Hello everyone :wave: I am new here and currently...
# questions
f
Hello everyone šŸ‘‹ I am new here and currently I am evaluating whether the kedro pipelines can be used for our preprocessing purposes. I watched the tutorial series on Youtube and read the stable documentation, which I was able to follow well. I have already been able to preprocess my first CSV, but I would now like to test the use of SFTP. However, I am getting basic errors with my approach. Based on Load a CSV file stored in a remote location through SSH I have made my configurations in catalog.yml and credentials.yml and have also installed paramiko. However, when I try to load the catalog from my Jupyter notebook, I get the following errors:
df = catalog.load('sftp_catalog_as_in_catalog_yml')
Copy code
URLError: <urlopen error unknown url type: sftp>
Copy code
DatasetError: Failed while loading data from data set CSVDataset(filepath=/path/to/remote/location/file.csv, 
load_args={'encoding': utf-8, 'sep': ;}, protocol=sftp, save_args={'index': False}).
<urlopen error unknown url type: sftp>
I have checked the SFTP access, credentials etc. all seem to be correct. However, I'm wondering if SFTP (without further individual configuration/ script development) is supported yet? When I look at the documentation at Data_Catalog, I see that SSH / FTP is listed in relation to
fsspec
and given the code snippet for the SSH example from the first link, I would assume that this functionality should already be implemented, or am I wrong here? Has anyone already been able to use SFTP successfully according to the documentation, or would I have to write my own data catalog that can, for example, use paramiko to fetch files from an SFTP server or write them back to it? Another idea would be to place paramiko in a node and program the download via the node (also with paramiko) in order to then process the downloaded CSV in other nodes / other pipelines. However, I am somewhat reluctant to take this approach. Would you also advise against this? I would be grateful for any inputs and tips.
j
hi @Fabian Terranova! this looks similar to https://github.com/pandas-dev/pandas/issues/46765
so probably we can treat it like a bug and fix it in the way it's suggested there
could you open an issue on Kedro? https://github.com/kedro-org/kedro/issues/
f
Hi @Juan Luis Thanks for your reply! Sure, the issue is created https://github.com/kedro-org/kedro/issues/3646. May I ask when, based on experience, I can expect a solution to this? Will it take days/weeks/months? I am therefore asking whether I can wait for a possible fix, or whether I should look for an alternative solution/procedure regarding the download myself in the meantime.
j
left a comment at https://github.com/kedro-org/kedro/issues/3646#issuecomment-1959441547 don't expect a quick (days or weeks) fix, I'm afraid
šŸ™Œ 1
f
@Juan Luis I have created the file structure
Copy code
src/projectname/datasets
ā”œā”€ā”€ __init__.py
ā””ā”€ā”€ sftp_dataset.py
Do I need to make further changes to any configuration files? Or should kedro recognize the custom data set automatically? Currently I am stuck with DatasetError: An exception occurred when parsing config for dataset 'catalogname': Class 'projectname.datasets.SFTPCapableCSVDataset' not found, is this a typo? I tried following this guide here https://docs.kedro.org/en/stable/data/how_to_create_a_custom_dataset.html Thanks in advance!
j
@Fabian Terranova you're almost there! can you check 1. does
python -c "from projectname.datasets import SFTPCapableCSVDataset"
work? a. if not, you'll probably need to install your own code :
pip install -e .
(assuming there's a proper
pyproject.toml
next to
src/
b. if it does, but Kedro still doesn't find it, more debugging is needed
admittedly, the
pip install .
part is not in the docs. could you open a new issue?
f
Hi @Juan Luis Thanks for your reply! Initially step 1. didn't work, but with
pip install .
and some necessary changes to the
projectname/src/datasets/__init__.py
I was able to import and successfully run the catalog.load method. Possible steps to consider for the docs: 1. create the custom Dataset according to the recent docs in
projectname/src/datasets/datasetname.py
with the
__init__.py
alongside. 2. Exclude a possible pitfall by explicitely mentioning the (in my case required) lines in
__init__.py
, this file probably needs
Copy code
from .datasetname.py import SFTPCapableCSVDataset
__all__ = ["SFTPCapableCSVDataset"]
3.
cd projectname
4.
pip install .
5. In
conf/base/catalog.yml
the type should be
projectname.datasets.SFTPCapableCSVDataset
Whereas projectname and SFTPCapableCSVDataset have to be exchanged with the according names of the respective project. With these steps I was able to sucessfully call catalog.load within the jupyter notebook. Sure, I will open a new issue regarding the docs later today.
the issue is created in the kedro repository: https://github.com/kedro-org/kedro/issues/3654 Thanks again!
šŸ’Æ 1
j
Initially step 1. didn't work, but with
pip install .
and some necessary changes to the
projectname/src/datasets/__init__.py
I was able to import and successfully run the catalog.load method.
fantastic news šŸ‘šŸ¼ yeah we don't emphasize enough that users need to install their own code. good to have a reminder on that
thanks for opening the issue! adding that to our backlog
šŸ‘ 1
f
fantastic news šŸ‘šŸ¼ yeah we don't emphasize enough that users need to install their own code. good to have a reminder on that
I am very grateful that there is already a lot of documentation regarding the extension options šŸ™Œ. And I think that with this little hint for local/private extension development (without having to mandatory make a public dataset contribution), these implementations can also be realized very flexibly. Even if we prefer to work with standard approaches and existing datasets, as long as the use case can be mapped with them. šŸ˜