Hello everyone wave I am new here and currently I am evaluat Kedro #questions

Hello everyone :wave: I am new here and currently...

Fabian Terranova

02/21/2024, 6:40 PM

Hello everyone 👋 I am new here and currently I am evaluating whether the kedro pipelines can be used for our preprocessing purposes. I watched the tutorial series on Youtube and read the stable documentation, which I was able to follow well. I have already been able to preprocess my first CSV, but I would now like to test the use of SFTP. However, I am getting basic errors with my approach. Based on Load a CSV file stored in a remote location through SSH I have made my configurations in catalog.yml and credentials.yml and have also installed paramiko. However, when I try to load the catalog from my Jupyter notebook, I get the following errors:

df = catalog.load('sftp_catalog_as_in_catalog_yml')

Copy code

URLError: <urlopen error unknown url type: sftp>

Copy code

DatasetError: Failed while loading data from data set CSVDataset(filepath=/path/to/remote/location/file.csv, 
load_args={'encoding': utf-8, 'sep': ;}, protocol=sftp, save_args={'index': False}).
<urlopen error unknown url type: sftp>

I have checked the SFTP access, credentials etc. all seem to be correct. However, I'm wondering if SFTP (without further individual configuration/ script development) is supported yet? When I look at the documentation at Data_Catalog, I see that SSH / FTP is listed in relation to

fsspec

and given the code snippet for the SSH example from the first link, I would assume that this functionality should already be implemented, or am I wrong here? Has anyone already been able to use SFTP successfully according to the documentation, or would I have to write my own data catalog that can, for example, use paramiko to fetch files from an SFTP server or write them back to it? Another idea would be to place paramiko in a node and program the download via the node (also with paramiko) in order to then process the downloaded CSV in other nodes / other pipelines. However, I am somewhat reluctant to take this approach. Would you also advise against this? I would be grateful for any inputs and tips.

Juan Luis

02/22/2024, 8:10 AM

hi @Fabian Terranova! this looks similar to https://github.com/pandas-dev/pandas/issues/46765

Juan Luis

02/22/2024, 8:11 AM

so probably we can treat it like a bug and fix it in the way it's suggested there

Juan Luis

02/22/2024, 8:11 AM

could you open an issue on Kedro? https://github.com/kedro-org/kedro/issues/

Fabian Terranova

02/22/2024, 11:15 AM

Hi @Juan Luis Thanks for your reply! Sure, the issue is created https://github.com/kedro-org/kedro/issues/3646. May I ask when, based on experience, I can expect a solution to this? Will it take days/weeks/months? I am therefore asking whether I can wait for a possible fix, or whether I should look for an alternative solution/procedure regarding the download myself in the meantime.

Juan Luis

02/22/2024, 1:20 PM

left a comment at https://github.com/kedro-org/kedro/issues/3646#issuecomment-1959441547 don't expect a quick (days or weeks) fix, I'm afraid

🙌 1

Fabian Terranova

02/23/2024, 6:59 PM

@Juan Luis I have created the file structure

Copy code

src/projectname/datasets
├── __init__.py
└── sftp_dataset.py

Do I need to make further changes to any configuration files? Or should kedro recognize the custom data set automatically? Currently I am stuck with DatasetError: An exception occurred when parsing config for dataset 'catalogname': Class 'projectname.datasets.SFTPCapableCSVDataset' not found, is this a typo? I tried following this guide here https://docs.kedro.org/en/stable/data/how_to_create_a_custom_dataset.html Thanks in advance!

Juan Luis

02/23/2024, 11:49 PM

@Fabian Terranova you're almost there! can you check 1. does

python -c "from projectname.datasets import SFTPCapableCSVDataset"

work? a. if not, you'll probably need to install your own code :

pip install -e .

(assuming there's a proper

pyproject.toml

next to

src/

b. if it does, but Kedro still doesn't find it, more debugging is needed

Juan Luis

02/23/2024, 11:50 PM

admittedly, the

pip install .

part is not in the docs. could you open a new issue?

Fabian Terranova

02/26/2024, 11:31 AM

Hi @Juan Luis Thanks for your reply! Initially step 1. didn't work, but with

pip install .

and some necessary changes to the

projectname/src/datasets/__init__.py

I was able to import and successfully run the catalog.load method. Possible steps to consider for the docs: 1. create the custom Dataset according to the recent docs in

projectname/src/datasets/datasetname.py

with the

__init__.py

alongside. 2. Exclude a possible pitfall by explicitely mentioning the (in my case required) lines in

__init__.py

, this file probably needs

Copy code

from .datasetname.py import SFTPCapableCSVDataset
__all__ = ["SFTPCapableCSVDataset"]

cd projectname

pip install .

5. In

conf/base/catalog.yml

the type should be

projectname.datasets.SFTPCapableCSVDataset

Whereas projectname and SFTPCapableCSVDataset have to be exchanged with the according names of the respective project. With these steps I was able to sucessfully call catalog.load within the jupyter notebook. Sure, I will open a new issue regarding the docs later today.

Fabian Terranova

02/26/2024, 1:23 PM

the issue is created in the kedro repository: https://github.com/kedro-org/kedro/issues/3654 Thanks again!

💯 1

Juan Luis

02/26/2024, 1:33 PM

Initially step 1. didn't work, but with
pip install .
and some necessary changes to the
projectname/src/datasets/__init__.py
I was able to import and successfully run the catalog.load method.

fantastic news 👏🏼 yeah we don't emphasize enough that users need to install their own code. good to have a reminder on that

Juan Luis

02/26/2024, 1:33 PM

thanks for opening the issue! adding that to our backlog

👍 1

Fabian Terranova

02/26/2024, 1:46 PM

fantastic news 👏🏼 yeah we don't emphasize enough that users need to install their own code. good to have a reminder on that

I am very grateful that there is already a lot of documentation regarding the extension options 🙌. And I think that with this little hint for local/private extension development (without having to mandatory make a public dataset contribution), these implementations can also be realized very flexibly. Even if we prefer to work with standard approaches and existing datasets, as long as the use case can be mapped with them. 😏

23 Views

Open in Slack

Previous Next