Fabian Terranova
02/21/2024, 6:40 PMdf = catalog.load('sftp_catalog_as_in_catalog_yml')
URLError: <urlopen error unknown url type: sftp>
DatasetError: Failed while loading data from data set CSVDataset(filepath=/path/to/remote/location/file.csv,
load_args={'encoding': utf-8, 'sep': ;}, protocol=sftp, save_args={'index': False}).
<urlopen error unknown url type: sftp>
I have checked the SFTP access, credentials etc. all seem to be correct. However, I'm wondering if SFTP (without further individual configuration/ script development) is supported yet?
When I look at the documentation at Data_Catalog, I see that SSH / FTP is listed in relation to fsspec
and given the code snippet for the SSH example from the first link, I would assume that this functionality should already be implemented, or am I wrong here?
Has anyone already been able to use SFTP successfully according to the documentation, or would I have to write my own data catalog that can, for example, use paramiko to fetch files from an SFTP server or write them back to it?
Another idea would be to place paramiko in a node and program the download via the node (also with paramiko) in order to then process the downloaded CSV in other nodes / other pipelines. However, I am somewhat reluctant to take this approach. Would you also advise against this?
I would be grateful for any inputs and tips.Juan Luis
02/22/2024, 8:10 AMJuan Luis
02/22/2024, 8:11 AMJuan Luis
02/22/2024, 8:11 AMFabian Terranova
02/22/2024, 11:15 AMJuan Luis
02/22/2024, 1:20 PMFabian Terranova
02/23/2024, 6:59 PMsrc/projectname/datasets
āāā __init__.py
āāā sftp_dataset.py
Do I need to make further changes to any configuration files? Or should kedro recognize the custom data set automatically?
Currently I am stuck with
DatasetError: An exception occurred when parsing config for dataset 'catalogname':
Class 'projectname.datasets.SFTPCapableCSVDataset' not found, is this a typo?
I tried following this guide here https://docs.kedro.org/en/stable/data/how_to_create_a_custom_dataset.html
Thanks in advance!Juan Luis
02/23/2024, 11:49 PMpython -c "from projectname.datasets import SFTPCapableCSVDataset"
work?
a. if not, you'll probably need to install your own code : pip install -e .
(assuming there's a proper pyproject.toml
next to src/
b. if it does, but Kedro still doesn't find it, more debugging is neededJuan Luis
02/23/2024, 11:50 PMpip install .
part is not in the docs. could you open a new issue?Fabian Terranova
02/26/2024, 11:31 AMpip install .
and some necessary changes to the projectname/src/datasets/__init__.py
I was able to import and successfully run the catalog.load method.
Possible steps to consider for the docs:
1. create the custom Dataset according to the recent docs in projectname/src/datasets/datasetname.py
with the __init__.py
alongside.
2. Exclude a possible pitfall by explicitely mentioning the (in my case required) lines in __init__.py
, this file probably needs
from .datasetname.py import SFTPCapableCSVDataset
__all__ = ["SFTPCapableCSVDataset"]
3. cd projectname
4. pip install .
5. In conf/base/catalog.yml
the type should be projectname.datasets.SFTPCapableCSVDataset
Whereas projectname and SFTPCapableCSVDataset have to be exchanged with the according names of the respective project.
With these steps I was able to sucessfully call catalog.load within the jupyter notebook.
Sure, I will open a new issue regarding the docs later today.Fabian Terranova
02/26/2024, 1:23 PMJuan Luis
02/26/2024, 1:33 PMInitially step 1. didn't work, but withfantastic news šš¼ yeah we don't emphasize enough that users need to install their own code. good to have a reminder on thatand some necessary changes to thepip install .
I was able to import and successfully run the catalog.load method.projectname/src/datasets/__init__.py
Juan Luis
02/26/2024, 1:33 PMFabian Terranova
02/26/2024, 1:46 PMfantastic news šš¼ yeah we don't emphasize enough that users need to install their own code. good to have a reminder on thatI am very grateful that there is already a lot of documentation regarding the extension options š. And I think that with this little hint for local/private extension development (without having to mandatory make a public dataset contribution), these implementations can also be realized very flexibly. Even if we prefer to work with standard approaches and existing datasets, as long as the use case can be mapped with them. š