Hi everyone! :wave: I'm trying to load specific CS...
# questions
r
Hi everyone! šŸ‘‹ I'm trying to load specific CSV files from an SFTP connection in Kedro, and I need to filter the files using a wildcard pattern. For example, I'd like to load only files that match something like:
Copy code
/doc_20250620*_delta.csv
But I noticed that YAML interprets
*
as an anchor, and it doesn't seem to behave like a wildcard here. How can I configure a dataset in
catalog.yml
to use a wildcard when loading files from an SFTP path (e.g. to only fetch files starting with a certain prefix and ending with
_delta.csv
)? Is there native support for this kind of pattern in Kedro's SFTPDataSet or do I need to implement a custom dataset? Any guidance or examples would be super appreciated! šŸ™
s
You could try the
PartitionedDataSet
to handle wildcard paths with SFTP.
r
hello @Sajid Alam Thanks for the suggestion! I actually have a huge number of files on the server with the same filename prefix, so I’m a bit concerned that using a wildcard path with
PartitionedDataSet
might take a long time to load or list all those files. Do you know if there's a way to optimize that, or limit how many files it tries to process at once?
s
PartitionedDataset
lists the directory once, but I don't believe it open any file until your node calls the load-function for that partition.
r
I will check it, thank u ^^
j
what does your YAML file look like @Rachid Cherqaoui? maybe if you wrap
/doc_20250620*_delta.csv
in double quotes, YAML won't treat
*
as an anchor (but I'm not sure)