Hello everyone Is there a way to load a flat file...
# questions
v
Hello everyone Is there a way to load a flat file from S3 based on some conditions like pulling the latest file from the mentioned bucket.
l
Split it into two problems: • Listing files in the bucket based ā—¦ Listing metadata can help you filter the one you'd like to load • Load correct file from list
v
But how is that done, the below example expects the exact name of the file
motorbikes:
type: pandas.CSVDataset
filepath: <s3://your_bucket/data/02_intermediate/company/motorbikes.csv>
credentials: dev_s3
load_args:
sep: ','
skiprows: 5
skipfooter: 1
na_values: ['#NA', NA]
I just know the name of the bucket, we need to fetch the files based on some conditions right ?
l
in that case you'll have to implement a custom dataset
v
I see
l
you can expand the behaviour of the PandasDataset
and override the
load
method
v
understood
this is an example that adds additional functionality for sheets
v
how challenging it is going to be?
l
but same manner you can add a dataset, have filtering args in constructor and use those args in the load method
šŸ‘ 1
v
cannot open the above link
l
my bad, updated link
not hard, datasets are just classes with load and save methods really
šŸ‘ 1
v
I just found that PartionedDatasets , provides a way of iterating over each file present in a bucket/folder https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html#partitioned-dataset-load @Laurens Vijnck
l
ah yes, that one exists , though then you'll be implementing the conditions in the node
šŸ‘ 1
I usually prefer to have the dataset internals handling the selection
šŸ‘ 1
you might even be able to extend the ParitionedDataset and overload the load method there to call super and thereafter do the filtering
šŸ‘ 1