Sergey S
11/10/2023, 9:36 PMparent/
1. data_1.csv # useless for us and can be ignored
2. data_2.csv
3. scores.json
The files in each report directory we care about are "data_2.csv" and "scores.json".
We could use TemplatedConfigLoader
to simply define the two files in catalog.yaml:
scores_data:
type: pandas.JSONDataSet
filepath: "<gs://my-bucket/reports/${report_dir}/scores.json>"
data_2_data:
type: pandas.CsvDataSet
filepath: "<gs://my-bucket/reports/${report_dir}/data_2.csv>"
And then in conf/base/globals.yml
report_dir = "report_2023_03_01"
Would that be a recommended approach here? Or is there a better way of doing this?
I was thinking about using PartitionedDataset
to simply point Kedro to the report directory and treat the folder as a dataset.
The issue with that approach seems to be that PartitionedDataset
requires a dataset
argument in the __init__()
to specify the type for all the files inside the directory. The issue is that in our case we have mixed files.
Is there a way with Kedro to create a custom dataset that works on a folder level and depending on files inside the folder loads them differently with custom logic?Ian Whalen
11/10/2023, 10:05 PMPartitionedDataset
might help here.
https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html
If your directory looks like:
parent/
1.csv
2.csv
3.csv
you can point it to parent
and loop over the contents in your node.
Note that it returns your load functions for the partitions, not the data itself.Sergey S
11/10/2023, 10:55 PMparent/
1. data_1.csv
2. data_2.csv
3. scores.json
I only care about "data_2.csv" and "scores.json" in each report folder.
The idea was to have two separate processing nodes for "data_2.csv" (process_data_2_node) and "scores.json" (process_scores_node) and the user only needs to specify the path to this report folder on a GCS bucket.
The docs for PartitionedDataset
seem to suggest that a type of files has to be provided, however in my case I have a mix of files that I care about (csv, json).
How would one go about creating the first node, that given a path to the folder returns or reroutes individual files inside the folder to different nodes?Nok Lam Chan
11/11/2023, 6:52 AMLukas Innig
11/11/2023, 9:55 AMSergey S
11/11/2023, 4:12 PMOmegaConfigLoader
and templating the path:
# conf/base/catalog.yaml
scores_data:
type: pandas.JSONDataSet
filepath: "<gs://my-bucket/reports/${report_dir}/scores.json>"
data_2_data:
type: pandas.CsvDataSet
filepath: "<gs://my-bucket/reports/${report_dir}/data_2.csv>"
• Option 2 would be to use a custom dataset, which would point to a folder and read those two files and a node that would output those two files for specialized nodes for processing.
In a case of a custom dataset, that would only support reading the data, I would have to use fsspec, right?
I was hoping the AbstractDataSet
would already provide the file I/O abstraction so that the user would only implement the logic of dealing with files.