Hi all, first time Kedro user. I am working on a s...
# questions
s
Hi all, first time Kedro user. I am working on a simple POC project to showcase Kedro to the team. However, I can't wrap my head around how to use Kedro to solve a simple and very common use case of processing specific files inside a directory. Let's say every day a report directory with files is uploaded to cloud storage. A report directory has the following files:
Copy code
parent/
1. data_1.csv # useless for us and can be ignored
2. data_2.csv
3. scores.json
The files in each report directory we care about are "data_2.csv" and "scores.json". We could use
TemplatedConfigLoader
to simply define the two files in catalog.yaml:
Copy code
scores_data:
    type: pandas.JSONDataSet
    filepath: "<gs://my-bucket/reports/${report_dir}/scores.json>"

data_2_data:
    type: pandas.CsvDataSet
    filepath: "<gs://my-bucket/reports/${report_dir}/data_2.csv>"
And then in
conf/base/globals.yml
Copy code
report_dir = "report_2023_03_01"
Would that be a recommended approach here? Or is there a better way of doing this? I was thinking about using
PartitionedDataset
to simply point Kedro to the report directory and treat the folder as a dataset. The issue with that approach seems to be that
PartitionedDataset
requires a
dataset
argument in the
__init__()
to specify the type for all the files inside the directory. The issue is that in our case we have mixed files. Is there a way with Kedro to create a custom dataset that works on a folder level and depending on files inside the folder loads them differently with custom logic?
i
PartitionedDataset
might help here. https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html If your directory looks like:
Copy code
parent/
    1.csv
    2.csv
    3.csv
you can point it to
parent
and loop over the contents in your node. Note that it returns your load functions for the partitions, not the data itself.
s
~Thanks @Ian Whalen The report directories~
Thanks @Ian Whalen The report directories look like this:
Copy code
parent/
1. data_1.csv
2. data_2.csv
3. scores.json
I only care about "data_2.csv" and "scores.json" in each report folder. The idea was to have two separate processing nodes for "data_2.csv" (process_data_2_node) and "scores.json" (process_scores_node) and the user only needs to specify the path to this report folder on a GCS bucket. The docs for
PartitionedDataset
seem to suggest that a type of files has to be provided, however in my case I have a mix of files that I care about (csv, json). How would one go about creating the first node, that given a path to the folder returns or reroutes individual files inside the folder to different nodes?
How would your node process the data? Depends on different data type. Does it has the same processing logic?
I think the problem is why you want the two dataset share the same folder? If you are treating folder as a partition you could have them define in two different subdirectory, is this an option?
Partitioning usually expect homogenous type of data, you can make it work dynamically but I think separating them into fohders is much easier
l
I'd argue that if this report is so common, it's worth adding a new custom dataset that handles it. It's not very complex to add one, if you follow the example
👍 1
s
Thanks for the insights @Nok Lam Chan. The reports are generated by an outside system so I have no influence on the structure of files inside the reports folders. My idea was to create two separate nodes to process each file, one node for scores.json and one node for data_2.csv. • Option 1 seems to be to use
OmegaConfigLoader
and templating the path:
Copy code
# conf/base/catalog.yaml
scores_data:
    type: pandas.JSONDataSet
    filepath: "<gs://my-bucket/reports/${report_dir}/scores.json>"

data_2_data:
    type: pandas.CsvDataSet
    filepath: "<gs://my-bucket/reports/${report_dir}/data_2.csv>"
• Option 2 would be to use a custom dataset, which would point to a folder and read those two files and a node that would output those two files for specialized nodes for processing. In a case of a custom dataset, that would only support reading the data, I would have to use fsspec, right? I was hoping the
AbstractDataSet
would already provide the file I/O abstraction so that the user would only implement the logic of dealing with files.