Hi all first time Kedro user I am working on a simple POC pr Kedro #questions

Hi all, first time Kedro user. I am working on a s...

Sergey S

11/10/2023, 9:36 PM

Hi all, first time Kedro user. I am working on a simple POC project to showcase Kedro to the team. However, I can't wrap my head around how to use Kedro to solve a simple and very common use case of processing specific files inside a directory. Let's say every day a report directory with files is uploaded to cloud storage. A report directory has the following files:

Copy code

parent/
1. data_1.csv # useless for us and can be ignored
2. data_2.csv
3. scores.json

The files in each report directory we care about are "data_2.csv" and "scores.json". We could use

TemplatedConfigLoader

to simply define the two files in catalog.yaml:

Copy code

scores_data:
    type: pandas.JSONDataSet
    filepath: "<gs://my-bucket/reports/${report_dir}/scores.json>"

data_2_data:
    type: pandas.CsvDataSet
    filepath: "<gs://my-bucket/reports/${report_dir}/data_2.csv>"

And then in

conf/base/globals.yml

Copy code

report_dir = "report_2023_03_01"

Would that be a recommended approach here? Or is there a better way of doing this? I was thinking about using

PartitionedDataset

to simply point Kedro to the report directory and treat the folder as a dataset. The issue with that approach seems to be that

PartitionedDataset

requires a

dataset

argument in the

__init__()

to specify the type for all the files inside the directory. The issue is that in our case we have mixed files. Is there a way with Kedro to create a custom dataset that works on a folder level and depending on files inside the folder loads them differently with custom logic?

Ian Whalen

11/10/2023, 10:05 PM

PartitionedDataset

might help here. https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html If your directory looks like:

Copy code

parent/
    1.csv
    2.csv
    3.csv

you can point it to

parent

and loop over the contents in your node. Note that it returns your load functions for the partitions, not the data itself.

Sergey S

11/10/2023, 10:55 PM

~Thanks @Ian Whalen The report directories~

Sergey S

11/10/2023, 11:01 PM

Thanks @Ian Whalen The report directories look like this:

Copy code

parent/
1. data_1.csv
2. data_2.csv
3. scores.json

I only care about "data_2.csv" and "scores.json" in each report folder. The idea was to have two separate processing nodes for "data_2.csv" (process_data_2_node) and "scores.json" (process_scores_node) and the user only needs to specify the path to this report folder on a GCS bucket. The docs for

PartitionedDataset

seem to suggest that a type of files has to be provided, however in my case I have a mix of files that I care about (csv, json). How would one go about creating the first node, that given a path to the folder returns or reroutes individual files inside the folder to different nodes?

Nok Lam Chan

11/11/2023, 6:52 AM

https://docs.kedro.org/en/stable/data/how_to_create_a_custom_dataset.html

Nok Lam Chan

11/11/2023, 6:53 AM

How would your node process the data? Depends on different data type. Does it has the same processing logic?

Nok Lam Chan

11/11/2023, 6:56 AM

I think the problem is why you want the two dataset share the same folder? If you are treating folder as a partition you could have them define in two different subdirectory, is this an option?

Nok Lam Chan

11/11/2023, 6:57 AM

Partitioning usually expect homogenous type of data, you can make it work dynamically but I think separating them into fohders is much easier

Lukas Innig

11/11/2023, 9:55 AM

I'd argue that if this report is so common, it's worth adding a new custom dataset that handles it. It's not very complex to add one, if you follow the example

👍 1

Sergey S

11/11/2023, 4:12 PM

Thanks for the insights @Nok Lam Chan. The reports are generated by an outside system so I have no influence on the structure of files inside the reports folders. My idea was to create two separate nodes to process each file, one node for scores.json and one node for data_2.csv. • Option 1 seems to be to use

OmegaConfigLoader

and templating the path:

Copy code

# conf/base/catalog.yaml
scores_data:
    type: pandas.JSONDataSet
    filepath: "<gs://my-bucket/reports/${report_dir}/scores.json>"

data_2_data:
    type: pandas.CsvDataSet
    filepath: "<gs://my-bucket/reports/${report_dir}/data_2.csv>"

• Option 2 would be to use a custom dataset, which would point to a folder and read those two files and a node that would output those two files for specialized nodes for processing. In a case of a custom dataset, that would only support reading the data, I would have to use fsspec, right? I was hoping the

AbstractDataSet

would already provide the file I/O abstraction so that the user would only implement the logic of dealing with files.

Open in Slack

Previous Next