Hey Kedro team just wanted to share that I recently found <h Kedro #user-research

Hey Kedro team, just wanted to share that I recent...

Yury Fedotov

06/27/2024, 6:25 AM

Hey Kedro team, just wanted to share that I recently found dataset factories to be a super cool feature, and say thanks for that 😁. I'm doing a data ingestion & processing pipeline inspired by this article (@datajoely), where I have

int

layer as a typed/concatenated mirror of

raw

, then

pri

and

feat

etc. And while my

raw

datasets definitions are quite long and differ from dataset to dataset, e.g. like this:

Copy code

raw_notifications_multisheet:
  type: pandas.ExcelDataset
  filepath: data/01_raw/...xlsx
  load_args:
    sheet_name: null
    dtype:
      Order: str
      Equipment: str
  <<: *raw_layer

It takes me just 3 dataset definitions to capture an arbitrary number of

int

pri

and

feat

layer datasets, all of which I just want to save as a parquet file.

Copy code

"int_{dataset}":
  type: pandas.ParquetDataset
  filepath: data/02_intermediate/int_{dataset}.parquet
  <<: *intermediate_layer

"pri_{dataset}":
  type: pandas.ParquetDataset
  filepath: data/03_primary/pri_{dataset}.parquet
  <<: *primary_layer

"feat_{dataset}":
  type: pandas.ParquetDataset
  filepath: data/04_feature/feat_{dataset}.parquet
  <<: *feature_layer

If not dataset factories, the catalog YAML would have been incredibly long, or at best I would have to use a jinja for loop, which requires knowing all datasets in advance of the run.

🚀 10

💛 7

2 Views

Open in Slack

Previous Next