Hey Kedro team, just wanted to share that I recent...
# user-research
y
Hey Kedro team, just wanted to share that I recently found dataset factories to be a super cool feature, and say thanks for that 😁. I'm doing a data ingestion & processing pipeline inspired by this article (@datajoely), where I have
int
layer as a typed/concatenated mirror of
raw
, then
pri
and
feat
etc. And while my
raw
datasets definitions are quite long and differ from dataset to dataset, e.g. like this:
Copy code
raw_notifications_multisheet:
  type: pandas.ExcelDataset
  filepath: data/01_raw/...xlsx
  load_args:
    sheet_name: null
    dtype:
      Order: str
      Equipment: str
  <<: *raw_layer
It takes me just 3 dataset definitions to capture an arbitrary number of
int
,
pri
and
feat
layer datasets, all of which I just want to save as a parquet file.
Copy code
"int_{dataset}":
  type: pandas.ParquetDataset
  filepath: data/02_intermediate/int_{dataset}.parquet
  <<: *intermediate_layer

"pri_{dataset}":
  type: pandas.ParquetDataset
  filepath: data/03_primary/pri_{dataset}.parquet
  <<: *primary_layer

"feat_{dataset}":
  type: pandas.ParquetDataset
  filepath: data/04_feature/feat_{dataset}.parquet
  <<: *feature_layer
If not dataset factories, the catalog YAML would have been incredibly long, or at best I would have to use a jinja for loop, which requires knowing all datasets in advance of the run.
🚀 10
💛 7