Yury Fedotov
06/27/2024, 6:25 AMint
layer as a typed/concatenated mirror of raw
, then pri
and feat
etc.
And while my raw
datasets definitions are quite long and differ from dataset to dataset, e.g. like this:
raw_notifications_multisheet:
type: pandas.ExcelDataset
filepath: data/01_raw/...xlsx
load_args:
sheet_name: null
dtype:
Order: str
Equipment: str
<<: *raw_layer
It takes me just 3 dataset definitions to capture an arbitrary number of int
, pri
and feat
layer datasets, all of which I just want to save as a parquet file.
"int_{dataset}":
type: pandas.ParquetDataset
filepath: data/02_intermediate/int_{dataset}.parquet
<<: *intermediate_layer
"pri_{dataset}":
type: pandas.ParquetDataset
filepath: data/03_primary/pri_{dataset}.parquet
<<: *primary_layer
"feat_{dataset}":
type: pandas.ParquetDataset
filepath: data/04_feature/feat_{dataset}.parquet
<<: *feature_layer
If not dataset factories, the catalog YAML would have been incredibly long, or at best I would have to use a jinja for loop, which requires knowing all datasets in advance of the run.