Andrew Stewart
08/24/2023, 10:12 PMdata
subdir organization from the default kedro starter :
/data
├── 01_raw
├── 02_intermediate
├── 03_primary
├── 04_feature
├── 05_model_input
├── 06_models
├── 07_model_output
└── 08_reporting
But I'm curious what other folder organization schemes folks are using. What's your favorite??Deepyaman Datta
08/24/2023, 11:49 PMJuan Luis
08/25/2023, 7:30 AMDatabricks medallion architecture is also pretty common.so, Bronze (raw) + Silver (validated) + Gold (enriched)? https://www.databricks.com/glossary/medallion-architecture
Iñigo Hidalgo
08/25/2023, 9:44 AMquantumtrope
08/25/2023, 1:54 PMNok Lam Chan
08/25/2023, 5:32 PMAndrew Stewart
08/25/2023, 8:17 PMDeepyaman Datta
08/25/2023, 9:22 PMSo I guess my question wrt to that stated philosophy would be: does this kedro layer thinking change at all if one manages data transformations completely outside of Kedro? (for example, using dbt or a feature store)I would only use stuff from
model_input
onwards (or whatever point you transition over; potentially if you're reading features you register them in featuers
).Andrew Stewart
08/29/2023, 5:43 PMfeatures
to stage data pulled from a feature storemodel_input
to stage data constructed from features and ready for fitting01_
- 04_
kedro layers were designed from a primarily DE perspective, I wonder what additional granularity of layers would logically arise from rethinking from a DS perspective?
For example (and there will be some overlap here):
• benchmark datasetes (from papers, etc)
• data before train/test split
• training data
• test data
• "inference" data that has been transformed/scored/augmented by model inferences (07_
)
I could also see granular expansion of 08_reporting
:
• model_diagnostics data
• model_validation data
• model_comparison data (experiment tracking)├── data
│ ├── 01_source <-- Datasets cached from external sources
│ ├── 02_staging <-- Transformed datasets
│ ├── 03_benchmark <-- Benchmark datasets & metrics
│ ├── 04_training <-- Direct inputs to model fitting
│ ├── 05_testing <-- Direct inputs to model inference
│ ├── 06_models <-- Serialised models
│ ├── 07_inference <-- Data generated from models
│ ├── 08_diagnostic <-- Model diagnostic data
│ ├── 09_evaluation <-- Model evaluation data
Nok Lam Chan
09/01/2023, 6:33 PMAndrew Stewart
09/05/2023, 4:23 PM