Hey all - we're all familiar with the common `dat...
# questions
a
Hey all - we're all familiar with the common
data
subdir organization from the default kedro starter :
Copy code
/data
├── 01_raw
├── 02_intermediate
├── 03_primary
├── 04_feature
├── 05_model_input
├── 06_models
├── 07_model_output
└── 08_reporting
But I'm curious what other folder organization schemes folks are using. What's your favorite??
d
Databricks medallion architecture is also pretty common. I've also seen some Kedro + Databricks architecture use a hybrid of medallion + Kedro layers. I think it doesn't really matter, as long as you have something consistent. For me, the best data organization scheme is one that I don't need to think about. To that end, I'm generally happy using Kedro's until I need another layer, in which case I can add it as necessary.
j
Databricks medallion architecture is also pretty common.
so, Bronze (raw) + Silver (validated) + Gold (enriched)? https://www.databricks.com/glossary/medallion-architecture
i
To me it isn't always 100% clear where things should go between raw and intermediate, or between primary and feature, but I do try to follow that layered structure, though sometimes I have a filtered step between raw and interemediate
👀 1
q
I've been replacing 08 with a "processed output" step because our model output isn't our final product. That pushes reporting to 09
👍🏼 1
n
Agree with @Iñigo Hidalgo, the layer is generally a guideline but not strict order. An “output” from a pipeline could be an “input” for another pipeline, so the semantic changes when you structure it differently. In generally you may have an end to end pipeline and data will roughly follow this context to determine what is “intermediate” or “model_input”
a
Interesting thoughts all, thanks.
I like the idea of expanding 08 into more granular post-model layers
I also find myself not really needing the level of granularity in the 'data engineering' layers
So I guess my question wrt to that stated philosophy would be: does this kedro layer thinking change at all if one manages data transformations completely outside of Kedro? (for example, using dbt or a feature store)
d
So I guess my question wrt to that stated philosophy would be: does this kedro layer thinking change at all if one manages data transformations completely outside of Kedro? (for example, using dbt or a feature store)
I would only use stuff from
model_input
onwards (or whatever point you transition over; potentially if you're reading features you register them in
featuers
).
a
Yeah, I tend to agree
Like maybe use
features
to stage data pulled from a feature store
and
model_input
to stage data constructed from features and ready for fitting
But also, since the granularity of the
01_
-
04_
kedro layers were designed from a primarily DE perspective, I wonder what additional granularity of layers would logically arise from rethinking from a DS perspective? For example (and there will be some overlap here): • benchmark datasetes (from papers, etc) • data before train/test split • training data • test data • "inference" data that has been transformed/scored/augmented by model inferences (
07_
) I could also see granular expansion of
08_reporting
: • model_diagnostics data • model_validation data • model_comparison data (experiment tracking)
💡 1
After some thought, here's a possible "DS perspective-driven" version of the kedro layers:
Copy code
├── data
│   ├── 01_source         <-- Datasets cached from external sources
│   ├── 02_staging        <-- Transformed datasets
│   ├── 03_benchmark      <-- Benchmark datasets & metrics
│   ├── 04_training       <-- Direct inputs to model fitting
│   ├── 05_testing        <-- Direct inputs to model inference
│   ├── 06_models         <-- Serialised models
│   ├── 07_inference      <-- Data generated from models
│   ├── 08_diagnostic     <-- Model diagnostic data
│   ├── 09_evaluation     <-- Model evaluation data
🙌 1
n
In case you are interested to make your custom starter https://docs.kedro.org/en/stable/kedro_project_setup/starters.html You can either create a repository or a pip-installable one.
a
Yep, doing exactly that now 🙂
K 1