Hey all we re all familiar with the common `data` subdir org Kedro #questions

Hey all - we're all familiar with the common `dat...

Andrew Stewart

08/24/2023, 10:12 PM

Hey all - we're all familiar with the common

data

subdir organization from the default kedro starter :

Copy code

/data
├── 01_raw
├── 02_intermediate
├── 03_primary
├── 04_feature
├── 05_model_input
├── 06_models
├── 07_model_output
└── 08_reporting

But I'm curious what other folder organization schemes folks are using. What's your favorite??

Deepyaman Datta

08/24/2023, 11:49 PM

Databricks medallion architecture is also pretty common. I've also seen some Kedro + Databricks architecture use a hybrid of medallion + Kedro layers. I think it doesn't really matter, as long as you have something consistent. For me, the best data organization scheme is one that I don't need to think about. To that end, I'm generally happy using Kedro's until I need another layer, in which case I can add it as necessary.

Juan Luis

08/25/2023, 7:30 AM

Databricks medallion architecture is also pretty common.

so, Bronze (raw) + Silver (validated) + Gold (enriched)? https://www.databricks.com/glossary/medallion-architecture

Iñigo Hidalgo

08/25/2023, 9:44 AM

To me it isn't always 100% clear where things should go between raw and intermediate, or between primary and feature, but I do try to follow that layered structure, though sometimes I have a filtered step between raw and interemediate

👀 1

quantumtrope

08/25/2023, 1:54 PM

I've been replacing 08 with a "processed output" step because our model output isn't our final product. That pushes reporting to 09

👍🏼 1

Nok Lam Chan

08/25/2023, 5:32 PM

Agree with @Iñigo Hidalgo, the layer is generally a guideline but not strict order. An “output” from a pipeline could be an “input” for another pipeline, so the semantic changes when you structure it differently. In generally you may have an end to end pipeline and data will roughly follow this context to determine what is “intermediate” or “model_input”

Andrew Stewart

08/25/2023, 8:17 PM

Interesting thoughts all, thanks.

Andrew Stewart

08/25/2023, 8:18 PM

I like the idea of expanding 08 into more granular post-model layers

Andrew Stewart

08/25/2023, 8:18 PM

I also find myself not really needing the level of granularity in the 'data engineering' layers

Andrew Stewart

08/25/2023, 8:54 PM

Aha, I didn't come upon this before https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71

Andrew Stewart

08/25/2023, 8:56 PM

So I guess my question wrt to that stated philosophy would be: does this kedro layer thinking change at all if one manages data transformations completely outside of Kedro? (for example, using dbt or a feature store)

Deepyaman Datta

08/25/2023, 9:22 PM

So I guess my question wrt to that stated philosophy would be: does this kedro layer thinking change at all if one manages data transformations completely outside of Kedro? (for example, using dbt or a feature store)

I would only use stuff from

model_input

onwards (or whatever point you transition over; potentially if you're reading features you register them in

featuers

Andrew Stewart

08/29/2023, 5:43 PM

Yeah, I tend to agree

Andrew Stewart

08/29/2023, 5:43 PM

Like maybe use

features

to stage data pulled from a feature store

Andrew Stewart

08/29/2023, 5:44 PM

and

model_input

to stage data constructed from features and ready for fitting

Andrew Stewart

08/29/2023, 6:01 PM

But also, since the granularity of the

01_

04_

kedro layers were designed from a primarily DE perspective, I wonder what additional granularity of layers would logically arise from rethinking from a DS perspective? For example (and there will be some overlap here): • benchmark datasetes (from papers, etc) • data before train/test split • training data • test data • "inference" data that has been transformed/scored/augmented by model inferences (

07_

) I could also see granular expansion of

08_reporting

: • model_diagnostics data • model_validation data • model_comparison data (experiment tracking)

💡 1

Andrew Stewart

09/01/2023, 6:20 PM

After some thought, here's a possible "DS perspective-driven" version of the kedro layers:

Copy code

├── data
│   ├── 01_source         <-- Datasets cached from external sources
│   ├── 02_staging        <-- Transformed datasets
│   ├── 03_benchmark      <-- Benchmark datasets & metrics
│   ├── 04_training       <-- Direct inputs to model fitting
│   ├── 05_testing        <-- Direct inputs to model inference
│   ├── 06_models         <-- Serialised models
│   ├── 07_inference      <-- Data generated from models
│   ├── 08_diagnostic     <-- Model diagnostic data
│   ├── 09_evaluation     <-- Model evaluation data

🙌 1

Nok Lam Chan

09/01/2023, 6:33 PM

In case you are interested to make your custom starter https://docs.kedro.org/en/stable/kedro_project_setup/starters.html You can either create a repository or a pip-installable one.

Andrew Stewart

09/05/2023, 4:23 PM

Yep, doing exactly that now 🙂

K 1

3 Views

Open in Slack

Previous Next