How do you avoid over DRY Don t Repeat Yourself using Kedro Kedro #questions

How do you avoid over DRY ("Don't Repeat Yourself"...

Luis Chaves Rodriguez

01/14/2025, 6:12 PM

How do you avoid over DRY ("Don't Repeat Yourself") using Kedro? I find given the fairly opinionated syntax and project structure that is proprosed it's easy to DRY bits of code that would be best not DRY (e.g. preprocessing code). I wonder if anyone else has had similar thoughts

Hall

01/14/2025, 6:12 PM

Someone will reply to you shortly. In the meantime, this might help:

datajoely

01/14/2025, 6:16 PM

So in the Kedro tutorial we keep everything in one project, longer term I move all business logic into independently tested packages. This also means your Kedro projects are really lightweight representations of flow and data catalog. Data catalog utilizing dataset factories also massively improves DRY

Deepyaman Datta

01/14/2025, 9:55 PM

What's an example of over-DRY?

datajoely

01/14/2025, 9:56 PM

I think it’s possible to make things so dry it’s hard to follow, but in general it’s not a problem

Luis Chaves Rodriguez

01/15/2025, 8:49 AM

@Deepyaman Datta to me, writing functions that are a thin wrapper around some pandas/polars operations, much more straightforward to just read the plain dataframe operations in their native language

Luis Chaves Rodriguez

01/15/2025, 8:51 AM

@datajoely I am still fairly new to kedro, what do you mean by dataset factories? I can't see a mention of it in the docs

Luis Chaves Rodriguez

01/15/2025, 8:52 AM

Also do you have an example of the kedro projects built on top of independently tested packages?An advanced tutorial for it would be a fantastic addition to the docs

datajoely

01/15/2025, 8:52 AM

https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html

Luis Chaves Rodriguez

01/15/2025, 8:54 AM

ok, apologies for that, I need to change search engine! that was Google's top result

Luis Chaves Rodriguez

01/15/2025, 8:57 AM

shouldn't the example be:

Copy code

boats:
  type: pandas.CSVDataset
  filepath: data/01_raw/boats.csv

cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/cars.csv

planes:
  type: pandas.CSVDataset
  filepath: data/01_raw/planes.csv

datajoely

01/15/2025, 10:24 AM

yes but with the factory approach all 3 of those can be collapsed into one DRY pattern matching entity called a dataset factory Dataset factories is similar to regular expression and you can think of it as reversed

f-string

. In this case, the name of the input dataset

factory_data

matches the pattern

{name}_data

with the

_data

suffix, so it resolves

name

factory

. Similarly, it resolves

name

process

for the output dataset

process_data

. This allows you to use one dataset factory pattern to replace multiple datasets entries. It keeps your catalog concise and you can generalise datasets using similar names, type or namespaces.

Luis Chaves Rodriguez

01/15/2025, 12:23 PM

That's very cool on the catalog front! I'd love to see how people avoid DRY-ing in pipelines and nodes, especially how people build packages for this!

👀 1

Deepyaman Datta

01/15/2025, 2:13 PM

@Deepyaman Datta to me, writing functions that are a thin wrapper around some pandas/polars operations, much more straightforward to just read the plain dataframe operations in their native language

This is possible! For example, say you want to use https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.drop_nulls.html. Instead of defining a node, you can do

from operator import methodcaller

and use

methodcaller("drop_nulls")

as your node function.

Luis Chaves Rodriguez

01/16/2025, 8:33 AM

I see, that's nice, thanks! I'll have a look at the methodcaller method!

5 Views

Open in Slack

Previous Next