How do you avoid over DRY ("Don't Repeat Yourself"...
# questions
l
How do you avoid over DRY ("Don't Repeat Yourself") using Kedro? I find given the fairly opinionated syntax and project structure that is proprosed it's easy to DRY bits of code that would be best not DRY (e.g. preprocessing code). I wonder if anyone else has had similar thoughts
h
Someone will reply to you shortly. In the meantime, this might help:
d
So in the Kedro tutorial we keep everything in one project, longer term I move all business logic into independently tested packages. This also means your Kedro projects are really lightweight representations of flow and data catalog. Data catalog utilizing dataset factories also massively improves DRY
d
What's an example of over-DRY?
d
I think it’s possible to make things so dry it’s hard to follow, but in general it’s not a problem
l
@Deepyaman Datta to me, writing functions that are a thin wrapper around some pandas/polars operations, much more straightforward to just read the plain dataframe operations in their native language
@datajoely I am still fairly new to kedro, what do you mean by dataset factories? I can't see a mention of it in the docs
Also do you have an example of the kedro projects built on top of independently tested packages?An advanced tutorial for it would be a fantastic addition to the docs
l
ok, apologies for that, I need to change search engine! that was Google's top result
shouldn't the example be:
Copy code
boats:
  type: pandas.CSVDataset
  filepath: data/01_raw/boats.csv

cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/cars.csv

planes:
  type: pandas.CSVDataset
  filepath: data/01_raw/planes.csv
??
d
yes but with the factory approach all 3 of those can be collapsed into one DRY pattern matching entity called a dataset factory Dataset factories is similar to regular expression and you can think of it as reversed
f-string
. In this case, the name of the input dataset
factory_data
matches the pattern
{name}_data
with the
_data
suffix, so it resolves
name
to
factory
. Similarly, it resolves
name
to
process
for the output dataset
process_data
. This allows you to use one dataset factory pattern to replace multiple datasets entries. It keeps your catalog concise and you can generalise datasets using similar names, type or namespaces.
l
That's very cool on the catalog front! I'd love to see how people avoid DRY-ing in pipelines and nodes, especially how people build packages for this!
👀 1
d
@Deepyaman Datta to me, writing functions that are a thin wrapper around some pandas/polars operations, much more straightforward to just read the plain dataframe operations in their native language
This is possible! For example, say you want to use https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.drop_nulls.html. Instead of defining a node, you can do
from operator import methodcaller
and use
methodcaller("drop_nulls")
as your node function.
l
I see, that's nice, thanks! I'll have a look at the methodcaller method!