I guess what I’m missing is how is Kedro integrate...
# questions
o
I guess what I’m missing is how is Kedro integrated as part of a real-world application, and not just data science in vacuum. Is there like a kedro folder in Git with per-experiment folder and relative Python imports for the core code? Pointers to a real-world application on GitHub that use Kedro across different experiments would be useful.
s
What does "data science" in a vacuum mean? Basically you would copy/paste your codebase into the specific kedro project for it to be run by the kedro cli. However reusability and importability of kedro pipelines -- every node is just a regular python function -- is given 100%. See also https://kedro.readthedocs.io/en/stable/nodes_and_pipelines/modular_pipelines.html Regarding 'real world github': how many companys are there that release ANY analytics/datascience codebase publicly? Perhaps there are others here who could chime in with a live session on kinda nonconfidential prod code. Regarding Experiment tracking: https://kedro.readthedocs.io/en/stable/logging/experiment_tracking.html Perhaps you could elaborate a bit more on the use case.
o
Hi @Sebastian Pehle thanks! So I have an existing code base in GitHub that implements a classification pipeline.
Now say I want to adopt Kedro for my MLOps/SDLC (Software Development Lifecycle). If I start a new Kedro workspace for experiment #1, it encourages me to put my data science code under src/ right? Then I start another Kedro workspace for experiment #2. Should I duplicate my code under src/ for both experiment 1 and 2?
Should I use relative imports in Python?
say this is my folder hierarchy:
Copy code
docs/
kedro/
  experiment1/
    src/
  experiment2/
    src/
src/
  regression_model_main.py
  regression_model_core.py
I guess what I’m trying to say is that I’m trying to wrap my head around code organization and reusability of my Python code across different workspaces / models / experiments
Does it make sense?
s
You mean you want to use kedro as a sub framework inside of an existing project framework, and for experiments only?
o
I would like to adopt a “healthy” way of developing data science and Kedro seems compelling, but I am a bit puzzled about code reuse across experiments
Concrete example. Let’s say that I have a data preprocessing helpers that apply certain business logic to clean the data, and I want to apply a bug fix to all of my existing experiments, i.e. different kedro folders. What is the best practice here? 1. Backport the changes and manually apply the change? 2. Clone/fork an existing experiment that you’d like to fix and apply the fix on top of the newly forked experiment? 3. Maintain a single source of truth, i.e. repo/src/ and do relative Python imports from all experiments/folders to repo/src?
What does Kedro encourage its users to do?
s
I think im now aware of your aim. How would one incorporate already built modules to be shared across many different projects? Inside kedro (copy paste into specific Project/micropackage) or as 'regular' python modules and import therefore? However i have to pass on this one as itis beyond my knowledge.
NB: you can always import the nodes.py with functions of a general business logic, multipurpose kedro pipeline -- or some other, regular python module -- and insert in into the node for the specific project, as long as python knows where to look. Therefore no copypasta needed and the pipelines are always up to date -- as long as the local folder is uptodate with git. Collaboration on different machines regarding the path is another problem to solve here, but... def apply_businesslogic_to_df(df: pd.DataFrame) -> pd.DataFrame: import sys sys.path.insert(1, 'PATH/TO/kedro-multipurpose/src/kedro_multipurpose/pipelines') import data_processing_businesslogic df_processed = data_processing_businesslogic.nodes.apply_businesslogic(df) return df_processed