I’d really like to know/read more about how advanc...
# questions
l
I’d really like to know/read more about how advanced kedro users use kedro. Beyond the first pipeline being set up: • how do you add new features to existing pipelines, • how do you do dev -> prod promotion, • do you use notebooks to iterate on new ideas then merge or develop directly in scripts? • How do you deal with pipelines that do retraining or incorporate model drift/data drift scenarios? • Do you use the kedro cli or configure more advanced runs via the python sdk?
h
Someone will reply to you shortly. In the meantime, this might help:
l
These questions arise from experience and frustrations after using kedro for several months. Namely in my team a training Jupyter notebook has been the basis for training new models for each client we onboard (demand forecasting) and we use kedro pipelines for predictions and refitting mainly. The reason Jupyter was chosen for the training of models rather than kedro directly is to have more interactivity with the idea of migrating the core/production training notebook into a kedro pipeline at some point. I am currently trying to steer my team away from using notebooks in this fashion and just go pipeline or script native but I’ve found significant push back, I wonder how others approach these issues
y
1. Depending on what you mean by features a. If you mean independent ML estimator variables, that's typically just a change in the YAML parameters file which adds new variables in) b. If you mean functionalities, that implies adding new nodes typically 2. I haven't been setting this up myself, but I think a common patters is to make both code and data have
dev
and
prod
versions (branches). 3. Never notebooks, only
.py
files, and I wouldn't call them scripts. I think of them more as like Python packages which contain data processing functions, and then Kedro is a very thin layer just to chain those functions together in particular order and pass data between them. 4. Trigger their runs manually every X days 5. Just the standard
run
CLI
🙌🏼 1
🙌 1
l
That's really helpful @Yury Fedotov, thank you very much, I wonder what the core contributors think as well and whether it'd be worth having extended documentation or tutorials on this @datajoely @Juan Luis @Deepyaman Datta
d
I also think it would be beneficial to have a best-practice, real-world Kedro pipeline that people can see as a reference, but most of the big projects are not openly available.
👌🏼 1
l
Thanks so much @Deepyaman Datta that's very enlightening, it seems like I need go do some studying on these articles Another question, how common is it in kedro to have a pipeline that trains a model or several models (as in my example of demand forecasting) and it's just parametrised to run for every new client? i.e. training a new model for a new client as a one off pipeline? Then have other pipelines that generate the predictions?
d
The other practice I've gotten into - my python business logic lives in an independently well tested package. The kedro code is an extremely simple declaration of flow. Even better if that package is in a internal artifact store.
❤️ 1
and also my current favourite pattern if Kedro + Ibis
❤️ 1
against snowflake in prod, duckdb synthetic data locally
d
Another question, how common is it in kedro to have a pipeline that trains a model or several models (as in my example of demand forecasting) and it's just parametrised to run for every new client? i.e. training a new model for a new client as a one off pipeline? Then have other pipelines that generate the predictions?
I think this is pretty standard? But it would be better if a DS or MLE who has done a lot of this work in Kedro more recently could answer; it's been 5 years since I've been working on demand forecasting models, and I don't remember the patterns 😅
👌🏼 1