Hi all! I have a couple of questions regarding th...
# questions
о
Hi all! I have a couple of questions regarding the best practices of Kedro usage. Frequently, ML models incorporate some preprocessing logic right in the model classes. And there may be some quite complex class inheritance structure to use some abstractions and, for example, try models with similar interface like
BaseRegressor
and lots of ancestors like
LGBMRegressor
LinearRegressor
. And all these wrappers do not only use sklearn.model.predict or lgbm.model.predict but also incorporate quite long list of data preparations. So my first question is: how this paradigm of "advanced and abstract ML development" is compatible with Kedro which is mostly (to the best of my understanding) for pipelines? In the basic examples I see that there may be any number of preprocessing steps like load, filter, enrich, fillna, etc, etc, and then just train step. This is compatible with the pipeline logic perfectly. But, probably, doesn't work well if you keep some methods in the model class and also use some internal states and so on. Maybe you know some good practices or have any ideas? The second question is similar to the first one but covers mostly the inference part. Please, correct me if I'm wrong, but I mostly see Kedro as a framework for the preprocessing and training ML routines. What is recommended to do if I want to reuse some of my logic (already defined as data_processing nodes) for model inference? Thank you very much!
h
Someone will reply to you shortly. In the meantime, this might help:
r
Hi Олег Литвинов, If you haven't seen this already, for complex projects kedro recommends you to use namespaces and modular pipelines as a good practice. Check modular pipelines and namespace docs for further information.
о
I will check this out, thank you! Would be happy to hear various options and experiences 🙂
y
An here is kedro-mlflow and its tutorial which is specifically design to address these issues: https://github.com/Galileo-Galilei/kedro-mlflow-tutorial
Think about it as "scikit learn like pipeline but for any arbitrary kedro pipeline"
K 1
👍 1
It requires mlflow though
о
Thank you very much, colleagues! I appreciate your ideas! Please, let me know if there are any other options to consider
Dear @Ravi Kumar Pilla, thank you for sharing the docs. I see how this helps to establish good separation of preprocessing and modelling as well as training of two different models (via namespaces). However I still don't have any good idea of how to reuse, for example, the preprocessing logic\nodes\parameters during the model inference. Do you, probably, have some examples of such? Dear @Yolan Honoré-Rougé, thank you for the tutorial. This makes a lot of sense and addresses my original question. Looks like the core idea here is to use tags, right? I see this example haven't been updated for a while. Is there any particular reason for that? Is this approach still considered as the best practice?
y
Hi, some answers: • A namespace is a way to tag all the nodes of a pipeline, so both suggestions are very correlated • Yes the key idea is to use tags because (using sklearn vocab) some steps are "fit" (e.g. create something from data to reuse at inference time - in mlflow vocab this is called an "artifact") and other steps are "transform" (e.g. apply a fitted object on data). You never want to do "fit_transform" because you need to separate the steps: you "fit" only at training time and you "transform" both at training time and at inference time • Unfortunately the example has not been updated because the starter changed between 0.18 and 0.19 and I have to update all examples and screenshots and I never took the time, but it perfectly works in 0.19. There were no breaking change on pipeline and nodes between the two major versions. • I don't know if it's "best practices" but given the number of related issues in this channel, in the kedro-mlflow repo issues and discussions (look for "pipeline_ml_factory" keyword to see them), and the numerous projects I've seen in production using it, I am quite confident this is considered a good approach (likely the best available at the time)
❤️ 1
K 1
о
Thank you very much for a follow up! This sounds great. In the meantime, I found a very similar issue mentioned here: https://github.com/kedro-org/kedro/issues/464. It looks like this issue\thread started somewhere there. This is a very useful discussion helping to frame some understanding. According to it, I see that model serving was outside of the interests of Kedro. But this was 4 years ago. So look like now it's pretty well covered and addresses the main inference goals. Thank you again!
y
Yes, the original poster contributed directly to the kedro mlflow code base back then
❤️ 1
If you want deep control over pipeline serving, check out kedro-boot and the fastapi mapping
❤️ 1
о
After a couple days of investigating realised how similar tags and namespaces are. Is there any preferred way of using one or the other?
r
Hi @Олег Литвинов, Both tags and namespaces help you group nodes in your kedro project and structure a complex project. Inclusive Grouping : By this I mean nodes are part of more than one group, For example a kedro node can have multiple tags • Tagging is inclusive • Tagging cannot provide modularity • Not suitable for deployment because of non-exclusivity • Tagging is good when you want nodes belonging to more than one group Exclusive Grouping: By this I mean a node is part of only one group. For example having a namespace to a pipeline (set of nodes) • Namespaces provide hierarchy. Hierarchy is exclusive • Namespace provide modularity • Most suitable for visualization and deployment • Namespace is good when you want nodes belonging to one group Hope this helps. Thank you
❤️ 1
👍 1