Hello everyone! A bit new to Kedro, but making pro...
# questions
r
Hello everyone! A bit new to Kedro, but making progress! I wonder what the best practice is on the following scenario: I have a pipeline (not a kedro pipeline yet, just the generic concept) that process a dataset. I want to be able to run the same pipeline on different pre-specified datasets (all would be in the catalog already). These dataset are very different, and require different wrangling, filtering that are dataset specific. However, the later parts of the pipeline (modeling, evaluation) are essentially the same (the data processing transform the datasets into a common format), except for a few parameters (say, number of epochs for training a neural net is different for each dataset).
So I'd want to run kedro and maybe in prompt specify which dataset I'm using, and kedro would run the appropriate nodes for that dataset.
Of course, I could build a kedro pipeline for each dataset, but that seems wasteful, there's a lot that is shared across datasets. I know namespaces would solve some of the problem, but I don't quite see how it would solve the need to have some completely different nodes for each dataset.
Thoughts would be much appreciated! 🙂
To make things simpler: I have three nodes -- PreProcess, Train, Evaluate. PreProcess is dataset-specific, everything else is the same for all datasets. It seems I would need to have PreProcessDatasetA, PreProcessDatasetB etc, but then I don't know how to tell Kedro which to use, nor how to tell it that I want to run the pipeline on a specific dataset.
d
Train
+
Evaluate
can be a modular pipeline that you use (lets call it
data_science_pipeline
); and then you would have your
PreProcess
happening in another pipeline(s). It's honestly a design decision whether you want one
data_engineering_pipeline
consisting of
PreProcessA
,
PreProcessB
, etc., or if you want to break it up into
data_engineering_pipeline_a
,
data_engineering_pipeline_b
, etc. In reality, your
data_science_pipeline
is the reused part.
In my past experience, we actually had something like this at a large scale; multiple teams across domains (retail, pharma, telco, banking, etc.) all used the same
data_science_pipeline
for 90% of cases (the pipeline was heavily parametrizable), but the
data_engineering_pipeline
was different per domain and use case within the domain.
m
Hi everyone! Apologies for re-iterating on this, but I have a similar use case and I would like to understand what others have implemented. In particular, I have a question for @Deepyaman Datta. Where would you define the
data_science_pipeline
you mentioned to make it reusable across different projects? I'm asking because I'm developing a package that is using Kedro internally. Users will then install the package for each project. Inside the package, I would like to give complete freedom for the data engineering bit, but then I would like to define fixed (but configurable) modular pipelines that broadly correspond to different data science tasks (e.g. Classification, Regression, Anomaly Detection,...). What goes inside these modular pipelines should respect the defined inputs, outputs and the Python types to be used internally. For example, a Classification pipeline should take a Pandas Dataframe in input, use a sklearn estimator inside and return a numpy array with the predictions. The user can choose the dataframe and what estimator to use (which can also be a sklearn Pipeline), but he must respect the structure. Then I would like to allow connecting these pipelines with other modular pipelines (just as you described). Any help would highly appreciated! Thanks!
r
@Deepyaman Datta thanks for the answer! How would you choose which
data_engineering_pipeline
to run? Since all data_engineering pipelines have the same output, I'm unsure how kedro would handle that -- and it would probably run all of them. I could probably create another meta-pipeline to join each, but that creates less reusable code (since if parts of the tail of the pipeline changed, I'd have to update all pipelines manually).
My ideal situation, I think, would be to have PreProcessA, PreProcessB as nodes inside a single pipeline and a way to tell kedro which one to use.
The difference between mine and @Marco Bignotti question is that in his it seems there would at the end still be a single first-stage pipeline, since it would be different projects, while in my case there would be multiple first-stage pipelines in a single project that could be chosen from somehow.
I think I found a solution, would love to hear opinions. • Inside the same kedro pipeline, have a (modular)
CommonPipeline
that is shared across datasets; • Create modular pipelines for each dataset, containing only the dataset-specific nodes; give the same name for all of them (say, "start"); in my case, I created a pipeline
PreProcessA
and another `PreProcessB`; • Create different modular pipelines for each dataset, which will join the dataset-specific pipeline with the CommonPipeline (in one it will be
PreProcessA+CommonPipeline
, in another it will be
PreProcessB+CommonPipeline
); set different
namespaces
for each
(say, DatasetA, DatasetB)*.* • Now I can run
kedro run --from-nodes=DataSetA.start
👍 1
Modular pipelines + namespaces FTW
👍 1
d
@Marco Bignotti This can be done using the micro-packaging workflow (https://kedro.readthedocs.io/en/stable/nodes_and_pipelines/micro_packaging.html). We published the packaged
data_science_pipeline
to an internal PyPI (JFrog Artifactory), and then pull the pipeline in these different projects.
@Ricardo Araújo yep, that sounds good. I believe in our case we just added these combinations (e.g.
PreProcessA+CommonPipeline
) to the pipeline registry, so we could just do
kedro run --pipeline use_case_a
.
r
Thanks for the help, @Deepyaman Datta!
👍 1
m
Thanks a lot for the tips @Deepyaman Datta and @Ricardo Araújo!!! Admittedly, I don't particularly like the micro packaging workflow. I would rather define a class that takes in the necessary objects and validates them. A class that can be imported in other projects. But then I would need to find a way to make it compatible with other modular pipelines and Kedro in general. I'll try a few experiments and see what I can come up with. Thank you!
👍 1
d
@Marco Bignotti does not using the
pull
part of the micropackaging workflow, and just importing the packaged pipeline, work? Curious to learn more about what you don't like; it's not used much, but your use case is what it's designed for, so it would be great to get this feedback
m
Ok. Let me try to give a better explanation of my use case, to make it easier to understand what I'm trying to do. Let's take Anomaly Detection, for instance. We have defined the Anomaly Detection task as a pipeline composed of the following python objects: • [Optional] P*reprocessing*. This step should define a scikit learn transformer, or a list of transformers, possibly with some checks on the dimensionality of the output dataset. • Estimator. This is a scikit learn compatible anomaly detector with the
fit
and
predict
methods. The result of the
predict
method should be a one dimensional numpy array containing an anomaly score for each point in input. • [Optional] Postprocessing. A scikit learn transformer that does some transformations of the anomaly score (e.g. z-scoring, filtering). The dimensionality should remain unchanged. • Threshold. Again, a scikit learn estimator that takes in input the anomaly score and returns a numpy array with labels (0 for normal and 1 for anomaly). The previous steps are then composed together to create another estimator, namely a scikit learn pipeline. The resulting estimator is the model that needs to be trained, registered somewhere (e.g. the mlflow model registry), and that will be used in production. The structure of the pipeline and of any auxiliary check (e.g. dimensionality of input/output data) must be fixed and thoroughly tested, since this is what will go to production. However, the user/data scientist, should have the possibility of trying anything he/she wants in terms of what transformers and estimators to use inside the pipeline, as long as the objects passed to the pipeline respect the required interface. Ideally, it would be nice to define these objects (e.g. preprocessing, estimator, postprocessing,...) in a configuration file and then try different experiments. A tool like Hydra, would make this experimentation rather trivial, because the user only needs to define the yaml relative to the experiment he/she wants to perform (https://hydra.cc/docs/patterns/configuring_experiments/), but this is another problem.
I am not entirely sure how this would fit into the micro-packaging workflow, although I haven't spent too much time on it (yet). However, the main thing that stops me from using them is the necessity of building and managing tar files, which seems to me a bit more complicated than just defining some code in another package that can be installed.
Moreover, if I understand correctly, when you install a micro-package you are simply injecting the source code of the original modular pipeline. This implies that the user can modify the source code, which is something I do not want to allow.
d
Moreover, if I understand correctly, when you install a micro-package you are simply injecting the source code of the original modular pipeline. This implies that the user can modify the source code, which is something I do not want to allow.
If you pull a micro-package, yes, you are injecting source code. I understand why you don't want to do that. without pulling, micro-packaging is basically just a way to distribute a pipeline + nodes as a Python package, which can be imported as a dependency. But also, one could argue that without pulling the micro-package, you're not doing much past building a standard Python package, and you don't need this Kedro-specific functionality, so I think I get your point/think it makes sense when you don't want the user to be able to modify the common code.
m
@Deepyaman Datta exactly! So, I think I chose the wrong verb: it's not that I don't like micro-packaging, it's just not useful for my use case. I guess that the simplest solution would be to create a class, something like
AnomalyDetectionTask
, that accepts the objects I mentioned before (preprocessing, estimators, postprocessing,...), and then instantiate the class inside a node. I would loose the full pipeline visualization when running
kedro viz
, but since the structure is fixed, I can document the inner workings of
AnomalyDetectionTask
elsewhere. Should it be of any interest, I will let you know if I can come up with something that works. I know that MLflow is creating similar, fixed pipelines for specific data science tasks, but they are still experimental and, at the moment, it's not possible to create user defined pipelines. Maybe this might be an interesting topic for Kedro as well. In any case, thanks a lot for the support!! I'm really impressed by the amount of work and thought that you as a team put in the development and maintenance of Kedro! I really hope it will have even more success in the future.
👍 1
y
Hi @Marco Bignotti, I think that packaging your kedro pipeline in a custom mlflow model perfectly suits your use case: packaging an entire kedro pipelien with "preprocessing + model +postprocessing" in a single mlflow model is exactly what the ``kedro-mlflow`` plugin was originally built for. You can see this answer : https://kedro-org.slack.com/archives/C03RKP2LW64/p1669237790092029?thread_ts=1666864575.162609&cid=C03RKP2LW64 for more detail, and obviously ping me if needed.