Hello everyone A bit new to Kedro but making progress I wond Kedro #questions

Hello everyone! A bit new to Kedro, but making pro...

Ricardo Araújo

01/10/2023, 11:52 PM

Hello everyone! A bit new to Kedro, but making progress! I wonder what the best practice is on the following scenario: I have a pipeline (not a kedro pipeline yet, just the generic concept) that process a dataset. I want to be able to run the same pipeline on different pre-specified datasets (all would be in the catalog already). These dataset are very different, and require different wrangling, filtering that are dataset specific. However, the later parts of the pipeline (modeling, evaluation) are essentially the same (the data processing transform the datasets into a common format), except for a few parameters (say, number of epochs for training a neural net is different for each dataset).

Ricardo Araújo

01/10/2023, 11:53 PM

So I'd want to run kedro and maybe in prompt specify which dataset I'm using, and kedro would run the appropriate nodes for that dataset.

Ricardo Araújo

01/10/2023, 11:54 PM

Of course, I could build a kedro pipeline for each dataset, but that seems wasteful, there's a lot that is shared across datasets. I know namespaces would solve some of the problem, but I don't quite see how it would solve the need to have some completely different nodes for each dataset.

Ricardo Araújo

01/10/2023, 11:55 PM

Thoughts would be much appreciated! 🙂

Ricardo Araújo

01/11/2023, 12:00 AM

To make things simpler: I have three nodes -- PreProcess, Train, Evaluate. PreProcess is dataset-specific, everything else is the same for all datasets. It seems I would need to have PreProcessDatasetA, PreProcessDatasetB etc, but then I don't know how to tell Kedro which to use, nor how to tell it that I want to run the pipeline on a specific dataset.

Deepyaman Datta

01/11/2023, 5:57 AM

Train

Evaluate

can be a modular pipeline that you use (lets call it

data_science_pipeline

); and then you would have your

PreProcess

happening in another pipeline(s). It's honestly a design decision whether you want one

data_engineering_pipeline

consisting of

PreProcessA

PreProcessB

, etc., or if you want to break it up into

data_engineering_pipeline_a

data_engineering_pipeline_b

, etc. In reality, your

data_science_pipeline

is the reused part.

Deepyaman Datta

01/11/2023, 6:01 AM

In my past experience, we actually had something like this at a large scale; multiple teams across domains (retail, pharma, telco, banking, etc.) all used the same

data_science_pipeline

for 90% of cases (the pipeline was heavily parametrizable), but the

data_engineering_pipeline

was different per domain and use case within the domain.

Marco Bignotti

01/11/2023, 10:04 AM

Hi everyone! Apologies for re-iterating on this, but I have a similar use case and I would like to understand what others have implemented. In particular, I have a question for @Deepyaman Datta. Where would you define the

data_science_pipeline

you mentioned to make it reusable across different projects? I'm asking because I'm developing a package that is using Kedro internally. Users will then install the package for each project. Inside the package, I would like to give complete freedom for the data engineering bit, but then I would like to define fixed (but configurable) modular pipelines that broadly correspond to different data science tasks (e.g. Classification, Regression, Anomaly Detection,...). What goes inside these modular pipelines should respect the defined inputs, outputs and the Python types to be used internally. For example, a Classification pipeline should take a Pandas Dataframe in input, use a sklearn estimator inside and return a numpy array with the predictions. The user can choose the dataframe and what estimator to use (which can also be a sklearn Pipeline), but he must respect the structure. Then I would like to allow connecting these pipelines with other modular pipelines (just as you described). Any help would highly appreciated! Thanks!

Ricardo Araújo

01/11/2023, 10:54 AM

@Deepyaman Datta thanks for the answer! How would you choose which

data_engineering_pipeline

to run? Since all data_engineering pipelines have the same output, I'm unsure how kedro would handle that -- and it would probably run all of them. I could probably create another meta-pipeline to join each, but that creates less reusable code (since if parts of the tail of the pipeline changed, I'd have to update all pipelines manually).

Ricardo Araújo

01/11/2023, 10:55 AM

My ideal situation, I think, would be to have PreProcessA, PreProcessB as nodes inside a single pipeline and a way to tell kedro which one to use.

Ricardo Araújo

01/11/2023, 10:58 AM

The difference between mine and @Marco Bignotti question is that in his it seems there would at the end still be a single first-stage pipeline, since it would be different projects, while in my case there would be multiple first-stage pipelines in a single project that could be chosen from somehow.

Ricardo Araújo

01/11/2023, 1:42 PM

I think I found a solution, would love to hear opinions. • Inside the same kedro pipeline, have a (modular)

CommonPipeline

that is shared across datasets; • Create modular pipelines for each dataset, containing only the dataset-specific nodes; give the same name for all of them (say, "start"); in my case, I created a pipeline

PreProcessA

and another `PreProcessB`; • Create different modular pipelines for each dataset, which will join the dataset-specific pipeline with the CommonPipeline (in one it will be

PreProcessA+CommonPipeline

, in another it will be

PreProcessB+CommonPipeline

); set different
namespaces
for each (say, DatasetA, DatasetB)*.* • Now I can run

kedro run --from-nodes=DataSetA.start

👍 1

Ricardo Araújo

01/11/2023, 1:45 PM

Modular pipelines + namespaces FTW

👍 1

Deepyaman Datta

01/11/2023, 2:13 PM

@Marco Bignotti This can be done using the micro-packaging workflow (https://kedro.readthedocs.io/en/stable/nodes_and_pipelines/micro_packaging.html). We published the packaged

data_science_pipeline

to an internal PyPI (JFrog Artifactory), and then pull the pipeline in these different projects.

Deepyaman Datta

01/11/2023, 2:15 PM

@Ricardo Araújo yep, that sounds good. I believe in our case we just added these combinations (e.g.

PreProcessA+CommonPipeline

) to the pipeline registry, so we could just do

kedro run --pipeline use_case_a

Ricardo Araújo

01/11/2023, 2:23 PM

Thanks for the help, @Deepyaman Datta!

👍 1

Marco Bignotti

01/11/2023, 2:51 PM

Thanks a lot for the tips @Deepyaman Datta and @Ricardo Araújo!!! Admittedly, I don't particularly like the micro packaging workflow. I would rather define a class that takes in the necessary objects and validates them. A class that can be imported in other projects. But then I would need to find a way to make it compatible with other modular pipelines and Kedro in general. I'll try a few experiments and see what I can come up with. Thank you!

👍 1

Deepyaman Datta

01/11/2023, 3:06 PM

@Marco Bignotti does not using the

pull

part of the micropackaging workflow, and just importing the packaged pipeline, work? Curious to learn more about what you don't like; it's not used much, but your use case is what it's designed for, so it would be great to get this feedback

Marco Bignotti

01/11/2023, 3:25 PM

Ok. Let me try to give a better explanation of my use case, to make it easier to understand what I'm trying to do. Let's take Anomaly Detection, for instance. We have defined the Anomaly Detection task as a pipeline composed of the following python objects: • [Optional] P*reprocessing*. This step should define a scikit learn transformer, or a list of transformers, possibly with some checks on the dimensionality of the output dataset. • Estimator. This is a scikit learn compatible anomaly detector with the

fit

and

predict

methods. The result of the

predict

method should be a one dimensional numpy array containing an anomaly score for each point in input. • [Optional] Postprocessing. A scikit learn transformer that does some transformations of the anomaly score (e.g. z-scoring, filtering). The dimensionality should remain unchanged. • Threshold. Again, a scikit learn estimator that takes in input the anomaly score and returns a numpy array with labels (0 for normal and 1 for anomaly). The previous steps are then composed together to create another estimator, namely a scikit learn pipeline. The resulting estimator is the model that needs to be trained, registered somewhere (e.g. the mlflow model registry), and that will be used in production. The structure of the pipeline and of any auxiliary check (e.g. dimensionality of input/output data) must be fixed and thoroughly tested, since this is what will go to production. However, the user/data scientist, should have the possibility of trying anything he/she wants in terms of what transformers and estimators to use inside the pipeline, as long as the objects passed to the pipeline respect the required interface. Ideally, it would be nice to define these objects (e.g. preprocessing, estimator, postprocessing,...) in a configuration file and then try different experiments. A tool like Hydra, would make this experimentation rather trivial, because the user only needs to define the yaml relative to the experiment he/she wants to perform (https://hydra.cc/docs/patterns/configuring_experiments/), but this is another problem.

Marco Bignotti

01/11/2023, 3:59 PM

I am not entirely sure how this would fit into the micro-packaging workflow, although I haven't spent too much time on it (yet). However, the main thing that stops me from using them is the necessity of building and managing tar files, which seems to me a bit more complicated than just defining some code in another package that can be installed.

Marco Bignotti

01/11/2023, 4:00 PM

Moreover, if I understand correctly, when you install a micro-package you are simply injecting the source code of the original modular pipeline. This implies that the user can modify the source code, which is something I do not want to allow.

Deepyaman Datta

01/11/2023, 6:05 PM

Moreover, if I understand correctly, when you install a micro-package you are simply injecting the source code of the original modular pipeline. This implies that the user can modify the source code, which is something I do not want to allow.

If you pull a micro-package, yes, you are injecting source code. I understand why you don't want to do that. without pulling, micro-packaging is basically just a way to distribute a pipeline + nodes as a Python package, which can be imported as a dependency. But also, one could argue that without pulling the micro-package, you're not doing much past building a standard Python package, and you don't need this Kedro-specific functionality, so I think I get your point/think it makes sense when you don't want the user to be able to modify the common code.

Marco Bignotti

01/11/2023, 6:57 PM

@Deepyaman Datta exactly! So, I think I chose the wrong verb: it's not that I don't like micro-packaging, it's just not useful for my use case. I guess that the simplest solution would be to create a class, something like

AnomalyDetectionTask

, that accepts the objects I mentioned before (preprocessing, estimators, postprocessing,...), and then instantiate the class inside a node. I would loose the full pipeline visualization when running

kedro viz

, but since the structure is fixed, I can document the inner workings of

AnomalyDetectionTask

elsewhere. Should it be of any interest, I will let you know if I can come up with something that works. I know that MLflow is creating similar, fixed pipelines for specific data science tasks, but they are still experimental and, at the moment, it's not possible to create user defined pipelines. Maybe this might be an interesting topic for Kedro as well. In any case, thanks a lot for the support!! I'm really impressed by the amount of work and thought that you as a team put in the development and maintenance of Kedro! I really hope it will have even more success in the future.

👍 1

Yolan Honoré-Rougé

01/27/2023, 8:40 PM

Hi @Marco Bignotti, I think that packaging your kedro pipeline in a custom mlflow model perfectly suits your use case: packaging an entire kedro pipelien with "preprocessing + model +postprocessing" in a single mlflow model is exactly what the ``kedro-mlflow`` plugin was originally built for. You can see this answer : https://kedro-org.slack.com/archives/C03RKP2LW64/p1669237790092029?thread_ts=1666864575.162609&cid=C03RKP2LW64 for more detail, and obviously ping me if needed.

6 Views

Open in Slack

Previous Next