Ricardo Araújo
01/10/2023, 11:52 PMDeepyaman Datta
01/11/2023, 5:57 AMTrain
+ Evaluate
can be a modular pipeline that you use (lets call it data_science_pipeline
); and then you would have your PreProcess
happening in another pipeline(s). It's honestly a design decision whether you want one data_engineering_pipeline
consisting of PreProcessA
, PreProcessB
, etc., or if you want to break it up into data_engineering_pipeline_a
, data_engineering_pipeline_b
, etc. In reality, your data_science_pipeline
is the reused part.data_science_pipeline
for 90% of cases (the pipeline was heavily parametrizable), but the data_engineering_pipeline
was different per domain and use case within the domain.Marco Bignotti
01/11/2023, 10:04 AMdata_science_pipeline
you mentioned to make it reusable across different projects?
I'm asking because I'm developing a package that is using Kedro internally. Users will then install the package for each project. Inside the package, I would like to give complete freedom for the data engineering bit, but then I would like to define fixed (but configurable) modular pipelines that broadly correspond to different data science tasks (e.g. Classification, Regression, Anomaly Detection,...).
What goes inside these modular pipelines should respect the defined inputs, outputs and the Python types to be used internally. For example, a Classification pipeline should take a Pandas Dataframe in input, use a sklearn estimator inside and return a numpy array with the predictions. The user can choose the dataframe and what estimator to use (which can also be a sklearn Pipeline), but he must respect the structure.
Then I would like to allow connecting these pipelines with other modular pipelines (just as you described).
Any help would highly appreciated!
Thanks!Ricardo Araújo
01/11/2023, 10:54 AMdata_engineering_pipeline
to run? Since all data_engineering pipelines have the same output, I'm unsure how kedro would handle that -- and it would probably run all of them. I could probably create another meta-pipeline to join each, but that creates less reusable code (since if parts of the tail of the pipeline changed, I'd have to update all pipelines manually).CommonPipeline
that is shared across datasets;
• Create modular pipelines for each dataset, containing only the dataset-specific nodes; give the same name for all of them (say, "start"); in my case, I created a pipeline PreProcessA
and another `PreProcessB`;
• Create different modular pipelines for each dataset, which will join the dataset-specific pipeline with the CommonPipeline (in one it will be PreProcessA+CommonPipeline
, in another it will be PreProcessB+CommonPipeline
); set different namespaces
for each (say, DatasetA, DatasetB)*.*
• Now I can run kedro run --from-nodes=DataSetA.start
Deepyaman Datta
01/11/2023, 2:13 PMdata_science_pipeline
to an internal PyPI (JFrog Artifactory), and then pull the pipeline in these different projects.PreProcessA+CommonPipeline
) to the pipeline registry, so we could just do kedro run --pipeline use_case_a
.Ricardo Araújo
01/11/2023, 2:23 PMMarco Bignotti
01/11/2023, 2:51 PMDeepyaman Datta
01/11/2023, 3:06 PMpull
part of the micropackaging workflow, and just importing the packaged pipeline, work? Curious to learn more about what you don't like; it's not used much, but your use case is what it's designed for, so it would be great to get this feedbackMarco Bignotti
01/11/2023, 3:25 PMfit
and predict
methods. The result of the predict
method should be a one dimensional numpy array containing an anomaly score for each point in input.
• [Optional] Postprocessing. A scikit learn transformer that does some transformations of the anomaly score (e.g. z-scoring, filtering). The dimensionality should remain unchanged.
• Threshold. Again, a scikit learn estimator that takes in input the anomaly score and returns a numpy array with labels (0 for normal and 1 for anomaly).
The previous steps are then composed together to create another estimator, namely a scikit learn pipeline. The resulting estimator is the model that needs to be trained, registered somewhere (e.g. the mlflow model registry), and that will be used in production.
The structure of the pipeline and of any auxiliary check (e.g. dimensionality of input/output data) must be fixed and thoroughly tested, since this is what will go to production. However, the user/data scientist, should have the possibility of trying anything he/she wants in terms of what transformers and estimators to use inside the pipeline, as long as the objects passed to the pipeline respect the required interface.
Ideally, it would be nice to define these objects (e.g. preprocessing, estimator, postprocessing,...) in a configuration file and then try different experiments. A tool like Hydra, would make this experimentation rather trivial, because the user only needs to define the yaml relative to the experiment he/she wants to perform (https://hydra.cc/docs/patterns/configuring_experiments/), but this is another problem.Deepyaman Datta
01/11/2023, 6:05 PMMoreover, if I understand correctly, when you install a micro-package you are simply injecting the source code of the original modular pipeline. This implies that the user can modify the source code, which is something I do not want to allow.If you pull a micro-package, yes, you are injecting source code. I understand why you don't want to do that. without pulling, micro-packaging is basically just a way to distribute a pipeline + nodes as a Python package, which can be imported as a dependency. But also, one could argue that without pulling the micro-package, you're not doing much past building a standard Python package, and you don't need this Kedro-specific functionality, so I think I get your point/think it makes sense when you don't want the user to be able to modify the common code.
Marco Bignotti
01/11/2023, 6:57 PMAnomalyDetectionTask
, that accepts the objects I mentioned before (preprocessing, estimators, postprocessing,...), and then instantiate the class inside a node. I would loose the full pipeline visualization when running kedro viz
, but since the structure is fixed, I can document the inner workings of AnomalyDetectionTask
elsewhere.
Should it be of any interest, I will let you know if I can come up with something that works. I know that MLflow is creating similar, fixed pipelines for specific data science tasks, but they are still experimental and, at the moment, it's not possible to create user defined pipelines. Maybe this might be an interesting topic for Kedro as well.
In any case, thanks a lot for the support!!
I'm really impressed by the amount of work and thought that you as a team put in the development and maintenance of Kedro! I really hope it will have even more success in the future.Yolan Honoré-Rougé
01/27/2023, 8:40 PM