Hello :slightly_smiling_face: ! I have a quite sp...
# questions
s
Hello 🙂 ! I have a quite specific use case and can't find a way to fully exploit kedro potentialities : I have one pipeline, that I want to execute on several datasets. Let us, for example, say that I have one table in a SQL database, and that I want to build n datasets, e.g. with something like : "SELECT * FROM TABLE WHERE INDICATOR = {i}", and train one model on each. I (GPT-4, actually) found a way, using KedroContext object, to achieve this, but I would like to use kedro_viz to visualize my pipeline (and for example I/O shapes, and so...) under a specific KedroContext. For example, let us say that at the end of my training I don't have a model for INDICATOR = 3, I would like to see if any raw data were retrieved from database, or if I had a problem during preprocessing, etc... Also, I would like to know if any automatic filing process, for example with filepath based on the node graph where available, or if I manually have to set every folder up. Thank you in advance, any help would be greatly appreciated !
đź‘€ 1
d
so there are ways of using the the library components like this and it is functional - but it’s “out of bounds” i.e. not really the encouraged way of using Kedro. We don’t really support conditional nodes like you suggest in a native way - my recommendation is to read about the hooks system And previous threads on dynamic pipelines as a general rule of thumb if you need to create your own
KedroContext
you’ve gone too far.
n
Is this more about dynamic pipeline or Kedro-viz problem?
Could you show the relevant snippets?
s
My aim is to be able to do something like this : • First, execute one pipeline for each of my dataset. For example, I would like to be able to do
kedro run --indicator 3
, that would do a "SELECT * FROM TABLE WHERE INDICATOR = 3", then would save to my_dataset_3.csv, and then would train the model n°3 and save it under my_model_3.pkl. I have ~1000 indicators, want to do one model for each, so no real way to do it manually. • Second, I would like to do
kedro viz --indicator 3
in order for kedro_viz to display my pipeline, with e.g. dataset statistics, metrics, etc. related to my model_3.pkl. For now, best shot at the first part consists in building a hook reading from a config :
Copy code
class DataCatalogHooks:
    @property
    def _logger(self):
        return logging.getLogger(self.__class__.__name__)

    @hook_impl
    def after_catalog_created(self, catalog: DataCatalog) -> None:
        config = load_config(path_of_config_containing_indicator)
        catalog.add("indicator", MemoryDataSet(data=config.indicator))
having this pipeline :
Copy code
def create_pipeline() -> dict:

    pipeline = Pipeline(
        [
            node(
                raw,
                inputs=[
                    f"indicator",
                ],
                outputs=f"raw_train_data",
            ),
            node(
                func=preprocess,
                inputs=["indicator", "raw_train_data"],
                outputs="preprocess_train_data",
            ),
        ]
    )

    return pipeline

if __name__ == '__main__':
    pipelines = create_pipeline()
and do another program that basically for current_indicator in [1, 1000], update_config to have indicator: current_indicator, and do a sys.exe("kedro run"). But i'm a bit lost on how to do the kedro viz part.
n
For the datasets part, my instinct is that it can be replace by https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html
and for the 1-1000 part, it should be just namespace pipeline with an outer loop, similiar to what you would do for a time-series forecast. See https://noklam.github.io/blog/posts/understand_namespace/2023-09-26-understand-kedro-namespace-pipeline.html
s
Thank you a lot ! I'll try dataset_factories this afternoon. But if as suggested I do one big pipeline, using namespace-pipeline, wouldn't it make kedro_viz crash or not be very usable, by trying to plot 1000 pipelines at the same time ?
Hi ! I found a way to achieve my goals, partly using what you suggested 🙂. Here is a very simple example : I write the following pipeline :
Copy code
def create_pipeline() -> dict:
    iterator_list = [1, 2]
    my_pipeline = sum(
        [
            pipeline(
                [
                    node(
                        raw,
                        inputs=[
                            f"indicator",
                        ],
                        outputs=f"raw_train_data#csv",
                    ),
                    node(
                        func=preprocess,
                        inputs=["indicator", "raw_train_data#csv"],
                        outputs="preprocess_train_data#csv",
                    ),
                ], namespace=str(iterator)
            ) for iterator in iterator_list
        ]
    )
    return my_pipeline
Then, I run
kedro run
. It will train my pipeline for iterator 1 and 2. During the visualization, I can set
iterator_list = [1]
and run
kedro viz
to only display my first pipeline
You can then replace iterator_list by something set through a config.
I don't really have enough experience with kedro, so I don't really know how to implement a plugin that natively enable this