Hi Guys How can i change the default dataset from MemoryData Kedro #questions

Hi Guys, How can i change the default dataset fro...

Hugo Evers

02/27/2023, 10:17 AM

Hi Guys, How can i change the default dataset from MemoryDataset to a kedro-mlflow dataset? or provide some rule to map such datasets without having to maintain two sources of datasets? Currently, i am using a modular pipeline to create namespaces for different experiments i want to run in a single session. For example, i am testing the accuracy of several prediction methods in two ways: 1. a random train-test split 2. a date-based train-test split, this is to check performance on the latest data and detect drift. Now i can very easily create multiple pipelines by remapping some inputs and outputs using the modular pipeline concept, however, i want to cache some of the training steps since these are very big (and costly) multi-modal models. I use kedro-mlflow to log the artefact and metrics to mlflow and s3, however this requires such datasets to be described in the catalog.yml. I used the templatedconfig and the Jinja2 syntax to create a list of datasets, however now i have to maintain these lists in two different places, which is begging for bugs. My prefered solution would be to have a single parameters file, where i specify all the parameters for which i want to run them in a grid (ParameterGrid). This could look like:

Copy code

parameters.yml

ParameterGrid:
  name_of_parameter:
    version_1:
      - value1
      - value2
    version_2:
      - value1
etc.

and now i could run through these options with the namespace. However, now i need to have dataset entries in the catalog.yml which match these

version_1

and

version_2

names. Since i dont want these to be stored in memory and than destroyed. Instead i want to use the kedro_mlflow datasets. so for example for the parquet files i would use something like:

Copy code

X_test_{{ split_crit }}:
  type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
  data_set:
      type: pandas.ParquetDataSet
      filepath: <s3://sagemaker-vertex/data/05_model_input/X_test_{{> split_crit }}.parquet

and for the metrics:

Copy code

my_model_metrics_{{ split_crit }}:
    type: kedro_mlflow.io.metrics.MlflowMetricDataSet
    key: accuracy

and for the models

Copy code

multi_modal_model:
  type: kedro_mlflow.io.models.MlflowModelLoggerDataSet
  flavor: mlflow.pyfunc
  pyfunc_workflow: python_model
  save_args:
    conda_env:
        python: "3.9.10"
        dependencies:
            - "mlflow==1.27.0"

However, in kedro these output datasets cannot be shared (even though in mlflow this would be fine)

datajoely

02/27/2023, 10:22 AM

so there are two questions here: 1. How to change the default dataset 2. How to do a parameter sweep and multiple runs

datajoely

02/27/2023, 10:22 AM

For 1. you should define your own runner and change the default https://github.com/kedro-org/kedro/blob/6428dd94e7538366a8b1155cf15ffc6ece7fb959/kedro/runner/sequential_runner.py

datajoely

02/27/2023, 10:23 AM

for 2 it’s a bit more complicated

datajoely

02/27/2023, 10:23 AM

generating different CLI commands may be the neatest

Hugo Evers

02/27/2023, 10:24 AM

yeah, i though about doing a run.yml with the different cli commands

Hugo Evers

02/27/2023, 10:25 AM

however i version my data every run, and these are tied to the session

Hugo Evers

02/27/2023, 10:25 AM

so calling kedro run multiple times would make me lose the connection to the dataset versions

Hugo Evers

02/27/2023, 10:26 AM

I think the load-multiple-datasets-with-similar-configuration, mentioned in https://kedro.readthedocs.io/en/stable/data/data_catalog.html#load-multiple-datasets-with-similar-configuration looks very close to a good solution

Hugo Evers

02/27/2023, 10:27 AM

basically, i would like kedro to use the model in the datacatalog with the name model

Hugo Evers

02/27/2023, 10:28 AM

but namespace it such that it does not collide

Hugo Evers

02/27/2023, 10:29 AM

to me, the issue is really with the memorydataset being the default

Hugo Evers

02/27/2023, 10:30 AM

i would like to specify a different default dataset for some outputs which would be namespaced

Hugo Evers

02/27/2023, 10:32 AM

and i think the catalog entry style like:

Copy code

_multi_modal_model: &multi_modal_model
  type: kedro_mlflow.io.models.MlflowModelLoggerDataSet
  flavor: mlflow.pyfunc
  pyfunc_workflow: python_model
  save_args:
    conda_env:
        python: "3.9.10"
        dependencies:
            - "mlflow==1.27.0"

would be great

Hugo Evers

02/27/2023, 10:33 AM

also because if i changed the cli inputs, i would need to change them based on the parameters in the parameters.yml file

Hugo Evers

02/27/2023, 10:33 AM

so then i would also need to access the datacatalog before creating the cli commands

Hugo Evers

02/27/2023, 10:33 AM

instead of using the modular pipeline commands, and simply remapping some inputs in the pipeline_registry

Hugo Evers

02/27/2023, 10:34 AM

where i think you would expect such logic to be located

Hugo Evers

02/27/2023, 10:35 AM

anyway, my current solution is really ugly, and i figured i cant be the first one to have this issue. so i am curious about your views

datajoely

02/27/2023, 12:31 PM

So I think a
before_pipeline_run
hook gives you everything you need to do to modify this

datajoely

02/27/2023, 12:31 PM

So I think a

before_pipeline_run

hook gives you everything you need to do to modify this

10 Views

Open in Slack

Previous Next