Hi Guys, How can i change the default dataset fro...
# questions
h
Hi Guys, How can i change the default dataset from MemoryDataset to a kedro-mlflow dataset? or provide some rule to map such datasets without having to maintain two sources of datasets? Currently, i am using a modular pipeline to create namespaces for different experiments i want to run in a single session. For example, i am testing the accuracy of several prediction methods in two ways: 1. a random train-test split 2. a date-based train-test split, this is to check performance on the latest data and detect drift. Now i can very easily create multiple pipelines by remapping some inputs and outputs using the modular pipeline concept, however, i want to cache some of the training steps since these are very big (and costly) multi-modal models. I use kedro-mlflow to log the artefact and metrics to mlflow and s3, however this requires such datasets to be described in the catalog.yml. I used the templatedconfig and the Jinja2 syntax to create a list of datasets, however now i have to maintain these lists in two different places, which is begging for bugs. My prefered solution would be to have a single parameters file, where i specify all the parameters for which i want to run them in a grid (ParameterGrid). This could look like:
Copy code
parameters.yml

ParameterGrid:
  name_of_parameter:
    version_1:
      - value1
      - value2
    version_2:
      - value1
etc.
and now i could run through these options with the namespace. However, now i need to have dataset entries in the catalog.yml which match these
version_1
and
version_2
names. Since i dont want these to be stored in memory and than destroyed. Instead i want to use the kedro_mlflow datasets. so for example for the parquet files i would use something like:
Copy code
X_test_{{ split_crit }}:
  type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
  data_set:
      type: pandas.ParquetDataSet
      filepath: <s3://sagemaker-vertex/data/05_model_input/X_test_{{> split_crit }}.parquet
and for the metrics:
Copy code
my_model_metrics_{{ split_crit }}:
    type: kedro_mlflow.io.metrics.MlflowMetricDataSet
    key: accuracy
and for the models
Copy code
multi_modal_model:
  type: kedro_mlflow.io.models.MlflowModelLoggerDataSet
  flavor: mlflow.pyfunc
  pyfunc_workflow: python_model
  save_args:
    conda_env:
        python: "3.9.10"
        dependencies:
            - "mlflow==1.27.0"
However, in kedro these output datasets cannot be shared (even though in mlflow this would be fine)
d
so there are two questions here: 1. How to change the default dataset 2. How to do a parameter sweep and multiple runs
for 2 it’s a bit more complicated
generating different CLI commands may be the neatest
h
yeah, i though about doing a run.yml with the different cli commands
however i version my data every run, and these are tied to the session
so calling kedro run multiple times would make me lose the connection to the dataset versions
I think the load-multiple-datasets-with-similar-configuration, mentioned in https://kedro.readthedocs.io/en/stable/data/data_catalog.html#load-multiple-datasets-with-similar-configuration looks very close to a good solution
basically, i would like kedro to use the model in the datacatalog with the name model
but namespace it such that it does not collide
to me, the issue is really with the memorydataset being the default
i would like to specify a different default dataset for some outputs which would be namespaced
and i think the catalog entry style like:
Copy code
_multi_modal_model: &multi_modal_model
  type: kedro_mlflow.io.models.MlflowModelLoggerDataSet
  flavor: mlflow.pyfunc
  pyfunc_workflow: python_model
  save_args:
    conda_env:
        python: "3.9.10"
        dependencies:
            - "mlflow==1.27.0"
would be great
also because if i changed the cli inputs, i would need to change them based on the parameters in the parameters.yml file
so then i would also need to access the datacatalog before creating the cli commands
instead of using the modular pipeline commands, and simply remapping some inputs in the pipeline_registry
where i think you would expect such logic to be located
anyway, my current solution is really ugly, and i figured i cant be the first one to have this issue. so i am curious about your views
d
So I think a
before_pipeline_run
hook gives you everything you need to do to modify this
So I think a
before_pipeline_run
hook gives you everything you need to do to modify this