Hello, Very much new to the ML world, I'm trying ...
# questions
Hello, Very much new to the ML world, I'm trying to setup a framework for our ML projects in general, which we could start from all the time and where I would like to use: • kedro for creating training pipelines and overall project structure • mlflow for experiment tracking and model registry • dvc for dataset versioning • TensorFlow for machine learning framework • RayTune for hyperparameter tuning But I'm struggling.... And I would appreciate if anybody could give me high level instructions on how to combine all this together. Let's use for this the hypothetical but probably very common and realistic problem/scenario: I am asked to create a ML model that best performs on hand written character recognition. I have at hand a dataset of handwritten characters and associated labels (eg MNIST) that I plan to complement on regular subsequent basis to make it grow. I'm planning on trying different ML models at the job, where for each model I will conduct hyper parameter tuning to get the "best" of it. At the end of the day I want to cherry pick the best performing model (let's assume accuracy is the key criteria). Here are my initial thoughts that maybe can be confirmed (or not) on how to get this scenario done: 1. the scenario corresponds to one single kedro project (and not one project per experiment - cf my definition of experiment in bullet 3) 2. I will create a first pipeline that transforms the dataset to split it into train/val/test and do feature engineering 3. Each trial on a different ML model topology is what I call an experiment and will be materialized as one kedro pipeline for each of them, that kedro pipeline will basically do the training+hyperparam tuning on val set+test set evaluation. 4. "Kedro run" is going to run the whole project so all pipelines, i.e. the "data preparation pipeline + all experiments" together in one single run. I can also rerun one single experiment with "kedro run --pipeline="experimentX" as the output of my data preparation pipeline is persisted in parquet files, so I could run it only once and then go and run my experiments many times. If conceptually what I just described sounds good and is the intended way of using Kedro, I more or less managed to reach that point... Now I want to add experiment tracking with MLflow. I added to the project the kedro-mlflow plugin. But this is where things start to get (even) more complicated... Just doing that doesn't really work that well. I see that in the mlflow UI, my kedro project name is logged as an experiment, while each "kedro run" I do is logged as a run. Which does very weird things, because if I do a "kedro run", then it logs in mlflow a single run named "__default__" that is containing a big mess of all my experiments together... but next to that, if I do a "kedro run --pipeline="experimentX", then I get a single run named "experimentX". I feel like my kedro project organization is then not mapping correctly to the kedro-mlflow concepts ? What I would like to see in mlflow is probably that each pipeline I created which corresponds to an experiment is seen as a different experiment in mlflow terms (does that make sense ?) and that each run of a given pipeline (via kedro run or kedro run --pipeline=="experimentX" is adding runs inside the corresponding experiments... Or am I wrong there and should I rather see my "scenario" as an experiment in mlflow terms and have some kind of run and sub-runs hiearchy, where the first level corresponds to experiments and the second level a run of each experiment ? Now jumping to DVC, I didn't try yet but what I envision here is that I will create a git repo dedicated to the versioning of my dataset. So inside that git repo, I will configure/initialize DVC, and create tags for each baseline of my dataset (eg v1, v2 etc), where the actual dataset files will be hosted on a server and inside git will be baselined only the DVC files. Benefit of doing that is that the version control of the dataset is centralized and can be reused for multiple different "scenarios" I want to work on using that same dataset. Then back to my kedro project (which is in a seperate git repo btw), I will remove the exclusion of the "data/01_raw" from the .gitignore file and import into that folder the dataset repo as a git submodule pointing to the right version of the dataset I want. That way, my kedro project is baselined in git, together with a given version of a dataset it works on (makes sense ?). I'm then hoping that when I do a "kedro run", the git info about my kedro project will get logged into the mlflow runs so that I can trace back for a given trained model in mlflow which dataset and code base has been used to train it. But from what i have seen, with the kedro-mlflow plugin, for some reason my git info doesn't get logged into the run tags... This is another problem I'm facing and not managing to resolve for now... As for RayTune, I'm hoping that this will be as simple as updating the code to use raytune inside the training node... Any comments, suggestions, ideas are very welcome. I'm assuming I'm trying to do something very common, so there is maybe even a sample project somewhere doing all this ? Thanks, Edouard
• You can use Kedro with MLFlow easily via https://github.com/Galileo-Galilei/kedro-mlflow • DVC and Kedro don’t gell super nicely together, it can be done but our support for native DataSet versioning and Delta (spark) (non-spark) also work in this space • This is quite an advanced configuration - you can read the recent article by getindata on how to do the parameter sweep part which is a bit like
https://getindata.com/blog/kedro-dynamic-pipelinesTensorflow is supported by Kedro With reproducibility in mind we don’t support conditional logic that allows Kedro to then pick the best option, there are ways to achieve this but it’s a conscious decision not to support this in a simple way
I'd comment on the kedro-mlflow part but likely not before this weekend because there are a lot of different subtleties here.