Hi everyone, I have a question about integrating ...
# questions
s
Hi everyone, I have a question about integrating MLflow into my Kedro project. Currently, all outputs from my Kedro project are being stored in a designated folder within the project directory (e.g.,
mykedroproject/
), as specified in my
catalog.yml
. However, I've noticed that when I implement MLflow, artifacts and metrics are logged in a different location (under the
mlruns
directory). This results in the same outputs being stored twice: once through Kedro and again via MLflow. Do you have any advice on how to address this issue so that I store results only once? Ideally, I would like to have specific artifacts displayed in the MLflow UI, sourced directly from the
mykedroproject/
folder. Thanks in advance!!
👍 1
j
Hi Sid, Have you tried
kedro-mlflow
plugin? You can follow the official guide here: Kedro-MLflow Plugin Guide. This guide explains how to configure and use the plugin effectively. If you need advanced customisation, refer to the last chapter of the guide, which details how to use hooks.
👍 1
y
I am not sure about the question : mlflow duplicates your data / parameters / metrics by design, so you keep track of the entire history, while kedro only keeps track of the last version written during
kedro run
. If you have kedro versioning enable you can turn it off, but if you are using dataset without versioning, this is the intended behaviour.
👍 1
Usually people use a server and a S3 backend to store data because storing each run locally can be storage expensive
s
Thank you, Jitendra and Yolan! Very helpful insights. I’ve successfully configured my setup to store artifacts, such as
regressor.pickle
, exclusively in the
mlruns
directory. However, I am facing challenges with the reverse process: I want MLflow to retrieve artifacts directly from my Kedro project directory, ensuring that my project structure remains intact (without having the same artifact be duplicated on
mlruns
) Specifically, I want to maintain my organized subfolders within
mykedroproject
(e.g.,
raw
,
features
, etc.) that adhere to the Kedro layer nomenclature. This arrangement makes debugging more straightforward, as I can avoid using run IDs and UUIDs assigned by MLflow. And yes, I am currently versioning my Kedro run artifacts. Any thoughts/advice on this?
l
Would be nice to have input on this one! We've also tried to do this, but it seems it's not possible to organise them as you like, since MLFlow forces runs to be housed in a directory for the experiment
đź’Ż 1