For kedro mlflow Would it be possible to only store certain Kedro #plugins-integrations

For kedro-mlflow: Would it be possible to only sto...

Hugo Evers

07/24/2024, 12:44 PM

For kedro-mlflow: Would it be possible to only store certain files to mlflow instead of also having to store them to the file path? And if no, would the maintainers @Yolan Honoré-Rougé , be open to a PR the mlflow artefact dataset which would allow this behavior? For example such that when the file path is empty, a temp file is used for storing the data, and then logged to mlflow

mlflow 1

Hugo Evers

07/24/2024, 12:47 PM

I first tried by creating a custom_resolver that would generate a temp file, but cleaning it up afterwards turned out to be quite complicated (basically i would need to create a hook for it), so the cleanest options is to include it in the kedro-mlflow dataset

Yolan Honoré-Rougé

07/24/2024, 12:51 PM

Yes that's something I should have done for a long time, definitely something I'd accept a PR for!

Hugo Evers

07/24/2024, 12:55 PM

nice! ill get on it

🔥 4

Yolan Honoré-Rougé

07/25/2024, 12:59 PM

The more I think about it, the more I am inclined to think this lives in the

MlflowArtifactDataset.save

method and not in a hook

Yolan Honoré-Rougé

07/25/2024, 12:59 PM

This would increase consistency between interactive and CLI workflow

Yolan Honoré-Rougé

07/25/2024, 1:00 PM

And maybe we can tweak it to accept a temp folder manger and pass it through a resolver

Yolan Honoré-Rougé

07/25/2024, 1:00 PM

Need to prototype a little to find out the best developer experience

Hugo Evers

07/25/2024, 1:03 PM

yeah, the hook was only an experiment to not have to make any changes to the kedro-mlflow dataset

Hugo Evers

07/25/2024, 1:06 PM

i was thinking btw, that the reverse of this idea would also be interesting, so an option to only save the reference to the artifact to mlflow but not the artefact itself. such than when you load it, it loads the locations of the artefact (given the experiment id for example) and then loads the data using the underlying dataset. basically an mlflow artefact dataset where you can choose whether you want to save only to mlflow, only to the filepath (but version with mlflow), or save to both. for as far as i understood the current implementation saves to both right?

Yolan Honoré-Rougé

07/25/2024, 1:07 PM

Yes it saves both

Yolan Honoré-Rougé

07/25/2024, 1:08 PM

I am not sure we want to do this, this is quite confusing

Yolan Honoré-Rougé

07/25/2024, 1:08 PM

If you don't want to save in mlflow, just keep the underlying dataset

Yolan Honoré-Rougé

07/25/2024, 1:09 PM

You can have 2 envs, one with mlflow and one without

Hugo Evers

07/25/2024, 1:12 PM

yeah i get your point, it only arises when you have to use a custom dataset to save something to s3 for example. i had a case where a file was simply to big to be saved as a pickle, and then used the cloudpickle implementation. So that begs the question whether the build in saving methods of mlflow would be able to log that artefact in which case i would want to (for example) version by the mlflow run id, but not save using mlflow.log_artefact. anyway, there are also other ways around that like dataset factories, hooks and other solutions

Hugo Evers

07/25/2024, 1:13 PM

unfortunately kedro’s build in tools do not allow one to pass the mlflow_run_id from runtime params to the factories where you need them, and so i had to do some ugly hacks in a custom cli implementation and an environment variable

Hugo Evers

07/25/2024, 1:14 PM

anyway, im rambling off a bit. ill get back to you with a proposal for saving only to mlflow like discussed earlier

👍 1

Yolan Honoré-Rougé

07/25/2024, 1:14 PM

I think you should rather change what you node return you want to log in mlflow instead of changing the dataset behaviour

Yolan Honoré-Rougé

07/25/2024, 1:15 PM

Yes I understand for the incompatibility runtime_params+ factory but the resolver should still be the way to go

6 Views

Open in Slack

Previous Next