For kedro-mlflow: Would it be possible to only sto...
# plugins-integrations
h
For kedro-mlflow: Would it be possible to only store certain files to mlflow instead of also having to store them to the file path? And if no, would the maintainers @Yolan Honoré-Rougé , be open to a PR the mlflow artefact dataset which would allow this behavior? For example such that when the file path is empty, a temp file is used for storing the data, and then logged to mlflow
mlflow 1
I first tried by creating a custom_resolver that would generate a temp file, but cleaning it up afterwards turned out to be quite complicated (basically i would need to create a hook for it), so the cleanest options is to include it in the kedro-mlflow dataset
y
Yes that's something I should have done for a long time, definitely something I'd accept a PR for!
h
nice! ill get on it
🔥 4
y
The more I think about it, the more I am inclined to think this lives in the
MlflowArtifactDataset.save
method and not in a hook
This would increase consistency between interactive and CLI workflow
And maybe we can tweak it to accept a temp folder manger and pass it through a resolver
Need to prototype a little to find out the best developer experience
h
yeah, the hook was only an experiment to not have to make any changes to the kedro-mlflow dataset
i was thinking btw, that the reverse of this idea would also be interesting, so an option to only save the reference to the artifact to mlflow but not the artefact itself. such than when you load it, it loads the locations of the artefact (given the experiment id for example) and then loads the data using the underlying dataset. basically an mlflow artefact dataset where you can choose whether you want to save only to mlflow, only to the filepath (but version with mlflow), or save to both. for as far as i understood the current implementation saves to both right?
y
Yes it saves both
I am not sure we want to do this, this is quite confusing
If you don't want to save in mlflow, just keep the underlying dataset
You can have 2 envs, one with mlflow and one without
h
yeah i get your point, it only arises when you have to use a custom dataset to save something to s3 for example. i had a case where a file was simply to big to be saved as a pickle, and then used the cloudpickle implementation. So that begs the question whether the build in saving methods of mlflow would be able to log that artefact in which case i would want to (for example) version by the mlflow run id, but not save using mlflow.log_artefact. anyway, there are also other ways around that like dataset factories, hooks and other solutions
unfortunately kedro’s build in tools do not allow one to pass the mlflow_run_id from runtime params to the factories where you need them, and so i had to do some ugly hacks in a custom cli implementation and an environment variable
anyway, im rambling off a bit. ill get back to you with a proposal for saving only to mlflow like discussed earlier
👍 1
y
I think you should rather change what you node return you want to log in mlflow instead of changing the dataset behaviour
Yes I understand for the incompatibility runtime_params+ factory but the resolver should still be the way to go