Hi team, I am facing an issue with logging an arti...
# questions
h
Hi team, I am facing an issue with logging an artifact using
kedro-mlflow
. in
catalog.yml
I've modified a dataset from this
Copy code
reporting_patient.pre_cohort_waterfall:
  type: pandas.CSVDataSet
  filepath: ${base_path}/reporting_patient/pre_cohort_waterfall.csv
to this
Copy code
reporting_patient.pre_cohort_waterfall:
  type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
  data_set:
    type: pandas.CSVDataSet
    filepath: ${base_path}/reporting_patient/pre_cohort_waterfall.csv
mlflow is connected with databricks, the
filepath: ${base_path}/reporting_patient/pre_cohort_waterfall.csv
is on s3. When I run the relevant node
MlflowArtifactDataSet
successfully saves the dataset to s3, but is fails in logging the artifact when calling
mlflow.store.artifact.databricks_artifact_repo.DatabricksArtifactRepository.log_artifact
The error is caused in the source code when it tried uploading the file to s3. It provides
${base_path}/reporting_patient/pre_cohort_waterfall.csv
as a local path, though the dataset is an object and is not written there, so I end up with the error
No such file or Directory
. Is there something I am missing with configuration which causes this issue, any advice on this matter? thanks in advance.
m
Hi @Hadeel Mustafa, we are aware of this bug and will be releasing a fix in our next release.
đź‘Ť 1
m
For plugin related questions please go to #plugins-integrations
đź‘Ť 1
m
In the next 1-2 weeks.
h
thanks!
m
If you want you could use the unreleased
main
branch and install that directly which already contains the fix.
h
@Merel could you please give me the link to that main branch? I would like to test it
n
Hey@Hadeel Mustafa ! Kedro is open source on GitHub, you can find this already here: https://github.com/kedro-org/kedro You can always pip install directly from a git repo:
Copy code
pip install git+<https://github.com/kedro-org/kedro.git@main>
h
thanks @Nok Lam Chan
@Nok Lam Chan another relevant question due to this limitation I know that
run_id
is an opaque identifier that MLflow produces, and I cannot create it I am trying to log artificate in a different way now, the catalog path will have the run_id path like this
Copy code
reporting_patient.{mlflow_run_id}_pre_cohort_waterfall:
  type: pandas.CSVDataSet
  filepath: ${base_path}/reporting_patient/{mlflow_run_id}/pre_cohort_waterfall.csv
using the kedro data factory/ or any other mehtod is there a way storing
mlflow_run_id
before before kedro starts running the node or something?
n
how are you initiating MLflow now?
So the best way to do this is with hooks
h
I am using mlflow.yml only for now
n
Are you using any plugin or you are reading this yourself? mlflow.yml is just a configuration file
Ah ok - just reading the full thread so you are using
kedro-mlflow
Cc @Yolan Honoré-Rougé the author of
kedro-mlflow
and see if he has a better solution here.
h
yes kedro-mlflow
@Nok Lam Chan @Yolan Honoré-Rougé I figured I can name use the experiment
name
as part of the catalog path which can be user define like this
Copy code
tracking:
  run:
    id: null # if `id` is None, a new run will be created
    name: ${mlflow_experiment_name} # if `name` is None, pipeline name will be used for the run name
    nested: True  # # if `nested` is False, you won't be able to launch sub-runs inside your nodes
so the catalog path can be
filepath: ${base_path}/reporting_patient/${mlflow_experiment_name}/pre_cohort_waterfall.csv
@Yolan Honoré-Rougé another question though,
globals.yml
is not captured as parameters, while I understand why, the kedro globals vars are not kedro parameters, what is the best way to log it in
kedro-mlflow
@Nok Lam Chan (if you can add input here 🙏 )
n
@Hadeel Mustafa Which configloader and Kedro version are you using?
Two ways to do it: • config_loader[“globals”] to log all globals parameters • log the global parameters inside your parameter files i.e.
Copy code
# parameters.yml
x: ${global_something_parmaeters}
a
re: the original question - this does not seem like the same bug that we fixed, that was specifically for dataset factories.
n
These template value should be used somewhere in your
parameters.yml
already, otherwise why do you need the
globals.yml
? To log it explicitly logging either go with method 1 or simply just read your
globals.yml
h
@Nok Lam Chan kedro:
0.18.12
@Nok Lam Chan the first method creates dependency, whenever a global is added, one must be aware that they need to add it to parameters.yml as well, so I wanted to steer away from this method yes you are correct, we do use some of the globals as part of a parameter value, but other globals are used in catalog.yml only
n
And which configloader you are using?
TemplatedConfigloader: • conf_loader._config_mapping OmegaConfigLoader: • conf_laoder[“globals”]
y
Actually I think this is a mlflow limitation. The underlying
log_artifact
only works with local path
You can tweak it by creating a custom dataset which downloads the data from S3 to a local temp directory first, and then log it
h
thank you @Yolan Honoré-Rougé, this is helpful!
đź‘Ť 1
r
Similar to the thread but not exactly the same: We’re using modular pipelines with namespaces and data factories for running backtests. When using JSON dataset (or any other dataset really) Kedro is able to correctly replace the
{namespace}
with the actual namespace name. When using the
<http://kedro_mlflow.io|kedro_mlflow.io>.artifacts.MlflowArtifactDataSet
wrapper the
{namespace}
replacement does not work. The same is true for pkl, json, csv, etc. Is that expected behavior? Code sample:
Copy code
# conf/catalog.yml
# namespace = NBA_v1

# Working
"{namespace}.nba_model_best_params":
  type: kedro.extras.datasets.json.JSONDataSet
  filepath: data/09_tracking/{namespace}_best_params.json
  versioned: true
# >>>> This saves data/09_tracking/NBA_v1_best_params.json (correct) <<<<

# Not working
"{namespace}.nba_model_best_params":
  type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
  data_set:
    type: kedro.extras.datasets.json.JSONDataSet
    filepath: data/09_tracking/{namespace}_best_params.json
    versioned: true
# >>>> This saves data/09_tracking/{namespace}_best_params.json (wrong) <<<<
kedro==0.18.13
kedro-mlflow==0.11.9
mlflow==2.5.0
e
I think this was fixed here (not sure if available in 0.18.13): https://github.com/kedro-org/kedro/issues/2992 thread related: https://kedro-org.slack.com/archives/C03RKPCLYGY/p1693494350796279
đź‘€ 1
a
@Rennan Haro, this has been fixed like @Erwin pointed out but hasn’t been released yet. This fix will be out in 0.18.14
r
Awesome. Installing from source solved it. Thanks!!! (And sorry for the duplicated issue, could not find the original one)
❤️ 5