Hi team I am facing an issue with logging an artifact using Kedro #questions

Hi team, I am facing an issue with logging an arti...

Hadeel Mustafa

09/22/2023, 11:03 AM

Hi team, I am facing an issue with logging an artifact using

kedro-mlflow

. in

catalog.yml

I've modified a dataset from this

Copy code

reporting_patient.pre_cohort_waterfall:
  type: pandas.CSVDataSet
  filepath: ${base_path}/reporting_patient/pre_cohort_waterfall.csv

to this

Copy code

reporting_patient.pre_cohort_waterfall:
  type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
  data_set:
    type: pandas.CSVDataSet
    filepath: ${base_path}/reporting_patient/pre_cohort_waterfall.csv

mlflow is connected with databricks, the

filepath: ${base_path}/reporting_patient/pre_cohort_waterfall.csv

is on s3. When I run the relevant node

MlflowArtifactDataSet

successfully saves the dataset to s3, but is fails in logging the artifact when calling

mlflow.store.artifact.databricks_artifact_repo.DatabricksArtifactRepository.log_artifact

The error is caused in the source code when it tried uploading the file to s3. It provides

${base_path}/reporting_patient/pre_cohort_waterfall.csv

as a local path, though the dataset is an object and is not written there, so I end up with the error

No such file or Directory

. Is there something I am missing with configuration which causes this issue, any advice on this matter? thanks in advance.

Merel

09/22/2023, 11:10 AM

Hi @Hadeel Mustafa, we are aware of this bug and will be releasing a fix in our next release.

👍 1

marrrcin

09/22/2023, 11:10 AM

For plugin related questions please go to #C03RKPCLYGY

👍 1

Hadeel Mustafa

09/22/2023, 11:14 AM

@Merel @marrrcin thank you for the response. @Merel when is this fix expected to be released?

Merel

09/22/2023, 11:14 AM

In the next 1-2 weeks.

Hadeel Mustafa

09/22/2023, 11:14 AM

thanks!

Merel

09/22/2023, 11:15 AM

If you want you could use the unreleased

main

branch and install that directly which already contains the fix.

Hadeel Mustafa

09/22/2023, 11:16 AM

@Merel could you please give me the link to that main branch? I would like to test it

Nok Lam Chan

09/22/2023, 11:38 AM

Hey@Hadeel Mustafa ! Kedro is open source on GitHub, you can find this already here: https://github.com/kedro-org/kedro You can always pip install directly from a git repo:

Copy code

pip install git+<https://github.com/kedro-org/kedro.git@main>

Hadeel Mustafa

09/22/2023, 11:52 AM

thanks @Nok Lam Chan

Hadeel Mustafa

09/22/2023, 12:21 PM

@Nok Lam Chan another relevant question due to this limitation I know that

run_id

is an opaque identifier that MLflow produces, and I cannot create it I am trying to log artificate in a different way now, the catalog path will have the run_id path like this

Copy code

reporting_patient.{mlflow_run_id}_pre_cohort_waterfall:
  type: pandas.CSVDataSet
  filepath: ${base_path}/reporting_patient/{mlflow_run_id}/pre_cohort_waterfall.csv

using the kedro data factory/ or any other mehtod is there a way storing

mlflow_run_id

before before kedro starts running the node or something?

Nok Lam Chan

09/22/2023, 12:22 PM

how are you initiating MLflow now?

Nok Lam Chan

09/22/2023, 12:22 PM

So the best way to do this is with hooks

Hadeel Mustafa

09/22/2023, 12:23 PM

I am using mlflow.yml only for now

Nok Lam Chan

09/22/2023, 12:23 PM

Are you using any plugin or you are reading this yourself? mlflow.yml is just a configuration file

Nok Lam Chan

09/22/2023, 12:24 PM

Ah ok - just reading the full thread so you are using

kedro-mlflow

Nok Lam Chan

09/22/2023, 12:25 PM

So you may find these hooks useful: https://docs.kedro.org/en/stable/hooks/introduction.html

Nok Lam Chan

09/22/2023, 12:25 PM

Cc @Yolan Honoré-Rougé the author of

kedro-mlflow

and see if he has a better solution here.

Hadeel Mustafa

09/22/2023, 12:25 PM

yes kedro-mlflow

Hadeel Mustafa

09/22/2023, 12:45 PM

@Nok Lam Chan @Yolan Honoré-Rougé I figured I can name use the experiment

name

as part of the catalog path which can be user define like this

Copy code

tracking:
  run:
    id: null # if `id` is None, a new run will be created
    name: ${mlflow_experiment_name} # if `name` is None, pipeline name will be used for the run name
    nested: True  # # if `nested` is False, you won't be able to launch sub-runs inside your nodes

Hadeel Mustafa

09/22/2023, 12:45 PM

so the catalog path can be

filepath: ${base_path}/reporting_patient/${mlflow_experiment_name}/pre_cohort_waterfall.csv

Hadeel Mustafa

09/22/2023, 12:48 PM

@Yolan Honoré-Rougé another question though,

globals.yml

is not captured as parameters, while I understand why, the kedro globals vars are not kedro parameters, what is the best way to log it in

kedro-mlflow

@Nok Lam Chan (if you can add input here 🙏 )

Nok Lam Chan

09/22/2023, 12:49 PM

@Hadeel Mustafa Which configloader and Kedro version are you using?

Nok Lam Chan

09/22/2023, 12:50 PM

Two ways to do it: • config_loader[“globals”] to log all globals parameters • log the global parameters inside your parameter files i.e.

Copy code

# parameters.yml
x: ${global_something_parmaeters}

Ankita Katiyar

09/22/2023, 12:50 PM

re: the original question - this does not seem like the same bug that we fixed, that was specifically for dataset factories.

Nok Lam Chan

09/22/2023, 12:51 PM

These template value should be used somewhere in your

parameters.yml

already, otherwise why do you need the

globals.yml

? To log it explicitly logging either go with method 1 or simply just read your

globals.yml

Hadeel Mustafa

09/22/2023, 12:51 PM

@Nok Lam Chan kedro:

0.18.12

Hadeel Mustafa

09/22/2023, 12:55 PM

@Nok Lam Chan the first method creates dependency, whenever a global is added, one must be aware that they need to add it to parameters.yml as well, so I wanted to steer away from this method yes you are correct, we do use some of the globals as part of a parameter value, but other globals are used in catalog.yml only

Nok Lam Chan

09/22/2023, 1:04 PM

And which configloader you are using?

Nok Lam Chan

09/22/2023, 1:05 PM

TemplatedConfigloader: • conf_loader._config_mapping OmegaConfigLoader: • conf_laoder[“globals”]

Yolan Honoré-Rougé

09/22/2023, 4:57 PM

Actually I think this is a mlflow limitation. The underlying

log_artifact

only works with local path

Yolan Honoré-Rougé

09/22/2023, 4:59 PM

See https://github.com/Galileo-Galilei/kedro-mlflow/issues/15

Yolan Honoré-Rougé

09/22/2023, 5:01 PM

You can tweak it by creating a custom dataset which downloads the data from S3 to a local temp directory first, and then log it

Hadeel Mustafa

09/25/2023, 8:01 AM

thank you @Yolan Honoré-Rougé, this is helpful!

👍 1

Rennan Haro

10/03/2023, 2:36 PM

Similar to the thread but not exactly the same: We’re using modular pipelines with namespaces and data factories for running backtests. When using JSON dataset (or any other dataset really) Kedro is able to correctly replace the

{namespace}

with the actual namespace name. When using the

<http://kedro_mlflow.io|kedro_mlflow.io>.artifacts.MlflowArtifactDataSet

wrapper the

{namespace}

replacement does not work. The same is true for pkl, json, csv, etc. Is that expected behavior? Code sample:

Copy code

# conf/catalog.yml
# namespace = NBA_v1

# Working
"{namespace}.nba_model_best_params":
  type: kedro.extras.datasets.json.JSONDataSet
  filepath: data/09_tracking/{namespace}_best_params.json
  versioned: true
# >>>> This saves data/09_tracking/NBA_v1_best_params.json (correct) <<<<

# Not working
"{namespace}.nba_model_best_params":
  type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
  data_set:
    type: kedro.extras.datasets.json.JSONDataSet
    filepath: data/09_tracking/{namespace}_best_params.json
    versioned: true
# >>>> This saves data/09_tracking/{namespace}_best_params.json (wrong) <<<<

kedro==0.18.13

kedro-mlflow==0.11.9

mlflow==2.5.0

Erwin

10/03/2023, 2:44 PM

I think this was fixed here (not sure if available in 0.18.13): https://github.com/kedro-org/kedro/issues/2992 thread related: https://kedro-org.slack.com/archives/C03RKPCLYGY/p1693494350796279

👀 1

Ankita Katiyar

10/03/2023, 2:46 PM

@Rennan Haro, this has been fixed like @Erwin pointed out but hasn’t been released yet. This fix will be out in 0.18.14

Rennan Haro

10/03/2023, 3:05 PM

Awesome. Installing from source solved it. Thanks!!! (And sorry for the duplicated issue, could not find the original one)

❤️ 5

8 Views

Open in Slack

Previous Next