Hadeel Mustafa
09/22/2023, 11:03 AMkedro-mlflow
.
in catalog.yml
I've modified a dataset from this
reporting_patient.pre_cohort_waterfall:
type: pandas.CSVDataSet
filepath: ${base_path}/reporting_patient/pre_cohort_waterfall.csv
to this
reporting_patient.pre_cohort_waterfall:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
data_set:
type: pandas.CSVDataSet
filepath: ${base_path}/reporting_patient/pre_cohort_waterfall.csv
mlflow is connected with databricks, the filepath: ${base_path}/reporting_patient/pre_cohort_waterfall.csv
is on s3.
When I run the relevant node MlflowArtifactDataSet
successfully saves the dataset to s3, but is fails in logging the artifact when calling mlflow.store.artifact.databricks_artifact_repo.DatabricksArtifactRepository.log_artifact
The error is caused in the source code when it tried uploading the file to s3. It provides ${base_path}/reporting_patient/pre_cohort_waterfall.csv
as a local path, though the dataset is an object and is not written there, so I end up with the error No such file or Directory
.
Is there something I am missing with configuration which causes this issue, any advice on this matter?
thanks in advance.Merel
09/22/2023, 11:10 AMmarrrcin
09/22/2023, 11:10 AMHadeel Mustafa
09/22/2023, 11:14 AMMerel
09/22/2023, 11:14 AMHadeel Mustafa
09/22/2023, 11:14 AMMerel
09/22/2023, 11:15 AMmain
branch and install that directly which already contains the fix.Hadeel Mustafa
09/22/2023, 11:16 AMNok Lam Chan
09/22/2023, 11:38 AMpip install git+<https://github.com/kedro-org/kedro.git@main>
Hadeel Mustafa
09/22/2023, 11:52 AMHadeel Mustafa
09/22/2023, 12:21 PMrun_id
is an opaque identifier that MLflow produces, and I cannot create it
I am trying to log artificate in a different way now, the catalog path will have the run_id path like this
reporting_patient.{mlflow_run_id}_pre_cohort_waterfall:
type: pandas.CSVDataSet
filepath: ${base_path}/reporting_patient/{mlflow_run_id}/pre_cohort_waterfall.csv
using the kedro data factory/ or any other mehtod
is there a way storing mlflow_run_id
before before kedro starts running the node or something?Nok Lam Chan
09/22/2023, 12:22 PMNok Lam Chan
09/22/2023, 12:22 PMHadeel Mustafa
09/22/2023, 12:23 PMNok Lam Chan
09/22/2023, 12:23 PMNok Lam Chan
09/22/2023, 12:24 PMkedro-mlflow
Nok Lam Chan
09/22/2023, 12:25 PMNok Lam Chan
09/22/2023, 12:25 PMkedro-mlflow
and see if he has a better solution here.Hadeel Mustafa
09/22/2023, 12:25 PMHadeel Mustafa
09/22/2023, 12:45 PMname
as part of the catalog path which can be user define
like this
tracking:
run:
id: null # if `id` is None, a new run will be created
name: ${mlflow_experiment_name} # if `name` is None, pipeline name will be used for the run name
nested: True # # if `nested` is False, you won't be able to launch sub-runs inside your nodes
Hadeel Mustafa
09/22/2023, 12:45 PMfilepath: ${base_path}/reporting_patient/${mlflow_experiment_name}/pre_cohort_waterfall.csv
Hadeel Mustafa
09/22/2023, 12:48 PMglobals.yml
is not captured as parameters, while I understand why, the kedro globals vars are not kedro parameters, what is the best way to log it in kedro-mlflow
@Nok Lam Chan (if you can add input here 🙏 )Nok Lam Chan
09/22/2023, 12:49 PMNok Lam Chan
09/22/2023, 12:50 PM# parameters.yml
x: ${global_something_parmaeters}
Ankita Katiyar
09/22/2023, 12:50 PMNok Lam Chan
09/22/2023, 12:51 PMparameters.yml
already, otherwise why do you need the globals.yml
? To log it explicitly logging either go with method 1 or simply just read your globals.yml
Hadeel Mustafa
09/22/2023, 12:51 PM0.18.12
Hadeel Mustafa
09/22/2023, 12:55 PMNok Lam Chan
09/22/2023, 1:04 PMNok Lam Chan
09/22/2023, 1:05 PMYolan Honoré-Rougé
09/22/2023, 4:57 PMlog_artifact
only works with local pathYolan Honoré-Rougé
09/22/2023, 4:59 PMYolan Honoré-Rougé
09/22/2023, 5:01 PMHadeel Mustafa
09/25/2023, 8:01 AMRennan Haro
10/03/2023, 2:36 PM{namespace}
with the actual namespace name. When using the <http://kedro_mlflow.io|kedro_mlflow.io>.artifacts.MlflowArtifactDataSet
wrapper
the {namespace}
replacement does not work. The same is true for pkl, json, csv, etc.
Is that expected behavior?
Code sample:
# conf/catalog.yml
# namespace = NBA_v1
# Working
"{namespace}.nba_model_best_params":
type: kedro.extras.datasets.json.JSONDataSet
filepath: data/09_tracking/{namespace}_best_params.json
versioned: true
# >>>> This saves data/09_tracking/NBA_v1_best_params.json (correct) <<<<
# Not working
"{namespace}.nba_model_best_params":
type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
data_set:
type: kedro.extras.datasets.json.JSONDataSet
filepath: data/09_tracking/{namespace}_best_params.json
versioned: true
# >>>> This saves data/09_tracking/{namespace}_best_params.json (wrong) <<<<
kedro==0.18.13
kedro-mlflow==0.11.9
mlflow==2.5.0
Erwin
10/03/2023, 2:44 PMAnkita Katiyar
10/03/2023, 2:46 PMRennan Haro
10/03/2023, 3:05 PM