Hi all when running a pipeline in a `distributed` manner mul Kedro #questions

Hi all, when running a pipeline in a `distributed`...

Gregor Höhne

11/10/2023, 8:03 AM

Hi all, when running a pipeline in a

distributed

manner (multiple GPUs) using the

ThreadRunner

, how can I save the metrics as versioned

kedro_datasets.tracking.MetricsDataset

and some graphs with

matplotlib.MatplotlibWriter

in just one file (something like

sync_dist=True

) without creating a versioned file for each sub-process?

marrrcin

11/10/2023, 10:37 AM

That’s a nice one 😄 We’ve stumped upon similar issue on

kedro-azureml

- our solution was to save only on master node: https://github.com/getindata/kedro-azureml/blob/8e5979f5040e03032215e9db25af51538ec6a26a/kedro_azureml/datasets/runner_dataset.py#L82

Gregor Höhne

11/12/2023, 3:20 PM

Thanks for that tip. The solution works but this UserWarning appears because the saving object is a versioned abstract dataset: Any idea how to get rid of this warning?

Gregor Höhne

11/12/2023, 3:43 PM

Another question: When running a distributed training it is recommended to use the ThreadRunner, right? Is there a way to "run" only one node with the behavior of the ThreadRunner and the rest with the behavior of the SequentialRunner? Or do I have to run these parts of a pipeline separately?

marrrcin

11/13/2023, 7:49 AM

https://kedro-org.slack.com/archives/C03RKP2LW64/p1699803807916089?thread_ts=1699603409.411269&cid=C03RKP2LW64 Most likely the second option. I’m not aware of any options to switch between runners mid-pipeline as this is against Kedro’s architecture.

marrrcin

11/13/2023, 7:51 AM

https://kedro-org.slack.com/archives/C03RKP2LW64/p1699802420388689?thread_ts=1699603409.411269&cid=C03RKP2LW64 This warning is most likely from the additional processes that are trying to to

.save

on the dataset. If you’ve added the guardrails (

is_distributed_master_node

) then the warning can be probably neglected.

2 Views

Open in Slack

Previous Next