Hi all, when running a pipeline in a `distributed`...
# questions
g
Hi all, when running a pipeline in a
distributed
manner (multiple GPUs) using the
ThreadRunner
, how can I save the metrics as versioned
kedro_datasets.tracking.MetricsDataset
and some graphs with
matplotlib.MatplotlibWriter
in just one file (something like
sync_dist=True
) without creating a versioned file for each sub-process?
m
That’s a nice one 😄 We’ve stumped upon similar issue on
kedro-azureml
- our solution was to save only on master node: https://github.com/getindata/kedro-azureml/blob/8e5979f5040e03032215e9db25af51538ec6a26a/kedro_azureml/datasets/runner_dataset.py#L82
g
Thanks for that tip. The solution works but this UserWarning appears because the saving object is a versioned abstract dataset: Any idea how to get rid of this warning?
Another question: When running a distributed training it is recommended to use the ThreadRunner, right? Is there a way to "run" only one node with the behavior of the ThreadRunner and the rest with the behavior of the SequentialRunner? Or do I have to run these parts of a pipeline separately?
m
https://kedro-org.slack.com/archives/C03RKP2LW64/p1699803807916089?thread_ts=1699603409.411269&cid=C03RKP2LW64 Most likely the second option. I’m not aware of any options to switch between runners mid-pipeline as this is against Kedro’s architecture.
https://kedro-org.slack.com/archives/C03RKP2LW64/p1699802420388689?thread_ts=1699603409.411269&cid=C03RKP2LW64 This warning is most likely from the additional processes that are trying to to
.save
on the dataset. If you’ve added the guardrails (
is_distributed_master_node
) then the warning can be probably neglected.