Hi there! I'm fairly new to kedro and want to use ...
# questions
i
Hi there! I'm fairly new to kedro and want to use it for a machine learning/ data science project. My scenario is the following • I have a dataset that I want to train a model on. This is fixed. • I want to train my model multiple times with different initial conditions for the training to assess how robust the convergence is • Later, I want to collect the results of all trainings to visualize them and do some statistics on them Now, my question is: What would be the best way to do step 2/3 using kedro, more specific, how do I orchestrate the in/ out? How can I save the results of all trainings in different files, and then load them all together (while making sure I'm not loading any outdated ones)? I understand that there are
PartitionedDataset
s, but they seem to require all data to be held in memory and then saved at the same time. In contrast, I want to perform my training on a cluster, in separate jobs, that save to separate files, but these files should later on be collected for the visualization. What's the way to go here?
d
I understand that there are PartitionedDataset s, but they seem to require all data to be held in memory and then saved at the same time.
PartitionedDataset
can be lazy. https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html#partitioned-dataset-lazy-saving
In contrast, I want to perform my training on a cluster, in separate jobs, that save to separate files, but these files should later on be collected for the visualization.
When you say "on a cluster," just to confirm--each job leverages the cluster for distributed training, but probably doing one job at a time?
i
PartitionedDataset
can be lazy
I see, thanks! In this setting, could I run a pipeline multiple times, and save each result as a new partition?
When you say "on a cluster," just to confirm--each job leverages the cluster for distributed training, but probably doing one job at a time?
Yes. I have separate jobs. I want to run the same pipeline multiple times and then collect the results of all runs
d
could I run a pipeline multiple times, and save each result as a new partition?
Yes, that should work.
i
Perfect, thanks!