Hi there I m fairly new to kedro and want to use it for a ma Kedro #questions

Hi there! I'm fairly new to kedro and want to use ...

Inga Kottlarz

05/13/2024, 11:39 AM

Hi there! I'm fairly new to kedro and want to use it for a machine learning/ data science project. My scenario is the following • I have a dataset that I want to train a model on. This is fixed. • I want to train my model multiple times with different initial conditions for the training to assess how robust the convergence is • Later, I want to collect the results of all trainings to visualize them and do some statistics on them Now, my question is: What would be the best way to do step 2/3 using kedro, more specific, how do I orchestrate the in/ out? How can I save the results of all trainings in different files, and then load them all together (while making sure I'm not loading any outdated ones)? I understand that there are

PartitionedDataset

s, but they seem to require all data to be held in memory and then saved at the same time. In contrast, I want to perform my training on a cluster, in separate jobs, that save to separate files, but these files should later on be collected for the visualization. What's the way to go here?

Deepyaman Datta

05/13/2024, 2:59 PM

I understand that there are PartitionedDataset s, but they seem to require all data to be held in memory and then saved at the same time.

PartitionedDataset

can be lazy. https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html#partitioned-dataset-lazy-saving

In contrast, I want to perform my training on a cluster, in separate jobs, that save to separate files, but these files should later on be collected for the visualization.

When you say "on a cluster," just to confirm--each job leverages the cluster for distributed training, but probably doing one job at a time?

Inga Kottlarz

05/13/2024, 3:10 PM

PartitionedDataset
can be lazy

I see, thanks! In this setting, could I run a pipeline multiple times, and save each result as a new partition?

When you say "on a cluster," just to confirm--each job leverages the cluster for distributed training, but probably doing one job at a time?

Yes. I have separate jobs. I want to run the same pipeline multiple times and then collect the results of all runs

Deepyaman Datta

05/13/2024, 5:01 PM

could I run a pipeline multiple times, and save each result as a new partition?

Yes, that should work.

Inga Kottlarz

05/14/2024, 9:04 AM

Perfect, thanks!

10 Views

Open in Slack

Previous Next