Inga Kottlarz
05/13/2024, 11:39 AMPartitionedDataset
s, but they seem to require all data to be held in memory and then saved at the same time. In contrast, I want to perform my training on a cluster, in separate jobs, that save to separate files, but these files should later on be collected for the visualization. What's the way to go here?Deepyaman Datta
05/13/2024, 2:59 PMI understand that there are PartitionedDataset s, but they seem to require all data to be held in memory and then saved at the same time.
PartitionedDataset
can be lazy. https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html#partitioned-dataset-lazy-saving
In contrast, I want to perform my training on a cluster, in separate jobs, that save to separate files, but these files should later on be collected for the visualization.
When you say "on a cluster," just to confirm--each job leverages the cluster for distributed training, but probably doing one job at a time?
Inga Kottlarz
05/13/2024, 3:10 PMI see, thanks! In this setting, could I run a pipeline multiple times, and save each result as a new partition?can be lazyPartitionedDataset
When you say "on a cluster," just to confirm--each job leverages the cluster for distributed training, but probably doing one job at a time?Yes. I have separate jobs. I want to run the same pipeline multiple times and then collect the results of all runs
Deepyaman Datta
05/13/2024, 5:01 PMcould I run a pipeline multiple times, and save each result as a new partition?Yes, that should work.
Inga Kottlarz
05/14/2024, 9:04 AM