Thank you <@U03RYKELN5Q> and <@U03S12LHNNQ> for re...
# questions
Thank you @Antony Milne and @Deepyaman Datta for responding to question about data versioning with a PartitionedDataSet (one cannot use
versioned: True
argument in the data catalog for this kind of dataset). Perhaps it is better than I explain the root issue/challenge, in case there are solutions I am missing. The Problem: By default, Kedro overwrites data objects with each run, using the paths set in the data catalog. The Question: What is a convenient solution/tech stack for enabling the execution of multiple parallel ML experiments in my Kedro pipeline, while maintaining that… 1. Each experiment triggers the data to be versioned effectively. Ideally… a. When there are changes to the data, the data is copied and assigned a unique ID (sha, md5, timestamp), perhaps with metadata regarding the parameters that were used to generate the data. In this case, it is important that the data is stored a sensible, organized manner. b. When there have been no changes the data, the same unique ID (and metadata) are used and can be extracted 2. The unique IDs (and metadata) for each relevant dataset relevant to the ML run can be extracted and stored alongside the (presumably lighter) results of the experiment 3. Given 1. and 2. above, the results are reproducible (they offer point-time-correctness) The solutions I have thus far come across are problematic: • Writing a class to set dynamic dataset filepaths ◦ The main issue with this approach is that it is incredibly high-maintenance. It requires continuous, careful attention to the parameters used to define the dynamic filepaths. ▪︎ For example, if I set the filepath to
using parameters
and I change parameter
, changing the composition of the data, a new dataset will overwrite the previous dataset. If I wanted to have kept them both, I would have had to remember to update filepath parameters to include
. Of course, with many different data-defining parameters, this becomes problematic rather quickly. • Use Kedro versioning - use the
versioned: True
argument in the catalog underneath datasets for which you desire to version ◦ The first issue with this approach is that it appears to version all of the data with every new run, presenting a massive storage issue and the necessity for a custom retention policy to clear useless/outdated data. ◦ The second issue is that this doesnt work with PartitionedDataSet datasets. Are there any effective solutions I am missing?
👍 1
👀 1
1000 1
Regarding the storage problem - would it be possible to do a clean-up with
to check if the current file is identical and delete them if necessary? The timestamp isn’t just for versioning but it’s also a data lineage. If you start deleting file and caching, you will have to maintain which pipeline to create which artefact.
Deleting duplicated data using
could be a useful solution, thank you!