Deepyaman Datta
10/22/2023, 1:32 PMPartitionedDataset
users out there! We have a question for you, related to enabling versioning for PartitionedDataset
--which of the below options makes the most sense to you?
1. https://github.com/kedro-org/kedro/pull/521 proposes to enable versioning of the underlying dataset, by specifying versioned: true
in the dataset config:
station_data:
type: PartitionedDataset
path: data/03_primary/station_data
dataset:
type: pandas.CSVDataset
versioned: true
On the plus side, having the versioned: true
config on the dataset
config makes it clear that the versioning is applied to the underlying dataset, not to the PartitionedDataset
. However, there are some edge cases (see https://github.com/kedro-org/kedro/pull/521#issuecomment-744653023, if you're keen).
2. Alternatively, we can move the versioned: true
flag to the top level PartitionedDataset
config:
station_data:
type: PartitionedDataset
path: data/03_primary/station_data
versioned: true
dataset:
type: pandas.CSVDataset
Note that the versioning is still of the underlying dataset (e.g. data/03_primary/station_data/first_station.csv/<version>/first_station.csv
), even though the config is at the top level.
3. None of these options make sense; what you really need is versioning of the top-level dataset. (Note that we don't have a solution designed for this case, but it would be great to know nonetheless!)
Please feel free to vote using 1️⃣2️⃣3️⃣, and elaborate further on your thoughts in the thread below!