Hello to all of our wonderful `PartitionedDataset`...
# user-research
d
Hello to all of our wonderful
PartitionedDataset
users out there! We have a question for you, related to enabling versioning for
PartitionedDataset
--which of the below options makes the most sense to you? 1. https://github.com/kedro-org/kedro/pull/521 proposes to enable versioning of the underlying dataset, by specifying
versioned: true
in the dataset config:
Copy code
station_data:
  type: PartitionedDataset
  path: data/03_primary/station_data
  dataset:
    type: pandas.CSVDataset
    versioned: true
On the plus side, having the
versioned: true
config on the
dataset
config makes it clear that the versioning is applied to the underlying dataset, not to the
PartitionedDataset
. However, there are some edge cases (see https://github.com/kedro-org/kedro/pull/521#issuecomment-744653023, if you're keen). 2. Alternatively, we can move the
versioned: true
flag to the top level
PartitionedDataset
config:
Copy code
station_data:
  type: PartitionedDataset
  path: data/03_primary/station_data
  versioned: true
  dataset:
    type: pandas.CSVDataset
Note that the versioning is still of the underlying dataset (e.g.
data/03_primary/station_data/first_station.csv/<version>/first_station.csv
), even though the config is at the top level. 3. None of these options make sense; what you really need is versioning of the top-level dataset. (Note that we don't have a solution designed for this case, but it would be great to know nonetheless!) Please feel free to vote using 1️⃣2️⃣3️⃣, and elaborate further on your thoughts in the thread below!
1️⃣ 3
2️⃣ 1
3️⃣ 1