Hello Is it possible to give versioned dataset a name instea Kedro #questions

Hello! Is it possible to give versioned dataset a ...

Danhua Yan

01/06/2023, 5:03 PM

Hello! Is it possible to give versioned dataset a name instead of using timestamp? I’m using datasets like

pandas.ParquetDataSet

spark.SparkDataSet

pickle.PickleDataSet

, and using yml configs to save:

Copy code

dataset:
  type: pandas.ParquetDataSet
  filepath: some_path
  versioned: true

Deepyaman Datta

01/06/2023, 5:08 PM

Short answer: no

Deepyaman Datta

01/06/2023, 5:10 PM

Slightly longer: you can always modify Kedro to use a different versioning scheme under the hood, if you want to play with this. However, they should still be sortable, else Kedro won't be able to load the "latest" version. What's your use case?

Danhua Yan

01/06/2023, 5:42 PM

Thanks @Deepyaman Datta! Use case is I to rerun several nodes with different parameters and I want to save different runs to compare outputs. Goal is to save datasets like

{timestamp}_conf_v1

{timestamp}_conf_v2

etc. so it’s easier to analyze the output. I know MLflow could probably do this but want to see if there’s an option without changing the source code.

Deepyaman Datta

01/06/2023, 6:07 PM

Have you considered the experiment tracking built in to kedro? Let me grab a link

👌 1

Deepyaman Datta

01/06/2023, 6:11 PM

Actually, not sure this is a good fit. https://kedro.readthedocs.io/en/stable/logging/experiment_tracking.html @Merel could confirm I think; lot more familiar with experiment tracking. The kedro-mlflow plug-in could be another idea; I don't know the functionality for it that well, either.

Danhua Yan

01/06/2023, 6:13 PM

Thanks for the link, seems like it tracks metric instead of different versions of the dataset. I can probably do a workaround to track timestamp and configs somewhere else for now.

Elias WILLEMSE

01/08/2023, 11:50 AM

Perhaps using

PartitionedDataSet

(link below)? You can then generate a key, and it will save accordingly. PS. We’ve also been looking into this use case. https://kedro.readthedocs.io/en/stable/kedro.io.PartitionedDataSet.html

4 Views

Open in Slack

Previous Next