https://kedro.org/ logo
#questions
Title
# questions
d

Danhua Yan

01/06/2023, 5:03 PM
Hello! Is it possible to give versioned dataset a name instead of using timestamp? I’m using datasets like
pandas.ParquetDataSet
spark.SparkDataSet
pickle.PickleDataSet
, and using yml configs to save:
Copy code
dataset:
  type: pandas.ParquetDataSet
  filepath: some_path
  versioned: true
d

Deepyaman Datta

01/06/2023, 5:08 PM
Short answer: no
Slightly longer: you can always modify Kedro to use a different versioning scheme under the hood, if you want to play with this. However, they should still be sortable, else Kedro won't be able to load the "latest" version. What's your use case?
d

Danhua Yan

01/06/2023, 5:42 PM
Thanks @Deepyaman Datta! Use case is I to rerun several nodes with different parameters and I want to save different runs to compare outputs. Goal is to save datasets like
{timestamp}_conf_v1
,
{timestamp}_conf_v2
etc. so it’s easier to analyze the output. I know MLflow could probably do this but want to see if there’s an option without changing the source code.
d

Deepyaman Datta

01/06/2023, 6:07 PM
Have you considered the experiment tracking built in to kedro? Let me grab a link
👌 1
Actually, not sure this is a good fit. https://kedro.readthedocs.io/en/stable/logging/experiment_tracking.html @Merel could confirm I think; lot more familiar with experiment tracking. The kedro-mlflow plug-in could be another idea; I don't know the functionality for it that well, either.
d

Danhua Yan

01/06/2023, 6:13 PM
Thanks for the link, seems like it tracks metric instead of different versions of the dataset. I can probably do a workaround to track timestamp and configs somewhere else for now.
e

Elias WILLEMSE

01/08/2023, 11:50 AM
Perhaps using
PartitionedDataSet
(link below)? You can then generate a key, and it will save accordingly. PS. We’ve also been looking into this use case. https://kedro.readthedocs.io/en/stable/kedro.io.PartitionedDataSet.html
3 Views