Hello! Is it possible to give versioned dataset a ...
# questions
d
Hello! Is it possible to give versioned dataset a name instead of using timestamp? I’m using datasets like
pandas.ParquetDataSet
spark.SparkDataSet
pickle.PickleDataSet
, and using yml configs to save:
Copy code
dataset:
  type: pandas.ParquetDataSet
  filepath: some_path
  versioned: true
d
Short answer: no
Slightly longer: you can always modify Kedro to use a different versioning scheme under the hood, if you want to play with this. However, they should still be sortable, else Kedro won't be able to load the "latest" version. What's your use case?
d
Thanks @Deepyaman Datta! Use case is I to rerun several nodes with different parameters and I want to save different runs to compare outputs. Goal is to save datasets like
{timestamp}_conf_v1
,
{timestamp}_conf_v2
etc. so it’s easier to analyze the output. I know MLflow could probably do this but want to see if there’s an option without changing the source code.
d
Have you considered the experiment tracking built in to kedro? Let me grab a link
👌 1
Actually, not sure this is a good fit. https://kedro.readthedocs.io/en/stable/logging/experiment_tracking.html @Merel could confirm I think; lot more familiar with experiment tracking. The kedro-mlflow plug-in could be another idea; I don't know the functionality for it that well, either.
d
Thanks for the link, seems like it tracks metric instead of different versions of the dataset. I can probably do a workaround to track timestamp and configs somewhere else for now.
e
Perhaps using
PartitionedDataSet
(link below)? You can then generate a key, and it will save accordingly. PS. We’ve also been looking into this use case. https://kedro.readthedocs.io/en/stable/kedro.io.PartitionedDataSet.html