Debanjan Banerjee
08/07/2023, 10:40 AMversioned
always points to a new version once writing the data right ? Can we ensure there is a prod
version created that the rest of the datasets always read from in production and we can change it in params or somewhere when we want to?
for eg., we can do this manually by doing this
parameters.yml
run_date: &run_date 20230101
version : *run_date --this can also be prod/dev/uat etc.
catalog.yml
weather:
type: spark.SparkDataSet
filepath: <s3a://your_bucket/data/01_raw/weather/${version}/file.csv>
file_format: csv
but this wont usilise the versioned: True
feature. Any way we can achieve the above functionality from versioned
? That would be much cleaner imoDeepyaman Datta
08/07/2023, 12:31 PMversioned: true
enables a very specific versioning scheme that will load from the most "recent" folder, which is determined by listing and sorting subdirectories at a particular level. There's no support for a specific version to always read from, unless you want to patch that version discovery logic to do something custom.Nok Lam Chan
08/07/2023, 4:24 PMkedro run --load-versions
?
Two ways of dealing this
1. Native Kedro versioning scheme - kedro run --load-versions
2. Simliary to your approach, simply define a templated value to point to where your `prod`version live.production
, there is a process to validate it, where you can simply move it to a specific path. (closer to approach 2)
Be it a CI/CD build process or a manual human intervention. With experiment tracking tools or things like MLflow, you may use the Artifact or tags etc to achieve similar thing.Richard Bownes
08/08/2023, 8:59 AMweather:
type: spark.SparkDataSet
filepath: <s3a://your_bucket/data/01_raw/weather/${version}/file.csv>
file_format: csv
I want to do something similar with table names in bigquery, how do I go about picking an environment at run time to do it?datajoely
08/08/2023, 9:14 AMtable_name
key/value the exact same way:
https://docs.kedro.org/en/stable/kedro.extras.datasets.pandas.GBQTableDataSet.htmlDebanjan Banerjee
08/08/2023, 4:40 PMweather_hive:
type: spark.SparkHiveDataSet
table_name: ${table_name_from_params/globals}
database_name : ${db_name_from_params/globals}
mode: overwrite
Nok Lam Chan
08/09/2023, 8:10 PM