kedro `versioned` always points to a new version ...
# questions
d
kedro
versioned
always points to a new version once writing the data right ? Can we ensure there is a
prod
version created that the rest of the datasets always read from in production and we can change it in params or somewhere when we want to? for eg., we can do this manually by doing this parameters.yml
Copy code
run_date: &run_date 20230101

version : *run_date --this can also be prod/dev/uat etc.
catalog.yml
Copy code
weather:
  type: spark.SparkDataSet
  filepath: <s3a://your_bucket/data/01_raw/weather/${version}/file.csv>
  file_format: csv
but this wont usilise the
versioned: True
feature. Any way we can achieve the above functionality from
versioned
? That would be much cleaner imo
d
Not 100% sure I understand, but I want to say no.
versioned: true
enables a very specific versioning scheme that will load from the most "recent" folder, which is determined by listing and sorting subdirectories at a particular level. There's no support for a specific version to always read from, unless you want to patch that version discovery logic to do something custom.
n
kedro run --load-versions
? Two ways of dealing this 1. Native Kedro versioning scheme -
kedro run --load-versions
2. Simliary to your approach, simply define a templated value to point to where your `prod`version live.
The way that I see it, in order to mark something as
production
, there is a process to validate it, where you can simply move it to a specific path. (closer to approach 2) Be it a CI/CD build process or a manual human intervention. With experiment tracking tools or things like MLflow, you may use the Artifact or tags etc to achieve similar thing.
r
I'd love to piggy back off this. In this example we have a parameterised filepath:
Copy code
weather:
  type: spark.SparkDataSet
  filepath: <s3a://your_bucket/data/01_raw/weather/${version}/file.csv>
  file_format: csv
I want to do something similar with table names in bigquery, how do I go about picking an environment at run time to do it?
d
You can parametrise you
table_name
key/value the exact same way: https://docs.kedro.org/en/stable/kedro.extras.datasets.pandas.GBQTableDataSet.html
d
we are doing this right now too @Richard Bownes . like Joel mentioned , this will work
Copy code
weather_hive:
  type: spark.SparkHiveDataSet
  table_name: ${table_name_from_params/globals}
  database_name : ${db_name_from_params/globals}
  mode: overwrite
n
@Richard Bownes does this solve your problem? Would love to hear that.