kedro `versioned` always points to a new version once writin Kedro #questions

kedro `versioned` always points to a new version ...

Debanjan Banerjee

08/07/2023, 10:40 AM

kedro

versioned

always points to a new version once writing the data right ? Can we ensure there is a

prod

version created that the rest of the datasets always read from in production and we can change it in params or somewhere when we want to? for eg., we can do this manually by doing this parameters.yml

Copy code

run_date: &run_date 20230101

version : *run_date --this can also be prod/dev/uat etc.

catalog.yml

Copy code

weather:
  type: spark.SparkDataSet
  filepath: <s3a://your_bucket/data/01_raw/weather/${version}/file.csv>
  file_format: csv

but this wont usilise the

versioned: True

feature. Any way we can achieve the above functionality from

versioned

? That would be much cleaner imo

Deepyaman Datta

08/07/2023, 12:31 PM

Not 100% sure I understand, but I want to say no.

versioned: true

enables a very specific versioning scheme that will load from the most "recent" folder, which is determined by listing and sorting subdirectories at a particular level. There's no support for a specific version to always read from, unless you want to patch that version discovery logic to do something custom.

Nok Lam Chan

08/07/2023, 4:24 PM

kedro run --load-versions

? Two ways of dealing this 1. Native Kedro versioning scheme -

kedro run --load-versions

2. Simliary to your approach, simply define a templated value to point to where your `prod`version live.

Nok Lam Chan

08/07/2023, 4:27 PM

The way that I see it, in order to mark something as

production

, there is a process to validate it, where you can simply move it to a specific path. (closer to approach 2) Be it a CI/CD build process or a manual human intervention. With experiment tracking tools or things like MLflow, you may use the Artifact or tags etc to achieve similar thing.

Richard Bownes

08/08/2023, 8:59 AM

I'd love to piggy back off this. In this example we have a parameterised filepath:

Copy code

weather:
  type: spark.SparkDataSet
  filepath: <s3a://your_bucket/data/01_raw/weather/${version}/file.csv>
  file_format: csv

I want to do something similar with table names in bigquery, how do I go about picking an environment at run time to do it?

datajoely

08/08/2023, 9:14 AM

You can parametrise you

table_name

key/value the exact same way: https://docs.kedro.org/en/stable/kedro.extras.datasets.pandas.GBQTableDataSet.html

Debanjan Banerjee

08/08/2023, 4:40 PM

we are doing this right now too @Richard Bownes . like Joel mentioned , this will work

Copy code

weather_hive:
  type: spark.SparkHiveDataSet
  table_name: ${table_name_from_params/globals}
  database_name : ${db_name_from_params/globals}
  mode: overwrite

Nok Lam Chan

08/09/2023, 8:10 PM

@Richard Bownes does this solve your problem? Would love to hear that.

3 Views

Open in Slack

Previous Next