What s the best design pattern for registering many catalog Kedro #questions

What's the best design pattern for registering man...

Miguel Rodríguez

09/14/2024, 8:27 PM

What's the best design pattern for registering many catalog entries that are very similar but differ in small details? E.g. Spark tables with same config but different filepath and partition column each I don't want to write the same 10 lines of config for every dataset when the only difference is the _partition_by_ and filepath • Factories would make me add every possible customization (like the partition column) in the name as a placeholder and end up with large and verbose dataset names • YAML anchors are not recommended with OmegaConf and don't work accross multiple files • OmegaConf interpolations can't solve this easily either in a simple way I feel this should be a common pattern many people will face but I can't find an elegant solution I found https://github.com/kedro-org/kedro/issues/3625 which seems to be motivated by the same issue

Nok Lam Chan

09/15/2024, 12:00 AM

Could you provide some examples?

Miguel Rodríguez

09/15/2024, 12:05 AM

Now my datasets look like this:

Copy code

"{country}.prm_my_dataset":
  type: spark.SparkDataset
  file_format: delta
  credentials: ${globals:datalake_credential}
  save_args:
    mode: overwrite
    mergeSchema: true
    partitionOverwriteMode: dynamic
    partitionBy: ["date_column"]
  filepath: ${globals:versioned_storage_path}${globals:namespace}/data/{country}/03_primary/prm_my_dataset
  metadata:
    kedro-viz:
      layer: primary

Most of my datasets look very similar, just some layer, partitioning and filepaths changes. My catalog ends up being quite unreadable with 13 lines per catalog entry as I have around a hundred of them. And if I want to change something at project level (e.g. add some new metadata to all datasets) I end up having to do a lot of find-replace and end up with an even more complex catalog

Merel

09/17/2024, 2:23 PM

The current official Kedro supported ways of reducing the catalog entries are indeed dataset factories and resolvers. I'm afraid there's no other official feature that can help solve your problem. (Thanks for commenting on the issue above, the more responses and evidence we have that this needs to be solved, the better we can prioritise! )

👍 1

19 Views

Open in Slack

Previous Next