Hi Using the spaceflights as an example let s say I have two Kedro #questions

Hi, Using the spaceflights as an example, let's s...

Afiq Johari

01/18/2024, 2:24 PM

Hi, Using the spaceflights as an example, let's say I have two different spaceflight companies, SpaceX and SpaceY In the spaceflight example, we have the following data catalog.

Copy code

companies:
  type: pandas.CSVDataset
  filepath: data/01_raw/companies.csv

reviews:
  type: pandas.CSVDataset
  filepath: data/01_raw/reviews.csv

shuttles:
  type: pandas.ExcelDataset
  filepath: data/01_raw/shuttles.xlsx
  load_args:
    engine: openpyxl

Given SpaceX and SpaceY, I would like to define spaceflight company as a parameter And from the data catalog point of view, I hope to arrive at

Copy code

companies:
  type: pandas.CSVDataset
  filepath: ${spaceflight_company}/data/01_raw/companies.csv

reviews:
  type: pandas.CSVDataset
  filepath: ${spaceflight_company}/data/01_raw/reviews.csv

shuttles:
  type: pandas.ExcelDataset
  filepath: ${spaceflight_company}/data/01_raw/shuttles.xlsx
  load_args:
    engine: openpyxl

This avoids me from duplicating multiples of data catalogs such as companies_SpaceX, companies_SpaceY, etc. This simplifies the data catalog. Hence, for all the kedro nodes and pipelines, they will be dependent on which spaceflight company that I want to run. So instead of

Copy code

kedro run
kedro run --nodes=preprocess_companies_node,preprocess_shuttles_node

I hope to be able to specify which spaceflight company that I want to run, so it'd be like.

Copy code

kedro run --spaceflightcompany=SpaceX
kedro run --spaceflightcompany=SpaceX --nodes=preprocess_companies_node,preprocess_shuttles_node

➕ 1

Ankita Katiyar

01/18/2024, 2:36 PM

Hey Afiq, check out the dataset factories feature which might be useful - https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html

👍 2

Deepyaman Datta

01/18/2024, 3:16 PM

Also see https://docs.kedro.org/en/stable/development/commands_reference.html#modifying-a-kedro-run; you can

kedro run --namespace SpaceX

to run a namespaced pipeline. You can reuse a modular pipeline with these nodes. Like @Ankita Katiyar said, you would use something like dataset factories to make sure you have a catalog entry for each namespace.

👍 2

Open in Slack

Previous Next