Thiago José Moser Poletto
11/07/2024, 2:15 PMHall
11/07/2024, 2:15 PMdatajoely
11/07/2024, 2:31 PMrun_idenifier
using an env var is the easiest way to do thisThiago José Moser Poletto
11/07/2024, 2:52 PMdatajoely
11/07/2024, 3:01 PMdatajoely
11/07/2024, 3:02 PMdatajoely
11/07/2024, 3:02 PMNok Lam Chan
11/07/2024, 3:07 PMI'm loading a base from BigQuery and spliting each row to run in another pipelineHow are you doing this?
Nok Lam Chan
11/07/2024, 3:07 PMruntime_params
Thiago José Moser Poletto
11/07/2024, 3:10 PMdef create_pipeline(**kwargs) -> Pipeline:
params = load_catalog_params()
print("input_tables entry from parameters/general.yml: ", params["input_tables"])
all_pipelines = []
for group in range(int(params['n_cores'])):
p = Pipeline([
node(func = train_setup,
inputs = ['imagens', 'masks', 'split_data_input', f"train_config_params_{group}"],
outputs = [f"metrics_{group}", f"epoch_loss_{group}", f"validity_metrics_{group}", f"model_save_{group}", f"train_params_{group}"],
name = f"train_setup_{group}")
],tags = "training")
all_pipelines += [p]
return reduce(add, all_pipelines)
Nok Lam Chan
11/07/2024, 3:10 PM{run_identifier}_xxx_dataset:
filepath: {run_identifier}/some_folder/some_dataset.parquet
Then in those namespace pipeline you use the run_identifier as part of the input/output nameThiago José Moser Poletto
11/07/2024, 3:11 PMThiago José Moser Poletto
11/07/2024, 3:12 PMThiago José Moser Poletto
11/07/2024, 3:13 PMNok Lam Chan
11/07/2024, 3:17 PMhook
to override the catalog during node run.Thiago José Moser Poletto
11/07/2024, 3:18 PMThiago José Moser Poletto
11/07/2024, 3:19 PMNok Lam Chan
11/07/2024, 3:20 PMkedro run
command to do what you think as a whole job. In an orchestrator it shouldn't be a problem since you can treat two run as a job and it doesn't have to be mapped 1:!Thiago José Moser Poletto
11/07/2024, 3:25 PMThiago José Moser Poletto
11/07/2024, 7:48 PM{run_identifier}_xxx_dataset:
filepath: {run_identifier}/some_folder/some_dataset.parquet
Nok Lam Chan
11/08/2024, 11:47 AMThiago José Moser Poletto
11/08/2024, 11:57 AMNok Lam Chan
11/08/2024, 11:59 AMThiago José Moser Poletto
11/08/2024, 12:00 PMThiago José Moser Poletto
11/08/2024, 12:02 PMNok Lam Chan
11/08/2024, 12:03 PM{layer}_some_dataset:
filepath: data/{layer}/data.pq
Dataset factory allows you to define multiple dataset with a single pattern. While namespace is merely the "group" of dataset/pipeline, which support more powerful feature with kedro-viz. For kedro itself, namespace does not change anything, they are mainly for organising your code.Thiago José Moser Poletto
11/08/2024, 12:05 PMNok Lam Chan
11/08/2024, 12:05 PMnode(my_func, inputs="some_data", outputs="my_fav_dataset")
then your catalog may look like:
some_data:
...
{some}_dataset:
filepath: data/{some}/dataset.pq
With pattern matching, the outputs
will automatically match the 2nd dataset in the catalog. Without a pattern, Kedro default to an in-memory dataset that is thrown away after the run.Nok Lam Chan
11/08/2024, 12:05 PMinputs
, outputs
string.Thiago José Moser Poletto
11/08/2024, 12:07 PMNok Lam Chan
11/08/2024, 12:07 PMnode
and catalog
to reason about the dataset. It's not super clear in the doc I will try to add some explanation.Nok Lam Chan
11/08/2024, 12:08 PMThiago José Moser Poletto
11/08/2024, 12:09 PMNok Lam Chan
11/08/2024, 12:12 PMsome_data:
...
{some}_dataset_{abc}:
filepath: data/{some}/dataset.pq
layer: {abc}
{some}_dataset:
filepath: data/{some}/dataset.pq
Extending the example, this time we have 2 dataset factories. an outputs/inputs can match more than 1 factory pattern, and the one is more specific wins. There is a command kedro catalog rank
to help you to understand the resolution.
For example, a dataset call kedro_dataset
, will match {some}_dataset
, while kedro_dataset_something
will match {some}__dataset_{abc}_
Nok Lam Chan
11/08/2024, 12:12 PMf-string
or Jinja template.Nok Lam Chan
11/08/2024, 12:13 PM{}
is just a placeholderNok Lam Chan
11/08/2024, 12:13 PMkedro_dataset_something
,
some -> kedro
abc -> somethingNok Lam Chan
11/08/2024, 12:15 PMparse
as the underlying library, which is a bit like reverse of f-string
. This example should help you to understand more:
>>> from parse import compile
>>> p = compile("It's {}, I love it!")
>>> print(p)
<Parser "It's {}, I love it!">
>>> p.parse("It's spam, I love it!")
<Result ('spam',) {}>
Thiago José Moser Poletto
11/08/2024, 12:23 PMThiago José Moser Poletto
11/08/2024, 12:24 PMThiago José Moser Poletto
11/08/2024, 12:24 PMThiago José Moser Poletto
11/08/2024, 12:49 PMNok Lam Chan
11/08/2024, 1:04 PMThiago José Moser Poletto
11/08/2024, 1:07 PMNok Lam Chan
11/08/2024, 1:09 PMThiago José Moser Poletto
11/08/2024, 1:09 PMThiago José Moser Poletto
11/08/2024, 1:14 PMNok Lam Chan
11/08/2024, 1:43 PMThiago José Moser Poletto
11/08/2024, 1:45 PMmonths = ["jan", "feb", "mar", "apr"]
# instead use:
months = catalog.load('months') # lets say... something like that
Nok Lam Chan
11/08/2024, 1:49 PMNok Lam Chan
11/08/2024, 1:49 PMThiago José Moser Poletto
11/12/2024, 3:06 PMThiago José Moser Poletto
11/12/2024, 8:10 PM