Guys I would like to check with you if theres a simpler way Kedro #questions

Guys, I would like to check with you if theres a s...

Thiago José Moser Poletto

11/07/2024, 2:15 PM

Guys, I would like to check with you if theres a simpler way to use a run_identifier on the path into the catalog: I'm loading a base from BigQuery and spliting each row to run in another pipeline, where I load and save dynamically the inputs/outputs. I would like to get a value from a column and use as run_identifier in the path on catalog: filepath: ${root_folder}/${current_datetime}/${run_identifier}/data/model/{placeholder:name}.pt is there a way known to do something like that? I open to suggestions...

Hall

11/07/2024, 2:15 PM

Someone will reply to you shortly. In the meantime, we've found some posts that could help answer your question.

datajoely

11/07/2024, 2:31 PM

I think setting

run_idenifier

using an env var is the easiest way to do this

Thiago José Moser Poletto

11/07/2024, 2:52 PM

yeah, but I need it to be updated dynamically, based on the value from a row that comes from the input...

datajoely

11/07/2024, 3:01 PM

oh gotcha

datajoely

11/07/2024, 3:02 PM

that's slightly difficult in Kedro as IO / logic are decoupled intentionally

datajoely

11/07/2024, 3:02 PM

it can be done but it's a bit difficult

Nok Lam Chan

11/07/2024, 3:07 PM

So I understand that you basically want to override some parameters base on a node output. https://docs.kedro.org/en/stable/extend_kedro/architecture_overview.html

I'm loading a base from BigQuery and spliting each row to run in another pipeline

How are you doing this?

Nok Lam Chan

11/07/2024, 3:07 PM

If you are calling separate Kedro pipeline you can simply inject those identifier as part of the

runtime_params

Thiago José Moser Poletto

11/07/2024, 3:10 PM

I do have a pipeline that load the BQ table, split that data into rows, that is saved dynamically on catalog. Once that orchestration pipeline is done, the model pipeline runs.

Copy code

def create_pipeline(**kwargs) -> Pipeline:
    params = load_catalog_params()
    print("input_tables entry from parameters/general.yml: ", params["input_tables"])
    all_pipelines = []
    for group in range(int(params['n_cores'])):
        p = Pipeline([
                    node(func    = train_setup,
                          inputs  = ['imagens', 'masks', 'split_data_input', f"train_config_params_{group}"],
                          outputs = [f"metrics_{group}", f"epoch_loss_{group}", f"validity_metrics_{group}", f"model_save_{group}", f"train_params_{group}"],
                          name    = f"train_setup_{group}")
                    ],tags = "training") 
        all_pipelines += [p]
    
    return reduce(add, all_pipelines)

Nok Lam Chan

11/07/2024, 3:10 PM

Otherwise I am thinkin using namespace pipeline + dataset factory. So the catalog look like

Copy code

{run_identifier}_xxx_dataset:
  filepath: {run_identifier}/some_folder/some_dataset.parquet

Then in those namespace pipeline you use the run_identifier as part of the input/output name

Thiago José Moser Poletto

11/07/2024, 3:11 PM

so the outputs from that train_setup that I would like to be able to add the run_idenifier into the path

Thiago José Moser Poletto

11/07/2024, 3:12 PM

so that for every whole run, I'll have a folder with date/run_idenifier/data..

Thiago José Moser Poletto

11/07/2024, 3:13 PM

so that I can identify which run/model output has which params behind it

Nok Lam Chan

11/07/2024, 3:17 PM

I see, I don't have an immediate solution. This is tricky because this is having a runtime output defining the dataset, which has been initiated way before the node is run. https://docs.kedro.org/en/stable/extend_kedro/architecture_overview.html The other way is likely using

hook

to override the catalog during node run.

Thiago José Moser Poletto

11/07/2024, 3:18 PM

oh yeah I thought of that too, I'll check if works properly and post here

Thiago José Moser Poletto

11/07/2024, 3:19 PM

thanks guys

Nok Lam Chan

11/07/2024, 3:20 PM

Alternatively, if you can split this into two separate Kedro run, this would be very simple. First run to generate the rows to run: Second run are kedro run with runtime_params, that read the result of the previous run, from a table potentially. The downside is that you cannot use a single

kedro run

command to do what you think as a whole job. In an orchestrator it shouldn't be a problem since you can treat two run as a job and it doesn't have to be mapped 1:!

👍🏻 1

Thiago José Moser Poletto

11/07/2024, 3:25 PM

I'll try that as well

Thiago José Moser Poletto

11/07/2024, 7:48 PM

Hey @Nok Lam Chan do you have a documention on how to implement that example you gave: namespace pipeline + dataset....

Copy code

{run_identifier}_xxx_dataset:
  filepath: {run_identifier}/some_folder/some_dataset.parquet

Nok Lam Chan

11/08/2024, 11:47 AM

let see if this help you. Docs: • [How to generalise datasets using namespaces] https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html#how-to-generalise-datasets-using-namespaces • I have a demo project created which may help: https://github.com/noklam/kedro-example/blob/master/loop-pipeline/src/loop_pipeline/pipeline.py

Thiago José Moser Poletto

11/08/2024, 11:57 AM

nice... I get the part of the namespace now, but the layer one, where its defined? I mean, the namespace is define in the node... and layer?

Nok Lam Chan

11/08/2024, 11:59 AM

What do you mean by layer?

Thiago José Moser Poletto

11/08/2024, 12:00 PM

sorry... haha

Thiago José Moser Poletto

11/08/2024, 12:02 PM

and if its possible to use both layer and namespace for that matter

Nok Lam Chan

11/08/2024, 12:03 PM

I see, this is using both namespace and factory at the same time. For example, if you only use factory but not namespace, it may look like this:

Copy code

{layer}_some_dataset:
  filepath: data/{layer}/data.pq

Dataset factory allows you to define multiple dataset with a single pattern. While namespace is merely the "group" of dataset/pipeline, which support more powerful feature with kedro-viz. For kedro itself, namespace does not change anything, they are mainly for organising your code.

Thiago José Moser Poletto

11/08/2024, 12:05 PM

ok, but I just don't know where the value "layer" is coming from, sorry about that, I'm somewhat new with more advanced Kedro concepts, so I don't know about the factory

👍🏼 1

Nok Lam Chan

11/08/2024, 12:05 PM

So for example, when you have a node without any dataset factory:

Copy code

node(my_func, inputs="some_data", outputs="my_fav_dataset")

then your catalog may look like:

Copy code

some_data:
   ...

{some}_dataset:
   filepath: data/{some}/dataset.pq

With pattern matching, the

outputs

will automatically match the 2nd dataset in the catalog. Without a pattern, Kedro default to an in-memory dataset that is thrown away after the run.

Nok Lam Chan

11/08/2024, 12:05 PM

the {layer} and {dataset_name} is pattern matching of the

inputs

outputs

string.

Thiago José Moser Poletto

11/08/2024, 12:07 PM

so how would look a node and catalog that have a layer? just to see if I truly got it

Nok Lam Chan

11/08/2024, 12:07 PM

don't worry about that, I can see the confusion as it requires both

node

and

catalog

to reason about the dataset. It's not super clear in the doc I will try to add some explanation.

❤️ 1

Nok Lam Chan

11/08/2024, 12:08 PM

hmm, reusing the example above

Thiago José Moser Poletto

11/08/2024, 12:09 PM

ohhh that will be the layer? or I can call it anything? as long as follow a somewhat pattern in that mentioned rank...

Nok Lam Chan

11/08/2024, 12:12 PM

Copy code

some_data:
   ...

{some}_dataset_{abc}:
   filepath: data/{some}/dataset.pq
   layer: {abc}

{some}_dataset:
  filepath: data/{some}/dataset.pq

Extending the example, this time we have 2 dataset factories. an outputs/inputs can match more than 1 factory pattern, and the one is more specific wins. There is a command

kedro catalog rank

to help you to understand the resolution. For example, a dataset call

kedro_dataset

, will match

{some}_dataset

, while

kedro_dataset_something

will match

{some}__dataset_{abc}_

Nok Lam Chan

11/08/2024, 12:12 PM

It can be call anything, it's very similar to

f-string

or Jinja template.

Nok Lam Chan

11/08/2024, 12:13 PM

the name inside

{}

is just a placeholder

Nok Lam Chan

11/08/2024, 12:13 PM

with

kedro_dataset_something

, some -> kedro abc -> something

Nok Lam Chan

11/08/2024, 12:15 PM

We use

parse

as the underlying library, which is a bit like reverse of

f-string

. This example should help you to understand more:

Copy code

>>> from parse import compile
>>> p = compile("It's {}, I love it!")
>>> print(p)
<Parser "It's {}, I love it!">
>>> p.parse("It's spam, I love it!")
<Result ('spam',) {}>

Thiago José Moser Poletto

11/08/2024, 12:23 PM

I see now, I believe that I'll be able to solve that issue I'm having with these features.

👍🏼 1

Thiago José Moser Poletto

11/08/2024, 12:24 PM

I'll use a for loop inside the node to extract the information I need and use as layer

Thiago José Moser Poletto

11/08/2024, 12:24 PM

to segment the path

Thiago José Moser Poletto

11/08/2024, 12:49 PM

it would be possible to access a data from a input in the pipeline node definition?

Nok Lam Chan

11/08/2024, 1:04 PM

Added some docs here: https://github.com/kedro-org/kedro/pull/4308 The build is stuck for some reason but you can review this temporarily with this link: https://5500-kedroorg-kedro-ov91qdu83us.ws-us116.gitpod.io/docs/build/html/data/kedro_dataset_factories.html

Thiago José Moser Poletto

11/08/2024, 1:07 PM

the second link doesn't work, 401

Nok Lam Chan

11/08/2024, 1:09 PM

try again?

Thiago José Moser Poletto

11/08/2024, 1:09 PM

it worked now

Thiago José Moser Poletto

11/08/2024, 1:14 PM

but like in that example you shared, you create a list of months, that you use as namespace to save the outputs, would be possible to access a value from the catalog instead of creating that list?

Nok Lam Chan

11/08/2024, 1:43 PM

Do you have the value already before you create the nodes?

Thiago José Moser Poletto

11/08/2024, 1:45 PM

yeah, the are created in another pipeline, basically it would be accessing a catalog entry inside the pipeline.py file... like you did there with the months, but accensing a entry...

Copy code

months = ["jan", "feb", "mar", "apr"]
# instead use:
months = catalog.load('months') # lets say... something like that

Nok Lam Chan

11/08/2024, 1:49 PM

It is possible though not the most convention way: https://github.com/kedro-org/kedro/issues/2627#issuecomment-1691596460

Nok Lam Chan

11/08/2024, 1:49 PM

If you can afford two runs, this may be a simpler approach: https://github.com/noklam/kedro-example/blob/master/conditional-kedro-runs/conditional_run.py

👍🏻 1

Thiago José Moser Poletto

11/12/2024, 3:06 PM

Hey @Nok Lam Chan I do have a param: conf/base/parameters/test.yml group_id: null I'm trying to update it during a pipeline run by catalog.save(), is it possible?

Thiago José Moser Poletto

11/12/2024, 8:10 PM

I manage to workaround what I needed, its just the namespace that I'm not making to work for some reason

K 1

5 Views

Open in Slack

Previous Next