How can I access a data within a node meaning the node pipel Kedro #questions

How can I access a data within a node? meaning, t...

Afiq Johari

04/25/2024, 6:58 AM

How can I access a data within a node? meaning, the node/pipeline doesn't explicitly have the data as input. But I still want to access the data within the node. The data is declared in the catalog.yml. I tried debugging for the global variables (catalog)

⁉️ 1

Dmitry Sorokin

04/25/2024, 8:12 AM

Hi Afiq, why don't you want to include the data as an input?

Afiq Johari

04/25/2024, 8:12 AM

Alright, found the relevant doc https://docs.kedro.org/en/stable/data/advanced_data_catalog_usage.html

👍 1

datajoely

04/25/2024, 8:18 AM

the bigger question is what do you want to do?

👍 1

datajoely

04/25/2024, 8:18 AM

it’s been designed intentionally that nodes have no knowledge of IO

Afiq Johari

04/25/2024, 8:19 AM

@Dmitry Sorokin @datajoely, good question! I currently have a Kedro node responsible for generating forecasts for a given model. Previously, each model could utilize this node : it received a training dataset and output the forecast dataset. However, a new model requires forecasts from multiple models as input, producing the forecast dataset as output. This new model disrupts the current design, yet we still want to simply leverage the same node. So accessing the data catalog within the node seems to help us move forward. But yeah, likely a temporary measure for now.

datajoely

04/25/2024, 8:20 AM

so is it that you want conditional logic or you want to reuse the same pipeline code?

Afiq Johari

04/25/2024, 8:22 AM

@datajoely As of now, we somehow implemented conditional logic to the node function, and also able to reuse the same pipeline

datajoely

04/25/2024, 8:24 AM

so we have a pattern for reusing pipelines here https://docs.kedro.org/en/stable/nodes_and_pipelines/modular_pipelines.html#how-to-use-a-modular-pipeline-with-different-parameters conditional stuff isn’t really supported , my heuristic rule is that if you’re loading the catalog inside the node you’re probably ‘out of bounds’ within Kedro. Now it’s possible and people do have success getting it to work… but you’ll fight some of the assumptions that Kedro was built against.

Afiq Johari

04/25/2024, 8:29 AM

@datajoely personally, I find the challenge lies in our initial design, which was created to meet the original requirements. However, as new requirements emerge, it becomes difficult to determine whether we should develop a separate project entirely or adjust/add pipelines to the current one. Additionally, there are instances where requirements change over time, and the older design seems no longer suitable for some requirements but still suitable for other requirements..

👍 1

datajoely

04/25/2024, 8:30 AM

It would be great to get a diagram / github issue describing the problem you’re coming up against. I’m super keen to make Kedro work for your situation whilst balancing our commitment to simplicity, reproducibility and determinism. If you have time it would help us make Kedro better in the future.

💯 1

Iñigo Hidalgo

04/25/2024, 9:11 AM

Hey Afiq, I've had a similar requirement in the past with varying numbers of inputs from different pipelines, but wanting to use the same node. One solution I came up with was: The node expects a list of inputs as its first parameter,

Copy code

def node_fun(list_of_inputs, **kwargs):
  if len(list_of_inputs)==1:
    original_logic
  else:
    new_logic

Then what I have is a previous node which takes an arbitrary number of inputs and returns it as a list

Copy code

def gather_args_into_list(*args):
    """Utility function which gathers all arguments into a list. Useful to combine multiple kedro node outputs into
    a single list.
    """
    return args

Then you would have

Copy code

Pipeline(
  [
    node(
      func=gather_args_into_list,
      inputs=[
        "any",
        "number",
        "of",
        "inputs",
      ],
      outputs="list_of_inputs",
    ),
    node(
      func=node_fun,
      inputs="list_of_inputs",
      outputs="output",
    ),
  ]
)

This allows for a variable number of inputs in the first node without needing to change the pipeline structure. It's a little bit awkward, but I would strongly encourage you to avoid using the catalog to load data manually within a pipeline, as that will make maintenance SO much more annoying down the line. I speak from experience as I currently maintain various projects where we do this, and it's impossible to keep track of where data is being loaded.

🎉 4

10 Views

Open in Slack

Previous Next