How can I access a data within a node? meaning, t...
# questions
a
How can I access a data within a node? meaning, the node/pipeline doesn't explicitly have the data as input. But I still want to access the data within the node. The data is declared in the catalog.yml. I tried debugging for the global variables (catalog)
⁉️ 1
d
Hi Afiq, why don't you want to include the data as an input?
a
d
the bigger question is what do you want to do?
👍 1
it’s been designed intentionally that nodes have no knowledge of IO
a
@Dmitry Sorokin @datajoely, good question! I currently have a Kedro node responsible for generating forecasts for a given model. Previously, each model could utilize this node : it received a training dataset and output the forecast dataset. However, a new model requires forecasts from multiple models as input, producing the forecast dataset as output. This new model disrupts the current design, yet we still want to simply leverage the same node. So accessing the data catalog within the node seems to help us move forward. But yeah, likely a temporary measure for now.
d
so is it that you want conditional logic or you want to reuse the same pipeline code?
a
@datajoely As of now, we somehow implemented conditional logic to the node function, and also able to reuse the same pipeline
d
so we have a pattern for reusing pipelines here https://docs.kedro.org/en/stable/nodes_and_pipelines/modular_pipelines.html#how-to-use-a-modular-pipeline-with-different-parameters conditional stuff isn’t really supported , my heuristic rule is that if you’re loading the catalog inside the node you’re probably ‘out of bounds’ within Kedro. Now it’s possible and people do have success getting it to work… but you’ll fight some of the assumptions that Kedro was built against.
a
@datajoely personally, I find the challenge lies in our initial design, which was created to meet the original requirements. However, as new requirements emerge, it becomes difficult to determine whether we should develop a separate project entirely or adjust/add pipelines to the current one. Additionally, there are instances where requirements change over time, and the older design seems no longer suitable for some requirements but still suitable for other requirements..
👍 1
d
It would be great to get a diagram / github issue describing the problem you’re coming up against. I’m super keen to make Kedro work for your situation whilst balancing our commitment to simplicity, reproducibility and determinism. If you have time it would help us make Kedro better in the future.
💯 1
i
Hey Afiq, I've had a similar requirement in the past with varying numbers of inputs from different pipelines, but wanting to use the same node. One solution I came up with was: The node expects a list of inputs as its first parameter,
Copy code
def node_fun(list_of_inputs, **kwargs):
  if len(list_of_inputs)==1:
    original_logic
  else:
    new_logic
Then what I have is a previous node which takes an arbitrary number of inputs and returns it as a list
Copy code
def gather_args_into_list(*args):
    """Utility function which gathers all arguments into a list. Useful to combine multiple kedro node outputs into
    a single list.
    """
    return args
Then you would have
Copy code
Pipeline(
  [
    node(
      func=gather_args_into_list,
      inputs=[
        "any",
        "number",
        "of",
        "inputs",
      ],
      outputs="list_of_inputs",
    ),
    node(
      func=node_fun,
      inputs="list_of_inputs",
      outputs="output",
    ),
  ]
)
This allows for a variable number of inputs in the first node without needing to change the pipeline structure. It's a little bit awkward, but I would strongly encourage you to avoid using the catalog to load data manually within a pipeline, as that will make maintenance SO much more annoying down the line. I speak from experience as I currently maintain various projects where we do this, and it's impossible to keep track of where data is being loaded.
🎉 4