Hey team, I process several partitioned data sets ...
# questions
a
Hey team, I process several partitioned data sets in their own name spaces and combine the results. I want to add an origin column to my data, to be able to trace back each data entry. I.e., data from partset1 should get the string "partset1" in a column "origin" in my pandas data frame. However, I cannot find a way to access the name of the currently run data set from within a node (or the name of the namespace). Is there an elegant way to do this?
h
Someone will reply to you shortly. In the meantime, this might help:
y
A partitioned dataset loads to your node as a
dict
mapping filenames to dataframes. So you can do, inside the node:
Copy code
for filename, df in partitioned_dataset.items():
    df[“origin”] = filename
And then concatenate all dataframes, i.e., values of this dictionary.
a
Unfortunately the file name is not enough for me to identify the data, I need the parent directory or the namespace or the name of the data set as stored in the catalog
m
@Andreas Postel You could use the dataset specific hooks to access the dataset_name: https://docs.kedro.org/en/stable/api/kedro.framework.hooks.specs.DatasetSpecs.html#kedro.framework.hooks.specs.DatasetSpecs this is not from within the node though. The namespace is only accessible in the pipeline level hooks.
Oh you can actually also use the Node specific hooks, because then you can access the namespace from
node.namespace
https://docs.kedro.org/en/stable/api/kedro.framework.hooks.specs.NodeSpecs.html
a
@Merel Could you please explain that in more detail? I managed to read the namespace and print it on the command line, directly from the hook class. But how do I access the namespace from within a node?
r
Hi @Andreas Postel, sorry for the delay in response. The namespace of a node is not directly accessible within the function executed by that node because the namespace is a property of the pipeline's configuration, not an argument passed to the node function. However, you can achieve this by injecting namespace as an input in a
before_node_run
hook. (i.e., modifying node inputs mentioned here). Thank you
a
Hi @Ravi Kumar Pilla, I was able to solve it due to the reference in your last message. Thank you!
👍 1