Arnout Verboven
04/25/2025, 1:07 AMnode
PostDeepyaman Datta
04/25/2025, 1:18 AMnode.inputs
gets called for each node, and if caching could help on that side. Or, would there be any reason to not just convert inputs
in the Node.__init__()
method.Deepyaman Datta
04/25/2025, 1:20 AMNode.__init__()
and measuring the difference in time, and also making a PR with this potential enhancement (if there's a substantial perf improvement) and making sure it doesn't break things... I would be very happy to take a look?Chris Schopp
04/25/2025, 2:06 AM<PipelineWarningContent
isVisible={visible}
title="Whoa, that's a chonky pipeline!"
Kacper Leśniara
04/25/2025, 8:20 AMNok Lam Chan
04/25/2025, 9:04 AMNok Lam Chan
04/25/2025, 9:06 AMKacper Leśniara
04/25/2025, 9:08 AMNok Lam Chan
04/25/2025, 9:15 AMNok Lam Chan
04/25/2025, 9:17 AMNok Lam Chan
04/25/2025, 9:18 AMArnout Verboven
04/25/2025, 9:24 AM_dict_inputs_to_list
(102s)
• With caching: ~2k calls to _dict_inputs_to_list
(0.12s)
I've made a quick PR but couldn't get the tests to run, so maybe someone else could take this insteadArnout Verboven
04/25/2025, 9:25 AMNok Lam Chan
04/25/2025, 9:43 AM_dict_inputs_to_list
4.36 μs ± 37.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)The operation is quite fast so I am surprised it takes 100+ seconds
Arnout Verboven
04/25/2025, 9:47 AMupdate_wrapper(partial(my_func, ...), my func)
to provide non-Kedro arguments, which might make the inspect.signature slower? But also keep in mind _dict_inputs_to_list
was run 1.5M times.Juan Luis
04/25/2025, 10:17 AMJuan Luis
04/25/2025, 10:18 AMJuan Luis
04/25/2025, 10:21 AM_dict_inputs_to_list
runtime is inspect.signature(...).bind
. I have some ideas on the caching thoughArtur Dobrogowski
04/25/2025, 11:16 AMJuan Luis
04/25/2025, 12:04 PMArnout Verboven
04/25/2025, 2:05 PM# Option A
def process(df):
...
return df
groups = ["X", "Y", "Z"]
for group in groups: # Process each group separately
node( # Filter
func=partial(<filter>, group=group),
inputs="df",
outputs=f"{group}.df",
)
node( # Process
func=process,
inputs=f"{group}.df",
outputs=f"{group}.df_processed",
)
node( # Combine the groups
func=<concat>,
inputs=[f"{group}.df_processed" for group in groups],
outputs="df_processed",
)
# Option B
def process_groups(df):
for group in df["group"].unique(): # or using groupby
df.loc[df["group"] == group, ...
return df
node(
func=process_groups,
inputs="df"
outputs="df_processed",
)
Deepyaman Datta
04/25/2025, 8:10 PMI've made a quick PR but couldn't get the tests to run, so maybe someone else could take this insteadI just left a comment on there, but happy to try taking it over (or somebody else on the team can) if you'd prefer.
Kacper Leśniara
05/09/2025, 8:22 AMJuan Luis
05/09/2025, 10:04 AM