Hey channel our kedro pipeline is throwing the following err Kedro #questions

Hey channel, our kedro pipeline is throwing the fo...

Sameera Kodi

06/11/2024, 8:01 PM

Hey channel, our kedro pipeline is throwing the following error : PS : The same code used to run fine before but now it is throwing this error Has anyone seen this error before?

Ian Whalen

06/11/2024, 8:02 PM

What type is the dataset that causes this error? These type of pickle errors usually come from things like database connections But its odd that its failing to be saved in a memory dataset

Sameera Kodi

06/11/2024, 8:07 PM

Its a nested dictionary with model objects.

Yolan Honoré-Rougé

06/11/2024, 8:08 PM

It's because you are trying to deepcopy a non pickleable object (my bet would be a tensorflow weird object)

Yolan Honoré-Rougé

06/11/2024, 8:09 PM

Try :

Copy code

my_data:
    type: MemoryDataset
    copy_mode: assign

🌠 1

Sameera Kodi

06/11/2024, 8:11 PM

I'm not sure I completely understand this. What are we trying to do?

Yolan Honoré-Rougé

06/11/2024, 8:11 PM

Put this entry in your catalog, where ``my_data`` is the output of your pipeline which raise the error

Yolan Honoré-Rougé

06/11/2024, 8:13 PM

By default if you donc specify anything, kedro deepcopies the object to store in memory, and this raise error on non pickleable objects, here the

copy_mode=assign

means "store in memory without deepcopy"

👍 2

💯 1

Sameera Kodi

06/11/2024, 8:16 PM

okay, let me try this. Thankyou

Sameera Kodi

06/11/2024, 8:23 PM

It worked, thankyou 🙂

👍 2

Matthias Roels

06/12/2024, 5:08 AM

Why do we even need to deepcopy by default? I mean, if it’s causing errors for some datasets, isn’t it better to change the default?

Yolan Honoré-Rougé

06/12/2024, 7:05 AM

The rationale is that different nodes can access the same input dataset. If one modifies the data in place, the second node is impacted and this is impossible to debug at the node level, it's a pipeline level bug. This was even more prevalent when the node order was not deterministic (I think we switched from

toposort

for that reason) because the exact same pipeline could produce the bug randomly depending on the node order resolution.

Yolan Honoré-Rougé

06/12/2024, 7:07 AM

The current default can lead to subtle bugs though like this one or "running out of RAM" when deepcopying large dataframe

Yolan Honoré-Rougé

06/12/2024, 7:14 AM

But changing the default would likely give very subtle bugs too with concurrent access

👍 2

Matthias Roels

06/12/2024, 5:33 PM

And is there a way we can enforce nodes to not modify data in place?

Yolan Honoré-Rougé

06/12/2024, 5:36 PM

It's very unlikely we can (after all, we don't control anything inside the node, users can use any python code), and unlikely we want (it would prevent some performance optimisation and may have unintended side effects)

Yolan Honoré-Rougé

06/12/2024, 5:36 PM

That said if you want to personally change the default in your projects, you can do it easily

Yolan Honoré-Rougé

06/12/2024, 5:37 PM

Use the dataset factory:

Copy code

{default_dataset}:
    type: MemoryDataset
    copy_mode: assign

👍 1

Open in Slack

Previous Next