Hey channel, our kedro pipeline is throwing the fo...
# questions
s
Hey channel, our kedro pipeline is throwing the following error : PS : The same code used to run fine before but now it is throwing this error Has anyone seen this error before?
i
What type is the dataset that causes this error? These type of pickle errors usually come from things like database connections But its odd that its failing to be saved in a memory dataset
s
Its a nested dictionary with model objects.
y
It's because you are trying to deepcopy a non pickleable object (my bet would be a tensorflow weird object)
Try :
Copy code
my_data:
    type: MemoryDataset
    copy_mode: assign
🌠 1
s
I'm not sure I completely understand this. What are we trying to do?
y
Put this entry in your catalog, where ``my_data`` is the output of your pipeline which raise the error
By default if you donc specify anything, kedro deepcopies the object to store in memory, and this raise error on non pickleable objects, here the
copy_mode=assign
means "store in memory without deepcopy"
👍 2
💯 1
s
okay, let me try this. Thankyou
It worked, thankyou 🙂
👍 2
m
Why do we even need to deepcopy by default? I mean, if it’s causing errors for some datasets, isn’t it better to change the default?
y
The rationale is that different nodes can access the same input dataset. If one modifies the data in place, the second node is impacted and this is impossible to debug at the node level, it's a pipeline level bug. This was even more prevalent when the node order was not deterministic (I think we switched from
toposort
for that reason) because the exact same pipeline could produce the bug randomly depending on the node order resolution.
The current default can lead to subtle bugs though like this one or "running out of RAM" when deepcopying large dataframe
But changing the default would likely give very subtle bugs too with concurrent access
👍 2
m
And is there a way we can enforce nodes to not modify data in place?
y
It's very unlikely we can (after all, we don't control anything inside the node, users can use any python code), and unlikely we want (it would prevent some performance optimisation and may have unintended side effects)
That said if you want to personally change the default in your projects, you can do it easily
Use the dataset factory:
Copy code
{default_dataset}:
    type: MemoryDataset
    copy_mode: assign
👍 1