I have a question about the memory dataset's defau...
# questions
g
I have a question about the memory dataset's default copy method. I noticed that if the data is a pandas dataframe or a numpy array that copy rather than assignment (i.e. making a reference) is used by default. I'm wondering what the rationale for that is. Often making a reference is cheaper in terms of runtime than making either a shallow or deep copy. Why is assignment not the top priority default? https://docs.kedro.org/en/stable/_modules/kedro/io/memory_dataset.html#MemoryDataset
👀 1
1
d
The default goal is to preserve the same behavior, whether somebody uses a
MemoryDataset
or, say,
pandas.ParquetDataset
. It would be confusing if your pipeline started behaving differently based on how you configured your catalog.
g
@Deepyaman Datta It does make sense to me to have the same default behaviour where possible, but I think I am missing some premises to fully understand the default in MemoryDataset. Why isn't assignment the default regardless of dataset type?
d
Without a copy, pandas assignments can be unsafe: https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-view-versus-copy This can't really happen with Spark, Polars, Ibis, etc.
🙌 1
thankyou 1
g
@Deepyaman Datta Ah, thank you! I had forgotten about Pandas' approach. Thank you for sharing that documentation.
@Deepyaman Datta I have a follow-up question if you have time. To control this behaviour of assignment/copy/deep copy in my Kedro project, what is the conventional way to do that? Should I make a Kedro catalog entry with
MemoryDataset
as the dataset type?
d
Yep, sounds good! That said, Kedro explicitly tríes to separate data transformation logic from I/O. You should probably document it clearly if you want to do this, so that somebody doesn't come along later, swap in a different dataset, and things behave weirdly.
g
@Deepyaman Datta Great! Thank you for answering my questions about this topic. 🙂
y
1 additional reasons + 1 comment : • Kedro pipelines used to be sorted non deterministically, and pandas data frame could be modified by different nodes. Running twice the same pipeline with the exact same configuration could lead to different results 🤯 The order is now deterministic but @Deepyaman Datta reason is still valid though • You can change the default behaviour with a factory in your catalog:
Copy code
{default}:
    type: Memory dataset
    copy_mode: assign
👍 2