I have a question about the memory dataset s default copy me Kedro #questions

I have a question about the memory dataset's defau...

Galen Seilis

10/15/2024, 5:26 PM

I have a question about the memory dataset's default copy method. I noticed that if the data is a pandas dataframe or a numpy array that copy rather than assignment (i.e. making a reference) is used by default. I'm wondering what the rationale for that is. Often making a reference is cheaper in terms of runtime than making either a shallow or deep copy. Why is assignment not the top priority default? https://docs.kedro.org/en/stable/_modules/kedro/io/memory_dataset.html#MemoryDataset

👀 1

✅ 1

Deepyaman Datta

10/15/2024, 5:28 PM

The default goal is to preserve the same behavior, whether somebody uses a

MemoryDataset

or, say,

pandas.ParquetDataset

. It would be confusing if your pipeline started behaving differently based on how you configured your catalog.

Galen Seilis

10/15/2024, 5:31 PM

@Deepyaman Datta It does make sense to me to have the same default behaviour where possible, but I think I am missing some premises to fully understand the default in MemoryDataset. Why isn't assignment the default regardless of dataset type?

Deepyaman Datta

10/15/2024, 5:33 PM

Without a copy, pandas assignments can be unsafe: https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-view-versus-copy This can't really happen with Spark, Polars, Ibis, etc.

🙌 1

thankyou 1

Galen Seilis

10/15/2024, 5:35 PM

@Deepyaman Datta Ah, thank you! I had forgotten about Pandas' approach. Thank you for sharing that documentation.

Galen Seilis

10/15/2024, 5:37 PM

@Deepyaman Datta I have a follow-up question if you have time. To control this behaviour of assignment/copy/deep copy in my Kedro project, what is the conventional way to do that? Should I make a Kedro catalog entry with

MemoryDataset

as the dataset type?

Deepyaman Datta

10/15/2024, 5:39 PM

Yep, sounds good! That said, Kedro explicitly tríes to separate data transformation logic from I/O. You should probably document it clearly if you want to do this, so that somebody doesn't come along later, swap in a different dataset, and things behave weirdly.

Galen Seilis

10/15/2024, 5:42 PM

@Deepyaman Datta Great! Thank you for answering my questions about this topic. 🙂

Yolan Honoré-Rougé

10/15/2024, 9:44 PM

1 additional reasons + 1 comment : • Kedro pipelines used to be sorted non deterministically, and pandas data frame could be modified by different nodes. Running twice the same pipeline with the exact same configuration could lead to different results 🤯 The order is now deterministic but @Deepyaman Datta reason is still valid though • You can change the default behaviour with a factory in your catalog:

Copy code

{default}:
    type: Memory dataset
    copy_mode: assign

👍 2

2 Views

Open in Slack

Previous Next