Is it possible to set a default `copy mode` for `MemoryDatas Kedro #questions

Is it possible to set a default `copy_mode` for `M...

Deepyaman Datta

01/22/2024, 5:03 PM

Is it possible to set a default

copy_mode

for

MemoryDataset

currently? If not, would it make sense to support this, or would it make sense to replace the default dataset?

K 1

Iñigo Hidalgo

01/22/2024, 5:08 PM

I have asked this precise thing before, because I was having issues when passing Ibis tables between nodes as they are not serializable

Iñigo Hidalgo

01/22/2024, 5:09 PM

https://linen-slack.kedro.org/t/12319671/hi-i-would-like-to-set-a-default-copy-mode-for-datasets-of-a

Deepyaman Datta

01/22/2024, 5:15 PM

Yeah, I'm also having this question w.r.t. Ibis tables. 😄

Deepyaman Datta

01/22/2024, 5:16 PM

OK, I will take a look at this issue, but happy for any other thoughts on how to potentially resolve this.

Iñigo Hidalgo

01/22/2024, 5:17 PM

The suggestion to use dataset factories seemed good. As I’m still on an earlier kedro version I haven’t had the chance to use them yet, but I do think they should solve the issue, with the downside of forcing you to follow a specific naming pattern for the intermediate ibis tables which should be set to assign.

Iñigo Hidalgo

01/22/2024, 5:18 PM

The

_infer_copy_mode

method struck me as quite an odd implementation too, with a specific handling case for pandas and numpy, and then always just defaulting to deepcopy. It does seem like something that should be user configurable if wanted

Iñigo Hidalgo

01/22/2024, 5:19 PM

I understand it was developed in the time when pandas was the only real dataframe library, but looking at it now it seems a bit outdated

Iñigo Hidalgo

01/22/2024, 5:41 PM

some sort of mapping of types to default dataset type passed in settings.py or something along those lines

Nok Lam Chan

01/22/2024, 5:46 PM

Are you able to set

copy_mode

on catalog.yml?

Iñigo Hidalgo

01/22/2024, 5:46 PM

when i tested it that fixed it, but at least in my case i was generating a lot of intermediate datasets which i did not want to have to specify in the catalog

Nok Lam Chan

01/22/2024, 5:47 PM

If so, I think you can settle this with dataset factory pattern. basically

Copy code

# catalog.yml

{_catch_all_}:
  type: MemoryDataset
  copy_mode: "infer"

Nok Lam Chan

01/22/2024, 5:47 PM

let me look up the latest change - I am not too up to date with this

Nok Lam Chan

01/22/2024, 5:48 PM

the idea is override the default dataset with extra configuration if it makes sense

Deepyaman Datta

01/22/2024, 5:48 PM

Yeah, I just found this: https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html#how-to-override-the-default-dataset-creation-with-dataset-factories And was going to say the same thing @Nok Lam Chan just said. I guess I would do

copy_mode: "assign"

Deepyaman Datta

01/22/2024, 5:49 PM

It's not 100% ideal since that would force it for pandas datasets, etc., too; but this would be good enough for me for now, until can overwrite

_infer_copy_mode

or make it more flexible.

Iñigo Hidalgo

01/22/2024, 5:49 PM

then i guess the next step of subclassing memorydataset and modifying that method is trivial

Iñigo Hidalgo

01/22/2024, 5:49 PM

thanks @Nok Lam Chan

Nok Lam Chan

01/22/2024, 5:50 PM

Yeah I think you can combine both solutions with Dataset factory + CustomMemoryDataset

Deepyaman Datta

01/22/2024, 5:51 PM

Subclassing is still a bit ugly; I agree there's value to making the default special-cased around pandas/numpy/Spark and having the possibility of treating something else with

DataFrame

type incorrectly

Deepyaman Datta

01/22/2024, 5:51 PM

(I guess Dask would work as is, too)

Iñigo Hidalgo

01/22/2024, 5:51 PM

Subclassing is still a bit ugly

how so?

Deepyaman Datta

01/22/2024, 5:52 PM

@Juan Luis you do a lot of those examples with Polars; doesn't LazyFrame end up weird, or do you replace the default dataset

Deepyaman Datta

01/22/2024, 5:53 PM

> Subclassing is still a bit ugly

how so?

Sorry, what I meant is, subclassing doesn't make a bad solution, but I think the fact that you need to create a dataset subclass of a dataset a lot of users aren't necessarily actively thinking about, feels non-ideal. Maybe it's fine.

Deepyaman Datta

01/22/2024, 5:54 PM

some sort of mapping of types to default dataset type passed in settings.py or something along those lines

Something like this, or rather a mapping of default copy mode for datatype, could be nice IMO 🙂

Iñigo Hidalgo

01/22/2024, 5:54 PM

yeah in that sense I agree. it's a decent workaround but not the most intuitive solution

Nok Lam Chan

01/22/2024, 5:56 PM

I agree it's not ergonomic, but most user wouldn't care (or know) this

copy_mode

configuration exists. I guess my question is why do you need to change this default?

Iñigo Hidalgo

01/22/2024, 5:57 PM

Deepyaman will be more familiar with the specifics of Ibis, but their dataframe equivalent is non-serializable, i assume because the objects themselves handle db connections. So the deepcopy mode, which attemptst to serialize objects, cannot handle the Tables

👍🏼 1

Nok Lam Chan

01/22/2024, 5:57 PM

Is there a pattern of datatype that requires different

copy_mode

? I think the Memorydataset default was trying to do that, though it hasn't been updated for years.

Deepyaman Datta

01/22/2024, 5:58 PM

I think basically any of those lazy types (Spark, Ibis, etc.) should default to

assign

Nok Lam Chan

01/22/2024, 6:01 PM

#1910 Add an attribute to dataset classes to flag persistence - could we take the same approach? Though I am also worry that we start adding a lot of flags, and each of them are handling a very special thing.

Nok Lam Chan

01/22/2024, 6:02 PM

Oh sorry - I guess in this case you aren't dealing with new dataset but rather you want the default memory dataset to take care of special data type instead

👍 1

Iñigo Hidalgo

01/22/2024, 6:06 PM

in this case it's just changing an

__init__

kwarg, and it's more depending on the type of the data being saved *into the dataset at

_save

time than the type of the dataset itself

Nok Lam Chan

01/22/2024, 6:12 PM

Copy code

def _infer_copy_mode(data: Any) -> str:

    if pd and isinstance(data, pd.DataFrame) or np and isinstance(data, np.ndarray):
        copy_mode = "copy"
    elif type(data).__name__ == "DataFrame":
        copy_mode = "assign"
    else:
        copy_mode = "deepcopy"
    return copy_mode

Basically you want this but expose to

settings.py

so it's customisable?

Nok Lam Chan

01/22/2024, 6:14 PM

I'd argue this seems over-engineered, I'd prefer the framework just try to do its best (like how it's working now). Because every project is gonna have the same settings, if a Lazy datatype should have an

assign

mode, I don't see why anyone would want it differently. If so we should just have it embedded in framework rather than creating a new configuration in

settings.py

Iñigo Hidalgo

01/22/2024, 6:53 PM

to me that function just seems too limited in what it does. refactoring it to somehow use a mapping, which by default could replicate the logic of the current method but be user-extendable somehow would in my opinion be an interesting functionality to have

Iñigo Hidalgo

01/22/2024, 7:00 PM

every project is gonna have the same settings

that is true, but that puts the maintenance and testing burden on kedro to support compatibility with packages it shouldn't even know about. is there an auto import of settings functionality like (i think) there is with hooks?

Nok Lam Chan

01/22/2024, 11:56 PM

It is limited, probably because it was not expected to be extended thus it's an internal method. The refactoring will involve some rewrite of API as well. If you are interested in this, feel free to open an issue, as I know we will start revisit some

<http://kedro.io|kedro.io>

DataCatalog API.

✔️ 1

Nok Lam Chan

01/23/2024, 6:36 PM

FYR: https://github.com/kedro-org/kedro/issues/3551

thankyou 2

Deepyaman Datta

01/25/2024, 3:32 PM

Interestingly, I think some Ibis tables may be serializable? I didn't have an issue using

MemoryDataset

without changing copy mode in my project right now, but this is using DuckDB backend. Probably, some other tables encapsulate more state (will need to verify).

Iñigo Hidalgo

01/25/2024, 3:49 PM

@Juan Luis pointed this out:

looks like the
.con
property is not part of
__init__
, but later created on the fly https://github.com/ibis-project/ibis/blob/6d4a34f74bf4880d33495ad94d44d58fdd97c2b0/ibis/backends/base/sql/alchemy/__init__.py#L122

https://github.com/kedro-org/kedro/issues/2374#issuecomment-1593132824 Not sure if that's the issue but i would expect it depends on the connection method

48 Views

Open in Slack

Previous Next