https://kedro.org/ logo
#questions
Title
# questions
d

Deepyaman Datta

01/22/2024, 5:03 PM
Is it possible to set a default
copy_mode
for
MemoryDataset
currently? If not, would it make sense to support this, or would it make sense to replace the default dataset?
K 1
i

Iñigo Hidalgo

01/22/2024, 5:08 PM
I have asked this precise thing before, because I was having issues when passing Ibis tables between nodes as they are not serializable
d

Deepyaman Datta

01/22/2024, 5:15 PM
Yeah, I'm also having this question w.r.t. Ibis tables. 😄
OK, I will take a look at this issue, but happy for any other thoughts on how to potentially resolve this.
i

Iñigo Hidalgo

01/22/2024, 5:17 PM
The suggestion to use dataset factories seemed good. As I’m still on an earlier kedro version I haven’t had the chance to use them yet, but I do think they should solve the issue, with the downside of forcing you to follow a specific naming pattern for the intermediate ibis tables which should be set to assign.
The
_infer_copy_mode
method struck me as quite an odd implementation too, with a specific handling case for pandas and numpy, and then always just defaulting to deepcopy. It does seem like something that should be user configurable if wanted
I understand it was developed in the time when pandas was the only real dataframe library, but looking at it now it seems a bit outdated
some sort of mapping of types to default dataset type passed in settings.py or something along those lines
n

Nok Lam Chan

01/22/2024, 5:46 PM
Are you able to set
copy_mode
on catalog.yml?
i

Iñigo Hidalgo

01/22/2024, 5:46 PM
when i tested it that fixed it, but at least in my case i was generating a lot of intermediate datasets which i did not want to have to specify in the catalog
n

Nok Lam Chan

01/22/2024, 5:47 PM
If so, I think you can settle this with dataset factory pattern. basically
Copy code
# catalog.yml

{_catch_all_}:
  type: MemoryDataset
  copy_mode: "infer"
let me look up the latest change - I am not too up to date with this
the idea is override the default dataset with extra configuration if it makes sense
d

Deepyaman Datta

01/22/2024, 5:48 PM
Yeah, I just found this: https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html#how-to-override-the-default-dataset-creation-with-dataset-factories And was going to say the same thing @Nok Lam Chan just said. I guess I would do
copy_mode: "assign"
.
It's not 100% ideal since that would force it for pandas datasets, etc., too; but this would be good enough for me for now, until can overwrite
_infer_copy_mode
or make it more flexible.
i

Iñigo Hidalgo

01/22/2024, 5:49 PM
then i guess the next step of subclassing memorydataset and modifying that method is trivial
thanks @Nok Lam Chan
n

Nok Lam Chan

01/22/2024, 5:50 PM
Yeah I think you can combine both solutions with Dataset factory + CustomMemoryDataset
d

Deepyaman Datta

01/22/2024, 5:51 PM
Subclassing is still a bit ugly; I agree there's value to making the default special-cased around pandas/numpy/Spark and having the possibility of treating something else with
DataFrame
type incorrectly
(I guess Dask would work as is, too)
i

Iñigo Hidalgo

01/22/2024, 5:51 PM
Subclassing is still a bit ugly
how so?
d

Deepyaman Datta

01/22/2024, 5:52 PM
@Juan Luis you do a lot of those examples with Polars; doesn't LazyFrame end up weird, or do you replace the default dataset
> Subclassing is still a bit ugly
how so?
Sorry, what I meant is, subclassing doesn't make a bad solution, but I think the fact that you need to create a dataset subclass of a dataset a lot of users aren't necessarily actively thinking about, feels non-ideal. Maybe it's fine.
some sort of mapping of types to default dataset type passed in settings.py or something along those lines
Something like this, or rather a mapping of default copy mode for datatype, could be nice IMO 🙂
i

Iñigo Hidalgo

01/22/2024, 5:54 PM
yeah in that sense I agree. it's a decent workaround but not the most intuitive solution
n

Nok Lam Chan

01/22/2024, 5:56 PM
I agree it's not ergonomic, but most user wouldn't care (or know) this
copy_mode
configuration exists. I guess my question is why do you need to change this default?
i

Iñigo Hidalgo

01/22/2024, 5:57 PM
Deepyaman will be more familiar with the specifics of Ibis, but their dataframe equivalent is non-serializable, i assume because the objects themselves handle db connections. So the deepcopy mode, which attemptst to serialize objects, cannot handle the Tables
👍🏼 1
n

Nok Lam Chan

01/22/2024, 5:57 PM
Is there a pattern of datatype that requires different
copy_mode
? I think the Memorydataset default was trying to do that, though it hasn't been updated for years.
d

Deepyaman Datta

01/22/2024, 5:58 PM
I think basically any of those lazy types (Spark, Ibis, etc.) should default to
assign
n

Nok Lam Chan

01/22/2024, 6:01 PM
#1910 Add an attribute to dataset classes to flag persistence - could we take the same approach? Though I am also worry that we start adding a lot of flags, and each of them are handling a very special thing.
Oh sorry - I guess in this case you aren't dealing with new dataset but rather you want the default memory dataset to take care of special data type instead
👍 1
i

Iñigo Hidalgo

01/22/2024, 6:06 PM
in this case it's just changing an
__init__
kwarg, and it's more depending on the type of the data being saved *into the dataset at
_save
time than the type of the dataset itself
n

Nok Lam Chan

01/22/2024, 6:12 PM
Copy code
def _infer_copy_mode(data: Any) -> str:

    if pd and isinstance(data, pd.DataFrame) or np and isinstance(data, np.ndarray):
        copy_mode = "copy"
    elif type(data).__name__ == "DataFrame":
        copy_mode = "assign"
    else:
        copy_mode = "deepcopy"
    return copy_mode
Basically you want this but expose to
settings.py
so it's customisable?
I'd argue this seems over-engineered, I'd prefer the framework just try to do its best (like how it's working now). Because every project is gonna have the same settings, if a Lazy datatype should have an
assign
mode, I don't see why anyone would want it differently. If so we should just have it embedded in framework rather than creating a new configuration in
settings.py
i

Iñigo Hidalgo

01/22/2024, 6:53 PM
to me that function just seems too limited in what it does. refactoring it to somehow use a mapping, which by default could replicate the logic of the current method but be user-extendable somehow would in my opinion be an interesting functionality to have
every project is gonna have the same settings
that is true, but that puts the maintenance and testing burden on kedro to support compatibility with packages it shouldn't even know about. is there an auto import of settings functionality like (i think) there is with hooks?
n

Nok Lam Chan

01/22/2024, 11:56 PM
It is limited, probably because it was not expected to be extended thus it's an internal method. The refactoring will involve some rewrite of API as well. If you are interested in this, feel free to open an issue, as I know we will start revisit some
<http://kedro.io|kedro.io>
DataCatalog API.
✔️ 1
d

Deepyaman Datta

01/25/2024, 3:32 PM
Interestingly, I think some Ibis tables may be serializable? I didn't have an issue using
MemoryDataset
without changing copy mode in my project right now, but this is using DuckDB backend. Probably, some other tables encapsulate more state (will need to verify).
i

Iñigo Hidalgo

01/25/2024, 3:49 PM
@Juan Luis pointed this out:
looks like the
.con
property is not part of
__init__
, but later created on the fly https://github.com/ibis-project/ibis/blob/6d4a34f74bf4880d33495ad94d44d58fdd97c2b0/ibis/backends/base/sql/alchemy/__init__.py#L122
https://github.com/kedro-org/kedro/issues/2374#issuecomment-1593132824 Not sure if that's the issue but i would expect it depends on the connection method
4 Views