Hi I would like to set a default copy mode for datasets of a Kedro #questions

Hi, I would like to set a default copy_mode for da...

Iñigo Hidalgo

06/06/2023, 7:40 AM

Hi, I would like to set a default copy_mode for datasets of a certain type, Ibis Tables should always be passed through as "assign" I would like to build a query on an Ibis table over multiple nodes which would imply creating lots of MemoryDatasets, I would like to avoid needing to specify an instance in the catalog for each to specify their copy_mode. https://github.com/kedro-org/kedro/blob/39f2168b81c550873c685eea42f1018c2927dbb8/kedro/io/memory_dataset.py#L83 Would it make sense to somehow modify the behavior of

_infer_copy_mode

? In this issue it was mentioned as a possibility but was discarded because it’s too “heavy” but I think adding one additional branch to the already-existing pandas check could be worth it for incorporating Ibis functionality.

👀 1

👍 2

Iñigo Hidalgo

06/06/2023, 7:46 AM

Tbh I haven’t benchmarked it, but I think that logic of importing pandas each time a memory dataset is created might already be quite heavy. Would it make sense to refactor that into some sort of class method to avoid needlessly importing the modules each time and instead setting a flag if numpy or pandas is available?

👍 2

Antony Milne

06/06/2023, 10:02 PM

I think this is a great idea (both the idea of adding it to

_infer_copy_mode

and somehow making the whole thing lighter). My objection on the issue you linked to I think was that it would have just been a temporary thing for LightGBM and we didn’t want to put in lots of special cases. But this ibis case sounds like one that is not a temporary workaround but the “correct” way to do it. So as far as I’m concerned, go for it!

👍 1

Iñigo Hidalgo

06/07/2023, 7:00 AM

Thanks Antony. In the case of LGBM that makes sense, and I guess you really wouldn’t want to be passing the original object anyways, since it is mutable.

👍 1

Iñigo Hidalgo

06/07/2023, 7:02 AM

I think I remember reading about the option to add a different dataset class as default. But I can't find it in the documentation or the issues atm, was it something that was planned for the future?

Juan Luis

06/07/2023, 7:32 AM

it was mentioned as a possibility in this issue https://github.com/kedro-org/kedro/issues/2423 and it's being worked out as part of the dataset factories, I believe

Iñigo Hidalgo

06/07/2023, 7:43 AM

Thank you Juanlu, apparently I'd already been in that issue in the past since I'd reacted to some of the comments 😅

Iñigo Hidalgo

06/07/2023, 7:44 AM

So basically as of right now I should aim to change the memorydataset implementation itself?

Antony Milne

06/07/2023, 9:31 AM

Yes, exactly 👍 I think this is a good improvement even when the dataset factories exist, not just a temporary thing.

17 Views

Open in Slack

Previous Next