Hi, I would like to set a default copy_mode for da...
# questions
i
Hi, I would like to set a default copy_mode for datasets of a certain type, Ibis Tables should always be passed through as "assign" I would like to build a query on an Ibis table over multiple nodes which would imply creating lots of MemoryDatasets, I would like to avoid needing to specify an instance in the catalog for each to specify their copy_mode. https://github.com/kedro-org/kedro/blob/39f2168b81c550873c685eea42f1018c2927dbb8/kedro/io/memory_dataset.py#L83 Would it make sense to somehow modify the behavior of
_infer_copy_mode
? In this issue it was mentioned as a possibility but was discarded because it’s too “heavy” but I think adding one additional branch to the already-existing pandas check could be worth it for incorporating Ibis functionality.
👀 1
👍 2
Tbh I haven’t benchmarked it, but I think that logic of importing pandas each time a memory dataset is created might already be quite heavy. Would it make sense to refactor that into some sort of class method to avoid needlessly importing the modules each time and instead setting a flag if numpy or pandas is available?
👍 2
a
I think this is a great idea (both the idea of adding it to
_infer_copy_mode
and somehow making the whole thing lighter). My objection on the issue you linked to I think was that it would have just been a temporary thing for LightGBM and we didn’t want to put in lots of special cases. But this ibis case sounds like one that is not a temporary workaround but the “correct” way to do it. So as far as I’m concerned, go for it!
👍 1
i
Thanks Antony. In the case of LGBM that makes sense, and I guess you really wouldn’t want to be passing the original object anyways, since it is mutable.
👍 1
j
it was mentioned as a possibility in this issue https://github.com/kedro-org/kedro/issues/2423 and it's being worked out as part of the dataset factories, I believe
i
Thank you Juanlu, apparently I'd already been in that issue in the past since I'd reacted to some of the comments 😅
So basically as of right now I should aim to change the memorydataset implementation itself?
a
Yes, exactly 👍 I think this is a good improvement even when the dataset factories exist, not just a temporary thing.