I have a question regarding `kedro pandera` < Nok Lam Chan> Kedro #plugins-integrations

I have a question regarding `kedro-pandera` <@U03R...

Artur Dobrogowski

07/04/2024, 9:22 AM

I have a question regarding

kedro-pandera

@Nok Lam Chan - why the hook validates datasets before node run instead of after/before dataset is loaded/saved? It leads to the same data being re-validated multiple times when dataset is shared among nodes

👀 2

Matthias Roels

07/04/2024, 12:00 PM

I think it’s mainly because they have a catalog variable when using before/after node run. Btw even if you would switch to before/after dataset saved/loaded, you would still do validations multiple times. The only way to avoid that is by adding additional logic to the hook to only validate on load when its a “free” input

pipeline.inputs()

in kedro’s language. Would be a nice addition though…

Nok Lam Chan

07/04/2024, 12:08 PM

I am slightly outdated with the

kedro-pandera

development lately, so pinging @Yolan Honoré-Rougé here. If there is no response this week I will come back to this next week. I am a bit overloaded at the moment with `vscode-kedro`and various things

Artur Dobrogowski

07/04/2024, 12:33 PM

Yes you're right, I've tested the solution with after dataset loaded and sadly this hook behaves differently than I expected, as it's "loaded" every time it is passed to nodes

Artur Dobrogowski

07/04/2024, 12:34 PM

And in the newest version I see added exception to avoid revalidation multiple times by tracking set of validated datasets which is something I wanted to add

Artur Dobrogowski

07/04/2024, 12:34 PM

just I had a thought that validation should occur on loading only once, and on before-saved every time

Yolan Honoré-Rougé

07/17/2024, 6:55 AM

Sorry for being late to the party. Like @Matthias Roels says, the main pain point is about multiple validation. There is a recent community addition which prevents validation happening both at loading and saving (even though this is questionable : serialisatio' might modify the data, especially with different load_args and save_args) and we should tackle this problem. As pointed out, we need to keep track of validation at the dataset level or the hook level to avoid this, but the desirable behaviour is unclear. We want to avoid performance penalty but on the other hand if the dataset is modified on the fly by a hook, or it is loaded interactively, we want to validate it each time.maybz we can hash the dataset, but this adds an extra layer of complexity. Nonetheless, I'll accept PR to avoid multiple validation on the short run and not wait until we define the full design. As you have notice, the repo has little activity (@datajoely, @Nok Lam Chan and myself have a lot to do and not enough time to be committed everywhere), so I am glad the plugin lives through community driven development solving the main issues on the short term .

👀 1

Artur Dobrogowski

07/17/2024, 9:39 AM

Yeah the problem here was that the dataset hook was triggered in different moment than I expected. I expected it to be done only once on dataset loading to memory and saving from memory, not before/after running each node for each dataset. I wrote my own hook that does the same as was added in the latest release on load, but on save always checks the schema, although thats when I was surprised by this behavior.

👍 1

2 Views

Open in Slack

Previous Next