I have a question regarding `kedro-pandera` <@U03R...
# plugins-integrations
a
I have a question regarding
kedro-pandera
@Nok Lam Chan - why the hook validates datasets before node run instead of after/before dataset is loaded/saved? It leads to the same data being re-validated multiple times when dataset is shared among nodes
đź‘€ 2
m
I think it’s mainly because they have a catalog variable when using before/after node run. Btw even if you would switch to before/after dataset saved/loaded, you would still do validations multiple times. The only way to avoid that is by adding additional logic to the hook to only validate on load when its a “free” input
pipeline.inputs()
in kedro’s language. Would be a nice addition though…
n
I am slightly outdated with the
kedro-pandera
development lately, so pinging @Yolan Honoré-Rougé here. If there is no response this week I will come back to this next week. I am a bit overloaded at the moment with `vscode-kedro`and various things
a
Yes you're right, I've tested the solution with after dataset loaded and sadly this hook behaves differently than I expected, as it's "loaded" every time it is passed to nodes
And in the newest version I see added exception to avoid revalidation multiple times by tracking set of validated datasets which is something I wanted to add
just I had a thought that validation should occur on loading only once, and on before-saved every time
y
Sorry for being late to the party. Like @Matthias Roels says, the main pain point is about multiple validation. There is a recent community addition which prevents validation happening both at loading and saving (even though this is questionable : serialisatio' might modify the data, especially with different load_args and save_args) and we should tackle this problem. As pointed out, we need to keep track of validation at the dataset level or the hook level to avoid this, but the desirable behaviour is unclear. We want to avoid performance penalty but on the other hand if the dataset is modified on the fly by a hook, or it is loaded interactively, we want to validate it each time.maybz we can hash the dataset, but this adds an extra layer of complexity. Nonetheless, I'll accept PR to avoid multiple validation on the short run and not wait until we define the full design. As you have notice, the repo has little activity (@datajoely, @Nok Lam Chan and myself have a lot to do and not enough time to be committed everywhere), so I am glad the plugin lives through community driven development solving the main issues on the short term .
đź‘€ 1
a
Yeah the problem here was that the dataset hook was triggered in different moment than I expected. I expected it to be done only once on dataset loading to memory and saving from memory, not before/after running each node for each dataset. I wrote my own hook that does the same as was added in the latest release on load, but on save always checks the schema, although thats when I was surprised by this behavior.
👍 1