Am I correct in understanding that Kedro Pandera will only w Kedro #plugins-integrations

Am I correct in understanding that Kedro-Pandera w...

Deepyaman Datta

09/16/2024, 12:53 PM

Am I correct in understanding that Kedro-Pandera will only work with pandas schemas currently? I saw that it uses

pandera.io.deserialize_schema

under the hood in it's schema resolver, and that seems to be only implemented in pandera for pandas, is that right?

Laurens Vijnck

09/17/2024, 8:58 AM

Hiya Deepy, I think I've just discovered the same thing. Seems only pandas is supported so far

👍 1

Laurens Vijnck

09/17/2024, 9:00 AM

I do wonder though, Kedro already does have the dataset definition. Updating the code to ensure that the dataset type is used to construct the proper Pandera object should not be much of a stretch

Yolan Honoré-Rougé

09/17/2024, 11:48 AM

Hi @Deepyaman Datta sorry I missed it. Yes it is the case, but hopefully you can build your own resolver to pass another schema ; not absolutely sure of how the hook will behave though. There are still a lot to do for this plugin, and unfortunately I don't think it will happen in a foreseeable future

Deepyaman Datta

09/17/2024, 12:52 PM

I do wonder though, Kedro already does have the dataset definition. Updating the code to ensure that the dataset type is used to construct the proper Pandera object should not be much of a stretch

Yeah. I'm guessing this is also not a huge lift on the pandera side to just include the parser for other schemas; they all look pretty similar.

Nok Lam Chan

09/18/2024, 1:42 PM

Kedro already does have the dataset definition. Updating the code to ensure that the dataset type is used to construct the proper Pandera object should not be much of a stretch

Can you explains a little more on this?

Laurens Vijnck

09/18/2024, 1:46 PM

Well, we define datasets in Kedro as spark or pandas already

Laurens Vijnck

09/18/2024, 1:46 PM

could that info be used to construct the correct schema object in pandera?

Nok Lam Chan

09/18/2024, 1:52 PM

I see. I think the question here is that

kedro-pandera

relies on

pandera

to do this deserialisation step (from object to YAML).

pandera

only support

pandas

so far, https://github.com/unionai-oss/pandera/blob/main/pandera/io/pandas_io.py

Deepyaman Datta

09/18/2024, 5:00 PM

could that info be used to construct the correct schema object in pandera?

infer_schema

is another piece of the functionality. But you don't necessarily want to set your validation rule based on the inferred schema, maybe you want some subset or something. I think this is more P2 functionality.

Laurens Vijnck

09/19/2024, 7:40 AM

Two things coming together here indeed. I was not saying we should validate based on the inferred schema from the dataset, I intended to say that the type (and only the type, i.e., SparkDataset, PandasDataset) should aid in parsing the yaml into the correct Pandera object. Specifically: •

SparkDataset

results in creation of

pandera.pyspark.DataFrameModel

•

PandasDataset

results in creation of

pandera.api.pandas.model.DataFrameModel

and so on What surprises me though, if that the validation rules are somehow expressed based on the target data model (e.g., pandas or spark). I think it should be more generic, where there is an abstract, data type agnostic way to express validation rules and Pandera should figure out how to apply the validation rule to the underlying dataset type (not sure if I am making sense here)

40 Views

Open in Slack

Previous Next