Deepyaman Datta
09/16/2024, 12:53 PMpandera.io.deserialize_schema
under the hood in it's schema resolver, and that seems to be only implemented in pandera for pandas, is that right?Laurens Vijnck
09/17/2024, 8:58 AMLaurens Vijnck
09/17/2024, 9:00 AMYolan Honoré-Rougé
09/17/2024, 11:48 AMDeepyaman Datta
09/17/2024, 12:52 PMI do wonder though, Kedro already does have the dataset definition. Updating the code to ensure that the dataset type is used to construct the proper Pandera object should not be much of a stretchYeah. I'm guessing this is also not a huge lift on the pandera side to just include the parser for other schemas; they all look pretty similar.
Nok Lam Chan
09/18/2024, 1:42 PMKedro already does have the dataset definition. Updating the code to ensure that the dataset type is used to construct the proper Pandera object should not be much of a stretchCan you explains a little more on this?
Laurens Vijnck
09/18/2024, 1:46 PMLaurens Vijnck
09/18/2024, 1:46 PMNok Lam Chan
09/18/2024, 1:52 PMkedro-pandera
relies on pandera
to do this deserialisation step (from object to YAML).
pandera
only support pandas
so far, https://github.com/unionai-oss/pandera/blob/main/pandera/io/pandas_io.pyDeepyaman Datta
09/18/2024, 5:00 PMcould that info be used to construct the correct schema object in pandera?
infer_schema
is another piece of the functionality. But you don't necessarily want to set your validation rule based on the inferred schema, maybe you want some subset or something. I think this is more P2 functionality.Laurens Vijnck
09/19/2024, 7:40 AMSparkDataset
results in creation of pandera.pyspark.DataFrameModel
• PandasDataset
results in creation of pandera.api.pandas.model.DataFrameModel
and so on
What surprises me though, if that the validation rules are somehow expressed based on the target data model (e.g., pandas or spark). I think it should be more generic, where there is an abstract, data type agnostic way to express validation rules and Pandera should figure out how to apply the validation rule to the underlying dataset type
(not sure if I am making sense here)