Matthias Roels07/27/2023, 3:35 PM
built-in. Combined with an additional dbt-expectations plugin, you get almost all data validation functionality from great expectations for free (i.e. with no additional python dependencies). Would it make sense for kedro to have a similar plugin? One that requires no additional dependencies, is compatible with the config system, kedro-datasets and that can be run via hooks (before you even write outputs to a filesystem). I had a quick look into the great expectations code-base and as far as I can tell it doesn’t look too complex to implement such a plugin.
datajoely07/27/2023, 3:38 PM
Juan Luis07/27/2023, 3:46 PM
work? does it generate test code, performs validation on the fly? how would this be different by generating appropriate tests + running
datajoely07/27/2023, 3:46 PM
Deepyaman Datta07/27/2023, 4:02 PM
Nok Lam Chan07/27/2023, 4:13 PM
or you create an additional config files to map GE dataset to Kedro Datasets The existing
hook will be very useful
datajoely07/27/2023, 4:44 PM
Deepyaman Datta07/27/2023, 4:45 PM
I guess this is why the old plugin isn’t maintained anymore.The plugin was never maintained by the Kedro team (fair, plenty of things that the core team is responsible for), but also no other team/group is willing to make a small, well-defined maintenance commitment for it (🤷), nor were people willing to open it up to the outside community.
Would love to see both kedro + great_expectations & kedro + pandera plugins available for the communityIs there a measured demand anywhere? I'd be more inclined to build out the Kedro-Great Expectations plugin if I know there are a lot of people who would be willing to use it (still, no promises :P); otherwise, I think it's better somebody who actually uses Kedro day-to-day and needs the functionality themselves build it out.
GX has now reached a good level of stability, it was too early last time this was done properlyAs a side note, last I looked at this, they removed the CLI commands entirely from the getting started workflow. But the CLI still exists. I'm not sure if this is indicative of them wanting to move towards more Python integratios.
Juan Luis07/27/2023, 4:47 PM
Is there a measured demand anywhere?I don't have evidence for this but I can only guess that people working on data pipelines want to do data validation as well, or at a minimum some form of unit testing
Nok Lam Chan07/27/2023, 4:49 PM
now, not much for data validation & data versioning.
Neeraj Malhotra07/27/2023, 6:59 PM
Kedro + GE
. I think both should exist and consumers will decide what they want. But if its a question about demand for one of them, then I would go with
Kedro + pandera
as GE had been there for a loooong time but hasn’t impressed users (mostly pains down the line). That was one of the reasons, we choose
kedro + pandera
. Moreover, Pandera’s demand has skyrocketted 🚀 in recent times with 10M+ downloads. I think that tells a story.. 🤓 But again, I am indifferent to both as long as users want them and someone can really commit time to develop it. 🙂
Matthias Roels07/27/2023, 7:35 PM
Yolan Honoré-Rougé07/27/2023, 8:05 PM
is a bit different than
seems to be more focused on validating schema at runtime while GX is an entire framework to pull data and validate data statistics over time so I don't think they really compete on the same field. I really disliked gx in the 0.11-1.12 series because the API was unreliable and the abstraction weren't very clear so it was hard to integrate. The docs seem clearer now but I don't have experience with recent versions so I can't really be assertive here.
Neeraj Malhotra07/27/2023, 8:22 PM
Nok Lam Chan07/27/2023, 8:29 PM
Matthias Roels07/27/2023, 8:30 PM
Yolan Honoré-Rougé07/27/2023, 8:46 PM
But just like GX, it is missing polars support which is getting quite popularI feel it is easier (and has more chances to happen in a short timeframe) to bring
(or anything else) to data validation backend (e.g.
) when the library gets traction rather than having kedro managing all by itself because the team will soon become overwhelmed. I'd be very curious to see if native experiment trarcking is really more used by kedro users than other solutoons like
Deepyaman Datta07/28/2023, 5:52 AM
property--that way, you don't have to handle each lib separately. E.g. in that case you would have it available for polars too via https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.__dataframe__.html
Matthias Roels07/28/2023, 6:49 AM