Matthias Roels
07/27/2023, 3:35 PMdbt test
built-in. Combined with an additional dbt-expectations plugin, you get almost all data validation functionality from great expectations for free (i.e. with no additional python dependencies).
Would it make sense for kedro to have a similar plugin? One that requires no additional dependencies, is compatible with the config system, kedro-datasets and that can be run via hooks (before you even write outputs to a filesystem).
I had a quick look into the great expectations code-base and as far as I can tell it doesn’t look too complex to implement such a plugin.datajoely
07/27/2023, 3:38 PMJuan Luis
07/27/2023, 3:46 PMdbt test
work? does it generate test code, performs validation on the fly?
how would this be different by generating appropriate tests + running pytest
?datajoely
07/27/2023, 3:46 PMDeepyaman Datta
07/27/2023, 4:02 PMNok Lam Chan
07/27/2023, 4:13 PMcatalog.yml
+ metadata
or you create an additional config files to map GE dataset to Kedro Datasets
The existing after_data_loaded
hook will be very usefuldatajoely
07/27/2023, 4:44 PMDeepyaman Datta
07/27/2023, 4:45 PMI guess this is why the old plugin isn’t maintained anymore.The plugin was never maintained by the Kedro team (fair, plenty of things that the core team is responsible for), but also no other team/group is willing to make a small, well-defined maintenance commitment for it (🤷), nor were people willing to open it up to the outside community.
Would love to see both kedro + great_expectations & kedro + pandera plugins available for the communityIs there a measured demand anywhere? I'd be more inclined to build out the Kedro-Great Expectations plugin if I know there are a lot of people who would be willing to use it (still, no promises :P); otherwise, I think it's better somebody who actually uses Kedro day-to-day and needs the functionality themselves build it out.
GX has now reached a good level of stability, it was too early last time this was done properlyAs a side note, last I looked at this, they removed the CLI commands entirely from the getting started workflow. But the CLI still exists. I'm not sure if this is indicative of them wanting to move towards more Python integratios.
Juan Luis
07/27/2023, 4:47 PMIs there a measured demand anywhere?I don't have evidence for this but I can only guess that people working on data pipelines want to do data validation as well, or at a minimum some form of unit testing
Nok Lam Chan
07/27/2023, 4:49 PMkedro-mlfow
, kedro-neptune
now, not much for data validation & data versioning.Neeraj Malhotra
07/27/2023, 6:59 PMKedro + GE
or Kedro + pandera
. I think both should exist and consumers will decide what they want.
But if its a question about demand for one of them, then I would go with kedro + pandera
as GE had been there for a loooong time but hasn’t impressed users (mostly pains down the line). That was one of the reasons, we choose Pandera
. Moreover, Pandera’s demand has skyrocketted 🚀 in recent times with 10M+ downloads. I think that tells a story.. 🤓
But again, I am indifferent to both as long as users want them and someone can really commit time to develop it. 🙂Matthias Roels
07/27/2023, 7:35 PMYolan Honoré-Rougé
07/27/2023, 8:05 PMpandera
is a bit different than GX
. pandera
seems to be more focused on validating schema at runtime while GX is an entire framework to pull data and validate data statistics over time so I don't think they really compete on the same field. I really disliked gx in the 0.11-1.12 series because the API was unreliable and the abstraction weren't very clear so it was hard to integrate. The docs seem clearer now but I don't have experience with recent versions so I can't really be assertive here.Neeraj Malhotra
07/27/2023, 8:22 PMNok Lam Chan
07/27/2023, 8:29 PMMatthias Roels
07/27/2023, 8:30 PMYolan Honoré-Rougé
07/27/2023, 8:46 PMBut just like GX, it is missing polars support which is getting quite popularI feel it is easier (and has more chances to happen in a short timeframe) to bring
polars
(or anything else) to data validation backend (e.g. pandera
or gx
) when the library gets traction rather than having kedro managing all by itself because the team will soon become overwhelmed. I'd be very curious to see if native experiment trarcking is really more used by kedro users than other solutoons like kedro-mlflow
, kedro-neptune
...Deepyaman Datta
07/28/2023, 5:52 AM__dataframe__
property--that way, you don't have to handle each lib separately. E.g. in that case you would have it available for polars too via https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.__dataframe__.htmlMatthias Roels
07/28/2023, 6:49 AM