Dbt has `dbt test` built-in. Combined with an addi...
# plugins-integrations
m
Dbt has
dbt test
built-in. Combined with an additional dbt-expectations plugin, you get almost all data validation functionality from great expectations for free (i.e. with no additional python dependencies). Would it make sense for kedro to have a similar plugin? One that requires no additional dependencies, is compatible with the config system, kedro-datasets and that can be run via hooks (before you even write outputs to a filesystem). I had a quick look into the great expectations code-base and as far as I can tell it doesn’t look too complex to implement such a plugin.
👍🏼 1
d
I’ve wanted this for years
I think Pandera is the way to do this and thanks to @Neeraj Malhotra it now supports PySpark!
j
how does
dbt test
work? does it generate test code, performs validation on the fly? how would this be different by generating appropriate tests + running
pytest
?
d
It runs on persisted data
so it’s no online checks, it’s more about checks on materialised stuff. It’s possible to add tests at a project level using YAML or annotate individual SQL files
d
It's not too complex. QuantumBlack/McKinsey had a good Kedro-Great Expectations plugin internally, but it's not maintained (nor released). I started implementing https://github.com/deepyaman/kedro-great-expectations, but got distracted. IMO Pandera and Great Expectations are different tools (with pros/cons), and it's not fair to simply say that Pandera is the way to go. However, if you are interested in a Pandera integration, I don't know what the plan forward with https://github.com/Galileo-Galilei/kedro-pandera is.
n
@Juan Luis https://docs.getdbt.com/docs/build/tests You get some declarative test + great expectations you can declare more exotic tests.
👀 1
It shouldn’t be too hard to implement. GE has its own config, essential you need to make a plugin to make sure the config are compatible, it could either leverage the
catalog.yml
+
metadata
or you create an additional config files to map GE dataset to Kedro Datasets The existing
after_data_loaded
hook will be very useful
Not a big fan of GE API, historically it has a lot of breaking change, I guess this is why the old plugin isn’t maintained anymore. It supports SQL/Spark/Pandas, Pandera don’t have native spark support until now.
Would love to see both kedro + great_expectations & kedro + pandera plugins available for the community 🙂
d
GX has now reached a good level of stability, it was too early last time this was done properly
👍 1
d
I guess this is why the old plugin isn’t maintained anymore.
The plugin was never maintained by the Kedro team (fair, plenty of things that the core team is responsible for), but also no other team/group is willing to make a small, well-defined maintenance commitment for it (🤷), nor were people willing to open it up to the outside community.
Would love to see both kedro + great_expectations & kedro + pandera plugins available for the community
Is there a measured demand anywhere? I'd be more inclined to build out the Kedro-Great Expectations plugin if I know there are a lot of people who would be willing to use it (still, no promises :P); otherwise, I think it's better somebody who actually uses Kedro day-to-day and needs the functionality themselves build it out.
👍🏼 1
GX has now reached a good level of stability, it was too early last time this was done properly
As a side note, last I looked at this, they removed the CLI commands entirely from the getting started workflow. But the CLI still exists. I'm not sure if this is indicative of them wanting to move towards more Python integratios.
👀 1
j
Is there a measured demand anywhere?
I don't have evidence for this but I can only guess that people working on data pipelines want to do data validation as well, or at a minimum some form of unit testing
n
Kedro pipeline + data versioning + experiment tracking + data validation was a natural choice of integration when I started 3 years ago.
experiment tracking we have
kedro-mlfow
,
kedro-neptune
now, not much for data validation & data versioning.
n
In general, I am indifferent to plugins whether
Kedro + GE
or
Kedro + pandera
. I think both should exist and consumers will decide what they want. But if its a question about demand for one of them, then I would go with
kedro + pandera
as GE had been there for a loooong time but hasn’t impressed users (mostly pains down the line). That was one of the reasons, we choose
Pandera
. Moreover, Pandera’s demand has skyrocketted 🚀 in recent times with 10M+ downloads. I think that tells a story.. 🤓 But again, I am indifferent to both as long as users want them and someone can really commit time to develop it. 🙂
👍 1
m
I wasn’t talking about plugins for GE or pandera. But rather create one within the kedro framework natively! With support for pandas/spark and polars. The plugin would then allow for checks of in-memory datasets after they are created and before you would persist the data and start running the next node. This way, you would need no additional dependencies! I hate GE because it’s complex to use and has a ton of additional dependencies that are installed to make it work. And to be honest, managing python dependencies is already enough of a nightmare. I don’t need a ton of extra stuf that I don’t really need Edit: same can be said for MLflow btw. That’s why I am watching experiment tracking with great interest!
Just had a look at pandera. It looks interesting, but it only supports pandas and pyspark-pandas. Not sure what the later means (can we even do validations on generic spark dataframes?) though…
y
I'd love to have data validation plugin for kedro, and I am actually keen on making this working. I really plan to work on kedro-pandera to make it work but I have no planning for this, maybe this summer. From what I read, I think that the goal of
pandera
is a bit different than
GX
.
pandera
seems to be more focused on validating schema at runtime while GX is an entire framework to pull data and validate data statistics over time so I don't think they really compete on the same field. I really disliked gx in the 0.11-1.12 series because the API was unreliable and the abstraction weren't very clear so it was hard to integrate. The docs seem clearer now but I don't have experience with recent versions so I can't really be assertive here.
🔥 1
👍 2
n
@Matthias Roels, Pandera does support native PySpark dataframes: https://pandera.readthedocs.io/en/stable/pyspark_sql.html
👍 1
n
@Matthias Roels are you suggesting a kedro native plugin? isn't it reinventing the wheel though when we can leverage these existing library which is doing the job.
👍🏼 1
m
But just like GX, it is missing polars support which is getting quite popular
Reading through the comments, a kedro-pandera plugin might be a good starting point to bring data validation into kedro. Unfortunately, I have no experience with pandera so I am not the right person to implement it.
y
But just like GX, it is missing polars support which is getting quite popular
I feel it is easier (and has more chances to happen in a short timeframe) to bring
polars
(or anything else) to data validation backend (e.g.
pandera
or
gx
) when the library gets traction rather than having kedro managing all by itself because the team will soon become overwhelmed. I'd be very curious to see if native experiment trarcking is really more used by kedro users than other solutoons like
kedro-mlflow
,
kedro-neptune
...
d
I feel like these libs should move towards type-checking the
__dataframe__
property--that way, you don't have to handle each lib separately. E.g. in that case you would have it available for polars too via https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.__dataframe__.html
💯 1
m
It’s not just type checking. It also checking for null values, check unique constraints, check if a column contains a particular set of values etc. But also making sure there is room for custom business checks (i.e. the framework should be extensible).
As it is not a good idea to allow all users to create extensions in the DataFrame API’s directly, those things will never be possible within these frameworks only. But, you could make use of the fact that the DataFrame API’s are relatively stable to create “metrics” classes to get all sorts of info out of a dataframe (which is what pandera does too?). Using these metrics classes, it is straightforward to define a set of data checks/tests/validations (whatever you want to call them). Then, it’s just a matter of deciding the best way to generate the output (or create a bunch of options for the user to decide). I’m thinking about logs, html report, kedro viz,…