Dbt has `dbt test` built in Combined with an additional dbt Kedro #plugins-integrations

Dbt has `dbt test` built-in. Combined with an addi...

Matthias Roels

07/27/2023, 3:35 PM

Dbt has

dbt test

built-in. Combined with an additional dbt-expectations plugin, you get almost all data validation functionality from great expectations for free (i.e. with no additional python dependencies). Would it make sense for kedro to have a similar plugin? One that requires no additional dependencies, is compatible with the config system, kedro-datasets and that can be run via hooks (before you even write outputs to a filesystem). I had a quick look into the great expectations code-base and as far as I can tell it doesn’t look too complex to implement such a plugin.

👍🏼 1

datajoely

07/27/2023, 3:38 PM

I’ve wanted this for years

datajoely

07/27/2023, 3:38 PM

I think Pandera is the way to do this and thanks to @Neeraj Malhotra it now supports PySpark!

Juan Luis

07/27/2023, 3:46 PM

how does

dbt test

work? does it generate test code, performs validation on the fly? how would this be different by generating appropriate tests + running

pytest

datajoely

07/27/2023, 3:46 PM

It runs on persisted data

datajoely

07/27/2023, 3:47 PM

so it’s no online checks, it’s more about checks on materialised stuff. It’s possible to add tests at a project level using YAML or annotate individual SQL files

Deepyaman Datta

07/27/2023, 4:02 PM

It's not too complex. QuantumBlack/McKinsey had a good Kedro-Great Expectations plugin internally, but it's not maintained (nor released). I started implementing https://github.com/deepyaman/kedro-great-expectations, but got distracted. IMO Pandera and Great Expectations are different tools (with pros/cons), and it's not fair to simply say that Pandera is the way to go. However, if you are interested in a Pandera integration, I don't know what the plan forward with https://github.com/Galileo-Galilei/kedro-pandera is.

Nok Lam Chan

07/27/2023, 4:13 PM

@Juan Luis https://docs.getdbt.com/docs/build/tests You get some declarative test + great expectations you can declare more exotic tests.

👀 1

Nok Lam Chan

07/27/2023, 4:16 PM

It shouldn’t be too hard to implement. GE has its own config, essential you need to make a plugin to make sure the config are compatible, it could either leverage the

catalog.yml

metadata

or you create an additional config files to map GE dataset to Kedro Datasets The existing

after_data_loaded

hook will be very useful

Nok Lam Chan

07/27/2023, 4:18 PM

Not a big fan of GE API, historically it has a lot of breaking change, I guess this is why the old plugin isn’t maintained anymore. It supports SQL/Spark/Pandas, Pandera don’t have native spark support until now.

Nok Lam Chan

07/27/2023, 4:21 PM

Would love to see both kedro + great_expectations & kedro + pandera plugins available for the community 🙂

datajoely

07/27/2023, 4:44 PM

GX has now reached a good level of stability, it was too early last time this was done properly

👍 1

Deepyaman Datta

07/27/2023, 4:45 PM

I guess this is why the old plugin isn’t maintained anymore.

The plugin was never maintained by the Kedro team (fair, plenty of things that the core team is responsible for), but also no other team/group is willing to make a small, well-defined maintenance commitment for it (🤷), nor were people willing to open it up to the outside community.

Would love to see both kedro + great_expectations & kedro + pandera plugins available for the community

Is there a measured demand anywhere? I'd be more inclined to build out the Kedro-Great Expectations plugin if I know there are a lot of people who would be willing to use it (still, no promises :P); otherwise, I think it's better somebody who actually uses Kedro day-to-day and needs the functionality themselves build it out.

👍🏼 1

Deepyaman Datta

07/27/2023, 4:46 PM

GX has now reached a good level of stability, it was too early last time this was done properly

As a side note, last I looked at this, they removed the CLI commands entirely from the getting started workflow. But the CLI still exists. I'm not sure if this is indicative of them wanting to move towards more Python integratios.

👀 1

Juan Luis

07/27/2023, 4:47 PM

Is there a measured demand anywhere?

I don't have evidence for this but I can only guess that people working on data pipelines want to do data validation as well, or at a minimum some form of unit testing

Nok Lam Chan

07/27/2023, 4:49 PM

Kedro pipeline + data versioning + experiment tracking + data validation was a natural choice of integration when I started 3 years ago.

Nok Lam Chan

07/27/2023, 4:49 PM

experiment tracking we have

kedro-mlfow

kedro-neptune

now, not much for data validation & data versioning.

Neeraj Malhotra

07/27/2023, 6:59 PM

In general, I am indifferent to plugins whether

Kedro + GE

Kedro + pandera

. I think both should exist and consumers will decide what they want. But if its a question about demand for one of them, then I would go with kedro + pandera
as GE had been there for a loooong time but hasn’t impressed users (mostly pains down the line). That was one of the reasons, we choose

Pandera

. Moreover, Pandera’s demand has skyrocketted 🚀 in recent times with 10M+ downloads. I think that tells a story.. 🤓 But again, I am indifferent to both as long as users want them and someone can really commit time to develop it. 🙂

👍 1

Matthias Roels

07/27/2023, 7:35 PM

I wasn’t talking about plugins for GE or pandera. But rather create one within the kedro framework natively! With support for pandas/spark and polars. The plugin would then allow for checks of in-memory datasets after they are created and before you would persist the data and start running the next node. This way, you would need no additional dependencies! I hate GE because it’s complex to use and has a ton of additional dependencies that are installed to make it work. And to be honest, managing python dependencies is already enough of a nightmare. I don’t need a ton of extra stuf that I don’t really need Edit: same can be said for MLflow btw. That’s why I am watching experiment tracking with great interest!

Matthias Roels

07/27/2023, 7:40 PM

Just had a look at pandera. It looks interesting, but it only supports pandas and pyspark-pandas. Not sure what the later means (can we even do validations on generic spark dataframes?) though…

Yolan Honoré-Rougé

07/27/2023, 8:05 PM

I'd love to have data validation plugin for kedro, and I am actually keen on making this working. I really plan to work on kedro-pandera to make it work but I have no planning for this, maybe this summer. From what I read, I think that the goal of

pandera

is a bit different than

GX

pandera

seems to be more focused on validating schema at runtime while GX is an entire framework to pull data and validate data statistics over time so I don't think they really compete on the same field. I really disliked gx in the 0.11-1.12 series because the API was unreliable and the abstraction weren't very clear so it was hard to integrate. The docs seem clearer now but I don't have experience with recent versions so I can't really be assertive here.

🔥 1

👍 2

Neeraj Malhotra

07/27/2023, 8:22 PM

@Matthias Roels, Pandera does support native PySpark dataframes: https://pandera.readthedocs.io/en/stable/pyspark_sql.html

👍 1

Nok Lam Chan

07/27/2023, 8:29 PM

@Matthias Roels are you suggesting a kedro native plugin? isn't it reinventing the wheel though when we can leverage these existing library which is doing the job.

👍🏼 1

Matthias Roels

07/27/2023, 8:30 PM

But just like GX, it is missing polars support which is getting quite popular

Matthias Roels

07/27/2023, 8:36 PM

Reading through the comments, a kedro-pandera plugin might be a good starting point to bring data validation into kedro. Unfortunately, I have no experience with pandera so I am not the right person to implement it.

Yolan Honoré-Rougé

07/27/2023, 8:46 PM

But just like GX, it is missing polars support which is getting quite popular

I feel it is easier (and has more chances to happen in a short timeframe) to bring

polars

(or anything else) to data validation backend (e.g.

pandera

gx

) when the library gets traction rather than having kedro managing all by itself because the team will soon become overwhelmed. I'd be very curious to see if native experiment trarcking is really more used by kedro users than other solutoons like

kedro-mlflow

kedro-neptune

...

Yolan Honoré-Rougé

07/27/2023, 8:46 PM

https://github.com/unionai-oss/pandera/issues/1064

Deepyaman Datta

07/28/2023, 5:52 AM

I feel like these libs should move towards type-checking the

__dataframe__

property--that way, you don't have to handle each lib separately. E.g. in that case you would have it available for polars too via https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.__dataframe__.html

💯 1

Matthias Roels

07/28/2023, 6:49 AM

It’s not just type checking. It also checking for null values, check unique constraints, check if a column contains a particular set of values etc. But also making sure there is room for custom business checks (i.e. the framework should be extensible).

Matthias Roels

07/28/2023, 6:49 AM

As it is not a good idea to allow all users to create extensions in the DataFrame API’s directly, those things will never be possible within these frameworks only. But, you could make use of the fact that the DataFrame API’s are relatively stable to create “metrics” classes to get all sorts of info out of a dataframe (which is what pandera does too?). Using these metrics classes, it is straightforward to define a set of data checks/tests/validations (whatever you want to call them). Then, it’s just a matter of deciding the best way to generate the output (or create a bunch of options for the user to decide). I’m thinking about logs, html report, kedro viz,…

104 Views

Open in Slack

Previous Next