Hello I m looking into ways to add data validation to our pi Kedro #questions

Hello! I'm looking into ways to add data validatio...

Iñigo Hidalgo

03/30/2023, 9:58 AM

Hello! I'm looking into ways to add data validation to our pipelines at runtime and came across this really cool example project using great expectations by (I assume) @Erwin https://github.com/erwinpaillacan/kedro-great-expectations-example It seems like a good way forward, using the hooks to run the validations if the dataset has some validations mapped to it in config, but was wondering if anybody has done it a different way, by treating the great expectations outputs as kedro datasets themselves. I ask this bc we have all our blob connectors implemented as kedro custom datasets, and the easiest way for us to save these validations would be by treating them as outputs from kedro nodes. I'm not interested in the html report output, I'm only interested in the json outputs as we would then send want to send alerts based on those.

👀 1

👍 1

Iñigo Hidalgo

03/30/2023, 9:59 AM

I'm wondering if great expectations is too opinionated about the way it treats its outputs to adapt well to the kedro view of nodes-outputs etc.

datajoely

03/30/2023, 10:01 AM

So we had a version of a plug-in that we’ve never open sourced because the GE API kept breaking

Iñigo Hidalgo

03/30/2023, 10:01 AM

Yeah we have that closed-source wheel from y'all haha

Iñigo Hidalgo

03/30/2023, 10:01 AM

Trying to move away from it as it limits us to 17.X

datajoely

03/30/2023, 10:02 AM

from reflection there are two types of doing this: • Online checks that validate an runtime • offline checks that run on persisted data (kinda like how dbt works)

datajoely

03/30/2023, 10:02 AM

If I’m honest I really prefer Pandera for the first one

datajoely

03/30/2023, 10:02 AM

but it’s not 100% there for Spark yet

datajoely

03/30/2023, 10:02 AM

and we’ve actually got a team at QB who are looking to contribute that missing part to the library

Iñigo Hidalgo

03/30/2023, 10:04 AM

We're pure pandas so Spark isn't a dealbreaker for us. Thing is we already have validations built up so would need a rly good reason to move towards Pandera. It does seem to be more lightweight, and potentially configurable through kedro datasets. Do you have any examples within kedro I could look into?

Juan Luis

03/30/2023, 10:04 AM

notice that pandera 0.14 refactored everything to prepare for Spark and Polars compatibility 🔮

🔥 1

datajoely

03/30/2023, 10:06 AM

well @Iñigo Hidalgo to use pandera you can just decorate your python functions

datajoely

03/30/2023, 10:06 AM

but if you’re looking to leverage your existing validators that may worth pursuing

datajoely

03/30/2023, 10:07 AM

in truth though, I’ve not got enough GE experience to recommend any specific next steps

Iñigo Hidalgo

03/30/2023, 10:10 AM

i think i'll spend some time looking into pandera, since i have seen it brought up a few times. runtime validation is the most important usecase for us rn so if I can get something basic up and running soon it might be worth pursuing

❤️ 3

Iñigo Hidalgo

03/30/2023, 10:19 AM

do you have any examples of somebody who's integrated it into kedro? I'm particularly interested in generating these expectations (schemas in pandera's case) from one pipeline run, for example feature engineering, saving the config as a kedro (yaml?) dataset and then instantiating these schemas to run in a different pipeline, for example when predicting. We're trying to write code which we can reuse across dozens of different pipelines without too much adhoc config. and pretty much all our code runs through kedro, so good integration using params and datasets would save us a lot of time

datajoely

03/30/2023, 10:22 AM

I dont unfortunately

Iñigo Hidalgo

03/30/2023, 10:23 AM

I guess it's quite a specific requirement. Pandera does seem interesting though. I need to look further into what its checks actually return, since we'll want to store these results either in SQL or blob

Nok Lam Chan

03/30/2023, 12:37 PM

Worth mentioning we are going to add the ability to use custom attributes on Datasets, this may be useful if you need to build validation plugins. https://github.com/kedro-org/kedro/issues/2440

👍 1

Iñigo Hidalgo

03/30/2023, 12:50 PM

@Nok Lam Chan Would the idea be for it to work similarly to node tags?

datajoely

03/30/2023, 12:51 PM

sort of, but it has potential use in hooks and viz

Iñigo Hidalgo

03/30/2023, 12:55 PM

ohh okay, so not like tagging it to be processed by a plugin, but rather adding the expected schema directly as its attributes

datajoely

03/30/2023, 12:56 PM

yeah it’s quite a low level feature that will enable plug-ins to be a lot more powerful

Iñigo Hidalgo

03/30/2023, 1:00 PM

ok. that's interesting but would basically mean we would have to store these attributes as part of the catalog.yml, right? or does OmegaConf allow for combining for example a catalog.yml with a validations.yml where the top level keys (dataset name) are the same, but then the validations.yml dict just has the validation schema or smth? I'm thinking something like: `catalog.yml`:

Copy code

sales_intermediate:
  filename: ...
  layer: ...
  save_kwargs: ...

`validations.yml`:

Copy code

sales_intermediate:
  expected_schema:
    col1: ...
    col2: ...

Iñigo Hidalgo

03/30/2023, 1:00 PM

my reasoning is that the validations will probably change more frequently than the catalog itself, so it might be cumbersome to store them as part of the catalog

datajoely

03/30/2023, 1:01 PM

that’s a very good point

datajoely

03/30/2023, 1:01 PM

Copy code

sales_intermediate:
  filename: ...
  layer: ...
  save_kwargs: ...
  validations : ${../validations.yml}

datajoely

03/30/2023, 1:01 PM

we would need to do some sort of linking like that

datajoely

03/30/2023, 1:01 PM

no idea if that’s the right syntax

datajoely

03/30/2023, 1:01 PM

would appreciate a comment on that issue!

👍 1

Iñigo Hidalgo

03/30/2023, 1:02 PM

is that a feature of omegaconf? like ive mentioned before, still stuck on an ancient version of kedro so havent been able to play around with it lol

Nok Lam Chan

03/30/2023, 1:04 PM

I think it will work but may requires some workaround. I’ll try to add that somewhere in our GH issues

Iñigo Hidalgo

03/31/2023, 9:21 AM

Hi, bouncing back from the DataSets attributes, I had a question based on a comment of @Antony Milne's, but it isn't exactly suited for the issue the discussion is in, so thought I'd post it here: He mentions that atm when merging keys from different yamls it throws an exception from

self._check_duplicates(seen_file_to_keys)

, which might be inherited from anyconf so could be removed to facilitate the behavior I asked about above. This triggered a further thought though: I haven't been following recent developments, but do you distinguish the merging done within environments and across environments (e.g. when merging local & base)? I ask because at the moment (on kedro 0.17.1) I have some datasets defined in the base catalog, and when I want to save those datasets locally instead of blob, I overwrite them in the local environment with a different dataset definition. At the moment, since anyconf replaces the entire dictionary with the one from local, it overwrites correctly, but I was wondering what the expected behavior would be if you use OmegaConf's subkey merging.

Copy code

conf/
  base/
    catalog/
      datasets.yml
      validations.yml  # subkeys should be merged with those in the same environment
  local/
    catalog/
      datasets.yml # top-level keys (dataset names) and their dictionary here should overwrite those in base (including those from validations?)

Nok Lam Chan

03/31/2023, 2:50 PM

In short, omegaconf will merge within its environment, the final merge will be a dict merge from different environment so it shouldn’t change anything.

6 Views

Open in Slack

Previous Next