Hello! I'm looking into ways to add data validatio...
# questions
i
Hello! I'm looking into ways to add data validation to our pipelines at runtime and came across this really cool example project using great expectations by (I assume) @Erwin https://github.com/erwinpaillacan/kedro-great-expectations-example It seems like a good way forward, using the hooks to run the validations if the dataset has some validations mapped to it in config, but was wondering if anybody has done it a different way, by treating the great expectations outputs as kedro datasets themselves. I ask this bc we have all our blob connectors implemented as kedro custom datasets, and the easiest way for us to save these validations would be by treating them as outputs from kedro nodes. I'm not interested in the html report output, I'm only interested in the json outputs as we would then send want to send alerts based on those.
👀 1
👍 1
I'm wondering if great expectations is too opinionated about the way it treats its outputs to adapt well to the kedro view of nodes-outputs etc.
d
So we had a version of a plug-in that we’ve never open sourced because the GE API kept breaking
i
Yeah we have that closed-source wheel from y'all haha
Trying to move away from it as it limits us to 17.X
d
from reflection there are two types of doing this: • Online checks that validate an runtime • offline checks that run on persisted data (kinda like how dbt works)
If I’m honest I really prefer Pandera for the first one
but it’s not 100% there for Spark yet
and we’ve actually got a team at QB who are looking to contribute that missing part to the library
i
We're pure pandas so Spark isn't a dealbreaker for us. Thing is we already have validations built up so would need a rly good reason to move towards Pandera. It does seem to be more lightweight, and potentially configurable through kedro datasets. Do you have any examples within kedro I could look into?
j
notice that pandera 0.14 refactored everything to prepare for Spark and Polars compatibility 🔮
🔥 1
d
well @Iñigo Hidalgo to use pandera you can just decorate your python functions
but if you’re looking to leverage your existing validators that may worth pursuing
in truth though, I’ve not got enough GE experience to recommend any specific next steps
i
i think i'll spend some time looking into pandera, since i have seen it brought up a few times. runtime validation is the most important usecase for us rn so if I can get something basic up and running soon it might be worth pursuing
❤️ 3
do you have any examples of somebody who's integrated it into kedro? I'm particularly interested in generating these expectations (schemas in pandera's case) from one pipeline run, for example feature engineering, saving the config as a kedro (yaml?) dataset and then instantiating these schemas to run in a different pipeline, for example when predicting. We're trying to write code which we can reuse across dozens of different pipelines without too much adhoc config. and pretty much all our code runs through kedro, so good integration using params and datasets would save us a lot of time
d
I dont unfortunately
i
I guess it's quite a specific requirement. Pandera does seem interesting though. I need to look further into what its checks actually return, since we'll want to store these results either in SQL or blob
n
Worth mentioning we are going to add the ability to use custom attributes on Datasets, this may be useful if you need to build validation plugins. https://github.com/kedro-org/kedro/issues/2440
👍 1
i
@Nok Lam Chan Would the idea be for it to work similarly to node tags?
d
sort of, but it has potential use in hooks and viz
i
ohh okay, so not like tagging it to be processed by a plugin, but rather adding the expected schema directly as its attributes
d
yeah it’s quite a low level feature that will enable plug-ins to be a lot more powerful
i
ok. that's interesting but would basically mean we would have to store these attributes as part of the catalog.yml, right? or does OmegaConf allow for combining for example a catalog.yml with a validations.yml where the top level keys (dataset name) are the same, but then the validations.yml dict just has the validation schema or smth? I'm thinking something like: `catalog.yml`:
Copy code
sales_intermediate:
  filename: ...
  layer: ...
  save_kwargs: ...
`validations.yml`:
Copy code
sales_intermediate:
  expected_schema:
    col1: ...
    col2: ...
my reasoning is that the validations will probably change more frequently than the catalog itself, so it might be cumbersome to store them as part of the catalog
d
that’s a very good point
Copy code
sales_intermediate:
  filename: ...
  layer: ...
  save_kwargs: ...
  validations : ${../validations.yml}
we would need to do some sort of linking like that
no idea if that’s the right syntax
would appreciate a comment on that issue!
👍 1
i
is that a feature of omegaconf? like ive mentioned before, still stuck on an ancient version of kedro so havent been able to play around with it lol
n
I think it will work but may requires some workaround. I’ll try to add that somewhere in our GH issues
i
Hi, bouncing back from the DataSets attributes, I had a question based on a comment of @Antony Milne's, but it isn't exactly suited for the issue the discussion is in, so thought I'd post it here: He mentions that atm when merging keys from different yamls it throws an exception from
self._check_duplicates(seen_file_to_keys)
, which might be inherited from anyconf so could be removed to facilitate the behavior I asked about above. This triggered a further thought though: I haven't been following recent developments, but do you distinguish the merging done within environments and across environments (e.g. when merging local & base)? I ask because at the moment (on kedro 0.17.1) I have some datasets defined in the base catalog, and when I want to save those datasets locally instead of blob, I overwrite them in the local environment with a different dataset definition. At the moment, since anyconf replaces the entire dictionary with the one from local, it overwrites correctly, but I was wondering what the expected behavior would be if you use OmegaConf's subkey merging.
Copy code
conf/
  base/
    catalog/
      datasets.yml
      validations.yml  # subkeys should be merged with those in the same environment
  local/
    catalog/
      datasets.yml # top-level keys (dataset names) and their dictionary here should overwrite those in base (including those from validations?)
n
In short, omegaconf will merge within its environment, the final merge will be a dict merge from different environment so it shouldn’t change anything.