When in a non-production environment - during development - data engineers can validate datasets between nodes, to see if the function was correctly executed. However, I wonder if you have some guidance on how to do this in a production environment, especially when CI/CD is involved and when the pipeline is not run locally but in say Airflow.
I am aware that there are data quality tools like Great Expectations that can automate this within a CI/CD pipeline.
Some questions I have:
• When an automated data quality tool fails the (Airflow) pipeline that is running in a CI/CD pipeline, what is the recommended way of fixing the data and re-running the pipeline again? Is it recommended to re-run the whole pipeline again, or can we also run only a subset of the pipeline? Or is it really hard to find out where the data quality issue resides in the overal (master) pipeline, thus it might be better to re-run the pipeline as a whole again?
• I understand that it is better to automate data quality testing, however, is there also something like manual data quality testing, especially when running in the context of a production system? Can we express something like a manual validation step within Kedro , whereas the pipelines waits until a user presses a button to continue the pipeline, or - in case of an data error - (partially) re-run the pipeline with newly uploaded corrected data?
Nok Lam Chan
07/09/2023, 7:45 AM
Both airflow and Kedro operate on a DAG, Kedro cli has different options to run a pipeline, as long as you know which dataset is failed, you only need to re-run from that point and Kedro can figure out the dependency for you.