Good morning! We're looking for best practices to ...
# questions
j
Good morning! We're looking for best practices to handle data quality issues within Kedro. Specifically: 1. We need to implement both manual and automated data curation 2. Ideally want to keep as much as possible within the Kedro pipeline structure 3. The current challenge is how to apply and track incoming data corrections requests Has anyone implemented something similar? Looking for patterns/approaches that worked well.
h
Someone will reply to you shortly. In the meantime, this might help:
j
CC @Elena Khaustova
e
Hi @Jacques Vergine, can you please provide some more details regarding the challenge you mentioned and concrete examples explaining the difficulties?
j
Thank you for answering quickly! I will get clarity on this with the users and get back to you next week with precise information and concrete examples.
👍 1
Thanks for your patience! Let me clarify our specific use case: We're working with a large medical knowledge graph (tens of millions of nodes, hundreds of millions of edges) where we need to handle different types of data quality issues, so far we've identified two: 1. One-off corrections (e.g., incorrect drug-disease relationships, wrong labels) 2. Systematic errors (e.g., data source integration issues, edge accuracy problems) My first thoughts: - For systematic errors: Handle through GitHub issues and fix at the pipeline level - For one-off corrections: Create a structured format to collect user feedback that can be automatically processed by Kedro pipelines while preserving raw data integrity Key requirements: - Enable domain experts to implement corrections without engineering involvement - Scale well with potentially large numbers of corrections - Maintain traceability of changes - Keep as much as possible within the Kedro framework While this looks like a long term solution, I'm also interested in short/medium term solutions that would tick most of the boxes. Does this provide you with enough details?