Good morning We re looking for best practices to handle data Kedro #questions

Good morning! We're looking for best practices to ...

Jacques Vergine

02/14/2025, 10:13 AM

Good morning! We're looking for best practices to handle data quality issues within Kedro. Specifically: 1. We need to implement both manual and automated data curation 2. Ideally want to keep as much as possible within the Kedro pipeline structure 3. The current challenge is how to apply and track incoming data corrections requests Has anyone implemented something similar? Looking for patterns/approaches that worked well.

Hall

02/14/2025, 10:13 AM

Someone will reply to you shortly. In the meantime, this might help:

Jitendra Gundaniya

02/14/2025, 11:23 AM

CC @Elena Khaustova

Elena Khaustova

02/14/2025, 11:27 AM

Hi @Jacques Vergine, can you please provide some more details regarding the challenge you mentioned and concrete examples explaining the difficulties?

Jacques Vergine

02/14/2025, 11:31 AM

Thank you for answering quickly! I will get clarity on this with the users and get back to you next week with precise information and concrete examples.

👍 1

Jacques Vergine

02/25/2025, 11:35 AM

Thanks for your patience! Let me clarify our specific use case: We're working with a large medical knowledge graph (tens of millions of nodes, hundreds of millions of edges) where we need to handle different types of data quality issues, so far we've identified two: 1. One-off corrections (e.g., incorrect drug-disease relationships, wrong labels) 2. Systematic errors (e.g., data source integration issues, edge accuracy problems) My first thoughts: - For systematic errors: Handle through GitHub issues and fix at the pipeline level - For one-off corrections: Create a structured format to collect user feedback that can be automatically processed by Kedro pipelines while preserving raw data integrity Key requirements: - Enable domain experts to implement corrections without engineering involvement - Scale well with potentially large numbers of corrections - Maintain traceability of changes - Keep as much as possible within the Kedro framework While this looks like a long term solution, I'm also interested in short/medium term solutions that would tick most of the boxes. Does this provide you with enough details?

Open in Slack

Previous Next