Thanks for your patience! Let me clarify our specific use case:
We're working with a large medical knowledge graph (tens of millions of nodes, hundreds of millions of edges) where we need to handle different types of data quality issues, so far we've identified two:
1. One-off corrections (e.g., incorrect drug-disease relationships, wrong labels)
2. Systematic errors (e.g., data source integration issues, edge accuracy problems)
My first thoughts:
- For systematic errors: Handle through GitHub issues and fix at the pipeline level
- For one-off corrections: Create a structured format to collect user feedback that can be automatically processed by Kedro pipelines while preserving raw data integrity
Key requirements:
- Enable domain experts to implement corrections without engineering involvement
- Scale well with potentially large numbers of corrections
- Maintain traceability of changes
- Keep as much as possible within the Kedro framework
While this looks like a long term solution, I'm also interested in short/medium term solutions that would tick most of the boxes.
Does this provide you with enough details?