Kedro is an open-sourced Python framework for creating maintainable and modular data science code.

Kedro

Hello, does anyone know of any best practices for data ingestion, particularly concerning the data cleaning process from third-party providers?

I'm currently integrating data from third-party sources, and it's often necessary to clean the data before it can be utilized by other machine learning products.

The challenge lies in maintaining a record of the data cleaning processes and being able to audit why and how the data is transformed over time.
Are there any generic frameworks or templates I can explore to manage this cleaning process effectively?

image.png

hi <@U060KH3CNLD>! a couple of ideas:

• Databricks coined something called the "medallion architecture" with bronze, silver, and gold layers <https://www.databricks.com/glossary/medallion-architecture> I think it's a useful framing (not too different from Kedro `/data` layers)
• the way to maintain a record of the data cleaning process is by coding it. for example with Kedro pipelines :smile: but of course anything will do