Hello, does anyone know of any best practices for data ingestion, particularly concerning the data cleaning process from third-party providers?
I'm currently integrating data from third-party sources, and it's often necessary to clean the data before it can be utilized by other machine learning products.
The challenge lies in maintaining a record of the data cleaning processes and being able to audit why and how the data is transformed over time.
Are there any generic frameworks or templates I can explore to manage this cleaning process effectively?
j
Juan Luis
04/18/2024, 7:47 AM
hi @Afiq Johari! a couple of ideas:
• Databricks coined something called the "medallion architecture" with bronze, silver, and gold layers https://www.databricks.com/glossary/medallion-architecture I think it's a useful framing (not too different from Kedro
/data
layers)
• the way to maintain a record of the data cleaning process is by coding it. for example with Kedro pipelines 😄 but of course anything will do