Hello, does anyone know of any best practices for ...
# random
a
Hello, does anyone know of any best practices for data ingestion, particularly concerning the data cleaning process from third-party providers? I'm currently integrating data from third-party sources, and it's often necessary to clean the data before it can be utilized by other machine learning products. The challenge lies in maintaining a record of the data cleaning processes and being able to audit why and how the data is transformed over time. Are there any generic frameworks or templates I can explore to manage this cleaning process effectively?
j
hi @Afiq Johari! a couple of ideas: ā€¢ Databricks coined something called the "medallion architecture" with bronze, silver, and gold layers https://www.databricks.com/glossary/medallion-architecture I think it's a useful framing (not too different from Kedro
/data
layers) ā€¢ the way to maintain a record of the data cleaning process is by coding it. for example with Kedro pipelines šŸ˜„ but of course anything will do