Kedro is an open-sourced Python framework for creating maintainable and modular data science code.

Kedro

Hey all! I have a random question. How often do you struggle with missing metadata (e.g. missing domain understanding, dataset descriptions and definitions for the columns) in internal and external datasets used in your companies, and how do you usually handle this? Just curious to hear your experiences!

> How often do you struggle with missing metadata
Almost every project

> how do you usually handle this
Three things that typically help:
1. Tag dictionary - think of it as a table where 1 row covers 1 column in raw data, and it has raw column name, human readable name, unit, normal range...
2. Documenting data schema and, very important, version controlling it with the repo - not storing in some power point. I use `mermaid` markdown for that and maintain a ER diagram
3. Making `pandera` schemas for key datasets, and they allow to specify column descriptions

I'm currently creating Pandera schemas and putting info in the class docstrings. Haven't tried the column descriptions yet, like <@U05JMSKG6MT> mentioned.
Going to look at `mermaid` markdown for ER diagrams, though! That's what I've been wanting to do but forgot about `mermaid`.