Hey all! I have a random question. How often do yo...
# random
y
Hey all! I have a random question. How often do you struggle with missing metadata (e.g. missing domain understanding, dataset descriptions and definitions for the columns) in internal and external datasets used in your companies, and how do you usually handle this? Just curious to hear your experiences!
👀 2
y
> How often do you struggle with missing metadata Almost every project > how do you usually handle this Three things that typically help: 1. Tag dictionary - think of it as a table where 1 row covers 1 column in raw data, and it has raw column name, human readable name, unit, normal range... 2. Documenting data schema and, very important, version controlling it with the repo - not storing in some power point. I use
mermaid
markdown for that and maintain a ER diagram 3. Making
pandera
schemas for key datasets, and they allow to specify column descriptions
💡 1
🙌 1
c
I'm currently creating Pandera schemas and putting info in the class docstrings. Haven't tried the column descriptions yet, like @Yury Fedotov mentioned. Going to look at
mermaid
markdown for ER diagrams, though! That's what I've been wanting to do but forgot about
mermaid
.
👍 1