Does your team have a standard for documenting the expected Kedro #random

Does your team have a standard for documenting the...

Iñigo Hidalgo

03/18/2024, 12:42 PM

Does your team have a standard for documenting the expected columns in a tabular input to a function? E.g. which columns must be present, with any required types, and which columns are added/removed in the returned data? I am mostly asking about docstring conventions, not things like pandera schemas which go much further. For example:

Copy code

def to_candlestick(daily_data: pl.DataFrame, frequency="7d"):
    """
    daily_data:
        primary_key:
            - timestamp
            - ticker
        schema:
            timestamp: datetime
            open: float
            high: float
            low: float
            close: float
    Return:
        plotly graph object with a candlestick chart without a Rangeslider
    """

Juan Luis

03/18/2024, 12:50 PM

I'm also interested in this! haven't seen any docstring standard whatsoever

Iñigo Hidalgo

03/18/2024, 12:53 PM

yeah we recently started building sphinx pages for some internal packages and in reviewing docstrings i have seen that we very commonly indicate that "this" or "that" column is required, but it's just mentioned in passing, or as a note under whichever Args: entry. I am brainstorming what it would look like. things like "this subset of columns is unique" (primary key), this column cannot contain nulls, etc

Nok Lam Chan

03/18/2024, 1:02 PM

Have seen pandera schema + automatic generate this metadata (what are the added columns) IIRC

Nok Lam Chan

03/18/2024, 1:02 PM

sure it's not generic and you have to follow some convention to enable these kind of lineage

Nok Lam Chan

03/18/2024, 1:03 PM

things like "this subset of columns is unique" (primary key), this column cannot contain nulls, etc

This already sounds very like Great-expecation/pandera schema kind of thing

Iñigo Hidalgo

03/18/2024, 1:08 PM

This already sounds very like Great-expecation/pandera schema kind of thing

totally, but I am only trying to address this from the documentation side, not introducing an additional runtime check or anything like that

Nok Lam Chan

03/18/2024, 1:10 PM

Copy code

CREATE TABLE Persons (
    ID int NOT NULL,
    LastName varchar(255) NOT NULL,
    FirstName varchar(255) NOT NULL,
    Age int
);

I think for anything that is complicated enough, there are nothing that can beat code in terms of "complexity", code is after all the simplest thing

Nok Lam Chan

03/18/2024, 1:10 PM

for a lightweight solution, maybe something similar to the SQL world?

datajoely

03/18/2024, 1:46 PM

So my teams are now using Pandera directly on the python function bound to the Kedro node

datajoely

03/18/2024, 1:46 PM

Also Pandera shipped Polars support this last week

datajoely

03/18/2024, 1:47 PM

Frankly I’m not a fan of docstrings for metadata like this, the amount of tooling you need to validate them suggests to me you should just use something imperative in the first place

Juan Luis

03/18/2024, 2:11 PM

yeah I agree it's difficult to build tooling around docstrings. but I also think there should be something for humans and not only machines

datajoely

03/18/2024, 2:12 PM

so this is where I want the equivalent of

dbt build docs

in Kedro Viz constructed rom pandera annotations

datajoely

03/18/2024, 2:12 PM

even without the schema info a table view of the catalog would be helpful

💯 1

Swamini Khurana

03/18/2024, 2:19 PM

Supporting the request for a tabular view of the catalog 🙏

Nok Lam Chan

03/18/2024, 2:21 PM

How would a tabular view works? @Swamini Khurana Can you elaborate how would it looks like?

Iñigo Hidalgo

03/18/2024, 3:27 PM

totally agree that pandera would be "the" way to do it if I had full control over our stack. but just like type hinting when it first came out, not everybody is immediately onboard. i am trying to address this from the documentation side as it will at least get people in my team thinking about this sort of thing. and if we have good documentation about expected schemas, the step of actually enforcing them using pandera should come more naturally.

Iñigo Hidalgo

03/18/2024, 3:28 PM

table view of the catalog would be helpful

8 Views

Open in Slack

Previous Next