https://kedro.org/ logo
#random
Title
# random
i

Iñigo Hidalgo

03/18/2024, 12:42 PM
Does your team have a standard for documenting the expected columns in a tabular input to a function? E.g. which columns must be present, with any required types, and which columns are added/removed in the returned data? I am mostly asking about docstring conventions, not things like pandera schemas which go much further. For example:
Copy code
def to_candlestick(daily_data: pl.DataFrame, frequency="7d"):
    """
    daily_data:
        primary_key:
            - timestamp
            - ticker
        schema:
            timestamp: datetime
            open: float
            high: float
            low: float
            close: float
    Return:
        plotly graph object with a candlestick chart without a Rangeslider
    """
j

Juan Luis

03/18/2024, 12:50 PM
I'm also interested in this! haven't seen any docstring standard whatsoever
i

Iñigo Hidalgo

03/18/2024, 12:53 PM
yeah we recently started building sphinx pages for some internal packages and in reviewing docstrings i have seen that we very commonly indicate that "this" or "that" column is required, but it's just mentioned in passing, or as a note under whichever Args: entry. I am brainstorming what it would look like. things like "this subset of columns is unique" (primary key), this column cannot contain nulls, etc
n

Nok Lam Chan

03/18/2024, 1:02 PM
Have seen pandera schema + automatic generate this metadata (what are the added columns) IIRC
sure it's not generic and you have to follow some convention to enable these kind of lineage
things like "this subset of columns is unique" (primary key), this column cannot contain nulls, etc
This already sounds very like Great-expecation/pandera schema kind of thing
i

Iñigo Hidalgo

03/18/2024, 1:08 PM
This already sounds very like Great-expecation/pandera schema kind of thing
totally, but I am only trying to address this from the documentation side, not introducing an additional runtime check or anything like that
n

Nok Lam Chan

03/18/2024, 1:10 PM
Copy code
CREATE TABLE Persons (
    ID int NOT NULL,
    LastName varchar(255) NOT NULL,
    FirstName varchar(255) NOT NULL,
    Age int
);
I think for anything that is complicated enough, there are nothing that can beat code in terms of "complexity", code is after all the simplest thing
for a lightweight solution, maybe something similar to the SQL world?
d

datajoely

03/18/2024, 1:46 PM
So my teams are now using Pandera directly on the python function bound to the Kedro node
Also Pandera shipped Polars support this last week
Frankly I’m not a fan of docstrings for metadata like this, the amount of tooling you need to validate them suggests to me you should just use something imperative in the first place
j

Juan Luis

03/18/2024, 2:11 PM
yeah I agree it's difficult to build tooling around docstrings. but I also think there should be something for humans and not only machines
d

datajoely

03/18/2024, 2:12 PM
so this is where I want the equivalent of
dbt build docs
in Kedro Viz constructed rom pandera annotations
even without the schema info a table view of the catalog would be helpful
💯 1
s

Swamini Khurana

03/18/2024, 2:19 PM
Supporting the request for a tabular view of the catalog 🙏
n

Nok Lam Chan

03/18/2024, 2:21 PM
How would a tabular view works? @Swamini Khurana Can you elaborate how would it looks like?
i

Iñigo Hidalgo

03/18/2024, 3:27 PM
totally agree that pandera would be "the" way to do it if I had full control over our stack. but just like type hinting when it first came out, not everybody is immediately onboard. i am trying to address this from the documentation side as it will at least get people in my team thinking about this sort of thing. and if we have good documentation about expected schemas, the step of actually enforcing them using pandera should come more naturally.
2 Views