wave Hiya I d like to test an experiment test tube with you Kedro #user-research

:wave: Hiya - I’d like to test an experiment :tes...

datajoely

02/23/2024, 1:54 PM

👋 Hiya - I’d like to test an experiment 🧪. with you all Building on @Deepyaman Datta’s awesome work with Ibis I’ve been thinking about the lazy execution engine could enable us to understand what operations are going on within Kedro nodes whether you’re using Pandas, Polars, DuckDB, Spark etc. I’ve hacked together a little prototype for retrieving column level lineage within Kedro nodes 🎉 that use Ibis ibis - perhaps in the future we could think of this as a ‘double click’ 👈 in Kedro-Viz that explains how values are actually calculated. If we were to actually ship this it makes sense to integrate with standards like OpenLineage Anyway, here’s my prototype repo - and I have a couple of questions: https://github.com/datajoely/jaffle-shop-lineage 1. Would you use this? 2. How would it help your workflow today? 3. What else would be cool? And here is an interactive network diagram of what it looks like we expand the

rename_payments

node https://datajoely.github.io/jaffle-shop-lineage/

🥳 6

ibis 2

Nok Lam Chan

02/23/2024, 3:16 PM

Just read the diagram more carefully, when we say "column level lineage" here, does it tells me • Nodes A takes Column a,b,c from Table A and output Table B column e,f,g or • Column a from Table A <-> Table B, column E? noneoftheless, this is super impressive🔥

datajoely

02/23/2024, 3:18 PM

I think the first one

datajoely

02/23/2024, 3:18 PM

the node in question does this

datajoely

02/23/2024, 3:18 PM

Copy code

def rename_payments(source: ir.Table) -> ir.Table:
    # `amount` is currently stored in cents, so we convert it to dollars
    amount_in_dollars = (source.amount / 100).name("amount")

    renamed = source[
        source.id.name("payment_id"),
        "order_id",
        "payment_method",
        amount_in_dollars,
    ]
    return renamed

👍🏼 1

datajoely

02/23/2024, 3:19 PM

so the node inputs are mapped to the Ibis graph ‘database table’ and the outputs map to the ‘selection’

datajoely

02/23/2024, 3:21 PM

for a much more complex node:

Copy code

def process_orders(
    orders: ir.Table, payments: ir.Table, payment_methods: list[str]
) -> Tuple[ir.Table, ir.Table]:
    total_amount_by_payment_method = {}
    for payment_method in payment_methods:
        total_amount_by_payment_method[f"{payment_method}_amount"] = ibis.coalesce(
            payments.amount.sum(where=payments.payment_method == payment_method), 0
        )

    order_payments = payments.group_by("order_id").aggregate(
        **total_amount_by_payment_method, total_amount=payments.amount.sum()
    )

    final = orders.left_join(order_payments, "order_id")[
        [
            orders.order_id,
            orders.customer_id,
            orders.order_date,
            orders.status,
            *[
                order_payments[f"{payment_method}_amount"]
                for payment_method in payment_methods
            ],
            order_payments.total_amount.name("amount"),
        ]
    ]
    return final, order_payments.sample(0.5)

datajoely

02/23/2024, 3:21 PM

this is the ibis graph

datajoely

02/23/2024, 3:29 PM

Deepyaman Datta

02/23/2024, 3:43 PM

If I understand, this focuses on lineage in a node. Would I be correct in assuming, if you can figure out the lineage in the node, then connecting it between datasets like in dbt (pic below) is just a matter of throwing each bit in the Kedro DAG?

👍 1

datajoely

02/23/2024, 3:48 PM

Yeah this is more of a proof of concept that I could match the Ibis query plan to a part of the Kedro DAG. Because the networks get quite complicated I thought it was best to build the POC where only we only focus on one node at a time

🙌 1

datajoely

02/23/2024, 3:49 PM

The point of the POC shows this is actually really easy to retrieve and work with as a property graph and therefore we could potentially come up with some pretty interesting abstractions, simplifications, user experiences in Viz

Iñigo Hidalgo

02/23/2024, 5:35 PM

This is super cool

Iñigo Hidalgo

02/23/2024, 5:35 PM

I may be way off-base here but: I remember reading this post recently https://www.linkedin.com/posts/toby-mao_column-level-lineage-is-freely-available-activity-7164763519266045952-7tQe/ SQLMesh (https://github.com/TobikoData/sqlmesh) provides column-level lineage for dbt models, built using SQLGlot, and ibis itself uses SQLGlot too. Are there any potential synergies there? Note: i don't actually know what SQLMesh is

datajoely

02/23/2024, 5:36 PM

Yes! this is exactly what triggered my experiment. SQLMesh is Toby’s interesting alternative to dbt , SQLGlot is the big enabler here

Iñigo Hidalgo

02/23/2024, 6:30 PM

To answer your questions 1. absolutely, particularly if we actually manage to have some sort of cross-backend lineage visibility. we're deeply in pandas-only world, so if we can somehow link pandas operations to database tables using something like this that would give us pretty full visibility into most of our data lineage 2. other than the above, it would also help when debugging, for example, if i can quickly track down the source of a derived feature's errors 3. what i mentioned in point 1, cross-backend lineage would take this over the top

🚀 2

33 Views

Open in Slack

Previous Next