:wave: Hiya - I’d like to test an experiment :tes...
# user-research
d
👋 Hiya - I’d like to test an experiment 🧪. with you all Building on @Deepyaman Datta’s awesome work with Ibis I’ve been thinking about the lazy execution engine could enable us to understand what operations are going on within Kedro nodes whether you’re using Pandas, Polars, DuckDB, Spark etc. I’ve hacked together a little prototype for retrieving column level lineage within Kedro nodes 🎉 that use Ibis ibis - perhaps in the future we could think of this as a ‘double click’ 👈 in Kedro-Viz that explains how values are actually calculated. If we were to actually ship this it makes sense to integrate with standards like OpenLineage Anyway, here’s my prototype repo - and I have a couple of questions: https://github.com/datajoely/jaffle-shop-lineage 1. Would you use this? 2. How would it help your workflow today? 3. What else would be cool? And here is an interactive network diagram of what it looks like we expand the
rename_payments
node https://datajoely.github.io/jaffle-shop-lineage/
🥳 6
ibis 2
n
Just read the diagram more carefully, when we say "column level lineage" here, does it tells me • Nodes A takes Column a,b,c from Table A and output Table B column e,f,g or • Column a from Table A <-> Table B, column E? noneoftheless, this is super impressive🔥
d
I think the first one
the node in question does this
Copy code
def rename_payments(source: ir.Table) -> ir.Table:
    # `amount` is currently stored in cents, so we convert it to dollars
    amount_in_dollars = (source.amount / 100).name("amount")

    renamed = source[
        source.id.name("payment_id"),
        "order_id",
        "payment_method",
        amount_in_dollars,
    ]
    return renamed
👍🏼 1
so the node inputs are mapped to the Ibis graph ‘database table’ and the outputs map to the ‘selection’
for a much more complex node:
Copy code
def process_orders(
    orders: ir.Table, payments: ir.Table, payment_methods: list[str]
) -> Tuple[ir.Table, ir.Table]:
    total_amount_by_payment_method = {}
    for payment_method in payment_methods:
        total_amount_by_payment_method[f"{payment_method}_amount"] = ibis.coalesce(
            payments.amount.sum(where=payments.payment_method == payment_method), 0
        )

    order_payments = payments.group_by("order_id").aggregate(
        **total_amount_by_payment_method, total_amount=payments.amount.sum()
    )

    final = orders.left_join(order_payments, "order_id")[
        [
            orders.order_id,
            orders.customer_id,
            orders.order_date,
            orders.status,
            *[
                order_payments[f"{payment_method}_amount"]
                for payment_method in payment_methods
            ],
            order_payments.total_amount.name("amount"),
        ]
    ]
    return final, order_payments.sample(0.5)
this is the ibis graph
image.png
d
If I understand, this focuses on lineage in a node. Would I be correct in assuming, if you can figure out the lineage in the node, then connecting it between datasets like in dbt (pic below) is just a matter of throwing each bit in the Kedro DAG?
👍 1
d
Yeah this is more of a proof of concept that I could match the Ibis query plan to a part of the Kedro DAG. Because the networks get quite complicated I thought it was best to build the POC where only we only focus on one node at a time
🙌 1
The point of the POC shows this is actually really easy to retrieve and work with as a property graph and therefore we could potentially come up with some pretty interesting abstractions, simplifications, user experiences in Viz
i
This is super cool
I may be way off-base here but: I remember reading this post recently https://www.linkedin.com/posts/toby-mao_column-level-lineage-is-freely-available-activity-7164763519266045952-7tQe/ SQLMesh (https://github.com/TobikoData/sqlmesh) provides column-level lineage for dbt models, built using SQLGlot, and ibis itself uses SQLGlot too. Are there any potential synergies there? Note: i don't actually know what SQLMesh is
d
Yes! this is exactly what triggered my experiment. SQLMesh is Toby’s interesting alternative to dbt , SQLGlot is the big enabler here
i
To answer your questions 1. absolutely, particularly if we actually manage to have some sort of cross-backend lineage visibility. we're deeply in pandas-only world, so if we can somehow link pandas operations to database tables using something like this that would give us pretty full visibility into most of our data lineage 2. other than the above, it would also help when debugging, for example, if i can quickly track down the source of a derived feature's errors 3. what i mentioned in point 1, cross-backend lineage would take this over the top
🚀 2