Sasha Collin
12/20/2023, 4:51 PMGBQQueryDataset
and GBQTableDataset
.
My pipeline includes two functions:
1. function_1
takes Table1
(a GBQQueryDataset
) as input and produces Table2
(a GBQTableDataset
) as output.
2. function_2
then processes Table2
, applying a query to it and outputting Table4
(a GBQTableDataset
). Here, Table3
is a GBQQueryDataset
derived from Table2
in BigQuery.
I am seeking guidance on how to correctly establish data lineage between Table2
and Table3
. Specifically, I want to ensure that:
1. The nodes in the pipeline are executed in the correct sequence.
2. In Kedro Viz, it's clearly visible that Table3
is generated from Table2
using a specific query.
Could you provide insights or best practices on how to set this up effectively in Kedro?
Thank you for your assistance!
Best regards,
Sasha.Dmitry Sorokin
12/20/2023, 10:01 PMDmitry Sorokin
12/21/2023, 5:01 PMDeepyaman Datta
12/21/2023, 5:37 PMThis problem can be solved with https://ibis-project.org/, @Deepyaman Datta will write a blog post about it soonIf you want to poke around an example pipeline using Ibis for this, you can see https://github.com/deepyaman/jaffle-shop/. The specific example is with DuckDB, but you can configure it to use BigQuery instead. It's a pretty barebones implementation you can use for inspiration; if you need things like credentials support, I haven't implemented it as part of the dataset yet. Another thing you could look into is bigframes. It provides a pandas API on top of BigQuery (also leveraging Ibis under the hood). It provides
read_gbq
and to_gbq
methods, but I'm not 100% sure if you can create views.Deepyaman Datta
01/31/2024, 2:15 PM