Sasha Collin
12/20/2023, 4:51 PMGBQQueryDataset and GBQTableDataset.
My pipeline includes two functions:
1. function_1 takes Table1 (a GBQQueryDataset) as input and produces Table2 (a GBQTableDataset) as output.
2. function_2 then processes Table2, applying a query to it and outputting Table4 (a GBQTableDataset). Here, Table3 is a GBQQueryDataset derived from Table2 in BigQuery.
I am seeking guidance on how to correctly establish data lineage between Table2 and Table3. Specifically, I want to ensure that:
1. The nodes in the pipeline are executed in the correct sequence.
2. In Kedro Viz, it's clearly visible that Table3 is generated from Table2 using a specific query.
Could you provide insights or best practices on how to set this up effectively in Kedro?
Thank you for your assistance!
Best regards,
Sasha.Dmitry Sorokin
12/20/2023, 10:01 PMDmitry Sorokin
12/21/2023, 5:01 PMDeepyaman Datta
12/21/2023, 5:37 PMThis problem can be solved with https://ibis-project.org/, @Deepyaman Datta will write a blog post about it soonIf you want to poke around an example pipeline using Ibis for this, you can see https://github.com/deepyaman/jaffle-shop/. The specific example is with DuckDB, but you can configure it to use BigQuery instead. It's a pretty barebones implementation you can use for inspiration; if you need things like credentials support, I haven't implemented it as part of the dataset yet. Another thing you could look into is bigframes. It provides a pandas API on top of BigQuery (also leveraging Ibis under the hood). It provides
read_gbq and to_gbq methods, but I'm not 100% sure if you can create views.Deepyaman Datta
01/31/2024, 2:15 PM