Hello Kedro Team I am currently developing a pipeline in Ked Kedro #questions

Hello Kedro Team, I am currently developing a pipe...

Sasha Collin

12/20/2023, 4:51 PM

Hello Kedro Team, I am currently developing a pipeline in Kedro that involves loading, querying, and saving datasets using Google BigQuery components, namely

GBQQueryDataset

and

GBQTableDataset

. My pipeline includes two functions: 1.

function_1

takes

Table1

GBQQueryDataset

) as input and produces

Table2

GBQTableDataset

) as output. 2.

function_2

then processes

Table2

, applying a query to it and outputting

Table4

GBQTableDataset

). Here,

Table3

is a

GBQQueryDataset

derived from

Table2

in BigQuery. I am seeking guidance on how to correctly establish data lineage between

Table2

and

Table3

. Specifically, I want to ensure that: 1. The nodes in the pipeline are executed in the correct sequence. 2. In Kedro Viz, it's clearly visible that

Table3

is generated from

Table2

using a specific query. Could you provide insights or best practices on how to set this up effectively in Kedro? Thank you for your assistance! Best regards, Sasha.

Dmitry Sorokin

12/20/2023, 10:01 PM

Sasha hi, that's a good question. As I understood Table3 is like a view, based on Table2. Have you tried to put it in a node, where you take as input Table2 and Table3 and looks like you need to return Table5 that will be equal to Table3? Table3 is an input object for Kedro, but you need a dependency for Table2 to be ready. More clear solution will be to put transformation from QueryDataset to node's function, but I think you want to process it on BigQuery side, I think it's also possible: not use Table3 as input, but in a Kedro node trigger a BigQuery procedure that returns select * from Table3, and put Table2 as an input and Table3 as output of that node.

Dmitry Sorokin

12/21/2023, 5:01 PM

This problem can be solved with https://ibis-project.org/, @Deepyaman Datta will write a blog post about it soon

Deepyaman Datta

12/21/2023, 5:37 PM

This problem can be solved with https://ibis-project.org/, @Deepyaman Datta will write a blog post about it soon

If you want to poke around an example pipeline using Ibis for this, you can see https://github.com/deepyaman/jaffle-shop/. The specific example is with DuckDB, but you can configure it to use BigQuery instead. It's a pretty barebones implementation you can use for inspiration; if you need things like credentials support, I haven't implemented it as part of the dataset yet. Another thing you could look into is bigframes. It provides a pandas API on top of BigQuery (also leveraging Ibis under the hood). It provides

read_gbq

and

to_gbq

methods, but I'm not 100% sure if you can create views.

👍 1

Deepyaman Datta

01/31/2024, 2:15 PM

Blog post for reference: https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis

3 Views

Open in Slack

Previous Next