Hello Kedro Team, I am currently developing a pipe...
# questions
s
Hello Kedro Team, I am currently developing a pipeline in Kedro that involves loading, querying, and saving datasets using Google BigQuery components, namely
GBQQueryDataset
and
GBQTableDataset
. My pipeline includes two functions: 1.
function_1
takes
Table1
(a
GBQQueryDataset
) as input and produces
Table2
(a
GBQTableDataset
) as output. 2.
function_2
then processes
Table2
, applying a query to it and outputting
Table4
(a
GBQTableDataset
). Here,
Table3
is a
GBQQueryDataset
derived from
Table2
in BigQuery. I am seeking guidance on how to correctly establish data lineage between
Table2
and
Table3
. Specifically, I want to ensure that: 1. The nodes in the pipeline are executed in the correct sequence. 2. In Kedro Viz, it's clearly visible that
Table3
is generated from
Table2
using a specific query. Could you provide insights or best practices on how to set this up effectively in Kedro? Thank you for your assistance! Best regards, Sasha.
d
Sasha hi, that's a good question. As I understood Table3 is like a view, based on Table2. Have you tried to put it in a node, where you take as input Table2 and Table3 and looks like you need to return Table5 that will be equal to Table3? Table3 is an input object for Kedro, but you need a dependency for Table2 to be ready. More clear solution will be to put transformation from QueryDataset to node's function, but I think you want to process it on BigQuery side, I think it's also possible: not use Table3 as input, but in a Kedro node trigger a BigQuery procedure that returns select * from Table3, and put Table2 as an input and Table3 as output of that node.
This problem can be solved with https://ibis-project.org/, @Deepyaman Datta will write a blog post about it soon
d
This problem can be solved with https://ibis-project.org/, @Deepyaman Datta will write a blog post about it soon
If you want to poke around an example pipeline using Ibis for this, you can see https://github.com/deepyaman/jaffle-shop/. The specific example is with DuckDB, but you can configure it to use BigQuery instead. It's a pretty barebones implementation you can use for inspiration; if you need things like credentials support, I haven't implemented it as part of the dataset yet. Another thing you could look into is bigframes. It provides a pandas API on top of BigQuery (also leveraging Ibis under the hood). It provides
read_gbq
and
to_gbq
methods, but I'm not 100% sure if you can create views.
👍 1