Hi everyone Wanted to check with you as more experienced ked Kedro #questions

Hi everyone! Wanted to check with you as more expe...

Eugene P

12/22/2022, 2:52 PM

Hi everyone! Wanted to check with you as more experienced kedroids if I’m doing something stupid. Several preliminary steps of my workflow requires running some heavy-lifting sql queries with Postgres/PostGIS (must be executed in particular order). At the moment I’m doing it the following way: 1. I have separate folder with SQL queries 2. I use catalog to declare

pandas.SQLQueryDataSet

. One for each query. 3. I have generic node function to call SQL query, returning empty df like this

Copy code

def run_sql_script_node(sql_query_dataset: pd.DataFrame,
                        blank_df_for_nodes_order: pd.DataFrame,):
    return pd.DataFrame()

4. I define required nodes controlling the execution order by using consecutive empty_df outputs/inputs

Copy code

node(
   func=run_sql_script_node,
   inputs=["create_rropen_cadcost_schema_and_tables_dataset", "empty_cadcost_df0"],
   outputs="empty_cadcost_df1",
   name="create_rropen_cadcost_schema_and_tables_node",
        ),
node(
   func=run_sql_script_node,
   inputs=["create_rropen_cadcost_staging_table_dataset", "empty_cadcost_df1"],
   outputs="empty_cadcost_df2",
   name="create_rropen_cadcost_staging_table_dataset_node",
        ),

I do understand that Kedro may be the not-the-100%-appropriate-tool to control SQL workflows, but for the sake of total DS pipeline integrity and my kedro-learning would like to stick to it (it is amazing, btw!). This workaround works and works correctly, but I was thinking that this approach can be further simplified? May be there is a way to execute sql-queries in particular order without creation of catalog entries for datasets, for example? Thx in advance for critique and suggestions!

Olivier Ho

12/22/2022, 3:24 PM

I had similar issues to you but i'm not from the kedro team so take my answer with a grain of salt. If the sql queries do not change. Would it not be better to create a custom dataset where the _load function returns an iterator as it was defined in the SQLQueryDataSet

👍 1

Olivier Ho

12/22/2022, 3:24 PM

Eugene P

12/22/2022, 3:27 PM

Sounds promising, thanks! I’ll give a try

11 Views

Open in Slack

Previous Next