Eugene P
12/22/2022, 2:52 PMpandas.SQLQueryDataSet
. One for each query.
3. I have generic node function to call SQL query, returning empty df like this
def run_sql_script_node(sql_query_dataset: pd.DataFrame,
blank_df_for_nodes_order: pd.DataFrame,):
return pd.DataFrame()
4. I define required nodes controlling the execution order by using consecutive empty_df outputs/inputs
node(
func=run_sql_script_node,
inputs=["create_rropen_cadcost_schema_and_tables_dataset", "empty_cadcost_df0"],
outputs="empty_cadcost_df1",
name="create_rropen_cadcost_schema_and_tables_node",
),
node(
func=run_sql_script_node,
inputs=["create_rropen_cadcost_staging_table_dataset", "empty_cadcost_df1"],
outputs="empty_cadcost_df2",
name="create_rropen_cadcost_staging_table_dataset_node",
),
I do understand that Kedro may be the not-the-100%-appropriate-tool to control SQL workflows, but for the sake of total DS pipeline integrity and my kedro-learning would like to stick to it (it is amazing, btw!).
This workaround works and works correctly, but I was thinking that this approach can be further simplified? May be there is a way to execute sql-queries in particular order without creation of catalog entries for datasets, for example?
Thx in advance for critique and suggestions!Olivier Ho
12/22/2022, 3:24 PMEugene P
12/22/2022, 3:27 PM