Hello,
Is it possible to do a sql merge statement in a “kedro way”? Here is my issue:
I have a pipeline that starts by pulling historical data from an API. It then does some cleaning etc. and then uploads it to our internal database. To upload it, I use pandas.SQLTableDataset.
But the data from the API, being historical, doesn’t change much – if I run the pipeline after a few days, there will be new rows of data and maybe a few correction to old values.
So ideally, I want to be able to only pull about 1 month of data from the API and then use sql-merge to merge that data into the existing database table.
I don’t think there is a kedro dataset that can do this for me. So my initial thoughts is to create a custom dataset that does two steps:
1. uses panda.SQLTableDataset to save the data to a db table (say tbl_data_temp)
2. uses sqlalchemy to execute a merge to merge tbl_data_temp into tbl_data_main
Does this seem like the best approach? I generally dislike doing anything kind of processing inside of custom datasets because, for transparency, I like all the processing to be done in the pipeline/nodes but I can’t think of a better solution.
Thank you
Explanation of Sql Merge:
https://www.sqlshack.com/understanding-the-sql-merge-statement/