Hi everyone, I’m trying to figure out a way to do ...
# questions
Hi everyone, I’m trying to figure out a way to do upsert to a db via kedro. Does kedro sql dataset support upsert functionality? if yes, could you please point me to the documentation for reference? Thanks!
The docs for the SQLTableDataSet are here: https://docs.kedro.org/en/latest/kedro_datasets.pandas.SQLTableDataSet.html When saving the data, it will call pandas.DataFrame.to_sql Possible save options are
*if_exists*{'fail', 'replace', 'append'}, default 'fail'
So out of the box, afaik there is no upsert functionality. Some issues I can think of why that is: 1. How to upsert exactly depends on the database you are using 2. Upsert needs to be aware of what index you are matching against But you can definitely create a subclass of `SQLTableDataSet`and overwrite the
method with your own upsert method! What database do you use? Do you use SQLAlchemy? And, do you already have a method implementing upserting data somehow or not yet?
👍 1
I am using Postgres and SQLAlchemy orm. I do have a method implementing upserting data. Just wanted to know if it is a good practice to perform an upsert in a kedro pipeline.
I, as just a Kedro user, definitely think it is! We use a Spark Dataset and upsert it to Postgres and it's been great for us. I would suggest creating a custom dataset that subclasses
and use your upsert implementation in the
method -- One reason this might not be entirely conform Kedro principles is that the output of your pipeline depends on the state of your database, therefore the data pipeline is not 100% reproducible 🤔
👍 1
Agreed! It’s not a CRUD paradigm
So if you read this section on how we talk about upserts in Delta we talk about how you can force Kedro into doing this pattern but as you’ve both correctly identified it’s not a comfortable paradigm for Kedro since it’s been designed with reproducibility in mind https://docs.kedro.org/en/0.17.6/11_tools_integration/01_pyspark.html#spark-and-delta-lake-interaction
I’d also consider @Iñigo Hidalgo’s
which may be a nicer modern way of doing this https://github.com/inigohidalgo/kedro-ibis-dataset
👍 1
I would have to look into what Ibis' capabilities are wrt to upserts (@Cody Peterson)
As all the functionality in the Ibis dataset is handled by Ibis, the dataset is a shallow wrapper 🙂
🥳 1
because this is something that varies across SQL backends, it's not "natively" supported in Ibis but you can run anything with
(the latter if no records are returned) to achieve whatever behavior you could in SQL
👍 3
❤️ 1