Hi everyone, I’m trying to figure out a way to do ...
# questions
p
Hi everyone, I’m trying to figure out a way to do upsert to a db via kedro. Does kedro sql dataset support upsert functionality? if yes, could you please point me to the documentation for reference? Thanks!
l
The docs for the SQLTableDataSet are here: https://docs.kedro.org/en/latest/kedro_datasets.pandas.SQLTableDataSet.html When saving the data, it will call pandas.DataFrame.to_sql Possible save options are
*if_exists*{'fail', 'replace', 'append'}, default 'fail'
So out of the box, afaik there is no upsert functionality. Some issues I can think of why that is: 1. How to upsert exactly depends on the database you are using 2. Upsert needs to be aware of what index you are matching against But you can definitely create a subclass of `SQLTableDataSet`and overwrite the
_save()
method with your own upsert method! What database do you use? Do you use SQLAlchemy? And, do you already have a method implementing upserting data somehow or not yet?
👍 1
p
I am using Postgres and SQLAlchemy orm. I do have a method implementing upserting data. Just wanted to know if it is a good practice to perform an upsert in a kedro pipeline.
l
I, as just a Kedro user, definitely think it is! We use a Spark Dataset and upsert it to Postgres and it's been great for us. I would suggest creating a custom dataset that subclasses
pandas.SQLTableDataSet
and use your upsert implementation in the
_save()
method -- One reason this might not be entirely conform Kedro principles is that the output of your pipeline depends on the state of your database, therefore the data pipeline is not 100% reproducible 🤔
👍 1
p
Agreed! It’s not a CRUD paradigm
1
d
So if you read this section on how we talk about upserts in Delta we talk about how you can force Kedro into doing this pattern but as you’ve both correctly identified it’s not a comfortable paradigm for Kedro since it’s been designed with reproducibility in mind https://docs.kedro.org/en/0.17.6/11_tools_integration/01_pyspark.html#spark-and-delta-lake-interaction
I’d also consider @Iñigo Hidalgo’s
IbisDataSet
which may be a nicer modern way of doing this https://github.com/inigohidalgo/kedro-ibis-dataset
👍 1
i
I would have to look into what Ibis' capabilities are wrt to upserts (@Cody Peterson)
As all the functionality in the Ibis dataset is handled by Ibis, the dataset is a shallow wrapper 🙂
🥳 1
c
because this is something that varies across SQL backends, it's not "natively" supported in Ibis but you can run anything with
.sql
or
.raw_sql
(the latter if no records are returned) to achieve whatever behavior you could in SQL
👍 3
❤️ 1