Hi team, Is there any way (without reinventing th...
# questions
a
Hi team, Is there any way (without reinventing the wheel) to make atomic ouputs using Kedro? Let me explain the use case: The last node of one pipeline has 4 outputs which are 4 tables to write in a Postgres DDBB. We need to 'rollback' the insert in case one of those inserts fails. For instance, if the insert for output table number 3 fails, we would need to delete the data inserted for output table 1 and 2. Thanks in advance.
n
I haven't used Kedro in these etl flow before. Are you mainly using the Sql dataset with pandas? https://stackoverflow.com/questions/63440779/how-to-rollback-dataframe-to-sql-in-python-in-sqlalchemy
a
Yes, my doubt was if there was any way to make it 'kedro style' using the catalog. We can always do it like that, using a transaction at the end of the node but we would need to delete the write from the catalog
n
What if the node doesn’t execute the query immediately but rather appending stuff to the query plan? Would it be easier to make this a transaction in that case? @Deepyaman Datta do you think the ibis dataset can help here?
d
Unfortunately, Ibis doesn't have any higher-level API for transactions at this time; you'd need to drop down into the driver, anyway.
👀 1
> The last node of one pipeline has 4 outputs which are 4 tables to write in a Postgres DDBB. In a sense (not actually, because it's not about filepaths, trying to give an analogy), this is like a
PartitionedDataset
? But in Kedro, there's no support for transaction-like behavior on
PartitionedDataset
. That said. I think this could make sense to more broadly support...
💡 1
i
@Alex Ferrero are you specifically talking about failures on insert? because if the failure you're trying to handle is something which happens in the kedro pipeline, you could solve this by having one final "gatekeeper" node which all the outputs as memorydatasets and then output the actual 4 sql datasets as a final step. This wouldn't help you with rolling back a transaction if the failure happens on write, though maybe having the 4 datasets as an output of the same node could allow you to do something with hooks, but we are definitely in "wheel-reinventing" territory here EDIT: need to read better
d
@Iñigo Hidalgo after rereading it, my impression was that the 4 datasets are already output of the same node:
The last node of one pipeline has 4 outputs which are 4 tables to write in a Postgres DDBB.
👍 1
a
Although it might be reinventing the wheel, we are going to consider creating a custom class inheriting from SQLTableDataset with a create_connection method with the code @Nok Lam Chan sent in the first reply... we'll see how it goes👀 Thanks guys!!
👀 1