Hi team Is there any way without reinventing the wheel to ma Kedro #questions

Hi team, Is there any way (without reinventing th...

Alex Ferrero

04/02/2024, 11:41 AM

Hi team, Is there any way (without reinventing the wheel) to make atomic ouputs using Kedro? Let me explain the use case: The last node of one pipeline has 4 outputs which are 4 tables to write in a Postgres DDBB. We need to 'rollback' the insert in case one of those inserts fails. For instance, if the insert for output table number 3 fails, we would need to delete the data inserted for output table 1 and 2. Thanks in advance.

Nok Lam Chan

04/02/2024, 6:31 PM

I haven't used Kedro in these etl flow before. Are you mainly using the Sql dataset with pandas? https://stackoverflow.com/questions/63440779/how-to-rollback-dataframe-to-sql-in-python-in-sqlalchemy

Alex Ferrero

04/03/2024, 8:32 AM

Yes, my doubt was if there was any way to make it 'kedro style' using the catalog. We can always do it like that, using a transaction at the end of the node but we would need to delete the write from the catalog

Nok Lam Chan

04/03/2024, 9:52 AM

What if the node doesn’t execute the query immediately but rather appending stuff to the query plan? Would it be easier to make this a transaction in that case? @Deepyaman Datta do you think the ibis dataset can help here?

Deepyaman Datta

04/03/2024, 2:07 PM

Unfortunately, Ibis doesn't have any higher-level API for transactions at this time; you'd need to drop down into the driver, anyway.

👀 1

Deepyaman Datta

04/03/2024, 2:28 PM

> The last node of one pipeline has 4 outputs which are 4 tables to write in a Postgres DDBB. In a sense (not actually, because it's not about filepaths, trying to give an analogy), this is like a

PartitionedDataset

? But in Kedro, there's no support for transaction-like behavior on

PartitionedDataset

. That said. I think this could make sense to more broadly support...

💡 1

Iñigo Hidalgo

04/03/2024, 3:08 PM

@Alex Ferrero are you specifically talking about failures on insert? because if the failure you're trying to handle is something which happens in the kedro pipeline, you could solve this by having one final "gatekeeper" node which all the outputs as memorydatasets and then output the actual 4 sql datasets as a final step. This wouldn't help you with rolling back a transaction if the failure happens on write, though maybe having the 4 datasets as an output of the same node could allow you to do something with hooks, but we are definitely in "wheel-reinventing" territory here EDIT: need to read better

Deepyaman Datta

04/03/2024, 3:16 PM

@Iñigo Hidalgo after rereading it, my impression was that the 4 datasets are already output of the same node:

The last node of one pipeline has 4 outputs which are 4 tables to write in a Postgres DDBB.

👍 1

Alex Ferrero

04/04/2024, 3:36 PM

Although it might be reinventing the wheel, we are going to consider creating a custom class inheriting from SQLTableDataset with a create_connection method with the code @Nok Lam Chan sent in the first reply... we'll see how it goes👀 Thanks guys!!

👀 1

5 Views

Open in Slack

Previous Next