Guys are there any built in solution to handle large databas Kedro #questions

Guys, are there any built-in solution to handle la...

Thiago José Moser Poletto

11/19/2024, 1:06 PM

Guys, are there any built-in solution to handle large databases, so that the nodes run them partially, like, lets say, a 100k rows will be running in batches of 10k each. Instead of doing by hand with for loop or something like it...

Hall

11/19/2024, 1:06 PM

Someone will reply to you shortly. In the meantime, this might help:

Laurens Vijnck

11/19/2024, 2:16 PM

I'm actually doing a blog post on this topic as we speak

Laurens Vijnck

11/19/2024, 2:17 PM

but you can use the PartitionedDataset

Thiago José Moser Poletto

11/19/2024, 2:25 PM

Nice to know, I'll love to read it... yeah I mean, I saw a little bit about PartitionedDataset, is just that to me was not that clear if its usable in all scenarios, like in order to avoid problems with lack of vm resources, to allow me to run even with lower count of CPUs and so on....

Thiago José Moser Poletto

11/19/2024, 2:26 PM

I do want to learn more about the "Kedro way" of things, to understand on its fully potential you know.

Deepyaman Datta

11/20/2024, 12:04 AM

@Thiago José Moser Poletto It's not really a question of the "Kedro way", but if you want to process large volumes of data from a database, the best way is to do the compute on the database. For example, Ibis is a Python dataframe library that lets you lazily execute code on the backend. Ibis can be fairly easily integrated with Kedro (there are built-in datasets and examples). Would this help, or am I misunderstanding your question?

Ivan Danov

11/20/2024, 7:43 AM

https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#how-to-use-generator-functions-in-a-node @Thiago José Moser Poletto Kedro supports generator functions, you only need to have a dataset which will load a part of the data and

yield

it and then in the node you need to iterate through the generated data chunks, process them and then

yield

them back, which Kedro will call

save

on a dataset which supports append. You can check in the docs an example how that would work.

Ivan Danov

11/20/2024, 7:45 AM

This way only a small part of the dataset gets loaded into memory at a time.

Thiago José Moser Poletto

11/22/2024, 12:41 PM

I appreciate that guys, I'll read and try that @Ivan Danov... My problem right now its actually unknown, I was running a code that I didn't build, but it was working, but as soon as I changed the input data, which is larger than the last ones I we used to use, and its not working for some reason, the kernel dies before it finishes and since the process take quite some time, its kinda impossible for me to remain watching the code execution. So this is way I would like to know about a way to make sure that the input data is runned properly in every step of the way. But I'll try that solution Ivan mentioned and see what I can do with that..

👍 1

7 Views

Open in Slack

Previous Next