Guys, are there any built-in solution to handle la...
# questions
t
Guys, are there any built-in solution to handle large databases, so that the nodes run them partially, like, lets say, a 100k rows will be running in batches of 10k each. Instead of doing by hand with for loop or something like it...
h
Someone will reply to you shortly. In the meantime, this might help:
l
I'm actually doing a blog post on this topic as we speak
but you can use the PartitionedDataset
t
Nice to know, I'll love to read it... yeah I mean, I saw a little bit about PartitionedDataset, is just that to me was not that clear if its usable in all scenarios, like in order to avoid problems with lack of vm resources, to allow me to run even with lower count of CPUs and so on....
I do want to learn more about the "Kedro way" of things, to understand on its fully potential you know.
d
@Thiago José Moser Poletto It's not really a question of the "Kedro way", but if you want to process large volumes of data from a database, the best way is to do the compute on the database. For example, Ibis is a Python dataframe library that lets you lazily execute code on the backend. Ibis can be fairly easily integrated with Kedro (there are built-in datasets and examples). Would this help, or am I misunderstanding your question?
i
https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#how-to-use-generator-functions-in-a-node @Thiago José Moser Poletto Kedro supports generator functions, you only need to have a dataset which will load a part of the data and
yield
it and then in the node you need to iterate through the generated data chunks, process them and then
yield
them back, which Kedro will call
save
on a dataset which supports append. You can check in the docs an example how that would work.
This way only a small part of the dataset gets loaded into memory at a time.
t
I appreciate that guys, I'll read and try that @Ivan Danov... My problem right now its actually unknown, I was running a code that I didn't build, but it was working, but as soon as I changed the input data, which is larger than the last ones I we used to use, and its not working for some reason, the kernel dies before it finishes and since the process take quite some time, its kinda impossible for me to remain watching the code execution. So this is way I would like to know about a way to make sure that the input data is runned properly in every step of the way. But I'll try that solution Ivan mentioned and see what I can do with that..
👍 1