Hello team, I was wondering if theres an approach ...
# questions
s
Hello team, I was wondering if theres an approach to break a pandas dataframe into chunks, run a few operations on it and write each chunk to a parquet in append mode(without concatenating the chunks back)? So the kedro node would have multiple writes.
d
Yes!
PartitionedDataSet
s
Gotcha thank you!
Thank you this approach worked great! Was wondering how the lazy loading works and maybe this is just a python concept and not related to the partition dataset implementation. Looks like multiple partitions are being saved at the same time and therefore their callables being executed in parallel, is that right? Is there a way to control the number of executions at one time and control RAM usage?
d
hey so generators are really powerful piece of python where you can defer execution to when it’s needed rather than at runtime. The opposite is called eager. If I do
range(1,1000000000000)
you get a generator, i.e. a variable that if you call
list(range(1,1000000000000))
you will actually get python to return you get Python to generate all numbers between 1 and an Trillion. This is exactly how we do the PartitionedDataSet i.e. only load the data when the user needs it.
One clarification is that the partitions aren’t processed in parallel, they are still sequential unless you do something custom
s
Ahh I see, thank you, very cool stuff!
d
my pleasure