Hello team I was wondering if theres an approach to break a Kedro #questions

Hello team, I was wondering if theres an approach ...

Sid Shetty

07/24/2023, 2:13 PM

Hello team, I was wondering if theres an approach to break a pandas dataframe into chunks, run a few operations on it and write each chunk to a parquet in append mode(without concatenating the chunks back)? So the kedro node would have multiple writes.

datajoely

07/24/2023, 2:28 PM

Yes!

PartitionedDataSet

Sid Shetty

07/24/2023, 2:30 PM

Gotcha thank you!

Sid Shetty

07/25/2023, 12:03 PM

Thank you this approach worked great! Was wondering how the lazy loading works and maybe this is just a python concept and not related to the partition dataset implementation. Looks like multiple partitions are being saved at the same time and therefore their callables being executed in parallel, is that right? Is there a way to control the number of executions at one time and control RAM usage?

datajoely

07/25/2023, 12:12 PM

hey so generators are really powerful piece of python where you can defer execution to when it’s needed rather than at runtime. The opposite is called eager. If I do

range(1,1000000000000)

you get a generator, i.e. a variable that if you call

list(range(1,1000000000000))

you will actually get python to return you get Python to generate all numbers between 1 and an Trillion. This is exactly how we do the PartitionedDataSet i.e. only load the data when the user needs it.

datajoely

07/25/2023, 12:13 PM

One clarification is that the partitions aren’t processed in parallel, they are still sequential unless you do something custom

Sid Shetty

07/25/2023, 12:21 PM

Ahh I see, thank you, very cool stuff!

datajoely

07/25/2023, 12:22 PM

my pleasure

Open in Slack

Previous Next