Hello is it possible to save data from a pandas iterator to Kedro #questions

Hello, is it possible to save data from a pandas i...

Richard Purvis

02/27/2024, 7:36 PM

Hello, is it possible to save data from a pandas iterator to a partitioned dataset in chunks? For example, reading from

pd.read_csv

with a

chunksize

arg. I have seen the lazy save for a partitioned dataset article (link). However this requires a pre-defined dictionary with callable items, and if you are iterating through chunks you wouldn't be able to predefine keys. CC @Yury Fedotov

datajoely

02/28/2024, 2:16 AM

Can you not lazily create dictionary of callable to do this where your keys are just enumerated?

Juan Luis

02/28/2024, 6:45 AM

hi @Richard Purvis, I think this is similar to @Biel Stela’s request here https://kedro-org.slack.com/archives/C03RKP2LW64/p1706716264070519

Juan Luis

02/28/2024, 6:48 AM

just so that I understand, would your node function be `return`ing the individual chunks?

yield

them? something else?

Richard Purvis

02/28/2024, 1:23 PM

@Juan Luis It would be yielding them. @datajoely I'm not sure what you mean by enumerate, as in the python

enumerate()

function?

datajoely

02/28/2024, 1:24 PM

yeah just so you can have a lazily defined set of keys for each chunk

Juan Luis

02/28/2024, 1:39 PM

we have an example in the docs with a generator node using

yield

and a custom dataset, please have a look https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#saving-data-with-generators if this isn't quite it, let's continue the conversation 🙂

👍 1

Richard Purvis

02/29/2024, 5:06 PM

@Juan Luis This appears to be exactly what I need, thank you!

Juan Luis

02/29/2024, 5:13 PM

amazing 🙌🏼

6 Views

Open in Slack

Previous Next