Question regarding saving data which is too big for RAM I ha Kedro #questions

Question regarding saving data which is too big fo...

Elior Cohen

11/14/2023, 2:02 PM

Question regarding saving data which is too big for RAM. I have a use case where a Kedro pipeline creates an RDF file - for the sake of this question, RDF file is simply a text file. It is used to load bulks of data to graph databases. My go-to was to use

text.TextDataSet

for this purpose, but this means that my node should return a string with the entire contents of this RDF file.. Because the file is going to be so big, I want to write it in batches, so create 10% of the contents, then write them, then create another 10% and write them and so on until getting to 100%. Is there a way in Kedro to achieve something like that? I looked at

IncrementalDataSet

but it seems it has nothing to do with this use case

Deepyaman Datta

11/14/2023, 2:05 PM

You can use

PartitionedDataset

, and lazily save each partition.

Elior Cohen

11/14/2023, 2:06 PM

But I need them all in the same file

Elior Cohen

11/14/2023, 2:06 PM

the tool that loads them into the graph expects a single file

datajoely

11/14/2023, 2:12 PM

I would subclass

text.TextDataSet

and get it to write chunks using a generator

Elior Cohen

11/14/2023, 2:14 PM

@datajoely interesting. so

_save

would accept a generator instead of a string?

datajoely

11/14/2023, 2:15 PM

exactly https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#how-to-use-generator-functions-in-a-node

Elior Cohen

11/14/2023, 2:17 PM

@datajoely but in this example it's not accepting a generator - it accepts the actual data structure which is a dataframe So my node should be a generator, but

_save

still should accept

str

and kedro will unpack if I understand correctly

marrrcin

11/14/2023, 2:21 PM

Yes, Kedro unpacks the generator and calls

_save

for each item - your dataset implementation can just append to a single file. Example: https://kedro-org.slack.com/archives/C03RKP2LW64/p1699443585183569?thread_ts=1699440250.364219&cid=C03RKP2LW64

Elior Cohen

11/14/2023, 2:22 PM

@marrrcin we meet again 😄 (we worked together in kedro-azure we donated the datasets for azure) Thanks, I'll try it out

windows 1

😎 2

19 Views

Open in Slack

Previous Next