https://kedro.org/ logo
#questions
Title
# questions
e

Elior Cohen

11/14/2023, 2:02 PM
Question regarding saving data which is too big for RAM. I have a use case where a Kedro pipeline creates an RDF file - for the sake of this question, RDF file is simply a text file. It is used to load bulks of data to graph databases. My go-to was to use
text.TextDataSet
for this purpose, but this means that my node should return a string with the entire contents of this RDF file.. Because the file is going to be so big, I want to write it in batches, so create 10% of the contents, then write them, then create another 10% and write them and so on until getting to 100%. Is there a way in Kedro to achieve something like that? I looked at
IncrementalDataSet
but it seems it has nothing to do with this use case
d

Deepyaman Datta

11/14/2023, 2:05 PM
You can use
PartitionedDataset
, and lazily save each partition.
e

Elior Cohen

11/14/2023, 2:06 PM
But I need them all in the same file
the tool that loads them into the graph expects a single file
d

datajoely

11/14/2023, 2:12 PM
I would subclass
text.TextDataSet
and get it to write chunks using a generator
e

Elior Cohen

11/14/2023, 2:14 PM
@datajoely interesting. so
_save
would accept a generator instead of a string?
e

Elior Cohen

11/14/2023, 2:17 PM
@datajoely but in this example it's not accepting a generator - it accepts the actual data structure which is a dataframe So my node should be a generator, but
_save
still should accept
str
and kedro will unpack if I understand correctly
m

marrrcin

11/14/2023, 2:21 PM
Yes, Kedro unpacks the generator and calls
_save
for each item - your dataset implementation can just append to a single file. Example: https://kedro-org.slack.com/archives/C03RKP2LW64/p1699443585183569?thread_ts=1699440250.364219&cid=C03RKP2LW64
e

Elior Cohen

11/14/2023, 2:22 PM
@marrrcin we meet again 😄 (we worked together in kedro-azure we donated the datasets for azure) Thanks, I'll try it out
windows 1
😎 2
2 Views