Question regarding saving data which is too big fo...
# questions
e
Question regarding saving data which is too big for RAM. I have a use case where a Kedro pipeline creates an RDF file - for the sake of this question, RDF file is simply a text file. It is used to load bulks of data to graph databases. My go-to was to use
text.TextDataSet
for this purpose, but this means that my node should return a string with the entire contents of this RDF file.. Because the file is going to be so big, I want to write it in batches, so create 10% of the contents, then write them, then create another 10% and write them and so on until getting to 100%. Is there a way in Kedro to achieve something like that? I looked at
IncrementalDataSet
but it seems it has nothing to do with this use case
d
You can use
PartitionedDataset
, and lazily save each partition.
e
But I need them all in the same file
the tool that loads them into the graph expects a single file
d
I would subclass
text.TextDataSet
and get it to write chunks using a generator
e
@datajoely interesting. so
_save
would accept a generator instead of a string?
e
@datajoely but in this example it's not accepting a generator - it accepts the actual data structure which is a dataframe So my node should be a generator, but
_save
still should accept
str
and kedro will unpack if I understand correctly
m
Yes, Kedro unpacks the generator and calls
_save
for each item - your dataset implementation can just append to a single file. Example: https://kedro-org.slack.com/archives/C03RKP2LW64/p1699443585183569?thread_ts=1699440250.364219&cid=C03RKP2LW64
e
@marrrcin we meet again 😄 (we worked together in kedro-azure we donated the datasets for azure) Thanks, I'll try it out
windows 1
😎 2