Hi Team, also is there a way to drop certain data ...
# questions
a
Hi Team, also is there a way to drop certain data frames from pipelines which arent required anymore without defining them in catalogs, my assumption is that if you dont define in catalogs it continues to stay in Memorydataset, right?
d
Are they not required because an experiment has been abandoned?
a
Lets say for example I have a spark dataframe and I am converting it to pandas using transcoding for further nodes, the spark dataframe will be no longer needed, if I save it using catalog it will eat up my disk space and if I dont save it, it will continue to stay in my MemoryDataSet covering memory
d
oh in that sense delete the catalog entry and the python GC will deal with it
if it’s spark it’s likely not even your memory, it’s the clusters memory and it’s lazily evaluated so it should have no impact there!
d
How are you transcoding but at the same time it's a
MemoryDataSet
?
And `MemoryDataSet`'s are automatically cleaned up, and even released as soon as they're no longer needed by the pipeline (i.e. usually sooner than end of pipeline).
n
Yes I believe this is managed by Kedro, essentially the
Runner
class has its own mini GC mechanism to track if a dataset is still needed.