Hi Team also is there a way to drop certain data frames from Kedro #questions

Hi Team, also is there a way to drop certain data ...

Ankar Yadav

11/22/2022, 1:04 PM

Hi Team, also is there a way to drop certain data frames from pipelines which arent required anymore without defining them in catalogs, my assumption is that if you dont define in catalogs it continues to stay in Memorydataset, right?

datajoely

11/22/2022, 1:30 PM

Are they not required because an experiment has been abandoned?

Ankar Yadav

11/22/2022, 1:45 PM

Lets say for example I have a spark dataframe and I am converting it to pandas using transcoding for further nodes, the spark dataframe will be no longer needed, if I save it using catalog it will eat up my disk space and if I dont save it, it will continue to stay in my MemoryDataSet covering memory

datajoely

11/22/2022, 3:01 PM

oh in that sense delete the catalog entry and the python GC will deal with it

datajoely

11/22/2022, 3:01 PM

if it’s spark it’s likely not even your memory, it’s the clusters memory and it’s lazily evaluated so it should have no impact there!

Deepyaman Datta

11/22/2022, 3:40 PM

How are you transcoding but at the same time it's a

MemoryDataSet

Deepyaman Datta

11/22/2022, 3:40 PM

And `MemoryDataSet`'s are automatically cleaned up, and even released as soon as they're no longer needed by the pipeline (i.e. usually sooner than end of pipeline).

Nok Lam Chan

11/22/2022, 9:47 PM

Yes I believe this is managed by Kedro, essentially the

Runner

class has its own mini GC mechanism to track if a dataset is still needed.

4 Views

Open in Slack

Previous Next