Hello Kedro Community I m diving into the world of Kedro pip Kedro #questions

Hello Kedro Community! I'm diving into the world o...

Irene Robles

09/06/2023, 9:48 AM

Hello Kedro Community! I'm diving into the world of Kedro pipelines to work with substantial single-cell data. The data is very big, and I'm planning to implement transformations in different nodes. However, the prospect of keeping all intermediate datasets in memory will be a mistake, and saving them all would eat up a significant chunk of disk space. I'm wondering if any of you have faced a similar challenge and have insights on how to address it effectively. Specifically, I'm curious if there's a way for Kedro to automatically manage and clean up MemoryDatasets that are no longer in use, helping to optimizs memory usage and disk space. Thanks in advance for your help.

Juan Luis

09/06/2023, 9:53 AM

hello @Irene Robles, welcome! K MemoryDatasets are transient and passed from node to node, so at any given point in time there will only be as many of them as needed to carry the computation. after the node is done using them, they get discarded immediately

💡 1

Juan Luis

09/06/2023, 9:53 AM

also, the default Kedro runner is sequential, so you won't have many nodes in parallel and therefore RAM usage will be contained

Juan Luis

09/06/2023, 9:53 AM

if this still poses problems for you we can discuss a bit more in depth how to tackle those issues

Juan Luis

09/06/2023, 9:53 AM

what kind of data would you be loading? n-dimensional arrays, some Biopython format?

Irene Robles

09/06/2023, 9:54 AM

Hi Juanlu, thank you a lot for your help

Irene Robles

09/06/2023, 9:54 AM

I am using scanpy.AnnData

datajoely

09/06/2023, 9:55 AM

Also remember the lazy callable pattern of

PartitionedDataSet

Irene Robles

09/06/2023, 9:55 AM

I will look into it

Juan Luis

09/06/2023, 10:02 AM

I see https://scanpy.readthedocs.io/en/latest/usage-principles.html#anndata for the record, you'll probably need to define your own dataset wrapping

sc.read

and

adata.write

. when you get to that point, let us know!

Irene Robles

09/06/2023, 10:03 AM

Something like:

Irene Robles

09/06/2023, 10:04 AM

Copy code

import scanpy as sc

from <http://kedro.io|kedro.io> import AbstractDataSet
from typing import Any, Dict

class scRNAseqDataset(AbstractDataSet):
    def __init__(self, filepath):
        """Creates a new instance of scRNAseqDataset.

        Args:
            filepath: Path to the h5ad file.
        """
        self._filepath = filepath

    def _load(self) -> sc.AnnData:
        return sc.read_h5ad(self._filepath)

    def _save(self, data: sc.AnnData) -> None:
        data.write_h5ad(self._filepath, compression = 'gzip')

    def _describe(self) -> Dict[str, Any]:
        return dict(filepath=self._filepath)

Irene Robles

09/06/2023, 10:05 AM

Juan Luis

09/06/2023, 10:08 AM

that's a good start 💯 it's missing some of the fsspec logic that enables you to use datasets in remote locations, as well as the versioning stuff. but the code should be more or less like that

Open in Slack

Previous Next