Hello Kedro Community! I'm diving into the world o...
# questions
i
Hello Kedro Community! I'm diving into the world of Kedro pipelines to work with substantial single-cell data. The data is very big, and I'm planning to implement transformations in different nodes. However, the prospect of keeping all intermediate datasets in memory will be a mistake, and saving them all would eat up a significant chunk of disk space. I'm wondering if any of you have faced a similar challenge and have insights on how to address it effectively. Specifically, I'm curious if there's a way for Kedro to automatically manage and clean up MemoryDatasets that are no longer in use, helping to optimizs memory usage and disk space. Thanks in advance for your help.
j
hello @Irene Robles, welcome! K MemoryDatasets are transient and passed from node to node, so at any given point in time there will only be as many of them as needed to carry the computation. after the node is done using them, they get discarded immediately
💡 1
also, the default Kedro runner is sequential, so you won't have many nodes in parallel and therefore RAM usage will be contained
if this still poses problems for you we can discuss a bit more in depth how to tackle those issues
what kind of data would you be loading? n-dimensional arrays, some Biopython format?
i
Hi Juanlu, thank you a lot for your help
I am using scanpy.AnnData
d
Also remember the lazy callable pattern of
PartitionedDataSet
!
i
I will look into it
j
I see https://scanpy.readthedocs.io/en/latest/usage-principles.html#anndata for the record, you'll probably need to define your own dataset wrapping
sc.read
and
adata.write
. when you get to that point, let us know!
i
Something like:
Copy code
import scanpy as sc

from <http://kedro.io|kedro.io> import AbstractDataSet
from typing import Any, Dict

class scRNAseqDataset(AbstractDataSet):
    def __init__(self, filepath):
        """Creates a new instance of scRNAseqDataset.

        Args:
            filepath: Path to the h5ad file.
        """
        self._filepath = filepath

    def _load(self) -> sc.AnnData:
        return sc.read_h5ad(self._filepath)

    def _save(self, data: sc.AnnData) -> None:
        data.write_h5ad(self._filepath, compression = 'gzip')

    def _describe(self) -> Dict[str, Any]:
        return dict(filepath=self._filepath)
?
j
that's a good start 💯 it's missing some of the fsspec logic that enables you to use datasets in remote locations, as well as the versioning stuff. but the code should be more or less like that