Hello guys, I'm struggling for weeks on a splittin...
# questions
a
Hello guys, I'm struggling for weeks on a splitting problem with kedro : I have a generator that is lazy loading my data (huge webdataset) and i want to split it in 2 dataset. How to performe this since i can only get data on a sample when I access it ? Thanks in advance for your answers !!
d
Have you seen the
PartitionedDataset
?
a
Yes but it does not fit my needs. I have a lot of small files so i use tar archive and webdataset to save and process them. But I struggle for the sppliting operation since i can't divide an iterator in 2 output objects without accessing each sample
n
Would like to help but I am afraid I don't understand the question, any chance you can show some code snippets even that api could be imaginary?
a
Hello ! Thanks for the answer ! Here is an exemple :
Copy code
# Here is what my loader is doing
dataset = (
            wds.WebDataset(files, nodesplitter=self.split_by_node)
            .decode(
                wds.torch_audio,
            )
        )

def preprocess(sample):
    # Do some operations
    return sample

# Here is my node code
def compute_node(dataset):
    dataset.map(preprocess)
    return dataset


# Here is the target node output
def compute_node_and_split(dataset):
    dataset.map(preprocess)

    # How to split webdataset with kedro without increasing complexity
    return split1, split2
So i want to achive a split but since webdataset is only an iterator (no len() method) i can't split it before acessing to a sample
a
mmmh i don't think the first link is working since the goal of webdataset is to avoid loading elements in memory all at once
And for the second link it's what i'm doing but it's not documented how to split an iterator
n
Can you split an iterator with pure Python? It may be easier to forget about Kedro for a second. Can you split it in the first place when you generate the iterator instead?
a
No, webdataset is a streaming, so it read on the fly the file as soon as it's downloaded
i have to process sample by sample but kedro doesn't seems to be fit for this i have a done a fix but it's horrible
Copy code
def compute_node_and_split(dataset):
    dataset.map(preprocess)
    # How to split webdataset with kedro without increasing complexity
    for sample in dataset:
        if condition:
            yield {}, sample
        else:
            yield sample, {}
n
What does these two different return actually do? Are they consumed differently?
a
Well it's to return to one dataset or the other but it's horrible ahah
it ignore the sample if dict is empty
n
The downstream consumer has to understand the difference anyway, you can create a data class instead
I don’t have the semantic of what that sample means under different conditions. Class DataClass: self.a=a self.b =b And the function will process DataClass instead.
a
yes but i'll have to change the runner right ?
Like i'll output dataclass and then the runner (or a custom hook) will have to process it ?
n
Hm, not sure if I am following. Why do you need to modify runner? After all it’s just function that takes input and produce output. Maybe you can elaborate what are that two different outputs? The idea of catalog and datasets is all about I/O abstraction, the node (function) operate on data directly. Just like a node will take pandas data frame as input but not a pandas.csvdataset.
a
Yes i know about that but maybe i was not clear. Let's say you have a partitioned dataset. You don't want to load everything in memory so you iterate on samples and you yield each result in order to clear the memory right ? What i want to do is mostly the same. I want to choose the dataset to write on for each sample. But in kedro i can't yield only on one dataset if my node output on 2 dataset
n
So the question is about dynamic saving an output to different datasets?
a
Yes exactly
n
If you treat it as memory dataset, it works out of the box. The problem is when you want to use Kedro Datacatalog, there is no dataset that you can use directly to save as different type.
So the solutions here is move that conditions to a custom dataset.
a
But it's again the kedro way no ? It's moving the logic in the dataset
I can't write a dataset each time i want to split
n
Is it an IO logic?
a
By IO logic you mean the split is defined dynamicaly depending of the sample ? If yes so yes it is
n
In my mind it is handling how to save a Python object and it fits the purpose of what Dataset supposed to do. Only in this case you have a dynamic output which could be different type of objects. This makes sense to me to have your node return a certain DataClass and a dataset that understand how to handle the DataClass
a
Even for purpose like train test split ? Like you suggest to add a variable path to the dataclass so that de customdataset will know were to save ?
it will break kedro viz no ?
n
Train test split is beyond the job of a dataset, this should be done by a node.
a
yes but if this is done by a node, how can i do it ? I don't have, like a partitioned dataset, the list of the sample (i discover it when i iterate over the iterator)