Hello guys I m struggling for weeks on a splitting problem w Kedro #questions

Hello guys, I'm struggling for weeks on a splittin...

Adrien

04/29/2024, 4:38 PM

Hello guys, I'm struggling for weeks on a splitting problem with kedro : I have a generator that is lazy loading my data (huge webdataset) and i want to split it in 2 dataset. How to performe this since i can only get data on a sample when I access it ? Thanks in advance for your answers !!

datajoely

04/29/2024, 5:21 PM

Have you seen the

PartitionedDataset

Adrien

04/29/2024, 5:27 PM

Yes but it does not fit my needs. I have a lot of small files so i use tar archive and webdataset to save and process them. But I struggle for the sppliting operation since i can't divide an iterator in 2 output objects without accessing each sample

Nok Lam Chan

04/29/2024, 6:31 PM

Would like to help but I am afraid I don't understand the question, any chance you can show some code snippets even that api could be imaginary?

Adrien

04/30/2024, 7:50 AM

Hello ! Thanks for the answer ! Here is an exemple :

Copy code

# Here is what my loader is doing
dataset = (
            wds.WebDataset(files, nodesplitter=self.split_by_node)
            .decode(
                wds.torch_audio,
            )
        )

def preprocess(sample):
    # Do some operations
    return sample

# Here is my node code
def compute_node(dataset):
    dataset.map(preprocess)
    return dataset


# Here is the target node output
def compute_node_and_split(dataset):
    dataset.map(preprocess)

    # How to split webdataset with kedro without increasing complexity
    return split1, split2

Adrien

04/30/2024, 7:52 AM

So i want to achive a split but since webdataset is only an iterator (no len() method) i can't split it before acessing to a sample

Nok Lam Chan

04/30/2024, 8:32 AM

https://stackoverflow.com/questions/2425270/how-to-look-ahead-one-element-peek-in-a-python-generator Would this work?

Nok Lam Chan

04/30/2024, 8:34 AM

https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#loading-data-with-generators

Adrien

04/30/2024, 9:09 AM

mmmh i don't think the first link is working since the goal of webdataset is to avoid loading elements in memory all at once

Adrien

04/30/2024, 9:09 AM

And for the second link it's what i'm doing but it's not documented how to split an iterator

Nok Lam Chan

04/30/2024, 9:13 AM

Can you split an iterator with pure Python? It may be easier to forget about Kedro for a second. Can you split it in the first place when you generate the iterator instead?

Adrien

04/30/2024, 10:09 AM

No, webdataset is a streaming, so it read on the fly the file as soon as it's downloaded

Adrien

04/30/2024, 10:11 AM

i have to process sample by sample but kedro doesn't seems to be fit for this i have a done a fix but it's horrible

Copy code

def compute_node_and_split(dataset):
    dataset.map(preprocess)
    # How to split webdataset with kedro without increasing complexity
    for sample in dataset:
        if condition:
            yield {}, sample
        else:
            yield sample, {}

Nok Lam Chan

04/30/2024, 10:19 AM

What does these two different return actually do? Are they consumed differently?

Adrien

04/30/2024, 10:20 AM

Well it's to return to one dataset or the other but it's horrible ahah

Adrien

04/30/2024, 10:20 AM

it ignore the sample if dict is empty

Nok Lam Chan

04/30/2024, 10:21 AM

The downstream consumer has to understand the difference anyway, you can create a data class instead

Nok Lam Chan

04/30/2024, 10:23 AM

I don’t have the semantic of what that sample means under different conditions. Class DataClass: self.a=a self.b =b And the function will process DataClass instead.

Adrien

04/30/2024, 10:30 AM

yes but i'll have to change the runner right ?

Adrien

04/30/2024, 10:34 AM

Like i'll output dataclass and then the runner (or a custom hook) will have to process it ?

Nok Lam Chan

04/30/2024, 10:40 AM

Hm, not sure if I am following. Why do you need to modify runner? After all it’s just function that takes input and produce output. Maybe you can elaborate what are that two different outputs? The idea of catalog and datasets is all about I/O abstraction, the node (function) operate on data directly. Just like a node will take pandas data frame as input but not a pandas.csvdataset.

Adrien

04/30/2024, 11:23 AM

Yes i know about that but maybe i was not clear. Let's say you have a partitioned dataset. You don't want to load everything in memory so you iterate on samples and you yield each result in order to clear the memory right ? What i want to do is mostly the same. I want to choose the dataset to write on for each sample. But in kedro i can't yield only on one dataset if my node output on 2 dataset

Nok Lam Chan

04/30/2024, 11:59 AM

So the question is about dynamic saving an output to different datasets?

Adrien

04/30/2024, 11:59 AM

Yes exactly

Nok Lam Chan

04/30/2024, 12:02 PM

If you treat it as memory dataset, it works out of the box. The problem is when you want to use Kedro Datacatalog, there is no dataset that you can use directly to save as different type.

Nok Lam Chan

04/30/2024, 12:03 PM

So the solutions here is move that conditions to a custom dataset.

Adrien

04/30/2024, 12:04 PM

But it's again the kedro way no ? It's moving the logic in the dataset

Adrien

04/30/2024, 12:05 PM

I can't write a dataset each time i want to split

Nok Lam Chan

04/30/2024, 12:11 PM

Is it an IO logic?

Adrien

04/30/2024, 12:12 PM

By IO logic you mean the split is defined dynamicaly depending of the sample ? If yes so yes it is

Nok Lam Chan

04/30/2024, 12:15 PM

In my mind it is handling how to save a Python object and it fits the purpose of what Dataset supposed to do. Only in this case you have a dynamic output which could be different type of objects. This makes sense to me to have your node return a certain DataClass and a dataset that understand how to handle the DataClass

Adrien

04/30/2024, 12:19 PM

Even for purpose like train test split ? Like you suggest to add a variable path to the dataclass so that de customdataset will know were to save ?

Adrien

04/30/2024, 12:20 PM

it will break kedro viz no ?

Nok Lam Chan

04/30/2024, 12:29 PM

Train test split is beyond the job of a dataset, this should be done by a node.

Adrien

04/30/2024, 1:36 PM

yes but if this is done by a node, how can i do it ? I don't have, like a partitioned dataset, the list of the sample (i discover it when i iterate over the iterator)

3 Views

Open in Slack

Previous Next