Hi All The other day i was making a custom dataset for the H Kedro #questions

Hi All, The other day i was making a custom datase...

Hugo Evers

06/29/2023, 8:23 AM

Hi All, The other day i was making a custom dataset for the Huggingface AudioFolder dataset, which takes a folder as an argument. As such, i gave it the parameter

data_dir

as input, instead of

filepath

, it took me roughly an hour of debugging to figure out why i loading the dataset was now dependant on the current working directory, and just wouldn;t load if i gave it a relative path (data/01_raw/..) instead of workspace/project_name/data/01_raw/…. Anyway, the issue was that filepath has a (buried) custom resolver in AbstractDataSet baseclass. So would it be a good idea to add to the docs for custom datasets that

filepath

has that behaviour, and maybe we could add an example of a how to make a FolderDataset. since all the current datasets in kedro-datasets point to specific files, but i’d wager there are folks out there who would want to read an entire folders’ worth of data.

datajoely

06/29/2023, 8:24 AM

Are you using the

PartitionedDataSet

as your base class?

datajoely

06/29/2023, 8:24 AM

We’re very keen to introduce some HuggingFace support into the core library, so very keen to help you through this and perhaps get a PR into

kedro-datasets

if you’re able to? cc @Juan Luis

Juan Luis

06/29/2023, 8:39 AM

yes there’s some magic with the

filepath

property specifically, sorry you had a rough experience @Hugo Evers

Juan Luis

06/29/2023, 8:39 AM

definitely let’s document it, do you want to open an issue?

Juan Luis

06/29/2023, 8:40 AM

and finally what @datajoely says, would you be open to contributing it upstream?

Hugo Evers

06/29/2023, 8:42 AM

sure, i was just looking up the code!

Hugo Evers

06/29/2023, 8:42 AM

its for a hobby/side project anyway

Hugo Evers

06/29/2023, 8:42 AM

(generating retro console video game sounds using AudioDiffusion)

🔥 1

👾 2

datajoely

06/29/2023, 8:43 AM

COOL!

datajoely

06/29/2023, 8:43 AM

then we’d also love to promote your work with a blog post

Hugo Evers

06/29/2023, 8:43 AM

we found this entire archive of retro console sounds, scraped it, and now i have a few gigs of sounds

Hugo Evers

06/29/2023, 8:43 AM

cool!

datajoely

06/29/2023, 8:43 AM

❤️

datajoely

06/29/2023, 8:43 AM

please shout if you get stuck on anything

Hugo Evers

06/29/2023, 8:44 AM

i work on this with Mark Tensen, so ill definitely ask him if he’s okay with the blog post

Hugo Evers

06/29/2023, 8:44 AM

but i think he will be 😉

Hugo Evers

06/29/2023, 8:44 AM

anyway

Hugo Evers

06/29/2023, 8:45 AM

the core code for the dataset is:

Copy code

from copy import deepcopy
from pathlib import Path, PurePosixPath
from typing import Any, Dict
import fsspec
import numpy as np
from datasets import load_dataset
from datasets.arrow_dataset import Dataset
from <http://kedro.io|kedro.io> import AbstractDataSet
from kedro.io.core import get_filepath_str, get_protocol_and_path, DataSetError


class AudioFolderDataSet(AbstractDataSet[Dict[str, Any], Dataset]):
    """``AudioFolderDataSet`` loads audio data from the Hugging Face AudioFolder dataset.
    <https://huggingface.co/docs/datasets/audio_dataset#audiofolder>

    Example:
    ::

        >>> AudioFolderDataSet(data_dir='/path/to/data')
    """

    DEFAULT_LOAD_ARGS: Dict[str, Any] = {}
    DEFAULT_SAVE_ARGS: Dict[str, Any] = {}

    def _init_(
        self,
        filepath: str,
        load_args: Dict[str, Any] = None,
        save_args: Dict[str, Any] = None,
        credentials: Dict[str, Any] = None,
        fs_args: Dict[str, Any] = None,
        metadata: Dict[str, Any] = None,
    ):
        """Creates a new instance of AudioFolderDataSet to load audio data from the Hugging Face AudioFolder dataset.

        Args:
            data_dir: The location of the AudioFolder dataset.
        """
        protocol, self.path = get_protocol_and_path(filepath)
        self._protocol = protocol
        _fs_args = deepcopy(fs_args) or {}
        _fs_open_args_load = _fs_args.pop("open_args_load", {})
        _fs_open_args_save = _fs_args.pop("open_args_save", {})
        _credentials = deepcopy(credentials) or {}

        if protocol == "file":
            _fs_args.setdefault("auto_mkdir", True)

        self._fs = fsspec.filesystem(self._protocol, **_credentials, **_fs_args)

        self.metadata = metadata
        super()._init_()

        # super()._init_(
        #     filepath=PurePosixPath(path),
        #     exists_function=self._fs.exists,
        #     glob_function=self._fs.glob,
        # )

        # Handle default load and save arguments
        self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS)
        if load_args is not None:
            self._load_args.update(load_args)
        self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
        if save_args is not None:
            self._save_args.update(save_args)

        _fs_open_args_save.setdefault("mode", "wb")
        self._fs_open_args_load = _fs_open_args_load
        self._fs_open_args_save = _fs_open_args_save

        self._filepath = get_filepath_str(PurePosixPath(self.path), self._protocol)

    def _load(self) -> Dict[str, Any]:
        """Loads data from the AudioFolder dataset.

        Returns:
            Data from the AudioFolder dataset as a dictionary of train, validation, and test sets.
        """

        # with self._fs.open(load_path, **self._fs_open_args_load) as fs_file:
        #     fs_path = fs_file.path
        return load_dataset("audiofolder", data_dir=self._filepath)

    def _save(self, data: Dict[str, Any]) -> None:
        """Saves audio data to the specified filepath."""
        raise NotImplementedError("AudioFolderDataSet does not support saving data.")

    def _describe(self) -> Dict[str, Any]:
        """Returns a dict that describes the attributes of the dataset."""
        return dict(filepath=self._filepath)

    def _exists(self) -> bool:
        return self._fs.exists(self._filepath)

Hugo Evers

06/29/2023, 8:46 AM

but i haven’t tested yet whether this will work with data stored on s3

Hugo Evers

06/29/2023, 8:46 AM

that has always been an issue with these huggingface datasets

Hugo Evers

06/29/2023, 8:46 AM

also with trained models, which is kind of a pain

Hugo Evers

06/29/2023, 8:46 AM

i could implement CloudPathLib to do it

Hugo Evers

06/29/2023, 8:46 AM

if neccesarry

datajoely

06/29/2023, 8:46 AM

I think you should be inherting from this class https://docs.kedro.org/en/stable/kedro.io.PartitionedDataSet.html

datajoely

06/29/2023, 8:47 AM

and fsspec should handle the cloud abstraction

Hugo Evers

06/29/2023, 8:47 AM

okay! ill test whether fsspec does that, i know i tried with Transformers before, and it required a local path

Hugo Evers

06/29/2023, 8:47 AM

so i ended up with creating a tempdir and uploading that to s3

Hugo Evers

06/29/2023, 8:48 AM

and reverse with downloading (quite ugly tbh)

Hugo Evers

06/29/2023, 8:48 AM

ill have a look at the partioneddataset

Hugo Evers

06/29/2023, 8:48 AM

i did not realise it was meant for this purpose

datajoely

06/29/2023, 8:49 AM

FolderDataSet is arguably a better name!

datajoely

06/29/2023, 8:50 AM

it’s cousin IncrementalDataSet is also possibly helpful

Juan Luis

08/17/2023, 3:09 PM

a bit late to the party but I opened an issue about documenting the magic behavior @Hugo Evers mentioned at the beginning of the thread https://github.com/kedro-org/kedro/issues/2942

datajoely

08/17/2023, 3:10 PM

also @Hugo Evers - very keen to see (hear 👂) what you built!

Hugo Evers

08/18/2023, 3:24 PM

hi, how nice you remembered! my partner in crime and i were both traveling for the past month, but well pick it again in september! in the meantime im trying to run LlaMa 2 on AWS batch using kedro

datajoely

08/18/2023, 3:25 PM

Oh that’s super cool - if you’d be interested in showcasing your work on our blog please let us know 🙂

datajoely

08/18/2023, 3:25 PM

https://kedro.org/blog

Hugo Evers

08/18/2023, 3:25 PM

okay, lemme give you a little sneak peak of what im doing for a client now

🥳 1

Hugo Evers

08/18/2023, 3:26 PM

i should ask whether i can post the real details, but ill give you a little peak at what im building now

Hugo Evers

08/18/2023, 3:26 PM

so, this client wants to classify job ads to determine their ISCO label, basically what kind of job it is

Hugo Evers

08/18/2023, 3:27 PM

we got very poor performance using Bert and human labeled data, like 60% acc

Hugo Evers

08/18/2023, 3:28 PM

so now, what i did was i engineerd a prompt that extracts the relevant information from the job description and summarises it into a very specific template using LlaMa v2 70b

Hugo Evers

08/18/2023, 3:29 PM

so now the labelers can label them much quicker, the descriptions are much shorter and more readible

Hugo Evers

08/18/2023, 3:29 PM

also you can now actually featurize the descriptions

Hugo Evers

08/18/2023, 3:29 PM

anyway, obviously you cant deploy a 70b model to production for just classification inference

Hugo Evers

08/18/2023, 3:30 PM

so im generating 2M summarised job descriptions by running Llama in batch mode on a P4d instance, and then train a flan-t5 model to learn how to do that summarisation

Hugo Evers

08/18/2023, 3:31 PM

so you can use a much cheaper and smaller instance to do inference

datajoely

08/18/2023, 3:32 PM

sup super cool

Hugo Evers

08/18/2023, 3:33 PM

anyway, its not completely finished yet, but this llama model is quite competitive with chatgpt 3.5

Hugo Evers

08/18/2023, 3:34 PM

and a lot cheaper, like a factor 1000

Hugo Evers

08/18/2023, 3:34 PM

(in my very specific case)

datajoely

08/18/2023, 3:34 PM

I would really love to share a tutorial on how users can get started in this space with Kedro

Hugo Evers

08/18/2023, 3:35 PM

we might need to add a huggingface dataset i think

datajoely

08/18/2023, 3:35 PM

Yes that’s actually a priority for myself and @Juan Luis

Hugo Evers

08/18/2023, 3:35 PM

because pickling llama is not working for me

datajoely

08/18/2023, 3:35 PM

I bet!

Hugo Evers

08/18/2023, 3:35 PM

and i am yet to try storing it using mlflow

Hugo Evers

08/18/2023, 3:36 PM

but most people probably dont want to have to implement kedro-mlflow just for this

datajoely

08/18/2023, 3:36 PM

yes agreed

Hugo Evers

08/18/2023, 3:36 PM

(and cloud-hosted mlflow cant handle a 150 gig model :p)

Hugo Evers

08/18/2023, 3:36 PM

so yeah, you really need some power tools to do this heavy lifting

Hugo Evers

08/18/2023, 3:37 PM

but i must say that kedro made it a lot easier

🥳 1

K 1

Hugo Evers

08/18/2023, 3:37 PM

i should finish this work before the end of august, at least thats my deadline, so i could get back to you then?

Hugo Evers

08/18/2023, 3:38 PM

I really hope my client allows me to share the details

datajoely

08/18/2023, 3:43 PM

completely understandable

Juan Luis

08/18/2023, 3:43 PM

of course, keep us posted @Hugo Evers!

datajoely

08/18/2023, 3:43 PM

and yeah zero rush!

Hugo Evers

08/18/2023, 4:26 PM

btw, you mentioned you where working on a dataset that works with transformer, my implementation hinges around the following:

Copy code

from tempfile import TemporaryDirectory
from cloudpathlib import S3Path
from transformers import AutoModel

def load_from_s3(s3_path:S3Path)->AutoModel:
    with TemporaryDirectory() as tmp_dir:
        s3_path.download_to(tmp_dir)
        return AutoModel.from_pretrained(tmp_dir)

def save_to_s3(model:AutoModel, s3_path:S3Path):
    with TemporaryDirectory() as tmp_dir:
        model.save_pretrained(tmp_dir)
        s3_path.upload_from(tmp_dir)

Hugo Evers

08/18/2023, 4:27 PM

do you have something similar, or do you see any shortcomings?

Hugo Evers

08/18/2023, 4:40 PM

Copy code

from tempfile import TemporaryDirectory
from cloudpathlib import CloudPath
from transformers import AutoModel
from <http://kedro.io|kedro.io> import AbstractDataSet

class HFTransformersDataset(AbstractDataSet):
    def __init__(self, path: str):
        self.cloud_path = CloudPath(path)

    def _load(self) -> AutoModel:
        """
        Loads the model from the cloud path.

        Returns:
            AutoModel: The loaded Hugging Face Transformers model.
        """
        with TemporaryDirectory() as tmp_dir:
            self.cloud_path.download_to(tmp_dir)
            return AutoModel.from_pretrained(tmp_dir)

    def _save(self, model: AutoModel) -> None:
        """
        Saves the model to the cloud path.

        Args:
            model (AutoModel): The Hugging Face Transformers model to save.
        """
        with TemporaryDirectory() as tmp_dir:
            model.save_pretrained(tmp_dir)
            self.cloud_path.upload_from(tmp_dir)

    def _describe(self) -> Dict[str, Any]:
        return dict(
            cloud_path=str(self.cloud_path),
        )

Hugo Evers

08/18/2023, 4:41 PM

this implementation would work for different cloud providers, but it does assume the credentials are already in the environment, so thats still WIP

13 Views

Open in Slack

Previous Next