Hi All, The other day i was making a custom datase...
# questions
h
Hi All, The other day i was making a custom dataset for the Huggingface AudioFolder dataset, which takes a folder as an argument. As such, i gave it the parameter
data_dir
as input, instead of
filepath
, it took me roughly an hour of debugging to figure out why i loading the dataset was now dependant on the current working directory, and just wouldn;t load if i gave it a relative path (data/01_raw/..) instead of workspace/project_name/data/01_raw/…. Anyway, the issue was that filepath has a (buried) custom resolver in AbstractDataSet baseclass. So would it be a good idea to add to the docs for custom datasets that
filepath
has that behaviour, and maybe we could add an example of a how to make a FolderDataset. since all the current datasets in kedro-datasets point to specific files, but i’d wager there are folks out there who would want to read an entire folders’ worth of data.
d
Are you using the
PartitionedDataSet
as your base class?
We’re very keen to introduce some HuggingFace support into the core library, so very keen to help you through this and perhaps get a PR into
kedro-datasets
if you’re able to? cc @Juan Luis
j
yes there’s some magic with the
filepath
property specifically, sorry you had a rough experience @Hugo Evers
definitely let’s document it, do you want to open an issue?
and finally what @datajoely says, would you be open to contributing it upstream?
h
sure, i was just looking up the code!
its for a hobby/side project anyway
(generating retro console video game sounds using AudioDiffusion)
🔥 1
👾 2
d
COOL!
then we’d also love to promote your work with a blog post
h
we found this entire archive of retro console sounds, scraped it, and now i have a few gigs of sounds
cool!
d
❤️
please shout if you get stuck on anything
h
i work on this with Mark Tensen, so ill definitely ask him if he’s okay with the blog post
but i think he will be 😉
anyway
the core code for the dataset is:
Copy code
from copy import deepcopy
from pathlib import Path, PurePosixPath
from typing import Any, Dict
import fsspec
import numpy as np
from datasets import load_dataset
from datasets.arrow_dataset import Dataset
from <http://kedro.io|kedro.io> import AbstractDataSet
from kedro.io.core import get_filepath_str, get_protocol_and_path, DataSetError


class AudioFolderDataSet(AbstractDataSet[Dict[str, Any], Dataset]):
    """``AudioFolderDataSet`` loads audio data from the Hugging Face AudioFolder dataset.
    <https://huggingface.co/docs/datasets/audio_dataset#audiofolder>

    Example:
    ::

        >>> AudioFolderDataSet(data_dir='/path/to/data')
    """

    DEFAULT_LOAD_ARGS: Dict[str, Any] = {}
    DEFAULT_SAVE_ARGS: Dict[str, Any] = {}

    def _init_(
        self,
        filepath: str,
        load_args: Dict[str, Any] = None,
        save_args: Dict[str, Any] = None,
        credentials: Dict[str, Any] = None,
        fs_args: Dict[str, Any] = None,
        metadata: Dict[str, Any] = None,
    ):
        """Creates a new instance of AudioFolderDataSet to load audio data from the Hugging Face AudioFolder dataset.

        Args:
            data_dir: The location of the AudioFolder dataset.
        """
        protocol, self.path = get_protocol_and_path(filepath)
        self._protocol = protocol
        _fs_args = deepcopy(fs_args) or {}
        _fs_open_args_load = _fs_args.pop("open_args_load", {})
        _fs_open_args_save = _fs_args.pop("open_args_save", {})
        _credentials = deepcopy(credentials) or {}

        if protocol == "file":
            _fs_args.setdefault("auto_mkdir", True)

        self._fs = fsspec.filesystem(self._protocol, **_credentials, **_fs_args)

        self.metadata = metadata
        super()._init_()

        # super()._init_(
        #     filepath=PurePosixPath(path),
        #     exists_function=self._fs.exists,
        #     glob_function=self._fs.glob,
        # )

        # Handle default load and save arguments
        self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS)
        if load_args is not None:
            self._load_args.update(load_args)
        self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
        if save_args is not None:
            self._save_args.update(save_args)

        _fs_open_args_save.setdefault("mode", "wb")
        self._fs_open_args_load = _fs_open_args_load
        self._fs_open_args_save = _fs_open_args_save

        self._filepath = get_filepath_str(PurePosixPath(self.path), self._protocol)

    def _load(self) -> Dict[str, Any]:
        """Loads data from the AudioFolder dataset.

        Returns:
            Data from the AudioFolder dataset as a dictionary of train, validation, and test sets.
        """

        # with self._fs.open(load_path, **self._fs_open_args_load) as fs_file:
        #     fs_path = fs_file.path
        return load_dataset("audiofolder", data_dir=self._filepath)

    def _save(self, data: Dict[str, Any]) -> None:
        """Saves audio data to the specified filepath."""
        raise NotImplementedError("AudioFolderDataSet does not support saving data.")

    def _describe(self) -> Dict[str, Any]:
        """Returns a dict that describes the attributes of the dataset."""
        return dict(filepath=self._filepath)

    def _exists(self) -> bool:
        return self._fs.exists(self._filepath)
but i haven’t tested yet whether this will work with data stored on s3
that has always been an issue with these huggingface datasets
also with trained models, which is kind of a pain
i could implement CloudPathLib to do it
if neccesarry
d
I think you should be inherting from this class https://docs.kedro.org/en/stable/kedro.io.PartitionedDataSet.html
and fsspec should handle the cloud abstraction
h
okay! ill test whether fsspec does that, i know i tried with Transformers before, and it required a local path
so i ended up with creating a tempdir and uploading that to s3
and reverse with downloading (quite ugly tbh)
ill have a look at the partioneddataset
i did not realise it was meant for this purpose
d
FolderDataSet is arguably a better name!
it’s cousin IncrementalDataSet is also possibly helpful
j
a bit late to the party but I opened an issue about documenting the magic behavior @Hugo Evers mentioned at the beginning of the thread https://github.com/kedro-org/kedro/issues/2942
d
also @Hugo Evers - very keen to see (hear 👂) what you built!
h
hi, how nice you remembered! my partner in crime and i were both traveling for the past month, but well pick it again in september! in the meantime im trying to run LlaMa 2 on AWS batch using kedro
d
Oh that’s super cool - if you’d be interested in showcasing your work on our blog please let us know 🙂
h
okay, lemme give you a little sneak peak of what im doing for a client now
🥳 1
i should ask whether i can post the real details, but ill give you a little peak at what im building now
so, this client wants to classify job ads to determine their ISCO label, basically what kind of job it is
we got very poor performance using Bert and human labeled data, like 60% acc
so now, what i did was i engineerd a prompt that extracts the relevant information from the job description and summarises it into a very specific template using LlaMa v2 70b
so now the labelers can label them much quicker, the descriptions are much shorter and more readible
also you can now actually featurize the descriptions
anyway, obviously you cant deploy a 70b model to production for just classification inference
so im generating 2M summarised job descriptions by running Llama in batch mode on a P4d instance, and then train a flan-t5 model to learn how to do that summarisation
so you can use a much cheaper and smaller instance to do inference
d
sup super cool
h
anyway, its not completely finished yet, but this llama model is quite competitive with chatgpt 3.5
and a lot cheaper, like a factor 1000
(in my very specific case)
d
I would really love to share a tutorial on how users can get started in this space with Kedro
h
we might need to add a huggingface dataset i think
d
Yes that’s actually a priority for myself and @Juan Luis
h
because pickling llama is not working for me
d
I bet!
h
and i am yet to try storing it using mlflow
but most people probably dont want to have to implement kedro-mlflow just for this
d
yes agreed
h
(and cloud-hosted mlflow cant handle a 150 gig model :p)
so yeah, you really need some power tools to do this heavy lifting
but i must say that kedro made it a lot easier
🥳 1
K 1
i should finish this work before the end of august, at least thats my deadline, so i could get back to you then?
I really hope my client allows me to share the details
d
completely understandable
j
of course, keep us posted @Hugo Evers!
d
and yeah zero rush!
h
btw, you mentioned you where working on a dataset that works with transformer, my implementation hinges around the following:
Copy code
from tempfile import TemporaryDirectory
from cloudpathlib import S3Path
from transformers import AutoModel

def load_from_s3(s3_path:S3Path)->AutoModel:
    with TemporaryDirectory() as tmp_dir:
        s3_path.download_to(tmp_dir)
        return AutoModel.from_pretrained(tmp_dir)

def save_to_s3(model:AutoModel, s3_path:S3Path):
    with TemporaryDirectory() as tmp_dir:
        model.save_pretrained(tmp_dir)
        s3_path.upload_from(tmp_dir)
do you have something similar, or do you see any shortcomings?
Copy code
from tempfile import TemporaryDirectory
from cloudpathlib import CloudPath
from transformers import AutoModel
from <http://kedro.io|kedro.io> import AbstractDataSet

class HFTransformersDataset(AbstractDataSet):
    def __init__(self, path: str):
        self.cloud_path = CloudPath(path)

    def _load(self) -> AutoModel:
        """
        Loads the model from the cloud path.

        Returns:
            AutoModel: The loaded Hugging Face Transformers model.
        """
        with TemporaryDirectory() as tmp_dir:
            self.cloud_path.download_to(tmp_dir)
            return AutoModel.from_pretrained(tmp_dir)

    def _save(self, model: AutoModel) -> None:
        """
        Saves the model to the cloud path.

        Args:
            model (AutoModel): The Hugging Face Transformers model to save.
        """
        with TemporaryDirectory() as tmp_dir:
            model.save_pretrained(tmp_dir)
            self.cloud_path.upload_from(tmp_dir)

    def _describe(self) -> Dict[str, Any]:
        return dict(
            cloud_path=str(self.cloud_path),
        )
this implementation would work for different cloud providers, but it does assume the credentials are already in the environment, so thats still WIP