https://kedro.org/ logo
#questions
Title
# questions
h

Hugo Evers

06/29/2023, 8:23 AM
Hi All, The other day i was making a custom dataset for the Huggingface AudioFolder dataset, which takes a folder as an argument. As such, i gave it the parameter
data_dir
as input, instead of
filepath
, it took me roughly an hour of debugging to figure out why i loading the dataset was now dependant on the current working directory, and just wouldn;t load if i gave it a relative path (data/01_raw/..) instead of workspace/project_name/data/01_raw/…. Anyway, the issue was that filepath has a (buried) custom resolver in AbstractDataSet baseclass. So would it be a good idea to add to the docs for custom datasets that
filepath
has that behaviour, and maybe we could add an example of a how to make a FolderDataset. since all the current datasets in kedro-datasets point to specific files, but i’d wager there are folks out there who would want to read an entire folders’ worth of data.
d

datajoely

06/29/2023, 8:24 AM
Are you using the
PartitionedDataSet
as your base class?
We’re very keen to introduce some HuggingFace support into the core library, so very keen to help you through this and perhaps get a PR into
kedro-datasets
if you’re able to? cc @Juan Luis
j

Juan Luis

06/29/2023, 8:39 AM
yes there’s some magic with the
filepath
property specifically, sorry you had a rough experience @Hugo Evers
definitely let’s document it, do you want to open an issue?
and finally what @datajoely says, would you be open to contributing it upstream?
h

Hugo Evers

06/29/2023, 8:42 AM
sure, i was just looking up the code!
its for a hobby/side project anyway
(generating retro console video game sounds using AudioDiffusion)
🔥 1
👾 2
d

datajoely

06/29/2023, 8:43 AM
COOL!
then we’d also love to promote your work with a blog post
h

Hugo Evers

06/29/2023, 8:43 AM
we found this entire archive of retro console sounds, scraped it, and now i have a few gigs of sounds
cool!
d

datajoely

06/29/2023, 8:43 AM
❤️
please shout if you get stuck on anything
h

Hugo Evers

06/29/2023, 8:44 AM
i work on this with Mark Tensen, so ill definitely ask him if he’s okay with the blog post
but i think he will be 😉
anyway
the core code for the dataset is:
Copy code
from copy import deepcopy
from pathlib import Path, PurePosixPath
from typing import Any, Dict
import fsspec
import numpy as np
from datasets import load_dataset
from datasets.arrow_dataset import Dataset
from <http://kedro.io|kedro.io> import AbstractDataSet
from kedro.io.core import get_filepath_str, get_protocol_and_path, DataSetError


class AudioFolderDataSet(AbstractDataSet[Dict[str, Any], Dataset]):
    """``AudioFolderDataSet`` loads audio data from the Hugging Face AudioFolder dataset.
    <https://huggingface.co/docs/datasets/audio_dataset#audiofolder>

    Example:
    ::

        >>> AudioFolderDataSet(data_dir='/path/to/data')
    """

    DEFAULT_LOAD_ARGS: Dict[str, Any] = {}
    DEFAULT_SAVE_ARGS: Dict[str, Any] = {}

    def _init_(
        self,
        filepath: str,
        load_args: Dict[str, Any] = None,
        save_args: Dict[str, Any] = None,
        credentials: Dict[str, Any] = None,
        fs_args: Dict[str, Any] = None,
        metadata: Dict[str, Any] = None,
    ):
        """Creates a new instance of AudioFolderDataSet to load audio data from the Hugging Face AudioFolder dataset.

        Args:
            data_dir: The location of the AudioFolder dataset.
        """
        protocol, self.path = get_protocol_and_path(filepath)
        self._protocol = protocol
        _fs_args = deepcopy(fs_args) or {}
        _fs_open_args_load = _fs_args.pop("open_args_load", {})
        _fs_open_args_save = _fs_args.pop("open_args_save", {})
        _credentials = deepcopy(credentials) or {}

        if protocol == "file":
            _fs_args.setdefault("auto_mkdir", True)

        self._fs = fsspec.filesystem(self._protocol, **_credentials, **_fs_args)

        self.metadata = metadata
        super()._init_()

        # super()._init_(
        #     filepath=PurePosixPath(path),
        #     exists_function=self._fs.exists,
        #     glob_function=self._fs.glob,
        # )

        # Handle default load and save arguments
        self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS)
        if load_args is not None:
            self._load_args.update(load_args)
        self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
        if save_args is not None:
            self._save_args.update(save_args)

        _fs_open_args_save.setdefault("mode", "wb")
        self._fs_open_args_load = _fs_open_args_load
        self._fs_open_args_save = _fs_open_args_save

        self._filepath = get_filepath_str(PurePosixPath(self.path), self._protocol)

    def _load(self) -> Dict[str, Any]:
        """Loads data from the AudioFolder dataset.

        Returns:
            Data from the AudioFolder dataset as a dictionary of train, validation, and test sets.
        """

        # with self._fs.open(load_path, **self._fs_open_args_load) as fs_file:
        #     fs_path = fs_file.path
        return load_dataset("audiofolder", data_dir=self._filepath)

    def _save(self, data: Dict[str, Any]) -> None:
        """Saves audio data to the specified filepath."""
        raise NotImplementedError("AudioFolderDataSet does not support saving data.")

    def _describe(self) -> Dict[str, Any]:
        """Returns a dict that describes the attributes of the dataset."""
        return dict(filepath=self._filepath)

    def _exists(self) -> bool:
        return self._fs.exists(self._filepath)
but i haven’t tested yet whether this will work with data stored on s3
that has always been an issue with these huggingface datasets
also with trained models, which is kind of a pain
i could implement CloudPathLib to do it
if neccesarry
d

datajoely

06/29/2023, 8:46 AM
I think you should be inherting from this class https://docs.kedro.org/en/stable/kedro.io.PartitionedDataSet.html
and fsspec should handle the cloud abstraction
h

Hugo Evers

06/29/2023, 8:47 AM
okay! ill test whether fsspec does that, i know i tried with Transformers before, and it required a local path
so i ended up with creating a tempdir and uploading that to s3
and reverse with downloading (quite ugly tbh)
ill have a look at the partioneddataset
i did not realise it was meant for this purpose
d

datajoely

06/29/2023, 8:49 AM
FolderDataSet is arguably a better name!
it’s cousin IncrementalDataSet is also possibly helpful
j

Juan Luis

08/17/2023, 3:09 PM
a bit late to the party but I opened an issue about documenting the magic behavior @Hugo Evers mentioned at the beginning of the thread https://github.com/kedro-org/kedro/issues/2942
d

datajoely

08/17/2023, 3:10 PM
also @Hugo Evers - very keen to see (hear 👂) what you built!
h

Hugo Evers

08/18/2023, 3:24 PM
hi, how nice you remembered! my partner in crime and i were both traveling for the past month, but well pick it again in september! in the meantime im trying to run LlaMa 2 on AWS batch using kedro
d

datajoely

08/18/2023, 3:25 PM
Oh that’s super cool - if you’d be interested in showcasing your work on our blog please let us know 🙂
h

Hugo Evers

08/18/2023, 3:25 PM
okay, lemme give you a little sneak peak of what im doing for a client now
🥳 1
i should ask whether i can post the real details, but ill give you a little peak at what im building now
so, this client wants to classify job ads to determine their ISCO label, basically what kind of job it is
we got very poor performance using Bert and human labeled data, like 60% acc
so now, what i did was i engineerd a prompt that extracts the relevant information from the job description and summarises it into a very specific template using LlaMa v2 70b
so now the labelers can label them much quicker, the descriptions are much shorter and more readible
also you can now actually featurize the descriptions
anyway, obviously you cant deploy a 70b model to production for just classification inference
so im generating 2M summarised job descriptions by running Llama in batch mode on a P4d instance, and then train a flan-t5 model to learn how to do that summarisation
so you can use a much cheaper and smaller instance to do inference
d

datajoely

08/18/2023, 3:32 PM
sup super cool
h

Hugo Evers

08/18/2023, 3:33 PM
anyway, its not completely finished yet, but this llama model is quite competitive with chatgpt 3.5
and a lot cheaper, like a factor 1000
(in my very specific case)
d

datajoely

08/18/2023, 3:34 PM
I would really love to share a tutorial on how users can get started in this space with Kedro
h

Hugo Evers

08/18/2023, 3:35 PM
we might need to add a huggingface dataset i think
d

datajoely

08/18/2023, 3:35 PM
Yes that’s actually a priority for myself and @Juan Luis
h

Hugo Evers

08/18/2023, 3:35 PM
because pickling llama is not working for me
d

datajoely

08/18/2023, 3:35 PM
I bet!
h

Hugo Evers

08/18/2023, 3:35 PM
and i am yet to try storing it using mlflow
but most people probably dont want to have to implement kedro-mlflow just for this
d

datajoely

08/18/2023, 3:36 PM
yes agreed
h

Hugo Evers

08/18/2023, 3:36 PM
(and cloud-hosted mlflow cant handle a 150 gig model :p)
so yeah, you really need some power tools to do this heavy lifting
but i must say that kedro made it a lot easier
🥳 1
K 1
i should finish this work before the end of august, at least thats my deadline, so i could get back to you then?
I really hope my client allows me to share the details
d

datajoely

08/18/2023, 3:43 PM
completely understandable
j

Juan Luis

08/18/2023, 3:43 PM
of course, keep us posted @Hugo Evers!
d

datajoely

08/18/2023, 3:43 PM
and yeah zero rush!
h

Hugo Evers

08/18/2023, 4:26 PM
btw, you mentioned you where working on a dataset that works with transformer, my implementation hinges around the following:
Copy code
from tempfile import TemporaryDirectory
from cloudpathlib import S3Path
from transformers import AutoModel

def load_from_s3(s3_path:S3Path)->AutoModel:
    with TemporaryDirectory() as tmp_dir:
        s3_path.download_to(tmp_dir)
        return AutoModel.from_pretrained(tmp_dir)

def save_to_s3(model:AutoModel, s3_path:S3Path):
    with TemporaryDirectory() as tmp_dir:
        model.save_pretrained(tmp_dir)
        s3_path.upload_from(tmp_dir)
do you have something similar, or do you see any shortcomings?
Copy code
from tempfile import TemporaryDirectory
from cloudpathlib import CloudPath
from transformers import AutoModel
from <http://kedro.io|kedro.io> import AbstractDataSet

class HFTransformersDataset(AbstractDataSet):
    def __init__(self, path: str):
        self.cloud_path = CloudPath(path)

    def _load(self) -> AutoModel:
        """
        Loads the model from the cloud path.

        Returns:
            AutoModel: The loaded Hugging Face Transformers model.
        """
        with TemporaryDirectory() as tmp_dir:
            self.cloud_path.download_to(tmp_dir)
            return AutoModel.from_pretrained(tmp_dir)

    def _save(self, model: AutoModel) -> None:
        """
        Saves the model to the cloud path.

        Args:
            model (AutoModel): The Hugging Face Transformers model to save.
        """
        with TemporaryDirectory() as tmp_dir:
            model.save_pretrained(tmp_dir)
            self.cloud_path.upload_from(tmp_dir)

    def _describe(self) -> Dict[str, Any]:
        return dict(
            cloud_path=str(self.cloud_path),
        )
this implementation would work for different cloud providers, but it does assume the credentials are already in the environment, so thats still WIP
4 Views