Hello all I want to add a transformer model from HuggingFace Kedro #questions

Hello all! I want to add a transformer model from...

Francis Duval

02/22/2024, 7:10 PM

Hello all! I want to add a transformer model from HuggingFace to my Kedro pipeline. I know there is this class:

kedro_datasets.huggingface.HFTransformerPipelineDataset

If I want to have a model from HF in my data catalog, I can do:

Copy code

hf_model:
  type: huggingface.HFTransformerPipelineDataset
  model_name: Helsinki-NLP/opus-mt-fr-en

However, this won't work since I don't have the certificates for HF website. I get an SSL error. So I downloaded the model that I put in the data folder:

01_raw/opus-mt-fr-en

. Then, how can I then add this model to my data catalog? I tried:

Copy code

hf_model:
  type: huggingface.HFTransformerPipelineDataset
  model_name: 01_raw/opus-mt-fr-en

but it does not work. I know I could make a custom dataset class, but I'm wondering if there's a simpler solution. Thanks!

Francis Duval

02/22/2024, 7:20 PM

Well, I guess I don't have to have the model as a dataset... I could simply have 2 nodes, one that load the tokenizer and another one that loads the model:

node(

func=load_tokenizer,

inputs='01_raw/opus-mt-fr-en',

outputs='tokenizer',

name='load_tokenizer'

node(

func=load_model,

inputs='01_raw/opus-mt-fr-en',

outputs='model',

name='load_model'

marrrcin

02/23/2024, 8:41 AM

Have you tried to specify the

model_name

as absolute path to your downloaded model?

👍 1

Juan Luis

02/23/2024, 11:00 AM

hmmmm that's a useful feature request @Francis Duval. I have SSL errors often because of company policies 😅 how does the

transformers.pipeline

call look like when using a local model?

Francis Duval

02/23/2024, 7:23 PM

So this is working:

Copy code

translation_model:
  type: huggingface.HFTransformerPipelineDataset
  task: translation_fr_to_en
  model_name: data/01_raw/opus-mt-fr-en

However, I'd like to have the tokenizer and the actual model in 2 different Kedro datasets. For this, I think I would need to create a custom dataset class!

Francis Duval

02/23/2024, 7:39 PM

So, I did:

Copy code

from <http://kedro.io|kedro.io> import AbstractDataset
from transformers import AutoTokenizer


class TokenizerDataset(AbstractDataset):
    def __init__(self, filepath: str):
        super().__init__()
        self.filepath = filepath

    def _load(self):
        return AutoTokenizer.from_pretrained(self.filepath)

    def _save(self, model):
        raise NotImplementedError('Not yet implemented')

    def _describe(self):
        return {
            'filepath': self.filepath
        }

and then, in the Catalog:

Copy code

translation_tokenizer:
  type: ibc_codes.datasets.tokenizer_dataset.TokenizerDataset
  filepath: data/01_raw/opus-mt-fr-en

Juan Luis

02/23/2024, 11:47 PM

that looks about right @Francis Duval!

8 Views

Open in Slack

Previous Next