Hello all! I want to add a transformer model from...
# questions
f
Hello all! I want to add a transformer model from HuggingFace to my Kedro pipeline. I know there is this class:
kedro_datasets.huggingface.HFTransformerPipelineDataset
If I want to have a model from HF in my data catalog, I can do:
Copy code
hf_model:
  type: huggingface.HFTransformerPipelineDataset
  model_name: Helsinki-NLP/opus-mt-fr-en
However, this won't work since I don't have the certificates for HF website. I get an SSL error. So I downloaded the model that I put in the data folder:
01_raw/opus-mt-fr-en
. Then, how can I then add this model to my data catalog? I tried:
Copy code
hf_model:
  type: huggingface.HFTransformerPipelineDataset
  model_name: 01_raw/opus-mt-fr-en
but it does not work. I know I could make a custom dataset class, but I'm wondering if there's a simpler solution. Thanks!
Well, I guess I don't have to have the model as a dataset... I could simply have 2 nodes, one that load the tokenizer and another one that loads the model:
node(
func=load_tokenizer,
inputs='01_raw/opus-mt-fr-en',
outputs='tokenizer',
name='load_tokenizer'
)
node(
func=load_model,
inputs='01_raw/opus-mt-fr-en',
outputs='model',
name='load_model'
)
m
Have you tried to specify the
model_name
as absolute path to your downloaded model?
👍 1
j
hmmmm that's a useful feature request @Francis Duval. I have SSL errors often because of company policies 😅 how does the
transformers.pipeline
call look like when using a local model?
f
So this is working:
Copy code
translation_model:
  type: huggingface.HFTransformerPipelineDataset
  task: translation_fr_to_en
  model_name: data/01_raw/opus-mt-fr-en
However, I'd like to have the tokenizer and the actual model in 2 different Kedro datasets. For this, I think I would need to create a custom dataset class!
So, I did:
Copy code
from <http://kedro.io|kedro.io> import AbstractDataset
from transformers import AutoTokenizer


class TokenizerDataset(AbstractDataset):
    def __init__(self, filepath: str):
        super().__init__()
        self.filepath = filepath

    def _load(self):
        return AutoTokenizer.from_pretrained(self.filepath)

    def _save(self, model):
        raise NotImplementedError('Not yet implemented')

    def _describe(self):
        return {
            'filepath': self.filepath
        }
and then, in the Catalog:
Copy code
translation_tokenizer:
  type: ibc_codes.datasets.tokenizer_dataset.TokenizerDataset
  filepath: data/01_raw/opus-mt-fr-en
j
that looks about right @Francis Duval!