Francis Duval
02/22/2024, 7:10 PMkedro_datasets.huggingface.HFTransformerPipelineDataset
If I want to have a model from HF in my data catalog, I can do:
hf_model:
type: huggingface.HFTransformerPipelineDataset
model_name: Helsinki-NLP/opus-mt-fr-en
However, this won't work since I don't have the certificates for HF website. I get an SSL error. So I downloaded the model that I put in the data folder: 01_raw/opus-mt-fr-en
.
Then, how can I then add this model to my data catalog? I tried:
hf_model:
type: huggingface.HFTransformerPipelineDataset
model_name: 01_raw/opus-mt-fr-en
but it does not work. I know I could make a custom dataset class, but I'm wondering if there's a simpler solution. Thanks!Francis Duval
02/22/2024, 7:20 PMnode(
func=load_tokenizer,
inputs='01_raw/opus-mt-fr-en',
outputs='tokenizer',
name='load_tokenizer'
)
node(
func=load_model,
inputs='01_raw/opus-mt-fr-en',
outputs='model',
name='load_model'
)
marrrcin
02/23/2024, 8:41 AMmodel_name
as absolute path to your downloaded model?Juan Luis
02/23/2024, 11:00 AMtransformers.pipeline
call look like when using a local model?Francis Duval
02/23/2024, 7:23 PMtranslation_model:
type: huggingface.HFTransformerPipelineDataset
task: translation_fr_to_en
model_name: data/01_raw/opus-mt-fr-en
However, I'd like to have the tokenizer and the actual model in 2 different Kedro datasets. For this, I think I would need to create a custom dataset class!Francis Duval
02/23/2024, 7:39 PMfrom <http://kedro.io|kedro.io> import AbstractDataset
from transformers import AutoTokenizer
class TokenizerDataset(AbstractDataset):
def __init__(self, filepath: str):
super().__init__()
self.filepath = filepath
def _load(self):
return AutoTokenizer.from_pretrained(self.filepath)
def _save(self, model):
raise NotImplementedError('Not yet implemented')
def _describe(self):
return {
'filepath': self.filepath
}
and then, in the Catalog:
translation_tokenizer:
type: ibc_codes.datasets.tokenizer_dataset.TokenizerDataset
filepath: data/01_raw/opus-mt-fr-en
Juan Luis
02/23/2024, 11:47 PM