Hi!! I am trying declare a hugging face dataset in...
# questions
e
Hi!! I am trying declare a hugging face dataset in the catalog:
Copy code
dataset_hf:
  type: kedro_datasets.huggingface.HFDataset
  dataset_name: Davlan/sib200
then I am going to instance it in kedro jupyter lab but I get an error because I can't pass the name parameter when I run this
catalog.load('dataset_hf')
the error message is this:
Copy code
in <module>:1                                                                                    │
│                                                                                                  │
│ ❱ 1 catalog.load('dataset_hf')                                                                   │
│   2                                                                                              │
│                                                                                                  │
│ C:\Users\eromero\Documents\Proyectos_edu\ia-text-classifiers-git_edu_V2\venv\lib\site-packages\k │
│ edro\io\data_catalog.py:490 in load                                                              │
│                                                                                                  │
│   487 │   │   │   extra={"markup": True},                                                        │
│   488 │   │   )                                                                                  │
│   489 │   │                                                                                      │
│ ❱ 490 │   │   result = dataset.load()                                                            │
│   491 │   │                                                                                      │
│   492 │   │   return result                                                                      │
│   493                                                                                            │
│                                                                                                  │
│ C:\Users\eromero\Documents\Proyectos_edu\ia-text-classifiers-git_edu_V2\venv\lib\site-packages\k │
│ edro\io\core.py:608 in load                                                                      │
│                                                                                                  │
│   605 │   │   return self._filepath / version / self._filepath.name                              │
│   606 │                                                                                          │
│   607 │   def load(self) -> _DO:                                                                 │
│ ❱ 608 │   │   return super().load()                                                              │
│   609 │                                                                                          │
│   610 │   def save(self, data: _DI) -> None:                                                     │
│   611 │   │   self._version_cache.clear()                                                        │
│                                                                                                  │
│ C:\Users\eromero\Documents\Proyectos_edu\ia-text-classifiers-git_edu_V2\venv\lib\site-packages\k │
│ edro\io\core.py:195 in load                                                                      │
│                                                                                                  │
│   192 │   │   │   message = (                                                                    │
│   193 │   │   │   │   f"Failed while loading data from data set {str(self)}.\n{str(exc)}"        │
│   194 │   │   │   )                                                                              │
│ ❱ 195 │   │   │   raise DatasetError(message) from exc                                           │
│   196 │                                                                                          │
│   197 │   def save(self, data: _DI) -> None:                                                     │
│   198 │   │   """Saves data by delegation to the provided save method.                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
DatasetError: Failed while loading data from data set HFDataset(dataset_author=Davlan, dataset_name=Davlan/sib200, 
dataset_tags=['task_categories:text-classification', 'task_ids:topic-classification', 'annotations_creators:found',
'language_creators:expert-generated', 'multilinguality:multilingual', 'size_categories:1K<n<10K', 
'source_datasets:original', 'language:ace', 'language:acm', 'language:acq', 'language:aeb', 'language:af', 
'language:ajp', 'language:ak', 'language:als', 'language:am', 'language:apc', 'language:ar', 'language:ars', 
'language:ary', 'language:arz', 'language:as', 'language:ast', 'language:awa', 'language:ayr', 'language:azb', 
'language:azj', 'language:ba', 'language:bm', 'language:ban', 'language:be', 'language:bem', 'language:bn', 
'language:bho', 'language:bjn', 'language:bo', 'language:bs', 'language:bug', 'language:bg', 'language:ca', 
'language:ceb', 'language:cs', 'language:cjk', 'language:ckb', 'language:crh', 'language:cy', 'language:da', 
'language:de', 'language:dik', 'language:dyu', 'language:dz', 'language:el', 'language:en', 'language:eo', 
'language:et', 'language:eu', 'language:ee', 'language:fo', 'language:fj', 'language:fi', 'language:fon', 
'language:fr', 'language:fur', 'language:fuv', 'language:gaz', 'language:gd', 'language:ga', 'language:gl', 
'language:gn', 'language:gu', 'language:ht', 'language:ha', 'language:he', 'language:hi', 'language:hne', 
'language:hr', 'language:hu', 'language:hy', 'language:ig', 'language:ilo', 'language:id', 'language:is', 
'language:it', 'language:jv', 'language:ja', 'language:kab', 'language:kac', 'language:kam', 'language:kn', 
'language:ks', 'language:ka', 'language:kk', 'language:kbp', 'language:kea', 'language:khk', 'language:km', 
'language:ki', 'language:rw', 'language:ky', 'language:kmb', 'language:kmr', 'language:knc', 'language:kg', 
'language:ko', 'language:lo', 'language:lij', 'language:li', 'language:ln', 'language:lt', 'language:lmo', 
'language:ltg', 'language:lb', 'language:lua', 'language:lg', 'language:luo', 'language:lus', 'language:lvs', 
'language:mag', 'language:mai', 'language:ml', 'language:mar', 'language:min', 'language:mk', 'language:mt', 
'language:mni', 'language:mos', 'language:mi', 'language:my', 'language:nl', 'language:nn', 'language:nb', 
'language:npi', 'language:nqo', 'language:nso', 'language:nus', 'language:ny', 'language:oc', 'language:ory', 
'language:pag', 'language:pa', 'language:pap', 'language:pbt', 'language:pes', 'language:plt', 'language:pl', 
'language:pt', 'language:prs', 'language:quy', 'language:ro', 'language:rn', 'language:ru', 'language:sg', 
'language:sa', 'language:sat', 'language:scn', 'language:shn', 'language:si', 'language:sk', 'language:sl', 
'language:sm', 'language:sn', 'language:sd', 'language:so', 'language:st', 'language:es', 'language:sc', 
'language:sr', 'language:ss', 'language:su', 'language:sv', 'language:swh', 'language:szl', 'language:ta', 
'language:taq', 'language:tt', 'language:te', 'language:tg', 'language:tl', 'language:th', 'language:ti', 
'language:tpi', 'language:tn', 'language:ts', 'language:tk', 'language:tum', 'language:tr', 'language:tw', 
'language:tzm', 'language:ug', 'language:uk', 'language:umb', 'language:ur', 'language:uzn', 'language:vec', 
'language:vi', 'language:war', 'language:wo', 'language:xh', 'language:ydd', 'language:yo', 'language:yue', 
'language:zh', 'language:zsm', 'language:zu', 'license:cc-by-sa-4.0', 'news-topic', 'sib-200', 'sib200', 
'croissant', 'arxiv:2309.07445', 'region:us']).
Config name is missing.
Please pick one among the available configs: ['ace_Arab', 'ace_Latn', 'acm_Arab', 'acq_Arab', 'aeb_Arab', 
'afr_Latn', 'ajp_Arab', 'aka_Latn', 'als_Latn', 'amh_Ethi', 'apc_Arab', 'arb_Arab', 'arb_Latn', 'ars_Arab', 
'ary_Arab', 'arz_Arab', 'asm_Beng', 'ast_Latn', 'awa_Deva', 'ayr_Latn', 'azb_Arab', 'azj_Latn', 'bak_Cyrl', 
'bam_Latn', 'ban_Latn', 'bel_Cyrl', 'bem_Latn', 'ben_Beng', 'bho_Deva', 'bjn_Arab', 'bjn_Latn', 'bod_Tibt', 
'bos_Latn', 'bug_Latn', 'bul_Cyrl', 'cat_Latn', 'ceb_Latn', 'ces_Latn', 'cjk_Latn', 'ckb_Arab', 'crh_Latn', 
'cym_Latn', 'dan_Latn', 'deu_Latn', 'dik_Latn', 'dyu_Latn', 'dzo_Tibt', 'ell_Grek', 'eng_Latn', 'epo_Latn', 
'est_Latn', 'eus_Latn', 'ewe_Latn', 'fao_Latn', 'fij_Latn', 'fin_Latn', 'fon_Latn', 'fra_Latn', 'fur_Latn', 
'fuv_Latn', 'gaz_Latn', 'gla_Latn', 'gle_Latn', 'glg_Latn', 'grn_Latn', 'guj_Gujr', 'hat_Latn', 'hau_Latn', 
'heb_Hebr', 'hin_Deva', 'hne_Deva', 'hrv_Latn', 'hun_Latn', 'hye_Armn', 'ibo_Latn', 'ilo_Latn', 'ind_Latn', 
'isl_Latn', 'ita_Latn', 'jav_Latn', 'jpn_Jpan', 'kab_Latn', 'kac_Latn', 'kam_Latn', 'kan_Knda', 'kas_Arab', 
'kas_Deva', 'kat_Geor', 'kaz_Cyrl', 'kbp_Latn', 'kea_Latn', 'khk_Cyrl', 'khm_Khmr', 'kik_Latn', 'kin_Latn', 
'kir_Cyrl', 'kmb_Latn', 'kmr_Latn', 'knc_Arab', 'knc_Latn', 'kon_Latn', 'kor_Hang', 'lao_Laoo', 'lij_Latn', 
'lim_Latn', 'lin_Latn', 'lit_Latn', 'lmo_Latn', 'ltg_Latn', 'ltz_Latn', 'lua_Latn', 'lug_Latn', 'luo_Latn', 
'lus_Latn', 'lvs_Latn', 'mag_Deva', 'mai_Deva', 'mal_Mlym', 'mar_Deva', 'min_Arab', 'min_Latn', 'mkd_Cyrl', 
'mlt_Latn', 'mni_Beng', 'mos_Latn', 'mri_Latn', 'mya_Mymr', 'nld_Latn', 'nno_Latn', 'nob_Latn', 'npi_Deva', 
'nqo_Nkoo', 'nqo_Nkoo.zip', 'nso_Latn', 'nus_Latn', 'nya_Latn', 'oci_Latn', 'ory_Orya', 'pag_Latn', 'pan_Guru', 
'pap_Latn', 'pbt_Arab', 'pes_Arab', 'plt_Latn', 'pol_Latn', 'por_Latn', 'prs_Arab', 'quy_Latn', 'ron_Latn', 
'run_Latn', 'rus_Cyrl', 'sag_Latn', 'san_Deva', 'sat_Olck', 'scn_Latn', 'shn_Mymr', 'sin_Sinh', 'slk_Latn', 
'slv_Latn', 'smo_Latn', 'sna_Latn', 'snd_Arab', 'som_Latn', 'sot_Latn', 'spa_Latn', 'srd_Latn', 'srp_Cyrl', 
'ssw_Latn', 'sun_Latn', 'swe_Latn', 'swh_Latn', 'szl_Latn', 'tam_Taml', 'taq_Latn', 'taq_Tfng', 'tat_Cyrl', 
'tel_Telu', 'tgk_Cyrl', 'tgl_Latn', 'tha_Thai', 'tir_Ethi', 'tpi_Latn', 'tsn_Latn', 'tso_Latn', 'tuk_Latn', 
'tum_Latn', 'tur_Latn', 'twi_Latn', 'tzm_Tfng', 'uig_Arab', 'ukr_Cyrl', 'umb_Latn', 'urd_Arab', 'uzn_Latn', 
'vec_Latn', 'vie_Latn', 'war_Latn', 'wol_Latn', 'xho_Latn', 'ydd_Hebr', 'yor_Latn', 'yue_Hant', 'zho_Hans', 
'zho_Hant', 'zsm_Latn', 'zul_Latn']
Example of usage:
        `load_dataset('sib200', 'ace_Arab')`
How can I define that parameter in the catalog?
n
catalog.load("dataset_hf")
, what you did is correct
Was it working if. you do a
kedro run
? It looks like there is issue with the data itself, can't tell from your truncated stacktrace
e
When I run
kedro run
I don't get error because I don't uses it in no node
I have done this test
image.png
I tested the datasets library without kedro and I need pass second parameter
glg_Latn
like the subset in hugging face, if I don't do that I get the same error that with kedro
I think is necessary to modify the kedro class to include that parameter and others
In this Datasets I can choose several "Subset" and "Split", how do I do in kedro?
n
How are you passing the "langauge:ka"? Is it a parameter?
Would be helpful if you can provide: • init and
_load
method of your custom dataset • definition in
pipeline.py
If you want to access the Dataset object instead of the underlying data. you can do
dataset = atalog.datasets.dataset_hf
. Then if you need to load it you can do
dataset.load()
e
a work colleague have followed a guide about kedro-mlflow and have created this example:
and I saw that exit a class in the kedro-datasets library to get dataset from hugging face and I want to chance that node (white mark) with the kedro_datasets.huggingface.HFDataset class in the catalog
you can see how the return need to pass two parameters
load_dataset(path_corpus, 'glg_Latn')
is posible to do it??
n
For now you will need to use a Custom Dataset because this argument is not supported yet. Do you want to open a PR to do this? It's fairly simple change.
Copy code
def __init__(self, *, dataset_name: str):
        self.dataset_name = dataset_name

    def _load(self):
        return load_dataset(self.dataset_name)
Basically you need to add that second argument for "langauge", or it could be just `**kwargs`for whatever
load_dataset
takes. Cc @Juan Luis
j
yeah I think this would be a nice addition to the Hugging Face dataset itself! I'm looking at https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset, is that
name
or
data_dir
? regardless, in the same way that the HFPipelineDataset has a
pipeline_kwargs
https://github.com/kedro-org/kedro-plugins/blob/afe4c98cd6a18a2e2e217989a5fe70a6a9[…]sets/kedro_datasets/huggingface/transformer_pipeline_dataset.py we could have the same for datasets
@Eduardo Romero López would you like to open an issue on https://github.com/kedro-org/kedro-plugins/issues first? and then the fix for that is quite easy 😄
e
Thanks very much!! @Juan Luis @Nok Lam Chan. I am going to create to issue and fix it then. The parameter is
name
I verify in error message.
🙌🏼 1