Eduardo Romero López
02/27/2024, 12:58 PMdataset_hf:
type: kedro_datasets.huggingface.HFDataset
dataset_name: Davlan/sib200
then I am going to instance it in kedro jupyter lab but I get an error because I can't pass the name parameter when I run this catalog.load('dataset_hf')
the error message is this:
in <module>:1 │
│ │
│ ❱ 1 catalog.load('dataset_hf') │
│ 2 │
│ │
│ C:\Users\eromero\Documents\Proyectos_edu\ia-text-classifiers-git_edu_V2\venv\lib\site-packages\k │
│ edro\io\data_catalog.py:490 in load │
│ │
│ 487 │ │ │ extra={"markup": True}, │
│ 488 │ │ ) │
│ 489 │ │ │
│ ❱ 490 │ │ result = dataset.load() │
│ 491 │ │ │
│ 492 │ │ return result │
│ 493 │
│ │
│ C:\Users\eromero\Documents\Proyectos_edu\ia-text-classifiers-git_edu_V2\venv\lib\site-packages\k │
│ edro\io\core.py:608 in load │
│ │
│ 605 │ │ return self._filepath / version / self._filepath.name │
│ 606 │ │
│ 607 │ def load(self) -> _DO: │
│ ❱ 608 │ │ return super().load() │
│ 609 │ │
│ 610 │ def save(self, data: _DI) -> None: │
│ 611 │ │ self._version_cache.clear() │
│ │
│ C:\Users\eromero\Documents\Proyectos_edu\ia-text-classifiers-git_edu_V2\venv\lib\site-packages\k │
│ edro\io\core.py:195 in load │
│ │
│ 192 │ │ │ message = ( │
│ 193 │ │ │ │ f"Failed while loading data from data set {str(self)}.\n{str(exc)}" │
│ 194 │ │ │ ) │
│ ❱ 195 │ │ │ raise DatasetError(message) from exc │
│ 196 │ │
│ 197 │ def save(self, data: _DI) -> None: │
│ 198 │ │ """Saves data by delegation to the provided save method. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
DatasetError: Failed while loading data from data set HFDataset(dataset_author=Davlan, dataset_name=Davlan/sib200,
dataset_tags=['task_categories:text-classification', 'task_ids:topic-classification', 'annotations_creators:found',
'language_creators:expert-generated', 'multilinguality:multilingual', 'size_categories:1K<n<10K',
'source_datasets:original', 'language:ace', 'language:acm', 'language:acq', 'language:aeb', 'language:af',
'language:ajp', 'language:ak', 'language:als', 'language:am', 'language:apc', 'language:ar', 'language:ars',
'language:ary', 'language:arz', 'language:as', 'language:ast', 'language:awa', 'language:ayr', 'language:azb',
'language:azj', 'language:ba', 'language:bm', 'language:ban', 'language:be', 'language:bem', 'language:bn',
'language:bho', 'language:bjn', 'language:bo', 'language:bs', 'language:bug', 'language:bg', 'language:ca',
'language:ceb', 'language:cs', 'language:cjk', 'language:ckb', 'language:crh', 'language:cy', 'language:da',
'language:de', 'language:dik', 'language:dyu', 'language:dz', 'language:el', 'language:en', 'language:eo',
'language:et', 'language:eu', 'language:ee', 'language:fo', 'language:fj', 'language:fi', 'language:fon',
'language:fr', 'language:fur', 'language:fuv', 'language:gaz', 'language:gd', 'language:ga', 'language:gl',
'language:gn', 'language:gu', 'language:ht', 'language:ha', 'language:he', 'language:hi', 'language:hne',
'language:hr', 'language:hu', 'language:hy', 'language:ig', 'language:ilo', 'language:id', 'language:is',
'language:it', 'language:jv', 'language:ja', 'language:kab', 'language:kac', 'language:kam', 'language:kn',
'language:ks', 'language:ka', 'language:kk', 'language:kbp', 'language:kea', 'language:khk', 'language:km',
'language:ki', 'language:rw', 'language:ky', 'language:kmb', 'language:kmr', 'language:knc', 'language:kg',
'language:ko', 'language:lo', 'language:lij', 'language:li', 'language:ln', 'language:lt', 'language:lmo',
'language:ltg', 'language:lb', 'language:lua', 'language:lg', 'language:luo', 'language:lus', 'language:lvs',
'language:mag', 'language:mai', 'language:ml', 'language:mar', 'language:min', 'language:mk', 'language:mt',
'language:mni', 'language:mos', 'language:mi', 'language:my', 'language:nl', 'language:nn', 'language:nb',
'language:npi', 'language:nqo', 'language:nso', 'language:nus', 'language:ny', 'language:oc', 'language:ory',
'language:pag', 'language:pa', 'language:pap', 'language:pbt', 'language:pes', 'language:plt', 'language:pl',
'language:pt', 'language:prs', 'language:quy', 'language:ro', 'language:rn', 'language:ru', 'language:sg',
'language:sa', 'language:sat', 'language:scn', 'language:shn', 'language:si', 'language:sk', 'language:sl',
'language:sm', 'language:sn', 'language:sd', 'language:so', 'language:st', 'language:es', 'language:sc',
'language:sr', 'language:ss', 'language:su', 'language:sv', 'language:swh', 'language:szl', 'language:ta',
'language:taq', 'language:tt', 'language:te', 'language:tg', 'language:tl', 'language:th', 'language:ti',
'language:tpi', 'language:tn', 'language:ts', 'language:tk', 'language:tum', 'language:tr', 'language:tw',
'language:tzm', 'language:ug', 'language:uk', 'language:umb', 'language:ur', 'language:uzn', 'language:vec',
'language:vi', 'language:war', 'language:wo', 'language:xh', 'language:ydd', 'language:yo', 'language:yue',
'language:zh', 'language:zsm', 'language:zu', 'license:cc-by-sa-4.0', 'news-topic', 'sib-200', 'sib200',
'croissant', 'arxiv:2309.07445', 'region:us']).
Config name is missing.
Please pick one among the available configs: ['ace_Arab', 'ace_Latn', 'acm_Arab', 'acq_Arab', 'aeb_Arab',
'afr_Latn', 'ajp_Arab', 'aka_Latn', 'als_Latn', 'amh_Ethi', 'apc_Arab', 'arb_Arab', 'arb_Latn', 'ars_Arab',
'ary_Arab', 'arz_Arab', 'asm_Beng', 'ast_Latn', 'awa_Deva', 'ayr_Latn', 'azb_Arab', 'azj_Latn', 'bak_Cyrl',
'bam_Latn', 'ban_Latn', 'bel_Cyrl', 'bem_Latn', 'ben_Beng', 'bho_Deva', 'bjn_Arab', 'bjn_Latn', 'bod_Tibt',
'bos_Latn', 'bug_Latn', 'bul_Cyrl', 'cat_Latn', 'ceb_Latn', 'ces_Latn', 'cjk_Latn', 'ckb_Arab', 'crh_Latn',
'cym_Latn', 'dan_Latn', 'deu_Latn', 'dik_Latn', 'dyu_Latn', 'dzo_Tibt', 'ell_Grek', 'eng_Latn', 'epo_Latn',
'est_Latn', 'eus_Latn', 'ewe_Latn', 'fao_Latn', 'fij_Latn', 'fin_Latn', 'fon_Latn', 'fra_Latn', 'fur_Latn',
'fuv_Latn', 'gaz_Latn', 'gla_Latn', 'gle_Latn', 'glg_Latn', 'grn_Latn', 'guj_Gujr', 'hat_Latn', 'hau_Latn',
'heb_Hebr', 'hin_Deva', 'hne_Deva', 'hrv_Latn', 'hun_Latn', 'hye_Armn', 'ibo_Latn', 'ilo_Latn', 'ind_Latn',
'isl_Latn', 'ita_Latn', 'jav_Latn', 'jpn_Jpan', 'kab_Latn', 'kac_Latn', 'kam_Latn', 'kan_Knda', 'kas_Arab',
'kas_Deva', 'kat_Geor', 'kaz_Cyrl', 'kbp_Latn', 'kea_Latn', 'khk_Cyrl', 'khm_Khmr', 'kik_Latn', 'kin_Latn',
'kir_Cyrl', 'kmb_Latn', 'kmr_Latn', 'knc_Arab', 'knc_Latn', 'kon_Latn', 'kor_Hang', 'lao_Laoo', 'lij_Latn',
'lim_Latn', 'lin_Latn', 'lit_Latn', 'lmo_Latn', 'ltg_Latn', 'ltz_Latn', 'lua_Latn', 'lug_Latn', 'luo_Latn',
'lus_Latn', 'lvs_Latn', 'mag_Deva', 'mai_Deva', 'mal_Mlym', 'mar_Deva', 'min_Arab', 'min_Latn', 'mkd_Cyrl',
'mlt_Latn', 'mni_Beng', 'mos_Latn', 'mri_Latn', 'mya_Mymr', 'nld_Latn', 'nno_Latn', 'nob_Latn', 'npi_Deva',
'nqo_Nkoo', 'nqo_Nkoo.zip', 'nso_Latn', 'nus_Latn', 'nya_Latn', 'oci_Latn', 'ory_Orya', 'pag_Latn', 'pan_Guru',
'pap_Latn', 'pbt_Arab', 'pes_Arab', 'plt_Latn', 'pol_Latn', 'por_Latn', 'prs_Arab', 'quy_Latn', 'ron_Latn',
'run_Latn', 'rus_Cyrl', 'sag_Latn', 'san_Deva', 'sat_Olck', 'scn_Latn', 'shn_Mymr', 'sin_Sinh', 'slk_Latn',
'slv_Latn', 'smo_Latn', 'sna_Latn', 'snd_Arab', 'som_Latn', 'sot_Latn', 'spa_Latn', 'srd_Latn', 'srp_Cyrl',
'ssw_Latn', 'sun_Latn', 'swe_Latn', 'swh_Latn', 'szl_Latn', 'tam_Taml', 'taq_Latn', 'taq_Tfng', 'tat_Cyrl',
'tel_Telu', 'tgk_Cyrl', 'tgl_Latn', 'tha_Thai', 'tir_Ethi', 'tpi_Latn', 'tsn_Latn', 'tso_Latn', 'tuk_Latn',
'tum_Latn', 'tur_Latn', 'twi_Latn', 'tzm_Tfng', 'uig_Arab', 'ukr_Cyrl', 'umb_Latn', 'urd_Arab', 'uzn_Latn',
'vec_Latn', 'vie_Latn', 'war_Latn', 'wol_Latn', 'xho_Latn', 'ydd_Hebr', 'yor_Latn', 'yue_Hant', 'zho_Hans',
'zho_Hant', 'zsm_Latn', 'zul_Latn']
Example of usage:
`load_dataset('sib200', 'ace_Arab')`
Eduardo Romero López
02/27/2024, 12:59 PMNok Lam Chan
02/27/2024, 1:54 PMcatalog.load("dataset_hf")
, what you did is correctNok Lam Chan
02/27/2024, 1:56 PMkedro run
? It looks like there is issue with the data itself, can't tell from your truncated stacktraceEduardo Romero López
02/27/2024, 2:40 PMkedro run
I don't get error because I don't uses it in no nodeEduardo Romero López
02/27/2024, 2:40 PMEduardo Romero López
02/27/2024, 2:40 PMEduardo Romero López
02/27/2024, 2:43 PMglg_Latn
like the subset in hugging face, if I don't do that I get the same error that with kedroEduardo Romero López
02/27/2024, 2:45 PMEduardo Romero López
02/27/2024, 2:48 PMNok Lam Chan
02/27/2024, 3:07 PMNok Lam Chan
02/27/2024, 3:08 PM_load
method of your custom dataset
• definition in pipeline.py
Nok Lam Chan
02/27/2024, 3:11 PMdataset = atalog.datasets.dataset_hf
. Then if you need to load it you can do dataset.load()
Eduardo Romero López
02/27/2024, 3:21 PMEduardo Romero López
02/27/2024, 3:24 PMEduardo Romero López
02/27/2024, 3:25 PMload_dataset(path_corpus, 'glg_Latn')
is posible to do it??Nok Lam Chan
02/27/2024, 4:00 PMdef __init__(self, *, dataset_name: str):
self.dataset_name = dataset_name
def _load(self):
return load_dataset(self.dataset_name)
Basically you need to add that second argument for "langauge", or it could be just `**kwargs`for whatever load_dataset
takes.
Cc @Juan LuisJuan Luis
02/27/2024, 5:41 PMname
or data_dir
? regardless, in the same way that the HFPipelineDataset has a pipeline_kwargs
https://github.com/kedro-org/kedro-plugins/blob/afe4c98cd6a18a2e2e217989a5fe70a6a9[…]sets/kedro_datasets/huggingface/transformer_pipeline_dataset.py we could have the same for datasetsJuan Luis
02/27/2024, 5:42 PMEduardo Romero López
02/28/2024, 7:11 AMname
I verify in error message.