Hugo Evers
12/21/2023, 10:36 AM%load_ext kedro.ipython
And i want to load a dataset with results from a dataset factory, but i get an error stating that the dataset is not in the catalog.
Is this not possible?Lodewic van Twillert
12/21/2023, 10:58 AMcatalog.exist(...)
and it will resolve the dataset, and then you can just load it normally:
catalog.load("my_namespace.my_dataset") # Error
catalog.exists("my_namespace.my_dataset")
catalog.load("my_namespace.my_dataset") # Success
#2, resolve the dataset factory yourself. Nice to know how it works under the hood but probably not what you'll use:
import copy
from kedro.utils import load_obj
def get_dataset(name, catalog):
"""Get dataset object from catalog by name, whether it's from a dataset factory or not."""
if name not in catalog._data_sets.keys():
# resolve dataset name manually
matched_pattern = catalog._match_pattern(catalog._dataset_patterns, name)
ds_config_copy = copy.deepcopy(
catalog._dataset_patterns[matched_pattern]
)
matched_dataset = catalog._get_dataset(matched_pattern)
ds_config = catalog._resolve_config(
name, matched_pattern, ds_config_copy
)
try:
# if dataset is from kedro-datasets
dataset_type = ".".join(["kedro_datasets", ds_config.pop("type")])
dataset = load_obj(dataset_type)(**ds_config)
except AttributeError as e:
# if custom dataset
dataset_type = ds_config.pop("type")
dataset = load_obj(dataset_type)(**ds_config)
else:
# if dataset is in catalog, just get it
dataset = catalog._data_sets[name]
return dataset
dataset = get_dataset(name="my_namespace.dataset", catalog=catalog)
dataset._describe()
Juan Luis
12/21/2023, 11:06 AMcatalog.load("my_namespace.my_dataset") # Error
catalog.exists("my_namespace.my_dataset")
catalog.load("my_namespace.my_dataset") # Success
...what kind of sorcery is this? π
@Ankita Katiyar any clues?Lodewic van Twillert
12/21/2023, 11:08 AMNok Lam Chan
12/21/2023, 12:34 PMNok Lam Chan
12/21/2023, 12:35 PMdef load(self, name: str, version: str = None) -> Any:
...
load_version = Version(version, None) if version else None
dataset = self._get_dataset(name, version=load_version)
def exists(self, name: str) -> bool:
"""Checks whether registered data set exists by calling its `exists()`
method. Raises a warning and returns False if `exists()` is not
implemented.
Args:
name: A data set to be checked.
Returns:
Whether the data set output exists.
"""
try:
dataset = self._get_dataset(name)
Looking through the code quickly they use the identical method to access the data so I am not sure how it could happen.
Which version of Kedro are you using?Lodewic van Twillert
12/21/2023, 12:41 PM"{namespace}.dataset":
type: json.JSONDataset
filepath: "data/01_raw/{namespace}.json"
Then the code I sent earlier.
Using kedro 0.18.14.
In the meantime I upgraded to 0.19.1 and it seems to not work anymore with that version..? π€·ββοΈAnkita Katiyar
12/21/2023, 12:43 PMcatalog.__get_dataset()
which is called internally by the exists()
fnNok Lam Chan
12/21/2023, 12:45 PMload
method also using self._get_dataset
?Lodewic van Twillert
12/21/2023, 12:51 PMcatalog.load("my_namespace.my_dataset") # Not an error
catalog.exists("my_namespace.my_dataset")
catalog.load("my_namespace.my_dataset") # Also Success
Oof my bad @Nok Lam Chan, you're right and just calling catalog.load() on a factory dataset just works. (Worked in 0.18.14 at least)
My test case was just bad locally and I copied π¬
---
I have a bad habit of selecting datasets like catalog._data_sets["<http://my_namespace.my|my_namespace.my>_dataset"].load()
and that failed because the dataset is not in that dictionary of _data_sets
yet. Different errorAnkita Katiyar
12/21/2023, 12:59 PM_datasets
now in 0.19.1 (removed the underscore)Nok Lam Chan
12/21/2023, 1:13 PM_data_sets
DataSet
Dataset
dataset
was a bit of legacy that we finally clean up in 0.19 so it may be causing some trouble especially if you are accessing the internal variables)Nok Lam Chan
12/21/2023, 1:15 PMAnd i want to load a dataset with results from a dataset factory, but i get an error stating that the dataset is not in the catalog.
Is this not possible? (edited)It would be nice if you can include your Kedro version and catalog entry and the snippets of how you do this exactly.
Nok Lam Chan
12/21/2023, 1:15 PMLodewic van Twillert
12/21/2023, 1:16 PMHugo Evers
12/21/2023, 2:47 PM"{namespace}.X":
type: pandas.ParquetDataSet
filepath: s3://..../data/04_intermediate/{namespace}_X.parquet
but i have a different deployment pattern with their own configs for aws batch.
So my conf/aws_batch folder contains the following catalog.yml entry:
"{default_dataset}":
type: extras.datasets.CloudpickleDataset
folder: "s3://.../data/tmp/${mlflow_run_id|}/{default_dataset}.bin"
And i wanted to access the dataset and model associated with that MLflow run id.
So i did
import mlflow
mlflow.start_run(id=....)
and then
load_ext kedro.ipython
but i dont see an option to specify conf=aws_batch
Only when i do session.run(**params)
So i can use the output of session.run, but that is not as clean as loading the datasets directly.
But maybe i am missing something?Nok Lam Chan
12/21/2023, 3:01 PMaws_batch
here?Nok Lam Chan
12/21/2023, 3:02 PMconf_source
? or is it an environmentNok Lam Chan
12/21/2023, 3:02 PM%reload_kedro
and you can specify conf_source
as an argumentHugo Evers
12/21/2023, 3:03 PMHugo Evers
12/21/2023, 3:03 PMkedro run --env=aws_batch
Nok Lam Chan
12/21/2023, 3:19 PM%reload_kedro --env=aws_batch
Hugo Evers
12/21/2023, 3:21 PMNok Lam Chan
12/21/2023, 4:08 PM