https://kedro.org/ logo
#questions
Title
# questions
h

Hugo Evers

12/21/2023, 10:36 AM
how can one use defaultdatasets in the jupyter integration for kedro? so i ran
%load_ext kedro.ipython
And i want to load a dataset with results from a dataset factory, but i get an error stating that the dataset is not in the catalog. Is this not possible?
πŸ‘€ 1
l

Lodewic van Twillert

12/21/2023, 10:58 AM
Two ways, easy and complex: #1 , touch the dataset using
catalog.exist(...)
and it will resolve the dataset, and then you can just load it normally:
Copy code
catalog.load("my_namespace.my_dataset")  # Error

catalog.exists("my_namespace.my_dataset")
catalog.load("my_namespace.my_dataset")  # Success
#2, resolve the dataset factory yourself. Nice to know how it works under the hood but probably not what you'll use:
Copy code
import copy
from kedro.utils import load_obj

def get_dataset(name, catalog):
    """Get dataset object from catalog by name, whether it's from a dataset factory or not."""
    if name not in catalog._data_sets.keys():
        # resolve dataset name manually
        matched_pattern = catalog._match_pattern(catalog._dataset_patterns, name)
        ds_config_copy = copy.deepcopy(
            catalog._dataset_patterns[matched_pattern]
        )
        matched_dataset = catalog._get_dataset(matched_pattern)

        ds_config = catalog._resolve_config(
            name, matched_pattern, ds_config_copy
        )
        
        try:
            # if dataset is from kedro-datasets
            dataset_type = ".".join(["kedro_datasets", ds_config.pop("type")])
            dataset = load_obj(dataset_type)(**ds_config)
        except AttributeError as e:
            # if custom dataset
            dataset_type = ds_config.pop("type")
            dataset = load_obj(dataset_type)(**ds_config)
    else:
        # if dataset is in catalog, just get it
        dataset = catalog._data_sets[name]
    return dataset

dataset = get_dataset(name="my_namespace.dataset", catalog=catalog)
dataset._describe()
πŸ‘€ 1
πŸ‘ 1
j

Juan Luis

12/21/2023, 11:06 AM
Copy code
catalog.load("my_namespace.my_dataset")  # Error

catalog.exists("my_namespace.my_dataset")
catalog.load("my_namespace.my_dataset")  # Success
...what kind of sorcery is this? πŸ˜… @Ankita Katiyar any clues?
l

Lodewic van Twillert

12/21/2023, 11:08 AM
https://github.com/takikadiri/kedro-boot/blob/main/kedro_boot/runner.py#L76 Idea from kedro-boot btw, happened to be scrolling through earlier today and noticed that that is how they handle it
n

Nok Lam Chan

12/21/2023, 12:34 PM
Is there an example that I can reproduce? I can have a look at this
Copy code
def load(self, name: str, version: str = None) -> Any:
        ...
        load_version = Version(version, None) if version else None
        dataset = self._get_dataset(name, version=load_version)
Copy code
def exists(self, name: str) -> bool:
        """Checks whether registered data set exists by calling its `exists()`
        method. Raises a warning and returns False if `exists()` is not
        implemented.

        Args:
            name: A data set to be checked.

        Returns:
            Whether the data set output exists.

        """
        try:
            dataset = self._get_dataset(name)
Looking through the code quickly they use the identical method to access the data so I am not sure how it could happen. Which version of Kedro are you using?
l

Lodewic van Twillert

12/21/2023, 12:41 PM
Dataset like this to test:
Copy code
"{namespace}.dataset":
  type: json.JSONDataset
  filepath: "data/01_raw/{namespace}.json"
Then the code I sent earlier. Using kedro 0.18.14. In the meantime I upgraded to 0.19.1 and it seems to not work anymore with that version..? πŸ€·β€β™‚οΈ
a

Ankita Katiyar

12/21/2023, 12:43 PM
It’s because the dataset factories datasets are only materialised when they’re first used and this logic is in
catalog.__get_dataset()
which is called internally by the
exists()
fn
n

Nok Lam Chan

12/21/2023, 12:45 PM
isn't the
load
method also using
self._get_dataset
?
l

Lodewic van Twillert

12/21/2023, 12:51 PM
Copy code
catalog.load("my_namespace.my_dataset")  # Not an error 

catalog.exists("my_namespace.my_dataset")
catalog.load("my_namespace.my_dataset")  # Also Success
Oof my bad @Nok Lam Chan, you're right and just calling catalog.load() on a factory dataset just works. (Worked in 0.18.14 at least) My test case was just bad locally and I copied 😬 --- I have a bad habit of selecting datasets like
catalog._data_sets["<http://my_namespace.my|my_namespace.my>_dataset"].load()
and that failed because the dataset is not in that dictionary of
_data_sets
yet. Different error
πŸ‘ 1
πŸ‘πŸΌ 1
a

Ankita Katiyar

12/21/2023, 12:59 PM
@Lodewic van Twillert we’ve renamed it to
_datasets
now in 0.19.1 (removed the underscore)
πŸ‘ 1
n

Nok Lam Chan

12/21/2023, 1:13 PM
(the
_data_sets
DataSet
Dataset
dataset
was a bit of legacy that we finally clean up in 0.19 so it may be causing some trouble especially if you are accessing the internal variables)
πŸ‘ 1
I wonder if @Hugo Evers original issue is solved.
And i want to load a dataset with results from a dataset factory, but i get an error stating that the dataset is not in the catalog.
Is this not possible? (edited)
It would be nice if you can include your Kedro version and catalog entry and the snippets of how you do this exactly.
@Lodewic van Twillert No worries 😁thanks for helping our community and being an active member :)
l

Lodewic van Twillert

12/21/2023, 1:16 PM
Thanks, learned some things myself from this interaction πŸ‘
h

Hugo Evers

12/21/2023, 2:47 PM
Thanks, it works when i do the namespace dataset, like:
Copy code
"{namespace}.X":
  type: pandas.ParquetDataSet
  filepath: s3://..../data/04_intermediate/{namespace}_X.parquet
but i have a different deployment pattern with their own configs for aws batch. So my conf/aws_batch folder contains the following catalog.yml entry:
Copy code
"{default_dataset}":
  type: extras.datasets.CloudpickleDataset
  folder: "s3://.../data/tmp/${mlflow_run_id|}/{default_dataset}.bin"
And i wanted to access the dataset and model associated with that MLflow run id. So i did
Copy code
import mlflow
mlflow.start_run(id=....)
and then
Copy code
load_ext kedro.ipython
but i dont see an option to specify conf=aws_batch Only when i do session.run(**params) So i can use the output of session.run, but that is not as clean as loading the datasets directly. But maybe i am missing something?
n

Nok Lam Chan

12/21/2023, 3:01 PM
What is
aws_batch
here?
do you mean the
conf_source
? or is it an environment
For the former one, you should use
%reload_kedro
and you can specify
conf_source
as an argument
h

Hugo Evers

12/21/2023, 3:03 PM
its an env
so
kedro run --env=aws_batch
n

Nok Lam Chan

12/21/2023, 3:19 PM
%reload_kedro --env=aws_batch
h

Hugo Evers

12/21/2023, 3:21 PM
cool! thanks, in that case i really did miss something, ill test whether it works. Can i also pass extra_params? probably yeah right. ill check, thanks!
n

Nok Lam Chan

12/21/2023, 4:08 PM
Yes