how can one use defaultdatasets in the jupyter integration f Kedro #questions

how can one use defaultdatasets in the jupyter int...

Hugo Evers

12/21/2023, 10:36 AM

how can one use defaultdatasets in the jupyter integration for kedro? so i ran

%load_ext kedro.ipython

And i want to load a dataset with results from a dataset factory, but i get an error stating that the dataset is not in the catalog. Is this not possible?

👀 1

Lodewic van Twillert

12/21/2023, 10:58 AM

Two ways, easy and complex: #1 , touch the dataset using

catalog.exist(...)

and it will resolve the dataset, and then you can just load it normally:

Copy code

catalog.load("my_namespace.my_dataset")  # Error

catalog.exists("my_namespace.my_dataset")
catalog.load("my_namespace.my_dataset")  # Success

#2, resolve the dataset factory yourself. Nice to know how it works under the hood but probably not what you'll use:

Copy code

import copy
from kedro.utils import load_obj

def get_dataset(name, catalog):
    """Get dataset object from catalog by name, whether it's from a dataset factory or not."""
    if name not in catalog._data_sets.keys():
        # resolve dataset name manually
        matched_pattern = catalog._match_pattern(catalog._dataset_patterns, name)
        ds_config_copy = copy.deepcopy(
            catalog._dataset_patterns[matched_pattern]
        )
        matched_dataset = catalog._get_dataset(matched_pattern)

        ds_config = catalog._resolve_config(
            name, matched_pattern, ds_config_copy
        )
        
        try:
            # if dataset is from kedro-datasets
            dataset_type = ".".join(["kedro_datasets", ds_config.pop("type")])
            dataset = load_obj(dataset_type)(**ds_config)
        except AttributeError as e:
            # if custom dataset
            dataset_type = ds_config.pop("type")
            dataset = load_obj(dataset_type)(**ds_config)
    else:
        # if dataset is in catalog, just get it
        dataset = catalog._data_sets[name]
    return dataset

dataset = get_dataset(name="my_namespace.dataset", catalog=catalog)
dataset._describe()

👀 1

👍 1

Juan Luis

12/21/2023, 11:06 AM

Copy code

catalog.load("my_namespace.my_dataset")  # Error

catalog.exists("my_namespace.my_dataset")
catalog.load("my_namespace.my_dataset")  # Success

...what kind of sorcery is this? 😅 @Ankita Katiyar any clues?

Lodewic van Twillert

12/21/2023, 11:08 AM

https://github.com/takikadiri/kedro-boot/blob/main/kedro_boot/runner.py#L76 Idea from kedro-boot btw, happened to be scrolling through earlier today and noticed that that is how they handle it

Nok Lam Chan

12/21/2023, 12:34 PM

Is there an example that I can reproduce? I can have a look at this

Nok Lam Chan

12/21/2023, 12:35 PM

Copy code

def load(self, name: str, version: str = None) -> Any:
        ...
        load_version = Version(version, None) if version else None
        dataset = self._get_dataset(name, version=load_version)

Copy code

def exists(self, name: str) -> bool:
        """Checks whether registered data set exists by calling its `exists()`
        method. Raises a warning and returns False if `exists()` is not
        implemented.

        Args:
            name: A data set to be checked.

        Returns:
            Whether the data set output exists.

        """
        try:
            dataset = self._get_dataset(name)

Looking through the code quickly they use the identical method to access the data so I am not sure how it could happen. Which version of Kedro are you using?

Lodewic van Twillert

12/21/2023, 12:41 PM

Dataset like this to test:

Copy code

"{namespace}.dataset":
  type: json.JSONDataset
  filepath: "data/01_raw/{namespace}.json"

Then the code I sent earlier. Using kedro 0.18.14. In the meantime I upgraded to 0.19.1 and it seems to not work anymore with that version..? 🤷‍♂️

Ankita Katiyar

12/21/2023, 12:43 PM

It’s because the dataset factories datasets are only materialised when they’re first used and this logic is in

catalog.__get_dataset()

which is called internally by the

exists()

Nok Lam Chan

12/21/2023, 12:45 PM

isn't the

load

method also using

self._get_dataset

Lodewic van Twillert

12/21/2023, 12:51 PM

Copy code

catalog.load("my_namespace.my_dataset")  # Not an error 

catalog.exists("my_namespace.my_dataset")
catalog.load("my_namespace.my_dataset")  # Also Success

Oof my bad @Nok Lam Chan, you're right and just calling catalog.load() on a factory dataset just works. (Worked in 0.18.14 at least) My test case was just bad locally and I copied 😬 --- I have a bad habit of selecting datasets like

catalog._data_sets["<http://my_namespace.my|my_namespace.my>_dataset"].load()

and that failed because the dataset is not in that dictionary of

_data_sets

yet. Different error

👍 1

👍🏼 1

Ankita Katiyar

12/21/2023, 12:59 PM

@Lodewic van Twillert we’ve renamed it to

_datasets

now in 0.19.1 (removed the underscore)

👍 1

Nok Lam Chan

12/21/2023, 1:13 PM

(the

_data_sets

DataSet

Dataset

dataset

was a bit of legacy that we finally clean up in 0.19 so it may be causing some trouble especially if you are accessing the internal variables)

👍 1

Nok Lam Chan

12/21/2023, 1:15 PM

I wonder if @Hugo Evers original issue is solved.

And i want to load a dataset with results from a dataset factory, but i get an error stating that the dataset is not in the catalog.

Is this not possible? (edited)

It would be nice if you can include your Kedro version and catalog entry and the snippets of how you do this exactly.

Nok Lam Chan

12/21/2023, 1:15 PM

@Lodewic van Twillert No worries 😁thanks for helping our community and being an active member :)

Lodewic van Twillert

12/21/2023, 1:16 PM

Thanks, learned some things myself from this interaction 👍

Hugo Evers

12/21/2023, 2:47 PM

Thanks, it works when i do the namespace dataset, like:

Copy code

"{namespace}.X":
  type: pandas.ParquetDataSet
  filepath: s3://..../data/04_intermediate/{namespace}_X.parquet

but i have a different deployment pattern with their own configs for aws batch. So my conf/aws_batch folder contains the following catalog.yml entry:

Copy code

"{default_dataset}":
  type: extras.datasets.CloudpickleDataset
  folder: "s3://.../data/tmp/${mlflow_run_id|}/{default_dataset}.bin"

And i wanted to access the dataset and model associated with that MLflow run id. So i did

Copy code

import mlflow
mlflow.start_run(id=....)

and then

Copy code

load_ext kedro.ipython

but i dont see an option to specify conf=aws_batch Only when i do session.run(**params) So i can use the output of session.run, but that is not as clean as loading the datasets directly. But maybe i am missing something?

Nok Lam Chan

12/21/2023, 3:01 PM

What is

aws_batch

here?

Nok Lam Chan

12/21/2023, 3:02 PM

do you mean the

conf_source

? or is it an environment

Nok Lam Chan

12/21/2023, 3:02 PM

For the former one, you should use

%reload_kedro

and you can specify

conf_source

as an argument

Hugo Evers

12/21/2023, 3:03 PM

its an env

Hugo Evers

12/21/2023, 3:03 PM

kedro run --env=aws_batch

Nok Lam Chan

12/21/2023, 3:19 PM

%reload_kedro --env=aws_batch

Hugo Evers

12/21/2023, 3:21 PM

cool! thanks, in that case i really did miss something, ill test whether it works. Can i also pass extra_params? probably yeah right. ill check, thanks!

Nok Lam Chan

12/21/2023, 4:08 PM

Yes

2 Views

Open in Slack

Previous Next