When I say `catalog list ` in a kedro jupter lab instance it Kedro #questions

When I say `catalog.list()` in a kedro jupter lab ...

Lukas Innig

11/15/2023, 10:31 AM

When I say

catalog.list()

in a kedro jupter lab instance, it doesn’t return all registered datasets. anything that’s using a dataset factory seems to be missing. Is there a way to infer those somehow?

Ankita Katiyar

11/15/2023, 10:40 AM

The factory datasets are registered after they are first used during a session run. CHecking for their existence will also register them to the catalog ->

catalog.exists(dataset_name)

Copy code

for dataset in pipeline["__default__"].data_sets():
  catalog.exists(dataset)

And then catalog.list() should list them

Lukas Innig

11/15/2023, 10:42 AM

that’s brilliant! Thank you

Lukas Innig

11/15/2023, 10:44 AM

aha, and custom datasets will have to implement an

_exists

method for this to work. That’s quite useful!

Ankita Katiyar

11/15/2023, 10:47 AM

It should work without the _exists method too since internally it just sees if the catalog entry exists, doesn’t call the datasets’s exists method

Lukas Innig

11/15/2023, 10:48 AM

I am getting a

2023-11-15 10:42:25,539 - kedro.io.core - WARNING - 'exists()' not implemented for 'DataRobotProjectDataset'. Assuming output does not exist.

But it’s fine - I’m actually rather happy to implement a custom

exists

method

Ankita Katiyar

11/15/2023, 10:52 AM

Ah, my bad. It does need it!

Ankita Katiyar

11/15/2023, 10:53 AM

It’ll throw a warning but the dataset should still get registered

Lukas Innig

11/15/2023, 10:53 AM

ah you’re right. It does appear in the list

Lukas Innig

11/15/2023, 10:54 AM

cool cool - super helpful. Thank you 🙏

❤️ 1

Nok Lam Chan

11/15/2023, 11:34 AM

@Lukas Innig What do you expected to see instead? factory is created during a pipeline run, so the result gonna depends on which pipeline you are running.

Nok Lam Chan

11/15/2023, 11:36 AM

on the other hand,

catalog.list(pipeline=<name>)

is not possible because catalog is not aware of a Pipeline object. This is coordianted by the session/runner instead😶

Nok Lam Chan

11/15/2023, 11:37 AM

(just brainstorming how can we make this a better experience out loud)

Lukas Innig

11/15/2023, 11:37 AM

I guess I was expecting to see all the datasets that I specified in catalog.yml. to be honest I haven’t thought too deeply about it

👍🏼 1

Nok Lam Chan

11/15/2023, 11:49 AM

I created this ticket to document the workaround. https://github.com/kedro-org/kedro/issues/3312 I can’t think of a nice solution just now but I will be keep this in mind.

🔥 1

17 Views

Open in Slack

Previous Next