I ve run into two blockers probably a newbie mistake 1 Follo Kedro #questions

I've run into two blockers (probably a newbie mist...

Sergey S

11/14/2023, 12:10 AM

I've run into two blockers (probably a newbie mistake): 1. Following the docs about custom datasets, I've created a dummy custom dataset under `kedro-example/src/kedro-example/datasets.py`:

Copy code

class KedroCocoDataset(AbstractDataset):
...

Then in

catalog.yaml

Copy code

coco_dataset:
  type: kedro-example.datasets.KedroCocoDataset
  filepath: data/01_raw/014-playment-parts-100-images-variant-a.json

However in the notebook (started by

kedro jupyter lab

) when running

%reload_kedro

I get an error:

Copy code

DatasetError: An exception occurred when parsing config for dataset 'coco_dataset':
Class 'kedro-example.datasets.KedroCocoDataset' not found or one of its dependencies has not been installed.

Any help is appreciated.

marrrcin

11/14/2023, 7:50 AM

Do you have proper imports in the

kedro-example/src/kedro-example/datasets.py

? Looks like a Python-level issue rather than Kedro issue.

Nok Lam Chan

11/14/2023, 8:46 AM

Is it a typo? it should be

kedro_example

instead of

kedro-example

, this is a python convention you cannot have

as a namespace. p.s. Most notable example is probably

scikit-learn

, you do

pip install scikit-learn

but you do

import sklearn

instead of

import scikit-learn

😱 1

🥳 1

👍 1

👀 1

Nok Lam Chan

11/14/2023, 8:48 AM

if you do

kedro new

this should be handled properly, if not then it’s a bug on our side👀

Nok Lam Chan

11/14/2023, 8:50 AM

Copy code

The project name 'kedro-example' has been applied to: 
- The project title in /Users/Nok_Lam_Chan/GitHub/kedro/kedro-example/README.md 
- The folder created for your project in /Users/Nok_Lam_Chan/GitHub/kedro/kedro-example 
- The project's python package in /Users/Nok_Lam_Chan/GitHub/kedro/kedro-example/src/kedro_example

I quickly test it with the latest release, it is working as expected.

🎖️ 1

Lukas Innig

11/14/2023, 10:23 AM

The way I debug these issues is to try to import the dataset from jupyter. The error is then a lot easier to debug

Nok Lam Chan

11/14/2023, 10:41 AM

@Lukas Innig that's true. https://github.com/kedro-org/kedro/pull/3272 , in the next release hopefully this should be a nicer experience. In the past we cannot differentiate between missing modules or missing dependency, it is possible now since kedro-datasets is now lazy loading.

👍 1

Sergey S

11/14/2023, 12:14 PM

Thanks, the issue #1 is fixed, it was indeed an issue with using

kedro-example

instead of

kedro_example

in the code. The issue #2 remains. I followed the project structure of the

iris-pandas

example (

pipelines.py

and

nodes.py

in the

src

directory) and my test pipeline works 🥳 However if I move

pipelines.py

and

nodes.py

to a sub directory called

pipelines

(see image below) I am getting an error of

ValueError: Pipeline contains no nodes after applying all provided filters

The issue seems to be that

find_pipelines()

defined in

pipeline_registry.py

does not find my pipeline if its defined in a sub folder. If I manually import my pipeline (

from kedro_example.pipelines.pipeline import create_pipeline)

from the sub-directory and manually assign it (

pipelines["__default__"] = create_pipeline()

) then everything works.

Sergey S

11/14/2023, 12:35 PM

Following the the

spaceflight

example, I've added

Copy code

from .pipeline import create_pipeline

__init.py__

in the pipelines directory but that did not help when using

pipelines = find_pipelines()

in the

pipeline_registry.py

Sergey S

11/14/2023, 12:44 PM

Copying the directory structure of

spaceflight

works:

Copy code

src/
  kedro-proj/
    pipelines/
      data_processing/
        __init__.py # with 'from .pipeline import create_pipeline'
        pipeline.py
        nodes.py

Defining my pipeline inside

pipelines

directory does not work:

Copy code

src/
  kedro-proj/
    pipelines/
        __init__.py # with 'from .pipeline import create_pipeline'
        pipeline.py
        nodes.py

Nok Lam Chan

11/14/2023, 1:32 PM

As you have found out yourself, you can always import the pipeline manually. The find_pipeline is a helper function that automatically discover your pipeline. It is no magic and kedro doesn't know where is your pipeline, it relies on a standard structure and yours is not. To do that you need to create a subfolder in pipelines. The easier way to do so is use the CLI 'kedro pipeline create'

👍 1

Nok Lam Chan

11/14/2023, 1:32 PM

https://docs.kedro.org/en/latest/nodes_and_pipelines/pipeline_registry.html#pipeline-autodiscovery You can read more about how autodiscovery work.

Sergey S

11/14/2023, 7:52 PM

Gotcha, thanks! I supposed that

find_pipeline

recursively iterates over the whole

src

folder to find my pipelines. It requiring a default folder structure makes sense as to why it didn't work, thanks!

Nok Lam Chan

11/15/2023, 9:49 AM

I think there is benefit of adopting some patterns here, Kedro try to promote the best practice, having arbitrary pipeline file may not be the best way to do it. On the other hand, I also don’t think this is the ONLY possible structure, it is what Kedro starters promote, and it matches the Kedro modular pipeline structure. We could potentially make it configurable to recognise different structure, this haven’t been raised a lot. If you are interested, feel free to raise an issue or PR.

13 Views

Open in Slack

Previous Next