I've run into two blockers (probably a newbie mist...
# questions
s
I've run into two blockers (probably a newbie mistake): 1. Following the docs about custom datasets, I've created a dummy custom dataset under `kedro-example/src/kedro-example/datasets.py`:
Copy code
class KedroCocoDataset(AbstractDataset):
...
Then in
catalog.yaml
Copy code
coco_dataset:
  type: kedro-example.datasets.KedroCocoDataset
  filepath: data/01_raw/014-playment-parts-100-images-variant-a.json
However in the notebook (started by
kedro jupyter lab
) when running
%reload_kedro
I get an error:
Copy code
DatasetError: An exception occurred when parsing config for dataset 'coco_dataset':
Class 'kedro-example.datasets.KedroCocoDataset' not found or one of its dependencies has not been installed.
Any help is appreciated.
m
Do you have proper imports in the
kedro-example/src/kedro-example/datasets.py
? Looks like a Python-level issue rather than Kedro issue.
n
Is it a typo? it should be
kedro_example
instead of
kedro-example
, this is a python convention you cannot have
-
as a namespace. p.s. Most notable example is probably
scikit-learn
, you do
pip install scikit-learn
but you do
import sklearn
instead of
import scikit-learn
👀 1
😱 1
🥳 1
👍 1
if you do
kedro new
this should be handled properly, if not then it’s a bug on our side👀
Copy code
The project name 'kedro-example' has been applied to: 
- The project title in /Users/Nok_Lam_Chan/GitHub/kedro/kedro-example/README.md 
- The folder created for your project in /Users/Nok_Lam_Chan/GitHub/kedro/kedro-example 
- The project's python package in /Users/Nok_Lam_Chan/GitHub/kedro/kedro-example/src/kedro_example
I quickly test it with the latest release, it is working as expected.
🎖️ 1
l
The way I debug these issues is to try to import the dataset from jupyter. The error is then a lot easier to debug
n
@Lukas Innig that's true. https://github.com/kedro-org/kedro/pull/3272 , in the next release hopefully this should be a nicer experience. In the past we cannot differentiate between missing modules or missing dependency, it is possible now since kedro-datasets is now lazy loading.
👍 1
s
Thanks, the issue #1 is fixed, it was indeed an issue with using
kedro-example
instead of
kedro_example
in the code. The issue #2 remains. I followed the project structure of the
iris-pandas
example (
pipelines.py
and
nodes.py
in the
src
directory) and my test pipeline works 🥳 However if I move
pipelines.py
and
nodes.py
to a sub directory called
pipelines
(see image below) I am getting an error of
ValueError: Pipeline contains no nodes after applying all provided filters
The issue seems to be that
find_pipelines()
defined in
pipeline_registry.py
does not find my pipeline if its defined in a sub folder. If I manually import my pipeline (
from kedro_example.pipelines.pipeline import create_pipeline)
from the sub-directory and manually assign it (
pipelines["__default__"] = create_pipeline()
) then everything works.
Following the the
spaceflight
example, I've added
Copy code
from .pipeline import create_pipeline
to
__init.py__
in the pipelines directory but that did not help when using
pipelines = find_pipelines()
in the
pipeline_registry.py
Copying the directory structure of
spaceflight
works:
Copy code
src/
  kedro-proj/
    pipelines/
      data_processing/
        __init__.py # with 'from .pipeline import create_pipeline'
        pipeline.py
        nodes.py
Defining my pipeline inside
pipelines
directory does not work:
Copy code
src/
  kedro-proj/
    pipelines/
        __init__.py # with 'from .pipeline import create_pipeline'
        pipeline.py
        nodes.py
n
As you have found out yourself, you can always import the pipeline manually. The find_pipeline is a helper function that automatically discover your pipeline. It is no magic and kedro doesn't know where is your pipeline, it relies on a standard structure and yours is not. To do that you need to create a subfolder in pipelines. The easier way to do so is use the CLI 'kedro pipeline create'
👍 1
s
Gotcha, thanks! I supposed that
find_pipeline
recursively iterates over the whole
src
folder to find my pipelines. It requiring a default folder structure makes sense as to why it didn't work, thanks!
n
I think there is benefit of adopting some patterns here, Kedro try to promote the best practice, having arbitrary pipeline file may not be the best way to do it. On the other hand, I also don’t think this is the ONLY possible structure, it is what Kedro starters promote, and it matches the Kedro modular pipeline structure. We could potentially make it configurable to recognise different structure, this haven’t been raised a lot. If you are interested, feel free to raise an issue or PR.