Hi all! I'm having an issue importing a custom dat...
# questions
g
Hi all! I'm having an issue importing a custom dataset. The pipeline is running on a Microsoft hosted agent in Azure DevOps. If use a MacOs image it works perfectly, while if I switch to an ubuntu image I get this error. It seems related to the environment. Weird enough, we're developing on a Linux VM and it works fine locally. Any idea?
Copy code
DatasetError: An exception occurred when parsing config for dataset 
'raw_default_dataset':
Class '<my-project>.datasets.spark_lowercase_dataset.SparkLowerDataset' not found, is 
this a typo?
Hint: If you are trying to use a dataset from `kedro-datasets`, make sure that 
the package is installed in your current environment. You can do so by running 
`pip install kedro-datasets` or `pip install kedro-datasets[<dataset-group>]` to
install `kedro-datasets` along with related dependencies for the specific 
dataset group.
h
Someone will reply to you shortly. In the meantime, this might help:
m
Hi Gianni, is the dataset installed correctly inside the environment?
g
Hi Merel, yes, it's installed correctly. As mentioned, if I use MacOs imagine it works fine (same pipeline), I get this error when I switch to an ubuntu image. Everything is the same including the requirements
m
This is outside of my own experience, so I'm asking some help from ChatGPT: It sounds like your custom dataset class (
SparkLowerDataset
) isn’t being found when you switch to an Ubuntu image — but it works fine on macOS. Let’s troubleshoot this step by step! Here’s what might be going wrong: 🛠️ 1. Package Installation Issue The error suggests the class isn’t found — it could be that your package isn’t installed in the Ubuntu environment. 👉 Check your
install
step:
Make sure you’re installing your project as a package:
Copy code
- script: pip install -e .
  displayName: "Install project package"
If you have a
requirements.txt
or
pyproject.toml
, make sure it’s installed too:
Copy code
- script: pip install -r src/requirements.txt
  displayName: "Install requirements"
📂 2. PYTHONPATH Issue Ubuntu might not pick up the
src
directory by default, while macOS sometimes handles that more gracefully. 👉 Add this to your pipeline:
Copy code
- script: echo "##vso[task.prependpath]$(Pipeline.Workspace)/src"
  displayName: "Set PYTHONPATH"
Or explicitly export it:
Copy code
- script: export PYTHONPATH=$(pwd)/src
  displayName: "Export PYTHONPATH"
🧩 3. Case Sensitivity in File/Folder Names Ubuntu is case-sensitive, while macOS is not. 👉 Double-check your file names: • Is the file named exactly as expected? (
spark_lowercase_dataset.py
) • Does your import match the case exactly? (
from <my_project>.datasets.spark_lowercase_dataset import SparkLowerDataset
) 🏷️ 4. Module Discovery & Project Name If your project has a custom structure,
kedro
might not discover the dataset properly. 👉 Check your
pyproject.toml
or `setup.py`:
Make sure the package is defined correctly:
Copy code
[tool.kedro]
package_name = "<my_project>"
And that your
setup.py
includes the right packages:
Copy code
packages=find_packages(where="src"),
package_dir={"": "src"},
🏃 5. Docker or Image Differences If you’re using Spark, Ubuntu might not have the right libraries installed, or the classpath might differ. 👉 Try adding any OS-level dependencies in your pipeline:
Copy code
- script: sudo apt-get update && sudo apt-get install -y openjdk-11-jdk
  displayName: "Install Java (for Spark)"
g
Hi Merel, thanks for the help! Unfortunately, none of the suggestions above worked but I ended up fixing it. If anyone else needs the information, it was related to the python environment. I had to delete and recreate it from scratch.
m
Ah glad to hear you manage to fix it in the end! Do you know what went wrong in the environment?