Hello team Is there a way to use `kedro extras datasets spar Kedro #questions

Hello team. Is there a way to use `kedro.extras.da...

Zirui Xu

10/27/2022, 4:34 PM

Hello team. Is there a way to use

kedro.extras.datasets.spark.SparkDataSet

without installing dependencies specified in

kedro[spark]

? I am on a databricks cluster where the installation of pyspark is blocked.

Nok Lam Chan

10/27/2022, 4:45 PM

As long as you have

pyspark

and

s3fs

installed it should be fine.

Zirui Xu

10/27/2022, 4:59 PM

Thanks @Nok Lam Chan. However if I

pip install kedro[spark]

it still tries to install

pyspark

. Annoyingly even though this is a databricks cluster, if I

pip freeze

, pyspark is not there. Even if I can

import pyspark

Zirui Xu

10/27/2022, 5:00 PM

So the error goes •

pip install kedro[spark]

-> pip cannot see pyspark (although it is import-able), so it tries to install it • pyspark is blocked on our cluster • fail

Nok Lam Chan

10/27/2022, 5:13 PM

I am a bit confused, why do you need to do

pip install kedro[spark]

if you don’t want to? As long as you have the library there it should be fine.

Nok Lam Chan

10/27/2022, 5:14 PM

This sounds a bit weird, are you looking at the right environment? Are you using

!pip freeze

%pip

? The shell environment on Databricks is different from your python environment if I remember.

Zirui Xu

10/27/2022, 5:24 PM

Okay I have resolved it. TLDR - maybe it will be helpful for users to have more details on ImportError • Initially, it was my confusion that I need to

pip install kedro[spark]

to make

spark.SparkDataSet

available - but it seems the code is always in the main kedro package. • When I ran the pipeline with

__main__.py

, the error message hid the actual import error (it just threw a message that pointed me to a page on the kedro documentation on managing dependencies). • I tried to

from kedro.extras.datasets.spark import SparkDataset

. However potentially due to

suppress(ImportError)

, the error was still not helpful - it just said cannot import SparkDataSet from

kedro.extras.datasets.spark

• Finally I

from kedro.extras.datasets.spark.spark_dataset import SparkDataSet

. That showed the real errors,

hdfs

and

s3fs

not installed. • After I installed the two packages, all import errors are solved and the pipeline is now happy.

Zirui Xu

10/27/2022, 5:25 PM

Hope this makes sense and thanks @Nok Lam Chan

Nok Lam Chan

10/28/2022, 11:05 AM

@Zirui Xu Thank you for your very detailed response! This is indeed a problem because of the import alias (The fact that you only need to type

pandas.CSVDataSet

instead of

kedro.extras.datasets.pandas.CSVDataSet

. So it is hard to determine if a module is not found because of a missing dependency or a non-existing module. Under the hood,

kedro

will look for this module in a couple of places until it finds one. I’ll try to look at it and see if there is something that we can improve. For the time being, your debugging strategy is correct. I would also do open up a Python console and import the full path!

👍 1

8 Views

Open in Slack

Previous Next