Hello team. Is there a way to use `kedro.extras.da...
# questions
z
Hello team. Is there a way to use
kedro.extras.datasets.spark.SparkDataSet
without installing dependencies specified in
kedro[spark]
? I am on a databricks cluster where the installation of pyspark is blocked.
n
As long as you have
pyspark
and
s3fs
installed it should be fine.
z
Thanks @Nok Lam Chan. However if I
pip install kedro[spark]
it still tries to install
pyspark
. Annoyingly even though this is a databricks cluster, if I
pip freeze
, pyspark is not there. Even if I can
import pyspark
.
So the error goes •
pip install kedro[spark]
-> pip cannot see pyspark (although it is import-able), so it tries to install it • pyspark is blocked on our cluster • fail
n
I am a bit confused, why do you need to do
pip install kedro[spark]
if you don’t want to? As long as you have the library there it should be fine.
This sounds a bit weird, are you looking at the right environment? Are you using
!pip freeze
or
%pip
? The shell environment on Databricks is different from your python environment if I remember.
z
Okay I have resolved it. TLDR - maybe it will be helpful for users to have more details on ImportError • Initially, it was my confusion that I need to
pip install kedro[spark]
to make
spark.SparkDataSet
available - but it seems the code is always in the main kedro package. • When I ran the pipeline with
__main__.py
, the error message hid the actual import error (it just threw a message that pointed me to a page on the kedro documentation on managing dependencies). • I tried to
from kedro.extras.datasets.spark import SparkDataset
. However potentially due to
suppress(ImportError)
, the error was still not helpful - it just said cannot import SparkDataSet from
kedro.extras.datasets.spark
• Finally I
from kedro.extras.datasets.spark.spark_dataset import SparkDataSet
. That showed the real errors,
hdfs
and
s3fs
not installed. • After I installed the two packages, all import errors are solved and the pipeline is now happy.
Hope this makes sense and thanks @Nok Lam Chan
n
@Zirui Xu Thank you for your very detailed response! This is indeed a problem because of the import alias (The fact that you only need to type
pandas.CSVDataSet
instead of
kedro.extras.datasets.pandas.CSVDataSet
. So it is hard to determine if a module is not found because of a missing dependency or a non-existing module. Under the hood,
kedro
will look for this module in a couple of places until it finds one. I’ll try to look at it and see if there is something that we can improve. For the time being, your debugging strategy is correct. I would also do open up a Python console and import the full path!
👍 1