Zirui Xu
10/27/2022, 4:34 PMkedro.extras.datasets.spark.SparkDataSet
without installing dependencies specified in kedro[spark]
? I am on a databricks cluster where the installation of pyspark is blocked.Nok Lam Chan
10/27/2022, 4:45 PMpyspark
and s3fs
installed it should be fine.Zirui Xu
10/27/2022, 4:59 PMpip install kedro[spark]
it still tries to install pyspark
.
Annoyingly even though this is a databricks cluster, if I pip freeze
, pyspark is not there. Even if I can import pyspark
.pip install kedro[spark]
-> pip cannot see pyspark (although it is import-able), so it tries to install it
• pyspark is blocked on our cluster
• failNok Lam Chan
10/27/2022, 5:13 PMpip install kedro[spark]
if you don’t want to? As long as you have the library there it should be fine.!pip freeze
or %pip
? The shell environment on Databricks is different from your python environment if I remember.Zirui Xu
10/27/2022, 5:24 PMpip install kedro[spark]
to make spark.SparkDataSet
available - but it seems the code is always in the main kedro package.
• When I ran the pipeline with __main__.py
, the error message hid the actual import error (it just threw a message that pointed me to a page on the kedro documentation on managing dependencies).
• I tried to from kedro.extras.datasets.spark import SparkDataset
. However potentially due to suppress(ImportError)
, the error was still not helpful - it just said cannot import SparkDataSet from kedro.extras.datasets.spark
• Finally I from kedro.extras.datasets.spark.spark_dataset import SparkDataSet
. That showed the real errors, hdfs
and s3fs
not installed.
• After I installed the two packages, all import errors are solved and the pipeline is now happy.Nok Lam Chan
10/28/2022, 11:05 AMpandas.CSVDataSet
instead of kedro.extras.datasets.pandas.CSVDataSet
. So it is hard to determine if a module is not found because of a missing dependency or a non-existing module. Under the hood, kedro
will look for this module in a couple of places until it finds one.
I’ll try to look at it and see if there is something that we can improve.
For the time being, your debugging strategy is correct. I would also do open up a Python console and import the full path!