Hi everyone I m having an issue with the Sparkdataset in Ked Kedro #questions

Hi everyone, I'm having an issue with the Sparkda...

Alexis Drakopoulos

08/27/2024, 7:13 AM

Hi everyone, I'm having an issue with the Sparkdataset in Kedro. I have my data stored in DBFS and installed kedro-datasets[spark.SparkDataset]==4.1.0 However I get strange issues with missing modules such as hdfs, s3fs and so on. I feel like I'm missing some parent dependency that would install all these. I couldn't find anything online, does anyone know what I'm missing rather than trying to install all of these manually? A specific error message is for example:

DatasetError: An exception occurred when parsing config for dataset '<DATASETNAME>':

No module named 'hdfs'. Please see the documentation on how to install relevant dependencies for kedro_datasets.spark.SparkDataset:

✅ 1

👀 1

Elena Khaustova

08/27/2024, 10:11 AM

Hi Alexis, could you please share some more details: installation command you use, pip version and python version?

Elena Khaustova

08/27/2024, 10:12 AM

I’ve just checked with

pip 24.2, python 3.11

`pip install `kedro-datasets[spark.SparkDataset]==4.1.0`` seems working fine

Elena Khaustova

08/27/2024, 10:17 AM

As well as

pip install 'kedro-datasets[spark-sparkdataset]==4.1.0'

Alexis Drakopoulos

08/27/2024, 10:33 AM

We have a custom pipeline that does our python installs let me see if I can find out!

👍 1

Alexis Drakopoulos

08/27/2024, 12:58 PM

I can see that it's installing correctly, but I get s3fs missing error. Strange

Alexis Drakopoulos

08/27/2024, 12:59 PM

Do you happen to know if s3fs is a dependency?

Elena Khaustova

08/27/2024, 12:59 PM

does

pip freeze

show you all dependencies installed?

Elena Khaustova

08/27/2024, 1:01 PM

Yes, it should be installed as a dependency. That’s what I have:

Copy code

(test-env) ➜  ~ pip freeze
aiobotocore==2.13.3
aiohappyeyeballs==2.4.0
aiohttp==3.10.5
aioitertools==0.11.0
aiosignal==1.3.1
antlr4-python3-runtime==4.9.3
appdirs==1.4.4
arrow==1.3.0
attrs==24.2.0
binaryornot==0.4.4
botocore==1.34.162
build==1.2.1
cachetools==5.5.0
certifi==2024.7.4
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
cookiecutter==2.6.0
docopt==0.6.2
dynaconf==3.2.6
frozenlist==1.4.1
fsspec==2024.6.1
gitdb==4.0.11
GitPython==3.1.43
hdfs==2.7.3
idna==3.8
importlib_metadata==8.4.0
importlib_resources==6.4.4
Jinja2==3.1.4
jmespath==1.0.1
kedro==0.19.8
kedro-datasets==4.1.0
kedro-telemetry==0.6.0
lazy_loader==0.4
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
more-itertools==10.4.0
multidict==6.0.5
omegaconf==2.3.0
packaging==24.1
parse==1.20.2
platformdirs==4.2.2
pluggy==1.5.0
pre-commit-hooks==4.6.0
py4j==0.10.9.7
Pygments==2.18.0
pyproject_hooks==1.1.0
pyspark==3.5.2
python-dateutil==2.9.0.post0
python-slugify==8.0.4
pytoolconfig==1.3.1
PyYAML==6.0.2
requests==2.32.3
rich==13.8.0
rope==1.13.0
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
s3fs==2024.6.1
six==1.16.0
smmap==5.0.1
text-unidecode==1.3
toml==0.10.2
types-python-dateutil==2.9.0.20240821
typing_extensions==4.12.2
urllib3==2.2.2
wrapt==1.16.0
yarl==1.9.4
zipp==3.20.1

Alexis Drakopoulos

08/27/2024, 2:45 PM

You're right my bad, the issue lies in our CVE scanner not allowing a bunch of the spark dataset dependencies through, I'll have to look into that. Thanks and sorry for taking your time!

np 1

82 Views

Open in Slack

Previous Next