Hi everyone! This might not be a pure kedro issue ...
# questions
j
Hi everyone! This might not be a pure kedro issue but looking for some inputs around kedro - SparkDataSet save method Getting this error on running a kedro pipline; i think this has to do with the dependencies / environment variables. Lmk your thoughts. Windows machine python 3.7
pyarrow==0.14.0
Copy code
java version "1.8.0_341"
Java(TM) SE Runtime Environment (build 1.8.0_341-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.341-b10, mixed mode)
Copy code
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/

Using Scala version 2.12.15, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_341
Branch HEAD
Compiled by user yumwang on 2022-10-15T09:47:01Z
Revision fbbcf9434ac070dd4ced4fb9efe32899c6db12a9
Url <https://github.com/apache/spark>
f
I'm not very familiar with Spark, but
Method <...> does not exist
in py4j is likely the equivalent of
NoSuchMethodError
in Java. This likely means that your version of Spark is expecting a different version of the Python library than what is available at runtime, so you should check which version Spark is expecting and which is actually being used. If there's a discrepancy, fixing this will probably fix your problem.
j
I tried
1.0.0
but same error. Also tried updating the spark/bin/conf/spark-defaults with
spark.sql.execution.arrow.pyspark.enabled=false
but still same error.
Copy code
2023-02-07 13:32:29,274 - py.warnings - WARNING - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  An error occurred while calling z:org.apache.spark.sql.api.python.PythonSQLUtils.readArrowStreamFromFile. Trace:
py4j.Py4JException: Method readArrowStreamFromFile([class org.apache.spark.sql.SQLContext, class java.lang.String]) does not exist
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
        at py4j.Gateway.invoke(Gateway.java:276)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.lang.Thread.run(Unknown Source)


Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.

2023-02-07 13:32:29,275 - py.warnings - WARNING - iteritems is deprecated and will be removed in a future version. Use .items instead.

2023-02-07 13:32:29,429 - kedro.io.data_catalog - INFO - Saving data to 'ftr_account_customer_month_spine' (SparkDataSet)...
23/02/07 13:32:53 ERROR FileFormatWriter: Aborting job 5f8a8e57-29de-46a3-899d-195f59b90171.
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
🫤 1
f
The way I usually debug version discrepancies like this (if there isn't a version compatibility matrix, like in this case) is to open to code (on e.g. GitHub) for the versions and check the signature of the method being called. If it doesn't match, I go through other releases of the library being called (so in this case, pyarrow/pyspark) until I find a match for that method signature. I'm afraid you'll probably have to do something similar, unless someone here has experience with those specific versions.
b
Looks like you need to write to somewhere that is not in
C:\Users
when overwriting during save https://stackoverflow.com/questions/51561061/scala-spark-overwrite-parquet-file-failed-to-delete-file-or-dir
Or perhaps that having the directory open in your file browser causes a lock