Hi everyone This might not be a pure kedro issue but looking Kedro #questions

Hi everyone! This might not be a pure kedro issue ...

JOEL WILSON

02/07/2023, 7:15 AM

Hi everyone! This might not be a pure kedro issue but looking for some inputs around kedro - SparkDataSet save method Getting this error on running a kedro pipline; i think this has to do with the dependencies / environment variables. Lmk your thoughts. Windows machine python 3.7

pyarrow==0.14.0

Copy code

java version "1.8.0_341"
Java(TM) SE Runtime Environment (build 1.8.0_341-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.341-b10, mixed mode)

Copy code

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/

Using Scala version 2.12.15, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_341
Branch HEAD
Compiled by user yumwang on 2022-10-15T09:47:01Z
Revision fbbcf9434ac070dd4ced4fb9efe32899c6db12a9
Url <https://github.com/apache/spark>

Untitled

Filip Panovski

02/07/2023, 9:27 AM

I'm not very familiar with Spark, but

Method <...> does not exist

in py4j is likely the equivalent of

NoSuchMethodError

in Java. This likely means that your version of Spark is expecting a different version of the Python library than what is available at runtime, so you should check which version Spark is expecting and which is actually being used. If there's a discrepancy, fixing this will probably fix your problem.

Filip Panovski

02/07/2023, 10:01 AM

Have you tried

pyarrow==0.14.1

1.0.0

JOEL WILSON

02/07/2023, 10:35 AM

I tried

1.0.0

but same error. Also tried updating the spark/bin/conf/spark-defaults with

spark.sql.execution.arrow.pyspark.enabled=false

but still same error.

Copy code

2023-02-07 13:32:29,274 - py.warnings - WARNING - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  An error occurred while calling z:org.apache.spark.sql.api.python.PythonSQLUtils.readArrowStreamFromFile. Trace:
py4j.Py4JException: Method readArrowStreamFromFile([class org.apache.spark.sql.SQLContext, class java.lang.String]) does not exist
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
        at py4j.Gateway.invoke(Gateway.java:276)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.lang.Thread.run(Unknown Source)


Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.

2023-02-07 13:32:29,275 - py.warnings - WARNING - iteritems is deprecated and will be removed in a future version. Use .items instead.

2023-02-07 13:32:29,429 - kedro.io.data_catalog - INFO - Saving data to 'ftr_account_customer_month_spine' (SparkDataSet)...
23/02/07 13:32:53 ERROR FileFormatWriter: Aborting job 5f8a8e57-29de-46a3-899d-195f59b90171.
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

🫤 1

Filip Panovski

02/07/2023, 10:59 AM

The way I usually debug version discrepancies like this (if there isn't a version compatibility matrix, like in this case) is to open to code (on e.g. GitHub) for the versions and check the signature of the method being called. If it doesn't match, I go through other releases of the library being called (so in this case, pyarrow/pyspark) until I find a match for that method signature. I'm afraid you'll probably have to do something similar, unless someone here has experience with those specific versions.

Ben Horsburgh

02/07/2023, 1:00 PM

Looks like you need to write to somewhere that is not in

C:\Users

when overwriting during save https://stackoverflow.com/questions/51561061/scala-spark-overwrite-parquet-file-failed-to-delete-file-or-dir

Ben Horsburgh

02/07/2023, 1:00 PM

Or perhaps that having the directory open in your file browser causes a lock

135 Views

Open in Slack

Previous Next