meharji arumilli
12/28/2022, 7:19 PMRob
12/29/2022, 10:37 PMMemoryDataSet
. So in you catalog, place something like this:
your_pipeline:
type: MemoryDataSet
copy_mode : assign
Notice the copy_mode and for more reference, check this doc:
Use MemoryDataSet with *copy_mode*="assign" for non-DataFrame Spark objectscopy_mode="deecopy"
(default setting I guess?), probably will fail since your data is still part of the Spark's Logical Planmeharji arumilli
12/29/2022, 10:47 PMyour_pipeline:
type: MemoryDataSet
copy_mode : assign
should be used when “Sometimes, you might want to use Spark objects that aren’t DataFrame
as inputs and outputs in your pipeline.” as per the docs https://kedro.readthedocs.io/en/0.17.3/11_tools_integration/01_pyspark.html#use-kedro-s-built-in-spark-datasets-to-load-and-save-raw-datapreprocessed_data:
is a sparkDataFrame and here im trying to save to local storagepreprocessed_data:
type: spark.SparkDataSet
filepath: data/${project}/05_model_input/df_preprocessed.parquet
file_format: parquet
preprocessed_data:
type: MemoryDataSet
copy_mode : assign
Rob
12/29/2022, 10:52 PMmeharji arumilli
12/29/2022, 10:54 PMpreprocessed_data:
type: MemoryDataSet
copy_mode: assign
preprocessed_data:
type: spark.SparkDataSet
filepath: data/${project}/05_model_input/df_preprocessed.parquet
file_format: parquet
but this already has an issue as we use `preprocessed_data` twice in the catalog
Rob
12/29/2022, 10:56 PM<Pipeline Name>:
type: MemoryDataSet
copy_mode: assign
preprocessed_data:
type: spark.SparkDataSet
filepath: data/${project}/05_model_input/df_preprocessed.parquet
file_format: parquet
meharji arumilli
12/29/2022, 11:02 PMfeature_engineering:
type: MemoryDataSet
copy_mode: assign
preprocessed_data:
type: spark.SparkDataSet
filepath: data/${project}/05_model_input/df_preprocessed.parquet
file_format: parquet
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o742.save.
: java.lang.ClassNotFoundException: <http://org.apache.spark.internal.io|org.apache.spark.internal.io>.cloud.PathOutputCommitProtocol
at <http://java.net|java.net>.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
Rob
12/29/2022, 11:04 PMmeharji arumilli
12/29/2022, 11:05 PM$ java -version
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)
Rob
12/29/2022, 11:10 PMmeharji arumilli
12/29/2022, 11:17 PMRob
12/29/2022, 11:18 PMmeharji arumilli
12/29/2022, 11:20 PMRob
12/29/2022, 11:22 PMmeharji arumilli
12/29/2022, 11:23 PM