https://kedro.org/ logo
#questions
Title
# questions
s

Sid Shetty

08/04/2023, 3:29 PM
Hey team, I am saving partitioned dataset with pyspark parquet data types, catalog entry:
Copy code
cpa_llm.blocking_output@partitions:
  type: PartitionedDataSet
  path: data/cpa_llm/blocking_output
  overwrite: True
  filename_suffix: ".parquet"
  dataset:
    type: spark.SparkDataSet
    file_format: parquet
    save_args:
      mode: overwrite
When I read the same data as a spark dataset I get the error that
AnalysisException: Unable to infer schema for Parquet. It must be specified manually.
but when I read from one of the particular partitions it infers the schema. Was wondering if there maybe a step I am missing here or if you recommend some other data type over parquet to store the files. Appreciate any help here 😄
d

datajoely

08/04/2023, 3:30 PM
you can’t partition Spark unfortunately
👍 1
since it’s doing something similar under the hood anyway
s

Sid Shetty

08/04/2023, 3:33 PM
Ahh I see, thank you