Hey team, I am saving partitioned dataset with pys...
# questions
s
Hey team, I am saving partitioned dataset with pyspark parquet data types, catalog entry:
Copy code
cpa_llm.blocking_output@partitions:
  type: PartitionedDataSet
  path: data/cpa_llm/blocking_output
  overwrite: True
  filename_suffix: ".parquet"
  dataset:
    type: spark.SparkDataSet
    file_format: parquet
    save_args:
      mode: overwrite
When I read the same data as a spark dataset I get the error that
AnalysisException: Unable to infer schema for Parquet. It must be specified manually.
but when I read from one of the particular partitions it infers the schema. Was wondering if there maybe a step I am missing here or if you recommend some other data type over parquet to store the files. Appreciate any help here 😄
d
you can’t partition Spark unfortunately
👍 1
since it’s doing something similar under the hood anyway
s
Ahh I see, thank you