Hello! I have a catalog using transcoding pandas/s...
# questions
g
Hello! I have a catalog using transcoding pandas/spark (see reply in the thread to see how it looks) . I then have a node that saves output in pandas version; e.g.
df@pandas
and the node right after that loads it in spark; e.g.
df@spark
. When running the whole pipeline it throws the error below when it tries to load
df@spark
. Interestingly though*, if I restart the pipeline from the node that failed the error is no longer produced* and the pipeline runs fine. Any ideas what could be the issue? pyarrow version is 6.0.1. Thanks!
Copy code
╭─────────────────────────────── Traceback (most recent call last) 
[TRUNCATED]

NotADirectoryError: blahblah/user/giulio/05_model_input/master.parquet/2023-10-09T18.45.28.662Z/master.parquet

The above exception was the direct cause of the following exception:

[TRUNCATED] 
      
DataSetError: Failed while loading data from data set 
ParquetDataSet(filepath=blahblah/user/giulio/05_model_input/master.parquet, load_args={'engine': pyarrow}, 
protocol=s3, save_args={'engine': pyarrow, 'index': False}, version=Version(load=None, save='2023-10-09T18.45.28.662Z')).
dptx-fingerprint/user/giulio/myc/screen_1k/05_model_input/master.parquet/2023-10-09T18.45.28.662Z/master.parquet
Copy code
_spark_parquet_ds: &_spark_parquet
  type: spark.SparkDataSet
  file_format: parquet
  versioned: True
  layer: model_input
  load_args:
    header: True
  save_args:
    mode: overwrite
    header: True

_pandas_parquet_ds: &_pandas_parquet
  layer: model_input
  versioned: True
  type: pandas.ParquetDataSet
  save_args:
    index: False  # Ensure no extra col with index is created when loading with Spark
    engine: pyarrow

df@spark:
  <<: *_spark_parquet
  filepath: <s3a://blahblah/${USER_NAMESPACE_PATH}/${data_layers.model_input}/master.parquet>

df@pandas:
  <<: *_pandas_parquet
  filepath: <s3://blahblah/${USER_NAMESPACE_PATH}/${data_layers.model_input}/master.parquet>