Giulio Morina
10/09/2023, 6:53 PMdf@pandas
and the node right after that loads it in spark; e.g. df@spark
.
When running the whole pipeline it throws the error below when it tries to load df@spark
. Interestingly though*, if I restart the pipeline from the node that failed the error is no longer produced* and the pipeline runs fine. Any ideas what could be the issue? pyarrow version is 6.0.1. Thanks!
╭─────────────────────────────── Traceback (most recent call last)
[TRUNCATED]
NotADirectoryError: blahblah/user/giulio/05_model_input/master.parquet/2023-10-09T18.45.28.662Z/master.parquet
The above exception was the direct cause of the following exception:
[TRUNCATED]
DataSetError: Failed while loading data from data set
ParquetDataSet(filepath=blahblah/user/giulio/05_model_input/master.parquet, load_args={'engine': pyarrow},
protocol=s3, save_args={'engine': pyarrow, 'index': False}, version=Version(load=None, save='2023-10-09T18.45.28.662Z')).
dptx-fingerprint/user/giulio/myc/screen_1k/05_model_input/master.parquet/2023-10-09T18.45.28.662Z/master.parquet
_spark_parquet_ds: &_spark_parquet
type: spark.SparkDataSet
file_format: parquet
versioned: True
layer: model_input
load_args:
header: True
save_args:
mode: overwrite
header: True
_pandas_parquet_ds: &_pandas_parquet
layer: model_input
versioned: True
type: pandas.ParquetDataSet
save_args:
index: False # Ensure no extra col with index is created when loading with Spark
engine: pyarrow
df@spark:
<<: *_spark_parquet
filepath: <s3a://blahblah/${USER_NAMESPACE_PATH}/${data_layers.model_input}/master.parquet>
df@pandas:
<<: *_pandas_parquet
filepath: <s3://blahblah/${USER_NAMESPACE_PATH}/${data_layers.model_input}/master.parquet>