Hello team! I saved pandas parquet in a hdfs datab...
# questions
s
Hello team! I saved pandas parquet in a hdfs database but I'm having problems reading it (it says no file error). Has anyone had similar problems before?
saving it like this
Copy code
__pandas_parquet: &pandas_parquet
 type: pandas.ParquetDataset
 save_args:
    index : False
Copy code
p05_model_input.master_table@pandas:
   <<: *pandas_parquet
   filepath: /data/_model_input/master_table.parquet
Copy code
catalog.load("p05_model_input.master_table@pandas")
returns no file error
but
Copy code
catalog.load("p05_model_input.master_table@spark")
works
my spark settings are
Copy code
_spark: &spark
  type: spark.SparkDatase
  file_format: parquet
  save_args:
       mode: overwrite
d
Hi Sabrina, it should work if the filepath is correct. There might be an issue with the anchoring. Have you tried setting it up without using the anchor?
s
Checked multiple times and the filepath is correct - quick google check says i need pyarrow to do pd.read_parquet(so pd.read_parquet(path, engine="pyarrow") on HDFS (and I see kedro's function uses pd.read_parquet directly) so was wondering if the problem is with compatibility with HDFS??
I'm sorry, what does "setting up without anchor" mean again??
d
I believe it uses
pyarrow
by default, but you can try explicitly adding it to the
load_args
like this:
Copy code
load_args:
  engine: pyarrow
Also, make try to specify
hdfs
in the filepath, like so:
hdfs:///data/_model...
. By anchoring, I mean the
&pandas_parquet
syntax in your configuration.
n
Kedro delegate most of the job to
pandas
in this case. If possible, can you try to read the file from pure
pandas
without Kedro first?