Hello team I saved pandas parquet in a hdfs database but I m Kedro #questions

Hello team! I saved pandas parquet in a hdfs datab...

Sabrina Zuraimi

08/22/2024, 2:33 AM

Hello team! I saved pandas parquet in a hdfs database but I'm having problems reading it (it says no file error). Has anyone had similar problems before?

Sabrina Zuraimi

08/22/2024, 2:41 AM

saving it like this

Copy code

__pandas_parquet: &pandas_parquet
 type: pandas.ParquetDataset
 save_args:
    index : False

Sabrina Zuraimi

08/22/2024, 2:42 AM

Copy code

p05_model_input.master_table@pandas:
   <<: *pandas_parquet
   filepath: /data/_model_input/master_table.parquet

Sabrina Zuraimi

08/22/2024, 2:43 AM

Copy code

catalog.load("p05_model_input.master_table@pandas")

returns no file error

Sabrina Zuraimi

08/22/2024, 2:43 AM

but

Copy code

catalog.load("p05_model_input.master_table@spark")

works

Sabrina Zuraimi

08/22/2024, 2:58 AM

my spark settings are

Copy code

_spark: &spark
  type: spark.SparkDatase
  file_format: parquet
  save_args:
       mode: overwrite

Dmitry Sorokin

08/22/2024, 9:01 AM

Hi Sabrina, it should work if the filepath is correct. There might be an issue with the anchoring. Have you tried setting it up without using the anchor?

Sabrina Zuraimi

08/22/2024, 9:19 AM

Checked multiple times and the filepath is correct - quick google check says i need pyarrow to do pd.read_parquet(so pd.read_parquet(path, engine="pyarrow") on HDFS (and I see kedro's function uses pd.read_parquet directly) so was wondering if the problem is with compatibility with HDFS??

Sabrina Zuraimi

08/22/2024, 9:19 AM

I'm sorry, what does "setting up without anchor" mean again??

Dmitry Sorokin

08/22/2024, 9:39 AM

I believe it uses

pyarrow

by default, but you can try explicitly adding it to the

load_args

like this:

Copy code

load_args:
  engine: pyarrow

Also, make try to specify

hdfs

in the filepath, like so:

hdfs:///data/_model...

. By anchoring, I mean the

&pandas_parquet

syntax in your configuration.

Nok Lam Chan

08/23/2024, 4:18 PM

Kedro delegate most of the job to

pandas

in this case. If possible, can you try to read the file from pure

pandas

without Kedro first?

7 Views

Open in Slack

Previous Next