Hi everyone Is it possible to use the `fastparquet` engine w Kedro #questions

Hi everyone! Is it possible to use the `fastparque...

Evžen Šírek

02/03/2023, 10:01 AM

Hi everyone! Is it possible to use the

fastparquet

engine with the ParquetDataSet? There is a possibility to specify the engine in the catalog entry:

Copy code

dataset:
  type: pandas.ParquetDataSet
  filepath: data/dataset.parquet
  load_args:
     engine: fastparquet
  save_args:
     engine: fastparquet

However, when I do that, I get the

DataSetError

with

I/O operation on closed file

when Kedro tries to save the dataset. When I manually save the data with

pandas

and

engine=fastparquet

(which is what Kedro should do according to the docs), it works well. Is this expected? Thanks! :)) Environment:

python==3.10.4, pandas==1.5.1, kedro==0.18.4, fastparquet==2023.1.0

datajoely

02/03/2023, 1:21 PM

Is there a reason the pyarrow method doesn’t work?

Evžen Šírek

02/03/2023, 1:55 PM

We had some problems with

pyarrow

- like not being able to save

timedelta

datatypes - which

fastparquet

helped us with, so we would like to stick with it. We would also like to use the

append

feature of

fastparquet

datajoely

02/03/2023, 2:04 PM

okay understood

datajoely

02/03/2023, 2:04 PM

I’m not actually sure why it wouldn’t work for now

🥲 1

datajoely

02/03/2023, 2:06 PM

If you look at our implementation the save method is doing a few things

Copy code

If you look at our implementation the save method is doing a few things:

def _save(self, data: pd.DataFrame) -> None:
        save_path = get_filepath_str(self._get_save_path(), self._protocol)

        if Path(save_path).is_dir():
            raise DataSetError(
                f"Saving {self.__class__.__name__} to a directory is not supported."
            )

        if "partition_cols" in self._save_args:
            raise DataSetError(
                f"{self.__class__.__name__} does not support save argument "
                f"'partition_cols'. Please use 'kedro.io.PartitionedDataSet' instead."
            )

        bytes_buffer = BytesIO()
        data.to_parquet(bytes_buffer, **self._save_args)

        with self._fs.open(save_path, mode="wb") as fs_file:
            fs_file.write(bytes_buffer.getvalue())

        self._invalidate_cache()

• It writes the dataframe to a bytes buffer in memory • It opens an fsspec path as binary • it writes the data to file

datajoely

02/03/2023, 2:07 PM

Presumably the

I/O operation on closed file

is happening at the fsspec part

datajoely

02/03/2023, 2:07 PM

What you can do is copy and paste this class into your project and reference it in your data catalog as a classpath i.e.

<http://location.of.my|location.of.my>_class.ParquetDataSet

👍 1

datajoely

02/03/2023, 2:08 PM

you could then tweak the save method to do what you need it to do

datajoely

02/03/2023, 2:08 PM

we’re doing very little but wrapping the native write method so it should be easy-ish to work out what’s going on

Evžen Šírek

02/03/2023, 2:16 PM

Thanks a lot! I will try tweaking the save method :))

datajoely

02/03/2023, 2:24 PM

let us know what works, because we can tweak things or you could raise a PR 🙂

👍 1

Evžen Šírek

02/03/2023, 2:31 PM

Sure, I ll report with results :))

167 Views

Open in Slack

Previous Next