Are there any known issues using fastparquet to save a panda Kedro #questions

Are there any known issues using fastparquet to sa...

Eric Bell

02/12/2024, 6:53 AM

Are there any known issues using fastparquet to save a pandas dataset? I'm brand new to kedro ... was literally trying to get my very first node ever to process and it was throwing an error...it did it with the node I wrote and also one of the sample nodes.

Copy code

kedro.io.core.DatasetError: Failed while saving data to data set ParquetDataset(filepath=D:/XXXXX/bmad-data-analysis/data/02_intermediate/preprocessed_companies.pq, load_args={}, protocol=file, save_args={}).
I/O operation on closed file.

I uninstalled fastparquet and installed pyarrow and is works now.

Nok Lam Chan

02/12/2024, 4:56 PM

Can you post the full error stacktrace? I think there should be more info above this error

Nok Lam Chan

02/12/2024, 4:59 PM

and how did you install the dependencies? (and

pip

kedro

kedor-datasets

version)

Nok Lam Chan

02/12/2024, 6:12 PM

FYR, typically if you install the project requirements file or pyproject.toml, it should install

pyarrow

already. I can confirm this is an issue, I manually delete

pyarrow

and install

fastparquet

and arrive at the same error. For now I suggest to use

pyarrow

, since this should be increasing the standard of the community, but at the same time we will to fix the bug.

datajoely

02/12/2024, 6:20 PM

pyarrow is deffo the standard these days, we should remove any references to fastparquet on our side

Eric Bell

02/13/2024, 3:39 AM

Sorry for the delayed response, and thank you for looking into this. Since you've been able to recreate the problem, it doesn't sound like you still need me to post the stacktrace. I've been using pipenv to manage my environments. I'm actually a bit confused right now ... I just created a new kedro project. Piopenv sees the requirements.txt file and uses it. Previously this didn't seem to install pyarrow or fastparquet, and pandas displayed a warning message that one of those two packages was going to be a non-optional requirement. But my current project did in fact install pyarrow. I think for now the best thing is to drop this. You've been able to replicate the issue relevant to kedro, so any problems I have with pyarrow being installed or not doesn't seem to warrant continued discussion.

Nok Lam Chan

02/13/2024, 12:13 PM

Did you manually install

fastparquet

Nok Lam Chan

02/13/2024, 12:14 PM

https://github.com/kedro-org/kedro-plugins/issues/313 i think this is breaking the installation

Eric Bell

02/13/2024, 11:50 PM

Yes, I manually installed fastparquet and when that didn't work I installed pyarrow. This is what is currently coming up after installing pandas:

Copy code

>>> import pandas
<stdin>:1: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at <https://github.com/pandas-dev/pandas/issues/54466>

Wow now I'm really going insane ... I'm certain that when I first saw this message, it said "pyarrow or fastparquet" ... now it only says "pyarrow"

👀 1

datajoely

02/14/2024, 11:41 AM

I reckon slightly newer versions of Pandas and Kedro have both dropped fastparquet and some wires have got crossed. Hopefully if you install the latest in a clean env this isn’t an issue 🤞

👍 1

4 Views

Open in Slack

Previous Next