Hi team. I am encountering an error whenever I try...
# questions
z
Hi team. I am encountering an error whenever I try to save a file using polars. I am able to load in the file all fine as a polars dataframe but when it comes to saving it the code always errors out with the below and I've also shown the catalog entry as well. I have tried this was the EagerPolarsDataset and get the same result. Any help or advice would be appreciated. Catalog Entry:
Copy code
df_2:
  type: polars.LazyPolarsDataset
  filepath: data/01_raw/test.parquet
  file_format: parquet
Error
Process finished with exit code 139 (interrupted by signal 11:SIGSEGV)
j
hi @Zubin Roy, are you able to save the same file with Polars, without using Kedro? for example, you can try to do
Copy code
$ ipython
In [1]: %load_ext kedro

In [2]: df = catalog.load("data/01_raw/test.parquet")

In [3]: df.write_parquet("data/01_raw/test.parquet")
that way you can see if the problem is in Kedro or Polars
z
Hey @Juan Luis so it works if I do it outside of a Kedro Catalog. See code below.
Copy code
if isinstance(df, pl.LazyFrame):
    df = df.collect()

df.write_parquet("data/01_raw/test_1.parquet", use_pyarrow=True)
I think the error is coming from the datasets save method (https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-6.0.0/_modules/kedro_datasets/polars/lazy_polars_dataset.html#LazyPolarsDataset) in particular this line:
save_method(file=fs_file, **self._save_args)
I'm unsure how to solve the issue but am pretty sure that's what is causing the above error. For my purposes I think a manual save to an output file path will work for me. But am curious if other people have flagged this issue when it comes to saving polars dataframes as parquet files? (As if it's a csv the kedro catalog works fine!)
j
what happens if you remove the
use_pyarrow
from your
df.write_parquet
call?
z
If it is removed you get the same error:
Process finished with exit code 139 (interrupted by signal 11:SIGSEGV)
So does that mean it's a Polars issue? Or when we save files using the Kedro Catalog are we using write_parquet without that argument?
e
It looks so, as if
write_parquet
is used without
use_pyarrow=True
, Polars defaults to its own rust-based parquet backend.
So the easiest workaround will be passing it through
save_args
Copy code
df_2:
  type: polars.LazyPolarsDataset
  filepath: data/01_raw/test.parquet
  file_format: parquet
  save_args:
    use_pyarrow: true
this 1
thankyou 1
🔥 1
z
@Elena Khaustova I had not realised you can put save_arguments into the Kedro Catalog entries. So that's super nifty! Thank you
🙌🏼 1
🙌 3