Sorry for the spam I am having issues with the po...
# questions
i
Sorry for the spam I am having issues with the polars.Genericdataset, was gonna open an issue but thought I'd open it up here first in case I'm being dumb. Currently I load azure blob parquet files like this:
Copy code
pl.read_parquet(
    f"az://{os.environ['CONTAINER_NAME_ENV_KEY']}/data/02_intermediate/blablabla.parquet",
    storage_options = {
        "account_name": os.environ["AZURE_STORAGE_ACCOUNT_DATA_NAME"],
        "anon": False
    }
)
the following:
Copy code
#credentials.yml
azure_blob:
  account_name: ${oc.env:AZURE_STORAGE_ACCOUNT_DATA_NAME}
  anon: false
Copy code
#catalog.yml
input_data:
  type: polars.GenericDataset
  file_format: parquet
  filepath: az://${oc.env:CONTAINER_NAME_ENV_KEY}/data/blabla.parquet
  credentials: azure_blob
results in the following error when I try to load
Copy code
--> 153 with self._fs.open(load_path, **self._fs_open_args_load) as fs_file:
    154     return load_method(fs_file, **self._load_args)

File ~/.venv/lib/python3.10/site-packages/fsspec/spec.py:1241, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
   1240 ac = kwargs.pop("autocommit", not self._intrans)
-> 1241 f = self._open(
   1242     path,
   1243     mode=mode,
   1244     block_size=block_size,
   1245     autocommit=ac,
   1246     cache_options=cache_options,
...
    201     )
--> 202     raise DatasetError(message) from exc

DatasetError: Failed while loading data from data set GenericDataset(file_format=parquet, filepath=/data/blablabla.parquet, load_args={}, protocol=az, save_args={}).
[Errno 2] No such file or directory: 'data/blablabla.parquet'
(Please ignore any inconsistencies container names, filenames etc, I tried to remove some information when pasting into slack but I probably wasn't super thorough) The container name is being stripped from the filepath which I assume is being supplied to fsspec somewhere else, but I'm not entirely sure why the load is failing when the pure polars call is working. I know polars recently did away with fsspec and implemented their own native support for cloud (https://github.com/pola-rs/polars/pull/11210) but I'm not sure if it has anything to do with that.
I have verified that
az://${oc.env:CONTAINER_NAME_ENV_KEY}/data/blabla.parquet
is being correctly interpolated to give the proper filepath
Ok changing
az
to
abfs
fixes it, but
az
works with polars, and should work with fsspec too through
adlfs
.
@Juan Luis id appreciate if u can have a look at this when u have a chance, specifically the change on polars’ side which has done away with the need for fsspec stuff. It should simplify the polars dataset implementation. Lmk if this would be better suited as an issue :)
There’s no blocker from my side, but I thought it was a cool addition in polars, as it expands all the capabilities around filter push downs and stuff which before wasn’t available thru fsspec