Hi team! I am trying to load a spark parquet as a ...
# questions
p
Hi team! I am trying to load a spark parquet as a
polars.LazyPolarsDataset
for which I assume the filepath needs to be a glob pattern. But since kedro-datasets>=6.0.0, we are checking the availability of the file itself without expanding the glob pattern if passed in. Is this a bug or am I doing something wrong?
e
Hi Puneet, could you elaborate a bit on the issue youโ€™re having? Do you mean that globals are not resolving in your dataset path?
p
Seems like when I pass
filepath: path/to/my/parquet_folder/*.parquet
, on this line we are checking whether the file exists or not. Since, filepath is a glob pattern and not a static filepath it fails to load the parquet.
๐Ÿ‘ 1
p
Copy code
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ /home/circleci/project/project_folder/venv-env/lib/python3.10/site-packages/kedr โ”‚
โ”‚ o/io/core.py:245 in load                                                     โ”‚
โ”‚                                                                              โ”‚
โ”‚   242 โ”‚   โ”‚   โ”‚   self._logger.debug("Loading %s", str(self))                โ”‚
โ”‚   243 โ”‚   โ”‚   โ”‚                                                              โ”‚
โ”‚   244 โ”‚   โ”‚   โ”‚   try:                                                       โ”‚
โ”‚ โฑ 245 โ”‚   โ”‚   โ”‚   โ”‚   return load_func(self)                                 โ”‚
โ”‚   246 โ”‚   โ”‚   โ”‚   except DatasetError:                                       โ”‚
โ”‚   247 โ”‚   โ”‚   โ”‚   โ”‚   raise                                                  โ”‚
โ”‚   248 โ”‚   โ”‚   โ”‚   except Exception as exc:                                   โ”‚
โ”‚                                                                              โ”‚
โ”‚ /home/circleci/project/project_folder/venv-env/lib/python3.10/site-packages/kedr โ”‚
โ”‚ o_datasets/polars/lazy_polars_dataset.py:205 in load                         โ”‚
โ”‚                                                                              โ”‚
โ”‚   202 โ”‚   def load(self) -> pl.LazyFrame:                                    โ”‚
โ”‚   203 โ”‚   โ”‚   load_path = str(self._get_load_path())                         โ”‚
โ”‚   204 โ”‚   โ”‚   if not self._exists():                                         โ”‚
โ”‚ โฑ 205 โ”‚   โ”‚   โ”‚   raise FileNotFoundError(errno.ENOENT, os.strerror(errno.EN โ”‚
โ”‚   206 โ”‚   โ”‚                                                                  โ”‚
โ”‚   207 โ”‚   โ”‚   if self._protocol == "file":                                   โ”‚
โ”‚   208 โ”‚   โ”‚   โ”‚   # With local filesystems, we can use Polar's build-in I/O  โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
FileNotFoundError: [Errno 2] No such file or directory: 
'/home/circleci/project/project_folder/data/pipeline_1/*.parquet'
๐Ÿ‘ 1
e
@Deepyaman Datta, hey, are we aiming to support glob patterns here?
p
I can raise a quick fix if that works?
Let me know if checking glob pattern for exists is a good enough solution
๐Ÿ‘ 1
e
Yeah, thank you, @Puneet Saini Give us some time to align on what this check was originally for
d
Let me know if checking glob pattern for exists is a good enough solution
I think that's fine. It looks like the CI logs before implementing this fix are gone since it's been a while, but it was just added to fix breaking tests, and I think the reason for this would be because Polars lazyframe doesn't otherwise test file existence until you actually go to collect the result of some operation. You could potentially try removing that and seeing which test fails.
thankyou 1
e
@Puneet Saini feel free to open an issue with a suggested fix, or weโ€™ll take this into the sprint ourselves