Hi fellows, we are using the `versioned` `pandas.P...
# questions
f
Hi fellows, we are using the
versioned
pandas.ParquetDataset
in conjunction with
s3
in a pipeline that runs at regular intervals. Yesterday, we realized that our
s3
costs were growing linearly with each run. 😅 Digging into the reason behind, it seems that the pipeline performs an enormous amount of
LIST
operations on
s3
. I therefore started to dig into the implementation of the versioning of the data set, which led me to this issue on
s3fs
which states that the
glob
method would perform one request per sub-folder. I am not completely sure that this is what happening in our case but it looks quite similar as the versioning creates a folder by version. Does anyone encounter this kind of issue? If my interpretation is correct, is there any trivial solution to solve this problem? My idea right now would be to create a custom
Dataset
to override the
glob_function=slef._fs.glob
in
panda.ParquetDataset
to use the
list_objects_v2
of
boto3
. Thanks in advance!
💸 1
👀 1
j
My idea right now would be to create a custom
Dataset
to override the
glob_function=slef._fs.glob
in
panda.ParquetDataset
to use the
list_objects_v2
of
boto3
.
that's maybe something quick you can try, yes! you can inherit
kedro_datasets.pandas.ParquetDataset
and override the methods you need, that will probably make it easier
f
That's indeed what I had in mind.
n
Interesting. I think this sounds like the right direction. Could you share these info? Version of your fsspec How do you define your path in catalog? Will read more into the issue later.
f
Copy code
Name: fsspec
Version: 2024.2.0
And something like that
Copy code
wide-basic-mlm:
  type: pandas.ParquetDataset
  filepath: <s3://virtual/wide-basic-mlm.parquet>
  versioned: True
thankyou 1
d
Should we think about an
expiry
attribute of the
AbstractVersionedDataset
?
f
I implemented the hack and tested by loading the dataset with
kedro ipython
. I can confirm that it is much faster than the genuine implementation.
❤️ 2
n
How much time does it takes before and after?
I guess more precisely I want to know that listdir time but maybe trickier
Dataset.exists() maybe?
f
A couple of seconds before and less than a second after. But this is just a feeling, no proper timing.
🔥 1
j
there has been no activity on the s3fs issue you pointed out @Flavien... is this something we could maybe change in our own datasets?