Flavien
07/18/2024, 12:15 PMversioned
pandas.ParquetDataset
in conjunction with s3
in a pipeline that runs at regular intervals. Yesterday, we realized that our s3
costs were growing linearly with each run. 😅 Digging into the reason behind, it seems that the pipeline performs an enormous amount of LIST
operations on s3
. I therefore started to dig into the implementation of the versioning of the data set, which led me to this issue on s3fs
which states that the glob
method would perform one request per sub-folder. I am not completely sure that this is what happening in our case but it looks quite similar as the versioning creates a folder by version.
Does anyone encounter this kind of issue?
If my interpretation is correct, is there any trivial solution to solve this problem? My idea right now would be to create a custom Dataset
to override the glob_function=slef._fs.glob
in panda.ParquetDataset
to use the list_objects_v2
of boto3
.
Thanks in advance!Juan Luis
07/18/2024, 12:21 PMMy idea right now would be to create a customthat's maybe something quick you can try, yes! you can inheritto override theDataset
inglob_function=slef._fs.glob
to use thepanda.ParquetDataset
oflist_objects_v2
.boto3
kedro_datasets.pandas.ParquetDataset
and override the methods you need, that will probably make it easierFlavien
07/18/2024, 12:21 PMNok Lam Chan
07/18/2024, 12:36 PMFlavien
07/18/2024, 12:38 PMName: fsspec
Version: 2024.2.0
And something like that
wide-basic-mlm:
type: pandas.ParquetDataset
filepath: <s3://virtual/wide-basic-mlm.parquet>
versioned: True
datajoely
07/18/2024, 1:15 PMexpiry
attribute of the AbstractVersionedDataset
?Flavien
07/18/2024, 3:15 PMkedro ipython
. I can confirm that it is much faster than the genuine implementation.Nok Lam Chan
07/18/2024, 3:21 PMNok Lam Chan
07/18/2024, 3:21 PMNok Lam Chan
07/18/2024, 3:21 PMFlavien
07/18/2024, 3:22 PMJuan Luis
07/18/2024, 3:29 PM