Flavien
07/18/2024, 12:15 PMversioned pandas.ParquetDataset in conjunction with s3 in a pipeline that runs at regular intervals. Yesterday, we realized that our s3 costs were growing linearly with each run. 😅 Digging into the reason behind, it seems that the pipeline performs an enormous amount of LIST operations on s3. I therefore started to dig into the implementation of the versioning of the data set, which led me to this issue on s3fs which states that the glob method would perform one request per sub-folder. I am not completely sure that this is what happening in our case but it looks quite similar as the versioning creates a folder by version.
Does anyone encounter this kind of issue?
If my interpretation is correct, is there any trivial solution to solve this problem? My idea right now would be to create a custom Dataset to override the glob_function=slef._fs.glob in panda.ParquetDataset to use the list_objects_v2 of boto3.
Thanks in advance!Juan Luis
07/18/2024, 12:21 PMMy idea right now would be to create a customthat's maybe something quick you can try, yes! you can inheritto override theDatasetinglob_function=slef._fs.globto use thepanda.ParquetDatasetoflist_objects_v2.boto3
kedro_datasets.pandas.ParquetDataset and override the methods you need, that will probably make it easierFlavien
07/18/2024, 12:21 PMNok Lam Chan
07/18/2024, 12:36 PMFlavien
07/18/2024, 12:38 PMName: fsspec
Version: 2024.2.0
And something like that
wide-basic-mlm:
type: pandas.ParquetDataset
filepath: <s3://virtual/wide-basic-mlm.parquet>
versioned: Truedatajoely
07/18/2024, 1:15 PMexpiry attribute of the AbstractVersionedDataset?Flavien
07/18/2024, 3:15 PMkedro ipython. I can confirm that it is much faster than the genuine implementation.Nok Lam Chan
07/18/2024, 3:21 PMNok Lam Chan
07/18/2024, 3:21 PMNok Lam Chan
07/18/2024, 3:21 PMFlavien
07/18/2024, 3:22 PMJuan Luis
07/18/2024, 3:29 PM