Hi fellows we are using the `versioned` `pandas ParquetDatas Kedro #questions

Hi fellows, we are using the `versioned` `pandas.P...

Flavien

07/18/2024, 12:15 PM

Hi fellows, we are using the

versioned

pandas.ParquetDataset

in conjunction with

s3

in a pipeline that runs at regular intervals. Yesterday, we realized that our

s3

costs were growing linearly with each run. 😅 Digging into the reason behind, it seems that the pipeline performs an enormous amount of

LIST

operations on

s3

. I therefore started to dig into the implementation of the versioning of the data set, which led me to this issue on

s3fs

which states that the

glob

method would perform one request per sub-folder. I am not completely sure that this is what happening in our case but it looks quite similar as the versioning creates a folder by version. Does anyone encounter this kind of issue? If my interpretation is correct, is there any trivial solution to solve this problem? My idea right now would be to create a custom

Dataset

to override the

glob_function=slef._fs.glob

panda.ParquetDataset

to use the

list_objects_v2

boto3

. Thanks in advance!

💸 1

👀 1

Juan Luis

07/18/2024, 12:21 PM

My idea right now would be to create a custom
Dataset
to override the
glob_function=slef._fs.glob
in
panda.ParquetDataset
to use the
list_objects_v2
of
boto3
.

that's maybe something quick you can try, yes! you can inherit

kedro_datasets.pandas.ParquetDataset

and override the methods you need, that will probably make it easier

Flavien

07/18/2024, 12:21 PM

That's indeed what I had in mind.

Nok Lam Chan

07/18/2024, 12:36 PM

Interesting. I think this sounds like the right direction. Could you share these info? Version of your fsspec How do you define your path in catalog? Will read more into the issue later.

Flavien

07/18/2024, 12:38 PM

Copy code

Name: fsspec
Version: 2024.2.0

And something like that

Copy code

wide-basic-mlm:
  type: pandas.ParquetDataset
  filepath: <s3://virtual/wide-basic-mlm.parquet>
  versioned: True

thankyou 1

datajoely

07/18/2024, 1:15 PM

Should we think about an

expiry

attribute of the

AbstractVersionedDataset

Flavien

07/18/2024, 3:15 PM

I implemented the hack and tested by loading the dataset with

kedro ipython

. I can confirm that it is much faster than the genuine implementation.

❤️ 2

Nok Lam Chan

07/18/2024, 3:21 PM

How much time does it takes before and after?

Nok Lam Chan

07/18/2024, 3:21 PM

I guess more precisely I want to know that listdir time but maybe trickier

Nok Lam Chan

07/18/2024, 3:21 PM

Dataset.exists() maybe?

Flavien

07/18/2024, 3:22 PM

A couple of seconds before and less than a second after. But this is just a feeling, no proper timing.

🔥 1

Juan Luis

07/18/2024, 3:29 PM

there has been no activity on the s3fs issue you pointed out @Flavien... is this something we could maybe change in our own datasets?

8 Views

Open in Slack

Previous Next