Quick question I noticed the implementation of ` load` for t Kedro #questions

Quick question, I noticed the implementation of `_...

Matthias Roels

04/06/2023, 12:22 PM

Quick question, I noticed the implementation of

_load

for the pandas parquet_dataset is different in

kedro.extras.datasets

vs the

kedro-datasets

plugin! The difference is significant as the one in kedro extras can be extremely slow (2hours compared to 10sec to load a dataset)! In our case, we had a dataset on S3 generated by a Spark job (hence a “directory” of (snappy) parquet files with a

_SUCCESS

file) with 137808 rows and 6410 columns. With that dataset, I could validate that

Copy code

pq.ParquetDataset(load_path, filesystem=self._fs).read(**self._load_args)

took indeed longer that 15mins (after that, I ran out of patience since

pd.read_parquet()

on the same dataset was loading within 10sec’s). So the question is: should we already switch from kedro extras datasets to the new kedro-datasets plugin to solve this issue? Is this plugin already ready to use with the current kedro version (v0.18.x)? And can we then simply remove the pandas extras from our

requirements

Matthias Roels

04/06/2023, 12:28 PM

As a follow-up. Does then someone have an idea why

pq.ParquetDataset(...).read(...)

is so much slower (720x 😱) than

pq.read_table()

Merel

04/06/2023, 12:36 PM

Yes

kedro-datasets

is ready for use! The earlier you can start using it instead of the datasets in

kedro.extras

the better, because those will be completely removed in Kedro 0.19.0.

datajoely

04/06/2023, 12:41 PM

FYI Our original Parquet dataset in extras predates Pandas implementing it on their end, so the newer one defers to the pandas implementation

Matthias Roels

04/06/2023, 12:49 PM

Thanks for the quick replies! I will plan the change ASAP then 😄. And yes, I am aware that at some point in time, pandas wasn’t able to do this. But I was surprised by the difference in speed 😱. Especially given that they all use pyarrow under the hood.

datajoely

04/06/2023, 12:50 PM

Yeah IIRC we didn’t even use the pyarrow library at first

datajoely

04/06/2023, 12:50 PM

we used something called fastparquet

datajoely

04/06/2023, 12:51 PM

then also in general the pandas implementation will have a lot more eyes on it and get optimised much more

datajoely

04/06/2023, 12:51 PM

in the medium term I think we’re going to introduce a

pandas.DeltaTable

dataset which may have the same problem longer term. i.e. it doesn’t exist natively today, so we’ll build it… but it will likely land in pandas at some point and we’ll remove our implementation then

datajoely

04/06/2023, 12:52 PM

this whole process is much more art than science!

😄 1

Matthias Roels

04/06/2023, 1:13 PM

On a side note, is there a reason why delta-spark is set a lot more strict in kedro-datasets compared to kedro? https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/setup.py#L79 https://github.com/kedro-org/kedro/blob/main/setup.py#L76

✅ 1

Nok Lam Chan

04/06/2023, 1:18 PM

https://github.com/kedro-org/kedro-plugins/issues/159 Just created the issue to keep track of it

Nok Lam Chan

04/06/2023, 1:22 PM

Good spot @Matthias Roels, we should update that in

kedro-plugins

kedro

requirements was relax https://github.com/kedro-org/kedro/commit/27f5490893dfce10a63ddeb57bf45110587d2b90 but we never since this change to plugins

Nok Lam Chan

04/06/2023, 1:26 PM

It’s an oversight.

Nok Lam Chan

04/06/2023, 1:28 PM

https://github.com/kedro-org/kedro-plugins/pull/160 for quick approval @datajoely @Merel since you were the OP & reviewer (got enough approval already, thanks!)

👍 1

✅ 2

Open in Slack

Previous Next