Quick question, I noticed the implementation of `_...
# questions
m
Quick question, I noticed the implementation of
_load
for the pandas parquet_dataset is different in
kedro.extras.datasets
vs the
kedro-datasets
plugin! The difference is significant as the one in kedro extras can be extremely slow (2hours compared to 10sec to load a dataset)! In our case, we had a dataset on S3 generated by a Spark job (hence a “directory” of (snappy) parquet files with a
_SUCCESS
file) with 137808 rows and 6410 columns. With that dataset, I could validate that
Copy code
pq.ParquetDataset(load_path, filesystem=self._fs).read(**self._load_args)
took indeed longer that 15mins (after that, I ran out of patience since
pd.read_parquet()
on the same dataset was loading within 10sec’s). So the question is: should we already switch from kedro extras datasets to the new kedro-datasets plugin to solve this issue? Is this plugin already ready to use with the current kedro version (v0.18.x)? And can we then simply remove the pandas extras from our
requirements
?
As a follow-up. Does then someone have an idea why
pq.ParquetDataset(...).read(...)
is so much slower (720x 😱) than
pq.read_table()
?
m
Yes
kedro-datasets
is ready for use! The earlier you can start using it instead of the datasets in
kedro.extras
the better, because those will be completely removed in Kedro 0.19.0.
d
FYI Our original Parquet dataset in extras predates Pandas implementing it on their end, so the newer one defers to the pandas implementation
m
Thanks for the quick replies! I will plan the change ASAP then 😄. And yes, I am aware that at some point in time, pandas wasn’t able to do this. But I was surprised by the difference in speed 😱. Especially given that they all use pyarrow under the hood.
d
Yeah IIRC we didn’t even use the pyarrow library at first
we used something called fastparquet
then also in general the pandas implementation will have a lot more eyes on it and get optimised much more
in the medium term I think we’re going to introduce a
pandas.DeltaTable
dataset which may have the same problem longer term. i.e. it doesn’t exist natively today, so we’ll build it… but it will likely land in pandas at some point and we’ll remove our implementation then
this whole process is much more art than science!
😄 1
m
On a side note, is there a reason why delta-spark is set a lot more strict in kedro-datasets compared to kedro? https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/setup.py#L79 https://github.com/kedro-org/kedro/blob/main/setup.py#L76
1
n
https://github.com/kedro-org/kedro-plugins/issues/159 Just created the issue to keep track of it
Good spot @Matthias Roels, we should update that in
kedro-plugins
kedro
requirements was relax https://github.com/kedro-org/kedro/commit/27f5490893dfce10a63ddeb57bf45110587d2b90 but we never since this change to plugins
It’s an oversight.
https://github.com/kedro-org/kedro-plugins/pull/160 for quick approval @datajoely @Merel since you were the OP & reviewer (got enough approval already, thanks!)
2
👍 1