Matthias Roels
04/06/2023, 12:22 PM_load
for the pandas parquet_dataset is different in kedro.extras.datasets
vs the kedro-datasets
plugin! The difference is significant as the one in kedro extras can be extremely slow (2hours compared to 10sec to load a dataset)!
In our case, we had a dataset on S3 generated by a Spark job (hence a “directory” of (snappy) parquet files with a _SUCCESS
file) with 137808 rows and 6410 columns. With that dataset, I could validate that
pq.ParquetDataset(load_path, filesystem=self._fs).read(**self._load_args)
took indeed longer that 15mins (after that, I ran out of patience since pd.read_parquet()
on the same dataset was loading within 10sec’s).
So the question is: should we already switch from kedro extras datasets to the new kedro-datasets plugin to solve this issue? Is this plugin already ready to use with the current kedro version (v0.18.x)? And can we then simply remove the pandas extras from our requirements
?pq.ParquetDataset(...).read(...)
is so much slower (720x 😱) than pq.read_table()
?Merel
04/06/2023, 12:36 PMkedro-datasets
is ready for use! The earlier you can start using it instead of the datasets in kedro.extras
the better, because those will be completely removed in Kedro 0.19.0.datajoely
04/06/2023, 12:41 PMMatthias Roels
04/06/2023, 12:49 PMdatajoely
04/06/2023, 12:50 PMpandas.DeltaTable
dataset which may have the same problem longer term. i.e. it doesn’t exist natively today, so we’ll build it… but it will likely land in pandas at some point and we’ll remove our implementation thenMatthias Roels
04/06/2023, 1:13 PMNok Lam Chan
04/06/2023, 1:18 PMkedro-plugins
kedro
requirements was relax https://github.com/kedro-org/kedro/commit/27f5490893dfce10a63ddeb57bf45110587d2b90 but we never since this change to plugins