Hi All! im trying to use: `spark.sql.parquet.aggr...
# questions
f
Hi All! im trying to use:
spark.sql.parquet.aggregatePushdown: true
And using the
spark.SparkDataset
But the query plan is not making use of the metadata when trying to compute thing such as max(), min() of columns. Ideally I would expect something like the following -> but no luck.. has anyone made
spark.sql.parquet.aggregatePushdown
work with kedro? Thanks
d
Kedro shouldn’t be making a difference here
👍 1
it’s really just doing a thing wrapper on
spark.read
and
df.write
👍 1
d
Did you set your config in an appropriate place? https://docs.kedro.org/en/stable/integrations/pyspark_integration.html
f
I checked and the sparksession object does have the aggregatePushDown option set to true when printing the config dictionary of the spark session. But when doing the max of a column it’s still doing the shuffles.. wandering if maybe someone has used this option and made it work 👀 Thanks both for the input
👍 1
d
Probably a Spark support question at this point, as @datajoely mentioned. If you want to be extra sure, you can try the same thing without Kedro at all, to be sure.
f
thanks.. yes. Just in case anyone is wandering the sol was you also must set V2 api so aggregate push down can work
spark.sql.sources.useV1SourceList: ""
🙌 2