Hi All im trying to use `spark sql parquet aggregatePushdown Kedro #questions

Hi All! im trying to use: `spark.sql.parquet.aggr...

felipe tellez

05/20/2024, 9:33 AM

Hi All! im trying to use:

spark.sql.parquet.aggregatePushdown: true

And using the

spark.SparkDataset

But the query plan is not making use of the metadata when trying to compute thing such as max(), min() of columns. Ideally I would expect something like the following -> but no luck.. has anyone made

spark.sql.parquet.aggregatePushdown

work with kedro? Thanks

datajoely

05/20/2024, 9:34 AM

Kedro shouldn’t be making a difference here

👍 1

datajoely

05/20/2024, 9:35 AM

it’s really just doing a thing wrapper on

spark.read

and

df.write

👍 1

Deepyaman Datta

05/20/2024, 2:07 PM

Did you set your config in an appropriate place? https://docs.kedro.org/en/stable/integrations/pyspark_integration.html

felipe tellez

05/20/2024, 3:08 PM

I checked and the sparksession object does have the aggregatePushDown option set to true when printing the config dictionary of the spark session. But when doing the max of a column it’s still doing the shuffles.. wandering if maybe someone has used this option and made it work 👀 Thanks both for the input

👍 1

Deepyaman Datta

05/20/2024, 3:15 PM

Probably a Spark support question at this point, as @datajoely mentioned. If you want to be extra sure, you can try the same thing without Kedro at all, to be sure.

felipe tellez

05/20/2024, 7:13 PM

thanks.. yes. Just in case anyone is wandering the sol was you also must set V2 api so aggregate push down can work

spark.sql.sources.useV1SourceList: ""

🙌 2

2 Views

Open in Slack

Previous Next