Together with Polars, duckdb heavily reduces the need for Spark. Unless there is no way to process you data on a single machine, Spark is almost always overkill. The way I see it: use duckdb + dbt if you want to go SQL all the way, use polars if you want to stay in Python (suits best if you use kedro IMO) and use Spark if you have to process TB’s of data.