Nice article by the wonderful folks of data minded, a Belgian data engineering consultancy company (who also have an academy and a managed product)!
Together with Polars, duckdb heavily reduces the need for Spark. Unless there is no way to process you data on a single machine, Spark is almost always overkill. The way I see it: use duckdb + dbt if you want to go SQL all the way, use polars if you want to stay in Python (suits best if you use kedro IMO) and use Spark if you have to process TB’s of data.
Nok Lam Chan
09/25/2023, 7:45 PM
And ibis could potentially enable similar workflow with Kedro