Tangentially related to Kedro, but a while ago I s...
# random
i
Tangentially related to Kedro, but a while ago I saw a post (maybe Linkedin) about how people are realizing that distributed data processing tools (e.g. PySpark, maybe?) aren't actually necessary And that DuckDB (I think) was enough for 95% of applications since people are usually under 1TB scale for most operations. Anyone know where I could read more about something like this?
d
There was a https://www.smalldatasf.com conference recently, recordings coming soon
👍 1
But equally - that's why I'm super excited by
kedro
+
ibis
+
duckdb
as our default pattern
🥳 1
i
Nice Thanks Joel! I might be implementing some data connectors soon and wanted to use that exact pattern
d
Yeah my current team is doing that locally against synthetic health data and then the same code is deployed in prod against real data they can't see
👀 1
I'm really liking it
j
you probably mean
👍 1
i
That was the exact image! Nice one haha
d
data fusion is the other gamechanger - sdflabs is the start up around that I"m most interested in
👍🏼 1
n
I just did a presentation about this at PyConHK😄 (haven't seen this diagram tho)
👀 2
d
can you share your slides?
n
DuckDB usually handles better larger than memory dataset than polars, <1 TB is probably relative to how big your machine is. Both are fast as long as it's enough to fit in memory most of the time.
👍 2
d
you can also do the lazy polars though right?
n
Yes that helps too, but for example a larger-than-memory join will spill to disk automatically. I haven't done benchmarking myself but read summary by other people and DuckDB is doing a better job. For example here DuckDB did a 50GB join on a 16GB RAM macbook: https://duckdb.org/2024/06/26/benchmarks-over-time.html
d
https://docs.coiled.io/blog/tpch.html another interesting benchmark. All benchmarks are flawed, but I trust Dask more on a DuckDB vs. Polars benchmark. 🤣
👀 2
I also had an interesting chat with Thomas Fan (scikit-learn maintainer) at PyData NYC, and one of the things I took away is that, despite their optimizations, ML preprocessing on DuckDB (or Polars) should be faster than using scikit-learn, where possible, because of query optimization. So even on the ML side, if really need to scale up, can try some of these local options before distributed. (Maybe IbisML? 😂)
😁 2
https://ibis-project.org/posts/1tbc/ is also interesting, if you haven't seen it before, by @Cody Peterson.
👍 2
n
oh the Dask one was the one I read
f
There is also this article on the topic: https://motherduck.com/blog/big-data-is-dead/
Some arguments in distributed compute I’ve heard is around fault tolerance and not reprocessing everything but not sure I’m convinced especially on joins
👍 2
👍🏼 1
d
Realistically, even when I was doing lots of work with PySpark, it was almost never necessary. I made that point, but the usual counterargument was, "But what about when we need to scale??" (Spoiler: we never magically went from 1 GB to 100 GB on these projects, much less TBs.) That said, there are of course cases where you need to scale and use distributed compute. I find Ibis to be a good fit for doing this without rewriting (much); in a recent tutorial, I ran the same code as on a sample with DuckDB locally on the full data with Trino. (And, while I say Ibis is a good fit, obviously I'm biased to some extent having worked on it and the Kedro integration.)
👍 1
n
When rerunning is fast enough fault tolerance isn't a good argument. I haven't done any pipeline lately but I am convinced with 100GBs of data it's well within single node capability. The advantages of Spark at smaller scale is not really distributed computing but rather the ecosystem, first class integration with most platform. (that will change with time too)