Tangentially related to Kedro but a while ago I saw a post m Kedro #random

Tangentially related to Kedro, but a while ago I s...

Ian Whalen

11/20/2024, 1:21 PM

Tangentially related to Kedro, but a while ago I saw a post (maybe Linkedin) about how people are realizing that distributed data processing tools (e.g. PySpark, maybe?) aren't actually necessary And that DuckDB (I think) was enough for 95% of applications since people are usually under 1TB scale for most operations. Anyone know where I could read more about something like this?

datajoely

11/20/2024, 1:29 PM

There was a https://www.smalldatasf.com conference recently, recordings coming soon

👍 1

datajoely

11/20/2024, 1:29 PM

But equally - that's why I'm super excited by

kedro

ibis

duckdb

as our default pattern

🥳 1

Ian Whalen

11/20/2024, 1:30 PM

Nice Thanks Joel! I might be implementing some data connectors soon and wanted to use that exact pattern

datajoely

11/20/2024, 1:31 PM

Yeah my current team is doing that locally against synthetic health data and then the same code is deployed in prod against real data they can't see

👀 1

datajoely

11/20/2024, 1:31 PM

I'm really liking it

Juan Luis

11/20/2024, 1:47 PM

you probably mean

👍 1

Ian Whalen

11/20/2024, 1:47 PM

That was the exact image! Nice one haha

datajoely

11/20/2024, 1:51 PM

data fusion is the other gamechanger - sdflabs is the start up around that I"m most interested in

👍🏼 1

Nok Lam Chan

11/20/2024, 2:14 PM

I just did a presentation about this at PyConHK😄 (haven't seen this diagram tho)

👀 2

datajoely

11/20/2024, 2:15 PM

can you share your slides?

Nok Lam Chan

11/20/2024, 2:16 PM

DuckDB usually handles better larger than memory dataset than polars, <1 TB is probably relative to how big your machine is. Both are fast as long as it's enough to fit in memory most of the time.

👍 2

datajoely

11/20/2024, 2:27 PM

you can also do the lazy polars though right?

Nok Lam Chan

11/20/2024, 2:34 PM

Yes that helps too, but for example a larger-than-memory join will spill to disk automatically. I haven't done benchmarking myself but read summary by other people and DuckDB is doing a better job. For example here DuckDB did a 50GB join on a 16GB RAM macbook: https://duckdb.org/2024/06/26/benchmarks-over-time.html

Deepyaman Datta

11/20/2024, 4:20 PM

https://docs.coiled.io/blog/tpch.html another interesting benchmark. All benchmarks are flawed, but I trust Dask more on a DuckDB vs. Polars benchmark. 🤣

👀 2

Deepyaman Datta

11/20/2024, 4:29 PM

I also had an interesting chat with Thomas Fan (scikit-learn maintainer) at PyData NYC, and one of the things I took away is that, despite their optimizations, ML preprocessing on DuckDB (or Polars) should be faster than using scikit-learn, where possible, because of query optimization. So even on the ML side, if really need to scale up, can try some of these local options before distributed. (Maybe IbisML? 😂)

😁 2

Deepyaman Datta

11/20/2024, 4:30 PM

https://ibis-project.org/posts/1tbc/ is also interesting, if you haven't seen it before, by @Cody Peterson.

👍 2

Nok Lam Chan

11/20/2024, 4:35 PM

oh the Dask one was the one I read

Florian d

11/20/2024, 7:07 PM

There is also this article on the topic: https://motherduck.com/blog/big-data-is-dead/

Florian d

11/20/2024, 7:08 PM

Some arguments in distributed compute I’ve heard is around fault tolerance and not reprocessing everything but not sure I’m convinced especially on joins

👍 2

👍🏼 1

Deepyaman Datta

11/20/2024, 8:04 PM

Realistically, even when I was doing lots of work with PySpark, it was almost never necessary. I made that point, but the usual counterargument was, "But what about when we need to scale??" (Spoiler: we never magically went from 1 GB to 100 GB on these projects, much less TBs.) That said, there are of course cases where you need to scale and use distributed compute. I find Ibis to be a good fit for doing this without rewriting (much); in a recent tutorial, I ran the same code as on a sample with DuckDB locally on the full data with Trino. (And, while I say Ibis is a good fit, obviously I'm biased to some extent having worked on it and the Kedro integration.)

👍 1

Nok Lam Chan

11/21/2024, 2:30 AM

When rerunning is fast enough fault tolerance isn't a good argument. I haven't done any pipeline lately but I am convinced with 100GBs of data it's well within single node capability. The advantages of Spark at smaller scale is not really distributed computing but rather the ecosystem, first class integration with most platform. (that will change with time too)

3 Views

Open in Slack

Previous Next