Hi <@U03S12LHNNQ>, <@U03R8FW4HUZ>, <@U046V0QC124> ...
# questions
m
Hi @Deepyaman Datta, @datajoely, @Juan Luis Fantastic work and write-up! I don't know if this is the right place for non-technical questions, let me know. The main question I have is that we've been saying for many years, that Kedro solves the experimentation vs production dilemma, i.e.,
the only useful code is production code
. Then the article wakes you up with `unfortunately, deploying the same data pipelines in production often doesn't work as well as one would hope`Don't get me wrong, I had my own experience with
the solution doesn't scale
, but there can be many reasons behind that, e.g., spark config is rabbit hole, costly operations, joins between large vs small tables...etc. I'm a bit unclear on the problem statement, what part of Kedro makes a production-ready data pipeline not scalable? Was it production-ready at the first place? If it's not scalable shouldn't we just start directly in SQL? Just to make it clear, I'm not challenging the value of Kedro - Ibis integration at all, I love this project and a big supporter of it. But I'd like to better understand the source of the scalability problem. Thank you!
❤️ 3
👍🏼 1
K 4
👍 2
j
Thanks for such a candid question, we're absolutely happy to see it here. I'll leave the SMEs to answer it and would also say that it's brilliant to have discussion of blog posts, it feels like a milestone because normally we publish and move on with some cheerleading but no follow up about the content. What's more, if you feel like there's a second post to come that discusses the topic further, that's a great outcome and I'd be happy to pick it up and publish!
💯 3
d
Great question @Mate Scharnitzky - fundamentally I think SQL vs DataFrame is a UX issue not an engineering one. • When talking about SQL we need to separate the query language from the execution engine, Ibis allows you to work with the query language part in a consistent DataFrame syntax across multiple execution engine backends 1:M. • IMO DataFrames are way more expressive and nuanced than SQL (especially creative modern SQL like DuckDB which tries to add lots of Pythonic stuff) . • If you’re a SQL only team dbt / sqlmesh type workflows are great, for team where a lot of the work will be in Python (particularly the ML part) I really like the idea of keeping the codebase consistent. • For me the ability to hot-swap a DuckDB development environment to a Spark production environment feels like witchcraft but also something that becomes simple to do if we adopt Ibis from the beginning • The term scalability is board, there’s obviously a performance angle but I think the team collaboration dimension (and even the team of teams dimension) is under-celebrated
👍 3
👍🏽 1
m
Thank you, Joel!
n
I think this also have to deal with the bias that we have a lot of project just start with Pyspark very early on, even if it's just a few CSVs, that's a tradeoff that we pay to avoid converting all the code back to Pyspark later on. So it will cause more scalability issue if you didn't start with one that scales.
👍 1
j
on top of what @datajoely said, and to put a more specific example, I think it's typical from data teams to start small with pandas, then realise that pandas cannot handle big volumes of data, then rewrite the thing in PySpark, then despair because nobody likes debugging PySpark. Ibis allows backend swapping + has a nice lazy query engine that can potentially make this transition easier. anyway @Mate Scharnitzky you're totally right, putting stuff in production is hard the moment that development != production (and it rarely is). Kedro tries to bridge the gap, but in the end it's a framework that structures your code
👍 1
and also I 💯 agree with @Jo Stichbury, it's fantastic to have discussion on blog posts and I think #questions is the best place!
m
Thank you, all!
d
All great points! I'd like to add, using Spark from the beginning can be a good, forward-thinking choice--there's a reason Kedro has provided first-class support since pretty much the start--but it may still not be the best choice. I don't want to get into benchmark wars, but if a team is using Snowflake (just as an example; can apply equally to many other engines), I think Snowflake-native code will generally outperform loading data into a Spark cluster (even with predicate pushdown to reduce amount of data loaded, which isn't always an option), processing, and potentially writing back to Snowflake. Cost will almost certainly be cheaper, too. Also, who's managing the Spark cluster? I've been on many projects, personally, where a prerequisite of using Spark with Kedro is convincing the client to adopt Databricks. In these cases, maybe the right answer (or at least the path of less resistance) would be to be able to perform compute natively on the engine. This is part of the reason dbt is so widely used.
👍 1
👍🏼 1
i
+1 for another blog post on this 🚀 I would love to see the discussion all spelled out
👀 1