<https://youtu.be/RnXsJzo7Ww8?si=hHjyNDKtWMSMY9vb>...
# random
n

https://youtu.be/RnXsJzo7Ww8?si=hHjyNDKtWMSMY9vb

Modular machine learning pipeline 😁
🥳 2
🫥 1
🔥 2
d
Not sure this is accurate...
😅 1
I'm no LLM expert, but what does Hamilton do to power LLM applications? There's no framework code that does the same... https://github.com/search?q=repo%3ADAGWorks-Inc%2Fhamilton%20llm&amp;type=code It's also weird that Hamilton is a library, dbt is a library (sure), but Kedro is maybe not a library (what criteria got it classified as less of a library than dbt?). And I'm not entirely sure where column-level lineage comes from. I believe they produce individual
pd.Series
as outputs of nodes by convention in many examples, so basically your node graph counts as column-level lineage? I do like bits and pieces of positioning around Hamilton, but I'm definitely put off by false advertising. 😂
t
@Deepyaman Datta Hey author here! Happy to chime-in! To address comments one-by-one: ## "library vs. system" The exact notion is up for debate, but the idea was: - what do users need to opt-in? - once they opt-in, how portable is their code? I think we can agree that Dagster and Airflow are systems that define "how you'll do your work". Once you write dbt or Hamilton code, yes you are committing to a framework, but you can run dbt or Hamilton anywhere you want. I used 🚸because my experience is that
kedro run
is largely promoted as the principal way to execute code, which limits it's portability. ## LLMs - we have multiple examples on GitHub and ready-to-use dataflows for vector search, RAG, text summarization, etc. - the key difference here is how the Hamilton
Driver
facilitates in-memory operations vs. the Kedro
Runner
classes. As far as I understand, you wouldn't run Kedro within a FastAPI application, but it's a common pattern for Hamilton (online feature engineering example). ## Column-level operations - it's a the core of Hamilton since it's inception. - We have many utilities to create column-level nodes from dataframe-level nodes allowing for very granular lineage, data validation, and schema checking. - Our
Driver
is able to resolve these column-level nodes into a dataframe seamlessly for users running Pandas, Polars (regular and lazy), Dask, PySpark, Vaex - It's a beloved feature and why many companies adopt Hamilton for feature engineering and powering their ML pipelines.
IMO, the "false advertising" claim is stronger than any of these points. Happy to discuss and rectify things if I misunderstood things about Kedro
n
t
@Nok Lam Chan great looking project! I did do my due diligence, but it wasn't exactly easy to find from the docs
w
Even without
kedro-boot
you can pretty much run a Kedro session anywhere, including a FastAPI service. I’ve been running Kedro sessions inside a Slack bot handler for a while now, and invoking pipeline steps from external code by distributing the project as a pypi package. That’s very portable if you ask me.
m
Hi Thierry! (we exchanged a bit after this talk 🙂) On my side, the only surprising thing in this slide was the use of "declarative vs imperative." Maybe I'm mistaken, but I see Kedro as mostly a declarative framework (meaning here that the graph declaration is distinct and decoupled from the graph execution), whereas on your slides, "declarative" vs "imperative" seem to be used as implicit vs. explicit node input and output mapping (where Hamilton infers the graph from the names of function parameters, while Kedro requires explicit mapping to named dataset and parameters). Otherwise, great presentation! In my job, I mostly use Kedro (the concept of catalog and "data-centric" graph definition is a lifesaver when trying to transfer moderatly complex graphs, mixing data sources, to other teams). But I also like the simplicity, the API design, and the UI of Hamilton. Excited to see how both frameworks continue to evolve!
👍 2
t
Hi @Martin S! TL;DR individual features or API can be described as declarative or imperative, but labeling the tool as a whole may be up to debate. For example, Python itself is imperative. To ground the discussion, I wrote about Airflow, Dagster, and Hamilton where Airflow is fully imperative, Dagster allows both, and Hamilton is declarative. Here are some notes by "feature" ## Pipeline definition • Kedro decouples: transform logic (the function), node definition, pipeline definition and data catalog (how to materialize the asset). • Hamilton couples (transform logic, node definition, pipeline definition). We use materializers which you can couple or not with your node definition. • Both tools have an automatic DAG resolver from node definitions • I'd say both are declarative here. • From my experience and within the context of the talk, the coupling Hamilton provides is an advantage for pipeline readability and maintainability ## Pipeline execution • Hamilton execute pipelines by requesting (declaring) the assets you want computed (nodes are nouns). It doesn't compute unnecessary nodes. • Kedro nodes primary execution pattern (via CLI or programmatically) is to run the pipeline in full. ◦ This is typical of imperative frameworks, and from the Kedro resources online is a pattern heavily promoted. ◦ The data catalog is declarative, but it's incidental to execution. You're still running the full pipeline • Kedro's
from_nodes
and
to_nodes
provide a declarative API and match the
overrides
and the requested nodes in Hamilton @Deepyaman Datta Let me know if there's any false advertising!
n
@Thierry Jean isn’t to_output equivalent to what you call by asset? It resolves the necessary node to compute if I understand correctly https://docs.kedro.org/en/stable/development/commands_reference.html#run-the-project
t
@Nok Lam Chan as per the end of my previous message, as far as I understand, yes. Happy to answer more questions about Hamilton on our Slack channel: https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg
👍🏼 1