<https youtu be RnXsJzo7Ww8 si=hHjyNDKtWMSMY9vb|https youtu Kedro #random

Join Slack

<https://youtu.be/RnXsJzo7Ww8?si=hHjyNDKtWMSMY9vb>...

# random

Nok Lam Chan

05/09/2024, 6:01 PM

https://youtu.be/RnXsJzo7Ww8?si=hHjyNDKtWMSMY9vb▾

Modular machine learning pipeline 😁

🥳 2

🫥 1

🔥 2

Deepyaman Datta

05/09/2024, 11:47 PM

Not sure this is accurate...

😅 1

Deepyaman Datta

05/10/2024, 12:06 AM

I'm no LLM expert, but what does Hamilton do to power LLM applications? There's no framework code that does the same... https://github.com/search?q=repo%3ADAGWorks-Inc%2Fhamilton%20llm&type=code It's also weird that Hamilton is a library, dbt is a library (sure), but Kedro is maybe not a library (what criteria got it classified as less of a library than dbt?). And I'm not entirely sure where column-level lineage comes from. I believe they produce individual

pd.Series

as outputs of nodes by convention in many examples, so basically your node graph counts as column-level lineage? I do like bits and pieces of positioning around Hamilton, but I'm definitely put off by false advertising. 😂

Thierry Jean

05/10/2024, 2:12 PM

@Deepyaman Datta Hey author here! Happy to chime-in! To address comments one-by-one: ## "library vs. system" The exact notion is up for debate, but the idea was: - what do users need to opt-in? - once they opt-in, how portable is their code? I think we can agree that Dagster and Airflow are systems that define "how you'll do your work". Once you write dbt or Hamilton code, yes you are committing to a framework, but you can run dbt or Hamilton anywhere you want. I used 🚸because my experience is that

kedro run

is largely promoted as the principal way to execute code, which limits it's portability. ## LLMs - we have multiple examples on GitHub and ready-to-use dataflows for vector search, RAG, text summarization, etc. - the key difference here is how the Hamilton

Driver

facilitates in-memory operations vs. the Kedro

Runner

classes. As far as I understand, you wouldn't run Kedro within a FastAPI application, but it's a common pattern for Hamilton (online feature engineering example). ## Column-level operations - it's a the core of Hamilton since it's inception. - We have many utilities to create column-level nodes from dataframe-level nodes allowing for very granular lineage, data validation, and schema checking. - Our

Driver

is able to resolve these column-level nodes into a dataframe seamlessly for users running Pandas, Polars (regular and lazy), Dask, PySpark, Vaex - It's a beloved feature and why many companies adopt Hamilton for feature engineering and powering their ML pipelines.

Thierry Jean

05/10/2024, 2:13 PM

IMO, the "false advertising" claim is stronger than any of these points. Happy to discuss and rectify things if I misunderstood things about Kedro

Nok Lam Chan

05/10/2024, 2:18 PM

https://github.com/takikadiri/kedro-boot

Thierry Jean

05/10/2024, 2:23 PM

@Nok Lam Chan great looking project! I did do my due diligence, but it wasn't exactly easy to find from the docs

William Caicedo

05/12/2024, 8:52 PM

Even without

kedro-boot

you can pretty much run a Kedro session anywhere, including a FastAPI service. I’ve been running Kedro sessions inside a Slack bot handler for a while now, and invoking pipeline steps from external code by distributing the project as a pypi package. That’s very portable if you ask me.

Martin S

05/18/2024, 3:25 AM

Hi Thierry! (we exchanged a bit after this talk 🙂) On my side, the only surprising thing in this slide was the use of "declarative vs imperative." Maybe I'm mistaken, but I see Kedro as mostly a declarative framework (meaning here that the graph declaration is distinct and decoupled from the graph execution), whereas on your slides, "declarative" vs "imperative" seem to be used as implicit vs. explicit node input and output mapping (where Hamilton infers the graph from the names of function parameters, while Kedro requires explicit mapping to named dataset and parameters). Otherwise, great presentation! In my job, I mostly use Kedro (the concept of catalog and "data-centric" graph definition is a lifesaver when trying to transfer moderatly complex graphs, mixing data sources, to other teams). But I also like the simplicity, the API design, and the UI of Hamilton. Excited to see how both frameworks continue to evolve!

👍 2

Thierry Jean

05/18/2024, 7:11 PM

Hi @Martin S! TL;DR individual features or API can be described as declarative or imperative, but labeling the tool as a whole may be up to debate. For example, Python itself is imperative. To ground the discussion, I wrote about Airflow, Dagster, and Hamilton where Airflow is fully imperative, Dagster allows both, and Hamilton is declarative. Here are some notes by "feature" ## Pipeline definition • Kedro decouples: transform logic (the function), node definition, pipeline definition and data catalog (how to materialize the asset). • Hamilton couples (transform logic, node definition, pipeline definition). We use materializers which you can couple or not with your node definition. • Both tools have an automatic DAG resolver from node definitions • I'd say both are declarative here. • From my experience and within the context of the talk, the coupling Hamilton provides is an advantage for pipeline readability and maintainability ## Pipeline execution • Hamilton execute pipelines by requesting (declaring) the assets you want computed (nodes are nouns). It doesn't compute unnecessary nodes. • Kedro nodes primary execution pattern (via CLI or programmatically) is to run the pipeline in full. ◦ This is typical of imperative frameworks, and from the Kedro resources online is a pattern heavily promoted. ◦ The data catalog is declarative, but it's incidental to execution. You're still running the full pipeline • Kedro's

from_nodes

and

to_nodes

provide a declarative API and match the

overrides

and the requested nodes in Hamilton @Deepyaman Datta Let me know if there's any false advertising!

Nok Lam Chan

05/18/2024, 7:48 PM

@Thierry Jean isn’t to_output equivalent to what you call by asset? It resolves the necessary node to compute if I understand correctly https://docs.kedro.org/en/stable/development/commands_reference.html#run-the-project

Thierry Jean

05/18/2024, 11:01 PM

@Nok Lam Chan as per the end of my previous message, as far as I understand, yes. Happy to answer more questions about Hamilton on our Slack channel: https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg

👍🏼 1

11 Views

Open in Slack

Previous Next