I want to share a truly random experience with everyone I m Kedro #random

I want to share a truly random experience with eve...

Lim H.

09/30/2023, 4:12 PM

I want to share a truly random experience with everyone: I'm facing the problems that Kedro solves, which is 5k lines of ETL/DS python in a single file that needs to be broken down. It's hard to adopt Kedro because of the perceived complexity and boilerplate which requires buy-in wholesale (I know we can adopt library components individually but it's still not straightforward), so I was trying to simplify the framework's interface. Got stuck, asked chatgpt and 100 exchanges later, we went full circle to the current API as the most optimal one. This is ChatGPT's opinion of the API:

❤️ 3

🚀 1

😄 3

Nok Lam Chan

09/30/2023, 6:04 PM

Super interesting, I also played with chatgpt on API design a while ago and I get similar result. What solution did you end up go with? We are trying to remove some of the boilerplate to make it as lean as possible (or at least easier to work with if you have your own template already). I am curious what issues did you bump into when you try to use Kedro library components only)

datajoely

10/01/2023, 10:19 AM

@Lim H. Is it not blasphemy using ChatGPT for this specific problem? 👀 🤣

Juan Luis

10/01/2023, 10:23 AM

this is both hilarious and reassuring, thanks a lot for sharing @Lim H. 😄 still, any way you think we can lower the perceived complexity or make adopting individual library components more easy? we're actively working on these areas, so even if ChatGPT blessed our API ✨ , we can keep that design while flattening the learning curve

Lim H.

10/01/2023, 11:16 AM

@datajoely blasphemy to which God? 😄 @Juan Luis @Nok Lam Chan my main problem is I don't know where to start dropping Kedro into an existing project of reasonable complexity without deep refactoring and this is an issue. Say I want to adopt the catalog. To do it quickly I would use the Python API. But then the catalog definition is in the code as dictionary with ~50 datasets and that's the same problem as before. After that, to be idiomatic, I will then have to convert it back to YAML at some point. Also where can I start with parameters refactoring?

Lim H.

10/01/2023, 11:17 AM

A typed catalog would really help convince people because then at least if it's in the code, it's typed and much easier to manage

👍 1

👍🏼 1

Lim H.

10/01/2023, 11:41 AM

But then again this goes back to the question whether catalog is code or configuration. I can see the arguments for both sides. Another thing that I think might help is making pipeline creation a little easier. Now there are 2 steps: • Turn pure function into node • Add node to a list to form a pipeline The problem is we already have a nodes & pipelines abstraction in place and it's a bit hard to convince people that this is better since it doesn't look easier.

👍🏼 1

datajoely

10/01/2023, 3:24 PM

One of the reflections from one of the early devs was that if dataclasses existed when they made it we would have gone down that route. I wonder if we should do a spike of what a pydantic style catalog looks like? I’m currently on a pydantic / pandera spree for everything so it’s very front of mind

Lim H.

10/01/2023, 3:41 PM

One thing I hacked together yesterday was allowing people to define a.

pipeline.py

with a bunch of

Copy code

func_a(a: DataCatalog.A) -> DataCatalog.B
func_b(a: DataCatalog.B) -> DataCatalog.C

and it automatically works out this is a pipeline of

A -> func_a -> B -> func_b

the main problem is it lost the pure function nature of the node but I guess you can just do that under the hood

Lim H.

10/01/2023, 3:42 PM

This is fairly trivial to do but then my track record of hacking kedro and letting it gather dust is abysmal so I'm hesistent to do anything important with it

datajoely

10/01/2023, 3:42 PM

Perhaps there is a model where you do a little prototype and then the team takes it over?

datajoely

10/01/2023, 3:43 PM

we’re pushing to get 0.19.x out and now we’ve settled on the future of the configuration system I’m super super excited to start solving the bigger problems that were held back because of that uncertainty

Lim H.

10/01/2023, 3:50 PM

I'm still not entirely convinced this is any better than the current API, hence the conversation with chatgpt lol.

Lim H.

10/01/2023, 3:52 PM

as in writing

node(func_a, inputs=[DataCatalog.A], outputs=[DataCatalog.B])

is fine too

Lim H.

10/01/2023, 3:53 PM

another thing that makes me think maybe my use case is not entirely suited for Kedro is that logically it makes most sense for my application to have more than 1 catalog. Kedro pipeline assumes 1 catalog per run I think

Lim H.

10/01/2023, 3:54 PM

anyway all these are probably just fun, random musings 😄

Lim H.

10/01/2023, 3:59 PM

redacted

Juan Luis

10/02/2023, 8:01 AM

I love the idea of being able to have a two-way conversion between YAML and Python classes (YAML and Python dictionary is already trivial) and it's something we could definitely explore

datajoely

10/02/2023, 8:35 AM

Two way conversion of YAML and JSON serialisation of Pipeline definitions are my current wishlist 🙏

datajoely

10/02/2023, 8:35 AM

@Lim H. if you had any thoughts on this I would really appreciate it https://kedro-org.slack.com/archives/C03R8N2M8KT/p1695909377585899

Juan Luis

10/02/2023, 8:42 AM

I'd like to note one thing: > Say I want to adopt the catalog. To do it quickly I would use the Python API. But then the catalog definition is in the code as dictionary with ~50 datasets and that's the same problem as before. > > After that, to be idiomatic, I will then have to convert it back to YAML at some point. what would you think about writing the catalog in YAML directly and loading it in Python using

Copy code

catalog = DataCatalog.from_config(OmegaConfigLoader.load_file("catalog.yml"))

without having to use the Kedro project template at all @Lim H.? (context and discussion: https://github.com/kedro-org/kedro/issues/2819)

Ivan Danov

10/03/2023, 12:45 PM

My ideas on designing some of those classes were to not be far from the moment, where the

ConfigLoader

can be replaced with a simple dictionary object, the

DataCatalog

with a dictionary-like object and then everything should work

Ivan Danov

10/03/2023, 12:46 PM

E.g. all operations are not to dissimilar:

Copy code

catalog.load("dataset") -> catalog["dataset"]
catalog.save("dataset", data) -> catalog["dataset"] = data

Ivan Danov

10/03/2023, 12:47 PM

Each dataset definition is nothing but a dataclass with three operations you can apply on it,

load

save

exists

5 Views

Open in Slack

Previous Next