I want to share a truly random experience with eve...
# random
l
I want to share a truly random experience with everyone: I'm facing the problems that Kedro solves, which is 5k lines of ETL/DS python in a single file that needs to be broken down. It's hard to adopt Kedro because of the perceived complexity and boilerplate which requires buy-in wholesale (I know we can adopt library components individually but it's still not straightforward), so I was trying to simplify the framework's interface. Got stuck, asked chatgpt and 100 exchanges later, we went full circle to the current API as the most optimal one. This is ChatGPT's opinion of the API:
โค๏ธ 3
๐Ÿš€ 1
๐Ÿ˜„ 3
n
Super interesting, I also played with chatgpt on API design a while ago and I get similar result. What solution did you end up go with? We are trying to remove some of the boilerplate to make it as lean as possible (or at least easier to work with if you have your own template already). I am curious what issues did you bump into when you try to use Kedro library components only)
d
@Lim H. Is it not blasphemy using ChatGPT for this specific problem? ๐Ÿ‘€ ๐Ÿคฃ
j
this is both hilarious and reassuring, thanks a lot for sharing @Lim H. ๐Ÿ˜„ still, any way you think we can lower the perceived complexity or make adopting individual library components more easy? we're actively working on these areas, so even if ChatGPT blessed our API โœจ , we can keep that design while flattening the learning curve
l
@datajoely blasphemy to which God? ๐Ÿ˜„ @Juan Luis @Nok Lam Chan my main problem is I don't know where to start dropping Kedro into an existing project of reasonable complexity without deep refactoring and this is an issue. Say I want to adopt the catalog. To do it quickly I would use the Python API. But then the catalog definition is in the code as dictionary with ~50 datasets and that's the same problem as before. After that, to be idiomatic, I will then have to convert it back to YAML at some point. Also where can I start with parameters refactoring?
A typed catalog would really help convince people because then at least if it's in the code, it's typed and much easier to manage
๐Ÿ‘ 1
๐Ÿ‘๐Ÿผ 1
But then again this goes back to the question whether catalog is code or configuration. I can see the arguments for both sides. Another thing that I think might help is making pipeline creation a little easier. Now there are 2 steps: โ€ข Turn pure function into node โ€ข Add node to a list to form a pipeline The problem is we already have a nodes & pipelines abstraction in place and it's a bit hard to convince people that this is better since it doesn't look easier.
๐Ÿ‘๐Ÿผ 1
d
One of the reflections from one of the early devs was that if dataclasses existed when they made it we would have gone down that route. I wonder if we should do a spike of what a pydantic style catalog looks like? Iโ€™m currently on a pydantic / pandera spree for everything so itโ€™s very front of mind
l
One thing I hacked together yesterday was allowing people to define a.
pipeline.py
with a bunch of
Copy code
func_a(a: DataCatalog.A) -> DataCatalog.B
func_b(a: DataCatalog.B) -> DataCatalog.C
and it automatically works out this is a pipeline of
A -> func_a -> B -> func_b
the main problem is it lost the pure function nature of the node but I guess you can just do that under the hood
This is fairly trivial to do but then my track record of hacking kedro and letting it gather dust is abysmal so I'm hesistent to do anything important with it
d
Perhaps there is a model where you do a little prototype and then the team takes it over?
weโ€™re pushing to get 0.19.x out and now weโ€™ve settled on the future of the configuration system Iโ€™m super super excited to start solving the bigger problems that were held back because of that uncertainty
l
I'm still not entirely convinced this is any better than the current API, hence the conversation with chatgpt lol.
as in writing
node(func_a, inputs=[DataCatalog.A], outputs=[DataCatalog.B])
is fine too
another thing that makes me think maybe my use case is not entirely suited for Kedro is that logically it makes most sense for my application to have more than 1 catalog. Kedro pipeline assumes 1 catalog per run I think
anyway all these are probably just fun, random musings ๐Ÿ˜„
j
I love the idea of being able to have a two-way conversion between YAML and Python classes (YAML and Python dictionary is already trivial) and it's something we could definitely explore
d
Two way conversion of YAML and JSON serialisation of Pipeline definitions are my current wishlist ๐Ÿ™
@Lim H. if you had any thoughts on this I would really appreciate it https://kedro-org.slack.com/archives/C03R8N2M8KT/p1695909377585899
j
I'd like to note one thing: > Say I want to adopt the catalog. To do it quickly I would use the Python API. But then the catalog definition is in the code as dictionary with ~50 datasets and that's the same problem as before. > > After that, to be idiomatic, I will then have to convert it back to YAML at some point. what would you think about writing the catalog in YAML directly and loading it in Python using
Copy code
catalog = DataCatalog.from_config(OmegaConfigLoader.load_file("catalog.yml"))
without having to use the Kedro project template at all @Lim H.? (context and discussion: https://github.com/kedro-org/kedro/issues/2819)
i
My ideas on designing some of those classes were to not be far from the moment, where the
ConfigLoader
can be replaced with a simple dictionary object, the
DataCatalog
with a dictionary-like object and then everything should work
E.g. all operations are not to dissimilar:
Copy code
catalog.load("dataset") -> catalog["dataset"]
catalog.save("dataset", data) -> catalog["dataset"] = data
Each dataset definition is nothing but a dataclass with three operations you can apply on it,
load
/
save
/
exists
.