Hey everyone, a while ago I took part in a Kedro u...
# plugins-integrations
b
Hey everyone, a while ago I took part in a Kedro user group and I brought up the idea of having the ability to share a Data Catalog across multiple projects and team members, other people on the call seemed to be interested in this concept as well and I finally got around to developing it. I'd love for you all to check out the repo and give me any feedback or recommendations!
🎉 2
👍 2
👀 2
👏🏼 1
d
This is cool!
🥳 2
j
amazing @Brandon Meek, thanks a lot for sharing! I recall those conversations but I don't have the links at hand, could you share some pointers to make a cross reference there?
d
Very interesting, in concept! I'll take a closer look later. Immediate thought:
Copy code
cars:
  type: universal_catalog.UniversalCatalogDataset
  source_name: cars
  url: <http://localhost:5000/>
I'd rather not have to redefine each catalog entry as a
UniversalCatalogDataset
(maybe there are cases where I have local overrides, but not for every dataset); is it possible just replace the catalog object with a remote catalog?
b
@Deepyaman Datta I think that's a really cool idea too, I'll look into it. @Juan Luis sorry, I'm not sure I know what you mean by a cross reference.
j
sorry, my question was very obtuse 😅 do you remember in which issue or discussion the idea of a "universal catalog" was mentioned?
to @Deepyaman Datta’s point: indeed, I guess the key would be to be able to do
Copy code
# settings.py

from kedro_universal_catalog import UniversalCatalog

DATA_CATALOG_CLASS = UniversalCatalog
and most of the rest should be the same, right?
👍 2
d
Yeah, though I don't know if it's a problem of how you still enable local entries/overrides. Depends on user needs. :)
d
or whether this folds into the wider set of requirements for an 1st party API plugin
b
@Juan Luis unfortunately I don't, I was still at my last employer when I had the meeting and lost my calendar and slack when I was let go, I think it was earlier this year, maybe in January? @Deepyaman Datta I think the simplest implementation would be for me to create another dataset maybe just
UniversalCatalog
that pulls the entire catalog, and then if any of the values need to be overwritten you could use the OmegaConfigLoader merge strategies
👍 1
@Deepyaman Datta check out the latest merge, I used @Juan Luis's recommendation to make the change in
settings.py
but it's a subclass that returns a
DataCatalog
that merges your project's
DataCatalog
with the remote catalog
🔥 3
d
Very cool! I forget who was talking about it—was it you, @Juan Luis—but there's the challenge of how do you package conf for deployment in things like Airflow. Maybe you shouldn't be packaging conf, and you should centrally host your conf like something like this. 🙂 I think it's much more akin to some existing patterns, at least thinking of something like Kafka schema registry or Iceberg catalog.
j
> but there's the challenge of how do you package conf for deployment in things like Airflow. Maybe you shouldn't be packaging conf, and you should centrally host your conf like something like this. I've been thinking about this since @Ankita Katiyar and @Dmitry Sorokin started working on
kedro-airflow
, but then tried to look for prior art in the MLOps world and I haven't seen really any "centralized config store". if anything, there are secrets managers, experiment tracking systems for ML, and the rest is roughly what @Yolan Honoré-Rougé called "external configuration" in his first "universal Kedro deployments" issue. looking at Google's seminal paper "Hidden Technical Debt in Machine Learning Systems", "Configuration" there merely refers to ML systems: features, data, hyperparameters. and yet, configuration in Kedro is much broader.
d
At least for the above examples (Iceberg and Kafka Schema Registry), REST catalog is very commonly used. Not sure about the broader landscape, as these are what I worked with (for other reasons) more recently.
d
This configuration store - is a concept we’ve introduced in my internal project that actually could be something that the OSS project should steal
…also maybe we should build one of these against this https://github.com/unitycatalog/unitycatalog 👀
j
I was about to open a discussion about just that this
d
Also, while this conversation quickly expands in scope, just want to clarify -- @Brandon Meek this shouldn't block the great work you're doing! Even if Kedro (eventually) does provide this functionality out of the box, it can leverage your work heavily (or simply include it). Having a battlr-tested plugin people are already using is a huge advantage. There's also nothing fundamentally wrong with some things continuing to be well-maintained, independent plugins.
👍 2
d
this is a really exciting avenue and what you’ve created @Brandon Meek is a great first step
👍 1