Hello! I have a question w.r.t. to the catalog: Wh...
# questions
g
Hello! I have a question w.r.t. to the catalog: What is the simplest way to know if a dataset is defined in the "base" env or has been overriden by the selected env catalog? Is this info stored anywhere? Thanks! :)
h
Someone will reply to you shortly. In the meantime, this might help:
d
Off the top of my head, this isn't stored anywhere. The config loader just merges the hierarchical config using `OmegaConf`; no information kept on source.
What you could do is attach some
metadata
to all entries in each file, basically stating the source? Then you could access that field. But no built-in way I know of.
In case @Merel knows something, since drive the work on the new config loader.
g
Thanks @Deepyaman Datta! That's a good idea. Do you think it's possible to have it done automatically at load time? I don't see how it's possible with a hook.
One alternative I can think of is to load the catalog and then
.to_config
for both "base" and the selected env and infer where they come from based on their differences.
m
Don't have a lot of time to dig right now, but when using destructive merge (the default) and when you have debugging logs on you'll see:
Copy code
"Config from path '%s' will override the following "
                "existing top-level config keys: %s"
in the logging messages
d
I don't think it's possible with a hook, unfortunately. However, you can create your own config loader (extending the base
OmegaConfigLoader
), and you could extend it very slightly by defining your own merge strategy/accepting it in the config loader construction. For example, the destructive merge strategy @Merel mentioned: https://github.com/kedro-org/kedro/blob/0.19.11/kedro/config/omegaconf_config.py#L529-L544 Here, you can insert a key into each item in the dict to be merge, using the
env_path
? That should work, would need to probably play around with it a bit.
e
Indeed, you can’t access
env
through hooks
g
Thank you everyone! I think your solution should work nicely for me @Deepyaman Datta πŸ™‚
πŸ™Œ 1
Hi again! Is there anyway to disable config merging altogether? What worries me in practice is that we do not know when loading a dataset in which env it is defined. For example, let's assume that I have been developing a model and created the pipelines in my "dev" env. The model itself is defined as a dataset in my base env. When I move pipelines to my prod environment, I forgot to move my model dataset to my prod env and it keeps using the base env to find it. A couple month later, I work on improving the model and it the result of my trial and error is moved to prod by mistake and I don't even notice, because I don't know where my datasets are loaded from. In practice, I do think that in this specific case, I'd used mlflow to manage models so this would be very unlikely, But it could still be happening with say intermediate data.
d
I don't think theres an out-of-the-box option (unless some configurable merge strategy foes that?), but you could also always define a custom config loader. Probably also one where you might get better answers in #C03RKP2LW64; @Merel and others spent a lot of time on config loaders. (Oops, this was in #C03RKP2LW64, haven't had enough coffee yet. β˜•)
πŸ‘ 1
πŸŽ‰ 1
m
g
@Merel I would still like to have different envs but I would like to for all datasets to be redefined in all of them. I am afraid that, as I can't check in which env a specific dataset is defined, I may have a dev pipeline modifying critical prod assets.
m
You still get to decide which environment overwrites which https://docs.kedro.org/en/stable/configuration/configuration_basics.html#how-to-change-the-default-overriding-environment, but that of course doesn't solve the case when a dataset is missing in the overwriting env..
I'd have to dive into this a bit deeper to come up with a solution.
g
Thank you @Merel πŸ™‚