[Omegaconf] Hi Team, I’d like to use Omegaconf tem...
# questions
m
[Omegaconf] Hi Team, I’d like to use Omegaconf templating for the project, but when I set
CONFIG_LOADER_CLASS = OmegaConfigLoader
in
settings.yml
and run
kedro ipython
, I get a
ValueError: Duplicate keys found in
my
raw_layer.yml
and
.ipynb_checkpoints/raw_layer-checkpoint.yml
.
settings.yml
Copy code
from <my project>.hooks import SparkHooks
from kedro.config import OmegaConfigLoader

CONFIG_LOADER_CLASS = OmegaConfigLoader

CONFIG_LOADER_ARGS = {
      "config_patterns": {
          "spark": ["spark*", "spark*/**"],
          "parameters": ["parameters*", "parameters*/**", "**/parameters*"],
          "catalog": ["catalog*", "catalog*/**", "**/catalog*"],
          "credentials": ["credentials*", "credentials*/**", "**/credentials*"],
          "logging": ["logging*", "logging*/**", "**/logging*"],
      }
}

HOOKS = (SparkHooks(),)
hooks.yml
Copy code
from kedro.framework.hooks import hook_impl
from pyspark import SparkConf
from pyspark.sql import SparkSession


class SparkHooks:
    @hook_impl
    def after_context_created(self, context) -> None:
        """Initialises a SparkSession using the config
        defined in project's conf folder.
        """

        # Load the spark configuration in spark.yaml using the config loader
        parameters = context.config_loader.get("spark*", "spark*/**")
        spark_conf = SparkConf().setAll(parameters.items())

        # Initialise the spark session
        spark_session_conf = (
            SparkSession.builder.appName(context.project_path.name)
            .enableHiveSupport()
            .config(conf=spark_conf)
        )
        _spark_session = spark_session_conf.getOrCreate()
        _spark_session.sparkContext.setLogLevel("WARN")
If I use the default
ConfigLoader
it works as expected. Why would
OmegaConf
read from ipynb checkpoints? How to switch that off? Thank you!
n
Could you try removing the checkpoints folder for now?
I am also curious how this happened. Did you edit your config inside a Jupyter notebook Or on databricks?
m
When I remove the folder I get an error for the line
spark_conf = SparkConf().setAll(parameters.items())
in
hooks.py
that parameters is str and doesn’t have an items() method.
👀 1
I’m working in Sagemaker notebook instance.
👍 1
n
This is weird. So it doesn't work with or without the checkpoints folder?
m
Correct
I switched to TemplatedConfigLoader, it’s all good
n
Could you do a tree command inside? Is there some other hidden files maybe?
And do a quick print statement in the hook to see what is parameter.
I can work on a fix but just want to make sure it's only because of the ipynb checkpoints.
e
@Mate Scharnitzky when using spark + Omega you need to use the hook like this:
Copy code
parameters = context.config_loader.get("spark")
spark_conf = SparkConf().setAll(parameters.items())
m
Thanks @Erwin, let me check it tomorrow and get back to you.
n
@Erwin what is the change here? How is parameters access before?
OmegaConfigLoader should support both get method and dictionary interface.
e
Copy code
(Pdb) context.config_loader.get("spark*", "spark*/**")
'spark*/**'

(Pdb) context.config_loader.get("spark")
{'spark.driver.maxResultSize': '3g', 'spark.hadoop.fs.s3a.impl': 'org.apache.hadoop.fs.s3a.S3AFileSystem', 'spark.sql.execution.arrow.pyspark.enabled': True, 'spark.scheduler.mode': 'FAIR'}
(Pdb)
👍 1
n
Thank you! I will run some tests tomorrow to confirm this.
d
I think we need to change the starter here
n
Thanks for reporting this! I can confirm this is a bug and sorry for this awkful experience. I think this should be tackled in two parts: 1. https://github.com/kedro-org/kedro/issues/2583 (Fix the checkpoint) 2. Fix the starter Cc @Merel
m
Thank you for investigating it!
I’ll try the workaround suggested by Juan.
n
May I ask what Kedro version are you using?
It’s weird I thought I can reproduce the example, I try to do this again and fail to reproduce
Ok! I know what’s wrong. Will have a fix for the 2. If you want to use
OmegaConfigLoader
you need to change the
settings.py
and
hooks.py
a little bit. See https://github.com/kedro-org/kedro-starters/pull/142/files
m
We’re using kedro
0.18.11
thankyou 1
n
So in
hooks
- change this
parameters = context.config_loader["spark"]
and in settings.py , uncomment the config
Copy code
CONFIG_LOADER_ARGS = {
      "config_patterns": {
          "spark" : ["spark*/", "spark*/**"],
          "parameters": ["parameters*", "parameters*/**", "**/parameters*"],
      }
}
m
Is there a particular reason why catalog, credential or logging are not the
config_patterns
? E.g.,
Copy code
from kedro.config import OmegaConfigLoader
CONFIG_LOADER_ARGS = {
       "config_patterns": {
           "spark": ["spark*", "spark*/**"],
           "parameters": ["parameters*", "parameters*/**", "**/parameters*"],
           "catalog": ["catalog*", "catalog*/**", "**/catalog*"],
           "credentials": ["credentials*", "credentials*/**", "**/credentials*"],
           "logging": ["logging*", "logging*/**", "**/logging*"],
       }
 }
Or it’s not meant to be exhaustive and it’s up to the user?
n
Yes - basically it’s a dictionary
.update
method
default is unchanged - you only need to use it with anything extra, e.g.
spark
If you dive into the source code of
OmegaConfigLoader
, you will see:
Copy code
self.config_patterns = {
            "catalog": ["catalog*", "catalog*/**", "**/catalog*"],
            "parameters": ["parameters*", "parameters*/**", "**/parameters*"],
            "credentials": ["credentials*", "credentials*/**", "**/credentials*"],
            "logging": ["logging*", "logging*/**", "**/logging*"],
        }
        self.config_patterns.update(config_patterns or {})
m
Ah, make sense!
n
This is for backward compatibility - so if no arguments given it should behave exactly the same, but you can override the pattern if you wish.
config_loader["xxx"]
is the preferred way to use
config_loader
,
get
is not in the
AbstractConfigLoader
interface
so using
get
will only works for TemplatedConfigLoader and ConfigLoader, while all ConfigLoader should support the dictionary interface
m
yes, I had troubles with it a couple of weeks ago 🙂
n
Sorry about that, I think this is our mistake. It is designed in a backward compatible way but we haven’t updated our starters properly.
m
No worries at all. This only happens with the ones who actually build something for others. Kedro is a beautiful piece of software and the support you, Juan, Joel…et al. provide is world-class,
❤️ 3
K 2
Are you sure the this works? In your PR,
CONFIG_LOADER_CLASS = OmegaConfigLoader
is commented, so I think it falls back to the default ConfigLoader. I add this line though, I still get the duplicated key error.
Ok, this is how I could make it work:
settings.py
-> I used Juan’s workaround from here
Copy code
from kedro.config import OmegaConfigLoader

CONFIG_LOADER_CLASS = OmegaConfigLoader

CONFIG_LOADER_ARGS = {
      "config_patterns": {
          "spark": ["spark*", "spark*/**"],
          "parameters": ["parameters*", "parameters*/**", "**/parameters*"],
          "catalog": ["catalog.yml", "**/catalog.yml", "globals_catalog.yml"],
      }
}

from gci_pricing.hooks import SparkHooks

HOOKS = (SparkHooks(),)
globals_catalog.yml
Copy code
_base_path: "s3a://..."

_folders:
  raw: "01_raw"
  int: "02_intermediate"
  prm: "03_primary"
  eda: "04_eda"
  ftr: "05_feature"
  mip: "06_model_input"
  mdl: "07_models"
  mop: "08_model_output"
  rpt: "09_reporting"
hooks.py
-> your suggestion
Copy code
parameters = context.config_loader["spark"]
👍🏼 1
n
Sorry this is only fixing the starter part - not the duplicate key issues. Changing the
config_pattern
can avoid OmegaConfigLoader to load
checkpoints
folder. However, as you point out with
ConfigLoader
we don’t need to do this. The correct fix is unclear yet (I am hoping to add this into backlog next week) - but it should be a fix within
OmegaConfigLoader
.
I didn’t want to change the default to OmegaConfigLoader - it should be still ConfigLoader, but if you create a new starter now (with the new version, you should be able to uncomment OmegaConfigLoader and use it straight away without changing hooks)
So the complete fix will be in two parts: 1. The issue that @Juan Luis raise which fix the inconsistency how OmegaConfigLoader treat the pattern (kedro) 2. The issue that starters is using the API that isn’t supported by all config loader, it should use the
dict
interface. (kedro-starters)
m
Thanks! I reverted back to TemplatedConfigLoader as the
config_pattern
workaround felt to restrictive. I think we’re good for now.
👍 1
Have a great weekend!