Mate Scharnitzky
08/09/2023, 7:41 PMCONFIG_LOADER_CLASS = OmegaConfigLoader in settings.yml and run kedro ipython, I get a ValueError: Duplicate keys found in my raw_layer.yml and .ipynb_checkpoints/raw_layer-checkpoint.yml.
settings.yml
from <my project>.hooks import SparkHooks
from kedro.config import OmegaConfigLoader
CONFIG_LOADER_CLASS = OmegaConfigLoader
CONFIG_LOADER_ARGS = {
"config_patterns": {
"spark": ["spark*", "spark*/**"],
"parameters": ["parameters*", "parameters*/**", "**/parameters*"],
"catalog": ["catalog*", "catalog*/**", "**/catalog*"],
"credentials": ["credentials*", "credentials*/**", "**/credentials*"],
"logging": ["logging*", "logging*/**", "**/logging*"],
}
}
HOOKS = (SparkHooks(),)
hooks.yml
from kedro.framework.hooks import hook_impl
from pyspark import SparkConf
from pyspark.sql import SparkSession
class SparkHooks:
@hook_impl
def after_context_created(self, context) -> None:
"""Initialises a SparkSession using the config
defined in project's conf folder.
"""
# Load the spark configuration in spark.yaml using the config loader
parameters = context.config_loader.get("spark*", "spark*/**")
spark_conf = SparkConf().setAll(parameters.items())
# Initialise the spark session
spark_session_conf = (
SparkSession.builder.appName(context.project_path.name)
.enableHiveSupport()
.config(conf=spark_conf)
)
_spark_session = spark_session_conf.getOrCreate()
_spark_session.sparkContext.setLogLevel("WARN")
If I use the default ConfigLoader it works as expected. Why would OmegaConf read from ipynb checkpoints? How to switch that off?
Thank you!Nok Lam Chan
08/09/2023, 8:00 PMNok Lam Chan
08/09/2023, 8:01 PMNok Lam Chan
08/09/2023, 8:02 PMMate Scharnitzky
08/09/2023, 8:05 PMspark_conf = SparkConf().setAll(parameters.items()) in hooks.py that parameters is str and doesn’t have an items() method.Mate Scharnitzky
08/09/2023, 8:05 PMNok Lam Chan
08/09/2023, 8:29 PMMate Scharnitzky
08/09/2023, 8:32 PMMate Scharnitzky
08/09/2023, 8:32 PMNok Lam Chan
08/09/2023, 8:51 PMNok Lam Chan
08/09/2023, 8:58 PMNok Lam Chan
08/09/2023, 8:58 PMErwin
08/09/2023, 9:11 PMparameters = context.config_loader.get("spark")
spark_conf = SparkConf().setAll(parameters.items())Mate Scharnitzky
08/09/2023, 9:38 PMNok Lam Chan
08/09/2023, 9:41 PMNok Lam Chan
08/09/2023, 9:42 PMErwin
08/09/2023, 9:42 PM(Pdb) context.config_loader.get("spark*", "spark*/**")
'spark*/**'
(Pdb) context.config_loader.get("spark")
{'spark.driver.maxResultSize': '3g', 'spark.hadoop.fs.s3a.impl': 'org.apache.hadoop.fs.s3a.S3AFileSystem', 'spark.sql.execution.arrow.pyspark.enabled': True, 'spark.scheduler.mode': 'FAIR'}
(Pdb)Nok Lam Chan
08/09/2023, 11:23 PMdatajoely
08/10/2023, 8:46 AMNok Lam Chan
08/10/2023, 3:25 PMMate Scharnitzky
08/10/2023, 3:27 PMMate Scharnitzky
08/10/2023, 3:29 PMNok Lam Chan
08/11/2023, 3:10 PMNok Lam Chan
08/11/2023, 3:10 PMNok Lam Chan
08/11/2023, 3:30 PMOmegaConfigLoader you need to change the settings.py and hooks.py a little bit.
See https://github.com/kedro-org/kedro-starters/pull/142/filesMate Scharnitzky
08/11/2023, 3:33 PM0.18.11Nok Lam Chan
08/11/2023, 3:35 PMhooks - change this
parameters = context.config_loader["spark"]
and in settings.py , uncomment the config
CONFIG_LOADER_ARGS = {
"config_patterns": {
"spark" : ["spark*/", "spark*/**"],
"parameters": ["parameters*", "parameters*/**", "**/parameters*"],
}
}Mate Scharnitzky
08/11/2023, 3:37 PMconfig_patterns ?
E.g.,
from kedro.config import OmegaConfigLoader
CONFIG_LOADER_ARGS = {
"config_patterns": {
"spark": ["spark*", "spark*/**"],
"parameters": ["parameters*", "parameters*/**", "**/parameters*"],
"catalog": ["catalog*", "catalog*/**", "**/catalog*"],
"credentials": ["credentials*", "credentials*/**", "**/credentials*"],
"logging": ["logging*", "logging*/**", "**/logging*"],
}
}Mate Scharnitzky
08/11/2023, 3:40 PMNok Lam Chan
08/11/2023, 3:43 PM.update methodNok Lam Chan
08/11/2023, 3:43 PMsparkNok Lam Chan
08/11/2023, 3:44 PMOmegaConfigLoader, you will see:
self.config_patterns = {
"catalog": ["catalog*", "catalog*/**", "**/catalog*"],
"parameters": ["parameters*", "parameters*/**", "**/parameters*"],
"credentials": ["credentials*", "credentials*/**", "**/credentials*"],
"logging": ["logging*", "logging*/**", "**/logging*"],
}
self.config_patterns.update(config_patterns or {})Mate Scharnitzky
08/11/2023, 3:45 PMNok Lam Chan
08/11/2023, 3:46 PMconfig_loader["xxx"] is the preferred way to use config_loader, get is not in the AbstractConfigLoader interfaceNok Lam Chan
08/11/2023, 3:46 PMget will only works for TemplatedConfigLoader and ConfigLoader, while all ConfigLoader should support the dictionary interfaceMate Scharnitzky
08/11/2023, 3:47 PMNok Lam Chan
08/11/2023, 3:49 PMMate Scharnitzky
08/11/2023, 3:53 PMMate Scharnitzky
08/11/2023, 4:34 PMCONFIG_LOADER_CLASS = OmegaConfigLoader is commented, so I think it falls back to the default ConfigLoader. I add this line though, I still get the duplicated key error.Mate Scharnitzky
08/11/2023, 4:45 PMsettings.py -> I used Juan’s workaround from here
from kedro.config import OmegaConfigLoader
CONFIG_LOADER_CLASS = OmegaConfigLoader
CONFIG_LOADER_ARGS = {
"config_patterns": {
"spark": ["spark*", "spark*/**"],
"parameters": ["parameters*", "parameters*/**", "**/parameters*"],
"catalog": ["catalog.yml", "**/catalog.yml", "globals_catalog.yml"],
}
}
from gci_pricing.hooks import SparkHooks
HOOKS = (SparkHooks(),)
globals_catalog.yml
_base_path: "s3a://..."
_folders:
raw: "01_raw"
int: "02_intermediate"
prm: "03_primary"
eda: "04_eda"
ftr: "05_feature"
mip: "06_model_input"
mdl: "07_models"
mop: "08_model_output"
rpt: "09_reporting"
hooks.py -> your suggestion
parameters = context.config_loader["spark"]Nok Lam Chan
08/11/2023, 5:08 PMconfig_pattern can avoid OmegaConfigLoader to load checkpoints folder.
However, as you point out with ConfigLoader we don’t need to do this. The correct fix is unclear yet (I am hoping to add this into backlog next week) - but it should be a fix within OmegaConfigLoader .Nok Lam Chan
08/11/2023, 5:11 PMNok Lam Chan
08/11/2023, 5:13 PMdict interface. (kedro-starters)Mate Scharnitzky
08/11/2023, 6:53 PMconfig_pattern workaround felt to restrictive. I think we’re good for now.Mate Scharnitzky
08/11/2023, 6:53 PM