Mate Scharnitzky
08/09/2023, 7:41 PMCONFIG_LOADER_CLASS = OmegaConfigLoader
in settings.yml
and run kedro ipython
, I get a ValueError: Duplicate keys found in
my raw_layer.yml
and .ipynb_checkpoints/raw_layer-checkpoint.yml
.
settings.yml
from <my project>.hooks import SparkHooks
from kedro.config import OmegaConfigLoader
CONFIG_LOADER_CLASS = OmegaConfigLoader
CONFIG_LOADER_ARGS = {
"config_patterns": {
"spark": ["spark*", "spark*/**"],
"parameters": ["parameters*", "parameters*/**", "**/parameters*"],
"catalog": ["catalog*", "catalog*/**", "**/catalog*"],
"credentials": ["credentials*", "credentials*/**", "**/credentials*"],
"logging": ["logging*", "logging*/**", "**/logging*"],
}
}
HOOKS = (SparkHooks(),)
hooks.yml
from kedro.framework.hooks import hook_impl
from pyspark import SparkConf
from pyspark.sql import SparkSession
class SparkHooks:
@hook_impl
def after_context_created(self, context) -> None:
"""Initialises a SparkSession using the config
defined in project's conf folder.
"""
# Load the spark configuration in spark.yaml using the config loader
parameters = context.config_loader.get("spark*", "spark*/**")
spark_conf = SparkConf().setAll(parameters.items())
# Initialise the spark session
spark_session_conf = (
SparkSession.builder.appName(context.project_path.name)
.enableHiveSupport()
.config(conf=spark_conf)
)
_spark_session = spark_session_conf.getOrCreate()
_spark_session.sparkContext.setLogLevel("WARN")
If I use the default ConfigLoader
it works as expected. Why would OmegaConf
read from ipynb checkpoints? How to switch that off?
Thank you!Nok Lam Chan
08/09/2023, 8:00 PMMate Scharnitzky
08/09/2023, 8:05 PMspark_conf = SparkConf().setAll(parameters.items())
in hooks.py
that parameters is str and doesn’t have an items() method.Nok Lam Chan
08/09/2023, 8:29 PMMate Scharnitzky
08/09/2023, 8:32 PMNok Lam Chan
08/09/2023, 8:51 PMErwin
08/09/2023, 9:11 PMparameters = context.config_loader.get("spark")
spark_conf = SparkConf().setAll(parameters.items())
Mate Scharnitzky
08/09/2023, 9:38 PMNok Lam Chan
08/09/2023, 9:41 PMErwin
08/09/2023, 9:42 PM(Pdb) context.config_loader.get("spark*", "spark*/**")
'spark*/**'
(Pdb) context.config_loader.get("spark")
{'spark.driver.maxResultSize': '3g', 'spark.hadoop.fs.s3a.impl': 'org.apache.hadoop.fs.s3a.S3AFileSystem', 'spark.sql.execution.arrow.pyspark.enabled': True, 'spark.scheduler.mode': 'FAIR'}
(Pdb)
Nok Lam Chan
08/09/2023, 11:23 PMdatajoely
08/10/2023, 8:46 AMNok Lam Chan
08/10/2023, 3:25 PMMate Scharnitzky
08/10/2023, 3:27 PMNok Lam Chan
08/11/2023, 3:10 PMOmegaConfigLoader
you need to change the settings.py
and hooks.py
a little bit.
See https://github.com/kedro-org/kedro-starters/pull/142/filesMate Scharnitzky
08/11/2023, 3:33 PM0.18.11
Nok Lam Chan
08/11/2023, 3:35 PMhooks
- change this
parameters = context.config_loader["spark"]
and in settings.py , uncomment the config
CONFIG_LOADER_ARGS = {
"config_patterns": {
"spark" : ["spark*/", "spark*/**"],
"parameters": ["parameters*", "parameters*/**", "**/parameters*"],
}
}
Mate Scharnitzky
08/11/2023, 3:37 PMconfig_patterns
?
E.g.,
from kedro.config import OmegaConfigLoader
CONFIG_LOADER_ARGS = {
"config_patterns": {
"spark": ["spark*", "spark*/**"],
"parameters": ["parameters*", "parameters*/**", "**/parameters*"],
"catalog": ["catalog*", "catalog*/**", "**/catalog*"],
"credentials": ["credentials*", "credentials*/**", "**/credentials*"],
"logging": ["logging*", "logging*/**", "**/logging*"],
}
}
Nok Lam Chan
08/11/2023, 3:43 PM.update
methodspark
OmegaConfigLoader
, you will see:
self.config_patterns = {
"catalog": ["catalog*", "catalog*/**", "**/catalog*"],
"parameters": ["parameters*", "parameters*/**", "**/parameters*"],
"credentials": ["credentials*", "credentials*/**", "**/credentials*"],
"logging": ["logging*", "logging*/**", "**/logging*"],
}
self.config_patterns.update(config_patterns or {})
Mate Scharnitzky
08/11/2023, 3:45 PMNok Lam Chan
08/11/2023, 3:46 PMconfig_loader["xxx"]
is the preferred way to use config_loader
, get
is not in the AbstractConfigLoader
interfaceget
will only works for TemplatedConfigLoader and ConfigLoader, while all ConfigLoader should support the dictionary interfaceMate Scharnitzky
08/11/2023, 3:47 PMNok Lam Chan
08/11/2023, 3:49 PMMate Scharnitzky
08/11/2023, 3:53 PMCONFIG_LOADER_CLASS = OmegaConfigLoader
is commented, so I think it falls back to the default ConfigLoader. I add this line though, I still get the duplicated key error.settings.py
-> I used Juan’s workaround from here
from kedro.config import OmegaConfigLoader
CONFIG_LOADER_CLASS = OmegaConfigLoader
CONFIG_LOADER_ARGS = {
"config_patterns": {
"spark": ["spark*", "spark*/**"],
"parameters": ["parameters*", "parameters*/**", "**/parameters*"],
"catalog": ["catalog.yml", "**/catalog.yml", "globals_catalog.yml"],
}
}
from gci_pricing.hooks import SparkHooks
HOOKS = (SparkHooks(),)
globals_catalog.yml
_base_path: "s3a://..."
_folders:
raw: "01_raw"
int: "02_intermediate"
prm: "03_primary"
eda: "04_eda"
ftr: "05_feature"
mip: "06_model_input"
mdl: "07_models"
mop: "08_model_output"
rpt: "09_reporting"
hooks.py
-> your suggestion
parameters = context.config_loader["spark"]
Nok Lam Chan
08/11/2023, 5:08 PMconfig_pattern
can avoid OmegaConfigLoader to load checkpoints
folder.
However, as you point out with ConfigLoader
we don’t need to do this. The correct fix is unclear yet (I am hoping to add this into backlog next week) - but it should be a fix within OmegaConfigLoader
.dict
interface. (kedro-starters)Mate Scharnitzky
08/11/2023, 6:53 PMconfig_pattern
workaround felt to restrictive. I think we’re good for now.