Carlos Prieto - Tomtom
01/15/2025, 1:34 PMabfss
in Kedro 0.19.3, which was not present in version 0.18. Here is a summary of the problem:
In Kedro 0.18, I configured the credentials for accessing storage through Spark configurations with Azure Service Principal, and everything worked fine. However, after upgrading to Kedro 0.19.3, the same setup stopped working. After spending a couple of days troubleshooting, I discovered that adding the credentials as environment variables resolved the issue.
My questions are:
1. Does Kedro 0.19.3 read these environment variables directly?
2. Is this behavior managed by Kedro itself or by the abfss library?
Additionally, it seems redundant to add the credentials both in the Spark configuration and as environment variables. This redundancy is confusing and feels like a bug rather than a feature. Could you please clarify if this is the intended behavior?
Execution Environment:
• This is being executed in Databricks.
• The Spark configurations to use Azure Service Principal are added to the Databricks cluster used. (The cluster configuration includes credentials for multiple storages.)
• Only one storage credentials can be added as environment variables, but since the spark config authenticates the spark session just filling in these values althugh incorrect allows to access the storages.
Thank you for your assistance!Hall
01/15/2025, 1:34 PMJuan Luis
01/15/2025, 1:53 PMOmegaConfigLoader
, is that correct?
• as far as I understand (but I could be wrong), Kedro doesn't do any magic env variable loading for credentials. apart from PySpark, are there any relevant Python dependencies in your environment?Carlos Prieto - Tomtom
01/20/2025, 11:46 AM# Class that manages how configuration is loaded.
from kedro.config import OmegaConfigLoader # noqa: E402
CONFIG_LOADER_CLASS = OmegaConfigLoader
# Keyword arguments to pass to the CONFIG_LOADER_CLASS constructor.
CONFIG_LOADER_ARGS = {
"base_env": "base",
"default_run_env": "local",
"config_patterns": {
"spark": ["spark*", "spark*/**"],
}
}
• I wonder why when then even if the the spark session is already authenticated with credentials through Databricks cluster config. When Kedro tries to instantiate the kedro Spark Dataset if no env_variable if gives out the missing credentials error for the dataset. Here are the dependencies for my project:
delta-spark==2.3.0
kedro==0.19.3
pyspark==3.3.2
azure-identity==1.12.0
azure-keyvault-secrets==4.7.0
pandas==1.5.3
country_converter==1.0.0
unidecode==1.3.6
haversine==2.8.0
rapidfuzz==3.1.2
numpy==1.23.1
azure-mgmt-network==25.2.0
azure-mgmt-compute==30.4.0
kedro-viz==8.0.1
kedro-datasets[spark-sparkdataset, spark-sparkjdbcdataset, pandas-csvdataset, pickle-pickledataset]==3.0.1
hdfs==2.7.3
s3fs==2024.3.1
postal==1.1.10
deltalake==0.16.3
opentraveldata==0.0.9.post2
fuzzywuzzy==0.18.0
python-Levenshtein==0.25.0
country-converter==1.0.0
babel==2.14.0
langchain==0.0.347
openai>=0.27.0
geopandas~=0.11.0
tiktoken==0.6.0
faiss-cpu==1.8.0
Thanks for the help 🙂