Hello everyone! I'm trying to use <SparkDataset> t...
# questions
j
Hello everyone! I'm trying to use SparkDataset to read and write to the Azure Datalake File System, using the abfs:// prefix. I noticed that, although the dataset requires credentials to be passed in the init method, these credentials are not used when writing, requiring the Spark section to be configured globally. This seems a bit out of line with the Kedro standard, as it doesn't allow us to have datasets from multiple sources. Shouldn't we be using these credentials directly when writing and reading, without using the global Spark configuration?
e
Hi @Júlio Resende, as far as I understand: • When you use
abfs://
or
abfss://
, Spark defers to the Azure Hadoop connector to handle authentication. • That connector ignores per-writer
.options(...)
for authentication and only checks Hadoop • So unless you’ve already set these globally on the Spark session,
.save()
will fail to authenticate, even if you passed
credentials
into the Kedro dataset • For some formats (e.g., JDBC, S3 connectors), Spark allows passing authentication tokens directly as reader options. That’s why Kedro’s
SparkDataSet
supports merging
credentials
into
.load()
.
thankyou 1
Reading: works with credentials in the dataset, because Kedro passes them as options to
DataFrameReader
. Writing: those same credentials don’t get passed to
DataFrameWriter
. Spark tries to resolve the ABFS path and falls back to Hadoop configs.
n
Hi @Júlio Resende, things are slightly different with Spark indeed. Part of the reason is that Spark has its own authentication mechanism which is different from the Kedro's one (
fsspec
based). You can still have multiple
spark.yml
configuration to keep different set of Spark credentials - though it's not as granular as a dataset level credentials.
thankyou 2
j
Thank you! I created a custom dataset using the credentials in the
.options(...)
method