https://kedro.org/ logo
#questions
Title
# questions
t

Toni - TomTom - Madrid

03/25/2024, 12:00 PM
Hi, we have found this annoying problem when dealing with Delta tables in Azure/Databricks: We found the same issue running locally a~nd running form notebook within Databricks~
we are trying to connect to a Delta Table usind Spark.Dataset, type: delta
Copy code
search_sessions_logs:
  type: spark.SparkDataset
  filepath: <abfss://bronze@adlsmapsanalyticspoi.dfs.core.windows.net/external_sources/search_logs_amigo>
  file_format: delta
💡 1
removing the abfss:// and using the dbfs path, works from notebook
We find this problem when trying to connect to the delta table
Copy code
TypeError: DatabricksFileSystem.__init__() missing 2 required positional arguments: 'instance' and 'token'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/kedro", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/cli/cli.py", line 198, in main
    cli_collection()
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/cli/cli.py", line 127, in main
    super().main(
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/cli/project.py", line 225, in run
    session.run(
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/session/session.py", line 374, in run
    catalog = context._get_catalog(
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/context/context.py", line 231, in _get_catalog
    catalog: DataCatalog = settings.DATA_CATALOG_CLASS.from_config(
  File "/usr/local/lib/python3.10/site-packages/kedro/io/data_catalog.py", line 294, in from_config
    datasets[ds_name] = AbstractDataset.from_config(
  File "/usr/local/lib/python3.10/site-packages/kedro/io/core.py", line 164, in from_config
    raise DatasetError(
kedro.io.core.DatasetError: 
DatabricksFileSystem.__init__() missing 2 required positional arguments: 'instance' and 'token'.
Dataset 'search_sessions_logs' must only contain arguments valid for the constructor of 'kedro_datasets.spark.spark_dataset.SparkDataset'.
j

Juan Luis

03/25/2024, 1:33 PM
hey Toni! where does
DatabricksFileSystem
come from?
t

Toni - TomTom - Madrid

03/26/2024, 7:51 AM
yes, in the past (version 0.18) we were able to connect from local to Delta talbes in Databrics/Azure blob storage by means of the spark.yml (adding azure credentials there), but now it seems that the format that it expects is not the right one
n

Nok Lam Chan

03/26/2024, 4:24 PM
can you replace
dbfs://
with
/dbfs/
? Please also share what you put in
filepath
since the example you share above doesn't seem to match the error message here
It shouldn't use the
DatabricksFileSystem
to start with, IIRC the dbfs fielsystem provided by fsspec is not useful. In addition, spark has native way to access remote storage, so it shouldn't even use fsspec.
t

Toni - TomTom - Madrid

03/27/2024, 7:30 AM
Thanks! I tried all the possible paths with the same problem. This error came from using kedro SparkDataset, file_format= delta. It seems like now it is expecting a credentials file to connect remotely. Do you know what is the expected schema of this “credential” dictionary?
n

Nok Lam Chan

03/27/2024, 11:48 AM
@Toni - TomTom - Madrid Sorry for this to happen, can you share how your catalog look like and what error and stacktrace you get here? I tried:
Copy code
search_sessions_logs:
  type: spark.SparkDataset
  filepath: dbfs:/bronze@adlsmapsanalyticspoi.dfs.core.windows.net/external_sources/search_logs_amigo
  file_format: delta
And I only get Java error. I don't think you need to put any
credentials
since Spark authenticate with its own way. If it's triggering fsspec then it's very likely it's a bug but I would appreciate if you can share an example that we can reproduce.
t

Toni - TomTom - Madrid

04/01/2024, 7:26 AM
Hi Nok! thx for answering. to reproduce the problem you need to have a Detla table saved in Azure Blob Storage File System (the path shown in the catalog is a mount in Databricks) and trying to read in locally as spark.Dataset. Sorry I cannot share any public dataset for this 😞
g

Guillermo Caminero

04/01/2024, 1:26 PM
I have an additional problem to this one for azure connections. When I try to connect via the wasb[s] protocol it tells me that it doesn't know the protocol. It should also be implemented for azure access as well as abfs[s].
🎉 1
image.png
n

Nok Lam Chan

04/01/2024, 10:45 PM
@Guillermo Caminero Could you create a separate Github issue about this? I have not used wasbs before. Kedro support remote storage via
fsspec
, specifically for azure related it will be https://github.com/fsspec/adlfs. I didn't find anything mention wasbs so chance are it's not supported.
g

Guillermo Caminero

04/02/2024, 7:36 AM
Hello @Nok Lam Chan , first of all thanks, Finally I couldn't connect via WASB. It's true that if you have a blob storage version 1 there is no other solution, it's a bit old now. In the end I managed to connect via ABFSS using fsspec. In this case the library requires specific parameters when connecting to the azure blob in the credentials that have to be called in a specific way: https://github.com/fsspec/adlfs I have used for example the
account_name
and the
sas_token
but you can also use the
account_name
and the
account_key
or a service principal. If you set the credentials with these names kedro is able to pass them to fsspec as you have a
*credentials
to send them to it. You can find examples on how to do it manually here (but with kedro is only configuration): https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/tutorial-spark-pool-filesystem-spec
👍 1
❤️ 1