Hi we have found this annoying problem when dealing with Del Kedro #questions

Hi, we have found this annoying problem when deali...

Toni - TomTom - Madrid

03/25/2024, 12:00 PM

Hi, we have found this annoying problem when dealing with Delta tables in Azure/Databricks: We found the ~~same~~ issue running locally a~nd running form notebook within Databricks~

Toni - TomTom - Madrid

03/25/2024, 12:13 PM

we are trying to connect to a Delta Table usind Spark.Dataset, type: delta

Toni - TomTom - Madrid

03/25/2024, 12:14 PM

Copy code

search_sessions_logs:
  type: spark.SparkDataset
  filepath: <abfss://bronze@adlsmapsanalyticspoi.dfs.core.windows.net/external_sources/search_logs_amigo>
  file_format: delta

💡 1

Toni - TomTom - Madrid

03/25/2024, 12:16 PM

removing the abfss:// and using the dbfs path, works from notebook

Toni - TomTom - Madrid

03/25/2024, 12:27 PM

We find this problem when trying to connect to the delta table

Toni - TomTom - Madrid

03/25/2024, 12:27 PM

Copy code

TypeError: DatabricksFileSystem.__init__() missing 2 required positional arguments: 'instance' and 'token'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/kedro", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/cli/cli.py", line 198, in main
    cli_collection()
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/cli/cli.py", line 127, in main
    super().main(
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/cli/project.py", line 225, in run
    session.run(
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/session/session.py", line 374, in run
    catalog = context._get_catalog(
  File "/usr/local/lib/python3.10/site-packages/kedro/framework/context/context.py", line 231, in _get_catalog
    catalog: DataCatalog = settings.DATA_CATALOG_CLASS.from_config(
  File "/usr/local/lib/python3.10/site-packages/kedro/io/data_catalog.py", line 294, in from_config
    datasets[ds_name] = AbstractDataset.from_config(
  File "/usr/local/lib/python3.10/site-packages/kedro/io/core.py", line 164, in from_config
    raise DatasetError(
kedro.io.core.DatasetError: 
DatabricksFileSystem.__init__() missing 2 required positional arguments: 'instance' and 'token'.
Dataset 'search_sessions_logs' must only contain arguments valid for the constructor of 'kedro_datasets.spark.spark_dataset.SparkDataset'.

Juan Luis

03/25/2024, 1:33 PM

hey Toni! where does

DatabricksFileSystem

come from?

Juan Luis

03/25/2024, 1:34 PM

oh it comes from fsspec probably https://github.com/fsspec/filesystem_spec/blob/2b87bff85bd2b52fe53ff4d3aa26708268983024/fsspec/implementations/dbfs.py#L25

👍 1

Toni - TomTom - Madrid

03/26/2024, 7:51 AM

yes, in the past (version 0.18) we were able to connect from local to Delta talbes in Databrics/Azure blob storage by means of the spark.yml (adding azure credentials there), but now it seems that the format that it expects is not the right one

Nok Lam Chan

03/26/2024, 4:24 PM

can you replace

dbfs://

with

/dbfs/

? Please also share what you put in

filepath

since the example you share above doesn't seem to match the error message here

Nok Lam Chan

03/26/2024, 4:25 PM

It shouldn't use the

DatabricksFileSystem

to start with, IIRC the dbfs fielsystem provided by fsspec is not useful. In addition, spark has native way to access remote storage, so it shouldn't even use fsspec.

Nok Lam Chan

03/26/2024, 4:27 PM

https://docs.databricks.com/en/files/index.html#do-i-need-to-provide-a-uri-scheme-to-access-data

dbfs:/

should be equivalent to

/dbfs

on databricks

Toni - TomTom - Madrid

03/27/2024, 7:30 AM

Thanks! I tried all the possible paths with the same problem. This error came from using kedro SparkDataset, file_format= delta. It seems like now it is expecting a credentials file to connect remotely. Do you know what is the expected schema of this “credential” dictionary?

Nok Lam Chan

03/27/2024, 11:48 AM

@Toni - TomTom - Madrid Sorry for this to happen, can you share how your catalog look like and what error and stacktrace you get here? I tried:

Copy code

search_sessions_logs:
  type: spark.SparkDataset
  filepath: dbfs:/bronze@adlsmapsanalyticspoi.dfs.core.windows.net/external_sources/search_logs_amigo
  file_format: delta

And I only get Java error. I don't think you need to put any

credentials

since Spark authenticate with its own way. If it's triggering fsspec then it's very likely it's a bug but I would appreciate if you can share an example that we can reproduce.

Toni - TomTom - Madrid

04/01/2024, 7:26 AM

Hi Nok! thx for answering. to reproduce the problem you need to have a Detla table saved in Azure Blob Storage File System (the path shown in the catalog is a mount in Databricks) and trying to read in locally as spark.Dataset. Sorry I cannot share any public dataset for this 😞

Guillermo Caminero

04/01/2024, 1:26 PM

I have an additional problem to this one for azure connections. When I try to connect via the wasb[s] protocol it tells me that it doesn't know the protocol. It should also be implemented for azure access as well as abfs[s].

🎉 1

Guillermo Caminero

04/01/2024, 1:27 PM

Nok Lam Chan

04/01/2024, 10:45 PM

@Guillermo Caminero Could you create a separate Github issue about this? I have not used wasbs before. Kedro support remote storage via

fsspec

, specifically for azure related it will be https://github.com/fsspec/adlfs. I didn't find anything mention wasbs so chance are it's not supported.

Guillermo Caminero

04/02/2024, 7:36 AM

Hello @Nok Lam Chan , first of all thanks, Finally I couldn't connect via WASB. It's true that if you have a blob storage version 1 there is no other solution, it's a bit old now. In the end I managed to connect via ABFSS using fsspec. In this case the library requires specific parameters when connecting to the azure blob in the credentials that have to be called in a specific way: https://github.com/fsspec/adlfs I have used for example the

account_name

and the

sas_token

but you can also use the

account_name

and the

account_key

or a service principal. If you set the credentials with these names kedro is able to pass them to fsspec as you have a

*credentials

to send them to it. You can find examples on how to do it manually here (but with kedro is only configuration): https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/tutorial-spark-pool-filesystem-spec

👍 1

❤️ 1

93 Views

Open in Slack

Previous Next