https://kedro.org/ logo
#questions
Title
# questions
r

Raghunath Nair

11/22/2023, 12:17 PM
Hello Team, I’m using Kedro 0.18.13 and I see it doesn’t support protocol abfss yet ? Can you confirm why we get error while using abfss path in file paths ? It was working fine in previous releases like 0.17.7 can you for me why it’s not now?
j

Juan Luis

11/22/2023, 12:24 PM
hi @Raghunath Nair , what error do you get? also, what version of kedro-datasets do you have?
n

Nok Lam Chan

11/22/2023, 1:39 PM
@Raghunath Nair Could you upgrade to at least 0.18.5? https://github.com/kedro-org/kedro/issues/2110 There was a bug in some early 0.18.x release
For the context, we use
fsspec
to handle different storage system, but there are issues with
fsspec
so we add support for
abfss
manually.
r

Raghunath Nair

11/22/2023, 4:12 PM
Hi guys it says, protocol abfss not found
@Nok Lam Chan do you mean Kedro 0.18.15?
j

Juan Luis

11/22/2023, 4:14 PM
0.18.15 doesn't exist, 0.18.5 should work
r

Raghunath Nair

11/22/2023, 4:16 PM
@Juan Luis so, you mean Kedro 0.18.5
Right now I’m using Kedro datasets 1.7.1 and Kedro 0.18.14 combination
And can I use the same datasets 1.7.1 version along with 0.18.5?
j

Juan Luis

11/22/2023, 4:19 PM
@Raghunath Nair sorry, you said initially that you were using 0.18.3
so, with 0.18.14 you still see the same error?
r

Raghunath Nair

11/22/2023, 4:19 PM
Yeah I tried using 0.18.13 and then I got info the bug is resolved in 0.18.14
@Juan Luis still seeing those issues
Sorry I mean Kedro 0.18.13 my bad
And yes with Kedro 0.18.14 it still says protocol
abfss
couldn’t found
j

Juan Luis

11/22/2023, 4:25 PM
can you
pip install adlfs
and try again?
r

Raghunath Nair

11/22/2023, 4:25 PM
I tried it says I need to authenticate with account key
j

Juan Luis

11/22/2023, 4:25 PM
well, that's progress 🙂
r

Raghunath Nair

11/22/2023, 4:25 PM
Even after using abfss - we’re logging via datbricks from Oauthb
yeah but we can’t use it, we need to authenticate via the Databricks config!
So, we connect the adls gen2 via abfss over OAuth2
abfss should connect as always without credentials from token we need to connect with OAuth it’s a bug still
Can someone please help?
n

Nok Lam Chan

11/22/2023, 4:48 PM
I am not particular familiar how this authentication work. How did you authenticate via OAuth2, is that using the Azure service principal as describe https://docs.databricks.com/en/storage/azure-storage.html#connect-to-azure-data-lake-storage-gen2-or-blob-storage-using-azure-credentials?
r

Raghunath Nair

11/22/2023, 4:49 PM
Correct, but for using that we need abfss which is not found with Kedro 0.18.14 @Nok Lam Chan
n

Nok Lam Chan

11/22/2023, 4:50 PM
It doesn’t seem to a kedro specific problem to me, https://github.com/fsspec/adlfs?tab=readme-ov-file#setting-credentials does it work with just adlfs?
With that it offers some options to set your credentials - https://github.com/fsspec/adlfs?tab=readme-ov-file#setting-credentials
r

Raghunath Nair

11/22/2023, 4:52 PM
I was using without Adlfs it was working fine in 0.17.7
n

Nok Lam Chan

11/22/2023, 4:58 PM
Can you show some of the catalog entries (masked the detail)? Do you use Spark only? From I understanding you need
adfls
at least for
fsspec
https://github.com/fsspec/filesystem_spec/blob/f7b454e544de7f2e5bc8ab737219e34e6282bdb5/fsspec/registry.py#L136-L138
r

Raghunath Nair

11/22/2023, 5:01 PM
Sure @Nok Lam Chan
n

Nok Lam Chan

11/22/2023, 5:01 PM
Spark has its native method to access different filesystem, so maybe you could access abfs without adfls.
r

Raghunath Nair

11/22/2023, 5:01 PM
we can’t use abfs need to use abfss as abfs is blocked due to tls 1.2 policies
@Nok Lam Chan the catalog looks like, dataset: type: spark.SparkDataSet file path: abfss://abc@edlcor.dfs.core.windows.net/test/1/data file_format: delta load_args: header: true layer: raw
And we authenticate via spark conf for abfss via Databricks clusters
The spark conf looks like service_credential = dbutils.secrets.get(scope="<secret-scope>",key="<service-credential-key>") spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth") spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>") spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential) spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "<https://login.microsoftonline.com/%3Cdirectory-id%3E/oauth2/token%7Chttps://login.microsoftonline.com/&lt;directory-id&gt;/oauth2/token&gt;%22)
👍🏼 1
This always works with abfss
n

Nok Lam Chan

11/22/2023, 5:10 PM
https://github.com/kedro-org/kedro/blob/0.17.7/kedro/extras/datasets/spark/spark_dataset.py Would you be able to create a custom dataset with the old spark dataset definition and update your catalog accordingly - https://docs.kedro.org/en/stable/data/how_to_create_a_custom_dataset.html?
r

Raghunath Nair

11/22/2023, 5:10 PM
only doesn’t work with Kedro :/
n

Nok Lam Chan

11/22/2023, 5:10 PM
basically with Spark, Kedro don’t do anything since Spark comes with its own thing
r

Raghunath Nair

11/22/2023, 5:11 PM
can you give an example
n

Nok Lam Chan

11/22/2023, 5:11 PM
What I want to verify is it the problem of the dataset itself
r

Raghunath Nair

11/22/2023, 5:11 PM
How can I convert this
I can try now itself
n

Nok Lam Chan

11/22/2023, 5:12 PM
The idea is to isolate the problem, since you are jumping from 0.17.7 to 0.18.x and there are many changes. Since you mentioned in 0.17.7 it works, it would help us to narrow down the scope whether this is the
dataset
problem or is there something going wrong in kedro’s core
essentially testing kedro 0.18.14 but with a 0.17.7 version of Spark.SparkDataset
r

Raghunath Nair

11/22/2023, 5:14 PM
Yeah that’s the Kedro datasets version
Shall I upgrade the version and try?
Will it help?
n

Nok Lam Chan

11/22/2023, 5:14 PM
It’s late here so I can only help up to this point, I will come back to this tomorrow morning but if you could try that it would be great
r

Raghunath Nair

11/22/2023, 5:14 PM
One compabilty with 0.18.14?
n

Nok Lam Chan

11/22/2023, 5:14 PM
It doesn’t matter, since you are creating a custom dataset, you will not be using kedro-datasets
r

Raghunath Nair

11/22/2023, 5:15 PM
Aah ok
Let me give a try
Keep you posted here!
Thanks @Nok Lam Chan see you tomrorow have a nice one!
@Nok Lam Chan I tried the custom dataset it’s still same error it’s because Kedro is using fsspec and it doesn’t support abfss it’s a blocker for us now
How can we add abfss support with Kedro then?
n

Nok Lam Chan

11/23/2023, 5:23 AM
What error did you get? It would be great if you can open an issue and provide all necessary details. i.e. error, stacktrace, what version of kedro,kedro-datasets. I am actually slightly surprised it works for 0.17.7 - https://github.com/kedro-org/kedro/blob/0.17.7/kedro/extras/datasets/spark/spark_dataset.py If you check the implementation there was no
fsspec
at all. In kedro-datasets 1.8.0, we do use fsspec https://github.com/kedro-org/kedro-plugins/blob/16c6d5e144ad1f67afba9984ca606e13e51217e4/kedro-datasets/kedro_datasets/spark/spark_dataset.py#L355 I will be very surprised if you get the same error using different implementations
r

Raghunath Nair

11/23/2023, 7:53 AM
@Nok Lam Chan will this solve if i use older Kedro datasets ? As there is no fsspec ?
n

Nok Lam Chan

11/23/2023, 7:54 AM
@Raghunath Nair Sorry can we keep the conversation inside the thread?
r

Raghunath Nair

11/23/2023, 7:54 AM
yeah I checked the fsspec protocols list it doesn’t suppprt abfss indeed
Sure sorry :)
So, can I use the Kedro 17.7 code for custom dataset do you think it’ll c work?
n

Nok Lam Chan

11/23/2023, 7:56 AM
I hope so, at least it will give us more idea where is the change coming from
r

Raghunath Nair

11/23/2023, 7:56 AM
As in 0.17.7 the abfss worked like charm. I think the abfss fix is partially done it only does the test for validation. But, at the time of opening the flatten it still don’t have condition for abfss hence its breaks
This is my conclusion as my analysis
@Nok Lam Chan cool let me try that and let you know :)
Keep you posted - it’s an awesome exercise!
n

Nok Lam Chan

11/23/2023, 8:05 AM
I am not very familiar with adfls, but base on my understanding of
fsspec
You can find this in
adfls
, so regardless
abfs
or
abfss
,
fsspec
using the same AzureBlobFileSystem class.
Copy code
entry_points={
        "fsspec.specs": [
            "abfss=adlfs.AzureBlobFileSystem",
        ],
    },
Since you mentioned you never have to use
adfls
before (I assume it is not installed right?) I suspect the connection was done via Spark directly in 0.17.7, thus I ask you to test with the old Spark.SparkDataset. Both Spark / fsspec could work, but we don’t need to solve both at the same time.
r

Raghunath Nair

11/23/2023, 8:14 AM
@Nok Lam Chan no I think the issue here is the fsspec opening a file system
If you peek into the fsspec it doesn’t contains cloud protocol namely abfss in their list, only abfs hence the issue I can easily reproduce this from Python console with fsspec. error occurs when fsspec tries to open the file and in the 18 and high Kedro Kedro. Abfss is not at all taken into consideration.
n

Nok Lam Chan

11/23/2023, 8:16 AM
This is not true
I haven’t verified this, but they have an entry point to register this into fsspec’s spec https://github.com/fsspec/adlfs/blob/092685f102c5cd215550d10e8347e5bce0e2b93d/setup.py#L44-L47
it should be fairly easy to test, just give
fsspec.filesystem()
a path starting with
abfss://
and see if it gives you an AzureBlobFileSystem class
anyway if you can test with the implementation with 0.17.7, which doesn’t use fsspec at all. We don’t need to guess this.
r

Raghunath Nair

11/23/2023, 8:20 AM
Yeah the file system does give error I tried it
It says, abfss protocol not found
That’s why I can so confidently confirm
n

Nok Lam Chan

11/23/2023, 8:24 AM
Copy code
fsspec.filesystem("abfss")
File ~/GitHub/adlfs/adlfs/spec.py:318, in AzureBlobFileSystem.__init__(self, account_name, account_key, connection_string, credential, sas_token, request_session, socket_timeout, blocksize, client_id, client_secret, tenant_id, anon, location_mode, loop, asynchronous, default_fill_cache, default_cache_type, version_aware, assume_container_exists, max_concurrency, timeout, connection_timeout, read_timeout, **kwargs)
307 if (
I get error from
adfls
so I do think it is registered properly
r

Raghunath Nair

11/23/2023, 8:32 AM
I did this test in evening yesterday so, today gonna try with code of 0.17.7
n

Nok Lam Chan

11/23/2023, 8:34 AM
Did you still get a protocol not found error?
r

Raghunath Nair

11/23/2023, 8:38 AM
Yeah I did
n

Nok Lam Chan

11/23/2023, 8:42 AM
Please make sure you have
adlfs
installed. It’s very strange we are getting different results
Copy code
adlfs                2023.8.0
fsspec             2023.10.0
And please check if your environment are corrupted, if possible start with a fresh virtual environment.
r

Raghunath Nair

11/23/2023, 8:59 AM
@Nok Lam Chan the issue is if the adls is installed the abfss expect the login in different way with account key which is not a compliant one
So, we can’t use Adlfs - so, it should take the spark configurations form cluster itself rather than injection credentials to the dataset itself
Which is not a clean solutions
n

Nok Lam Chan

11/23/2023, 9:05 AM
Do you mind opening an issue on Kedro repository? https://github.com/kedro-org/kedro/issues/new/choose
r

Raghunath Nair

11/23/2023, 9:07 AM
Sure doing it now
thankyou 1