Hello Team, I’m using Kedro 0.18.13 and I see it d...
# questions
r
Hello Team, I’m using Kedro 0.18.13 and I see it doesn’t support protocol abfss yet ? Can you confirm why we get error while using abfss path in file paths ? It was working fine in previous releases like 0.17.7 can you for me why it’s not now?
j
hi @Raghunath Nair , what error do you get? also, what version of kedro-datasets do you have?
n
@Raghunath Nair Could you upgrade to at least 0.18.5? https://github.com/kedro-org/kedro/issues/2110 There was a bug in some early 0.18.x release
For the context, we use
fsspec
to handle different storage system, but there are issues with
fsspec
so we add support for
abfss
manually.
r
Hi guys it says, protocol abfss not found
@Nok Lam Chan do you mean Kedro 0.18.15?
j
0.18.15 doesn't exist, 0.18.5 should work
r
@Juan Luis so, you mean Kedro 0.18.5
Right now I’m using Kedro datasets 1.7.1 and Kedro 0.18.14 combination
And can I use the same datasets 1.7.1 version along with 0.18.5?
j
@Raghunath Nair sorry, you said initially that you were using 0.18.3
so, with 0.18.14 you still see the same error?
r
Yeah I tried using 0.18.13 and then I got info the bug is resolved in 0.18.14
@Juan Luis still seeing those issues
Sorry I mean Kedro 0.18.13 my bad
And yes with Kedro 0.18.14 it still says protocol
abfss
couldn’t found
j
can you
pip install adlfs
and try again?
r
I tried it says I need to authenticate with account key
j
well, that's progress 🙂
r
Even after using abfss - we’re logging via datbricks from Oauthb
yeah but we can’t use it, we need to authenticate via the Databricks config!
So, we connect the adls gen2 via abfss over OAuth2
abfss should connect as always without credentials from token we need to connect with OAuth it’s a bug still
Can someone please help?
n
I am not particular familiar how this authentication work. How did you authenticate via OAuth2, is that using the Azure service principal as describe https://docs.databricks.com/en/storage/azure-storage.html#connect-to-azure-data-lake-storage-gen2-or-blob-storage-using-azure-credentials?
r
Correct, but for using that we need abfss which is not found with Kedro 0.18.14 @Nok Lam Chan
n
It doesn’t seem to a kedro specific problem to me, https://github.com/fsspec/adlfs?tab=readme-ov-file#setting-credentials does it work with just adlfs?
With that it offers some options to set your credentials - https://github.com/fsspec/adlfs?tab=readme-ov-file#setting-credentials
r
I was using without Adlfs it was working fine in 0.17.7
n
Can you show some of the catalog entries (masked the detail)? Do you use Spark only? From I understanding you need
adfls
at least for
fsspec
https://github.com/fsspec/filesystem_spec/blob/f7b454e544de7f2e5bc8ab737219e34e6282bdb5/fsspec/registry.py#L136-L138
r
Sure @Nok Lam Chan
n
Spark has its native method to access different filesystem, so maybe you could access abfs without adfls.
r
we can’t use abfs need to use abfss as abfs is blocked due to tls 1.2 policies
@Nok Lam Chan the catalog looks like, dataset: type: spark.SparkDataSet file path: abfss://abc@edlcor.dfs.core.windows.net/test/1/data file_format: delta load_args: header: true layer: raw
And we authenticate via spark conf for abfss via Databricks clusters
The spark conf looks like service_credential = dbutils.secrets.get(scope="<secret-scope>",key="<service-credential-key>") spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth") spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>") spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential) spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "<https://login.microsoftonline.com/%3Cdirectory-id%3E/oauth2/token%7Chttps://login.microsoftonline.com/&lt;directory-id&gt;/oauth2/token&gt;%22)
👍🏼 1
This always works with abfss
n
https://github.com/kedro-org/kedro/blob/0.17.7/kedro/extras/datasets/spark/spark_dataset.py Would you be able to create a custom dataset with the old spark dataset definition and update your catalog accordingly - https://docs.kedro.org/en/stable/data/how_to_create_a_custom_dataset.html?
r
only doesn’t work with Kedro :/
n
basically with Spark, Kedro don’t do anything since Spark comes with its own thing
r
can you give an example
n
What I want to verify is it the problem of the dataset itself
r
How can I convert this
I can try now itself
n
The idea is to isolate the problem, since you are jumping from 0.17.7 to 0.18.x and there are many changes. Since you mentioned in 0.17.7 it works, it would help us to narrow down the scope whether this is the
dataset
problem or is there something going wrong in kedro’s core
essentially testing kedro 0.18.14 but with a 0.17.7 version of Spark.SparkDataset
r
Yeah that’s the Kedro datasets version
Shall I upgrade the version and try?
Will it help?
n
It’s late here so I can only help up to this point, I will come back to this tomorrow morning but if you could try that it would be great
r
One compabilty with 0.18.14?
n
It doesn’t matter, since you are creating a custom dataset, you will not be using kedro-datasets
r
Aah ok
Let me give a try
Keep you posted here!
Thanks @Nok Lam Chan see you tomrorow have a nice one!
@Nok Lam Chan I tried the custom dataset it’s still same error it’s because Kedro is using fsspec and it doesn’t support abfss it’s a blocker for us now
How can we add abfss support with Kedro then?
n
What error did you get? It would be great if you can open an issue and provide all necessary details. i.e. error, stacktrace, what version of kedro,kedro-datasets. I am actually slightly surprised it works for 0.17.7 - https://github.com/kedro-org/kedro/blob/0.17.7/kedro/extras/datasets/spark/spark_dataset.py If you check the implementation there was no
fsspec
at all. In kedro-datasets 1.8.0, we do use fsspec https://github.com/kedro-org/kedro-plugins/blob/16c6d5e144ad1f67afba9984ca606e13e51217e4/kedro-datasets/kedro_datasets/spark/spark_dataset.py#L355 I will be very surprised if you get the same error using different implementations
r
@Nok Lam Chan will this solve if i use older Kedro datasets ? As there is no fsspec ?
n
@Raghunath Nair Sorry can we keep the conversation inside the thread?
r
yeah I checked the fsspec protocols list it doesn’t suppprt abfss indeed
Sure sorry :)
So, can I use the Kedro 17.7 code for custom dataset do you think it’ll c work?
n
I hope so, at least it will give us more idea where is the change coming from
r
As in 0.17.7 the abfss worked like charm. I think the abfss fix is partially done it only does the test for validation. But, at the time of opening the flatten it still don’t have condition for abfss hence its breaks
This is my conclusion as my analysis
@Nok Lam Chan cool let me try that and let you know :)
Keep you posted - it’s an awesome exercise!
n
I am not very familiar with adfls, but base on my understanding of
fsspec
You can find this in
adfls
, so regardless
abfs
or
abfss
,
fsspec
using the same AzureBlobFileSystem class.
Copy code
entry_points={
        "fsspec.specs": [
            "abfss=adlfs.AzureBlobFileSystem",
        ],
    },
Since you mentioned you never have to use
adfls
before (I assume it is not installed right?) I suspect the connection was done via Spark directly in 0.17.7, thus I ask you to test with the old Spark.SparkDataset. Both Spark / fsspec could work, but we don’t need to solve both at the same time.
r
@Nok Lam Chan no I think the issue here is the fsspec opening a file system
If you peek into the fsspec it doesn’t contains cloud protocol namely abfss in their list, only abfs hence the issue I can easily reproduce this from Python console with fsspec. error occurs when fsspec tries to open the file and in the 18 and high Kedro Kedro. Abfss is not at all taken into consideration.
n
This is not true
I haven’t verified this, but they have an entry point to register this into fsspec’s spec https://github.com/fsspec/adlfs/blob/092685f102c5cd215550d10e8347e5bce0e2b93d/setup.py#L44-L47
it should be fairly easy to test, just give
fsspec.filesystem()
a path starting with
abfss://
and see if it gives you an AzureBlobFileSystem class
anyway if you can test with the implementation with 0.17.7, which doesn’t use fsspec at all. We don’t need to guess this.
r
Yeah the file system does give error I tried it
It says, abfss protocol not found
That’s why I can so confidently confirm
n
Copy code
fsspec.filesystem("abfss")
File ~/GitHub/adlfs/adlfs/spec.py:318, in AzureBlobFileSystem.__init__(self, account_name, account_key, connection_string, credential, sas_token, request_session, socket_timeout, blocksize, client_id, client_secret, tenant_id, anon, location_mode, loop, asynchronous, default_fill_cache, default_cache_type, version_aware, assume_container_exists, max_concurrency, timeout, connection_timeout, read_timeout, **kwargs)
307 if (
I get error from
adfls
so I do think it is registered properly
r
I did this test in evening yesterday so, today gonna try with code of 0.17.7
n
Did you still get a protocol not found error?
r
Yeah I did
n
Please make sure you have
adlfs
installed. It’s very strange we are getting different results
Copy code
adlfs                2023.8.0
fsspec             2023.10.0
And please check if your environment are corrupted, if possible start with a fresh virtual environment.
r
@Nok Lam Chan the issue is if the adls is installed the abfss expect the login in different way with account key which is not a compliant one
So, we can’t use Adlfs - so, it should take the spark configurations form cluster itself rather than injection credentials to the dataset itself
Which is not a clean solutions
n
Do you mind opening an issue on Kedro repository? https://github.com/kedro-org/kedro/issues/new/choose
r
Sure doing it now
thankyou 1