Hello I would like to work with Delta Tables using PySpark i Kedro #questions

Hello, I would like to work with Delta Tables usin...

Mohamed El Guendouz

10/17/2024, 2:10 PM

Hello, I would like to work with Delta Tables using PySpark in a GCS bucket, but I'm having trouble using `spark.DeltaTableDataset`:

Copy code

table_name:
  type: spark.DeltaTableDataset
  filepath: "<gs://XXXX/poc-kedro/table_name/*.parquet>"

Could you tell me what might be wrong with this? Additionally, could you explain how to specify the credentials for accessing the table with this Dataset?

👀 1

✅ 1

Ravi Kumar Pilla

10/17/2024, 2:18 PM

Hi @Mohamed El Guendouz, what is the trouble you are facing here. Is it only related to credentials or something else. Do you see any error which can give us more information on the issue. Thank you

Mohamed El Guendouz

10/17/2024, 2:22 PM

Hi @Ravi Kumar Pilla, when trying to load data from a DeltaTable using PySpark and Kedro, an error occurs. The process attempts to load the dataset from a Google Cloud Storage (GCS) bucket, but fails with the following message: • "TypeError: 'JavaPackage' object is not callable", which points to an issue with the DeltaTable.forPath() method in the

delta.tables

library. This leads to a

DatasetError

in Kedro, preventing the data from being loaded successfully.

Mohamed El Guendouz

10/17/2024, 2:23 PM

Copy code

File "/opt/anaconda3/lib/python3.11/site-packages/kedro/io/core.py", line 202, in load
    raise DatasetError(message) from exc
kedro.io.core.DatasetError: Failed while loading data from data set DeltaTableDataset(filepath=XXXXXXX/poc-kedro/table_name/*.parquet, fs_prefix=gs://).
'JavaPackage' object is not callable

Mohamed El Guendouz

10/17/2024, 2:24 PM

I was wondering if this issue could be caused by the fact that I haven't provided the credentials.

Mohamed El Guendouz

10/17/2024, 2:24 PM

But the dataset doesn't seem to allow specifying the credentials in the parameters.

Nok Lam Chan

10/17/2024, 2:28 PM

Did you use

gs

instead of

gcs

or it's just. a typo?

Mohamed El Guendouz

10/17/2024, 2:29 PM

gs://

Nok Lam Chan

10/17/2024, 2:29 PM

https://gcsfs.readthedocs.io/en/latest/ isn't the suffix

gcs

? not sure if I am missing anything here

Mohamed El Guendouz

10/17/2024, 2:31 PM

Yes I tried but I have the same issues with

gcs

Copy code

File "/opt/anaconda3/lib/python3.11/site-packages/kedro/io/core.py", line 202, in load
    raise DatasetError(message) from exc
kedro.io.core.DatasetError: Failed while loading data from data set DeltaTableDataset(filepath=XXXXX/poc-kedro/table_name/*.parquet, fs_prefix=gcs://).
'JavaPackage' object is not callable

Nok Lam Chan

10/17/2024, 2:32 PM

I guess 2nd question will be, do you have delta configured for your sparkcontext?

Nok Lam Chan

10/17/2024, 2:32 PM

The error is not from Python so very likely your Spark configuration is not working

👍 1

Mohamed El Guendouz

10/17/2024, 2:33 PM

Yes it is possible 👍 here is my spark configuration :

Copy code

spark.driver.maxResultSize: 3g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.databricks.delta.properties.defaults.compatibility.symlinkFormatManifest.enabled: true
# <https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#tips-for-maximising-concurrency-using-threadrunner>
spark.scheduler.mode: FAIR

Ravi Kumar Pilla

10/17/2024, 2:35 PM

is this for s3 - spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem

👍 1

Ravi Kumar Pilla

10/17/2024, 2:41 PM

Shouldn't we use google file system as we are trying for gcs ?

Mohamed El Guendouz

10/17/2024, 2:58 PM

For now, I've made a few changes with this configuration, and I’m able to successfully launch the Spark session:

Copy code

spark.driver.maxResultSize: 3g
spark.jars.packages: io.delta:delta-core_2.12:2.0.0
spark.jars: <https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar>
spark.sql.execution.arrow.pyspark.enabled: true
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.databricks.delta.properties.defaults.compatibility.symlinkFormatManifest.enabled: true
# <https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#tips-for-maximising-concurrency-using-threadrunner>
spark.scheduler.mode: FAIR
spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
spark.hadoop.fs.gs.auth.service.account.enable: true
spark.hadoop.google.cloud.auth.service.account.json.keyfile: XXXX.json

However, it's still not recognizing my table… And GCS isn’t working on my end, but with GS, they are able to fetch the table.

Ravi Kumar Pilla

10/17/2024, 3:15 PM

I think we use fsspec to access files and based on the docs, the url should have a prefix

gcs://

. I am not sure how gs is working for you. But with

gs://

is your issue resolved ?

Ravi Kumar Pilla

10/17/2024, 3:23 PM

okay fsspec uses both protocols to identify gcfs.

👍 1

Mohamed El Guendouz

10/17/2024, 3:25 PM

Yeah, I don’t understand it either... Unfortunately, no, it’s not resolved because I’m getting an error saying that the URL I provided is not a Delta table. I’ve tried the same URLs multiple times in notebooks with the same paths, and it works there.

👀 1

Ravi Kumar Pilla

10/17/2024, 3:29 PM

can you check if there is a folder gs://your-bucket-name/path/to/delta-table/_delta_log exists ? Also can you try filepath: "gs://XXXX/poc-kedro/table_name" (instead of wildcard parquet)

Mohamed El Guendouz

10/17/2024, 3:36 PM

Copy code

File "/opt/anaconda3/lib/python3.11/site-packages/kedro/runner/runner.py", line 494, in _run_node_sequential
    inputs[name] = catalog.load(name)
                   ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/kedro/io/data_catalog.py", line 515, in load
    result = dataset.load()
             ^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/kedro/io/core.py", line 202, in load
    raise DatasetError(message) from exc
kedro.io.core.DatasetError: Failed while loading data from data set DeltaTableDataset(filepath=XXXXXX/poc-kedro/table_name, fs_prefix=gs://).
`<gs://XXXXXX/poc-kedro/table_name>` is not a Delta table.

👀 1

Ravi Kumar Pilla

10/17/2024, 3:56 PM

Thanks for your patience @Mohamed El Guendouz,I am new to this and as I read the docs it shows an example of using a single parquet file but not with wildcard, so I am not sure if we use multiple files. Let me get some help from the team. Meanwhile, if you could resolve the issue, please let us know. Thank you

Copy code

weather@delta:
          type: spark.DeltaTableDataset
          filepath: data/02_intermediate/data.parquet

👍 1

gratitude thank you 1

Mohamed El Guendouz

10/17/2024, 4:02 PM

Yes, exactly. That's what I found strange when I read the official documentation, as a Delta Table contains multiple files, not just a single Parquet file. Thanks, I'll keep you posted if I find a solution.

👍 1

Mohamed El Guendouz

10/17/2024, 4:10 PM

I also tested with a single file from the table, and I got the same error:

Copy code

kedro.io.core.DatasetError: Failed while loading data from data set DeltaTableDataset(filepath=XXXXXX/poc-kedro/table_name/processed_at=XXXXXX/part-00000-0076fd68-4ca3-46f6-982f-e77c539af8a1.c000.snappy.parquet, fs_prefix=gs://). <gs://XXXXXX/poc-kedro/table_name/processed_at=XXXXXX/part-00000-0076fd68-4ca3-46f6-982f-e77c539af8a1.c000.snappy.parquet> is not a Delta table."

Mohamed El Guendouz

10/18/2024, 9:22 AM

Hey @Ravi Kumar Pilla! 🙂 I figured out the cause of my issue—it was related to missing permissions from the SA. I found the solution by trying to read the table from a notebook in the Kedro project. Thanks for your help! 👍

🥳 1

27 Views

Open in Slack

Previous Next