Hello, I would like to work with Delta Tables usin...
# questions
m
Hello, I would like to work with Delta Tables using PySpark in a GCS bucket, but I'm having trouble using `spark.DeltaTableDataset`:
Copy code
table_name:
  type: spark.DeltaTableDataset
  filepath: "<gs://XXXX/poc-kedro/table_name/*.parquet>"
Could you tell me what might be wrong with this? Additionally, could you explain how to specify the credentials for accessing the table with this Dataset?
👀 1
1
r
Hi @Mohamed El Guendouz, what is the trouble you are facing here. Is it only related to credentials or something else. Do you see any error which can give us more information on the issue. Thank you
m
Hi @Ravi Kumar Pilla, when trying to load data from a DeltaTable using PySpark and Kedro, an error occurs. The process attempts to load the dataset from a Google Cloud Storage (GCS) bucket, but fails with the following message: • "TypeError: 'JavaPackage' object is not callable", which points to an issue with the DeltaTable.forPath() method in the
delta.tables
library. This leads to a
DatasetError
in Kedro, preventing the data from being loaded successfully.
Copy code
File "/opt/anaconda3/lib/python3.11/site-packages/kedro/io/core.py", line 202, in load
    raise DatasetError(message) from exc
kedro.io.core.DatasetError: Failed while loading data from data set DeltaTableDataset(filepath=XXXXXXX/poc-kedro/table_name/*.parquet, fs_prefix=gs://).
'JavaPackage' object is not callable
I was wondering if this issue could be caused by the fact that I haven't provided the credentials.
But the dataset doesn't seem to allow specifying the credentials in the parameters.
n
Did you use
gs
instead of
gcs
or it's just. a typo?
m
gs://
n
https://gcsfs.readthedocs.io/en/latest/ isn't the suffix
gcs
? not sure if I am missing anything here
m
Yes I tried but I have the same issues with
gcs
:
Copy code
File "/opt/anaconda3/lib/python3.11/site-packages/kedro/io/core.py", line 202, in load
    raise DatasetError(message) from exc
kedro.io.core.DatasetError: Failed while loading data from data set DeltaTableDataset(filepath=XXXXX/poc-kedro/table_name/*.parquet, fs_prefix=gcs://).
'JavaPackage' object is not callable
n
I guess 2nd question will be, do you have delta configured for your sparkcontext?
The error is not from Python so very likely your Spark configuration is not working
👍 1
m
Yes it is possible 👍 here is my spark configuration :
Copy code
spark.driver.maxResultSize: 3g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.databricks.delta.properties.defaults.compatibility.symlinkFormatManifest.enabled: true
# <https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#tips-for-maximising-concurrency-using-threadrunner>
spark.scheduler.mode: FAIR
r
is this for s3 - spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
👍 1
Shouldn't we use google file system as we are trying for gcs ?
m
For now, I've made a few changes with this configuration, and I’m able to successfully launch the Spark session:
Copy code
spark.driver.maxResultSize: 3g
spark.jars.packages: io.delta:delta-core_2.12:2.0.0
spark.jars: <https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar>
spark.sql.execution.arrow.pyspark.enabled: true
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.databricks.delta.properties.defaults.compatibility.symlinkFormatManifest.enabled: true
# <https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#tips-for-maximising-concurrency-using-threadrunner>
spark.scheduler.mode: FAIR
spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
spark.hadoop.fs.gs.auth.service.account.enable: true
spark.hadoop.google.cloud.auth.service.account.json.keyfile: XXXX.json
However, it's still not recognizing my table… And GCS isn’t working on my end, but with GS, they are able to fetch the table.
r
I think we use fsspec to access files and based on the docs, the url should have a prefix
gcs://
. I am not sure how gs is working for you. But with
gs://
is your issue resolved ?
okay fsspec uses both protocols to identify gcfs.
👍 1
m
Yeah, I don’t understand it either... Unfortunately, no, it’s not resolved because I’m getting an error saying that the URL I provided is not a Delta table. I’ve tried the same URLs multiple times in notebooks with the same paths, and it works there.
👀 1
r
can you check if there is a folder gs://your-bucket-name/path/to/delta-table/_delta_log exists ? Also can you try filepath: "gs://XXXX/poc-kedro/table_name" (instead of wildcard parquet)
m
Copy code
File "/opt/anaconda3/lib/python3.11/site-packages/kedro/runner/runner.py", line 494, in _run_node_sequential
    inputs[name] = catalog.load(name)
                   ^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/kedro/io/data_catalog.py", line 515, in load
    result = dataset.load()
             ^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.11/site-packages/kedro/io/core.py", line 202, in load
    raise DatasetError(message) from exc
kedro.io.core.DatasetError: Failed while loading data from data set DeltaTableDataset(filepath=XXXXXX/poc-kedro/table_name, fs_prefix=gs://).
`<gs://XXXXXX/poc-kedro/table_name>` is not a Delta table.
👀 1
r
Thanks for your patience @Mohamed El Guendouz,I am new to this and as I read the docs it shows an example of using a single parquet file but not with wildcard, so I am not sure if we use multiple files. Let me get some help from the team. Meanwhile, if you could resolve the issue, please let us know. Thank you
Copy code
weather@delta:
          type: spark.DeltaTableDataset
          filepath: data/02_intermediate/data.parquet
👍 1
gratitude thank you 1
m
Yes, exactly. That's what I found strange when I read the official documentation, as a Delta Table contains multiple files, not just a single Parquet file. Thanks, I'll keep you posted if I find a solution.
👍 1
I also tested with a single file from the table, and I got the same error:
Copy code
kedro.io.core.DatasetError: Failed while loading data from data set DeltaTableDataset(filepath=XXXXXX/poc-kedro/table_name/processed_at=XXXXXX/part-00000-0076fd68-4ca3-46f6-982f-e77c539af8a1.c000.snappy.parquet, fs_prefix=gs://). <gs://XXXXXX/poc-kedro/table_name/processed_at=XXXXXX/part-00000-0076fd68-4ca3-46f6-982f-e77c539af8a1.c000.snappy.parquet> is not a Delta table."
Hey @Ravi Kumar Pilla! 🙂 I figured out the cause of my issue—it was related to missing permissions from the SA. I found the solution by trying to read the table from a notebook in the Kedro project. Thanks for your help! 👍
🥳 1