Hi all, I can read and write `spark.SparkDataSet` ...
# questions
l
Hi all, I can read and write
spark.SparkDataSet
from an S3 bucket without issues for up to 1 hour in Kedro. However, when I run a node that requires more than 1 hour to process, my Kedro job is aborted after 1 hour and throws the following error. I am using Kedro 0.18.7. Please let me know if you have any clues regarding this issue. Is there any timeout-related setting in the AWS SDK used by Kedro? Thank you!
Copy code
24/08/06 14:42:45 ERROR Utils: Aborting task
org.apache.hadoop.fs.s3a.AWSBadRequestException: getFileStatus on <s3a://processed-data/projects/input_data/vitals.parquet/_temporary/0/_temporary/attempt_20240806144231705394881749042866_0023_m_000025_1754/hospitalID=10/unitAdmitYear=2018/part-00025-b0bdae35-5932-4179-a258-75ce64d1d156.c000.snappy.parquet>: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: HTNMAAWDQTDPKXBJ; S3 Extended Request ID: 5Il7bsSCCQBmf/sr84F5/S3tAlPtcINtVxTMCVmRAtU23j39Nu9Q0VGcMryPOMR8Gku7ueGrEMY=; Proxy: null), S3 Extended Request ID: 5Il7bsSCCQBmf/sr84F5/S3tAlPtcINtVxTMCVmRAtU23j39Nu9Q0VGcMryPOMR8Gku7ueGrEMY=:400 Bad Request: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: HTNMAAWDQTDPKXBJ; S3 Extended Request ID: 5Il7bsSCCQBmf/sr84F5/S3tAlPtcINtVxTMCVmRAtU23j39Nu9Q0VGcMryPOMR8Gku7ueGrEMY=; Proxy: null)…
👀 2
n
are you using Amazon EMR or any Amazon service? Could you provide the full stack trace if possible. I'd like to see if Kedro throw this error or if it's coming from somewhere else.
l
Thanks, @Nok Lam Chan, for looking into this. I am using a SageMaker instance to run the Kedro project. I will share the trace if I can get it from CloudWatch. By the way, is it must to set the credential profile under each data catalog entry? (e.g.,
credentials: dev_s3
) I am getting the same 1-hour timeout issue regardless of whether I put the credentials in the data catalog or simply export them in the terminal.