Dinesh D
03/11/2024, 11:05 AMfrom unittest.mock import patch
def handler(event, context):
from kedro.framework.project import configure_project
configure_project("spaceflights_step_functions")
node_to_run = event["node_name"]
with patch("multiprocessing.Lock"):
from kedro.framework.session import KedroSession
with KedroSession.create(env="aws") as session:
session.run(node_names=[node_to_run])
We have a catalog with datasets like csv, parquet, sparkdataset etc..
The node that we want to run doesn't require sparkdataset however kedro tries to load all datasets in a catalog which fails because spark library doesn't work in lambda.
[ERROR] PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.
I have java installed and java_home correctly set in the lambda docker image.
1. can we exclude loading datasets which are not required for a node?
2. why pyspark doesn't work on lambda? (we dont plan on running spark on lambda, just curious what are the limitations).datajoely
03/11/2024, 11:11 AMsettings.py
you should create some logic which only loads the spark hook if you’re NOT in lambda e.g. AWS_EXECUTION_ENV
https://docs.aws.amazon.com/lambda/latest/dg/configuration-envvars.htmlDinesh D
03/11/2024, 11:24 AMdatajoely
03/11/2024, 11:24 AMdatajoely
03/11/2024, 11:24 AMdatajoely
03/11/2024, 11:25 AMDinesh D
03/11/2024, 11:38 AMdataset is never called
- is this behaviour because of lazy loading?datajoely
03/11/2024, 11:43 AMHOOKS = (SparkHooks(),)
Dinesh D
03/11/2024, 11:52 AMDinesh D
03/11/2024, 11:52 AMdatajoely
03/11/2024, 11:57 AMDinesh D
03/11/2024, 12:06 PM