hi Team, firstly thank you for creating an awesome...
# questions
d
hi Team, firstly thank you for creating an awesome tool! I have a problem with deploying kedro on lambda(docker). we use below script to run a kedro node on lambda.
Copy code
from unittest.mock import patch


def handler(event, context):
    from kedro.framework.project import configure_project

    configure_project("spaceflights_step_functions")
    node_to_run = event["node_name"]
    with patch("multiprocessing.Lock"):
        from kedro.framework.session import KedroSession

        with KedroSession.create(env="aws") as session:
            session.run(node_names=[node_to_run])
We have a catalog with datasets like csv, parquet, sparkdataset etc.. The node that we want to run doesn't require sparkdataset however kedro tries to load all datasets in a catalog which fails because spark library doesn't work in lambda.
Copy code
[ERROR] PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.
I have java installed and java_home correctly set in the lambda docker image. 1. can we exclude loading datasets which are not required for a node? 2. why pyspark doesn't work on lambda? (we dont plan on running spark on lambda, just curious what are the limitations).
d
Hi @Dinesh D so we can go down the route of debugging the spark issue - but if your case doesn’t need it I have a solution that will work without it I think in your
settings.py
you should create some logic which only loads the spark hook if you’re NOT in lambda e.g.
AWS_EXECUTION_ENV
https://docs.aws.amazon.com/lambda/latest/dg/configuration-envvars.html
d
hey @datajoely thank you for the quick response, will check the spark hook. regarding debugging spark lambda is not necessary now as we dont plan on using it...cheers!
d
💪
essentially without the hook the spark context wont be initialised when the Kedro session is created
and if the dataset is never called you shouldn’t get a runtime error when it tries to load itself
d
1. AFAIK hooks are used to manipulate execution at different steps, so in this case, should I remove the sparkdataset after catalog is created using hooks? You are saying spark context wont be created without hook, can you send any link to this hook where its located? 2.
dataset is never called
- is this behaviour because of lazy loading?
d
in your settings.py do you not have something like this?
HOOKS = (SparkHooks(),)
d
everything in my settings is commented out.
should I uncomment hooks and set it to empty tuple?
d
try that please 🙂
d
yep, thank you!