hi Team firstly thank you for creating an awesome tool I hav Kedro #questions

hi Team, firstly thank you for creating an awesome...

Dinesh D

03/11/2024, 11:05 AM

hi Team, firstly thank you for creating an awesome tool! I have a problem with deploying kedro on lambda(docker). we use below script to run a kedro node on lambda.

Copy code

from unittest.mock import patch


def handler(event, context):
    from kedro.framework.project import configure_project

    configure_project("spaceflights_step_functions")
    node_to_run = event["node_name"]
    with patch("multiprocessing.Lock"):
        from kedro.framework.session import KedroSession

        with KedroSession.create(env="aws") as session:
            session.run(node_names=[node_to_run])

We have a catalog with datasets like csv, parquet, sparkdataset etc.. The node that we want to run doesn't require sparkdataset however kedro tries to load all datasets in a catalog which fails because spark library doesn't work in lambda.

Copy code

[ERROR] PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

I have java installed and java_home correctly set in the lambda docker image. 1. can we exclude loading datasets which are not required for a node? 2. why pyspark doesn't work on lambda? (we dont plan on running spark on lambda, just curious what are the limitations).

datajoely

03/11/2024, 11:11 AM

Hi @Dinesh D so we can go down the route of debugging the spark issue - but if your case doesn’t need it I have a solution that will work without it I think in your

settings.py

you should create some logic which only loads the spark hook if you’re NOT in lambda e.g.

AWS_EXECUTION_ENV

https://docs.aws.amazon.com/lambda/latest/dg/configuration-envvars.html

Dinesh D

03/11/2024, 11:24 AM

hey @datajoely thank you for the quick response, will check the spark hook. regarding debugging spark lambda is not necessary now as we dont plan on using it...cheers!

datajoely

03/11/2024, 11:24 AM

💪

datajoely

03/11/2024, 11:24 AM

essentially without the hook the spark context wont be initialised when the Kedro session is created

datajoely

03/11/2024, 11:25 AM

and if the dataset is never called you shouldn’t get a runtime error when it tries to load itself

Dinesh D

03/11/2024, 11:38 AM

1. AFAIK hooks are used to manipulate execution at different steps, so in this case, should I remove the sparkdataset after catalog is created using hooks? You are saying spark context wont be created without hook, can you send any link to this hook where its located? 2.

dataset is never called

- is this behaviour because of lazy loading?

datajoely

03/11/2024, 11:43 AM

in your settings.py do you not have something like this?

HOOKS = (SparkHooks(),)

Dinesh D

03/11/2024, 11:52 AM

everything in my settings is commented out.

Dinesh D

03/11/2024, 11:52 AM

should I uncomment hooks and set it to empty tuple?

datajoely

03/11/2024, 11:57 AM

try that please 🙂

Dinesh D

03/11/2024, 12:06 PM

yep, thank you!

36 Views

Open in Slack

Previous Next