Hi team, is there any guide on submitting `spark` ...
# questions
w
Hi team, is there any guide on submitting
spark
job to
EMR
through
livy
for a
kedro
project?
d
We don’t have great docs on this to be honest - the simplest way is run
kedro package
and deploy it that way
w
@datajoely thanks a lot! Something like this? • run kedro package to create the egg file and deploy to S3 location • in airflow livyoperator, pass location of the file to py_files parameter • create a python file to import this packaged kedro project and pass sys.argv to main. deploy this file to S3 location
Copy code
from test_livy.__main__ import main
import sys

if __name__ == "__main__":
    main(sys.argv)
• in airflow livy operator, pass location of the file to file parameter • kedro commands are passed as "args" parameter in airflow livy operator
Copy code
# airflow task

t1 = LivyOperator(
    task_id="run_kedro_pipeline",
    driver_memory="1g",
    num_executors=1,
    executor_memory="1g",
    executor_cores=1,
    polling_interval=30,
    file="s3://{{ var.json.AWS_BUCKETS.app.name}}/applications/spark/emr/test_kedro_livy.py",
    args=["--pipeline", "test", "--params", "pipeline:test,app_name:test,ds:{{ ds }}"],
    dag=dag,
    livy_conn_id="livy_emr",
)
d
did it work?
I think that looks sensible
w
trying to test it, I don't always have an environment to work on 👀
d
okay it looks sensible
if you do find a solution we’d love to write some docs on this ❤️
(or accept a contribution 😛 )
let us know how it goes and we’ll do our best to support
w
on another related question, is it true that kedro will only work with spark deployment mode "client"? https://github.com/kedro-org/kedro/issues/529
d
I’m not entirely sure, but that issue is quite old so Kedro itself will have changed since then
w
okie okie. so we tried to run with --deployment-mode cluster, it gives spark error exitCode 13. change to 'client' seems by pass that issue for now. @datajoely do you have any good suggestion on deploy the "conf" files to EMR for the packaged kedro project? we thought about bootstrap action but that will restart the cluster every time?
d
this is something that will improve in the next version of kedro
but for now you would have to write your own procedure to put this folder in the right place
w
that's awesome! looking forward to that update
we tested that the steps above is working after the conf files are also deployed on to the cluster