Hi folks, I am exploring 2 approaches to run kedr...
# questions
m
Hi folks, I am exploring 2 approaches to run kedro project in airflow: 1) I have used
kedro-docker
to generate Docker file and built a docker image 'test:latest'. The image is now run in airflow using DockerOperator as below:
tasks["run_image"] = DockerOperator(
task_id="test_model",
docker_conn_id='docker_ecr',
image="<http://123456789.dkr.ecr.eu-central-1.amazonaws.com/test:latest|123456789.dkr.ecr.eu-central-1.amazonaws.com/test:latest>",
api_version="auto",
auto_remove=True,
force_pull=True,
docker_url="<unix://var/run/docker.sock>"
)
Now the DAG runs this as a Docker image with a single task. Running the docker image using DockerOperator does not allow to have the nodes/pipelines as individual tasks in airflow. And this requires to run the docker image from the scratch if it breaks for some reason. 2) By using,
'kedro package'
the project was packaged into wheel file and dags using
kedro airflow create
. The pros of this approach compared to previous approach is that the dag has different tasks. This allows to resume the pipeline from where it breaks unlike the previous approach. If i have to deploy this approach to EC2 instance, the documentation says https://github.com/kedro-org/kedro-plugins/tree/main/kedro-airflow to deploy the project to airflows executors environment:
pip install <path-to-wheel-file>.
So i would install the .whl to airflow worker container (if airflows executors environment meant to be this, or correct me if im wrong). The first approach will pull the images from Docker registry only when it is scheduled to run the DAG and removes the container after the image is run. This allows to have multiple projects/dags that pulls the images and runs the tasks. Meaning that the EC2 instance will be lighter. With the second approach, the multiple projects (multiple .whl) have to be installed to the EC2 instance to run the project specific DAG. This makes the EC2 instance full as all projects (.whl files) need to be installed though they are run only when the DAG is scheduled to run. Can anyone suggest a scalable way for second approach as i would like to preserve or have control of individual tasks within the project which are missing in first approach. Or how are kedro-airflow users managing multiple projects in EC2 for example.