Deploying Airflow on Google Kubernetes Engine with Helm
Part One: configuring a basic Airflow deployment with Helm
Objectives
This two-part article will demonstrate how to deploy and configure Apache Airflow on the Google Kubernetes Engine on GCP using the official Helm chart.
In this first part we will:
- Set up a Kubernetes cluster on GKE.
- Deploy and configure Airflow using Helm and the values.yaml file.
- Expose the Airflow web server on GKE through a GCP LoadBalancer.
At the end of this part we will have an Airflow deployment running the LocalExecutor and an Airflow web server accessible through a GCP LoadBalancer. But without any DAGs.
In the second part we will:
- Manage Airflow Connections using GKE Secrets.
- Install Airflow dependencies and custom operators via a Docker image loaded from the Artifact Registry.
- Automatically pull Airflow DAGs from a private GitHub repository with the git-sync feature.
- Integrate other GCP services such as Google Cloud Storage.
After part two we will have extended our Airflow deployment with a DAG that writes a daily batch of data to a Google Cloud Storage bucket.
Introduction
GCP is an excellent cloud provider choice for Airflow. The apache-airflow-providers-google Python package provides a larger number of Airflow operators, hooks and sensors. This makes integrating Airflow with the many GCP services such as BigQuery and GCS a breeze.
It's worth noting that GCP offers its own managed deployment of Airflow called Cloud Composer. However, by managing our own deployment on Kubernetes we maintain more granular control over the underlying infrastructure. This allows us to optimize the infrastructure for our specific use case and lower the cost.
Prerequisites
This article assumes that the prerequisites have been met on your workstation:
- A GCP project named 'airflow-gke' with an active billing account (potentially with free trial credit).
- The CLI-tools
gcloud
,kubectl
andhelm
.
If you need a quick introduction to Kubernetes watch this light-hearted video.
1. Creating a Kubernetes cluster on GKE
Before we can initialize a Kubernetes cluster on GKE we must first set the project in the gcloud
CLI using its Project ID:
gcloud config set project airflow-gke-338120
The Project ID can be found in the Project info panel on the GCP dashboard. Your GCP project will have a different Project ID than the one in this article.
Now we can create a cluster named airflow-cluster
with a public endpoint. You are free to choose a different geographical region.
gcloud container clusters create airflow-cluster \
--machine-type n1-standard-4 \
--num-nodes 1 \
--region "europe-west4"
We will use the kubectl
CLI to interact with our newly deployed Kubernetes cluster on GKE. Authenticate kubectl
against this cluster with the following command:
gcloud container clusters get-credentials airflow-cluster --region "europe-west4"
Finally, we will create a Kubernetes namespace called airflow
for this deployment using the kubectl
CLI. This is not strictly necessary but it's worthwhile to learn how this feature works.
kubectl create namespace airflow
Don't forget to pass the namespace airflow
to the --namespace
or -n
flag in the following kubectl
commands.
Browse to the Clusters tab on Kubernetes Engine to view the newly created cluster.
2. Deploying the official Apache Airflow Helm chart
Now that the cluster is up and running we can install Airflow with Helm. Helm is a package manager that bundles Kubernetes applications into so called charts. Apache Airflow released the official Helm chart for Airflow in July 2021. With this chart we can bootstrap Airflow on our newly created Kubernetes cluster with relative ease.
First install the official Helm chart for Apache Airflow in your local Helm repository:
helm repo add apache-airflow https://airflow.apache.org
Verify that the chart is in your local repository:
helm repo list
Now Airflow can be deployed on GKE with just one command:
helm upgrade --install airflow apache-airflow/airflow -n airflow --debug
- The first
airflow
argument is the name we give to the release. apache-airflow/airflow
is the Helm chart that we deploy.- the second
airflow
argument is the Kubernetes namespace that we created earlier. - The
--debug
flag is recommended to view the progress and spot potential problems.
Verify the deployment by browsing to the Services & Ingress tab on Kubernetes Engine.
Here we can see the various services of our Airflow deployment. By default, the Helm chart is configured to use the CeleryExecutor which is why there is a airflow-flower
and airflow-redis
service. We will change this to the LocalExecutor later.
3. Accessing the Airflow web server via port-forwarding to the ClusterIP service
By default the Helm chart is configured to use a Kubernetes ClusterIP service for the airflow-webserver
as can be seen in the Type column of the image above. This service routes requests to the correct pod but has no external endpoint. To access it from outside the cluster we must forward port 8080 of the pod to port 8080 of our workstation:
kubectl port-forward svc/airflow-webserver 8080:8080 -n airflow
The web server can now be accessed on localhost:8080. The default credentials are username admin
and password admin
.
Forwarding the port each time we want to visit the web server is not convenient. In the next section we will substitute the Kubernetes ClusterIP service with a GCP LoadBalancer that will expose the Airflow web server externally.
4. Configuring our Airflow deployment
In this section we will modify the deployment by editing the values.yaml file of our Helm chart. Write the default configuration that is currently deployed to a file called values.yaml:
helm show values apache-airflow/airflow > values.yaml
We will use the file to change two aspects of our deployment: the Airflow executor and the web server service. To do this, edit the following values in the values.yaml file:
- Replace the
CeleryExecutor
withLocalExecutor
on line 202. - Replace the
ClusterIP
service withLoadBalancer
on line 752.
Once saved, you can pass the file to the helm
command with the -f
flag when we upgrade
the cluster:
helm upgrade --install airflow apache-airflow/airflow -n airflow \
-f values.yaml \
--debug
Let's verify the new deployment again.
Notice how there are fewer Kubernetes services than before? This is because the LocalExecutor
doesn't need the Redis broker and Flower UI. Hence GKE deleted these pods and replaced them by an airflow-scheduler
service.
Notice also how the airflow-webserver
service is now an "External load balancer" Type. GKE spotted the change from ClusterIP to LoadBalancer and automatically created this GCP service for you. You can now access the web server via the IP address in the Endpoints column.
Currently there are no DAGs. We will learn how to add a DAG along with its dependencies in the second part of this article.
References
Cover image: Infiflex 2019-2020, Google Kubernetes Engine - GKE - Features & Benefits, accessed 18 January 2022, https://www.infiflex.com/google-kubernetes-engine--gke--features--benefits.
Julian de Ruiter, Bas Harenslak. (2021). Data Pipelines with Apache Airflow. O'Reilly.