This two-part article will demonstrate how to deploy and configure Apache Airflow on the Google Kubernetes Engine on GCP using the official Helm chart.
In this first part we will:
- Set up a Kubernetes cluster on GKE.
- Deploy and configure Airflow using Helm and the values.yaml file.
- Expose the Airflow web server on GKE through a GCP LoadBalancer.
At the end of this part we will have an Airflow deployment running the LocalExecutor and an Airflow web server accessible through a GCP LoadBalancer. But without any DAGs.
In the second part we will:
- Manage Airflow Connections using GKE Secrets.
- Install Airflow dependencies and custom operators via a Docker image loaded from the Artifact Registry.
- Automatically pull Airflow DAGs from a private GitHub repository with the git-sync feature.
- Integrate other GCP services such as Google Cloud Storage.
After part two we will have extended our Airflow deployment with a DAG that writes a daily batch of data to a Google Cloud Storage bucket.
GCP is an excellent cloud provider choice for Airflow. The apache-airflow-providers-google Python package provides a larger number of Airflow operators, hooks and sensors. This makes integrating Airflow with the many GCP services such as BigQuery and GCS a breeze.
It's worth noting that GCP offers its own managed deployment of Airflow called Cloud Composer. However, by managing our own deployment on Kubernetes we maintain more granular control over the underlying infrastructure. This allows us to optimize the infrastructure for our specific use case and lower the cost.
This article assumes that the prerequisites have been met on your workstation:
- A GCP project named 'airflow-gke' with an active billing account (potentially with free trial credit).
- The CLI-tools
If you need a quick introduction to Kubernetes watch this light-hearted video.
1. Creating a Kubernetes cluster on GKE
Before we can initialize a Kubernetes cluster on GKE we must first set the project in the
gcloud CLI using its Project ID:
gcloud config set project airflow-gke-338120
The Project ID can be found in the Project info panel on the GCP dashboard. Your GCP project will have a different Project ID than the one in this article.
Now we can create a cluster named
airflow-cluster with a public endpoint. You are free to choose a different geographical region.
gcloud container clusters create airflow-cluster \ --machine-type n1-standard-4 \ --num-nodes 1 \ --region "europe-west4"
We will use the
kubectl CLI to interact with our newly deployed Kubernetes cluster on GKE. Authenticate
kubectl against this cluster with the following command:
gcloud container clusters get-credentials airflow-cluster --region "europe-west4"
Finally, we will create a Kubernetes namespace called
airflow for this deployment using the
kubectl CLI. This is not strictly necessary but it's worthwhile to learn how this feature works.
kubectl create namespace airflow
Don't forget to pass the namespace
airflow to the
-n flag in the following
Browse to the Clusters tab on Kubernetes Engine to view the newly created cluster.
2. Deploying the official Apache Airflow Helm chart
Now that the cluster is up and running we can install Airflow with Helm. Helm is a package manager that bundles Kubernetes applications into so called charts. Apache Airflow released the official Helm chart for Airflow in July 2021. With this chart we can bootstrap Airflow on our newly created Kubernetes cluster with relative ease.
First install the official Helm chart for Apache Airflow in your local Helm repository:
helm repo add apache-airflow https://airflow.apache.org
Verify that the chart is in your local repository:
helm repo list
Now Airflow can be deployed on GKE with just one command:
helm upgrade --install airflow apache-airflow/airflow -n airflow --debug
- The first
airflowargument is the name we give to the release.
apache-airflow/airflowis the Helm chart that we deploy.
- the second
airflowargument is the Kubernetes namespace that we created earlier.
--debugflag is recommended to view the progress and spot potential problems.
Verify the deployment by browsing to the Services & Ingress tab on Kubernetes Engine.
Here we can see the various services of our Airflow deployment. By default, the Helm chart is configured to use the CeleryExecutor which is why there is a
airflow-redis service. We will change this to the LocalExecutor later.
3. Accessing the Airflow web server via port-forwarding to the ClusterIP service
By default the Helm chart is configured to use a Kubernetes ClusterIP service for the
airflow-webserver as can be seen in the Type column of the image above. This service routes requests to the correct pod but has no external endpoint. To access it from outside the cluster we must forward port 8080 of the pod to port 8080 of our workstation:
kubectl port-forward svc/airflow-webserver 8080:8080 -n airflow
The web server can now be accessed on localhost:8080. The default credentials are username
admin and password
Forwarding the port each time we want to visit the web server is not convenient. In the next section we will substitute the Kubernetes ClusterIP service with a GCP LoadBalancer that will expose the Airflow web server externally.
4. Configuring our Airflow deployment
In this section we will modify the deployment by editing the values.yaml file of our Helm chart. Write the default configuration that is currently deployed to a file called values.yaml:
helm show values apache-airflow/airflow > values.yaml
We will use the file to change two aspects of our deployment: the Airflow executor and the web server service. To do this, edit the following values in the values.yaml file:
- Replace the
LocalExecutoron line 202.
- Replace the
LoadBalanceron line 752.
Once saved, you can pass the file to the
helm command with the
-f flag when we
upgrade the cluster:
helm upgrade --install airflow apache-airflow/airflow -n airflow \ -f values.yaml \ --debug
Let's verify the new deployment again.
Notice how there are fewer Kubernetes services than before? This is because the
LocalExecutor doesn't need the Redis broker and Flower UI. Hence GKE deleted these pods and replaced them by an
Notice also how the
airflow-webserver service is now an "External load balancer" Type. GKE spotted the change from ClusterIP to LoadBalancer and automatically created this GCP service for you. You can now access the web server via the IP address in the Endpoints column.
Currently there are no DAGs. We will learn how to add a DAG along with its dependencies in the second part of this article.
Cover image: Infiflex 2019-2020, Google Kubernetes Engine - GKE - Features & Benefits, accessed 18 January 2022, https://www.infiflex.com/google-kubernetes-engine--gke--features--benefits.
Julian de Ruiter, Bas Harenslak. (2021). Data Pipelines with Apache Airflow. O'Reilly.