Denis Gontcharov
On Airflow

On Airflow

Deploying Airflow on Google Kubernetes Engine with Helm

Deploying Airflow on Google Kubernetes Engine with Helm

Part One: configuring a basic Airflow deployment with Helm

Denis Gontcharov's photo
Denis Gontcharov
·Jan 18, 2022·

6 min read

Table of contents

Objectives

This two-part article will demonstrate how to deploy and configure Apache Airflow on the Google Kubernetes Engine on GCP using the official Helm chart.

In this first part we will:

  • Set up a Kubernetes cluster on GKE.
  • Deploy and configure Airflow using Helm and the values.yaml file.
  • Expose the Airflow web server on GKE through a GCP LoadBalancer.

At the end of this part we will have an Airflow deployment running the LocalExecutor and an Airflow web server accessible through a GCP LoadBalancer. But without any DAGs.

In the second part we will:

  • Manage Airflow Connections using GKE Secrets.
  • Install Airflow dependencies and custom operators via a Docker image loaded from the Artifact Registry.
  • Automatically pull Airflow DAGs from a private GitHub repository with the git-sync feature.
  • Integrate other GCP services such as Google Cloud Storage.

After part two we will have extended our Airflow deployment with a DAG that writes a daily batch of data to a Google Cloud Storage bucket.

Introduction

GCP is an excellent cloud provider choice for Airflow. The apache-airflow-providers-google Python package provides a larger number of Airflow operators, hooks and sensors. This makes integrating Airflow with the many GCP services such as BigQuery and GCS a breeze.

It's worth noting that GCP offers its own managed deployment of Airflow called Cloud Composer. However, by managing our own deployment on Kubernetes we maintain more granular control over the underlying infrastructure. This allows us to optimize the infrastructure for our specific use case and lower the cost.

Prerequisites

This article assumes that the prerequisites have been met on your workstation:

  1. A GCP project named 'airflow-gke' with an active billing account (potentially with free trial credit).
  2. The CLI-tools gcloud, kubectl and helm.

If you need a quick introduction to Kubernetes watch this light-hearted video.

1. Creating a Kubernetes cluster on GKE

Before we can initialize a Kubernetes cluster on GKE we must first set the project in the gcloud CLI using its Project ID:

gcloud config set project airflow-gke-338120

The Project ID can be found in the Project info panel on the GCP dashboard. Your GCP project will have a different Project ID than the one in this article.

image.png

Now we can create a cluster named airflow-cluster with a public endpoint. You are free to choose a different geographical region.

gcloud container clusters create airflow-cluster \
--machine-type n1-standard-4 \
--num-nodes 1 \
--region "europe-west4"

We will use the kubectl CLI to interact with our newly deployed Kubernetes cluster on GKE. Authenticate kubectl against this cluster with the following command:

gcloud container clusters get-credentials airflow-cluster --region "europe-west4"

Finally, we will create a Kubernetes namespace called airflow for this deployment using the kubectl CLI. This is not strictly necessary but it's worthwhile to learn how this feature works.

kubectl create namespace airflow

Don't forget to pass the namespace airflow to the --namespace or -n flag in the following kubectl commands.

Browse to the Clusters tab on Kubernetes Engine to view the newly created cluster.

image.png

2. Deploying the official Apache Airflow Helm chart

Now that the cluster is up and running we can install Airflow with Helm. Helm is a package manager that bundles Kubernetes applications into so called charts. Apache Airflow released the official Helm chart for Airflow in July 2021. With this chart we can bootstrap Airflow on our newly created Kubernetes cluster with relative ease.

First install the official Helm chart for Apache Airflow in your local Helm repository:

helm repo add apache-airflow https://airflow.apache.org

Verify that the chart is in your local repository:

helm repo list

Now Airflow can be deployed on GKE with just one command:

helm upgrade --install airflow apache-airflow/airflow -n airflow --debug
  • The first airflow argument is the name we give to the release.
  • apache-airflow/airflow is the Helm chart that we deploy.
  • the second airflow argument is the Kubernetes namespace that we created earlier.
  • The --debug flag is recommended to view the progress and spot potential problems.

Verify the deployment by browsing to the Services & Ingress tab on Kubernetes Engine.

image.png

Here we can see the various services of our Airflow deployment. By default, the Helm chart is configured to use the CeleryExecutor which is why there is a airflow-flower and airflow-redis service. We will change this to the LocalExecutor later.

3. Accessing the Airflow web server via port-forwarding to the ClusterIP service

By default the Helm chart is configured to use a Kubernetes ClusterIP service for the airflow-webserver as can be seen in the Type column of the image above. This service routes requests to the correct pod but has no external endpoint. To access it from outside the cluster we must forward port 8080 of the pod to port 8080 of our workstation:

kubectl port-forward svc/airflow-webserver 8080:8080 -n airflow

The web server can now be accessed on localhost:8080. The default credentials are username admin and password admin.

image.png

Forwarding the port each time we want to visit the web server is not convenient. In the next section we will substitute the Kubernetes ClusterIP service with a GCP LoadBalancer that will expose the Airflow web server externally.

4. Configuring our Airflow deployment

In this section we will modify the deployment by editing the values.yaml file of our Helm chart. Write the default configuration that is currently deployed to a file called values.yaml:

helm show values apache-airflow/airflow > values.yaml

We will use the file to change two aspects of our deployment: the Airflow executor and the web server service. To do this, edit the following values in the values.yaml file:

  1. Replace the CeleryExecutor with LocalExecutor on line 202.
  2. Replace the ClusterIP service with LoadBalancer on line 752.

Once saved, you can pass the file to the helm command with the -f flag when we upgrade the cluster:

helm upgrade --install airflow apache-airflow/airflow -n airflow  \
  -f values.yaml \
  --debug

Let's verify the new deployment again.

image.png

Notice how there are fewer Kubernetes services than before? This is because the LocalExecutor doesn't need the Redis broker and Flower UI. Hence GKE deleted these pods and replaced them by an airflow-scheduler service.

Notice also how the airflow-webserver service is now an "External load balancer" Type. GKE spotted the change from ClusterIP to LoadBalancer and automatically created this GCP service for you. You can now access the web server via the IP address in the Endpoints column.

image.png

Currently there are no DAGs. We will learn how to add a DAG along with its dependencies in the second part of this article.


References

 
Share this