2025-08-23

Operational Notes on Airflow, Spark, and Alerting on Kubernetes

Practical lessons from running workflow orchestration and distributed compute in Kubernetes.

AirflowSparkKubernetes

Operational Notes on Airflow, Spark, and Alerting on Kubernetes

Running Airflow on Kubernetes is powerful, but it also creates a new class of operational problems.

The scheduler, workers, metadata database, message broker, ingress, secrets, DAGs, Spark jobs, and alerting all have to work together. When one layer is misconfigured, the failure can look like an application issue even when the real problem is infrastructure.

Start with the alert path

Before relying on email alerts, prove the full path works.

That means testing:

application configuration
Kubernetes secrets
SMTP host reachability
network policy
relay permissions
sender address
failure callback behavior

A small test DAG is usually better than waiting for a real production failure.

Keep SMTP boring

For internal platforms, the most reliable SMTP setup is often the least fancy one.

A practical pattern is:

Airflow pod → node-local or internal SMTP relay → mail system

The important part is making sure the relay allows traffic from the pod network and that the sender address is intentional.

Debug from inside the pod

When alerting fails, test from the same network context as the application.

kubectl exec -it <pod-name> -- sh

Then test DNS and connectivity:

nslookup <smtp-host>
nc -vz <smtp-host> 25

If the pod cannot reach the relay, changing Airflow settings will not fix the issue.

DAG-level failure callbacks help

Global alerting is useful, but DAG-level callbacks make important jobs easier to reason about.

A pattern I like:

one default alert path
DAG-specific context in the subject
enough metadata to identify the failed task
a link back to logs or the Airflow UI
no secrets or private data in the alert body

The bigger lesson

Workflow systems are only as reliable as their operational plumbing.

Before trusting alerts, prove the path end to end.