2025-08-23
Operational Notes on Airflow, Spark, and Alerting on Kubernetes
Practical lessons from running workflow orchestration and distributed compute in Kubernetes.
Running Airflow on Kubernetes is powerful, but it also creates a new class of operational problems.
The scheduler, workers, metadata database, message broker, ingress, secrets, DAGs, Spark jobs, and alerting all have to work together. When one layer is misconfigured, the failure can look like an application issue even when the real problem is infrastructure.
Start with the alert path
Before relying on email alerts, prove the full path works.
That means testing:
- application configuration
- Kubernetes secrets
- SMTP host reachability
- network policy
- relay permissions
- sender address
- failure callback behavior
A small test DAG is usually better than waiting for a real production failure.
Keep SMTP boring
For internal platforms, the most reliable SMTP setup is often the least fancy one.
A practical pattern is:
Airflow pod → node-local or internal SMTP relay → mail system
The important part is making sure the relay allows traffic from the pod network and that the sender address is intentional.
Debug from inside the pod
When alerting fails, test from the same network context as the application.
kubectl exec -it <pod-name> -- sh
Then test DNS and connectivity:
nslookup <smtp-host>
nc -vz <smtp-host> 25
If the pod cannot reach the relay, changing Airflow settings will not fix the issue.
DAG-level failure callbacks help
Global alerting is useful, but DAG-level callbacks make important jobs easier to reason about.
A pattern I like:
- one default alert path
- DAG-specific context in the subject
- enough metadata to identify the failed task
- a link back to logs or the Airflow UI
- no secrets or private data in the alert body
The bigger lesson
Workflow systems are only as reliable as their operational plumbing.
Before trusting alerts, prove the path end to end.