2024-02-07
Troubleshooting NVIDIA GPU Operator on Kubernetes
A practical, sanitized checklist for debugging GPU nodes, device plugins, drivers, runtime configuration, and validator pods.
GPU problems in Kubernetes usually show up as a pod stuck in Init, a device plugin that never becomes ready, or a workload that requests a GPU but cannot see one inside the container.
This is a sanitized checklist for debugging NVIDIA GPU Operator issues on Kubernetes.
Start at the node
Before debugging Kubernetes objects, confirm the node can see the GPU.
nvidia-smi
lsmod | grep nvidia
cat /proc/driver/nvidia/version
If nvidia-smi fails on the host, Kubernetes is not the first problem. Fix the driver, kernel module, Secure Boot state, or hardware visibility first.
Confirm the node labels
The GPU Operator and GPU Feature Discovery usually add useful labels.
kubectl get nodes --show-labels | grep -i nvidia
kubectl describe node <gpu-node-name> | grep -i nvidia -A5 -B5
You want to see GPU capacity advertised:
kubectl describe node <gpu-node-name> | grep -A10 -i capacity
Look for something like:
nvidia.com/gpu: 1
Check the GPU Operator namespace
kubectl get pods -n gpu-operator -o wide
Common failure areas:
- driver daemonset
- container toolkit daemonset
- device plugin
- GPU feature discovery
- validator pods
Read logs from the failing component:
kubectl logs -n gpu-operator <pod-name>
kubectl describe pod -n gpu-operator <pod-name>
Runtime configuration matters
A common failure mode is installing the driver but not wiring containerd to the NVIDIA runtime.
Check the runtime configuration:
containerd config dump | grep -i nvidia -n
If the NVIDIA runtime is missing, configure it with the NVIDIA container toolkit.
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
sudo systemctl restart kubelet
Then check again:
containerd config dump | grep -i nvidia -n
Test with a simple pod
Use a minimal GPU test before blaming the application.
apiVersion: v1
kind: Pod
metadata:
name: gpu-smoke-test
spec:
restartPolicy: Never
runtimeClassName: nvidia
containers:
- name: cuda
image: nvidia/cuda:12.4.1-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
Keep the blast radius small
When changing GPU runtime behavior, test one node first.
A safe pattern:
- cordon the GPU node
- drain non-critical workloads
- change driver/runtime/toolkit configuration
- reboot if needed
- run
nvidia-smion the host - run a GPU smoke-test pod
- uncordon
GPU debugging gets much easier when every step answers one question at a time.