2024-02-07

Troubleshooting NVIDIA GPU Operator on Kubernetes

A practical, sanitized checklist for debugging GPU nodes, device plugins, drivers, runtime configuration, and validator pods.

KubernetesGPUNVIDIA
Troubleshooting NVIDIA GPU Operator on Kubernetes

GPU problems in Kubernetes usually show up as a pod stuck in Init, a device plugin that never becomes ready, or a workload that requests a GPU but cannot see one inside the container.

This is a sanitized checklist for debugging NVIDIA GPU Operator issues on Kubernetes.

Start at the node

Before debugging Kubernetes objects, confirm the node can see the GPU.

nvidia-smi
lsmod | grep nvidia
cat /proc/driver/nvidia/version

If nvidia-smi fails on the host, Kubernetes is not the first problem. Fix the driver, kernel module, Secure Boot state, or hardware visibility first.

Confirm the node labels

The GPU Operator and GPU Feature Discovery usually add useful labels.

kubectl get nodes --show-labels | grep -i nvidia
kubectl describe node <gpu-node-name> | grep -i nvidia -A5 -B5

You want to see GPU capacity advertised:

kubectl describe node <gpu-node-name> | grep -A10 -i capacity

Look for something like:

nvidia.com/gpu: 1

Check the GPU Operator namespace

kubectl get pods -n gpu-operator -o wide

Common failure areas:

  • driver daemonset
  • container toolkit daemonset
  • device plugin
  • GPU feature discovery
  • validator pods

Read logs from the failing component:

kubectl logs -n gpu-operator <pod-name>
kubectl describe pod -n gpu-operator <pod-name>

Runtime configuration matters

A common failure mode is installing the driver but not wiring containerd to the NVIDIA runtime.

Check the runtime configuration:

containerd config dump | grep -i nvidia -n

If the NVIDIA runtime is missing, configure it with the NVIDIA container toolkit.

sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd
sudo systemctl restart kubelet

Then check again:

containerd config dump | grep -i nvidia -n

Test with a simple pod

Use a minimal GPU test before blaming the application.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-smoke-test
spec:
  restartPolicy: Never
  runtimeClassName: nvidia
  containers:
    - name: cuda
      image: nvidia/cuda:12.4.1-base-ubuntu22.04
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1

Keep the blast radius small

When changing GPU runtime behavior, test one node first.

A safe pattern:

  1. cordon the GPU node
  2. drain non-critical workloads
  3. change driver/runtime/toolkit configuration
  4. reboot if needed
  5. run nvidia-smi on the host
  6. run a GPU smoke-test pod
  7. uncordon

GPU debugging gets much easier when every step answers one question at a time.