NVIDIA GPU Support#
Enable GPU access in your K3s cluster with the NVIDIA device plugin, including support for GPU time-slicing to allow multiple pods to share a single GPU.
Overview#
The NVIDIA device plugin for Kubernetes enables:
- GPU discovery and advertisement to the cluster
- GPU resource scheduling
- GPU time-slicing for improved utilization
- Multiple users/pods accessing GPUs simultaneously
Prerequisites#
Host System Requirements#
- NVIDIA Drivers: Install NVIDIA drivers on the host system
# Check if drivers are installed
nvidia-smi- NVIDIA Container Toolkit: Required for container GPU access
# Install NVIDIA Container Toolkit
# See: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# Configure the Docker daemon to use the NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerInstallation#
Deploy NVIDIA Device Plugin#
Run the installation script from the repository:
# From the repository root
bash nvidia/nvidia-device-plugin.shThis script:
- Deploys the NVIDIA device plugin as a DaemonSet
- Configures GPU time-slicing (8 slices per GPU by default)
- Sets up the necessary ConfigMaps
Verify Installation#
# Check that the device plugin pods are running
kubectl get pods -n kube-system | grep nvidia
# Verify GPU resources are advertised
kubectl describe nodes | grep nvidia.com/gpuYou should see output like:
nvidia.com/gpu: 8The number represents the total number of GPU slices available (not physical GPUs).
Configuration#
GPU Time-Slicing#
GPU time-slicing allows multiple pods to share a single GPU, improving utilization. Our default configuration sets up 8 time slices per GPU.
The configuration is defined in nvidia-device-plugin-config.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin-config
namespace: kube-system
data:
config: |
version: v1
sharing:
timeSlicing:
replicas: 8Adjusting Time Slices: To change the number of replicas, edit the replicas value and reapply:
kubectl apply -f nvidia/nvidia-device-plugin-config.yaml
kubectl rollout restart daemonset nvidia-device-plugin-daemonset -n kube-systemResource Limits in Pods#
To request GPU resources in your pods:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:12.0.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU sliceUsage in JupyterHub#
GPU access can be configured in JupyterHub’s config.yaml:
singleuser:
profileList:
- display_name: "GPU Instance"
description: "Notebook with GPU access"
kubespawner_override:
extra_resource_limits:
nvidia.com/gpu: "1"
extra_resource_guarantees:
nvidia.com/gpu: "1"Users can verify GPU access within their notebooks:
import subprocess
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
print(result.stdout)Testing GPU Access#
Test Job#
Deploy a test job to verify GPU functionality:
kubectl apply -f nvidia/test-nvidia-smi-job.yamlCheck the job output:
# Get the pod name
kubectl get pods | grep test-nvidia-smi
# View logs
kubectl logs test-nvidia-smi-xxxxxYou should see the nvidia-smi output showing your GPU.
Manual Test Pod#
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: Never
containers:
- name: cuda
image: nvidia/cuda:12.0.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1kubectl apply -f gpu-test.yaml
kubectl logs gpu-test
kubectl delete pod gpu-testTroubleshooting#
GPUs Not Visible#
- Check device plugin pods:
kubectl get pods -n kube-system | grep nvidia
kubectl logs -n kube-system <nvidia-device-plugin-pod>- Verify NVIDIA runtime:
# On the host
nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi- Check node labels:
kubectl get nodes -o json | jq '.items[].status.allocatable'Pods Can’t Access GPU#
- Check resource requests:
kubectl describe pod <pod-name>- Verify container runtime configuration:
sudo systemctl status containerd- Check K3s containerd config at
/var/lib/rancher/k3s/agent/etc/containerd/config.toml
Time-Slicing Not Working#
- Verify ConfigMap:
kubectl get configmap nvidia-device-plugin-config -n kube-system -o yamlCheck device plugin version: Ensure you’re using a recent version that supports time-slicing
Restart device plugin:
kubectl rollout restart daemonset nvidia-device-plugin-daemonset -n kube-systemResource Management#
Monitoring GPU Usage#
Monitor GPU utilization on the host:
# Continuous monitoring
watch -n 1 nvidia-smi
# One-time check
nvidia-smiFrom within the cluster:
kubectl exec -it <pod-name> -- nvidia-smiGPU Resource Quotas#
To limit GPU usage per namespace:
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: my-namespace
spec:
hard:
requests.nvidia.com/gpu: "4"
limits.nvidia.com/gpu: "4"Best Practices#
- Time-Slicing: Use time-slicing for workloads that don’t fully utilize the GPU
- Resource Limits: Always set GPU resource limits to prevent pods from requesting more than needed
- Monitoring: Regularly monitor GPU utilization to optimize time-slice configuration
- MIG (Multi-Instance GPU): For supported GPUs (A100, H100), consider MIG for better isolation instead of time-slicing