vLLM#
Deploy vLLM for high-throughput LLM inference with GPU acceleration.
Overview#
vLLM is a fast and easy-to-use library for LLM inference and serving. It features:
- State-of-the-art serving throughput
- Efficient memory management with PagedAttention
- Continuous batching of requests
- Optimized CUDA kernels
- Support for popular models (Llama, Mistral, GPT, etc.)
Prerequisites#
- K3s installed
- NVIDIA GPU support configured
- Sufficient GPU memory for your chosen model
Deployment#
The vllm/ directory contains Kubernetes manifests for deploying vLLM.
Quick Start#
cd vllm
# Deploy vLLM
./up.sh
# Check status
kubectl get pods -n vllm
# View logs
kubectl logs -n vllm deployment/vllm
# Access the service
kubectl get ingress -n vllmConfiguration Files#
deployment.yaml- vLLM deployment with GPUservice.yaml- Service for cluster accessingress.yaml- External HTTPS accessup.sh- Deploy scriptdown.sh- Cleanup script
Deployment Configuration#
The deployment is configured to:
- Request 1 GPU
- Mount model cache volume
- Expose OpenAI-compatible API
- Auto-download models on first run
Example deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
namespace: vllm
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- meta-llama/Llama-2-7b-chat-hf
- --dtype
- float16
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8000
volumeMounts:
- name: cache
mountPath: /root/.cache
volumes:
- name: cache
emptyDir: {}Usage#
Access the API#
Once deployed, vLLM provides an OpenAI-compatible API:
# Get the ingress URL
kubectl get ingress -n vllm
# Example: https://vllm.carlboettiger.infoAPI Examples#
Using curl:
curl -X POST https://vllm.carlboettiger.info/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": "San Francisco is a",
"max_tokens": 50,
"temperature": 0.7
}'Using Python with OpenAI client:
from openai import OpenAI
client = OpenAI(
base_url="https://vllm.carlboettiger.info/v1",
api_key="not-needed" # vLLM doesn't require API key by default
)
response = client.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
prompt="San Francisco is a",
max_tokens=50,
temperature=0.7
)
print(response.choices[0].text)Chat completion:
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "What is the capital of France?"}
],
max_tokens=100
)
print(response.choices[0].message.content)Streaming Responses#
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "Tell me a story"}
],
max_tokens=200,
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")Configuration#
Change Model#
Edit deployment.yaml to use a different model:
args:
- --model
- mistralai/Mistral-7B-Instruct-v0.2 # Change this
- --dtype
- float16Popular models:
meta-llama/Llama-2-7b-chat-hfmeta-llama/Llama-2-13b-chat-hfmistralai/Mistral-7B-Instruct-v0.2tiiuae/falcon-7b-instruct
Note: Ensure your GPU has sufficient memory for the model.
Persistent Model Cache#
Use a PersistentVolumeClaim to cache models:
volumes:
- name: cache
persistentVolumeClaim:
claimName: vllm-cache
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-cache
namespace: vllm
spec:
accessModes:
- ReadWriteOnce
storageClassName: openebs-zfs
resources:
requests:
storage: 50GiQuantization#
Use quantization for larger models:
args:
- --model
- meta-llama/Llama-2-13b-chat-hf
- --quantization
- awq # or 'gptq', 'squeezellm'
- --dtype
- float16Tensor Parallelism#
For multi-GPU setups:
args:
- --model
- meta-llama/Llama-2-70b-chat-hf
- --tensor-parallel-size
- "4"
resources:
limits:
nvidia.com/gpu: 4Monitoring#
Check Logs#
kubectl logs -n vllm deployment/vllm -fGPU Usage#
# On the host
nvidia-smi
# Or from the pod
kubectl exec -n vllm deployment/vllm -- nvidia-smiMetrics#
vLLM exposes metrics at /metrics:
curl https://vllm.carlboettiger.info/metricsTroubleshooting#
Pod Not Starting#
# Check pod status
kubectl describe pod -n vllm <pod-name>
# Common issues:
# - GPU not available
# - Insufficient GPU memory
# - Model download failureOut of Memory#
- Use smaller model: Switch to 7B instead of 13B
- Enable quantization: Use AWQ or GPTQ
- Adjust max tokens: Limit
max_model_len
args:
- --model
- meta-llama/Llama-2-7b-chat-hf
- --max-model-len
- "2048"Model Download Issues#
- Check internet connectivity:
kubectl exec -n vllm deployment/vllm -- ping huggingface.co- Use Hugging Face token for gated models:
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token- Pre-download models: Download models to persistent volume first
API Not Responding#
- Check service:
kubectl get svc -n vllm
kubectl describe svc vllm-service -n vllm- Check ingress:
kubectl get ingress -n vllm
kubectl describe ingress vllm-ingress -n vllm- Test internally:
kubectl run -it --rm test --image=curlimages/curl --restart=Never -- \
curl http://vllm-service.vllm.svc.cluster.local:8000/healthAdvanced Configuration#
Enable Authentication#
Add API key authentication:
args:
- --model
- meta-llama/Llama-2-7b-chat-hf
- --api-key
- $(API_KEY)
env:
- name: API_KEY
valueFrom:
secretKeyRef:
name: vllm-secret
key: api-keyCustom Ingress Rules#
Restrict access by IP:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: vllm-ingress
annotations:
traefik.ingress.kubernetes.io/router.middlewares: default-ipwhitelist@kubernetescrdResource Limits#
Adjust CPU and memory:
resources:
limits:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "4"
requests:
nvidia.com/gpu: 1
memory: "8Gi"
cpu: "2"Performance Tuning#
Batch Size#
args:
- --max-num-batched-tokens
- "4096"
- --max-num-seqs
- "256"GPU Memory Utilization#
args:
- --gpu-memory-utilization
- "0.9" # Use 90% of GPU memorySpeculative Decoding#
args:
- --model
- meta-llama/Llama-2-70b-chat-hf
- --speculative-model
- meta-llama/Llama-2-7b-chat-hf
- --num-speculative-tokens
- "5"Cleanup#
cd vllm
./down.shOr manually:
kubectl delete namespace vllm