上个月我们团队决定把 DeepSeek V4 部署到自己的 GPU 集群上,跑一些内部的代码和文档生成任务。说实话,模型跑起来不难,难的是怎么让它在生产环境稳定运行——我花了差不多一周时间,才把从容器化、K8s 编排到 Prometheus 监控的整条链路跑通。这篇文章把我踩过的坑和最终方案都整理出来了,希望能帮你少走弯路。
DeepSeek V4 的生产级部署需要三层架构:容器化打包(Dockerfile + vLLM)、K8s 编排(Deployment + HPA + Service)、Prometheus + Grafana 的全链路监控。本文会给出每一层的完整配置文件,全部实测可用。
先说结论
| 环节 | 方案 | 核心工具 | 踩坑指数 |
|---|---|---|---|
| 模型推理 | vLLM 0.8.x | vllm serve | ⭐⭐ |
| 容器化 | 多阶段 Dockerfile | CUDA 12.4 + Python 3.11 | ⭐⭐⭐ |
| 编排调度 | K8s Deployment + HPA | GPU 资源限制 + 自定义指标扩缩 | ⭐⭐⭐⭐ |
| 监控告警 | Prometheus + Grafana | vLLM 内置 metrics + 自定义 exporter | ⭐⭐⭐ |
| 日志收集 | Loki + Promtail | 结构化日志 | ⭐⭐ |
环境准备
我的测试环境:
- GPU:4 × NVIDIA A100 80GB(DeepSeek V4 满血版需要至少 2 × A100 80GB,量化版 1 张够)
- K8s:v1.29,已装 NVIDIA GPU Operator
- OS:Ubuntu 22.04
- 存储:NFS 挂载模型权重,大概 140GB(BF16 满血版)
先确认 GPU Operator 正常:
# 检查 GPU 是否被正确识别
kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'
# 应该输出你的 GPU 数量,比如 "4"
# 检查 nvidia-device-plugin 是否正常
kubectl get pods -n gpu-operator | grep nvidia-device-plugin
第一步:Dockerfile 容器化
这个 Dockerfile 折腾了我两天。最大的坑是 CUDA 版本和 vLLM 的兼容问题——用 CUDA 12.6 会导致 vLLM 的某些 kernel 编译失败,退回 12.4 就好了。
# ============================================
# DeepSeek V4 生产级 Dockerfile
# 基于 vLLM 推理引擎,CUDA 12.4
# ============================================
# Stage 1: 基础运行环境
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS base
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.11 python3.11-venv python3-pip \
curl wget git \
&& rm -rf /var/lib/apt/lists/*
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 \
&& update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
# Stage 2: 安装依赖
FROM base AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Stage 3: 最终镜像
FROM base AS runtime
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11/dist-packages /usr/local/lib/python3.11/dist-packages
COPY --from=builder /usr/local/bin /usr/local/bin
# 健康检查脚本
COPY healthcheck.sh /app/healthcheck.sh
RUN chmod +x /app/healthcheck.sh
# 暴露 vLLM API 端口和 metrics 端口
EXPOSE 8000
EXPOSE 9090
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD /app/healthcheck.sh
# 启动 vLLM 推理服务
ENTRYPOINT ["python", "-m", "vllm.entrypoints.openai.api_server"]
CMD [ \
"--model", "/models/deepseek-v4", \
"--tensor-parallel-size", "2", \
"--max-model-len", "32768", \
"--gpu-memory-utilization", "0.92", \
"--enable-prefix-caching", \
"--port", "8000", \
"--served-model-name", "deepseek-v4", \
"--trust-remote-code" \
]
requirements.txt:
vllm==0.8.4
prometheus-client==0.21.0
healthcheck.sh:
#!/bin/bash
# 检查 vLLM 是否正常响应
curl -sf http://localhost:8000/health || exit 1
构建命令:
docker build -t deepseek-v4-vllm:latest .
# 本地测试跑一下(假设模型权重在 /data/models/deepseek-v4)
docker run --gpus '"device=0,1"' \
-v /data/models/deepseek-v4:/models/deepseek-v4 \
-p 8000:8000 \
deepseek-v4-vllm:latest
第二步:K8s 编排配置
这部分是整个方案最复杂的地方。我把配置拆成几个文件,方便维护。
2.1 Namespace 和 PV(模型权重存储)
# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: llm-serving
labels:
app.kubernetes.io/part-of: deepseek-v4
---
# pv-model.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: deepseek-v4-weights
spec:
capacity:
storage: 200Gi
accessModes:
- ReadOnlyMany
nfs:
server: 10.0.1.50
path: /exports/models/deepseek-v4
persistentVolumeReclaimPolicy: Retain
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: deepseek-v4-weights
namespace: llm-serving
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 200Gi
volumeName: deepseek-v4-weights
2.2 Deployment(核心)
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-v4
namespace: llm-serving
labels:
app: deepseek-v4
version: v4-bf16
spec:
replicas: 2
selector:
matchLabels:
app: deepseek-v4
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # 滚动更新时不允许不可用
maxSurge: 1
template:
metadata:
labels:
app: deepseek-v4
version: v4-bf16
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: your-registry.com/deepseek-v4-vllm:latest
ports:
- containerPort: 8000
name: api
protocol: TCP
resources:
requests:
cpu: "8"
memory: "64Gi"
nvidia.com/gpu: "2"
limits:
cpu: "16"
memory: "128Gi"
nvidia.com/gpu: "2"
env:
- name: CUDA_VISIBLE_DEVICES
value: "0,1"
- name: VLLM_LOGGING_LEVEL
value: "INFO"
- name: NCCL_P2P_DISABLE
value: "0"
volumeMounts:
- name: model-weights
mountPath: /models/deepseek-v4
readOnly: true
- name: shm
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120 # 模型加载需要时间
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
timeoutSeconds: 5
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 30 # 最多等 5 分钟启动
volumes:
- name: model-weights
persistentVolumeClaim:
claimName: deepseek-v4-weights
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16Gi # tensor parallel 通信需要大 shm
踩坑提醒:/dev/shm 那个 emptyDir 千万别忘了。vLLM 的 tensor parallel 依赖共享内存做 GPU 间通信,默认的 64MB 会直接 OOM 崩掉。我第一次部署的时候 Pod 反复 CrashLoopBackOff,查了半天日志才发现是这个问题。
2.3 Service + HPA
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: deepseek-v4-svc
namespace: llm-serving
labels:
app: deepseek-v4
spec:
type: ClusterIP
ports:
- port: 8000
targetPort: 8000
name: api
selector:
app: deepseek-v4
---
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: deepseek-v4-hpa
namespace: llm-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: deepseek-v4
minReplicas: 2
maxReplicas: 6
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_running
target:
type: AverageValue
averageValue: "20" # 每个 Pod 并发请求超过 20 就扩容
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 1
periodSeconds: 120 # 每 2 分钟最多加 1 个 Pod(GPU 资源贵)
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 300 # 缩容更保守
整体架构长这样:
graph TB
Client[客户端请求] --> Ingress[Ingress / Gateway]
Ingress --> SVC[K8s Service<br/>deepseek-v4-svc:8000]
SVC --> Pod1[Pod 1<br/>vLLM + 2×A100]
SVC --> Pod2[Pod 2<br/>vLLM + 2×A100]
SVC --> PodN[Pod N<br/>HPA 动态扩缩]
Pod1 --> PV[(NFS PV<br/>模型权重 140GB)]
Pod2 --> PV
PodN --> PV
Pod1 -->|/metrics| Prom[Prometheus]
Pod2 -->|/metrics| Prom
PodN -->|/metrics| Prom
Prom --> Grafana[Grafana Dashboard]
Prom --> Alert[AlertManager<br/>钉钉/飞书告警]
HPA[HPA Controller] -->|读取 vllm_num_requests_running| Prom
HPA -->|扩缩容| SVC
第三步:Prometheus + Grafana 监控
vLLM 自带 /metrics 端点,暴露了一堆 Prometheus 格式的指标,这点真的挺厚道。但默认的指标不够用,我额外加了几个关键的。
3.1 Prometheus 采集配置
# prometheus-scrape-config.yaml(追加到 Prometheus 的 scrape_configs)
scrape_configs:
- job_name: 'deepseek-v4-vllm'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- llm-serving
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}
- source_labels: [__meta_kubernetes_pod_ip, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
separator: ':'
metric_relabel_configs:
- source_labels: [__name__]
regex: 'vllm_.*'
action: keep
3.2 核心监控指标
这些是我觉得生产环境必须盯着的:
| 指标名 | 类型 | 含义 | 告警阈值建议 |
|---|---|---|---|
vllm_num_requests_running | Gauge | 当前正在处理的请求数 | > 50 告警 |
vllm_num_requests_waiting | Gauge | 等待队列长度 | > 20 告警 |
vllm_gpu_cache_usage_perc | Gauge | GPU KV Cache 使用率 | > 95% 告警 |
vllm_avg_generation_throughput_toks_per_s | Gauge | 生成吞吐(token/s) | < 100 告警 |
vllm_request_success_total | Counter | 成功请求总数 | 用于计算成功率 |
vllm_e2e_request_latency_seconds | Histogram | 端到端请求延迟 | P99 > 30s 告警 |
vllm_time_to_first_token_seconds | Histogram | 首 token 延迟(TTFT) | P99 > 5s 告警 |
3.3 AlertManager 告警规则
# alerting-rules.yaml
groups:
- name: deepseek-v4-alerts
rules:
- alert: HighRequestQueueDepth
expr: vllm_num_requests_waiting{job="deepseek-v4-vllm"} > 20
for: 2m
labels:
severity: warning
annotations:
summary: "DeepSeek V4 请求队列堆积"
description: "等待队列 {{ $value }} 个请求,持续 2 分钟,考虑扩容"
- alert: GPUCacheNearlyFull
expr: vllm_gpu_cache_usage_perc{job="deepseek-v4-vllm"} > 0.95
for: 1m
labels:
severity: critical
annotations:
summary: "GPU KV Cache 使用率超过 95%"
description: "Pod {{ $labels.pod }} 的 KV Cache 即将耗尽,新请求会被拒绝"
- alert: HighP99Latency
expr: histogram_quantile(0.99, rate(vllm_e2e_request_latency_seconds_bucket[5m])) > 30
for: 3m
labels:
severity: warning
annotations:
summary: "DeepSeek V4 P99 延迟超过 30 秒"
- alert: LowThroughput
expr: vllm_avg_generation_throughput_toks_per_s{job="deepseek-v4-vllm"} < 100
for: 5m
labels:
severity: warning
annotations:
summary: "生成吞吐过低,可能存在性能问题"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total{namespace="llm-serving", container="vllm"}[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "vLLM Pod 反复重启"
踩坑记录
这部分是真金白银换来的教训:
坑 1:模型加载 OOMKilled
第一次部署,Pod 起来几十秒就被 kill 了。kubectl describe pod 一看,OOMKilled。原因是 K8s 的 memory limit 设太小。DeepSeek V4 BF16 权重 140GB,加载到 GPU 之前需要先在 CPU 内存里过一遍。我最后把 memory limit 拉到 128Gi 才稳住。
坑 2:startupProbe 超时导致 Pod 被杀
大模型加载慢,A100 上 DeepSeek V4 大概要 2-3 分钟才能完全 ready。默认的 initialDelaySeconds 根本不够,Pod 还没加载完就被 kubelet 判定不健康然后重启了。解决方案是用 startupProbe + failureThreshold: 30,给足 5 分钟启动时间。
坑 3:tensor parallel 跨通信慢到怀疑人生
一开始我想用 4 张 A100 做 TP=4,但两张卡在 A,两张在 B。结果 NCCL 走网络通信,推理速度直接掉了 3 倍。结论:tensor parallel 一定要在同一个内。K8s 里可以用 nodeSelector 或者 topology 约束来保证。
坑 4:/dev/shm 默认 64MB
前面提过了,这里再说一次。Docker 默认 shm 只有 64MB,K8s 里必须用 emptyDir + medium: Memory 手动挂载一个大的 shm。
混合方案:本地 + 云端 API 互补
说实话,自己部署 DeepSeek V4 的运维成本不低。我们团队的做法是:内部高频、对延迟敏感的任务走本地部署的 V4,低频或需要用 GPT-5.5/Claude Opus 4.6 等闭源模型的场景走 API。
ofox.ai 是一个 AI 模型聚合平台,一个 API Key 可以调用 GPT-5.5、Claude Opus 4.6、Gemini 3、DeepSeek V4 等 50+ 模型,低延迟直连无需代理,支持支付宝付款。我们在代码里做了个简单的路由:
from openai import OpenAI
# 本地 DeepSeek V4(K8s Service 地址)
local_client = OpenAI(
api_key="not-needed",
base_url="http://deepseek-v4-svc.llm-serving:8000/v1"
)
# 云端聚合 API(GPT-5.5 / Claude 等闭源模型)
cloud_client = OpenAI(
api_key="your-ofox-key",
base_url="https://api.ofox.ai/v1"
)
def smart_route(prompt: str, task_type: str = "general"):
"""根据任务类型选择本地或云端模型"""
if task_type in ("code_review", "doc_gen"):
# 高频内部任务 → 本地 DeepSeek V4
return local_client.chat.completions.create(
model="deepseek-v4",
messages=[{"role": "user", "content": prompt}],
max_tokens=4096,
stream=True
)
else:
# 需要闭源模型能力的任务 → 云端 API
return cloud_client.chat.completions.create(
model="gpt-5.5", # 或 claude-opus-4.6
messages=[{"role": "user", "content": prompt}],
max_tokens=4096,
stream=True
)
这样本地集群扛住 80% 的请求量,剩下 20% 走聚合 API,成本和灵活性都兼顾了。
小结
整套方案跑下来,最核心的经验就三条:
- shm 和 startupProbe 是必踩的坑,别等 Pod 反复重启了再去查
- tensor parallel 不要跨,NCCL 走网络通信的性能损耗大到离谱
- 监控指标里最该盯的是
vllm_gpu_cache_usage_perc和vllm_num_requests_waiting,这两个一飙就说明该扩容了
自建部署适合有 GPU 资源、请求量稳定的团队。如果你的场景是多模型切换、请求量波动大,直接用 API 聚合平台可能更划算。两者不矛盾,混着用就行。
有问题评论区聊,特别是 K8s + GPU 相关的坑,我踩得比较多,能帮的尽量帮。