DeepSeek V4 本地部署 + 生产级监控:从 Dockerfile 到 K8s 完整运维方案(2026)

14 阅读1分钟

上个月我们团队决定把 DeepSeek V4 部署到自己的 GPU 集群上,跑一些内部的代码和文档生成任务。说实话,模型跑起来不难,难的是怎么让它在生产环境稳定运行——我花了差不多一周时间,才把从容器化、K8s 编排到 Prometheus 监控的整条链路跑通。这篇文章把我踩过的坑和最终方案都整理出来了,希望能帮你少走弯路。

DeepSeek V4 的生产级部署需要三层架构:容器化打包(Dockerfile + vLLM)、K8s 编排(Deployment + HPA + Service)、Prometheus + Grafana 的全链路监控。本文会给出每一层的完整配置文件,全部实测可用。

先说结论

环节方案核心工具踩坑指数
模型推理vLLM 0.8.xvllm serve⭐⭐
容器化多阶段 DockerfileCUDA 12.4 + Python 3.11⭐⭐⭐
编排调度K8s Deployment + HPAGPU 资源限制 + 自定义指标扩缩⭐⭐⭐⭐
监控告警Prometheus + GrafanavLLM 内置 metrics + 自定义 exporter⭐⭐⭐
日志收集Loki + Promtail结构化日志⭐⭐

环境准备

我的测试环境:

  • GPU:4 × NVIDIA A100 80GB(DeepSeek V4 满血版需要至少 2 × A100 80GB,量化版 1 张够)
  • K8s:v1.29,已装 NVIDIA GPU Operator
  • OS:Ubuntu 22.04
  • 存储:NFS 挂载模型权重,大概 140GB(BF16 满血版)

先确认 GPU Operator 正常:

# 检查 GPU 是否被正确识别
kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'
# 应该输出你的 GPU 数量,比如 "4"

# 检查 nvidia-device-plugin 是否正常
kubectl get pods -n gpu-operator | grep nvidia-device-plugin

第一步:Dockerfile 容器化

这个 Dockerfile 折腾了我两天。最大的坑是 CUDA 版本和 vLLM 的兼容问题——用 CUDA 12.6 会导致 vLLM 的某些 kernel 编译失败,退回 12.4 就好了。

# ============================================
# DeepSeek V4 生产级 Dockerfile
# 基于 vLLM 推理引擎,CUDA 12.4
# ============================================

# Stage 1: 基础运行环境
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS base

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

RUN apt-get update && apt-get install -y --no-install-recommends \
 python3.11 python3.11-venv python3-pip \
 curl wget git \
 && rm -rf /var/lib/apt/lists/*

RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 \
 && update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1

# Stage 2: 安装依赖
FROM base AS builder

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Stage 3: 最终镜像
FROM base AS runtime

WORKDIR /app

COPY --from=builder /usr/local/lib/python3.11/dist-packages /usr/local/lib/python3.11/dist-packages
COPY --from=builder /usr/local/bin /usr/local/bin

# 健康检查脚本
COPY healthcheck.sh /app/healthcheck.sh
RUN chmod +x /app/healthcheck.sh

# 暴露 vLLM API 端口和 metrics 端口
EXPOSE 8000
EXPOSE 9090

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
 CMD /app/healthcheck.sh

# 启动 vLLM 推理服务
ENTRYPOINT ["python", "-m", "vllm.entrypoints.openai.api_server"]

CMD [ \
 "--model", "/models/deepseek-v4", \
 "--tensor-parallel-size", "2", \
 "--max-model-len", "32768", \
 "--gpu-memory-utilization", "0.92", \
 "--enable-prefix-caching", \
 "--port", "8000", \
 "--served-model-name", "deepseek-v4", \
 "--trust-remote-code" \
]

requirements.txt

vllm==0.8.4
prometheus-client==0.21.0

healthcheck.sh

#!/bin/bash
# 检查 vLLM 是否正常响应
curl -sf http://localhost:8000/health || exit 1

构建命令:

docker build -t deepseek-v4-vllm:latest .

# 本地测试跑一下(假设模型权重在 /data/models/deepseek-v4)
docker run --gpus '"device=0,1"' \
 -v /data/models/deepseek-v4:/models/deepseek-v4 \
 -p 8000:8000 \
 deepseek-v4-vllm:latest

第二步:K8s 编排配置

这部分是整个方案最复杂的地方。我把配置拆成几个文件,方便维护。

2.1 Namespace 和 PV(模型权重存储)

# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
 name: llm-serving
 labels:
 app.kubernetes.io/part-of: deepseek-v4
---
# pv-model.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
 name: deepseek-v4-weights
spec:
 capacity:
 storage: 200Gi
 accessModes:
 - ReadOnlyMany
 nfs:
 server: 10.0.1.50
 path: /exports/models/deepseek-v4
 persistentVolumeReclaimPolicy: Retain
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: deepseek-v4-weights
 namespace: llm-serving
spec:
 accessModes:
 - ReadOnlyMany
 resources:
 requests:
 storage: 200Gi
 volumeName: deepseek-v4-weights

2.2 Deployment(核心)

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
 name: deepseek-v4
 namespace: llm-serving
 labels:
 app: deepseek-v4
 version: v4-bf16
spec:
 replicas: 2
 selector:
 matchLabels:
 app: deepseek-v4
 strategy:
 type: RollingUpdate
 rollingUpdate:
 maxUnavailable: 0 # 滚动更新时不允许不可用
 maxSurge: 1
 template:
 metadata:
 labels:
 app: deepseek-v4
 version: v4-bf16
 annotations:
 prometheus.io/scrape: "true"
 prometheus.io/port: "8000"
 prometheus.io/path: "/metrics"
 spec:
 nodeSelector:
 nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
 tolerations:
 - key: nvidia.com/gpu
 operator: Exists
 effect: NoSchedule
 containers:
 - name: vllm
 image: your-registry.com/deepseek-v4-vllm:latest
 ports:
 - containerPort: 8000
 name: api
 protocol: TCP
 resources:
 requests:
 cpu: "8"
 memory: "64Gi"
 nvidia.com/gpu: "2"
 limits:
 cpu: "16"
 memory: "128Gi"
 nvidia.com/gpu: "2"
 env:
 - name: CUDA_VISIBLE_DEVICES
 value: "0,1"
 - name: VLLM_LOGGING_LEVEL
 value: "INFO"
 - name: NCCL_P2P_DISABLE
 value: "0"
 volumeMounts:
 - name: model-weights
 mountPath: /models/deepseek-v4
 readOnly: true
 - name: shm
 mountPath: /dev/shm
 livenessProbe:
 httpGet:
 path: /health
 port: 8000
 initialDelaySeconds: 120 # 模型加载需要时间
 periodSeconds: 30
 timeoutSeconds: 10
 failureThreshold: 3
 readinessProbe:
 httpGet:
 path: /health
 port: 8000
 initialDelaySeconds: 120
 periodSeconds: 10
 timeoutSeconds: 5
 startupProbe:
 httpGet:
 path: /health
 port: 8000
 initialDelaySeconds: 60
 periodSeconds: 10
 failureThreshold: 30 # 最多等 5 分钟启动
 volumes:
 - name: model-weights
 persistentVolumeClaim:
 claimName: deepseek-v4-weights
 - name: shm
 emptyDir:
 medium: Memory
 sizeLimit: 16Gi # tensor parallel 通信需要大 shm

踩坑提醒:/dev/shm 那个 emptyDir 千万别忘了。vLLM 的 tensor parallel 依赖共享内存做 GPU 间通信,默认的 64MB 会直接 OOM 崩掉。我第一次部署的时候 Pod 反复 CrashLoopBackOff,查了半天日志才发现是这个问题。

2.3 Service + HPA

# service.yaml
apiVersion: v1
kind: Service
metadata:
 name: deepseek-v4-svc
 namespace: llm-serving
 labels:
 app: deepseek-v4
spec:
 type: ClusterIP
 ports:
 - port: 8000
 targetPort: 8000
 name: api
 selector:
 app: deepseek-v4
---
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
 name: deepseek-v4-hpa
 namespace: llm-serving
spec:
 scaleTargetRef:
 apiVersion: apps/v1
 kind: Deployment
 name: deepseek-v4
 minReplicas: 2
 maxReplicas: 6
 metrics:
 - type: Pods
 pods:
 metric:
 name: vllm_num_requests_running
 target:
 type: AverageValue
 averageValue: "20" # 每个 Pod 并发请求超过 20 就扩容
 behavior:
 scaleUp:
 stabilizationWindowSeconds: 60
 policies:
 - type: Pods
 value: 1
 periodSeconds: 120 # 每 2 分钟最多加 1 个 Pod(GPU 资源贵)
 scaleDown:
 stabilizationWindowSeconds: 300
 policies:
 - type: Pods
 value: 1
 periodSeconds: 300 # 缩容更保守

整体架构长这样:

graph TB
 Client[客户端请求] --> Ingress[Ingress / Gateway]
 Ingress --> SVC[K8s Service<br/>deepseek-v4-svc:8000]
 SVC --> Pod1[Pod 1<br/>vLLM + 2×A100]
 SVC --> Pod2[Pod 2<br/>vLLM + 2×A100]
 SVC --> PodN[Pod N<br/>HPA 动态扩缩]
 
 Pod1 --> PV[(NFS PV<br/>模型权重 140GB)]
 Pod2 --> PV
 PodN --> PV
 
 Pod1 -->|/metrics| Prom[Prometheus]
 Pod2 -->|/metrics| Prom
 PodN -->|/metrics| Prom
 
 Prom --> Grafana[Grafana Dashboard]
 Prom --> Alert[AlertManager<br/>钉钉/飞书告警]
 
 HPA[HPA Controller] -->|读取 vllm_num_requests_running| Prom
 HPA -->|扩缩容| SVC

第三步:Prometheus + Grafana 监控

vLLM 自带 /metrics 端点,暴露了一堆 Prometheus 格式的指标,这点真的挺厚道。但默认的指标不够用,我额外加了几个关键的。

3.1 Prometheus 采集配置

# prometheus-scrape-config.yaml(追加到 Prometheus 的 scrape_configs)
scrape_configs:
 - job_name: 'deepseek-v4-vllm'
 kubernetes_sd_configs:
 - role: pod
 namespaces:
 names:
 - llm-serving
 relabel_configs:
 - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
 action: keep
 regex: true
 - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
 action: replace
 target_label: __address__
 regex: (.+)
 replacement: ${1}
 - source_labels: [__meta_kubernetes_pod_ip, __meta_kubernetes_pod_annotation_prometheus_io_port]
 action: replace
 target_label: __address__
 separator: ':'
 metric_relabel_configs:
 - source_labels: [__name__]
 regex: 'vllm_.*'
 action: keep

3.2 核心监控指标

这些是我觉得生产环境必须盯着的:

指标名类型含义告警阈值建议
vllm_num_requests_runningGauge当前正在处理的请求数> 50 告警
vllm_num_requests_waitingGauge等待队列长度> 20 告警
vllm_gpu_cache_usage_percGaugeGPU KV Cache 使用率> 95% 告警
vllm_avg_generation_throughput_toks_per_sGauge生成吞吐(token/s)< 100 告警
vllm_request_success_totalCounter成功请求总数用于计算成功率
vllm_e2e_request_latency_secondsHistogram端到端请求延迟P99 > 30s 告警
vllm_time_to_first_token_secondsHistogram首 token 延迟(TTFT)P99 > 5s 告警

3.3 AlertManager 告警规则

# alerting-rules.yaml
groups:
 - name: deepseek-v4-alerts
 rules:
 - alert: HighRequestQueueDepth
 expr: vllm_num_requests_waiting{job="deepseek-v4-vllm"} > 20
 for: 2m
 labels:
 severity: warning
 annotations:
 summary: "DeepSeek V4 请求队列堆积"
 description: "等待队列 {{ $value }} 个请求,持续 2 分钟,考虑扩容"

 - alert: GPUCacheNearlyFull
 expr: vllm_gpu_cache_usage_perc{job="deepseek-v4-vllm"} > 0.95
 for: 1m
 labels:
 severity: critical
 annotations:
 summary: "GPU KV Cache 使用率超过 95%"
 description: "Pod {{ $labels.pod }} 的 KV Cache 即将耗尽,新请求会被拒绝"

 - alert: HighP99Latency
 expr: histogram_quantile(0.99, rate(vllm_e2e_request_latency_seconds_bucket[5m])) > 30
 for: 3m
 labels:
 severity: warning
 annotations:
 summary: "DeepSeek V4 P99 延迟超过 30 秒"

 - alert: LowThroughput
 expr: vllm_avg_generation_throughput_toks_per_s{job="deepseek-v4-vllm"} < 100
 for: 5m
 labels:
 severity: warning
 annotations:
 summary: "生成吞吐过低,可能存在性能问题"

 - alert: PodCrashLooping
 expr: rate(kube_pod_container_status_restarts_total{namespace="llm-serving", container="vllm"}[15m]) > 0
 for: 5m
 labels:
 severity: critical
 annotations:
 summary: "vLLM Pod 反复重启"

踩坑记录

这部分是真金白银换来的教训:

坑 1:模型加载 OOMKilled

第一次部署,Pod 起来几十秒就被 kill 了。kubectl describe pod 一看,OOMKilled。原因是 K8s 的 memory limit 设太小。DeepSeek V4 BF16 权重 140GB,加载到 GPU 之前需要先在 CPU 内存里过一遍。我最后把 memory limit 拉到 128Gi 才稳住。

坑 2:startupProbe 超时导致 Pod 被杀

大模型加载慢,A100 上 DeepSeek V4 大概要 2-3 分钟才能完全 ready。默认的 initialDelaySeconds 根本不够,Pod 还没加载完就被 kubelet 判定不健康然后重启了。解决方案是用 startupProbe + failureThreshold: 30,给足 5 分钟启动时间。

坑 3:tensor parallel 跨通信慢到怀疑人生

一开始我想用 4 张 A100 做 TP=4,但两张卡在 A,两张在 B。结果 NCCL 走网络通信,推理速度直接掉了 3 倍。结论:tensor parallel 一定要在同一个内。K8s 里可以用 nodeSelector 或者 topology 约束来保证。

坑 4:/dev/shm 默认 64MB

前面提过了,这里再说一次。Docker 默认 shm 只有 64MB,K8s 里必须用 emptyDir + medium: Memory 手动挂载一个大的 shm。

混合方案:本地 + 云端 API 互补

说实话,自己部署 DeepSeek V4 的运维成本不低。我们团队的做法是:内部高频、对延迟敏感的任务走本地部署的 V4,低频或需要用 GPT-5.5/Claude Opus 4.6 等闭源模型的场景走 API。

ofox.ai 是一个 AI 模型聚合平台,一个 API Key 可以调用 GPT-5.5、Claude Opus 4.6、Gemini 3、DeepSeek V4 等 50+ 模型,低延迟直连无需代理,支持支付宝付款。我们在代码里做了个简单的路由:

from openai import OpenAI

# 本地 DeepSeek V4(K8s Service 地址)
local_client = OpenAI(
 api_key="not-needed",
 base_url="http://deepseek-v4-svc.llm-serving:8000/v1"
)

# 云端聚合 API(GPT-5.5 / Claude 等闭源模型)
cloud_client = OpenAI(
 api_key="your-ofox-key",
 base_url="https://api.ofox.ai/v1"
)

def smart_route(prompt: str, task_type: str = "general"):
 """根据任务类型选择本地或云端模型"""
 if task_type in ("code_review", "doc_gen"):
 # 高频内部任务 → 本地 DeepSeek V4
 return local_client.chat.completions.create(
 model="deepseek-v4",
 messages=[{"role": "user", "content": prompt}],
 max_tokens=4096,
 stream=True
 )
 else:
 # 需要闭源模型能力的任务 → 云端 API
 return cloud_client.chat.completions.create(
 model="gpt-5.5", # 或 claude-opus-4.6
 messages=[{"role": "user", "content": prompt}],
 max_tokens=4096,
 stream=True
 )

这样本地集群扛住 80% 的请求量,剩下 20% 走聚合 API,成本和灵活性都兼顾了。

小结

整套方案跑下来,最核心的经验就三条:

  1. shm 和 startupProbe 是必踩的坑,别等 Pod 反复重启了再去查
  2. tensor parallel 不要跨,NCCL 走网络通信的性能损耗大到离谱
  3. 监控指标里最该盯的是 vllm_gpu_cache_usage_percvllm_num_requests_waiting,这两个一飙就说明该扩容了

自建部署适合有 GPU 资源、请求量稳定的团队。如果你的场景是多模型切换、请求量波动大,直接用 API 聚合平台可能更划算。两者不矛盾,混着用就行。

有问题评论区聊,特别是 K8s + GPU 相关的坑,我踩得比较多,能帮的尽量帮。