Kubernetes 生产环境避坑指南:10 个真实故障案例与解决方案

20 阅读7分钟

K8s 集群上线后,故障不可避免。本文汇总了 10 个生产环境真实案例,涵盖 Pod 调度、网络、存储、RBAC 等高频问题,帮你少走 3 年弯路。


背景:为什么 K8s 故障这么多?

Kubernetes 的复杂性在于它涉及多个层次:

┌─────────────────────────────────────────┐
│              应用层                       │
│   Pod / Deployment / StatefulSet        │
├─────────────────────────────────────────┤
│              调度层                       │
│   Scheduler / Node / Taint/Toleration   │
├─────────────────────────────────────────┤
│              网络层                       │
│   CNI / Service / Ingress / DNS         │
├─────────────────────────────────────────┤
│              存储层                       │
│   PV / PVC / StorageClass               │
├─────────────────────────────────────────┤
│              控制平面                     │
│   API Server / etcd / Controller        │
└─────────────────────────────────────────┘

任何一层出问题,都会导致应用不可用。以下是真实踩过的坑。


案例 1:Pod 一直 Pending

问题现象

kubectl get pods -n production
# 输出:
NAME                     READY   STATUS    RESTARTS   AGE
web-app-7d9f8b6c5-x2n4p   0/1     Pending   0          5m

Pod 一直处于 Pending 状态,不被调度。

排查步骤

# 1. 查看 Pod 详情
kubectl describe pod web-app-7d9f8b6c5-x2n4p -n production

# 2. 常见原因:
# - 资源不足(CPU/内存不够)
# - 亲和性/反亲和性限制
# - 污点(Taint)限制
# - PVC 未绑定

# 3. 检查节点资源
kubectl describe nodes | grep -A 5 "Allocated resources"

# 4. 检查污点
kubectl get nodes -o custom-columns=NODE:.metadata.name,TAINTS:.spec.taints

解决方案

场景 A:资源不足

# 增加资源配额
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - max:
      cpu: "4"
      memory: 8Gi
    default:
      cpu: 500m
      memory: 1Gi
    defaultRequest:
      cpu: 200m
      memory: 512Mi
    type: Container

场景 B:污点限制

# 查看污点
kubectl describe node node-1 | grep Taints

# 临时移除污点(测试用)
kubectl taint node node-1 dedicated-

# 或添加容忍
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  template:
    spec:
      tolerations:
      - key: "dedicated"
        operator: "Exists"
        effect: "NoSchedule"

案例 2:Pod CrashLoopBackOff

问题现象

kubectl get pods -n production
# 输出:
NAME                     READY   STATUS              RESTARTS   AGE
api-server-5f4d7c8b9-abc   0/1     CrashLoopBackOff    3          2m

Pod 不断重启。

排查

# 查看日志
kubectl logs api-server-5f4d7c8b9-abc -n production --previous

# 常见原因:
# - 应用启动失败(配置错误)
# - 健康检查失败
# - OOMKilled(内存超限)
# - 依赖服务不可达

真实案例:健康检查配置错误

# 错误的配置(NodePort 在 Init 容器里不存在)
apiVersion: v1
kind: Pod
metadata:
  name: api-server
spec:
  containers:
  - name: api
    image: my-api:v1
    ports:
    - containerPort: 8080
    readinessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 0  # 太短!
      periodSeconds: 5
# 正确的配置
apiVersion: v1
kind: Pod
metadata:
  name: api-server
spec:
  containers:
  - name: api
    image: my-api:v1
    ports:
    - containerPort: 8080
    readinessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 30  # 给够启动时间
      periodSeconds: 10
      failureThreshold: 3
    resources:
      limits:
        memory: "512Mi"
      requests:
        memory: "256Mi"

案例 3:Service 无法访问

问题现象

# 在 Pod 内测试
kubectl exec -it nginx-pod -- sh
/ # curl http://api-service:8080/health
# curl: couldn't connect to host

排查流程

# 1. 检查 Service 是否存在
kubectl get svc -n production | grep api

# 2. 检查 Endpoint
kubectl get endpoints api-service -n production
# 如果为空,说明 Selector 没有匹配到 Pod

# 3. 检查 Pod 标签
kubectl get pods -n production --show-labels | grep app
kubectl get svc api-service -n production -o yaml | grep -A 5 selector

真实案例:标签不匹配

# Deployment 标签
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  selector:
    matchLabels:
      app: api-server
      version: v2
  template:
    metadata:
      labels:
        app: api-server
        version: v2  # 新版本用 v2
# Service Selector 还是 v1
apiVersion: v1
kind: Service
metadata:
  name: api-service
spec:
  selector:
    app: api-server
    version: v1  # ❌ 这里!
  ports:
  - port: 8080
    targetPort: 8080

修复:更新 Service Selector 或 Deployment 标签。


案例 4:DNS 解析失败

问题现象

kubectl exec -it test-pod -- nslookup kubernetes
# Server:    10.96.0.10
# ** server can't find kubernetes.default: NXDOMAIN

排查

# 1. 检查 CoreDNS 是否运行
kubectl get pods -n kube-system -l k8s-app=kube-dns

# 2. 查看 CoreDNS 日志
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

# 3. 测试其他 DNS
kubectl exec -it test-pod -- nslookup www.baidu.com

解决方案

# 方案 1:增加 CoreDNS 副本
apiVersion: apps/v1
kind: Deployment
metadata:
  name: coredns
  namespace: kube-system
spec:
  replicas: 3  # 生产至少 2 个副本
# 方案 2:配置 Pod 的 DNS 策略
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  dnsPolicy: ClusterFirst  # 默认
  # 或自定义 DNS
  dnsConfig:
    nameservers:
      - 8.8.8.8
    searches:
      - default.svc.cluster.local
    options:
      - name: ndots
        value: "2"

案例 5:PVC 无法绑定

问题现象

kubectl get pvc -n production
# NAME        STATUS    VOLUME                                     CAPACITY
# data-pvc    Pending   pvc-8f7a3c2b-xxxx                          10Gi
# 一直 Pending!

排查

# 1. 查看 PVC 详情
kubectl describe pvc data-pvc -n production

# 常见原因:
# - StorageClass 不存在
# - 存储配额超限
# - 云厂商卷类型不支持

# 2. 检查 StorageClass
kubectl get storageclass

# 3. 检查云厂商存储限制
# 腾讯云 CBS:单节点最多 20 个云盘

解决方案

# 方案 1:使用正确的 StorageClass
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-pvc
  namespace: production
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: "cbs-balanced"  # 腾讯云 CBS
  resources:
    requests:
      storage: 50Gi
# 方案 2:清理无用 PVC
kubectl get pvc --all-namespaces | grep Released
kubectl delete pvc <pvc-name> -n <namespace>

案例 6:Ingress 502 Bad Gateway

问题现象

浏览器访问返回 502,但 Pod 本身正常运行。

排查

# 1. 检查 Ingress Controller
kubectl get pods -n ingress-nginx

# 2. 检查 Backend
kubectl describe ingress my-ingress -n production

# 3. 测试 Pod 访问
kubectl exec -it nginx-ingress-xxx -n ingress-nginx -- curl -v http://<pod-ip>:8080/health

真实案例:健康检查路径错误

# 错误的 Ingress 配置
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
spec:
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080
# 加上健康检查注解
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    nginx.ingress.kubernetes.io/server-snippet: |
      location /health {
        return 200 'OK';
      }
spec:
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080

案例 7:RBAC 权限不足

问题现象

# 应用报错
Forbidden: User "system:serviceaccount:default:my-app" 
cannot list pods in namespace "production"

排查

# 1. 查看 ServiceAccount
kubectl get sa my-app -n default

# 2. 查看 Role
kubectl get role -n production

# 3. 查看 RoleBinding
kubectl get rolebinding -n production

解决方案

# 创建 Role
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: production
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list"]
# 绑定到 ServiceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pod-reader-binding
  namespace: production
subjects:
- kind: ServiceAccount
  name: my-app
  namespace: default
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

案例 8:OOMKilled 内存超限

问题现象

kubectl get pods -n production
# NAME        READY   STATUS      RESTARTS   AGE
# api-xxx     0/1     OOMKilled   2          10m

排查与解决

# 查看资源限制
kubectl describe pod api-xxx -n production | grep -A 5 "Limits"

# 解决方案:调高内存限制
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      containers:
      - name: api
        resources:
          limits:
            memory: "2Gi"  # 从 512Mi 增加到 2Gi
            cpu: "1"
          requests:
            memory: "1Gi"
            cpu: "500m"

案例 9:集群节点 NotReady

问题现象

kubectl get nodes
# NAME        STATUS     ROLES    AGE   VERSION
# node-1      NotReady   worker   30d   v1.28.0

排查

# 1. SSH 到问题节点
ssh root@node-1

# 2. 检查 kubelet 状态
systemctl status kubelet

# 3. 查看日志
journalctl -u kubelet -n 100 --no-pager

# 常见原因:
# - 磁盘空间不足
# - 内存不足
# - kubelet 证书过期

解决方案

# 清理磁盘
docker system prune -a --volumes
rm -rf /var/lib/docker/*

# 重启 kubelet
systemctl restart kubelet

# 如果是证书问题
kubeadm kubeconfig user --org myorg --cluster mycluster

案例 10:HPA 无法扩容

问题现象

kubectl get hpa -n production
# NAME        REFERENCE          TARGETS   MINPODS   MAXPODS   REPLICAS
# api-hpa     Deployment/api     85%/80%   2         10        2
# CPU 使用率 85% > 80%,但副本数还是 2!

排查

# 1. 检查 HPA 详情
kubectl describe hpa api-hpa -n production

# 2. 常见原因:
# - Pod 没有设置资源请求(CPU/内存)
# - Metrics Server 未运行
# - 副本数达到上限

# 3. 检查 Metrics Server
kubectl get pods -n kube-system | grep metrics

# 4. 测试指标采集
kubectl top pods -n production

解决方案

# Pod 必须设置资源请求
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      containers:
      - name: api
        resources:
          requests:
            cpu: "500m"  # 必须设置!HPA 依赖此指标
            memory: "512Mi"
          limits:
            cpu: "1"
            memory: "1Gi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80

生产环境 Checklist

部署前必查:

# 1. 资源限制
kubectl check-resource-limits.sh

# 2. 健康检查
kubectl test-health-checks.sh

# 3. 网络连通性
kubectl test-network.sh

# 4. 存储可用性
kubectl test-storage.sh

# 5. RBAC 权限
kubectl test-rbac.sh

总结:避坑要点

类别常见问题预防措施
调度Pod Pending提前规划资源,预留 buffer
生命周期CrashLoop合理健康检查 + 资源限制
网络Service 不通验证标签匹配,测试 DNS
存储PVC Pending确认 StorageClass 存在
权限RBAC 报错最小权限原则,测试验证
弹性HPA 不工作必须设置资源请求

黄金法则:先在测试环境充分验证,再上生产!


👤 作者简介

一枚在大中原腹地(河南)卖公有云的从业者,主营腾讯云/阿里云/华为云,曾踩坑无数,现专注AI大模型应用落地。关注公众号「公有云cloud」,围观AI前沿动态~

博客:yunduancloud.icu