kubernetes故障解决记录

7,403 阅读5分钟

20230626

安装了calico网络组件后,跨节点pod通信异常、svc访问异常、apiserver无法访问

  • 现象:
    • 可以ping通集群pod,curl超时
    • calico所有组件运行正常、重启calico-node依然如此
    • iptables.masqueradeBit: 30
  • 已检查的设置:
    • IP_AUTODETECTION_METHODcan-reach或者interface
    • CALICO_IPV4POOL_IPIPAlways
解决办法
  • 原因:设置了虚拟网卡的mtu为1500
    • ConfigMapveth_mtu改成0或者1440
    • 将kubelet配置iptables.masqueradeBit改成默认值14

20211105

查看日志提示MethodNotAllowed(http 405),如下图

image.png

解决方法
  • 原因:禁用了日志收集,需要重新开启.
    • 需要在命令行添加--enable-debugging-handlers=true
    • 在yaml文件中添加enableDebuggingHandlers: true

20190507

单节点的k8s因为机器卡死后强制断电重启,在启动集群后etcd数据库提示找不到快照备份文件一直在频繁启动,启动信息如下

{"level":"warn","ts":"2022-10-26T02:23:54.885Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":250025,"snapshot-fi
le-path":"/var/lib/etcd/member/snap/000000000003d0a9.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2022-10-26T02:23:54.885Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to fi
nd database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/releas
e/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.i
o/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV
2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server
/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot
goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000502300, 0xc0004acd40, 0x1, 0x1)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc0006322d0, 0x1234726, 0x2a, 0xc0004acd40, 0x1, 0x1)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0x7ffe21b69e44, 0x4, 0x0, 0x0, 0x0, 0x0, 0xc000221d40, 0x1, 0x1, 0xc00064a000, ...)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc000642000, 0xc000642600, 0x0, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc000642000, 0x12089be, 0x6, 0xc0000a8101, 0x2)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a140, 0x14, 0x14)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a140, 0x14, 0x14)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40 +0x13f
main.main()
	/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32 +0x45
解决方法
  • 删除快照目录下的所有文件,重新启动集群
    • systemctl stop kubelet
    • rm -rf /var/lib/etc/member/snap/*
    • systemctl start kubelet

经过上面的操作,k8s的数据不会丢失的。因为etcd所有的操作都记录在wal日志里面,snap只是一个快照文件,用于在重新启动时快速加载数据。删除snap快照后,etcd启动时会重新读取wal文件数据进行数据回放,因此不需要担心数据丢失!!!

在开发环境中部署了kubernetes后,使用istio做ingress处理流量,最终部署了gitlab、jenkins、jumpserver之后发现有问题:

  • 所有服务刷新页面间歇性upstream connect error or disconnect/reset before headers
  • jumpserver web 终端连接后大约3~5秒提示connect closed或者connect error
  • 使用kubectl exec 进入pod之后大约10秒会自动退出容器 排查思路:
    • 后端的服务是正常运行
    • 通过nodePort可以正常访问,也是间歇性提示上面的错误
    • 通过service可以正常访问,也是间歇性提示上面的错误
    • 排查istio-ingress对websocket协议的支持情况
    • 通过kiali查看istio-ingress的流量走向情况
系统docker版本kubernetes版本istio版本
CentOS 7.9.2009v19.03.15v1.20.51.9.1

Pod运行状态

[root@k8smaster1 ~]# kubectl get pod -n tools
NAME                          READY   STATUS    RESTARTS   AGE
gitlab-8b46fdb7-2xvkx         1/1     Running   0          4d16h
jenkins-5746bdbf76-bcflk      1/1     Running   0          41h
jumpserver-84f75577bd-r67d8   1/1     Running   0          22h

服务运行状态

[root@k8smaster1 ~]# kubectl get svc -n tools
NAME         TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)            AGE
gitlab       ClusterIP   10.100.127.175   <none>        80/TCP,22/TCP      5d19h
glusterfs    ClusterIP   10.105.254.77    <none>        49152/TCP          5d19h
jenkins      ClusterIP   10.103.40.60     <none>        80/TCP,50000/TCP   42h
jumpserver   ClusterIP   10.96.12.39      <none>        80/TCP,2222/TCP    22h
[root@k8smaster1 ~]# curl 10.103.40.60/login?from=%2f >/dev/null -s -vvv
* About to connect() to 10.103.40.60 port 80 (#0)
*   Trying 10.103.40.60...
* Connected to 10.103.40.60 (10.103.40.60) port 80 (#0)
> GET /login?from=%2f HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.103.40.60
> Accept: */*
< HTTP/1.1 200 OK
< Date: Wed, 14 Apr 2021 02:26:46 GMT
< X-Content-Type-Options: nosniff
< Content-Type: text/html;charset=utf-8
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Cache-Control: no-cache,no-store,must-revalidate
< X-Hudson: 1.395
< X-Jenkins: 2.285

Kiali配置显示结果

image.png

kubernetes安装配置

apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
bootstrapTokens:
- groups:
  - system:bootstrappers:kubeadm:default-node-token
  token: abcdef.0123456789abcdef
  description: "Another bootstrap token"
  ttl: 500h0m0s
  usages:
  - signing
  - authentication
localAPIEndpoint:
  advertiseAddress: ${hostip}
  bindPort: 6443
nodeRegistration:
  criSocket: /var/run/dockershim.sock
  name: $(hostname)
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
apiServer:
  timeoutForControlPlane: 4m0s
  certSANs:
    - ${k8shost1}
    - ${k8shostip1}
    - 127.0.0.1
  extraArgs:
    authorization-mode: "Node,RBAC"
    advertise-address: "${k8shostip1}"
    anonymous-auth: "true"
    cert-dir: "/var/run/kubernetes"
    audit-log-compress: "false"
    delete-collection-workers: "2"
    kubelet-timeout: "10s"
    logging-format: "json"
    default-not-ready-toleration-seconds: "180"
    default-unreachable-toleration-seconds: "180"
certificatesDir: /etc/kubernetes/pki
controlPlaneEndpoint: "${k8shostip1}:8443"
controllerManager:
  extraArgs:
    horizontal-pod-autoscaler-cpu-initialization-period: "1m0s"
    horizontal-pod-autoscaler-downscale-stabilization: "10m0s"
    horizontal-pod-autoscaler-sync-period: "30s"
    node-eviction-rate: "0.5"
    pod-eviction-timeout: "20s"
    cluster-name: "kubernetes"
    concurrent-deployment-syncs: "10"
    concurrent-endpoint-syncs: "10"
    concurrent-namespace-syncs: "10"
    concurrent-replicaset-syncs: "10"
    concurrent-service-syncs: "3"
    horizontal-pod-autoscaler-tolerance: "0.5"
    logging-format: "json"
    kube-api-qps: "200"
    kube-api-burst: "200"
scheduler:
  extraArgs:
    bind-address: "0.0.0.0"
    leader-elect-retry-period: "5s"
    port: "10251"
    secure-port: "10259"
dns:
  type: CoreDNS
etcd:
  local:
    dataDir: "/var/lib/etcd"
imageRepository: "registry.aliyuncs.com/google_containers"
kubernetesVersion: v1.20.5
networking:
  serviceSubnet: "10.96.0.0/12"
  podSubnet: "172.168.0.0/24"
  dnsDomain: "cluster.local"
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: "systemd"
address: "0.0.0.0"
port: 10250
nodeStatusReportFrequency: "30s"
nodeStatusUpdateFrequency: "10s"
evictionPressureTransitionPeriod: "10s"
runtimeRequestTimeout: "30s"
evictionHard:
  "memory.available": "1024Mi"
  "nodefs.available": "20%"
streamingConnectionIdleTimeout: "10s"
---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
bindAddress: "0.0.0.0"
healthzBindAddress: "0.0.0.0:10256"
metricsBindAddress: "0.0.0.0:10249"
bindAddressHardFail: false
enableProfiling: true
configSyncPeriod: "10s"
mode: "ipvs"
ipvs:
  strictARP: true
  tcpTimeOut: "10s"
  tcpFinTimeOut: "10s"
  udpTimeOut: "3s"
udpIdleTimeout: "0s"
conntrack:
  tcpEstablishedTimeout: "10s"
  tcpCloseWaitTimeout: "20s"
解决方法
    1. kubectl exec终端自动退出
    • 1.1 原因:初始化集群时添加了kubelet参数导致
      • 解决办法:更改KubeletConfiguration中的
        • streamingConnectionIdleTimeout: "10s"
        • streamingConnectionIdleTimeout: "0s"
    1. 页面间歇性提示upstream connect error or disconnect/reset before headers
    • 2.1 原因:系统内核对iptables转发tcp连接时间做了限制,更改内核限制
      • 解决办法:修改内核如下
        • net.netfilter.nf_conntrack_tcp_timeout_established=432000
        • net.netfilter.nf_conntrack_tcp_timeout_close=432000
        • net.bridge.bridge-nf-call-ip6tables = 1
        • net.bridge.bridge-nf-call-iptables =1
        • net.netfilter.nf_conntrack_tcp_max_retrans=43200