kubernetes报错整理

2,196 阅读4分钟

pod调度到节点存在Evicted现象

kubectl describe node XXX
The node was low on resource: ephemeral-storage. Container xxx was using 7738496Ki, which exceeds its request of 0.


The node had condition: [DiskPressure]grep 'threshold' /var/log/messages
kubelet: I1108 03:29:23.358995    5591 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 85% which is over the high threshold (85%). Trying to free 95863558144 bytes down to the low threshold (80%).

Resourcequota

apiVersion: v1
kind: ResourceQuota
metadata:
  name: mem-cpu-example
spec:
  hard:
    requests.cpu: 2
    requests.memory: 2Gi
    limits.cpu: 3
    limits.memory: 4Gi
    #pods: "10"
  scopeSeletor:
    matchExpressions:
    - operator: Exists
      scopeName: NotBestEffort
  • 所有 Pod 容器都必须声明对 CPU 和 RAM 的 Request 和 Limit;
  • 所有 CPU Requests 的总和不能超过 2 个内核;
  • 所有 CPU Limits 的总和不能超过 3 个内核;
  • 所有 RAM Requests 的总和不能超过 2 GiB;
  • 所有 RAM Limits 的总和不能超过 4 GiB。

LimitRange 

pass

failed to garbage collect required amount of images. Wanted to free 47664015769 bytes, but freed 0 bytes

自动回收无用镜像失败,手动清理空间

lookup dns timeout

dns五秒超时时间

        lifecycle:
          postStart:
            exec:
              command:
              - /bin/sh
              - -c 
              - "/bin/echo 'options single-request-reopen' >> /etc/resolv.conf"
  • single-request-reopen 发送A类型请求和AAAA类型请求使用不同的源端口。这样两个请求在conntrack表中不占用同一个表项,从而避免冲突。

    template: spec: dnsConfig: options: - name: single-request-reopen

参考:cloud.tencent.com/developer/a…

无法获取node资源

Unable to connect to the server: x509: certificate has expired or is not yet
证书逾期导致
备份证书tar zcf /root/k8s_pki.tar.gz /etc/kubernetes/pki
续订证书kubeadm alpha certs renew all
更换.kube/config文件 控制节点重启三个组件

新加入的node get时ip显示错误

--node-ip="192.168.223.102"

节点not ready kubelet报错use of closed network connection

原因: golang 的对 h2 处理的 bug,千分之一的概率发生
解决办法:systemctl restart kubelet
永久解决:关闭http2,导致http1.1连接过多

etcd频繁选举,导致scheduler和controller-manager频繁重启

since streamMsg's sending buffer follower 的请求的处理被延迟因为网络延迟
解决办法:加大节点之间的心跳时间    
    - --heartbeat-interval=500
    - --election-timeout=5000

Master kubelet大量ERROR日志

Failed to list *v1.Secret: secrets is forbidden: User "system:node:pro-star-manager223-75" cannot list resource "secrets" in API group "" in the namespace "xgsj": No Object name found
解决办法:增加权限kubectl create clusterrolebinding system-node-role-bound --clusterrole=system:node --group=system:nodes

docker无法rm无法kill无法stop

kill -9 此容器进程,查看容器使用network mode,剥离网络docker network disconnect --force bridge mysql1

node not ready

部署基础服务报错:MountVolume.SetUp failed for volume "kube-proxy-token-k7pd7" : failed to sync secret cache: timed out waiting for the condition
node kubelet日志:Failed to initialize CSINodeInfo after retrying
the server could not find the requested resource
原因:kubectl版本高于集群其他节点,CSIMigration默认开启状态
解决办法:/var/lib/kubelet/config.yaml featureGates:CSIMigration: false

kubernetes.io/zh/docs/ref…

Node kubelet大量日志

Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
原因:kublelet无法获取pod资源占用情况,网上很多说应该在kube的rpm包中添加修复程序,只在centos机器上出现
导致问题:无法kubectl exec出问题节点的容器
解决办法:/usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf
CPUAccounting=true
MemoryAccounting=truec

参考:www.mydlq.club/article/80/

calico部署失败:Number of node(s) with BGP peering established = 0 calico/node is not ready

            - name: IP_AUTODETECTION_METHOD
              value: "interface=eth1"
设置对应的网卡名称

calico:Liveness probe failed: Get http://localhost:9099/liveness: dial tcp 127.0.0.1:9099: connect: connection refused

[FATAL][2313] int_dataplane.go 824: Kernel's RPF check is set to 'loose'.  This would allow endpoints to spoof their IP address.  Calico requires net.ipv4.conf.all.rp_filter to be set to 0 or 1. If you require loose RPF and you are not concerned about spoofing, this check can be disabled by setting the IgnoreLooseRPF configuration parameter to 'true'.
sysctl -w net.ipv4.conf.all.rp_filter=1
控制系统是否开启对数据包源地址的校验
0:不开启源地址校验。1:开启严格的反向路径校验。对每个进来的数据包,校验其反向路径是否是最佳路径。如果反向路径不是最佳路径,则直接丢弃该数据包。2:开启松散的反向路径校验。对每个进来的数据包,校验其源地址是否可达,即反向路径是否能通(通过任意网口),如果反向路径不同,则直接丢弃该数据包。

Kubelet无法启动

Kubelet: Failed to watch directory /sys/fs/cgroup/memory/system.slice/XXX no space lefe on device

sysctl fs.inotify.max_user_watches=524288

参考:www.bookstack.cn/read/kubern…

Kubelet无法启动

Failed to start ContainerManager Cannot set property TasksAccounting, or unknown property

yum update systemd

namespace无法删除

kubectl  get ns rdbms  -o json > tmp.json
编辑json文件并删除掉spec部分
curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json http://192.168.130.230:8001/api/v1/namespaces/tunnel-proxy/finalize
kubectl replace --raw "/api/v1/namespaces/$NAMESPACE" -f ./$NAMESPACE.json
kubectl patch pvc pvc-9cd01e19-93b4-4bd8-bfc8-9d96cbe03f46  -p '{"metadata":{"finalizers":null}}' 强制删除pvc

node notready failed to ensure node lease exists, will retry in 7s, error: an error on the server ("") has prevented the request from succeeding

apiserver调整参数--http2-max-streams-per-connection为1000

参考:segmentfault.com/a/119000004…

启动容器时mkdir /var/lib/kubelet: read-only file system

原因: kubelet的cgroup driver是systemd,docker的cgroup driver是cgroupfs

kubelet报错:container runtime status check may not have completed yet, PLEG is not healthy

systemd-219-67.el7.x86_64版本问题
yum update -y systemd && systemctl daemon-reexec && systemctl restart docker