pod调度到节点存在Evicted现象
kubectl describe node XXX
The node was low on resource: ephemeral-storage. Container xxx was using 7738496Ki, which exceeds its request of 0.
The node had condition: [DiskPressure]grep 'threshold' /var/log/messages
kubelet: I1108 03:29:23.358995 5591 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 85% which is over the high threshold (85%). Trying to free 95863558144 bytes down to the low threshold (80%).
Resourcequota
apiVersion: v1
kind: ResourceQuota
metadata:
name: mem-cpu-example
spec:
hard:
requests.cpu: 2
requests.memory: 2Gi
limits.cpu: 3
limits.memory: 4Gi
#pods: "10"
scopeSeletor:
matchExpressions:
- operator: Exists
scopeName: NotBestEffort
- 所有 Pod 容器都必须声明对 CPU 和 RAM 的 Request 和 Limit;
- 所有 CPU Requests 的总和不能超过 2 个内核;
- 所有 CPU Limits 的总和不能超过 3 个内核;
- 所有 RAM Requests 的总和不能超过 2 GiB;
- 所有 RAM Limits 的总和不能超过 4 GiB。
LimitRange
pass
failed to garbage collect required amount of images. Wanted to free 47664015769 bytes, but freed 0 bytes
自动回收无用镜像失败,手动清理空间
lookup dns timeout
dns五秒超时时间
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- "/bin/echo 'options single-request-reopen' >> /etc/resolv.conf"
-
single-request-reopen 发送A类型请求和AAAA类型请求使用不同的源端口。这样两个请求在conntrack表中不占用同一个表项,从而避免冲突。
template: spec: dnsConfig: options: - name: single-request-reopen
参考:cloud.tencent.com/developer/a…
无法获取node资源
Unable to connect to the server: x509: certificate has expired or is not yet
证书逾期导致
备份证书tar zcf /root/k8s_pki.tar.gz /etc/kubernetes/pki
续订证书kubeadm alpha certs renew all
更换.kube/config文件 控制节点重启三个组件
新加入的node get时ip显示错误
--node-ip="192.168.223.102"
节点not ready kubelet报错use of closed network connection
原因: golang 的对 h2 处理的 bug,千分之一的概率发生
解决办法:systemctl restart kubelet
永久解决:关闭http2,导致http1.1连接过多
etcd频繁选举,导致scheduler和controller-manager频繁重启
since streamMsg's sending buffer follower 的请求的处理被延迟因为网络延迟
解决办法:加大节点之间的心跳时间
- --heartbeat-interval=500
- --election-timeout=5000
Master kubelet大量ERROR日志
Failed to list *v1.Secret: secrets is forbidden: User "system:node:pro-star-manager223-75" cannot list resource "secrets" in API group "" in the namespace "xgsj": No Object name found
解决办法:增加权限kubectl create clusterrolebinding system-node-role-bound --clusterrole=system:node --group=system:nodes
docker无法rm无法kill无法stop
kill -9 此容器进程,查看容器使用network mode,剥离网络docker network disconnect --force bridge mysql1
node not ready
部署基础服务报错:MountVolume.SetUp failed for volume "kube-proxy-token-k7pd7" : failed to sync secret cache: timed out waiting for the condition
node kubelet日志:Failed to initialize CSINodeInfo after retrying
the server could not find the requested resource
原因:kubectl版本高于集群其他节点,CSIMigration默认开启状态
解决办法:/var/lib/kubelet/config.yaml featureGates:CSIMigration: false
Node kubelet大量日志
Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
原因:kublelet无法获取pod资源占用情况,网上很多说应该在kube的rpm包中添加修复程序,只在centos机器上出现
导致问题:无法kubectl exec出问题节点的容器
解决办法:/usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf
CPUAccounting=true
MemoryAccounting=truec
calico部署失败:Number of node(s) with BGP peering established = 0 calico/node is not ready
- name: IP_AUTODETECTION_METHOD
value: "interface=eth1"
设置对应的网卡名称
calico:Liveness probe failed: Get http://localhost:9099/liveness: dial tcp 127.0.0.1:9099: connect: connection refused
[FATAL][2313] int_dataplane.go 824: Kernel's RPF check is set to 'loose'. This would allow endpoints to spoof their IP address. Calico requires net.ipv4.conf.all.rp_filter to be set to 0 or 1. If you require loose RPF and you are not concerned about spoofing, this check can be disabled by setting the IgnoreLooseRPF configuration parameter to 'true'.
sysctl -w net.ipv4.conf.all.rp_filter=1
控制系统是否开启对数据包源地址的校验
0:不开启源地址校验。1:开启严格的反向路径校验。对每个进来的数据包,校验其反向路径是否是最佳路径。如果反向路径不是最佳路径,则直接丢弃该数据包。2:开启松散的反向路径校验。对每个进来的数据包,校验其源地址是否可达,即反向路径是否能通(通过任意网口),如果反向路径不同,则直接丢弃该数据包。
Kubelet无法启动
Kubelet: Failed to watch directory /sys/fs/cgroup/memory/system.slice/XXX no space lefe on device
sysctl fs.inotify.max_user_watches=524288
参考:www.bookstack.cn/read/kubern…
Kubelet无法启动
Failed to start ContainerManager Cannot set property TasksAccounting, or unknown property
yum update systemd
namespace无法删除
kubectl get ns rdbms -o json > tmp.json
编辑json文件并删除掉spec部分
curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json http://192.168.130.230:8001/api/v1/namespaces/tunnel-proxy/finalize
kubectl replace --raw "/api/v1/namespaces/$NAMESPACE" -f ./$NAMESPACE.json
kubectl patch pvc pvc-9cd01e19-93b4-4bd8-bfc8-9d96cbe03f46 -p '{"metadata":{"finalizers":null}}' 强制删除pvc
node notready failed to ensure node lease exists, will retry in 7s, error: an error on the server ("") has prevented the request from succeeding
apiserver调整参数--http2-max-streams-per-connection为1000
参考:segmentfault.com/a/119000004…
启动容器时mkdir /var/lib/kubelet: read-only file system
原因: kubelet的cgroup driver是systemd,docker的cgroup driver是cgroupfs
kubelet报错:container runtime status check may not have completed yet, PLEG is not healthy
systemd-219-67.el7.x86_64版本问题
yum update -y systemd && systemctl daemon-reexec && systemctl restart docker