20230626
安装了calico网络组件后,跨节点pod通信异常、svc访问异常、apiserver无法访问
- 现象:
- 可以ping通集群pod,curl超时
- calico所有组件运行正常、重启calico-node依然如此
iptables.masqueradeBit: 30
- 已检查的设置:
IP_AUTODETECTION_METHOD
为can-reach
或者interface
CALICO_IPV4POOL_IPIP
为Always
解决办法
- 原因:设置了虚拟网卡的mtu为1500
- 将
ConfigMap
的veth_mtu
改成0
或者1440
- 将kubelet配置
iptables.masqueradeBit
改成默认值14
- 将
20211105
查看日志提示MethodNotAllowed(http 405),如下图
解决方法
- 原因:禁用了日志收集,需要重新开启.
- 需要在命令行添加
--enable-debugging-handlers=true
- 在yaml文件中添加
enableDebuggingHandlers: true
- 需要在命令行添加
20190507
单节点的k8s因为机器卡死后强制断电重启,在启动集群后etcd数据库提示找不到快照备份文件一直在频繁启动,启动信息如下
{"level":"warn","ts":"2022-10-26T02:23:54.885Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":250025,"snapshot-fi
le-path":"/var/lib/etcd/member/snap/000000000003d0a9.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2022-10-26T02:23:54.885Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to fi
nd database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/releas
e/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.i
o/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV
2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server
/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot
goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000502300, 0xc0004acd40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc0006322d0, 0x1234726, 0x2a, 0xc0004acd40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0x7ffe21b69e44, 0x4, 0x0, 0x0, 0x0, 0x0, 0xc000221d40, 0x1, 0x1, 0xc00064a000, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc000642000, 0xc000642600, 0x0, 0x0)
/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc000642000, 0x12089be, 0x6, 0xc0000a8101, 0x2)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a140, 0x14, 0x14)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a140, 0x14, 0x14)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40 +0x13f
main.main()
/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32 +0x45
解决方法
- 删除快照目录下的所有文件,重新启动集群
- systemctl stop kubelet
- rm -rf /var/lib/etc/member/snap/*
- systemctl start kubelet
经过上面的操作,k8s的数据不会丢失的。因为etcd所有的操作都记录在wal日志里面,snap只是一个快照文件,用于在重新启动时快速加载数据。删除snap快照后,etcd启动时会重新读取wal文件数据进行数据回放,因此不需要担心数据丢失!!!
在开发环境中部署了kubernetes后,使用istio做ingress处理流量,最终部署了gitlab、jenkins、jumpserver之后发现有问题:
- 所有服务刷新页面间歇性
upstream connect error or disconnect/reset before headers
- jumpserver web 终端连接后大约3~5秒提示
connect closed
或者connect error
- 使用kubectl exec 进入pod之后大约10秒会自动退出容器
排查思路:
- 后端的服务是正常运行
- 通过nodePort可以正常访问,也是间歇性提示上面的错误
- 通过service可以正常访问,也是间歇性提示上面的错误
- 排查istio-ingress对websocket协议的支持情况
- 通过kiali查看istio-ingress的流量走向情况
系统 | docker版本 | kubernetes版本 | istio版本 |
---|---|---|---|
CentOS 7.9.2009 | v19.03.15 | v1.20.5 | 1.9.1 |
Pod运行状态
[root@k8smaster1 ~]# kubectl get pod -n tools
NAME READY STATUS RESTARTS AGE
gitlab-8b46fdb7-2xvkx 1/1 Running 0 4d16h
jenkins-5746bdbf76-bcflk 1/1 Running 0 41h
jumpserver-84f75577bd-r67d8 1/1 Running 0 22h
服务运行状态
[root@k8smaster1 ~]# kubectl get svc -n tools
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
gitlab ClusterIP 10.100.127.175 <none> 80/TCP,22/TCP 5d19h
glusterfs ClusterIP 10.105.254.77 <none> 49152/TCP 5d19h
jenkins ClusterIP 10.103.40.60 <none> 80/TCP,50000/TCP 42h
jumpserver ClusterIP 10.96.12.39 <none> 80/TCP,2222/TCP 22h
[root@k8smaster1 ~]# curl 10.103.40.60/login?from=%2f >/dev/null -s -vvv
* About to connect() to 10.103.40.60 port 80 (#0)
* Trying 10.103.40.60...
* Connected to 10.103.40.60 (10.103.40.60) port 80 (#0)
> GET /login?from=%2f HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.103.40.60
> Accept: */*
< HTTP/1.1 200 OK
< Date: Wed, 14 Apr 2021 02:26:46 GMT
< X-Content-Type-Options: nosniff
< Content-Type: text/html;charset=utf-8
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Cache-Control: no-cache,no-store,must-revalidate
< X-Hudson: 1.395
< X-Jenkins: 2.285
Kiali配置显示结果
kubernetes安装配置
apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
bootstrapTokens:
- groups:
- system:bootstrappers:kubeadm:default-node-token
token: abcdef.0123456789abcdef
description: "Another bootstrap token"
ttl: 500h0m0s
usages:
- signing
- authentication
localAPIEndpoint:
advertiseAddress: ${hostip}
bindPort: 6443
nodeRegistration:
criSocket: /var/run/dockershim.sock
name: $(hostname)
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/master
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
apiServer:
timeoutForControlPlane: 4m0s
certSANs:
- ${k8shost1}
- ${k8shostip1}
- 127.0.0.1
extraArgs:
authorization-mode: "Node,RBAC"
advertise-address: "${k8shostip1}"
anonymous-auth: "true"
cert-dir: "/var/run/kubernetes"
audit-log-compress: "false"
delete-collection-workers: "2"
kubelet-timeout: "10s"
logging-format: "json"
default-not-ready-toleration-seconds: "180"
default-unreachable-toleration-seconds: "180"
certificatesDir: /etc/kubernetes/pki
controlPlaneEndpoint: "${k8shostip1}:8443"
controllerManager:
extraArgs:
horizontal-pod-autoscaler-cpu-initialization-period: "1m0s"
horizontal-pod-autoscaler-downscale-stabilization: "10m0s"
horizontal-pod-autoscaler-sync-period: "30s"
node-eviction-rate: "0.5"
pod-eviction-timeout: "20s"
cluster-name: "kubernetes"
concurrent-deployment-syncs: "10"
concurrent-endpoint-syncs: "10"
concurrent-namespace-syncs: "10"
concurrent-replicaset-syncs: "10"
concurrent-service-syncs: "3"
horizontal-pod-autoscaler-tolerance: "0.5"
logging-format: "json"
kube-api-qps: "200"
kube-api-burst: "200"
scheduler:
extraArgs:
bind-address: "0.0.0.0"
leader-elect-retry-period: "5s"
port: "10251"
secure-port: "10259"
dns:
type: CoreDNS
etcd:
local:
dataDir: "/var/lib/etcd"
imageRepository: "registry.aliyuncs.com/google_containers"
kubernetesVersion: v1.20.5
networking:
serviceSubnet: "10.96.0.0/12"
podSubnet: "172.168.0.0/24"
dnsDomain: "cluster.local"
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: "systemd"
address: "0.0.0.0"
port: 10250
nodeStatusReportFrequency: "30s"
nodeStatusUpdateFrequency: "10s"
evictionPressureTransitionPeriod: "10s"
runtimeRequestTimeout: "30s"
evictionHard:
"memory.available": "1024Mi"
"nodefs.available": "20%"
streamingConnectionIdleTimeout: "10s"
---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
bindAddress: "0.0.0.0"
healthzBindAddress: "0.0.0.0:10256"
metricsBindAddress: "0.0.0.0:10249"
bindAddressHardFail: false
enableProfiling: true
configSyncPeriod: "10s"
mode: "ipvs"
ipvs:
strictARP: true
tcpTimeOut: "10s"
tcpFinTimeOut: "10s"
udpTimeOut: "3s"
udpIdleTimeout: "0s"
conntrack:
tcpEstablishedTimeout: "10s"
tcpCloseWaitTimeout: "20s"
解决方法
-
- kubectl exec终端自动退出
- 1.1 原因:初始化集群时添加了kubelet参数导致
- 解决办法:更改KubeletConfiguration中的
streamingConnectionIdleTimeout: "10s"
streamingConnectionIdleTimeout: "0s"
- 解决办法:更改KubeletConfiguration中的
-
- 页面间歇性提示
upstream connect error or disconnect/reset before headers
- 2.1 原因:系统内核对iptables转发tcp连接时间做了限制,更改内核限制
- 解决办法:修改内核如下
net.netfilter.nf_conntrack_tcp_timeout_established=432000
net.netfilter.nf_conntrack_tcp_timeout_close=432000
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables =1
net.netfilter.nf_conntrack_tcp_max_retrans=43200
- 解决办法:修改内核如下
- 页面间歇性提示