问题现象:
在k8s上面使用calico网络插件,这两天发现dns服务异常,经过排查,发现dns的两个pod,位于master node上面的ip是不能被ping通的,导致了dns服务不能正常提供服务。 然后查看网络插件的pod,发现位于master节点上的calico-node服务,不正常
错误如下:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-7cd8b89887-vfzwc 1/1 Running 2 (117d ago) 132d 10.244.118.109 xy-5-server14 <none> <none>
calico-node-9qtv5 1/1 Running 0 132d 192.168.5.19 xy-5-server19 <none> <none>
calico-node-lxg9k 0/1 Running 0 34s 192.168.5.14 xy-5-server14 <none> <none>
calico-node-rmscn 1/1 Running 0 33s 192.168.5.17 xy-5-server17 <none> <none>
calico-typha-d4f58c4c9-8nf76 1/1 Running 0 132d 192.168.5.17 xy-5-server17 <none> <none>
calico-typha-d4f58c4c9-dbf8g 1/1 Running 0 132d 192.168.5.14 xy-5-server14 <none> <none>
csi-node-driver-92rbg 2/2 Running 0 132d 10.244.116.196 xy-5-server17 <none> <none>
csi-node-driver-gpgwd 2/2 Running 0 132d 10.244.6.82 xy-5-server19 <none> <none>
csi-node-driver-h9kbw 2/2 Running 0 132d 10.244.118.101 xy-5-server14 <none> <none>
[root@xy-5-server14 calico]# kubectl -n calico-system describe pod calico-node-lxg9k
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 47s default-scheduler Successfully assigned calico-system/calico-node-lxg9k to xy-5-server14
Normal Pulled 47s kubelet Container image "docker.io/calico/pod2daemon-flexvol:v3.24.5" already present on machine
Normal Created 47s kubelet Created container flexvol-driver
Normal Started 47s kubelet Started container flexvol-driver
Normal Pulled 46s kubelet Container image "docker.io/calico/cni:v3.24.5" already present on machine
Normal Created 45s kubelet Created container install-cni
Normal Started 45s kubelet Started container install-cni
Normal Pulled 42s kubelet Container image "docker.io/calico/node:v3.24.5" already present on machine
Normal Created 42s kubelet Created container calico-node
Normal Started 41s kubelet Started container calico-node
Warning Unhealthy 40s (x2 over 41s) kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
Warning Unhealthy 37s kubelet Readiness probe failed: 2023-07-18 08:18:19.246 [INFO][379] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.5.17,192.168.5.19
Warning Unhealthy 27s kubelet Readiness probe failed: 2023-07-18 08:18:29.242 [INFO][423] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.5.17,192.168.5.19
Warning Unhealthy 17s kubelet Readiness probe failed: 2023-07-18 08:18:39.246 [INFO][455] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.5.17,192.168.5.19
Warning Unhealthy 7s kubelet Readiness probe failed: 2023-07-18 08:18:49.249 [INFO][486] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.5.17,192.168.5.19
推而广之,发现所有的位于master节点上面的pod的ip,均不能正常ping通
问题发现
安装calico的客户端:参考:www.cnblogs.com/varden/p/15…
在master上面:
[root@xy-5-server14 ~]# calicoctl node status
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+----------+-------------+
| 192.168.5.17 | node-to-node mesh | up | 08:36:51 | Established |
| 192.168.5.19 | node-to-node mesh | up | 08:37:15 | Established |
+--------------+-------------------+-------+----------+-------------+
IPv6 BGP status
No IPv6 peers found.
连接正常... 在node1上面
[root@xy-5-server17 ~]# calicoctl node status
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+------------+-------------+
| 192.168.5.19 | node-to-node mesh | up | 2023-03-07 | Established |
| 10.4.0.1 | node-to-node mesh | start | 2023-07-17 | Connect |
+--------------+-------------------+-------+------------+-------------+
IPv6 BGP status
No IPv6 peers found.
发现问题了吧,master的地址正常应该使用的是192.168.5.14,这个却使用的是10.4.0.1这个ip。 同样,在node2上面,也发现相同的问题
[root@xy-5-server19 ~]# calicoctl node status
Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+------------+-------------+
| 192.168.5.17 | node-to-node mesh | up | 08:18:24 | Established |
| 10.4.0.1 | node-to-node mesh | start | 2023-07-17 | Connect |
+--------------+-------------------+-------+------------+-------------+
IPv6 BGP status
No IPv6 peers found.
在网上找到相同的遭遇的帖子:www.jianshu.com/p/4b175e733… cloud.tencent.com/developer/a… 需要指定网卡,但是我使用的是operator安装的calico,直接修改calico-node的statefulset是不起作用的,会被operator改回去。跟文中的描述不一致。
问题解决
在calico官网找到相关配置:docs.tigera.io/calico/late…
然后在k8s集群中找到
[root@xy-5-server17 ~]# kubectl get Installation
NAME AGE
default 155d
[root@xy-5-server17 ~]# kubectl edit Installation default
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
creationTimestamp: "2023-02-13T09:18:18Z"
finalizers:
- tigera.io/operator-cleanup
generation: 3
name: default
resourceVersion: "151883088"
uid: 580c6998-4b1e-4616-8c0b-7a3fc4adf553
spec:
calicoNetwork:
bgp: Enabled
hostPorts: Enabled
ipPools:
- blockSize: 26
cidr: 10.244.0.0/16
disableBGPExport: false
encapsulation: VXLANCrossSubnet
natOutgoing: Enabled
nodeSelector: all()
linuxDataplane: Iptables
multiInterfaceMode: None
nodeAddressAutodetectionV4:
interface: ens4f1
cni:
ipam:
type: Calico
type: Calico
controlPlaneReplicas: 2
flexVolumePath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
kubeletVolumePluginPath: /var/lib/kubelet
nodeUpdateStrategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
nonPrivileged: Disabled
variant: Calico
status:
computed:
calicoNetwork:
bgp: Enabled
hostPorts: Enabled
ipPools:
- blockSize: 26
cidr: 10.244.0.0/16
disableBGPExport: false
将
nodeAddressAutodetectionV4:
interface: ens4f1
这段配置,改成文档中描述的那样,设置自己的网卡即可
然后发现master节点上的calico-node pod运行正常,dns pod的ip可以ping通,dns服务恢复正常,问题得到了解决。
解决rocky9.2 calico不能ping通pod ip
在将操作系统切换成rocky9.2以后,发现原本可以正常ping pod ip变得不通了。
在pod所在的node节点上面,抓包:
发现可以接受到ping信息,但是好像木有ask应答?
本机是可以正常访问在本机node上面的pod ip的,从其他机器上面,发现具有ipvs转发路由规则:
[root@sc-master-1 milvus]# ip route get 10.244.116.190
10.244.116.190 via 10.244.116.128 dev vxlan.calico src 10.244.181.192 uid 0
cache
在查阅calico官网的时候,发现
意思是说,在由NetworkManager管理的网络的情况下,NetworkManager可能会干扰calico的网络路由功能。
于是,按照文档上面所说的,加入了如下配置文件:
[root@sc-master-1 milvus]# cat /etc/NetworkManager/conf.d/calico.conf
[keyfile]
unmanaged-devices=interface-name:cali*;interface-name:tunl*;interface-name:vxlan.calico;interface-name:vxlan-v6.calico;interface-name:wireguard.cali;interface-name:wg-v6.cali
[root@sc-master-1 milvus]#
为了能够让配置起作用,不得不重启机器(目前还没有找到其他办法,能够不重启机器,使得配置生效) 然后ping pod的ip,发现恢复了正常。
还有一些描述:
不过没有验证过,以后有机会吧。
如果以上网络策略还是没有生效,则需要清理k8s node节点,然后重新加入到k8s集群中
在节点上面执行:echo y |kubeadm reset && rm -rf /etc/cni/net.d && ipvsadm --clear && kubectl delete node xxxx
然后kubeadm join 172.70.21.9:16443 --token y9t1ep.45vr8r23go9qde4a --discovery-token-ca-cert-hash sha256:ace0c89c5c1d184c12c8