BIRD is not ready: BGP not established问题解决

2,211 阅读5分钟

问题现象:

在k8s上面使用calico网络插件,这两天发现dns服务异常,经过排查,发现dns的两个pod,位于master node上面的ip是不能被ping通的,导致了dns服务不能正常提供服务。 然后查看网络插件的pod,发现位于master节点上的calico-node服务,不正常

错误如下:

NAME                                       READY   STATUS    RESTARTS       AGE    IP               NODE            NOMINATED NODE   READINESS GATES
calico-kube-controllers-7cd8b89887-vfzwc   1/1     Running   2 (117d ago)   132d   10.244.118.109   xy-5-server14   <none>           <none>
calico-node-9qtv5                          1/1     Running   0              132d   192.168.5.19     xy-5-server19   <none>           <none>
calico-node-lxg9k                          0/1     Running   0              34s    192.168.5.14     xy-5-server14   <none>           <none>
calico-node-rmscn                          1/1     Running   0              33s    192.168.5.17     xy-5-server17   <none>           <none>
calico-typha-d4f58c4c9-8nf76               1/1     Running   0              132d   192.168.5.17     xy-5-server17   <none>           <none>
calico-typha-d4f58c4c9-dbf8g               1/1     Running   0              132d   192.168.5.14     xy-5-server14   <none>           <none>
csi-node-driver-92rbg                      2/2     Running   0              132d   10.244.116.196   xy-5-server17   <none>           <none>
csi-node-driver-gpgwd                      2/2     Running   0              132d   10.244.6.82      xy-5-server19   <none>           <none>
csi-node-driver-h9kbw                      2/2     Running   0              132d   10.244.118.101   xy-5-server14   <none>           <none>
[root@xy-5-server14 calico]# kubectl  -n calico-system describe pod calico-node-lxg9k 

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  47s                default-scheduler  Successfully assigned calico-system/calico-node-lxg9k to xy-5-server14
  Normal   Pulled     47s                kubelet            Container image "docker.io/calico/pod2daemon-flexvol:v3.24.5" already present on machine
  Normal   Created    47s                kubelet            Created container flexvol-driver
  Normal   Started    47s                kubelet            Started container flexvol-driver
  Normal   Pulled     46s                kubelet            Container image "docker.io/calico/cni:v3.24.5" already present on machine
  Normal   Created    45s                kubelet            Created container install-cni
  Normal   Started    45s                kubelet            Started container install-cni
  Normal   Pulled     42s                kubelet            Container image "docker.io/calico/node:v3.24.5" already present on machine
  Normal   Created    42s                kubelet            Created container calico-node
  Normal   Started    41s                kubelet            Started container calico-node
  Warning  Unhealthy  40s (x2 over 41s)  kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
  Warning  Unhealthy  37s                kubelet            Readiness probe failed: 2023-07-18 08:18:19.246 [INFO][379] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.5.17,192.168.5.19
  Warning  Unhealthy  27s  kubelet  Readiness probe failed: 2023-07-18 08:18:29.242 [INFO][423] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.5.17,192.168.5.19
  Warning  Unhealthy  17s  kubelet  Readiness probe failed: 2023-07-18 08:18:39.246 [INFO][455] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.5.17,192.168.5.19
  Warning  Unhealthy  7s  kubelet  Readiness probe failed: 2023-07-18 08:18:49.249 [INFO][486] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.5.17,192.168.5.19

推而广之,发现所有的位于master节点上面的pod的ip,均不能正常ping通

问题发现

安装calico的客户端:参考:www.cnblogs.com/varden/p/15…
在master上面:

[root@xy-5-server14 ~]# calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+--------------+-------------------+-------+----------+-------------+
| 192.168.5.17 | node-to-node mesh | up    | 08:36:51 | Established |
| 192.168.5.19 | node-to-node mesh | up    | 08:37:15 | Established |
+--------------+-------------------+-------+----------+-------------+

IPv6 BGP status
No IPv6 peers found.

连接正常... 在node1上面

[root@xy-5-server17 ~]# calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+------------+-------------+
| PEER ADDRESS |     PEER TYPE     | STATE |   SINCE    |    INFO     |
+--------------+-------------------+-------+------------+-------------+
| 192.168.5.19 | node-to-node mesh | up    | 2023-03-07 | Established |
| 10.4.0.1     | node-to-node mesh | start | 2023-07-17 | Connect     |
+--------------+-------------------+-------+------------+-------------+

IPv6 BGP status
No IPv6 peers found.

发现问题了吧,master的地址正常应该使用的是192.168.5.14,这个却使用的是10.4.0.1这个ip。 同样,在node2上面,也发现相同的问题

[root@xy-5-server19 ~]# calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+------------+-------------+
| PEER ADDRESS |     PEER TYPE     | STATE |   SINCE    |    INFO     |
+--------------+-------------------+-------+------------+-------------+
| 192.168.5.17 | node-to-node mesh | up    | 08:18:24   | Established |
| 10.4.0.1     | node-to-node mesh | start | 2023-07-17 | Connect     |
+--------------+-------------------+-------+------------+-------------+

IPv6 BGP status
No IPv6 peers found.

在网上找到相同的遭遇的帖子:www.jianshu.com/p/4b175e733… cloud.tencent.com/developer/a… 需要指定网卡,但是我使用的是operator安装的calico,直接修改calico-node的statefulset是不起作用的,会被operator改回去。跟文中的描述不一致。

问题解决

在calico官网找到相关配置:docs.tigera.io/calico/late…

1111.png

然后在k8s集群中找到

[root@xy-5-server17 ~]# kubectl get Installation
NAME      AGE
default   155d
[root@xy-5-server17 ~]# kubectl edit Installation default
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  creationTimestamp: "2023-02-13T09:18:18Z"
  finalizers:
  - tigera.io/operator-cleanup
  generation: 3
  name: default
  resourceVersion: "151883088"
  uid: 580c6998-4b1e-4616-8c0b-7a3fc4adf553
spec:
  calicoNetwork:
    bgp: Enabled
    hostPorts: Enabled
    ipPools:
    - blockSize: 26
      cidr: 10.244.0.0/16
      disableBGPExport: false
      encapsulation: VXLANCrossSubnet
      natOutgoing: Enabled
      nodeSelector: all()
    linuxDataplane: Iptables
    multiInterfaceMode: None
    nodeAddressAutodetectionV4:
      interface: ens4f1
  cni:
    ipam:
      type: Calico
    type: Calico
  controlPlaneReplicas: 2
  flexVolumePath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
  kubeletVolumePluginPath: /var/lib/kubelet
  nodeUpdateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
  nonPrivileged: Disabled
  variant: Calico
status:
  computed:
    calicoNetwork:
      bgp: Enabled
      hostPorts: Enabled
      ipPools:
      - blockSize: 26
        cidr: 10.244.0.0/16
        disableBGPExport: false

nodeAddressAutodetectionV4:
      interface: ens4f1

这段配置,改成文档中描述的那样,设置自己的网卡即可
然后发现master节点上的calico-node pod运行正常,dns pod的ip可以ping通,dns服务恢复正常,问题得到了解决。

解决rocky9.2 calico不能ping通pod ip

在将操作系统切换成rocky9.2以后,发现原本可以正常ping pod ip变得不通了。 截图 2023-09-13 09-59-34.png 在pod所在的node节点上面,抓包:

截图 2023-09-13 10-01-27.png 发现可以接受到ping信息,但是好像木有ask应答? 本机是可以正常访问在本机node上面的pod ip的,从其他机器上面,发现具有ipvs转发路由规则:

[root@sc-master-1 milvus]# ip route get  10.244.116.190
10.244.116.190 via 10.244.116.128 dev vxlan.calico src 10.244.181.192 uid 0 
    cache 

在查阅calico官网的时候,发现

截图 2023-09-13 09-53-00.png 意思是说,在由NetworkManager管理的网络的情况下,NetworkManager可能会干扰calico的网络路由功能。 于是,按照文档上面所说的,加入了如下配置文件:

[root@sc-master-1 milvus]# cat /etc/NetworkManager/conf.d/calico.conf 
[keyfile]
unmanaged-devices=interface-name:cali*;interface-name:tunl*;interface-name:vxlan.calico;interface-name:vxlan-v6.calico;interface-name:wireguard.cali;interface-name:wg-v6.cali
[root@sc-master-1 milvus]# 

为了能够让配置起作用,不得不重启机器(目前还没有找到其他办法,能够不重启机器,使得配置生效) 然后ping pod的ip,发现恢复了正常。

截图 2023-09-13 09-57-32.png

还有一些描述:

截图 2023-09-13 10-06-59.png 不过没有验证过,以后有机会吧。

如果以上网络策略还是没有生效,则需要清理k8s node节点,然后重新加入到k8s集群中 在节点上面执行:echo y |kubeadm reset && rm -rf /etc/cni/net.d && ipvsadm --clear && kubectl delete node xxxx
然后kubeadm join 172.70.21.9:16443 --token y9t1ep.45vr8r23go9qde4a --discovery-token-ca-cert-hash sha256:ace0c89c5c1d184c12c8