cilium 默认 vxlan 模式 BPF 持久化:/sys/fs/bpf

35 阅读9分钟

cilium agent mount BPF



root@cili-control:~# k get po -n kube-system   cilium-f2rh4  -o yaml | grep "/sys/fs/bpf"
    - mountPath: /sys/fs/bpf
    - mount | grep "/sys/fs/bpf type bpf" || mount -t bpf bpf /sys/fs/bpf
    - mountPath: /sys/fs/bpf
    - mountPath: /sys/fs/bpf
      path: /sys/fs/bpf
    - mountPath: /sys/fs/bpf
    - mountPath: /sys/fs/bpf
    - mountPath: /sys/fs/bpf
root@cili-control:~# k get po -n kube-system   cilium-f2rh4  -o yaml | grep "bpf"
    - mountPath: /sys/fs/bpf
      name: bpf-maps
    - mount | grep "/sys/fs/bpf type bpf" || mount -t bpf bpf /sys/fs/bpf
    name: mount-bpf-fs
    - mountPath: /sys/fs/bpf
      name: bpf-maps
          key: clean-cilium-bpf-state
    - mountPath: /sys/fs/bpf
      name: bpf-maps
      path: /sys/fs/bpf
    name: bpf-maps
    - mountPath: /sys/fs/bpf
      name: bpf-maps
    name: mount-bpf-fs
    - mountPath: /sys/fs/bpf
      name: bpf-maps
    - mountPath: /sys/fs/bpf
      name: bpf-maps

############### 这里只使用了 tc
root@cili-control:~# tree /sys/fs/bpf/
/sys/fs/bpf/
└── tc
    └── globals
        ├── cilium_auth_map
        ├── cilium_call_policy
        ├── cilium_calls_00360
        ├── cilium_calls_hostns_02993
        ├── cilium_calls_netdev_00006
        ├── cilium_calls_overlay_2
        ├── cilium_ct4_global
        ├── cilium_ct_any4_global
        ├── cilium_encrypt_state
        ├── cilium_events
        ├── cilium_ipcache
        ├── cilium_ipv4_frag_datagrams
        ├── cilium_l2_responder_v4
        ├── cilium_lb4_backends_v3
        ├── cilium_lb4_reverse_nat
        ├── cilium_lb4_services_v2
        ├── cilium_lxc
        ├── cilium_metrics
        ├── cilium_node_map
        ├── cilium_policy_00360
        ├── cilium_policy_02993
        ├── cilium_ratelimit
        ├── cilium_runtime_config
        ├── cilium_signals
        └── cilium_tunnel_map

2 directories, 25 files

在 Cilium 中,使用 tc(Traffic Control,流量控制工具)主要是为了在 Linux 内核的网络协议栈中插入 BPF 程序,实现对容器网络流量的精细化控制。结合你提供的 /sys/fs/bpf/tc/globals 目录下的 BPF 映射(maps),可以具体解释 tc 的作用如下:

1. 在网络设备层面挂载 BPF 程序

tc 是 Linux 内核提供的流量控制框架,允许在网络接口(如容器虚拟网卡、主机物理网卡)的 “接收”(ingress)或 “发送”(egress)路径上附加 BPF 程序。Cilium 利用这一机制,将 BPF 程序挂载到容器网络的关键节点(如容器的 veth 接口、主机的网桥或物理网卡),从而在流量进入 / 离开容器时进行实时处理

2. 核心功能场景

结合你看到的 BPF 映射(如 cilium_policy_*cilium_ct4_global 等),tc 挂载的 BPF 程序主要负责:

  • **网络策略 enforcement(策略执行)**通过 cilium_policy_* 等映射存储的网络策略规则,tc 层的 BPF 程序会在流量经过时检查源 / 目的 IP、端口、协议等信息,决定是否允许 / 拒绝流量(实现 L3/L4 甚至 L7 策略)。
  • 连接跟踪(Connection Tracking)cilium_ct4_global(IPv4 连接跟踪)等映射用于记录连接状态(如新建、 Established、关闭),tc 层的 BPF 程序依赖这些状态实现状态化策略(如允许回包、拒绝无效连接)。
  • 负载均衡(Load Balancing)cilium_lb4_services_v2(服务信息)和 cilium_lb4_backends_v3(后端节点)等映射存储了 Kubernetes Service 的负载均衡规则,tc 层的 BPF 程序会在此处完成流量的转发决策(如将请求分发到后端 Pod)。
  • 地址转换(NAT)cilium_lb4_reverse_nat 等映射用于记录反向 NAT 规则,tc 层的 BPF 程序会处理 Pod 与外部通信时的地址转换(如将 Pod IP 转换为主机 IP 以实现外部访问)。
  • 隧道封装 / 解封装(如 VXLAN)cilium_tunnel_map 存储了节点间隧道信息,tc 层的 BPF 程序可能参与跨节点容器通信时的隧道封装(如给流量加上 VXLAN 头部)或解封装。
  • 流量监控与 metricscilium_metrics 等映射用于收集流量统计信息(如吞吐量、丢包数),tc 层的 BPF 程序会实时更新这些指标,供 Cilium 监控和告警使用。

3. 为什么用 tc 而不是其他机制?

Cilium 选择 tc 是因为它能在网络协议栈的早期阶段(接近网卡驱动层)处理流量,相比传统的 iptables(工作在协议栈上层),具有更低的延迟和更高的性能。同时,tc 支持 BPF 程序的动态挂载和更新,满足 Cilium 对网络策略实时生效的需求。

总结来说,Cilium 中的 tc 是 BPF 程序的 “挂载点”,通过在网络设备的流量路径上插入 BPF 逻辑,实现了容器网络的策略控制、负载均衡、连接跟踪等核心功能,是 Cilium 高性能网络方案的关键组成部分。

环境信息



root@cili-control:~# k get node -A -o wide
NAME           STATUS   ROLES           AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
cili-control   Ready    control-plane   7m15s   v1.33.1   11.0.1.200    <none>        Ubuntu 22.04.5 LTS   5.15.0-160-generic   containerd://1.7.13
cili-work1     Ready    worker          7m5s    v1.33.1   11.0.1.201    <none>        Ubuntu 22.04.5 LTS   5.15.0-160-generic   containerd://1.7.13
cili-work2     Ready    worker          7m5s    v1.33.1   11.0.1.202    <none>        Ubuntu 22.04.5 LTS   5.15.0-160-generic   containerd://1.7.13
root@cili-control:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:0c:29:c6:51:a4 brd ff:ff:ff:ff:ff:ff
    altname enp3s0
    inet 11.0.1.200/24 brd 11.0.1.255 scope global ens160
       valid_lft forever preferred_lft forever
    inet6 fe80::20c:29ff:fec6:51a4/64 scope link 
       valid_lft forever preferred_lft forever
3: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:0c:29:c6:51:ae brd ff:ff:ff:ff:ff:ff
    altname enp11s0
    inet 11.0.2.200/24 brd 11.0.2.255 scope global ens192
       valid_lft forever preferred_lft forever
    inet6 fe80::20c:29ff:fec6:51ae/64 scope link 
       valid_lft forever preferred_lft forever
4: nodelocaldns: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default 
    link/ether 56:3c:49:e5:6c:d2 brd ff:ff:ff:ff:ff:ff
    inet 169.254.25.10/32 scope global nodelocaldns
       valid_lft forever preferred_lft forever
5: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default 
    link/ether ca:7c:9f:83:93:25 brd ff:ff:ff:ff:ff:ff
    inet 10.233.0.1/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.233.0.3/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.233.250.103/32 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
6: cilium_net@cilium_host: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fa:73:c8:76:ae:b4 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::f873:c8ff:fe76:aeb4/64 scope link 
       valid_lft forever preferred_lft forever
7: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 5a:21:54:a0:8d:00 brd ff:ff:ff:ff:ff:ff
    inet 10.222.2.61/32 scope global cilium_host
       valid_lft forever preferred_lft forever
    inet6 fe80::5821:54ff:fea0:8d00/64 scope link 
       valid_lft forever preferred_lft forever
8: cilium_vxlan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 1e:61:6a:27:d7:f9 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1c61:6aff:fe27:d7f9/64 scope link 
       valid_lft forever preferred_lft forever
10: lxc_health@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 8a:a0:f5:26:fd:fe brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::88a0:f5ff:fe26:fdfe/64 scope link 
       valid_lft forever preferred_lft forever
root@cili-control:~# 



root@cili-control:~# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         11.0.1.2        0.0.0.0         UG    0      0        0 ens160
10.222.0.0      10.222.2.61     255.255.255.0   UG    0      0        0 cilium_host
10.222.1.0      10.222.2.61     255.255.255.0   UG    0      0        0 cilium_host
10.222.2.0      10.222.2.61     255.255.255.0   UG    0      0        0 cilium_host
10.222.2.61     0.0.0.0         255.255.255.255 UH    0      0        0 cilium_host
11.0.1.0        0.0.0.0         255.255.255.0   U     0      0        0 ens160
11.0.2.0        0.0.0.0         255.255.255.0   U     0      0        0 ens192

root@cili-control:~# kgp | grep coredns
kube-system   coredns-5847cf56c7-8mx6p               1/1     Running   0          3m23s   10.222.1.33    cili-work1     <none>           <none>
kube-system   coredns-5847cf56c7-hqk4f               1/1     Running   0          3m23s   10.222.1.158   cili-work1     <none>           <none>
root@cili-control:~# kgp | grep cilium
kube-system   cilium-2hnpl                           1/1     Running   0          3m53s   11.0.1.202     cili-work2     <none>           <none>
kube-system   cilium-f2rh4                           1/1     Running   0          3m53s   11.0.1.200     cili-control   <none>           <none>
kube-system   cilium-operator-7c6b45754-lp9nd        1/1     Running   0          3m53s   11.0.1.201     cili-work1     <none>           <none>
kube-system   cilium-st9hs                           1/1     Running   0          3m53s   11.0.1.201     cili-work1     <none>           <none>
root@cili-control:~# k get po -n kube-system   cilium-f2rh4  -o yaml | grep "/sys/fs/bpf"
    - mountPath: /sys/fs/bpf
    - mount | grep "/sys/fs/bpf type bpf" || mount -t bpf bpf /sys/fs/bpf
    - mountPath: /sys/fs/bpf
    - mountPath: /sys/fs/bpf
      path: /sys/fs/bpf
    - mountPath: /sys/fs/bpf
    - mountPath: /sys/fs/bpf
    - mountPath: /sys/fs/bpf
root@cili-control:~# k get po -n kube-system   cilium-f2rh4  -o yaml | grep "bpf"
    - mountPath: /sys/fs/bpf
      name: bpf-maps
    - mount | grep "/sys/fs/bpf type bpf" || mount -t bpf bpf /sys/fs/bpf
    name: mount-bpf-fs
    - mountPath: /sys/fs/bpf
      name: bpf-maps
          key: clean-cilium-bpf-state
    - mountPath: /sys/fs/bpf
      name: bpf-maps
      path: /sys/fs/bpf
    name: bpf-maps
    - mountPath: /sys/fs/bpf
      name: bpf-maps
    name: mount-bpf-fs
    - mountPath: /sys/fs/bpf
      name: bpf-maps
    - mountPath: /sys/fs/bpf
      name: bpf-maps

root@cili-control:~# tree /sys/fs/bpf/
/sys/fs/bpf/
└── tc
    └── globals
        ├── cilium_auth_map
        ├── cilium_call_policy
        ├── cilium_calls_00360
        ├── cilium_calls_hostns_02993
        ├── cilium_calls_netdev_00006
        ├── cilium_calls_overlay_2
        ├── cilium_ct4_global
        ├── cilium_ct_any4_global
        ├── cilium_encrypt_state
        ├── cilium_events
        ├── cilium_ipcache
        ├── cilium_ipv4_frag_datagrams
        ├── cilium_l2_responder_v4
        ├── cilium_lb4_backends_v3
        ├── cilium_lb4_reverse_nat
        ├── cilium_lb4_services_v2
        ├── cilium_lxc
        ├── cilium_metrics
        ├── cilium_node_map
        ├── cilium_policy_00360
        ├── cilium_policy_02993
        ├── cilium_ratelimit
        ├── cilium_runtime_config
        ├── cilium_signals
        └── cilium_tunnel_map

2 directories, 25 files

3. calico 也有使用 BPF

如果把 calico 切换到 cilium, 两者对 bpf 目录的使用会发生冲突(覆盖同一个目录)么?

3.2 calico 直接切换到 cilium 之后


root@k8s-work2:~/test# tree /sys/fs/bpf/
/sys/fs/bpf/
├── calico
│   ├── calico_failsafe_ports_v1
│   └── xdp
├── cilium
│   ├── devices
│   │   ├── cilium_geneve
│   │   │   └── links
│   │   ├── cilium_host
│   │   │   └── links
│   │   ├── cilium_net
│   │   │   └── links
│   │   ├── ens160
│   │   │   └── links
│   │   ├── ens256
│   │   │   └── links
│   │   └── tunl0
│   │       └── links
│   ├── endpoints
│   │   ├── 125
│   │   │   └── links
│   │   ├── 1642
│   │   │   └── links
│   │   ├── 1780
│   │   │   └── links
│   │   └── 623
│   │       └── links
│   └── socketlb
│       └── links
│           └── cgroup
│               ├── cil_sock4_connect
│               ├── cil_sock4_getpeername
│               ├── cil_sock4_post_bind
│               ├── cil_sock4_recvmsg
│               ├── cil_sock4_sendmsg
│               ├── cil_sock6_connect
│               ├── cil_sock6_getpeername
│               ├── cil_sock6_post_bind
│               ├── cil_sock6_recvmsg
│               ├── cil_sock6_sendmsg
│               └── cil_sock_release
└── tc
    └── globals
        ├── cilium_auth_map
        ├── cilium_call_policy
        ├── cilium_calls_00125
        ├── cilium_calls_00623
        ├── cilium_calls_01642
        ├── cilium_calls_01780
        ├── cilium_calls_hostns_00713
        ├── cilium_calls_netdev_00002
        ├── cilium_calls_netdev_00003
        ├── cilium_calls_netdev_00007
        ├── cilium_calls_netdev_00025
        ├── cilium_calls_overlay_2
        ├── cilium_ct4_global
        ├── cilium_ct_any4_global
        ├── cilium_egresscall_policy
        ├── cilium_events
        ├── cilium_ipcache_v2
        ├── cilium_ipv4_frag_datagrams
        ├── cilium_l2_responder_v4
        ├── cilium_lb4_affinity
        ├── cilium_lb4_backends_v3
        ├── cilium_lb4_maglev
        ├── cilium_lb4_reverse_nat
        ├── cilium_lb4_reverse_sk
        ├── cilium_lb4_services_v2
        ├── cilium_lb4_source_range
        ├── cilium_lb_affinity_match
        ├── cilium_lxc
        ├── cilium_metrics
        ├── cilium_node_map_v2
        ├── cilium_nodeport_neigh4
        ├── cilium_policystats
        ├── cilium_policy_v2_00125
        ├── cilium_policy_v2_00623
        ├── cilium_policy_v2_00713
        ├── cilium_policy_v2_01642
        ├── cilium_policy_v2_01780
        ├── cilium_ratelimit
        ├── cilium_ratelimit_metrics
        ├── cilium_runtime_config
        ├── cilium_signals
        ├── cilium_skip_lb4
        ├── cilium_snat_v4_alloc_retries
        └── cilium_snat_v4_external

30 directories, 56 files
root@k8s-work2:~/test#