问题背景
kube-ovn 部署没有启动 lb, 所以 ovn-default 下的 pod 是经过 node上 的 ipvs 才访问到 coredns pod 的, 现在 ovn-default 下的 pod,都能正常进行 dns 解析。
node2 访问永远都是正常的, 但是,从 node1 上直接进行 dns 解析,只有第一次可以,其他后面的包都不行。 node3 也是同样的问题。
ovn-default: 10.222.0.0/18
这两种场景,唯一的区别就是 源 ip 不同。
ovn-defualt pod 进行 dns 解析,用的是自己的 pod ip,node 上直接发起 dns 解析,源 ip 其实是的 svc ip。通过 ping 命令 可以确认这点。
图中的抓包命令: src net not 是过滤掉无关网段。比如 ovn-default。
]# tcpdump -i any src net not 10.222.0.0/18 and src net not 10.251.0.0/16 and dst 10.233.0.10 -netvv
dropped privs to tcpdump
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes
In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 100: (tos 0x0, ttl 64, id 50255, offset 0, flags [DF], proto ICMP (1), length 84)
10.233.0.10 > 10.233.0.10: ICMP echo request, id 34632, seq 1, length 64
In 00:00:00:00:00:00 ethertype IPv4 (0x0800), length 100: (tos 0x0, ttl 64, id 50256, offset 0, flags [none], proto ICMP (1), length 84)
10.233.0.10 > 10.233.0.10: ICMP echo reply, id 34632, seq 1, length 64
这种情况,目前是怀疑 iptables prerouting 之前丢了包,所以抓不到。后面基于 iptables debug 跟踪下确认即可。
最好的解决方式是使用 localdns,其次是学 ovn-default的访问方式,把 ipset 补充上 svc cidr。
先手动添加 svc cidr 看是否可以解决问题,
[root@csy-wx-pm-os01-eis-node01 deployer]# k get po -n kube-system kube-ovn-controller-6d5d4bd7cf-4qvjq -o yaml | grep -i range
- --service-cluster-ip-range=10.233.0.0/18,fd11:1111:1112:15::/108
[root@csy-wx-pm-os01-eis-node01 deployer]# k get po -n kube-system kube-ovn-cni-xhs2h -o yaml | grep -i range
- --service-cluster-ip-range=10.233.0.0/18,fd11:1111:1112:15::/108
[root@csy-wx-pm-os01-eis-node01 deployer]#
只有一个地方的代码会操作 ipset
这两个ipset是:
Name: ovn40services
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 1048576
Size in memory: 440
References: 7
Number of entries: 1
Members:
10.233.0.0/18
Name: ovn60services
Type: hash:net
Revision: 6
Header: family inet6 hashsize 1024 maxelem 1048576
Size in memory: 1272
References: 7
Number of entries: 1
Members:
fd11:1111:1112:15::/108
# 这两个 ipset 用于 postrouting 阶段的规则匹配
[root@csy-wx-pm-os01-eis-node01 deployer]# iptables-save | grep "ovn40services"
-A OVN-POSTROUTING -m set --match-set ovn40services src -m set --match-set ovn40subnets dst -m mark --mark 0x4000/0x4000 -j SNAT --to-source 10.251.137.30 --random-fully
-A INPUT -p tcp -m mark ! --mark 0x4000/0x4000 -m set --match-set ovn40services dst -m conntrack --ctstate NEW -j REJECT --reject-with icmp-port-unreachable
-A INPUT -m set --match-set ovn40services dst -j ACCEPT
-A INPUT -m set --match-set ovn40services src -j ACCEPT
-A FORWARD -m set --match-set ovn40services dst -j ACCEPT
-A FORWARD -m set --match-set ovn40services src -j ACCEPT
-A OUTPUT -p tcp -m mark ! --mark 0x4000/0x4000 -m set --match-set ovn40services dst -m conntrack --ctstate NEW -j REJECT --reject-with icmp-port-unreachable
[root@csy-wx-pm-os01-eis-node01 deployer]#
[root@csy-wx-pm-os01-eis-node01 deployer]# iptables-save | grep "ovn60services"
[root@csy-wx-pm-os01-eis-node01 deployer]#
手动在 ovn-prerouting 的 ipset 中添加 svc 地址,便于匹配到
# iptables-save -t nat | grep -i OVN-PREROUTING
:OVN-PREROUTING - [0:0]
-A PREROUTING -m comment --comment "kube-ovn prerouting rules" -j OVN-PREROUTING
-A OVN-PREROUTING -i ovn0 -m set --match-set ovn40subnets src -m set --match-set KUBE-CLUSTER-IP dst,dst -j MARK --set-xmark 0x4000/0x4000
-A OVN-PREROUTING -p tcp -m addrtype --dst-type LOCAL -m set --match-set KUBE-NODE-PORT-LOCAL-TCP dst -j MARK --set-xmark 0x80000/0x80000
-A OVN-PREROUTING -p tcp -m set --match-set ovn40other-node src -m set --match-set KUBE-NODE-PORT-LOCAL-TCP dst -j MARK --set-xmark 0x4000/0x4000
-A OVN-PREROUTING -p udp -m addrtype --dst-type LOCAL -m set --match-set KUBE-NODE-PORT-LOCAL-UDP dst -j MARK --set-xmark 0x80000/0x80000
-A OVN-PREROUTING -p udp -m set --match-set ovn40other-node src -m set --match-set KUBE-NODE-PORT-LOCAL-UDP dst -j MARK --set-xmark 0x4000/0x4000
# 这条规则 -i ovn0, 是从 ovn0 进来的包,上面判断错误,node 发起的 svc 请求,肯定不是从 ovn0 进来的,最多是从 ipvs 的 dummy网卡发出来。
-A OVN-PREROUTING -i ovn0 -m set --match-set ovn40subnets src -m set --match-set KUBE-CLUSTER-IP dst,dst -j MARK --set-xmark 0x4000/0x4000
对比 node 1 2 两个节点的 iptables 记录,除了 docker ,认为是一致的
清除了 nat 的 docker 的规则确实就通了。
注意点1. 清空node上的iptables 之后,确实 kube-ovn 不会全量恢复,会丢失一部分
比如左边的节点重启过 iptables 之后,没有 ovn-prerouting,没有 ovn-postrouting。
重启过kube-ovn-cni 之后,可以看到 ovn-prerouting 回复了,mangle 中的 ovn-postrouting 依旧没有。
正常情况下,如右边的图,mangle 中存在这个 ovn-postrouting ,最终回包变成节点 ip。
否则,最终回包会变成 ovn0 的 ip。
由于 mangle 表 没有这个chain,导致 node 上访问 svc 会受到 docker chain 的影响。如果有这个 chain,就不会受到 docker output chain 的影响。
尝试修复该 bug
root@kube-ovn-worker:/kube-ovn# iptables -t mangle -nvL --line-numbers
Chain PREROUTING (policy ACCEPT 5801 packets, 2251K bytes)
num pkts bytes target prot opt in out source destination
Chain INPUT (policy ACCEPT 5627 packets, 2241K bytes)
num pkts bytes target prot opt in out source destination
Chain FORWARD (policy ACCEPT 166 packets, 8648 bytes)
num pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 5222 packets, 539K bytes)
num pkts bytes target prot opt in out source destination
Chain POSTROUTING (policy ACCEPT 5262 packets, 535K bytes)
num pkts bytes target prot opt in out source destination
1 5272 535K OVN-POSTROUTING all -- * * 0.0.0.0/0 0.0.0.0/0 /* kube-ovn postrouting rules */
2 5 300 TCPMSS tcp -- * eth0 0.0.0.0/0 0.0.0.0/0 tcp flags:0x06/0x02 TCPMSS set 1360
Chain KUBE-IPTABLES-HINT (0 references)
num pkts bytes target prot opt in out source destination
Chain KUBE-KUBELET-CANARY (0 references)
num pkts bytes target prot opt in out source destination
Chain OVN-OUTPUT (0 references)
num pkts bytes target prot opt in out source destination
Chain OVN-POSTROUTING (1 references)
num pkts bytes target prot opt in out source destination
1 0 0 DROP tcp -- * * 0.0.0.0/0 0.0.0.0/0 match-set ovn40subnets src tcp flags:0x04/0x04 state INVALID
Chain OVN-PREROUTING (0 references)
num pkts bytes target prot opt in out source destination
# iptables-save
*mangle
:PREROUTING ACCEPT [6964:2601772]
:INPUT ACCEPT [6754:2589644]
:FORWARD ACCEPT [202:10520]
:OUTPUT ACCEPT [6244:638387]
:POSTROUTING ACCEPT [6318:635793]
:KUBE-IPTABLES-HINT - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:OVN-OUTPUT - [0:0]
:OVN-POSTROUTING - [0:0]
:OVN-PREROUTING - [0:0]
-A POSTROUTING -m comment --comment "kube-ovn postrouting rules" -j OVN-POSTROUTING
-A POSTROUTING -o eth0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360
-A OVN-POSTROUTING -p tcp -m set --match-set ovn40subnets src -m tcp --tcp-flags RST RST -m state --state INVALID -j DROP
COMMIT
只有这几个规则是 kube-ovn-cni 控制的
:OVN-OUTPUT - [0:0]
:OVN-POSTROUTING - [0:0]
:OVN-PREROUTING - [0:0]
-A POSTROUTING -m comment --comment "kube-ovn postrouting rules" -j OVN-POSTROUTING
-A POSTROUTING -o eth0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360
-A OVN-POSTROUTING -p tcp -m set --match-set ovn40subnets src -m tcp --tcp-flags RST RST -m state --state INVALID -j DROP
COMMIT
清理 mangle 表
# iptables -t mangle -F; iptables -t mangle -nvL --line-numbers
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
num pkts bytes target prot opt in out source destination
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
num pkts bytes target prot opt in out source destination
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
num pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
num pkts bytes target prot opt in out source destination
Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
num pkts bytes target prot opt in out source destination
Chain KUBE-IPTABLES-HINT (0 references)
num pkts bytes target prot opt in out source destination
Chain KUBE-KUBELET-CANARY (0 references)
num pkts bytes target prot opt in out source destination
Chain OVN-OUTPUT (0 references)
num pkts bytes target prot opt in out source destination
Chain OVN-POSTROUTING (0 references)
num pkts bytes target prot opt in out source destination
Chain OVN-PREROUTING (0 references)
num pkts bytes target prot opt in out source destination
###### 但是实际在 master 测试发现,规则是会恢复的
# iptables-save | grep -A 15 mangle
*mangle
:PREROUTING ACCEPT [3889:1072255]
:INPUT ACCEPT [3762:1064459]
:FORWARD ACCEPT [119:6188]
:OUTPUT ACCEPT [3423:325608]
:POSTROUTING ACCEPT [3522:329553]
:KUBE-IPTABLES-HINT - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:OVN-OUTPUT - [0:0]
:OVN-POSTROUTING - [0:0]
:OVN-PREROUTING - [0:0]
-A POSTROUTING -m comment --comment "kube-ovn postrouting rules" -j OVN-POSTROUTING
-A POSTROUTING -o eth0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360
-A OVN-POSTROUTING -p tcp -m set --match-set ovn40subnets src -m tcp --tcp-flags RST RST -m state --state INVALID -j DROP
COMMIT
# Completed on Thu Jun 6 02:29:21 2024
iptables -t mangle -F;iptables-save | grep -A 15 mangle
root@kube-ovn-worker:/kube-ovn# iptables -t mangle -F;iptables-save | grep -A 15 mangle
*mangle
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:KUBE-IPTABLES-HINT - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:OVN-OUTPUT - [0:0]
:OVN-POSTROUTING - [0:0]
:OVN-PREROUTING - [0:0]
COMMIT
# Completed on Thu Jun 6 02:30:53 2024
# Generated by iptables-save v1.8.7 on Thu Jun 6 02:30:53 2024
*nat
:PREROUTING ACCEPT [9:1168]
root@kube-ovn-worker:/kube-ovn# iptables-save | grep -A 15 mangle
*mangle
:PREROUTING ACCEPT [186:54365]
:INPUT ACCEPT [180:54053]
:FORWARD ACCEPT [6:312]
:OUTPUT ACCEPT [165:15070]
:POSTROUTING ACCEPT [171:15382]
:KUBE-IPTABLES-HINT - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:OVN-OUTPUT - [0:0]
:OVN-POSTROUTING - [0:0]
:OVN-PREROUTING - [0:0]
-A POSTROUTING -m comment --comment "kube-ovn postrouting rules" -j OVN-POSTROUTING
-A POSTROUTING -o eth0 -p tcp -m tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360
-A OVN-POSTROUTING -p tcp -m set --match-set ovn40subnets src -m tcp --tcp-flags RST RST -m state --state INVALID -j DROP
COMMIT
# Completed on Thu Jun 6 02:31:03 2024
root@kube-ovn-worker:/kube-ovn#
只清理 mangle 表,确实会恢复
测试 1.12-mc 分支
mc 分支一旦部署就少了两条规则,这也就是为什么不能恢复的原因。