docker网络实现
docker作为容器平台,为应用提供相互间独立的运行环境。其中,每个容器有自己的ip以及其他网络环境,只要你想,宿主机上的每一个端口都可以对应一个容器环境。但是实际上,宿主机始终只有一个ip,docker是怎么实现为每个容器分配一个ip的呢?此外,为容器分配的这个虚拟ip又是怎么和其他网络设备互通的呢?
网络命名空间
docker为每个容器创建一个独立的namespace,其拥有完整的linux网络协议栈。不过我们直接通过ip netns是查询不到docker为容器建立的ns,因为ip netns查询/var/run/netns,而docker建立的ns位于/var/run/docker/netns,所以首先需要我们将这两个目录关联起来
ln -s /var/run/docker/netns /var/run/netns
之后在查询ns
$ ip netns
74d131f5d904 (id: 4)
da00eb9092e3 (id: 3)
93a0713223a3 (id: 2)
2545bc356693 (id: 1)
5f24774c17ef (id: 0)
default
到这里会发现,ns ID和container ID不一致,二者的映射关系记录在在docker inspect
"NetworkSettings": {
"Bridge": "",
"SandboxID": "74d131f5d9048cd7521ee047e09b8b893e8c3b70c3ee830478b44f2d62eaf268",
"HairpinMode": false,
"LinkLocalIPv6Address": "",
"LinkLocalIPv6PrefixLen": 0,
"Ports": {},
"SandboxKey": "/var/run/docker/netns/74d131f5d904",
"SecondaryIPAddresses": null,
单纯ns和一个可供container使用的网络环境还有一段差距,只有ns的container就像一台没有联网的主机,我们先用一条虚拟网线veth把ns连接到网卡上
这时再分配一个ip给veth的一端,现在容器就有了自己的ip,veth网线也变成了这个容器的eth0网卡
root@7084cc341104:/# ip a
...
4: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
valid_lft forever preferred_lft forever
docker0
如果有执行过ifconfig命令的同学,一定对docker0不会陌生,所有的docker环境中都会出现这个设备,那么docker0是什么呢,又有什么作用呢?
[node2 ~]$ ifconfig
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
ether 02:42:78:da:36:b0 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
docker0是一个bridge(网桥),所对应的物理设备是hub(集线器),从任意端口进入的报文都会被转发到其他端口,从而使连接在同一hub上的设备网络互通
在实际环境中,container并不是直接和host eth0网卡连接,而是多个container和docker0连接。这样使得连接在同一bridge的container网络互通,不同bridge上的container网络隔离。
[node2 ~]$ brctl show
bridge name bridge id STP enabled interfaces
docker0 8000.024278da36b0 no veth85f5191
容器如何访问其他设备
首先需要申明的是,虽然ns是在宿主机上虚拟的网络空间,但是对于宿主机来说,这个ns在逻辑上仍然是一个独立的网络设备,ns与宿主机是并列关系而不是从属关系。同样,在ns的角度来说,ns并不知道自己是个虚拟的网络空间,而是和其他主机一样的存在,也不会因为ns是宿主机中虚拟,就天然和宿主机网络互通。
启动一个容器
docker run -tid -p 80:80 ubuntu:18.04
容器内网络信息
root@7084cc341104:/# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1
link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
valid_lft forever preferred_lft forever
宿主机host网络信息
[node2 ~]$ ifconfig
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
ether 02:42:78:da:36:b0 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.0.17 netmask 255.255.254.0 broadcast 0.0.0.0
ether ba:51:7b:8e:60:6a txqueuelen 0 (Ethernet)
RX packets 17 bytes 1418 (1.3 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 9 bytes 770 (770.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.18.0.9 netmask 255.255.0.0 broadcast 0.0.0.0
ether 02:42:ac:12:00:09 txqueuelen 0 (Ethernet)
RX packets 40021 bytes 57020190 (54.3 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 5434 bytes 821490 (802.2 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
loop txqueuelen 1 (Local Loopback)
RX packets 122 bytes 11156 (10.8 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 122 bytes 11156 (10.8 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth85f5191: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether 8a:1e:0f:f0:b7:f8 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
host上container互相访问
本机上container互相访问,直接通过docker0就可以转发给对应容器。
host访问container
直接用tcpdump抓包docker0
[node2 ~]$ ps -ef
UID PID PPID C STIME TTY TIME CMD
...
root 10052 1 0 17:01 pts/0 00:00:00 bash -l
root 15523 10052 0 17:24 pts/0 00:00:00 sh sxm.sh
root 15524 15523 0 17:24 pts/0 00:00:00 ping 172.17.0.2
[node2 ~]$ tcpdump -n -i docker0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on docker0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:24:20.743053 IP 172.17.0.1 > 172.17.0.2: ICMP echo request, id 15524, seq 18, length 64
17:24:20.743105 IP 172.17.0.2 > 172.17.0.1: ICMP echo reply, id 15524, seq 18, length 64
17:24:21.743080 IP 172.17.0.1 > 172.17.0.2: ICMP echo request, id 15524, seq 19, length 64
17:24:21.743132 IP 172.17.0.2 > 172.17.0.1: ICMP echo reply, id 15524, seq 19, length 64
17:24:22.743060 IP 172.17.0.1 > 172.17.0.2: ICMP echo request, id 15524, seq 20, length 64
17:24:22.743113 IP 172.17.0.2 > 172.17.0.1: ICMP echo reply, id 15524, seq 20, length 64
17:24:23.743062 IP 172.17.0.1 > 172.17.0.2: ICMP echo request, id 15524, seq 21, length 64
17:24:23.743140 IP 172.17.0.2 > 172.17.0.1: ICMP echo reply, id 15524, seq 21, length 64
[node2 ~]$ tcpdump -n -i eth0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
可以看到icmp报文没有走eth0网卡,直接走docker0发送给容器。接着在查看一下路由表
[node2 ~]$ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 172.18.0.1 0.0.0.0 UG 0 0 0 eth1
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
172.18.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth1
192.168.0.0 0.0.0.0 255.255.254.0 U 0 0 0 eth0
不难看出icmp报文在路由表指导下,直接发送给了docker0。 不过还有一个问题,为什么ping container的source ip是172.17.0.1 docker0的ip,而不是192.168.0.17 eth0的ip。关于这一点,就需要介绍一下NAT。
NAT是iptable中的一部分,iptable则是负责对报文进行筛查和改动,iptable有5个阶段:接收预处理-prerouting,接收-input,转发-forward,发送-output,出口后处理-postrouting。大体流程如下
NAT在其中负责对报文中目的ip和源ip改动,不过只能在prerouting output和postrouting阶段更改。现在我们看下host上NAT表里都有什么
[node2 ~]$ iptables -t nat -nvL
Chain PREROUTING (policy ACCEPT 3629 packets, 218K bytes)
pkts bytes target prot opt in out source destination
4098 246K DOCKER all -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
Chain INPUT (policy ACCEPT 3615 packets, 217K bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 90 packets, 5827 bytes)
pkts bytes target prot opt in out source destination
50 3318 DOCKER_OUTPUT all -- * * 0.0.0.0/0 127.0.0.11
0 0 DOCKER all -- * * 0.0.0.0/0 !127.0.0.0/8 ADDRTYPE match dst-type LOCAL
Chain POSTROUTING (policy ACCEPT 126 packets, 8209 bytes)
pkts bytes target prot opt in out source destination
14 889 MASQUERADE all -- * !docker0 172.17.0.0/16 0.0.0.0/0
50 3318 DOCKER_POSTROUTING all -- * * 0.0.0.0/0 127.0.0.11
0 0 MASQUERADE tcp -- * * 172.17.0.2 172.17.0.2 tcp dpt:80
Chain DOCKER (2 references)
pkts bytes target prot opt in out source destination
2 168 RETURN all -- docker0 * 0.0.0.0/0 0.0.0.0/0
0 0 DNAT tcp -- !docker0 * 0.0.0.0/0 0.0.0.0/0 tcp dpt:80 to:172.17.0.2:80
Chain DOCKER_OUTPUT (1 references)
pkts bytes target prot opt in out source destination
0 0 DNAT tcp -- * * 0.0.0.0/0 127.0.0.11 tcp dpt:53 to:127.0.0.11:44436
50 3318 DNAT udp -- * * 0.0.0.0/0 127.0.0.11 udp dpt:53 to:127.0.0.11:49966
Chain DOCKER_POSTROUTING (1 references)
pkts bytes target prot opt in out source destination
0 0 SNAT tcp -- * * 127.0.0.11 0.0.0.0/0 tcp spt:44436 to::53
0 0 SNAT udp -- * * 127.0.0.11 0.0.0.0/0 udp spt:49966 to::53
NAT表中以规则链的形式定义规则,当报文满足条件进入chain执行动作,或者跳转到指定chain重新匹配条件。仍以host ping container场景为例,首先找到POSTROUTING阶段
Chain POSTROUTING (policy ACCEPT 126 packets, 8209 bytes)
pkts bytes target prot opt in out source destination
14 889 MASQUERADE all -- * !docker0 172.17.0.0/16 0.0.0.0/0
50 3318 DOCKER_POSTROUTING all -- * * 0.0.0.0/0 127.0.0.11
0 0 MASQUERADE tcp -- * * 172.17.0.2 172.17.0.2 tcp dpt:80
可以看到container ip 172.17.0.2位于网段172.17.0.0/16,因此执行动作MASQUERADE。而MASQUERADE将报文source ip改为发送报文网卡的ip,所以在tcpdump中可以看到icmp source ip是docker0的ip。
此外NAT表中还有
- Accept:接受报文
- Drop:丢弃报文
- SNAT:把报文的源地址改掉
- DNAT:把报文的目的地址改掉
- Masquerade:高级版的SNAT,把报文的源地址改掉,改为发送网卡的IP地址
container访问host
container通过veth将报文传递docker0,而docker0就是Host主机的一个网卡,所以就到达了目的地
[node2 ~]$ tcpdump -n -i docker0 -vv
tcpdump: listening on docker0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:16:42.831069 IP (tos 0x0, ttl 64, id 61181, offset 0, flags [DF], proto ICMP (1), length 84)
172.17.0.2 > 192.168.0.17: ICMP echo request, id 382, seq 241, length 64
17:16:42.831133 IP (tos 0x0, ttl 64, id 12943, offset 0, flags [none], proto ICMP (1), length 84)
192.168.0.17 > 172.17.0.2: ICMP echo reply, id 382, seq 241, length 64
17:16:43.831072 IP (tos 0x0, ttl 64, id 61293, offset 0, flags [DF], proto ICMP (1), length 84)
172.17.0.2 > 192.168.0.17: ICMP echo request, id 382, seq 242, length 64
17:16:43.831117 IP (tos 0x0, ttl 64, id 13129, offset 0, flags [none], proto ICMP (1), length 84)
192.168.0.17 > 172.17.0.2: ICMP echo reply, id 382, seq 242, length 64
[node2 ~]$ tcpdump -n -i eth0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
container访问其他host
在容器中ping 其他host 192.168.0.18
root@c3d1a0360e97:/# ping 192.168.0.18
PING 192.168.0.18 (192.168.0.18) 56(84) bytes of data.
64 bytes from 192.168.0.18: icmp_seq=1 ttl=63 time=0.239 ms
64 bytes from 192.168.0.18: icmp_seq=2 ttl=63 time=0.127 ms
此时在宿主机上分别抓取docker0和eth0报文,和访问本机差不多,都是转发到对应网卡然后修改source ip
[node2 ~]$ tcpdump -n -i docker0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on docker0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:38:02.435348 IP 172.17.0.2 > 192.168.0.18: ICMP echo request, id 507, seq 21, length 64
17:38:02.435477 IP 192.168.0.18 > 172.17.0.2: ICMP echo reply, id 507, seq 21, length 64
17:38:03.435339 IP 172.17.0.2 > 192.168.0.18: ICMP echo request, id 507, seq 22, length 64
17:38:03.435473 IP 192.168.0.18 > 172.17.0.2: ICMP echo reply, id 507, seq 22, length 64
17:38:04.435051 IP 172.17.0.2 > 192.168.0.18: ICMP echo request, id 507, seq 23, length 64
17:38:04.435168 IP 192.168.0.18 > 172.17.0.2: ICMP echo reply, id 507, seq 23, length 64
...
[node2 ~]$ tcpdump -n -i eth0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:39:36.455071 IP 192.168.0.17 > 192.168.0.18: ICMP echo request, id 507, seq 115, length 64
17:39:36.455132 IP 192.168.0.18 > 192.168.0.17: ICMP echo reply, id 507, seq 115, length 64
17:39:37.455187 IP 192.168.0.17 > 192.168.0.18: ICMP echo request, id 507, seq 116, length 64
17:39:37.455278 IP 192.168.0.18 > 192.168.0.17: ICMP echo reply, id 507, seq 116, length 64
17:39:38.455096 IP 192.168.0.17 > 192.168.0.18: ICMP echo request, id 507, seq 117, length 64
17:39:38.455170 IP 192.168.0.18 > 192.168.0.17: ICMP echo reply, id 507, seq 117, length 64
17:39:39.455222 IP 192.168.0.17 > 192.168.0.18: ICMP echo request, id 507, seq 118, length 64
...
host 192.168.0.18
[node2 ~]$ tcpdump -n -i eth0 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
16:56:09.662920 IP 192.168.0.17 > 192.168.0.18: ICMP echo request, id 366, seq 1, length 64
16:56:09.662967 IP 192.168.0.18 > 192.168.0.17: ICMP echo reply, id 366, seq 1, length 64
16:56:10.664500 IP 192.168.0.17 > 192.168.0.18: ICMP echo request, id 366, seq 2, length 64
16:56:10.664546 IP 192.168.0.18 > 192.168.0.17: ICMP echo reply, id 366, seq 2, length 64
...
其他host访问container
在启动container时,我们对外暴露了80端口用于外部流量访问,现在通过192.168.0.18主机访问192.168.0.17:80,看一下报文又经历了什么最后到达容器。仍然在192.168.0.17的eth0和docker0上抓包
[node2 ~]$ tcpdump -n -i eth0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:50:49.125332 IP 192.168.0.18.37188 > 192.168.0.17.http: Flags [S], seq 3131305202, win 29200, options [mss 1460,sackOK,TS val 2314544572 ecr 0,nop,wscale 7], length 0
17:50:49.125435 IP 192.168.0.17.http > 192.168.0.18.37188: Flags [R.], seq 0, ack 3131305203, win 0, length 0
...
[node2 ~]$ tcpdump -n -i docker0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on docker0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:49:09.885622 IP 192.168.0.18.58076 > 172.17.0.2.http: Flags [S], seq 618561937, win 29200, options [mss 1460,sackOK,TS val 2314519762 ecr 0,nop,wscale 7], length 0
17:49:09.885705 IP 172.17.0.2.http > 192.168.0.18.58076: Flags [R.], seq 0, ack 618561938, win 0, length 0
...
可以看到在到达docker0以后,destination ip从host ip变为container ip,不用多想必然又是NAT。在PREROUTING上我们最终能找到这条规则
Chain DOCKER (2 references)
pkts bytes target prot opt in out source destination
2 168 RETURN all -- docker0 * 0.0.0.0/0 0.0.0.0/0
0 0 DNAT tcp -- !docker0 * 0.0.0.0/0 0.0.0.0/0 tcp dpt:80 to:172.17.0.2:80
这里DNAT将报文原始的destination ip改写为了容器的地址,从而进一步导致路由表将该报文转发到了docker0网卡,并最终实现对容器的访问。而这也是docker原生的NodePort访问方式,这样的虚拟技术使得host上每一个port都有变成一台虚拟机的可能。
参考资料
[2]: A container networking overview
[3]: 《跟唐老师学习云网络》 - 网络命名空间 Network Namespace
[5]: 【跟唐老师学习云网络】-第8篇 iptables - filter过滤功能
[6]: 【跟唐老师学习云网络】 - iptables - nat穿越功能
[7]: k8s在线环境