kubernetes 集群 kubelet 组件优化 - inotify 内核参数优化kubelet组件常见优化措施有资

前言

kubelet组件常见优化措施有资源预留(为 kubelet 进程和系统进程)和配置 eviction-hard 驱逐阈值。另外系统设置合理的 inotify 内核参数也很重要，我们之前生产环境的 K8S 集群(1.22 版本)就碰到过因为 inotify 耗尽导致 kubelet 服务进程异常。下面我们在最新版本 K8S 集群(1.30 版本)做下实验，模拟系统 inotify 耗尽，看下 node 节点 kubelet 状态。

inotify 简单介绍

inotify 提供了一种高效、实时的文件系统监控方式，使得程序能够及时获知文件或目录的变化，并作出相应处理，是Linux/Unix系统中一种重要的文件系统编程接口。

consume_inotify_watches 程序

要编写一个程序来耗尽 Linux 系统的 /proc/sys/fs/inotify/max_user_watches 参数，可以通过创建大量的 inotify 实例并监视大量文件来实现。下面 consume_inotify_watches.py 是用 Python 代码来做本次实验，需要执行 pip install inotify-simple 安装 inotify-simple 依赖：

import os
import inotify_simple
import tempfile
import errno
import time

def consume_inotify_watches():
    inotify = inotify_simple.INotify()
    watch_flags = inotify_simple.flags.MODIFY

    # Create a list to store watch descriptors
    watch_descriptors = []

    # Define the directory to create files in
    base_dir = '/tmp/inotify_test'

    # Create the base directory if it doesn't exist
    os.makedirs(base_dir, exist_ok=True)

    try:
        while True:
            # Create a temporary file in the base directory
            temp_file = tempfile.NamedTemporaryFile(dir=base_dir, delete=False)
            temp_file.close()

            # Add a new inotify watch on the temporary file
            wd = inotify.add_watch(temp_file.name, watch_flags)
            watch_descriptors.append((wd, temp_file.name))

            print(f'Added watch {wd} on file {temp_file.name}')

    except OSError as e:
        if e.errno == errno.ENOSPC:
            print(f'OSError: {e} (ENOSPC: No space left on device)')
            print('Reached the limit of inotify watches or another system limit.')
        else:
            print(f'OSError: {e}')
        print('Continuing to run...')

    # Enter an infinite loop to keep the program running
    while True:
        try:
            time.sleep(60)  
        except KeyboardInterrupt:
            break  # Exit the loop if interrupted by the user

    print('Cleaning up...')
    # Clean up by removing all watches and deleting files
    for wd, temp_file in watch_descriptors:
        inotify.rm_watch(wd)
        os.remove(temp_file)

if __name__ == '__main__':
    consume_inotify_watches()

在这个程序中，会在 /tmp/inotify_test 目录下创建大量临时文件，并在每个文件上设置一个 inotify 监视器，直到达到系统的 inotify 限制。当捕捉到 OSError 时，说明已经耗尽了 inotify watches 的限制。

打包镜像，推送至镜像仓库

创建dockerfile

FROM python:3.9-slim

WORKDIR /app

COPY consume_inotify_watches.py .

RUN pip install inotify-simple

CMD ["python", "consume_inotify_watches.py"]

使用以下命令构建和上传镜像到指定 registry ：

# 构建
docker build -t inotify-consumer:v1 .
# tag and push
docker tag inotify-consumer:v1 registry.cn-hangzhou.aliyuncs.com/zzwade/zzwade:inotify-consumer-v1
docker push registry.cn-hangzhou.aliyuncs.com/zzwade/zzwade:inotify-consumer-v1

killercoda K8S 1.30 部署 inotify-consumer

我使用 killercoda 提供的 2 节点（1个node节点和1个controlplane节点） K8S 集群，来部署测试服务

controlplane $ kubectl get nodes -o wide
NAME           STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
controlplane   Ready    control-plane   21d   v1.30.0   172.30.1.2    <none>        Ubuntu 20.04.5 LTS   5.4.0-131-generic   containerd://1.7.13
node01         Ready    <none>          21d   v1.30.0   172.30.2.2    <none>        Ubuntu 20.04.5 LTS   5.4.0-131-generic   containerd://1.7.13

controlplane 打上 NoSchedule taints，让 pod 只能在 node01 上运行

kubectl taint node controlplane node-role.kubernetes.io/control-plane=:NoSchedule

首先创建2 个 replicas 的 nginx deployment 来看下 k8s 集群是否可以正常创建 pod

kubectl apply -f ngx_deploy.yaml

ngx_deploy.yaml 如下：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2 # tells deployment to run 2 pods matching the template
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80

可以看到 nginx deploy 成功创建

现在创建一个 replica 的 inotify-consumer deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inotify-consumer-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: inotify-consumer
  template:
    metadata:
      labels:
        app: inotify-consumer
    spec:
      containers:
      - name: inotify-consumer
        image: registry.cn-hangzhou.aliyuncs.com/zzwade/zzwade:inotify-consumer-v1

kubectl apply -f inotify_deploy.yaml

可以看到 inotify-consumer pod 创建成功

这时验证一下 inotify 是否被程序给耗尽， tail -f 提示 inotify 已经被耗尽了

检查下 kubelet 进程状态，看到 kubelet 进程状态是 running , 但是已经提示 inotify 被耗尽 inotify_add_watch /sys/fs/cgroup/devices/system.slice/phpsessionclean.service: no space left on device

通过 journalctl -u kubelet 查看 kubelet 日志

-- Logs begin at Sun 2022-11-13 17:25:58 UTC, end at Sun 2024-06-02 15:43:04 UTC. --
Jun 02 15:39:04 node01 kubelet[563]: W0602 15:39:04.214014     563 watcher.go:93] Error while processing event ("/sys/fs/cgroup/pids/system.slice/phpsessionclean.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/pids/system.slice/phpsessionclean.service: no space left on device
Jun 02 15:39:04 node01 kubelet[563]: W0602 15:39:04.175435     563 watcher.go:93] Error while processing event ("/sys/fs/cgroup/devices/system.slice/phpsessionclean.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/system.slice/phpsessionclean.service: no space left on device
Jun 02 15:39:04 node01 kubelet[563]: W0602 15:39:04.175417     563 watcher.go:93] Error while processing event ("/sys/fs/cgroup/memory/system.slice/phpsessionclean.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/memory/system.slice/phpsessionclean.service: no space left on device
Jun 02 15:39:04 node01 kubelet[563]: W0602 15:39:04.175361     563 watcher.go:93] Error while processing event ("/sys/fs/cgroup/blkio/system.slice/phpsessionclean.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/blkio/system.slice/phpsessionclean.service: no space left on device
Jun 02 15:39:04 node01 kubelet[563]: W0602 15:39:04.168847     563 watcher.go:93] Error while processing event ("/sys/fs/cgroup/cpu,cpuacct/system.slice/phpsessionclean.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/cpu,cpuacct/system.slice/phpsessionclean.service: no space left on device
Jun 02 15:35:15 node01 kubelet[563]: I0602 15:35:15.270140     563 kuberuntime_container_linux.go:167] "No swap cgroup controller present" swapBehavior="" pod="default/inotify-consumer-deployment-64fb9669b5-k7gjg" containerName="inotify-consumer"
Jun 02 15:34:54 node01 kubelet[563]: I0602 15:34:54.346787     563 reconciler_common.go:247] "operationExecutor.VerifyControllerAttachedVolume started for volume "kube-api-access-dc224" (UniqueName: "kubernetes.io/projected/51b582cc-1b62-4e07-b1c3-b5637d6c9175-kube-api-access-dc224") pod "inotify-consumer-deployment-64fb9669b5-k7gjg" (UID: "51b582cc-1b62-4e07-b1c3-b5637d6c9175") " pod="default/inotify-consumer-deployment-64fb9669b5-k7gjg"
Jun 02 15:34:54 node01 kubelet[563]: I0602 15:34:54.255375     563 topology_manager.go:215] "Topology Admit Handler" podUID="51b582cc-1b62-4e07-b1c3-b5637d6c9175" podNamespace="default" podName="inotify-consumer-deployment-64fb9669b5-k7gjg"
Jun 02 15:34:54 node01 kubelet[563]: I0602 15:34:54.254361     563 pod_startup_latency_tracker.go:104] "Observed pod startup duration" pod="default/nginx-deployment-576c6b7b6-gn8l8" podStartSLOduration=945.073709016 podStartE2EDuration="15m45.254340293s" podCreationTimestamp="2024-06-02 15:19:09 +0000 UTC" firstStartedPulling="2024-06-02 15:19:10.583719761 +0000 UTC m=+1263.519988745" lastFinishedPulling="2024-06-02 15:19:10.764351039 +0000 UTC m=+1263.700620022" observedRunningTime="2024-06-02 15:19:11.920369907 +0000 UTC m=+1264.856638897" watchObservedRunningTime="2024-06-02 15:34:54.254340293 +0000 UTC m=+2207.190609286"
...

这时候我们再创建 2个 replicas 的 openresty deployment ，看下它是否还可以正常工作

apiVersion: apps/v1
kind: Deployment
metadata:
  name: openresty-deployment
spec:
  selector:
    matchLabels:
      app: openresty
  replicas: 2 # tells deployment to run 2 pods matching the template
  template:
    metadata:
      labels:
        app: openresty
    spec:
      containers:
      - name: openresty
        image: openresty/openresty:alpine
        ports:
        - containerPort: 80

kubectl apply -f openresty_deploy.yaml

openresty pod 也被成功创建

这时候如果重启 kubelet 进程，会发现它会启动失败

查看日志，因为 inotify 被耗光了，一些 cAdvisor 和 cgroup 组件无法启动导致启动失败

-- Logs begin at Sun 2022-11-13 17:25:58 UTC, end at Sun 2024-06-02 15:59:24 UTC. --
Jun 02 15:59:24 node01 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jun 02 15:59:24 node01 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Jun 02 15:59:24 node01 kubelet[18726]: E0602 15:59:24.360483   18726 kubelet.go:1530] "Failed to start cAdvisor" err="inotify_add_watch /sys/fs/cgroup/memory/system.slice/systemd-journald.service: no space left on device"
Jun 02 15:59:24 node01 kubelet[18726]: E0602 15:59:24.360421   18726 watcher.go:152] Failed to watch directory "/sys/fs/cgroup/memory/system.slice": inotify_add_watch /sys/fs/cgroup/memory/system.slice/systemd-journald.service: no space left on device
Jun 02 15:59:24 node01 kubelet[18726]: E0602 15:59:24.359561   18726 watcher.go:152] Failed to watch directory "/sys/fs/cgroup/memory/system.slice/systemd-journald.service": inotify_add_watch /sys/fs/cgroup/memory/system.slice/systemd-journald.service: no space left on device
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.179166   18726 kubelet_network.go:61] "Updating Pod CIDR" originalPodCIDR="" newPodCIDR="192.168.1.0/24"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.173062   18726 kuberuntime_manager.go:1523] "Updating runtime config through cri with podcidr" CIDR="192.168.1.0/24"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.170919   18726 kubelet_node_status.go:76] "Successfully registered node" node="node01"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.170629   18726 kubelet_node_status.go:112] "Node was previously registered" node="node01"
Jun 02 15:59:24 node01 kubelet[18726]: E0602 15:59:24.169398   18726 kubelet.go:2361] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.156246   18726 kubelet_node_status.go:73] "Attempting to register node" node="node01"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.126395   18726 factory.go:221] Registration of the containerd container factory successfully
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.091227   18726 factory.go:219] Registration of the crio container factory failed: Get "http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info": dial unix /var/run/crio/crio.sock: connect: no such file or directory
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.090619   18726 factory.go:221] Registration of the systemd container factory successfully
Jun 02 15:59:24 node01 kubelet[18726]: E0602 15:59:24.069000   18726 kubelet.go:2361] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.068889   18726 kubelet.go:2337] "Starting kubelet main sync loop"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.067265   18726 status_manager.go:217] "Starting to sync pod status with apiserver"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.066979   18726 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv6"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.063932   18726 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv4"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.058476   18726 reconciler.go:26] "Reconciler: start to sync state"

查看 inotify 使用

线上环境肯定会比这个会复杂很多，如何查找是哪些 pod 或者进程大量消耗 inotify 的呢，可以使用这条命令

find /proc/*/fd/ -type l -lname "anon_inode:inotify" -printf "%hinfo/%f\n" | xargs grep -cE "^inotify" | column -t -s:

问题解决方案

解决这个 inotify 耗尽的问题方法无非就是增大 inotify 内核参数或者停掉消耗 inotify 异常的程序。一般系统默认的 inotify 内核参数大小是 8192

[root@z2024 ~]# cat /proc/sys/fs/inotify/max_user_watches
8192

killercoda 提供的 k8s 集群 inotify 内核参数大小是524288

controlplane $ cat /proc/sys/fs/inotify/max_user_watches
524288

大家可以根据自己的需求，灵活调整 inotify 内核参数

总结

上面通过实验模拟系统的 inotify 耗尽，导致 kubelet 组件异常。系统设置合理的 inotify 内核参数对系统以及 k8s 集群稳定运行非常重要。如果您在阅读过程中发现了任何问题，或者有任何可以改进的地方，欢迎留言讨论，或者关注我的微信公众号运维小猪，谢谢！