前言
kubelet组件常见优化措施有资源预留(为 kubelet 进程和系统进程)和配置 eviction-hard 驱逐阈值。另外系统设置合理的 inotify 内核参数也很重要,我们之前生产环境的 K8S 集群(1.22 版本)就碰到过因为 inotify 耗尽导致 kubelet 服务进程异常。下面我们在最新版本 K8S 集群(1.30 版本)做下实验,模拟系统 inotify 耗尽,看下 node 节点 kubelet 状态。
inotify 简单介绍
inotify 提供了一种高效、实时的文件系统监控方式,使得程序能够及时获知文件或目录的变化,并作出相应处理,是Linux/Unix系统中一种重要的文件系统编程接口。
consume_inotify_watches 程序
要编写一个程序来耗尽 Linux 系统的 /proc/sys/fs/inotify/max_user_watches 参数,可以通过创建大量的 inotify 实例并监视大量文件来实现。 下面 consume_inotify_watches.py 是用 Python 代码来做本次实验 ,需要执行 pip install inotify-simple 安装 inotify-simple 依赖:
import os
import inotify_simple
import tempfile
import errno
import time
def consume_inotify_watches():
inotify = inotify_simple.INotify()
watch_flags = inotify_simple.flags.MODIFY
# Create a list to store watch descriptors
watch_descriptors = []
# Define the directory to create files in
base_dir = '/tmp/inotify_test'
# Create the base directory if it doesn't exist
os.makedirs(base_dir, exist_ok=True)
try:
while True:
# Create a temporary file in the base directory
temp_file = tempfile.NamedTemporaryFile(dir=base_dir, delete=False)
temp_file.close()
# Add a new inotify watch on the temporary file
wd = inotify.add_watch(temp_file.name, watch_flags)
watch_descriptors.append((wd, temp_file.name))
print(f'Added watch {wd} on file {temp_file.name}')
except OSError as e:
if e.errno == errno.ENOSPC:
print(f'OSError: {e} (ENOSPC: No space left on device)')
print('Reached the limit of inotify watches or another system limit.')
else:
print(f'OSError: {e}')
print('Continuing to run...')
# Enter an infinite loop to keep the program running
while True:
try:
time.sleep(60)
except KeyboardInterrupt:
break # Exit the loop if interrupted by the user
print('Cleaning up...')
# Clean up by removing all watches and deleting files
for wd, temp_file in watch_descriptors:
inotify.rm_watch(wd)
os.remove(temp_file)
if __name__ == '__main__':
consume_inotify_watches()
在这个程序中,会在 /tmp/inotify_test 目录下创建大量临时文件,并在每个文件上设置一个 inotify 监视器,直到达到系统的 inotify 限制。当捕捉到 OSError 时,说明已经耗尽了 inotify watches 的限制。
打包镜像,推送至镜像仓库
创建dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY consume_inotify_watches.py .
RUN pip install inotify-simple
CMD ["python", "consume_inotify_watches.py"]
使用以下命令构建和上传镜像到指定 registry :
# 构建
docker build -t inotify-consumer:v1 .
# tag and push
docker tag inotify-consumer:v1 registry.cn-hangzhou.aliyuncs.com/zzwade/zzwade:inotify-consumer-v1
docker push registry.cn-hangzhou.aliyuncs.com/zzwade/zzwade:inotify-consumer-v1
killercoda K8S 1.30 部署 inotify-consumer
我使用 killercoda 提供的 2 节点(1个node节点和1个controlplane节点) K8S 集群,来部署测试服务
controlplane $ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
controlplane Ready control-plane 21d v1.30.0 172.30.1.2 <none> Ubuntu 20.04.5 LTS 5.4.0-131-generic containerd://1.7.13
node01 Ready <none> 21d v1.30.0 172.30.2.2 <none> Ubuntu 20.04.5 LTS 5.4.0-131-generic containerd://1.7.13
controlplane 打上 NoSchedule taints,让 pod 只能在 node01 上运行
kubectl taint node controlplane node-role.kubernetes.io/control-plane=:NoSchedule
首先创建2 个 replicas 的 nginx deployment 来看下 k8s 集群是否可以正常创建 pod
kubectl apply -f ngx_deploy.yaml
ngx_deploy.yaml 如下:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 2 # tells deployment to run 2 pods matching the template
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80
可以看到 nginx deploy 成功创建
现在创建一个 replica 的 inotify-consumer deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: inotify-consumer-deployment
spec:
replicas: 1
selector:
matchLabels:
app: inotify-consumer
template:
metadata:
labels:
app: inotify-consumer
spec:
containers:
- name: inotify-consumer
image: registry.cn-hangzhou.aliyuncs.com/zzwade/zzwade:inotify-consumer-v1
kubectl apply -f inotify_deploy.yaml
可以看到 inotify-consumer pod 创建成功
这时验证一下 inotify 是否被程序给耗尽, tail -f 提示 inotify 已经被耗尽了
检查下 kubelet 进程状态,看到 kubelet 进程状态是 running , 但是已经提示 inotify 被耗尽 inotify_add_watch /sys/fs/cgroup/devices/system.slice/phpsessionclean.service: no space left on device
通过 journalctl -u kubelet 查看 kubelet 日志
-- Logs begin at Sun 2022-11-13 17:25:58 UTC, end at Sun 2024-06-02 15:43:04 UTC. --
Jun 02 15:39:04 node01 kubelet[563]: W0602 15:39:04.214014 563 watcher.go:93] Error while processing event ("/sys/fs/cgroup/pids/system.slice/phpsessionclean.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/pids/system.slice/phpsessionclean.service: no space left on device
Jun 02 15:39:04 node01 kubelet[563]: W0602 15:39:04.175435 563 watcher.go:93] Error while processing event ("/sys/fs/cgroup/devices/system.slice/phpsessionclean.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/system.slice/phpsessionclean.service: no space left on device
Jun 02 15:39:04 node01 kubelet[563]: W0602 15:39:04.175417 563 watcher.go:93] Error while processing event ("/sys/fs/cgroup/memory/system.slice/phpsessionclean.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/memory/system.slice/phpsessionclean.service: no space left on device
Jun 02 15:39:04 node01 kubelet[563]: W0602 15:39:04.175361 563 watcher.go:93] Error while processing event ("/sys/fs/cgroup/blkio/system.slice/phpsessionclean.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/blkio/system.slice/phpsessionclean.service: no space left on device
Jun 02 15:39:04 node01 kubelet[563]: W0602 15:39:04.168847 563 watcher.go:93] Error while processing event ("/sys/fs/cgroup/cpu,cpuacct/system.slice/phpsessionclean.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/cpu,cpuacct/system.slice/phpsessionclean.service: no space left on device
Jun 02 15:35:15 node01 kubelet[563]: I0602 15:35:15.270140 563 kuberuntime_container_linux.go:167] "No swap cgroup controller present" swapBehavior="" pod="default/inotify-consumer-deployment-64fb9669b5-k7gjg" containerName="inotify-consumer"
Jun 02 15:34:54 node01 kubelet[563]: I0602 15:34:54.346787 563 reconciler_common.go:247] "operationExecutor.VerifyControllerAttachedVolume started for volume "kube-api-access-dc224" (UniqueName: "kubernetes.io/projected/51b582cc-1b62-4e07-b1c3-b5637d6c9175-kube-api-access-dc224") pod "inotify-consumer-deployment-64fb9669b5-k7gjg" (UID: "51b582cc-1b62-4e07-b1c3-b5637d6c9175") " pod="default/inotify-consumer-deployment-64fb9669b5-k7gjg"
Jun 02 15:34:54 node01 kubelet[563]: I0602 15:34:54.255375 563 topology_manager.go:215] "Topology Admit Handler" podUID="51b582cc-1b62-4e07-b1c3-b5637d6c9175" podNamespace="default" podName="inotify-consumer-deployment-64fb9669b5-k7gjg"
Jun 02 15:34:54 node01 kubelet[563]: I0602 15:34:54.254361 563 pod_startup_latency_tracker.go:104] "Observed pod startup duration" pod="default/nginx-deployment-576c6b7b6-gn8l8" podStartSLOduration=945.073709016 podStartE2EDuration="15m45.254340293s" podCreationTimestamp="2024-06-02 15:19:09 +0000 UTC" firstStartedPulling="2024-06-02 15:19:10.583719761 +0000 UTC m=+1263.519988745" lastFinishedPulling="2024-06-02 15:19:10.764351039 +0000 UTC m=+1263.700620022" observedRunningTime="2024-06-02 15:19:11.920369907 +0000 UTC m=+1264.856638897" watchObservedRunningTime="2024-06-02 15:34:54.254340293 +0000 UTC m=+2207.190609286"
...
这时候我们再创建 2个 replicas 的 openresty deployment ,看下它是否还可以正常工作
apiVersion: apps/v1
kind: Deployment
metadata:
name: openresty-deployment
spec:
selector:
matchLabels:
app: openresty
replicas: 2 # tells deployment to run 2 pods matching the template
template:
metadata:
labels:
app: openresty
spec:
containers:
- name: openresty
image: openresty/openresty:alpine
ports:
- containerPort: 80
kubectl apply -f openresty_deploy.yaml
openresty pod 也被成功创建
这时候如果重启 kubelet 进程,会发现它会启动失败
查看日志,因为 inotify 被耗光了,一些 cAdvisor 和 cgroup 组件无法启动导致启动失败
-- Logs begin at Sun 2022-11-13 17:25:58 UTC, end at Sun 2024-06-02 15:59:24 UTC. --
Jun 02 15:59:24 node01 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jun 02 15:59:24 node01 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Jun 02 15:59:24 node01 kubelet[18726]: E0602 15:59:24.360483 18726 kubelet.go:1530] "Failed to start cAdvisor" err="inotify_add_watch /sys/fs/cgroup/memory/system.slice/systemd-journald.service: no space left on device"
Jun 02 15:59:24 node01 kubelet[18726]: E0602 15:59:24.360421 18726 watcher.go:152] Failed to watch directory "/sys/fs/cgroup/memory/system.slice": inotify_add_watch /sys/fs/cgroup/memory/system.slice/systemd-journald.service: no space left on device
Jun 02 15:59:24 node01 kubelet[18726]: E0602 15:59:24.359561 18726 watcher.go:152] Failed to watch directory "/sys/fs/cgroup/memory/system.slice/systemd-journald.service": inotify_add_watch /sys/fs/cgroup/memory/system.slice/systemd-journald.service: no space left on device
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.179166 18726 kubelet_network.go:61] "Updating Pod CIDR" originalPodCIDR="" newPodCIDR="192.168.1.0/24"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.173062 18726 kuberuntime_manager.go:1523] "Updating runtime config through cri with podcidr" CIDR="192.168.1.0/24"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.170919 18726 kubelet_node_status.go:76] "Successfully registered node" node="node01"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.170629 18726 kubelet_node_status.go:112] "Node was previously registered" node="node01"
Jun 02 15:59:24 node01 kubelet[18726]: E0602 15:59:24.169398 18726 kubelet.go:2361] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.156246 18726 kubelet_node_status.go:73] "Attempting to register node" node="node01"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.126395 18726 factory.go:221] Registration of the containerd container factory successfully
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.091227 18726 factory.go:219] Registration of the crio container factory failed: Get "http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info": dial unix /var/run/crio/crio.sock: connect: no such file or directory
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.090619 18726 factory.go:221] Registration of the systemd container factory successfully
Jun 02 15:59:24 node01 kubelet[18726]: E0602 15:59:24.069000 18726 kubelet.go:2361] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.068889 18726 kubelet.go:2337] "Starting kubelet main sync loop"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.067265 18726 status_manager.go:217] "Starting to sync pod status with apiserver"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.066979 18726 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv6"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.063932 18726 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv4"
Jun 02 15:59:24 node01 kubelet[18726]: I0602 15:59:24.058476 18726 reconciler.go:26] "Reconciler: start to sync state"
查看 inotify 使用
线上环境肯定会比这个会复杂很多,如何查找是哪些 pod 或者进程大量消耗 inotify 的呢,可以使用这条命令
find /proc/*/fd/ -type l -lname "anon_inode:inotify" -printf "%hinfo/%f\n" | xargs grep -cE "^inotify" | column -t -s:
问题解决方案
解决这个 inotify 耗尽的问题方法无非就是增大 inotify 内核参数或者停掉消耗 inotify 异常的程序。一般系统默认的 inotify 内核参数大小是 8192
[root@z2024 ~]# cat /proc/sys/fs/inotify/max_user_watches
8192
killercoda 提供的 k8s 集群 inotify 内核参数大小是524288
controlplane $ cat /proc/sys/fs/inotify/max_user_watches
524288
大家可以根据自己的需求,灵活调整 inotify 内核参数
总结
上面通过实验模拟系统的 inotify 耗尽, 导致 kubelet 组件异常。系统设置合理的 inotify 内核参数对系统以及 k8s 集群稳定运行非常重要。如果您在阅读过程中发现了任何问题,或者有任何可以改进的地方,欢迎留言讨论,或者关注我的微信公众号运维小猪,谢谢!