故障记录

47 阅读3分钟

CPU负载高 -> NFS的failed

node_load1{job="node-exporter"} / count without (cpu, mode) (node_cpu_seconds_total{job="node-exporter",mode="idle"}) > 2

有一台机器CPU负载高达1780,使用以下命令:

top - 15:25:50 up 279 days, 3:04, 1 user, load average: 1970.90, 1970.54, 1970.48 

Tasks: 315 total, 3 running, 311 sleeping, 0 stopped, 1 zombie  
%Cpu(s): 3.6 us, 3.2 sy, 0.0 ni, 92.2 id, 0.2 wa, 0.0 hi, 0.8 si, 0.0 st  
MiB Mem : 24106.3 total, 3606.8 free, 8432.3 used, 12067.2 buff/cache  
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 15250.4 avail Mem  
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND  
1135 root 20 0 2369044 214764 31712 S 5.0 0.9 24442:55 kubelet  
825 root 20 0 2850752 123608 46920 S 3.7 0.5 13114:33 dockerd  
327 root 20 0 152652 97680 76044 S 3.3 0.4 8721:31 systemd-journal  
695235 root 20 0 3879716 889988 23676 S 1.7 3.6 1542:18 java  
3476640 root 20 0 1994028 50512 17556 S 1.7 0.2 62:43.48 CloudGuardClien  
5253 systemd+ 20 0 1022972 137336 14380 S 1.3 0.6 653:20.41 nginx-ingress-c  
315077 nobody 20 0 1241692 21780 11936 D 1.3 0.1 0:00.60 node_exporter  
4030932 root 20 0 2618652 504732 9688 S 1.3 2.0 220:50.09 ruby2.7  
734486 root 20 0 5396568 388748 142208 S 0.7 1.6 2060:27 deepflow-agent  
1112731 root 20 0 3870508 458436 16852 S 0.7 1.9 401:36.67 java  
2416532 root 20 0 13.6g 2.3g 23980 S 0.7 9.6 215:05.13 rps-consumer  
2870992 root 20 0 696956 110676 15524 S 0.7 0.4 5031:57 vector  
33 root 20 0 0 0 0 S 0.3 0.0 574:10.59 kauditd  
684 root 20 0 1655420 45980 13092 S 0.3 0.2 2283:47 containerd  
1061 root 20 0 719848 7768 1108 S 0.3 0.0 20:20.54 containerd-shim  
320797 root 20 0 11628 4012 3144 R 0.3 0.0 0:00.42 top  
582123 zabbix 20 0 18.6g 32888 15196 S 0.3 0.1 84:18.85 zabbix_agent2  
1123300 root 20 0 4359460 364100 16740 S 0.3 1.5 372:46.33 java  
2213571 root 16 -4 13016 1604 1236 S 0.3 0.0 1054:11 auditd  
1 root 20 0 172168 10696 6836 S 0.0 0.0 1474:57 systemd

发现cpu负载很高,使用ps -eo查看是否有进程卡死:

root@xxx-xxx-k8s-p6:~# ps -eo pid,stat,comm,wchan:20 | grep D  
PID STAT COMMAND WCHAN  
315077 Dsl node_exporter -  
385696 D df -  
432830 D df -  
471464 D df -  
1121995 D df -  
1370503 D df -  
1377263 D 172.22.1.229-ma -  
2257684 D df -  
2270450 DN updatedb.mlocat -  
2607809 D df -  
4152393 D df -

发现很多df命令卡住了,另外有一条和ip相关的进程卡住了,进一步用systemctl发现nfs服务状态failed,沟通基建同事,发现此ip的nfs已经下架,使用umount卸载后解决。

网络不通

route -n
traceroute www.baidu.com
route print
tracert www.google.com

去对比traceroute和tracert,看下跳到哪个节点失败的。

image.png