Kubenetes排障:Error: context deadline exceeded Pod创建异常处理
问题现象
查看异常pod,发现Failed create pod sandbox
$ kb describe po -n flink-test flink-340b9ac516be491cbef1698feb72cfbd-master-58857fcc46-2khl9
……
Warning FailedCreatePodSandBox 20m kubelet, 10.201.3.132 Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "flink-340b9ac516be491cbef1698feb72cfbd-master-58857fcc46-2khl9": operation timeout: context deadline exceeded
Warning Failed 15m (x2 over 17m) kubelet, 10.201.3.132 Error: context deadline exceeded
到节点上查看kubelet日志会看到createPodSandbox发生异常
I1110 10:54:58.819051 455815 kuberuntime_manager.go:397] No sandbox for pod "flink-340b9ac516be491cbef1698feb72cfbd-master-58857fcc46-2khl9_flink-test(82944f57-7f74-11ee-
94d4-50af732e63b7)" can be found. Need to start a new one
I1110 10:54:58.819113 455815 kuberuntime_manager.go:599] SyncPod received new pod "flink-340b9ac516be491cbef1698feb72cfbd-master-58857fcc46-2khl9_flink-test(82944f57-7f74
-11ee-94d4-50af732e63b7)", will create a sandbox for it
I1110 10:54:58.819122 455815 kuberuntime_manager.go:608] Stopping PodSandbox for "flink-340b9ac516be491cbef1698feb72cfbd-master-58857fcc46-2khl9_flink-test(82944f57-7f74-
11ee-94d4-50af732e63b7)", will start new one
I1110 10:54:58.819140 455815 kuberuntime_manager.go:660] Creating sandbox for pod "flink-340b9ac516be491cbef1698feb72cfbd-master-58857fcc46-2khl9_flink-test(82944f57-7f74
-11ee-94d4-50af732e63b7)"
E1110 10:56:58.821594 455815 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "flink-340b9ac516be491cbef1698feb72cfbd-master-58857fcc46-2khl9_flink-test(82944f57-7f74-
11ee-94d4-50af732e63b7)" failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "flink-340b9ac516be491cbef1698feb72cfbd-master-58857fcc46-2khl9": oper
ation timeout: context deadline exceeded
E1110 10:56:58.821617 455815 kuberuntime_manager.go:666] createPodSandbox for pod "flink-340b9ac516be491cbef1698feb72cfbd-master-58857fcc46-2khl9_flink-test(82944f57-7f74
-11ee-94d4-50af732e63b7)" failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "flink-340b9ac516be491cbef1698feb72cfbd-master-58857fcc46-2khl9": ope
ration timeout: context deadline exceeded
E1110 10:56:58.821699 455815 pod_workers.go:190] Error syncing pod 82944f57-7f74-11ee-94d4-50af732e63b7 ("flink-340b9ac516be491cbef1698feb72cfbd-master-58857fcc46-2khl9_f
link-test(82944f57-7f74-11ee-94d4-50af732e63b7)"), skipping: failed to "CreatePodSandbox" for "flink-340b9ac516be491cbef1698feb72cfbd-master-58857fcc46-2khl9_flink-test(82
944f57-7f74-11ee-94d4-50af732e63b7)" with CreatePodSandboxError: "CreatePodSandbox for pod \"flink-340b9ac516be491cbef1698feb72cfbd-master-58857fcc46-2khl9_flink-test(8294
4f57-7f74-11ee-94d4-50af732e63b7)\" failed: rpc error: code = Unknown desc = failed to create a sandbox for pod \"flink-340b9ac516be491cbef1698feb72cfbd-master-58857fcc46-
2khl9\": operation timeout: context deadline exceeded"
I1110 10:56:58.821736 455815 server.go:459] Event(v1.ObjectReference{Kind:"Pod", Namespace:"flink-test", Name:"flink-340b9ac516be491cbef1698feb72cfbd-master-58857fcc46-2k
hl9", UID:"82944f57-7f74-11ee-94d4-50af732e63b7", APIVersion:"v1", ResourceVersion:"1387738689", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed cr
eate pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "flink-340b9ac516be491cbef1698feb72cfbd-master-58857fcc46-2khl9": operation timeout:
context deadline exceeded
根据监控来看磁盘util跑满了,iowait特别高
问题原因
因为业务pod有大量读写磁盘的情况,导致kubelet创建容器超时。
找到异常pod
$ sudo iotop -P
Total DISK READ: 0.00 B/s | Total DISK WRITE: 12.70 M/s
Current DISK READ: 0.00 B/s | Current DISK WRITE: 91.51 M/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
2575957 ?dif root 0.00 B/s 11.71 M/s 0.00 % 0.01 % java -classpath /opt/flink/lib/flink-cep_2.11-1.10.ATO~d342-46dd-86fa-4067b7ed7954 -Djobmanager.rpc.port=6123
2678081 be/4 root 0.00 B/s 1767.95 B/s 0.00 % 0.00 % java -Xmx20m -XX:+UnlockExperimentalVMOptions -XX:+Use~file /opt/flume/conf/flume-conf-0.properties --name a1
2541863 be/4 root 0.00 B/s 1767.95 B/s 0.00 % 0.00 % java -Xmx20m -XX:+UnlockExperimentalVMOptions -XX:+Use~file /opt/flume/conf/flume-conf-0.properties --name a1
2677687 be/4 root 0.00 B/s 138.12 K/s 0.00 % 0.00 % java -classpath /opt/flink/lib/flink-cep_2.11-1.10.ATO~71b7-4f5d-a6bf-289bde4490cb -Djobmanager.rpc.port=6123
2677642 be/4 root 0.00 B/s 189.92 K/s 0.00 % 0.00 % java -classpath /opt/flink/lib/flink-cep_2.11-1.10.ATO~e04a-4ee9-a056-2102bbd01af4 -Djobmanager.rpc.port=6123
1879470 be/4 root 0.00 B/s 1767.95 B/s 0.00 % 0.00 % java -Xms20m -Xmx20m -DpropertiesImplementation=org.ap~file /opt/flume/conf/flume-conf-0.properties --name a1
2277790 be/4 root 0.00 B/s 1767.95 B/s 0.00 % 0.00 % java -Xms20m -Xmx20m -DpropertiesImplementation=org.ap~file /opt/flume/conf/flume-conf-1.properties --name a1
2780359 be/4 root 0.00 B/s 1767.95 B/s 0.00 % 0.00 % java -Xmx20m -XX:+UnlockExperimentalVMOptions -XX:+Use~file /opt/flume/conf/flume-conf-0.properties --name a1
2676803 be/4 root 0.00 B/s 1767.95 B/s 0.00 % 0.00 % java -Xmx20m -XX:+UnlockExperimentalVMOptions -XX:+Use~file /opt/flume/conf/flume-conf-0.properties --name a1
# 例如我们查看 2575957 ?dif root 0.00 B/s 11.71 M/s
$ sudo pstree -p |grep 2575957
| |-containerd-shim(2575933)-+-java(2575957)-+-java(2575990)
$ sudo ps -elf |grep 2575933
4 S root 2575933 301062 0 80 0 - 26924 - 10:43 ? 00:00:00 containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/3e641f1288c308941b91a27c12b4f87f1359de90d0c36078b252df2c3333dbac -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
$ sudo docker ps |grep 3e641f12
3e641f1288c3 4a103abd2b9e "/opt/flink/bin/kube…" 5 hours ago Up 5 hours k8s_flink-task-manager_flink-f98b144330e54d7fb557c8409a7df1a3-taskmanager-1-6_flink_df5efb46-7f72-11ee-94d4-50af732e63b7_0
$ sudo docker ps -q | xargs sudo docker inspect --format '{{.State.Pid}}, {{.Id}}, {{.Name}}, {{.GraphDriver.Data.WorkDir}}' | grep 3e641f12
# 再通过uuid找到pod
# kubectl describe no找到对应的pod
$ sudo kubectl describe no xxxx | awk '/Non-terminated Pods:/,/Allocated resources:/ {if ($0 !~ /(Non-terminated Pods:|Allocated resources:)/) print}' | egrep -v '\-\-\-\-|CPU Requests' |awk '{print $1 " " $2}' > 1
#!/bin/bash
uuid="3e641f1288c3"
while read line;do
ns=`echo $line|awk '{print $1}'`
pod=`echo $line|awk '{print $2}'`
sudo kubectl get pod/$pod -n $ns -o yaml|grep $uuid && echo $ns $pod
done < 1
# 可以找到对应的pod
flink flink-f98b144330e54d7fb557c8409a7df1a3-taskmanager-1-6
总结
遇到Error: context deadline exceeded,大多是系统层面出现了问题,例如磁盘io、cpu内存等,也要检查网络、cni是否正常。