Pod异常如下
集群部署于Windows的4个虚拟机上,Windows重启后Pod异常如下。
masha@node-0:~$ kubectl get pod
NAME READY STATUS RESTARTS AGE
gateway-75f5796c4c-kxcdf 0/1 ImagePullBackOff 9 (23h ago) 26d
mysql-pod-66f9d485fc-66fj8 1/1 Running 1 (23h ago) 24h
mysql-pod-66f9d485fc-cddtp 0/1 Error 1 2d6h
mysql-pod-66f9d485fc-pxrkz 0/1 Completed 0 24h
ImagePullBackOff
字面意思上为kubernetes拉取镜像出错,kubectl describe pod检查Pod的日志。
masha@node-0:~/projects/deployment$ kubectl describe pod gateway-75f5796c4c-kxcdf
Name: gateway-75f5796c4c-kxcdf
......
Node: node-1/192.168.50.91
......
Containers:
gateway:
......
Image: hjmasha/gateway:1.2
......
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning NodeNotReady 23h node-controller Node is not ready
Normal SandboxChanged 44m kubelet Pod sandbox changed, it will be killed and re-created.
Warning Failed 43m kubelet Failed to pull image "hjmasha/gateway:1.2": Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Failed 42m kubelet Failed to pull image "hjmasha/gateway:1.2": Error response from daemon: Get "https://registry-1.docker.io/v2/": dial tcp 31.13.94.10:443: i/o timeout
Normal Pulling 40m (x5 over 44m) kubelet Pulling image "hjmasha/gateway:1.2"
Warning Failed 39m (x3 over 43m) kubelet Failed to pull image "hjmasha/gateway:1.2": Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Failed 39m (x5 over 43m) kubelet Error: ErrImagePull
Normal BackOff 4m29s (x161 over 43m) kubelet Back-off pulling image "hjmasha/gateway:1.2"
Warning Failed 3m48s (x164 over 43m) kubelet Error: ImagePullBackOff
可以发现kubernetes无法访问registry-1.docker.io/v2/ ,但是集群在重启之前是正常的,该节点上应该存在容器镜像。到node-1上检查镜像是否存在。
masha@node-1:~$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
hjmasha/gateway 1.2 1bf25ab819ac 3 weeks ago 24.8MB
......
镜像在node-1上已经存在,但kubernetes仍然拉取镜像,查看gateway-75f5796c4c-kxcdf的deployment,其中镜像拉取策略为Always。
masha@node-0:~/projects/deployment$ kubectl get deploy gateway -o yaml
apiVersion: apps/v1
kind: Deployment
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
spec:
containers:
- image: hjmasha/gateway:1.2
imagePullPolicy: Always
.....
镜像 | Kubernetes中描述的更新策略如下:
IfNotPresent:只有当镜像在本地不存在时才会拉取。
Always:每当 kubelet 启动一个容器时,kubelet 会查询容器的镜像仓库, 将名称解析为一个镜像摘要。 如果 kubelet 有一个容器镜像,并且对应的摘要已在本地缓存,kubelet 就会使用其缓存的镜像; 否则,kubelet 就会使用解析后的摘要拉取镜像,并使用该镜像来启动容器。
Never:ubelet 不会尝试获取镜像。如果镜像已经以某种方式存在本地, kubelet 会尝试启动容器;否则,会启动失败。
将更新策略修改为IfNotPresent查看结果。
masha@node-0:~$ kubectl edit deploy gateway
deployment.apps/gateway edited
masha@node-0:~$ kubectl get pod
NAME READY STATUS RESTARTS AGE
gateway-6cdd57756f-s94r4 0/1 ErrImagePull 0 27s
gateway-75f5796c4c-kxcdf 0/1 ImagePullBackOff 9 (23h ago) 27d
masha@node-0:~$ kubectl get pod
NAME READY STATUS RESTARTS AGE
gateway-6cdd57756f-s94r4 0/1 ImagePullBackOff 0 28s
gateway-75f5796c4c-kxcdf 0/1 ImagePullBackOff 9 (23h ago) 27d
依然失败,回顾查看的deployment发现Pod的更新策略为RollingUpdate。根据Deployments | Kubernetes,当deployment更新后,一个新的replicaset会被创建,与新的deployment的.spec.selector匹配但是.spec.template不匹配的replicaset会将副本数缩至0(该replicaset并不会被删除)。
由于排查问题过程中网络恢复了,gateway在node-3上调度成功。
masha@node-0:~$ kubectl get pod -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gateway-6cdd57756f-s94r4 1/1 Running 0 19m 10.0.3.14 node-3 <none> <none>
mysql-pod-66f9d485fc-66fj8 1/1 Running 1 (24h ago) 24h 10.0.2.12 node-2 <none> <none>
mysql-pod-66f9d485fc-cddtp 0/1 Error 1 2d7h <none> node-1 <none> <none>
mysql-pod-66f9d485fc-pxrkz 0/1 Completed 0 24h <none> node-3 <none> <none>
梳理排查流程,需要研究下以下两点:
- 解释本例中在网络恢复之前为什么会同时存在两个Pod的状态为ImagePullBackOff
- 如果镜像只存在于一部分节点上,此时删除deployment然后重新创建,kubernetes是否会自动将Pod调度到含有目标镜像的节点上?
- Pod在当前节点异常的话,kubernetes在什么情况下会调度到其他节点上?
Err与Completed
kubectl logs无法查看日志,kubectl describe pod显示的Events也为空,对应节点上的容器也已经被回收,无法查阅到相关日志。QAQ