ack的pod健康失败导致持续性重启

303 阅读4分钟

背景

image.png

排查过程

  • pod服务开启是否正常
    • 一直crash
      • 只能,从有curl 等工具pod(同cluster),探测。发现一会ok,一会异常。后确认为 重复重启导致curl时结果不稳定
    sh-4.2# curl http://172.19.139.254:8084/healthz
    sh-4.2# curl http://172.19.166.100:8082/healthz
    
  • 剔除健康检查逻辑
    • 没有来得及验证完全,被找到根因了。

根因

image.png

解决

方案1:增加ipBlock(证明无效)

➜ helm git:(master) kubectl exec -ti argocd-redis-d486999b7-w7h96 --namespace=argocd -- /bin/sh

➜ helm git:(master) kubectl exec -ti centos-c68f668d8-lpldd -- /bin/sh

kubectl get networkpolicy argocd-repo-server-network-policy -nargocd -oyaml

➜ helm git:(master) kubectl logs -p argocd-application-controller-0 -n argocd


time="2022-05-13T15:04:58Z" level=info msg="Processing all cluster shards"

time="2022-05-13T15:04:58Z" level=info msg="appResyncPeriod=3m0s"

time="2022-05-13T15:04:58Z" level=info msg="Application Controller (version: v2.2.8+93d588c, built: 2022-03-23T00:27:32Z) starting (namespace: argocd)"

time="2022-05-13T15:04:58Z" level=info msg="Starting configmap/secret informers"

time="2022-05-13T15:04:58Z" level=info msg="Configmap/secret informer synced"

time="2022-05-13T15:04:58Z" level=info msg="Ignore status for CustomResourceDefinitions"

time="2022-05-13T15:04:58Z" level=info msg="Ignore '/spec/preserveUnknownFields' for CustomResourceDefinitions"

time="2022-05-13T15:04:58Z" level=info msg="0xc000bb7a40 subscribed to settings updates"

time="2022-05-13T15:04:58Z" level=info msg="Starting secretInformer forcluster"

➜ helm git:(master)



➜ helm git:(master) kubectl logs -p argocd-repo-server-7f944f76bf-47mc8 -n argocd

time="2022-05-13T15:02:57Z" level=info msg="Generating self-signed gRPC TLS certificate for this session"

time="2022-05-13T15:02:57Z" level=info msg="Initializing GnuPG keyring at /app/config/gpg/keys"

time="2022-05-13T15:02:57Z" level=info msg="gpg --no-permission-warning --logger-fd 1 --batch --gen-key /tmp/gpg-key-recipe826707274" dir= execID=6ca8f

time="2022-05-13T15:02:58Z" level=info msg=Trace args="[gpg --no-permission-warning --logger-fd 1 --batch --gen-key /tmp/gpg-key-recipe826707274]" dir= operation_name="exec gpg" time_ms=231.27132699999999

time="2022-05-13T15:02:58Z" level=info msg="Populating GnuPG keyring with keys from /app/config/gpg/source"

time="2022-05-13T15:02:58Z" level=info msg="gpg --no-permission-warning --list-public-keys" dir= execID=2bf8f

time="2022-05-13T15:02:58Z" level=info msg=Trace args="[gpg --no-permission-warning --list-public-keys]" dir= operation_name="exec gpg" time_ms=3.854837

time="2022-05-13T15:02:58Z" level=info msg="gpg --no-permission-warning -a --export 479D0430FE5CC66C" dir= execID=04aa7

time="2022-05-13T15:02:58Z" level=info msg=Trace args="[gpg --no-permission-warning -a --export 479D0430FE5CC66C]" dir= operation_name="exec gpg" time_ms=2.7001579999999996

time="2022-05-13T15:02:58Z" level=info msg="gpg-wrapper.sh --no-permission-warning --list-secret-keys 479D0430FE5CC66C" dir= execID=61d57

time="2022-05-13T15:02:58Z" level=info msg=Trace args="[gpg-wrapper.sh --no-permission-warning --list-secret-keys 479D0430FE5CC66C]" dir= operation_name="exec gpg-wrapper.sh" time_ms=4.097624000000001

time="2022-05-13T15:02:58Z" level=info msg="Loaded 0 (and removed 0) keys from keyring"

time="2022-05-13T15:02:58Z" level=info msg="argocd-repo-server v2.2.8+93d588c serving on [::]:8081"

time="2022-05-13T15:02:58Z" level=info msg="Starting GPG sync watcher on directory '/app/config/gpg/source'"

kubectl edit networkpolicy argocd-repo-server-network-policy -nargocd

edit 一下 加到这里在试试

  • ipBlock: cidr: 172.19.0.0/16

方案,无效。

直接关闭 networkpolicy

方案2:若要使用ipvlan,必须关闭networkpolicy(有效、根因)

image.png

  • 改true

  • 然后重启terway-eniip的pod

kubectl delete -n kube-system pod -l app=terway-eniip

➜ helm git:(master) kubectl delete -n kube-system pod -l app=terway-eniip
pod "terway-eniip-62zzt" deleted
pod "terway-eniip-kjtt7" deleted
pod "terway-eniip-m9gfn" deleted

对于networkpolicy

这个有好处也有坏处,一般很少人用networkpolicy的,我们也反馈一下看看能不能优化。ipvlan对于健康检查的场景

选了ipvlan, networkpolicy可以不勾选。不开ipvlan可以使用networkpolicy

网络策略networkpolicy 这个是可以后期开启和关闭的 ,ipvlan不可以

选ipvlan的话可以关闭就可以,或者不创建networkpolicy的资源