Chaos Mesh 实践

956 阅读6分钟

安装

环境准备

  • centos:7.9
  • docker:20.10.9
  • kubernetes:v1.20.4
  • helm:v3.2.1

使用 Helm 安装

第 1 步:添加 Chaos Mesh 仓库

在 Helm 仓库中添加 Chaos Mesh 仓库:

helm repo add chaos-mesh https://charts.chaos-mesh.org

第 2 步:创建安装 Chaos Mesh 的命名空间

推荐将 Chaos Mesh 安装在 chaos-testing 命名空间下,也可以指定任意命名空间安装 Chaos Mesh:

kubectl create ns chaos-testing

第 3 步:在docker环境下安装

Docker

helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-testing

验证安装

查看运行情况

要查看 Chaos Mesh 的运行情况,请执行以下命令:

kubectl get po -n chaos-testing

以下是预期输出:

NAME                                        READY   STATUS    RESTARTS   AGE
chaos-controller-manager-5cd8dc646c-qvrwd   1/1     Running   0          103s
chaos-daemon-75p56                          1/1     Running   0          103s
chaos-daemon-gglmj                          1/1     Running   0          103s
chaos-daemon-pm6nq                          1/1     Running   0          103s
chaos-daemon-z6cfk                          1/1     Running   0          104s
chaos-dashboard-649585686-5rshc             1/1     Running   0          103s

如果你的实际输出与预期输出相符,表示 Chaos Mesh 已经成功安装。

如果实际输出的 STATUS 状态不是 Running,则需要运行以下命令查看 Pod 的详细信息,然后依据错误提示排查并解决问题。

查看 dashboard

[root@m1 ~]# kubectl  get svc -n chaos-testing chaos-dashboard 
NAME              TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE
chaos-dashboard   NodePort   10.233.40.47   <none>        2333:31519/TCP   5m15s

访问nodeport

nodeport端口为31519 浏览器打开 masterip:31519

image.png

生成token

点击Click here to generate。 勾选Cluster scoped,Role 选择Manager ,然后点击COPY复制生成好的yaml文件,并保存为rbac.yaml

image.png

执行yaml

kubectl apply -f rbac.yaml
serviceaccount/account-default-viewer-dqscy created
role.rbac.authorization.k8s.io/role-default-viewer-dqscy created
rolebinding.rbac.authorization.k8s.io/bind-default-viewer-dqscy created

获取 token

kubectl describe -n default secrets account-default-viewer-dqscy

image.png

填写token

填写nametoken,然后点击SUBMIT image.png 提交完成后的页面 image.png

实验

准备测试pod

kubectl  create deployment tomcat --image=tomcat:7
kubectl  get pod 
NAME                     READY   STATUS    RESTARTS   AGE
tomcat-5f7b97cd7-8xx6v   1/1     Running   0          6m18s

POD故障

POD FAILURE

新建实验

选择 Pod FaultPod Failure image.png

选择测试namespace 及填写Name 选择 Run continuously 然后点击提交 image.png

提交完成后 image.png

验证故障

查看故障事件 image.png 查看pod状态为CrashLoopBackOff

kubectl  get pod 
NAME                     READY   STATUS             RESTARTS   AGE
tomcat-5f7b97cd7-8xx6v   0/1     CrashLoopBackOff   0          9m43s

查看pod事件 为一直拉取镜像失败

kubectl  describe pod tomcat-5f7b97cd7-8xx6v
......
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  12m                    default-scheduler  Successfully assigned default/tomcat-5f7b97cd7-8xx6v to fn01
  Normal   Pulled     12m                    kubelet            Container image "tomcat:7" already present on machine
  Normal   Created    12m                    kubelet            Created container tomcat
  Normal   Started    12m                    kubelet            Started container tomcat
  Normal   Killing    5m20s                  kubelet            Container tomcat definition changed, will be restarted
  Normal   BackOff    3m38s (x3 over 5m4s)   kubelet            Back-off pulling image "gcr.io/google-containers/pause:latest"
  Warning  Failed     3m38s (x3 over 5m4s)   kubelet            Error: ImagePullBackOff
  Normal   Pulling    2m35s (x4 over 5m20s)  kubelet            Pulling image "gcr.io/google-containers/pause:latest"
  Warning  Failed     2m10s (x4 over 5m5s)   kubelet            Failed to pull image "gcr.io/google-containers/pause:latest": rpc error: code = Unknown desc = Error response from daemon: Get "https://gcr.io/v2/": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Failed     2m10s (x4 over 5m5s)   kubelet            Error: ErrImagePull
  Warning  BackOff    115s (x5 over 3m23s)   kubelet            Back-off restarting failed container

恢复故障

image.png 查看pod状态,恢复正常,RESTARTS 加1

[root@m1 chaos]# kubectl  get pod 
NAME                     READY   STATUS    RESTARTS   AGE
tomcat-5f7b97cd7-8xx6v   1/1     Running   1          21m

POD KILL

新建实验

选择 Pod FaultPod KILL

选择测试namespace 及填写Name 选择 Run continuously 然后点击提交 image.png 提交完成后 image.png

验证故障

查看pod发现原来pod已经被kill 产生新的POD

kubectl  get pod 
NAME                     READY   STATUS    RESTARTS   AGE
tomcat-5f7b97cd7-74fj6   1/1     Running   0          2m43s

查看replicasets 事件 有新的pod 被拉起

kubectl describe replicasets.apps tomcat-5f7b97cd7
Name:           tomcat-5f7b97cd7
Namespace:      default
......
Events:
  Type    Reason            Age    From                   Message
  ----    ------            ----   ----                   -------
  Normal  SuccessfulCreate  36m    replicaset-controller  Created pod: tomcat-5f7b97cd7-8xx6v
  Normal  SuccessfulCreate  2m47s  replicaset-controller  Created pod: tomcat-5f7b97cd7-74fj6

Container Kill

新建实验

选择 Pod FaultContainer Kill 及填入container names 这里填tomcat

选择测试namespace 及填写Name Duration 填写30s 然后点击提交

image.png

验证故障

查看pod状态,发现RESTARTS 次数加1

kubectl  get pod 
NAME                     READY   STATUS    RESTARTS   AGE
tomcat-5f7b97cd7-74fj6   1/1     Running   1          18m

查看pod事件,tomcat 容器退出后 pod 又拉起一个新的容器

kubectl  describe pod tomcat-5f7b97cd7-74fj6
Name:         tomcat-5f7b97cd7-74fj6
Namespace:    default
......
Events:
  Type    Reason     Age                From               Message
  ----    ------     ----               ----               -------
  Normal  Scheduled  19m                default-scheduler  Successfully assigned default/tomcat-5f7b97cd7-74fj6 to fn01
  Normal  Pulled     31s (x2 over 19m)  kubelet            Container image "tomcat:7" already present on machine
  Normal  Created    30s (x2 over 19m)  kubelet            Created container tomcat
  Normal  Started    30s (x2 over 19m)  kubelet            Started container tomcat

网络

限流

环境准备

部署测试应用

kubectl  create deployment networktest --image=zhfangk8s/nginx-test

查看podip

kubectl  get pod  -o wide 
NAME                           READY   STATUS    RESTARTS   AGE     IP              NODE   NOMINATED NODE   READINESS GATES
networktest-6ccdcf677f-757ph   1/1     Running   0          40m     10.233.105.12   fn03   <none>           <none>
tomcat-5f7b97cd7-74fj6         1/1     Running   1          4h51m   10.233.99.16    fn01   <none>           <none>

模拟测试流量

while true;do curl -O 10.233.105.12/test ;done
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1000M  100 1000M    0     0   278M      0  0:00:03  0:00:03 --:--:--  278M

通过yaml创建限流

设置限流为100mbps

kind: NetworkChaos
apiVersion: chaos-mesh.org/v1alpha1
metadata:
  name: bandwith
  namespace: default
  annotations:
    experiment.chaos-mesh.org/pause: 'true'
spec:
  selector:
    namespaces:
      - default
    labelSelectors:
      app: networktest
  mode: one
  action: bandwidth
  bandwidth:
    rate: 100mbps
    limit: 10000000
    buffer: 100000000
  direction: to

验证限流效果

下载速度从278M下降到11.9M

[root@m1 chaos]# while true;do curl -O 10.233.105.12/test ;done
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1000M  100 1000M    0     0  12.2M      0  0:01:21  0:01:21 --:--:-- 11.9M

查看grafana监控 ,流量从2.47Gb下降至100Mb,与限流的100mbps相符。

image.png

Partition

通过ymal发布

将实验配置写入到文件中 network-partition.yaml,内容示例如下:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: partition
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      'app': 'tomcat'
  direction: to
  target:
    mode: all
    selector:
      namespaces:
        - default
      labelSelectors:
        'app': 'networktest'

该配置将阻止从 tomcat 向 networktest 建立的连接。direction 字段的值可以选择 tofrom 及 both

使用 kubectl 创建实验,命令如下:

kubectl apply -f ./network-partition.yaml

验证实验

进入tomcat 容器ping networktest显示无法范围

kubectl  exec -it tomcat-5f7b97cd7-74fj6 sh 
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
# ping 10.233.105.12
PING 10.233.105.12 (10.233.105.12) 56(84) bytes of data.
ping: sendmsg: Operation not permitted
ping: sendmsg: Operation not permitted

^C
--- 10.233.105.12 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 1003ms
# 
command terminated with exit code 1

在宿主机上ping可以成功

ping 10.233.105.12
PING 10.233.105.12 (10.233.105.12) 56(84) bytes of data.
64 bytes from 10.233.105.12: icmp_seq=1 ttl=63 time=0.339 ms
64 bytes from 10.233.105.12: icmp_seq=2 ttl=63 time=0.335 ms
64 bytes from 10.233.105.12: icmp_seq=3 ttl=63 time=0.265 ms