Robusta实践-集群告警askgpt

400 阅读7分钟

Robusta实践

”罗布斯塔“简介

Robusta 是一个 Python 开发的用于 Kubernetes 故障排除的开源平台。它位于你的监控堆栈(Prometheus、Elasticsearch 等)之上,并告诉你警报发生的原因以及如何修复它们。

Robusta 包含三个主要部分,全部开源:

  1. 用于 Kubernetes 的自动化引擎
  2. 内置自动化以丰富和修复常见警报
  3. 其他一些手动故障排除工具

还有一些其他额外的可选组件:

  1. 包含 Robusta、Prometheus Operator 和默认 Kubernetes 警报的工具包
  2. 用于查看集群中所有警报、变更和事件的 Web UI,platform.robusta.dev/。

Robusta的自动化引擎有2个deploy构成:

  • robusta-forwarder,负责连接APIServer和监控Kubernetes的变化,将告警转发给robusta-runner

  • robusta-runner,负责执行playbooks

playbook由三个部分组成:

  1. triggers:触发器
  2. actions:执行的任务
  3. sinks:通知通道

arch-11.png

使用场景

Robusta 默认情况下会监控下面这些报警和错误,并会提供一些修复建议。

Prometheus Alerts

  • CPUThrottlingHigh - 显示原因和解决方法。
  • HostOomKillDetected - 显示哪些 Pods 被 killed 掉了。
  • KubeNodeNotReady - 显示节点资源和受影响的 Pods。
  • HostHighCpuLoad - 显示CPU使用情况分析。
  • KubernetesDaemonsetMisscheduled - 标记已知错误并建议修复。
  • KubernetesDeploymentReplicasMismatch - 显示 deployment 的状态。
  • NodeFilesystemSpaceFillingUp - 显示磁盘使用情况。

其他错误

这些是通过监听 APIServer 来识别的:

  • CrashLoopBackOff
  • ImagePullBackOff
  • Node NotReady

此外,WARNING 级别及以上的所有 Kubernetes 事件(kubectl get events)都会发送到 Robusta UI

变更追踪

默认情况下,对 Deployments、DaemonSets 和 StatefulSets 的所有变更都会发送到 Robusta UI,以便与 Prometheus 警报和其他错误相关联。默认情况下,这些更改不会发送到其他接收器(例如 Slack),因为它们是垃圾邮件。

部署

Robusta CLI(可选)

下载 robusta 脚本

# docker启动
curl -fsSL -o /usr/bin/robusta https://docs.robusta.dev/master/_static/robusta
chmod +x /usr/bin/robusta
# 或者python3
pip install -U robusta-cli --no-cache

使用脚本,脚本会运行一个docker容器docker run -it --rm --net host

$ robusta version
version 0.10.10

生成配置./generated_values.yaml,便于helm部署

$ robusta gen-config
Robusta reports its findings to external destinations (we call them "sinks").
We'll define some of them now.

Configure Slack integration? This is HIGHLY recommended. [Y/n]: y
If your browser does not automatically launch, open the below url:
https://api.robusta.dev/integrations/slack?id=1ad7d0d9-7466-4859-a446-6bebf71e82f7
You've just connected Robusta to the Slack of: SRE
Which slack channel should I send notifications to? # robusta-test
Configure MsTeams integration? [y/N]: n
Configure Robusta UI sink? This is HIGHLY recommended. [Y/n]: n
Robusta can use Prometheus as an alert source.
If you haven't installed it yet, Robusta can install a pre-configured Prometheus.
Would you like to do so? [y/N]: y
Would you like to enable two-way interactivity (e.g. fix-it buttons in Slack) via Robusta's cloud? [y/N]: n
Last question! Would you like to help us improve Robusta by sending exception reports? [y/N]: n
Saved configuration to ./generated_values.yaml - save this file for future use!
Finish installing with Helm (see the Robusta docs). By the way, you're missing out on the UI! See https://home.robusta.dev/ui/

By the way, we'll send you some messages later to get feedback. (We don't store your API key, so we scheduled future messages using Slack's API)

从已经有的集群获取generated_values.yaml

helm get values -o yaml robusta -n robusta | grep -v clusterName: | grep -v isSmallCluster: > 1.yaml

robusta部署

helm仓库添加

helm repo add robusta https://robusta-charts.storage.googleapis.com && helm repo update

helm安装

$ helm pull robusta/robusta
$ tar -xvf robusta-0.10.10.tgz
$ helm install robusta robusta -f ./generated_values.yaml \
-n robusta --create-namespace \
--set clusterName=test-159-63 \
#--set isSmallCluster=true \ # 不要带,告警会收不到
#--debug

部署指定了namespace,在使用robusta cli时候会报错,我们指定一下默认的ns

$ kubectl config set-context robusta --cluster=kubernetes --user=kubernetes-admin --namespace=robusta
$ kubectl config use-context robusta
$ kubectl config get-contexts
CURRENT   NAME                          CLUSTER      AUTHINFO           NAMESPACE
          kubernetes-admin@kubernetes   kubernetes   kubernetes-admin   kube-system
*         robusta                       kubernetes   kubernetes-admin   robusta
# 或者带上--namespace
$ robusta playbooks list --namespace=robusta

prometheus 存储

$ vi robusta/values.yaml
    storageSpec:
        volumeClaimTemplate:
          spec:
            accessModes: ["ReadWriteOnce"]
            storageClassName: nfs-client # 添加storageClassName
            resources:
              requests:
                storage: 100Gi

alertmanager

$ vi robusta/values.yaml
  alertmanager:
    tplConfig: true
    config:
      global:
        resolve_timeout: 5m
      route:
        group_by: [ 'job', 'instance' ]
        group_wait: 30s
        group_interval: 5m
        repeat_interval: 4h
        receiver: 'robusta'
        routes:
          - match_re:
              severity: 'info|warn|error|critical'
            repeat_interval: 4h
            continue: true

runner playbook存储

vi robusta/templates/runner.yaml
kind: PersistentVolumeClaim
metadata:
  name: persistent-playbooks-pv-claim
  namespace: {{ .Release.Namespace }}
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: nfs-client # 添加storageClassName
  resources:
    requests:
      storage: {{ if .Values.isSmallCluster }}"512Mi"{{ else }}{{ .Values.playbooksPersistentVolumeSize }}{{ end }}

查看pod

$ kubectl get po -n robusta
NAME                                                     READY   STATUS    RESTARTS   AGE
alertmanager-robusta-kube-prometheus-st-alertmanager-0   2/2     Running   1          3h38m
prometheus-robusta-kube-prometheus-st-prometheus-0       2/2     Running   0          57m
robusta-forwarder-69b54dc7fb-sfmbc                       1/1     Running   0          3h38m
robusta-grafana-5558c546dd-sffzq                         3/3     Running   0          3h38m
robusta-kube-prometheus-st-operator-547f8ccdbb-g594x     1/1     Running   0          3h38m
robusta-kube-state-metrics-6c588f97c9-mhkv8              1/1     Running   0          3h38m
robusta-prometheus-node-exporter-dgd22                   1/1     Running   0          3h38m
robusta-prometheus-node-exporter-ksntf                   1/1     Running   0          3h38m
robusta-prometheus-node-exporter-xpdx6                   1/1     Running   0          3h38m
robusta-runner-5549c7d86b-7cjh4                          1/1     Running   0          3h38m
robusta-runner-5549c7d86b-vlp9f                          1/1     Running   0          21m

svc

$ kubectl get svc -n robusta
NAME                                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
alertmanager-operated                     ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   15h
prometheus-operated                       ClusterIP   None             <none>        9090/TCP                     15h
robusta-grafana                           NodePort    172.30.45.167    <none>        80:30747/TCP                 15h
robusta-kube-prometheus-st-alertmanager   NodePort    172.18.244.209   <none>        9093:30903/TCP               15h
robusta-kube-prometheus-st-operator       ClusterIP   172.17.117.196   <none>        443/TCP                      15h
robusta-kube-prometheus-st-prometheus     NodePort    172.27.236.40    <none>        9090:30090/TCP               15h
robusta-kube-state-metrics                ClusterIP   172.22.32.245    <none>        8080/TCP                     15h
robusta-prometheus-node-exporter          ClusterIP   172.28.205.53    <none>        9104/TCP                     15h
robusta-runner                            ClusterIP   172.23.90.171    <none>        80/TCP                       15h
# 默认都是ClusterIP,修改为NodePort
$ grep -r "type: NodePort" robusta
robusta/charts/kube-prometheus-stack/charts/grafana/values.yaml:  type: NodePort
robusta/charts/kube-prometheus-stack/values.yaml:    type: NodePort
robusta/charts/kube-prometheus-stack/values.yaml:    type: NodePort

crash日志

测试crash日志

kubectl apply -f https://gist.githubusercontent.com/robusta-lab/283609047306dc1f05cf59806ade30b6/raw

重启2次后slack会收到信息

image-20230131225147817.png

自动化

每个自动化都有三个部分

  • Triggers:何时运行(基于警报、日志、变更等)
  • Actions:要做什么操作(超过50个内置操作)
  • Sinks:将结果发送到何处,默认接收器slack或者robusta ui

快速上手 on_deployment_update

添加一个自动化,在generated_values.yaml后面添加如下配置

customPlaybooks:
- triggers:
    - on_deployment_update: {}
  actions:
    - resource_babysitter:
        omitted_fields: []
        fields_to_monitor: ["spec.replicas"]

更新robusta

helm upgrade robusta robusta --values=generated_values.yaml -n robusta \
--set clusterName=test-159-63 --set isSmallCluster=true

测试一下on_deployment_update是否生效

kubectl scale --replicas 2  deploy robusta-runner  -n robusta

image-20230131230107851转存失败,建议直接上传图片文件

如果开启了web-ui,在Timeline还可以看到yaml的修改内容(该自动化已经配置,会推送给ui,防止重复)

image-20230131230303416转存失败,建议直接上传图片文件

on_prometheus_alert

定义告警

generated_values.yaml添加:对普罗米修斯警报添加默认的处理规则

builtinPlaybooks:
- triggers:
  - on_prometheus_alert: {}
  actions:
  - default_enricher: {}

使用customPlaybooks定义自己的丰富内容

customPlaybooks:
- triggers:
  - on_prometheus_alert:
      alert_name: HostHighCpuLoad
  actions:
  - node_bash_enricher:
      bash_command: ps aux
  sinks:
    - "main_slack_sink"
  stop: True # 停止后面的匹配

添加HostHighCpuLoad告警。。

$ vi robusta/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/node-exporter.yaml
spec:
  groups:
  - name: node-exporter
    rules:
    - alert: HostHighCpuLoad
      expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Host high CPU load (instance {{`{{`}} $labels.instance {{`}}`}})
        description: "CPU load is > 80%\n  VALUE = {{`{{`}} $value {{`}}`}}\n  LABELS = {{`{{`}} $labels {{`}}`}}"

屏蔽告警

节点重启10分钟后再告警

customPlaybooks:
- triggers:
  - on_prometheus_alert:
      alert_name: KubePodCrashLooping
  actions:
  - node_restart_silencer:
      post_restart_silence: 600 # seconds

自定义playbook

pyproject.toml 会遇到坑:poetry.core.masonry.utils.module.ModuleOrPackageNotFound: No file/folder found for package robusta-playbook-actions

[tool.poetry]
name = "robusta-playbook-actions"
version = "0.0.1"
description = ""
authors = ["xx"]

[tool.poetry.dependencies]
# if your playbook requires additional dependencies, add them here
#some-dependency = "^1.2.3"

[tool.poetry.dev-dependencies]
robusta-cli = "^0.8.9"

packages = [
    { include="robusta-playbook-actions", from="." }, # 添加
]
$ robusta playbooks push robusta-playbook # 导入git仓库
$ robusta playbooks list-dirs # 查看目录
======================================================================
Listing playbooks directories
======================================================================
======================================================================
Stored playbooks directories:
 robusta-playbook
# reload看看是否有异常
$ robusta playbooks reload
$ robusta playbooks list
--------------------------------------
triggers:
- on_event_create: {}

actions:
- report_scheduling_failure: {}

Robusta事件层次结构,PrometheusKubernetesAlertPodEvent的一个子类

image-20230205153312468.png

基础的示例

from robusta.api import *

@action
def my_action(event: PodEvent):
    # we have full access to the pod on which the alert fired
    pod = event.get_pod()
    pod_name = pod.metadata.name
    pod_logs = pod.get_logs()
    pod_processes = pod.exec("ps aux")

    # this is how you send data to slack or other destinations
    event.add_enrichment([
        MarkdownBlock("*Oh no!* An alert occurred on " + pod_name),
        FileBlock("crashing-pod.log", pod_logs)
    ])

event.add_enrichment()示例

docs.robusta.dev/master/deve…

@action
def test_playbook(event: ExecutionBaseEvent):
    event.add_enrichment(
        [
            MarkdownBlock(
                "This is a *markdown* message. Here are some movie characters:"
            ),
            TableBlock(
                [["Han Solo", "Star Wars"], ["Paul Atreides", "Dune"]],
                ["name", "movie"],
            ),
        ]
    )

带参数的playbook示例

from robusta.api import *

class BashParams(ActionParams):
   bash_command: str

@action
def pod_bash_enricher(event: PodEvent, params: BashParams):
    pod = event.get_pod()
    if not pod:
        logging.error(f"cannot run PodBashEnricher on event with no pod: {event}")
        return

    block_list: List[BaseBlock] = []
    exec_result = pod.exec(params.bash_command)
    block_list.append(MarkdownBlock(f"Command results for *{params.bash_command}:*"))
    block_list.append(MarkdownBlock(exec_result))
    event.add_enrichment(block_list)

generated_values.yaml

customPlaybooks:
- triggers:
  - on_pod_update: {}
  actions:
  - pod_bash_enricher:
      bash_command: "ls -al /"

通过外部地址载入playbook

docs.robusta.dev/master/user…

给告警添加chatgpt帮助

github.com/robusta-dev…

globalConfig:
  chat_gpt_token: YOUR KEY GOES HERE
playbookRepos:
  chatgpt_robusta_actions:
    url: "https://github.com/robusta-dev/kubernetes-chatgpt-bot.git"
# you can search for CallbackBlock in the playbooks/ directory and read the existing playbooks to see how it works.
disableCloudRouting: false # 一定要改成false
customPlaybooks:
# Add the 'Ask ChatGPT' button to all Prometheus alerts
- triggers:
  - on_prometheus_alert: {}
  actions:
  - chat_gpt_enricher: {}

image-20230210180410361.png

image-20230210180428396.png

自带的actions

列出了小部分,详细见:docs.robusta.dev/master/cata…

Node

actions:
- node_bash_enricher: # 在node上执行命令
    bash_command: ls -l /etc/data/db
- node_status_enricher: {} # 获取node状态
- node_running_pods_enricher: {} # 获取node上运行的pod
- node_allocatable_resources_enricher: {} # 获取node资源
- node_graph_enricher: # 获取node资源图表
    prometheus_url: http://prometheus-k8s.monitoring.svc.cluster.local:9090
    resource_type: Memory
- node_cpu_enricher: # 获取node的cpu信息
    prometheus_url: http://prometheus-k8s.monitoring.svc.cluster.local:9090

Pod

actions:
- logs_enricher: {} # 获取pod日志
- pod_events_enricher: {} # 获取pod events
- pod_bash_enricher: # 在pod上执行命令
    bash_command: ls -l /etc/data/db
- pod_ps: {} # 获取pod的进程
- pod_oom_killer_enricher: {} # 获取pod oom的信息

actions:
- delete_pod: {} # 删除pod

其他

actions:
- incluster_ping: # ping hostname
    hostname: string
- disk_benchmark: # 磁盘测试
    storage_class_name: string
- http_stress_test: # http压测
    url: string
- create_pvc_snapshot: # 备份pvc
    name: some_pvc_name

python故障排除

docs.robusta.dev/master/cata…

java故障排除

docs.robusta.dev/master/cata…

文档:

docs.robusta.dev/master/