Robusta实践
”罗布斯塔“简介
Robusta 是一个 Python 开发的用于 Kubernetes 故障排除的开源平台。它位于你的监控堆栈(Prometheus、Elasticsearch 等)之上,并告诉你警报发生的原因以及如何修复它们。
Robusta 包含三个主要部分,全部开源:
- 用于 Kubernetes 的自动化引擎
- 内置自动化以丰富和修复常见警报
- 其他一些手动故障排除工具
还有一些其他额外的可选组件:
- 包含 Robusta、Prometheus Operator 和默认 Kubernetes 警报的工具包
- 用于查看集群中所有警报、变更和事件的 Web UI,platform.robusta.dev/。
Robusta的自动化引擎有2个deploy构成:
-
robusta-forwarder,负责连接APIServer和监控Kubernetes的变化,将告警转发给robusta-runner
-
robusta-runner,负责执行playbooks
playbook由三个部分组成:
- triggers:触发器
- actions:执行的任务
- sinks:通知通道
使用场景
Robusta 默认情况下会监控下面这些报警和错误,并会提供一些修复建议。
Prometheus Alerts
- CPUThrottlingHigh - 显示原因和解决方法。
- HostOomKillDetected - 显示哪些 Pods 被 killed 掉了。
- KubeNodeNotReady - 显示节点资源和受影响的 Pods。
- HostHighCpuLoad - 显示CPU使用情况分析。
- KubernetesDaemonsetMisscheduled - 标记已知错误并建议修复。
- KubernetesDeploymentReplicasMismatch - 显示 deployment 的状态。
- NodeFilesystemSpaceFillingUp - 显示磁盘使用情况。
其他错误
这些是通过监听 APIServer 来识别的:
- CrashLoopBackOff
- ImagePullBackOff
- Node NotReady
此外,WARNING 级别及以上的所有 Kubernetes 事件(kubectl get events)都会发送到 Robusta UI
变更追踪
默认情况下,对 Deployments、DaemonSets 和 StatefulSets 的所有变更都会发送到 Robusta UI,以便与 Prometheus 警报和其他错误相关联。默认情况下,这些更改不会发送到其他接收器(例如 Slack),因为它们是垃圾邮件。
部署
Robusta CLI(可选)
下载 robusta 脚本
# docker启动
curl -fsSL -o /usr/bin/robusta https://docs.robusta.dev/master/_static/robusta
chmod +x /usr/bin/robusta
# 或者python3
pip install -U robusta-cli --no-cache
使用脚本,脚本会运行一个docker容器docker run -it --rm --net host
$ robusta version
version 0.10.10
生成配置./generated_values.yaml,便于helm部署
$ robusta gen-config
Robusta reports its findings to external destinations (we call them "sinks").
We'll define some of them now.
Configure Slack integration? This is HIGHLY recommended. [Y/n]: y
If your browser does not automatically launch, open the below url:
https://api.robusta.dev/integrations/slack?id=1ad7d0d9-7466-4859-a446-6bebf71e82f7
You've just connected Robusta to the Slack of: SRE
Which slack channel should I send notifications to? # robusta-test
Configure MsTeams integration? [y/N]: n
Configure Robusta UI sink? This is HIGHLY recommended. [Y/n]: n
Robusta can use Prometheus as an alert source.
If you haven't installed it yet, Robusta can install a pre-configured Prometheus.
Would you like to do so? [y/N]: y
Would you like to enable two-way interactivity (e.g. fix-it buttons in Slack) via Robusta's cloud? [y/N]: n
Last question! Would you like to help us improve Robusta by sending exception reports? [y/N]: n
Saved configuration to ./generated_values.yaml - save this file for future use!
Finish installing with Helm (see the Robusta docs). By the way, you're missing out on the UI! See https://home.robusta.dev/ui/
By the way, we'll send you some messages later to get feedback. (We don't store your API key, so we scheduled future messages using Slack's API)
从已经有的集群获取generated_values.yaml
helm get values -o yaml robusta -n robusta | grep -v clusterName: | grep -v isSmallCluster: > 1.yaml
robusta部署
helm仓库添加
helm repo add robusta https://robusta-charts.storage.googleapis.com && helm repo update
helm安装
$ helm pull robusta/robusta
$ tar -xvf robusta-0.10.10.tgz
$ helm install robusta robusta -f ./generated_values.yaml \
-n robusta --create-namespace \
--set clusterName=test-159-63 \
#--set isSmallCluster=true \ # 不要带,告警会收不到
#--debug
部署指定了namespace,在使用robusta cli时候会报错,我们指定一下默认的ns
$ kubectl config set-context robusta --cluster=kubernetes --user=kubernetes-admin --namespace=robusta
$ kubectl config use-context robusta
$ kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
kubernetes-admin@kubernetes kubernetes kubernetes-admin kube-system
* robusta kubernetes kubernetes-admin robusta
# 或者带上--namespace
$ robusta playbooks list --namespace=robusta
prometheus 存储
$ vi robusta/values.yaml
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: nfs-client # 添加storageClassName
resources:
requests:
storage: 100Gi
alertmanager
$ vi robusta/values.yaml
alertmanager:
tplConfig: true
config:
global:
resolve_timeout: 5m
route:
group_by: [ 'job', 'instance' ]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'robusta'
routes:
- match_re:
severity: 'info|warn|error|critical'
repeat_interval: 4h
continue: true
runner playbook存储
vi robusta/templates/runner.yaml
kind: PersistentVolumeClaim
metadata:
name: persistent-playbooks-pv-claim
namespace: {{ .Release.Namespace }}
spec:
accessModes:
- ReadWriteOnce
storageClassName: nfs-client # 添加storageClassName
resources:
requests:
storage: {{ if .Values.isSmallCluster }}"512Mi"{{ else }}{{ .Values.playbooksPersistentVolumeSize }}{{ end }}
查看pod
$ kubectl get po -n robusta
NAME READY STATUS RESTARTS AGE
alertmanager-robusta-kube-prometheus-st-alertmanager-0 2/2 Running 1 3h38m
prometheus-robusta-kube-prometheus-st-prometheus-0 2/2 Running 0 57m
robusta-forwarder-69b54dc7fb-sfmbc 1/1 Running 0 3h38m
robusta-grafana-5558c546dd-sffzq 3/3 Running 0 3h38m
robusta-kube-prometheus-st-operator-547f8ccdbb-g594x 1/1 Running 0 3h38m
robusta-kube-state-metrics-6c588f97c9-mhkv8 1/1 Running 0 3h38m
robusta-prometheus-node-exporter-dgd22 1/1 Running 0 3h38m
robusta-prometheus-node-exporter-ksntf 1/1 Running 0 3h38m
robusta-prometheus-node-exporter-xpdx6 1/1 Running 0 3h38m
robusta-runner-5549c7d86b-7cjh4 1/1 Running 0 3h38m
robusta-runner-5549c7d86b-vlp9f 1/1 Running 0 21m
svc
$ kubectl get svc -n robusta
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 15h
prometheus-operated ClusterIP None <none> 9090/TCP 15h
robusta-grafana NodePort 172.30.45.167 <none> 80:30747/TCP 15h
robusta-kube-prometheus-st-alertmanager NodePort 172.18.244.209 <none> 9093:30903/TCP 15h
robusta-kube-prometheus-st-operator ClusterIP 172.17.117.196 <none> 443/TCP 15h
robusta-kube-prometheus-st-prometheus NodePort 172.27.236.40 <none> 9090:30090/TCP 15h
robusta-kube-state-metrics ClusterIP 172.22.32.245 <none> 8080/TCP 15h
robusta-prometheus-node-exporter ClusterIP 172.28.205.53 <none> 9104/TCP 15h
robusta-runner ClusterIP 172.23.90.171 <none> 80/TCP 15h
# 默认都是ClusterIP,修改为NodePort
$ grep -r "type: NodePort" robusta
robusta/charts/kube-prometheus-stack/charts/grafana/values.yaml: type: NodePort
robusta/charts/kube-prometheus-stack/values.yaml: type: NodePort
robusta/charts/kube-prometheus-stack/values.yaml: type: NodePort
crash日志
测试crash日志
kubectl apply -f https://gist.githubusercontent.com/robusta-lab/283609047306dc1f05cf59806ade30b6/raw
重启2次后slack会收到信息
自动化
每个自动化都有三个部分
- Triggers:何时运行(基于警报、日志、变更等)
- Actions:要做什么操作(超过50个内置操作)
- Sinks:将结果发送到何处,默认接收器slack或者robusta ui
快速上手 on_deployment_update
添加一个自动化,在generated_values.yaml后面添加如下配置
customPlaybooks:
- triggers:
- on_deployment_update: {}
actions:
- resource_babysitter:
omitted_fields: []
fields_to_monitor: ["spec.replicas"]
更新robusta
helm upgrade robusta robusta --values=generated_values.yaml -n robusta \
--set clusterName=test-159-63 --set isSmallCluster=true
测试一下on_deployment_update是否生效
kubectl scale --replicas 2 deploy robusta-runner -n robusta

如果开启了web-ui,在Timeline还可以看到yaml的修改内容(该自动化已经配置,会推送给ui,防止重复)

on_prometheus_alert
定义告警
generated_values.yaml添加:对普罗米修斯警报添加默认的处理规则
builtinPlaybooks:
- triggers:
- on_prometheus_alert: {}
actions:
- default_enricher: {}
使用customPlaybooks定义自己的丰富内容
customPlaybooks:
- triggers:
- on_prometheus_alert:
alert_name: HostHighCpuLoad
actions:
- node_bash_enricher:
bash_command: ps aux
sinks:
- "main_slack_sink"
stop: True # 停止后面的匹配
添加HostHighCpuLoad告警。。
$ vi robusta/charts/kube-prometheus-stack/templates/prometheus/rules-1.14/node-exporter.yaml
spec:
groups:
- name: node-exporter
rules:
- alert: HostHighCpuLoad
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: Host high CPU load (instance {{`{{`}} $labels.instance {{`}}`}})
description: "CPU load is > 80%\n VALUE = {{`{{`}} $value {{`}}`}}\n LABELS = {{`{{`}} $labels {{`}}`}}"
屏蔽告警
节点重启10分钟后再告警
customPlaybooks:
- triggers:
- on_prometheus_alert:
alert_name: KubePodCrashLooping
actions:
- node_restart_silencer:
post_restart_silence: 600 # seconds
自定义playbook
pyproject.toml 会遇到坑:poetry.core.masonry.utils.module.ModuleOrPackageNotFound: No file/folder found for package robusta-playbook-actions
[tool.poetry]
name = "robusta-playbook-actions"
version = "0.0.1"
description = ""
authors = ["xx"]
[tool.poetry.dependencies]
# if your playbook requires additional dependencies, add them here
#some-dependency = "^1.2.3"
[tool.poetry.dev-dependencies]
robusta-cli = "^0.8.9"
packages = [
{ include="robusta-playbook-actions", from="." }, # 添加
]
$ robusta playbooks push robusta-playbook # 导入git仓库
$ robusta playbooks list-dirs # 查看目录
======================================================================
Listing playbooks directories
======================================================================
======================================================================
Stored playbooks directories:
robusta-playbook
# reload看看是否有异常
$ robusta playbooks reload
$ robusta playbooks list
--------------------------------------
triggers:
- on_event_create: {}
actions:
- report_scheduling_failure: {}
Robusta事件层次结构,PrometheusKubernetesAlert是PodEvent的一个子类
基础的示例
from robusta.api import *
@action
def my_action(event: PodEvent):
# we have full access to the pod on which the alert fired
pod = event.get_pod()
pod_name = pod.metadata.name
pod_logs = pod.get_logs()
pod_processes = pod.exec("ps aux")
# this is how you send data to slack or other destinations
event.add_enrichment([
MarkdownBlock("*Oh no!* An alert occurred on " + pod_name),
FileBlock("crashing-pod.log", pod_logs)
])
event.add_enrichment()示例
@action
def test_playbook(event: ExecutionBaseEvent):
event.add_enrichment(
[
MarkdownBlock(
"This is a *markdown* message. Here are some movie characters:"
),
TableBlock(
[["Han Solo", "Star Wars"], ["Paul Atreides", "Dune"]],
["name", "movie"],
),
]
)
带参数的playbook示例
from robusta.api import *
class BashParams(ActionParams):
bash_command: str
@action
def pod_bash_enricher(event: PodEvent, params: BashParams):
pod = event.get_pod()
if not pod:
logging.error(f"cannot run PodBashEnricher on event with no pod: {event}")
return
block_list: List[BaseBlock] = []
exec_result = pod.exec(params.bash_command)
block_list.append(MarkdownBlock(f"Command results for *{params.bash_command}:*"))
block_list.append(MarkdownBlock(exec_result))
event.add_enrichment(block_list)
generated_values.yaml
customPlaybooks:
- triggers:
- on_pod_update: {}
actions:
- pod_bash_enricher:
bash_command: "ls -al /"
通过外部地址载入playbook
给告警添加chatgpt帮助
globalConfig:
chat_gpt_token: YOUR KEY GOES HERE
playbookRepos:
chatgpt_robusta_actions:
url: "https://github.com/robusta-dev/kubernetes-chatgpt-bot.git"
# you can search for CallbackBlock in the playbooks/ directory and read the existing playbooks to see how it works.
disableCloudRouting: false # 一定要改成false
customPlaybooks:
# Add the 'Ask ChatGPT' button to all Prometheus alerts
- triggers:
- on_prometheus_alert: {}
actions:
- chat_gpt_enricher: {}
自带的actions
列出了小部分,详细见:docs.robusta.dev/master/cata…
Node
actions:
- node_bash_enricher: # 在node上执行命令
bash_command: ls -l /etc/data/db
- node_status_enricher: {} # 获取node状态
- node_running_pods_enricher: {} # 获取node上运行的pod
- node_allocatable_resources_enricher: {} # 获取node资源
- node_graph_enricher: # 获取node资源图表
prometheus_url: http://prometheus-k8s.monitoring.svc.cluster.local:9090
resource_type: Memory
- node_cpu_enricher: # 获取node的cpu信息
prometheus_url: http://prometheus-k8s.monitoring.svc.cluster.local:9090
Pod
actions:
- logs_enricher: {} # 获取pod日志
- pod_events_enricher: {} # 获取pod events
- pod_bash_enricher: # 在pod上执行命令
bash_command: ls -l /etc/data/db
- pod_ps: {} # 获取pod的进程
- pod_oom_killer_enricher: {} # 获取pod oom的信息
actions:
- delete_pod: {} # 删除pod
其他
actions:
- incluster_ping: # ping hostname
hostname: string
- disk_benchmark: # 磁盘测试
storage_class_name: string
- http_stress_test: # http压测
url: string
- create_pvc_snapshot: # 备份pvc
name: some_pvc_name
python故障排除
java故障排除
文档: