AlertManager报警组件

1,216 阅读2分钟

这是我参与更文挑战的第5天,活动详情查看: 更文挑战

AlertManager

紧接着上一篇自定义Prometheus

前言

搭建好了一套监控后,必不可少的就是报警机制了,以各种各样的方式推送消息,比如邮件、短信、钉钉、企业微信等方式,帮助运维人员尽快发现并修复问题

1. 创建AlertManager

老规矩开局直接偷配置文件

docker cp alertmanager:/etc/alertmanager/alertmanager.yml .

启动AlertManager

docker run --name alertmanager -d -p 9093:9093 -v /Users/yujian/Documents/prometheus/alertmanager.yml:/etc/alertmanager/alertmanager.yml  prom/alertmanager:latest

2. 创建AlertManager告警方式

邮件方式,修改alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: xxxxxxx@163.com
  smtp_auth_username: xxxxxx@163.com
  smtp_auth_password: xxxxx
  smtp_require_tls: false

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1m
  receiver: 'mail'
receivers:
- name: 'mail'
  email_configs:
  - to: xxxxxxxx@qq.com

此时AlertManager的告警已经配置完成。

3. 创建告警规则

告警规则代表什么情况下会触发报警,由Prometheus控制

#修改prometheus.yml
rule_files:
   - "/etc/prometheus/rules.yml"
  # - "second_rules.yml"

此时并没有/etc/prometheus/rules.yml的配置文件,我们来创建一个

vi rule.yml

groups:
- name: node-up
  rules:
  - alert: cpumax  #aleartname
    expr: easy_prometheus_system_cpu_percent{job="easy_prometheus"} > 20 #promQL
    for: 3s #保持的时间
    annotations: #为了更好触发我改为了20%
      summary: "{{ $labels.instance }} cpu使用率超过20%!"
  - alert: node-up
    expr: up{job="easy_prometheus"} == 0 #promQL
    for: 4s
    labels: #描述
      severity: 1
      team: node
    annotations:
      summary: "{{ $labels.instance }} 已停止运行!"

重新创建Prometheus容器,将rule.yml挂载到/etc/prometheus/rules.yml,启动完成查看Alerts是否成功

image.png

webhook方式

route:
  group_by: ['instance']
  group_wait: 10s
  group_interval: 20s
  repeat_interval: 20s
  #repeat_interval: 1h
  receiver: 'webhook'
receivers:
- name: 'webhook'
  webhook_configs:
  - url: 'http://192.168.31.150:8089/webhook'

消息格式

{"receiver":"webhook","status":"resolved","alerts":[{"status":"resolved","labels":{{"status":"resolved","labels":{"action":"Cpu利用率","alertname":"cpumax","application":"easy_prometheus","cause":"Cpu利"exported_application":"easy_prometheus","instance":"192.168.31.150:8089","job":"easy_prometheus"},"annotations":{"summary":"192.168.31.150:8089 cpu使用率超过20%!"},"startsAt":"2021-06-19T03:21:56.117Z","ends021-06-19T03:22:11.117Z","generatorURL":"http://406161e43292:9090/graph?g0.expr=easy_prometheus_system_cpu_percent%7Bjob%3D%22easy_prometheus%22%7D+%3E+20\u0026g0.tab=1","fingerprint":"1bcf523f0c524538"}],"groupLabels":{"instance":"192.168.31.150:8089"},"commonLabels":{"application":"easy_prometheus","instance":"192.168.31.150:8089","job":"easy_prometheus"},"commonAnnotations":{},"externalURL":"http://c731ba69bfca:9093","version":"4","groupKey":"{}:{instance=\"192.168.31.150:8089\"}","truncatedAlerts":0}

改造一下Easy-Prometheus(已更新到github)的源码增加监听webhook通知

access_token在钉钉群机器人处创建得到

type Ding struct {
	Alerts []struct{
		Annotations struct{
			Summary string `json:"summary"`
		}  `json:"annotations"`
	}  `json:"alerts"`
}

func dingding(w http.ResponseWriter, r *http.Request)  {
	s, _ := ioutil.ReadAll(r.Body)
	ding := &Ding{}
	fmt.Println(string(s))
	json.Unmarshal(s,ding)
	anno := ding.Alerts[0]
	req :=&httpgo.Req{}
	x, err := req.Header("Content-Type", "application/json").
		Method(http.MethodPost).
		Url("https://oapi.dingtalk.com/robot/send?access_token=xxxxxxx").
		Params(httpgo.Query{
			"link": map[string]interface{}{
				"title": "AlertManager通知",
				"text": "通知" + anno.Annotations.Summary,
                                #图是网上随便找的
				"picUrl": "https://photo.16pic.com/00/65/09/16pic_6509905_b.png",
                                #点击消息标题快速跳转到Prometheus
			    "messageUrl":"http://localhost:9090/alerts", 
                            
			},
		    "msgtype": "link",
	    }).Go().Body()
	if err!=nil {
		log.Println(err)
	}
	fmt.Println(x)
}

3. 测试告警

我这里测试启动多个应用以让CPU达到20%利用率并维持3秒钟。

image.png

钉钉

image.png