alertmanage集群部署文档
alertmanager版本在0.15.2以上才能实现alertmanager集群的部署。
集群部署需要配置集群相关的配置参数,如下:
--cluster.listen-address="0.0.0.0:9094"
Listen address for cluster. Set to empty string to disable HA mode.
--cluster.advertise-address=CLUSTER.ADVERTISE-ADDRESS
Explicit address to advertise in cluster.
--cluster.peer=CLUSTER.PEER ...
Initial peers (may be repeated).
--cluster.peer-timeout=15s
Time to wait between peers to send notifications.
--cluster.gossip-interval=200ms
Interval between sending gossip messages. By lowering this value (more frequent) gossip messages are propagated across the cluster more quickly at the expense of increased
bandwidth.
--cluster.pushpull-interval=1m0s
Interval for gossip state syncs. Setting this interval lower (more frequent) will increase convergence speeds across larger clusters at the expense of increased bandwidth
usage.
--cluster.tcp-timeout=10s Timeout for establishing a stream connection with a remote node for a full state sync, and for stream read and write operations.
--cluster.probe-timeout=500ms
Timeout to wait for an ack from a probed node before assuming it is unhealthy. This should be set to 99-percentile of RTT (round-trip time) on your network.
--cluster.probe-interval=1s
Interval between random node probes. Setting this lower (more frequent) will cause the cluster to detect failed nodes more quickly at the expense of increased bandwidth
usage.
--cluster.settle-timeout=1m0s
Maximum time to wait for cluster connections to settle before evaluating notifications.
--cluster.reconnect-interval=10s
Interval between attempting to reconnect to lost peers.
--cluster.reconnect-timeout=6h0m0s
配置文件如下:
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
分别启动alertmanage服务:
./alertmanager --config.file=alertmanager-1.yml --web.listen-address=":9193" --cluster.listen-address="0.0.0.0:9194" --cluster.peer 172.29.203.60:9194 --cluster.peer 172.29.203.60:9294 --cluster.peer 172.29.203.60:9394 --cluster.advertise-address 172.29.203.60:9194
./alertmanager --config.file=alertmanager-2.yml --web.listen-address=":9293" --cluster.listen-address="0.0.0.0:9294" --cluster.peer 172.29.203.60:9194 --cluster.peer 172.29.203.60:9294 --cluster.peer 172.29.203.60:9394 --cluster.advertise-address 172.29.203.60:9294
./alertmanager --config.file=alertmanager-3.yml --web.listen-address=":9393" --cluster.listen-address="0.0.0.0:9394" --cluster.peer 172.29.203.60:9194 --cluster.peer 172.29.203.60:9294 --cluster.peer 172.29.203.60:9394 --cluster.advertise-address 172.29.203.60:9394
访问任意服务的web端口,查看服务信息:
http://172.29.203.60:9193/#/status
通过查看服务状态可以查看当前集群信息:
Alertmanager 配置说明
在Alertmanager中通过路由(Route)来定义告警的处理方式。路由是一个基于标签匹配的树状匹配结构。根据接收到告警的标签匹配相应的处理方式。这里将详细介绍路由相关的内容。
Alertmanager主要负责对Prometheus产生的告警进行统一处理,因此在Alertmanager配置中一般会包含以下几个主要部分:
- 全局配置(global):用于定义一些全局的公共参数,如全局的SMTP配置,Slack配置等内容;
- 模板(templates):用于定义告警通知时的模板,如HTML模板,邮件模板等;
- 告警路由(route):根据标签匹配,确定当前告警应该如何处理;
- 接收人(receivers):接收人是一个抽象的概念,它可以是一个邮箱也可以是微信,Slack或者Webhook等,接收人一般配合告警路由使用;
- 抑制规则(inhibit_rules):合理设置抑制规则可以减少垃圾告警的产生
配置格式如下:
global:
[ resolve_timeout: <duration> | default = 5m ]
[ smtp_from: <tmpl_string> ]
[ smtp_smarthost: <string> ]
[ smtp_hello: <string> | default = "localhost" ]
[ smtp_auth_username: <string> ]
[ smtp_auth_password: <secret> ]
[ smtp_auth_identity: <string> ]
[ smtp_auth_secret: <secret> ]
[ smtp_require_tls: <bool> | default = true ]
[ slack_api_url: <secret> ]
[ victorops_api_key: <secret> ]
[ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ]
[ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ]
[ opsgenie_api_key: <secret> ]
[ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ]
[ hipchat_api_url: <string> | default = "https://api.hipchat.com/" ]
[ hipchat_auth_token: <secret> ]
[ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ]
[ wechat_api_secret: <secret> ]
[ wechat_api_corp_id: <string> ]
[ http_config: <http_config> ]
templates:
[ - <filepath> ... ]
route: <route>
receivers:
- <receiver> ...
inhibit_rules:
[ - <inhibit_rule> ... ]
在全局配置中需要注意的是resolve_timeout,该参数定义了当Alertmanager持续多长时间未接收到告警后标记告警状态为resolved(已解决)。该参数的定义可能会影响到告警恢复通知的接收时间,读者可根据自己的实际场景进行定义,其默认值为5分钟。在接下来的部分,我们将已一些实际的例子解释Alertmanager的其它配置内容。
Alertmanager 配置
alertmanager的配置文件采用Secret方式进行保存,
创建alertmanager.yaml配置文件,内容如下:
global:
resolve_timeout: 5m
smtp_smarthost: smtp.mxhichina.com:25
smtp_from: xxxx@soulapp.cn
smtp_auth_username: xxxx@soulapp.cn
smtp_auth_identity: xxxx@soulapp.cn
smtp_auth_password: xxxx
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'default-receiver'
receivers:
- name: 'default-receiver'
email_configs:
- to: "ops@soulapp.cn"
配置文件创建为Secret对象
# 删除旧的配置文件
kubectl delete secret alertmanager-main
# 创建新的配置文件
kubectl create secret generic alertmanager-main --from-file=./alertmanager.yaml -n monitoring
或使用base64 对文件内容处理后修改alertmanager-secret.yaml 重新创建对象,如图所示:
最后查看alertmanager的status页面,查看配置是否配置成功。
如图所示,邮箱配置已生效。
默认监控指标说明:github.com/kubernetes-…