alertmanage 告警组件部署(三)

820 阅读3分钟

alertmanage集群部署文档

alertmanager版本在0.15.2以上才能实现alertmanager集群的部署。

集群部署需要配置集群相关的配置参数,如下:

      --cluster.listen-address="0.0.0.0:9094"
                                 Listen address for cluster. Set to empty string to disable HA mode.
      --cluster.advertise-address=CLUSTER.ADVERTISE-ADDRESS
                                 Explicit address to advertise in cluster.
      --cluster.peer=CLUSTER.PEER ...
                                 Initial peers (may be repeated).
      --cluster.peer-timeout=15s
                                 Time to wait between peers to send notifications.
      --cluster.gossip-interval=200ms
                                 Interval between sending gossip messages. By lowering this value (more frequent) gossip messages are propagated across the cluster more quickly at the expense of increased
                                 bandwidth.
      --cluster.pushpull-interval=1m0s
                                 Interval for gossip state syncs. Setting this interval lower (more frequent) will increase convergence speeds across larger clusters at the expense of increased bandwidth
                                 usage.
      --cluster.tcp-timeout=10s  Timeout for establishing a stream connection with a remote node for a full state sync, and for stream read and write operations.
      --cluster.probe-timeout=500ms
                                 Timeout to wait for an ack from a probed node before assuming it is unhealthy. This should be set to 99-percentile of RTT (round-trip time) on your network.
      --cluster.probe-interval=1s
                                 Interval between random node probes. Setting this lower (more frequent) will cause the cluster to detect failed nodes more quickly at the expense of increased bandwidth
                                 usage.
      --cluster.settle-timeout=1m0s
                                 Maximum time to wait for cluster connections to settle before evaluating notifications.
      --cluster.reconnect-interval=10s
                                 Interval between attempting to reconnect to lost peers.
      --cluster.reconnect-timeout=6h0m0s

配置文件如下:

global:
  resolve_timeout: 5m
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

分别启动alertmanage服务:

./alertmanager --config.file=alertmanager-1.yml --web.listen-address=":9193" --cluster.listen-address="0.0.0.0:9194" --cluster.peer 172.29.203.60:9194  --cluster.peer 172.29.203.60:9294 --cluster.peer 172.29.203.60:9394 --cluster.advertise-address 172.29.203.60:9194
./alertmanager --config.file=alertmanager-2.yml --web.listen-address=":9293" --cluster.listen-address="0.0.0.0:9294" --cluster.peer 172.29.203.60:9194  --cluster.peer 172.29.203.60:9294 --cluster.peer 172.29.203.60:9394 --cluster.advertise-address 172.29.203.60:9294
./alertmanager --config.file=alertmanager-3.yml --web.listen-address=":9393" --cluster.listen-address="0.0.0.0:9394" --cluster.peer 172.29.203.60:9194  --cluster.peer 172.29.203.60:9294 --cluster.peer 172.29.203.60:9394 --cluster.advertise-address 172.29.203.60:9394

访问任意服务的web端口,查看服务信息:

http://172.29.203.60:9193/#/status

通过查看服务状态可以查看当前集群信息:

Alertmanager 配置说明

在Alertmanager中通过路由(Route)来定义告警的处理方式。路由是一个基于标签匹配的树状匹配结构。根据接收到告警的标签匹配相应的处理方式。这里将详细介绍路由相关的内容。

Alertmanager主要负责对Prometheus产生的告警进行统一处理,因此在Alertmanager配置中一般会包含以下几个主要部分:

  • 全局配置(global):用于定义一些全局的公共参数,如全局的SMTP配置,Slack配置等内容;
  • 模板(templates):用于定义告警通知时的模板,如HTML模板,邮件模板等;
  • 告警路由(route):根据标签匹配,确定当前告警应该如何处理;
  • 接收人(receivers):接收人是一个抽象的概念,它可以是一个邮箱也可以是微信,Slack或者Webhook等,接收人一般配合告警路由使用;
  • 抑制规则(inhibit_rules):合理设置抑制规则可以减少垃圾告警的产生

配置格式如下:

global:
  [ resolve_timeout: <duration> | default = 5m ]
  [ smtp_from: <tmpl_string> ] 
  [ smtp_smarthost: <string> ] 
  [ smtp_hello: <string> | default = "localhost" ]
  [ smtp_auth_username: <string> ]
  [ smtp_auth_password: <secret> ]
  [ smtp_auth_identity: <string> ]
  [ smtp_auth_secret: <secret> ]
  [ smtp_require_tls: <bool> | default = true ]
  [ slack_api_url: <secret> ]
  [ victorops_api_key: <secret> ]
  [ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ]
  [ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ]
  [ opsgenie_api_key: <secret> ]
  [ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ]
  [ hipchat_api_url: <string> | default = "https://api.hipchat.com/" ]
  [ hipchat_auth_token: <secret> ]
  [ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ]
  [ wechat_api_secret: <secret> ]
  [ wechat_api_corp_id: <string> ]
  [ http_config: <http_config> ]

templates:
  [ - <filepath> ... ]

route: <route>

receivers:
  - <receiver> ...

inhibit_rules:
  [ - <inhibit_rule> ... ]

在全局配置中需要注意的是resolve_timeout,该参数定义了当Alertmanager持续多长时间未接收到告警后标记告警状态为resolved(已解决)。该参数的定义可能会影响到告警恢复通知的接收时间,读者可根据自己的实际场景进行定义,其默认值为5分钟。在接下来的部分,我们将已一些实际的例子解释Alertmanager的其它配置内容。

Alertmanager 配置

alertmanager的配置文件采用Secret方式进行保存,

创建alertmanager.yaml配置文件,内容如下:

global:
  resolve_timeout: 5m
  smtp_smarthost: smtp.mxhichina.com:25
  smtp_from: xxxx@soulapp.cn
  smtp_auth_username: xxxx@soulapp.cn
  smtp_auth_identity: xxxx@soulapp.cn
  smtp_auth_password: xxxx

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'default-receiver'

receivers:
- name: 'default-receiver'
  email_configs:
  - to: "ops@soulapp.cn"

配置文件创建为Secret对象

# 删除旧的配置文件
kubectl delete secret alertmanager-main
# 创建新的配置文件
kubectl create secret generic alertmanager-main --from-file=./alertmanager.yaml  -n monitoring

或使用base64 对文件内容处理后修改alertmanager-secret.yaml 重新创建对象,如图所示:

最后查看alertmanager的status页面,查看配置是否配置成功。

如图所示,邮箱配置已生效。

默认监控指标说明:github.com/kubernetes-…