prometheus告警问题分析今天来说一下prometheus告警失效的问题。运维prometheus的过程中发现，

团队作者:华仔

今天来说一下prometheus告警失效的问题。

问题分析

运维prometheus的过程中发现，有时候该告警的时候没有告警，不该告警的时候却告警了，告警有延迟等等。为了找出原因，也参考了一些资料。希望对大家有帮助。先来看一下prometheus和alertmanager的一些重要配置。如下所示：

# promtheus
global:
  # How frequently to scrape targets by default. 从目标抓取监控数据的间隔
  [ scrape_interval: <duration> | default = 1m ]
  # How long until a scrape request times out. 从目标住区数据的超时时间
  [ scrape_timeout: <duration> | default = 10s ]
  # How frequently to evaluate rules. 告警规则评估的时间间隔
  [ evaluation_interval: <duration> | default = 1m ]
# alertmanager
# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> | default = 30s ] # 初次发送告警的等待时间

# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)
[ group_interval: <duration> | default = 5m ] 同一个组其他新发生的告警发送时间间隔

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> | default = 4h ] 重复发送同一个告警的时间间隔

由以上来看一下整个流程。国信证券-prometheus告警流程.png

告警延迟或频发

根据上图以及配置来看，prometheus抓取数据后，根据告警规则计算，表达式为真时，进入pending状态，当持续时间超过for配置的时间后进入active状态；数据同时会推送至alertmanager，在经过group_wait后发送通知。可见，在数据到达alertmanager后，如果group_wait设置越大，则收到告警的时间也就越长，也就会造成告警延迟；同理，如果group_wait设置过小，则频繁收到告警，显然也是不合理的。因此，需要按照具体场景进行设置。

不该告警的时候告警了

prometheus每经过scrape_interval时间向target拉取数据，再进行计算。与此同时，target的数据可能已经恢复正常了，也就是说，在for计算过程中，原数据已经恢复了正常，但是被告警跳过了，达到了持续时间，就触发了告警，也就发送了告警通知。但从grafana中看，觉得是正常。这是因为grafana以prometheus为数据源时，是range query，而不是像告警数据那样稀疏的。