携手创作，共同成长！这是我参与「掘金日新计划 · 8 月更文挑战」的第22天，点击查看活动详情

一、SkyWalking告警规则的使用

SkyWalking 告警功能是在6.x版本新增的，其核心由一组规则驱动，这些规则定义在config/alarm-settings.yml文件中。告警规则的定义分为两部分：

告警规则：它们定义了应该如何触发度量警报，应该考虑什么条件。
Webhook（网络钩子）：定义当警告触发时，哪些服务终端需要被告知

1-1、SkyWalking告警规则的设置

SkyWalking 的发行版都会默认提供config/alarm-settings.yml文件，里面预先定义了一些常用的告警规则。如下：

过去 3 分钟内服务平均响应时间超过 1 秒。
过去 2 分钟服务成功率低于80%。
过去 3 分钟内服务响应时间超过 1s 的百分比
服务实例在过去 2 分钟内平均响应时间超过 1s，并且实例名称与正则表达式匹配。
过去 2 分钟内端点平均响应时间超过 1 秒。
过去 2 分钟内数据库访问平均响应时间超过 1 秒。
过去 2 分钟内端点关系平均响应时间超过 1 秒。

这些预定义的告警规则，打开config/alarm-settings.yml文件即可看到，如下：

rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
  service_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_sla
    op: "<"
    threshold: 8000
    # The length of time to evaluate the metrics
    period: 10
    # How many times after the metrics match the condition, will trigger alarm
    count: 2
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
  service_resp_time_percentile_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_percentile
    op: ">"
    threshold: 1000,1000,1000,1000,1000
    period: 10
    count: 3
    silence-period: 5
    message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
  service_instance_resp_time_rule:
    metrics-name: service_instance_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 5
    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
  database_access_resp_time_rule:
    metrics-name: database_access_resp_time
    threshold: 1000
    op: ">"
    period: 10
    count: 2
    message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
  endpoint_relation_resp_time_rule:
    metrics-name: endpoint_relation_resp_time
    threshold: 1000
    op: ">"
    period: 10
    count: 2
    message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
#  Because the number of endpoint is much more than service and instance.
#
#  endpoint_avg_rule:
#    metrics-name: endpoint_avg
#    op: ">"
#    threshold: 1000
#    period: 10
#    count: 2
#    silence-period: 5
#    message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes

webhooks:
#  - http://127.0.0.1/notify/
#  - http://127.0.0.1/go-wechat/

每一个xxx_rule为一个规则，最下面webhooks是一个告警通知的钩子，下面再进行介绍。

1-2、规则字段的含义

service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.

用以上规则来来举例（部分字段还是没有，看下面含义即可）

Rule name：规则名称，也是在告警信息中显示的唯一名称。必须以_rule结尾，前缀可自定义
Metrics name：度量名称，取值为oal脚本中的度量名，目前只支持long、double和int类型。详见Official OAL script
Include names：该规则作用于哪些实体名称，比如服务名，终端名（可选，默认为全部）
Exclude names：该规则作不用于哪些实体名称，比如服务名，终端名（可选，默认为空）
Threshold：阈值
OP：操作符，目前支持 >、<、=
Period：多久告警规则需要被核实一下。这是一个时间窗口，与后端部署环境时间相匹配
Count：在一个Period窗口中，如果values超过Threshold值（按op），达到Count值，需要发送警报
Silence period：在时间N中触发报警后，在TN -> TN + period这个阶段不告警。默认情况下，它和Period一样，这意味着相同的告警（在同一个Metrics name拥有相同的Id）在同一个Period内只会触发一次
message：告警消息

1-3、控制台告警的提示

如下，当服务满足在配置文件中设置的规则，就会在控制台显示相关告警信息。