Skywalking(9.7.0) 告警配置

99 阅读8分钟

过年前一天发版,大家高高兴兴准备回家过年去了。这时候老板说了一句,记得带上电脑,关注用户反馈。有紧急问题在高速上都得给我找个服务区改好。

但是机智如我,怎么能让老板知道服务出问题了呢?毕竟我还奢望过完年有年终奖。那正确的方式当然服务出问题了,我们开发瞒着老板偷偷给他改了,当做什么都没发生过。

平时当然Bug多点无所谓,毕竟软件嘛,有点bug也正常。但现在是决定年终的重要时刻,我们要让老板相信我们的服务是稳定的。

1. 首先你要有个Skywalking

有条件玩K8S的同学看这个:在K8S集群中部署SkyWalking-CSDN博客

没条件就本地玩玩吧:SkyWalking 本地启动以及闪退问题-CSDN博客

告警相关配置文件路径:

打开后有一些默认的规则,这些规则的作用看这个:Alerting | Apache SkyWalking

rules:
  service_resp_time_rule:
    expression: sum(service_resp_time > 1000) >= 3
    period: 10
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
  service_sla_rule:
    expression: sum(service_sla < 8000) >= 2
    period: 10
    silence-period: 3
    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
  service_resp_time_percentile_rule:
    expression: sum(service_percentile{_='0,1,2,3,4'} > 1000) >= 3
    period: 10
    silence-period: 5
    message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
  service_instance_resp_time_rule:
    expression: sum(service_instance_resp_time > 1000) >= 2
    period: 10
    silence-period: 5
    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
  database_access_resp_time_rule:
    expression: sum(database_access_resp_time > 1000) >= 2
    period: 10
    message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
  endpoint_relation_resp_time_rule:
    expression: sum(endpoint_relation_resp_time > 1000) >= 2
    period: 10
    message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes

2. 告警规则参数

Alerting | Apache SkyWalking

Rule name:规则名称。需要保证唯一,必须以 _rule 结尾

Expression:告警表达式。

Include names:告警规则生效包含的实体名列表。在 Skywalking中,实例有多种类型 Alerting | Apache SkyWalking

实体名称这里要注意一下,我们在集成 Agent 的时候,一般都会设置 Namespace 和 Service group。举个栗子: SW_AGENT_NAMESPACE:"dev" SW_AGENT_NAME:"dev::example-name" 当我这样定义时,service name 应该写成 dev::example-name|dev|,参考:Table of Agent Configuration Properties | Apache SkyWalking

Exclude names:告警规则不生效包含的实体名列表

Include names regex:和 Include names 一样。只不过是正则表达式字符串

Exclude names regex:和 Exclude names 一样。只不过是正则表达式字符串

Tags:自定义的 k-v 对

Period:表达式计算结果的缓存时间

Silence Period:推送最低间隔时间。例如我有一个规则,1分钟会触发一次,当我把Silence Period配置为 3 时。那就是每3分钟发送一次请求到 hook

Hooks:向外界发送通知的方式 ,本质上都是 WebHook。

3. 表达式解析

service_sla_custom_rule:
    # service_sla 是一个在 `alarm-config.yaml` 中默认定义的指标,当然可以覆盖它。
    # sum((service_sla / 100) < 90) 就是字面意思,服务SLA低于90% 的次数
    # >= 4 是关键,表达式每分钟算一次(这是我看了文档后猜的,应该没问题), 那这里就表示最近4分钟服务SLA都低于90%
    expression: sum((service_sla / 100) < 90) >= 4

    # 字符串匹配写法
    include-names:
      - 'dev::example|dev|'

    # 正则写法:所有dev组的
    include-names-regex: '^dev::.*' 

    # 表达式计算结果缓存时长,表达式每一分钟计算一次,我表达式中设置了>=4
    # 所以period 应该设置一个大于4的值,这样能避免重复计算
    period: 10

    # 通知静默时长,如果服务有10分钟SLA是低于90的,那么m4的时候会提醒。
    # 下一次本来是m5提醒的,我设置了2,所以等到m6再次计算表达式的时候才会在提醒
    silence-period: 2

    # 自定义 tags,key-value形式
    tags:
      level: ERROR

    # 提醒文本,可以通过格式化插入参数
    message: '服务 SLA 低于 90%'

    # 使用的通知方式,如果不填则选择默认hooks
    hooks:
      - '{hookType}.{hookName}'

4. 表达式定义实例

4.1 所有服务SLA在最近3分钟内小于100

service_success_rule:
  expression: sum((service_success / 100) < 100) >= 3
  period: 5
  silence-period: 5
  message: '服务 SLA 低于 100%'

4.2 单接口SLA在最近3分钟内小于100

endpoint_sla_rule:
  expression: sum((endpoint_sla / 100) < 100) >= 3
  include-names: 
      - 'GET:/test/custom1 in dev::example|dev|'
  period: 5
  message: '此接口 SLA 低于 100%'

4.3 所有DB SLA 最近1分钟内小于100

database_access_sla_rule:
  expression: sum((database_access_sla / 100) < 100) >= 1
  period: 3
  message: 'DB SLA 低于 100%'

5. 定义配置 Hooks

实际测试下来,直接配置飞书会出现只有首次才会通知的情况。自己提供个Webhook在透传到飞书正常。猜测是Skywalking中集成飞书通知的模块有问题,这个待验证。

5.1 Webhook

Alerting | Apache SkyWalking

自定义的接口

@RestController  
@RequestMapping("/alerting")  
public class AlertingController {  
    // 我用的是lark,用飞书得改下host
    private final static String WEBHOOK_URL = "https://open.larksuite.com/open-apis/bot/v2/hook/<token>";  
    @Resource  
    private RestTemplate restTemplate;  
    @PostMapping("skywalking")  
    public void alert(@RequestBody List<AlarmMessage> alarmMessageList) {  
        alarmMessageList.parallelStream().forEach(alarmMessage -> {  
            String text = "Apache SkyWalking Alarm:\n\n" +  
                    alarmMessage.getScope() + ": " + alarmMessage.getName() + "\n\n" +  
                    alarmMessage.getAlarmMessage();  
  
            ImmutableMap<String, Object> body = ImmutableMap.of(  
                    "msg_type", "text",  
                    "content", ImmutableMap.of("text", text)  
            );  
  
            restTemplate.postForEntity(WEBHOOK_URL, body, String.class);  
        });  
    }  
    // https://github.com/apache/skywalking/blob/master/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/alarm/AlarmMessage.java 
    @Getter  
    @Setter    
    @JsonNaming(PropertyNamingStrategy.LowerCaseStrategy.class)  
    public static class AlarmMessage {  
        private int scopeId;  
        private String scope;  
        private String name;  
        private String id0;  
        private String id1;  
        @JsonAlias("ruleName")  
        private String ruleName;  
        @JsonAlias("alarmMessage")  
        private String alarmMessage;  
        private List<Tag> tags;  
        @JsonAlias("startTime")  
        private long startTime;  
        private transient int period;  
    }  
    @Getter  
    @Setter    
    public static class Tag {  
        private String key;  
        private String value;  
    }  
}

alarm-config.yaml 中配置

hooks:
  webhook:
    default:
      # 定义这是默认的hook
      is-default: true
      urls:
        - http://localhost:8080/alerting/skywalking

5.2 飞书

创建机器人很简单的,基本上有手就行。

Alerting | Apache SkyWalking

自定义机器人使用指南 - 开发指南 - 开发文档 - Lark 开放平台 (larksuite.com)

飞书群机器人通知配置

hooks:
  feishu:
    default:
      text-template: |-
        {
        "msg_type": "text",
        "content": {
          "text": "Apache SkyWalking Alarm: \n\n%s"
          }
        }
      webhooks:
        - url: https://open.larksuite.com/open-apis/bot/v2/hook/<token>
        - secret: <secret>

6. 测试告警

  1. 启动 SkyWalkingAPM

image.png

  1. 项目接入SkyWalking Agent

VM 参数加入 skywalking-agent 配置

-javaagent:E:\apache-skywalking-apm-bin\skywalking-agent\skywalking-agent.jar -Dskywalking.agent.service_name=dev::user-api -Dskywalking.agent.namespace=dev
  1. 提供接口

接口的内容是一样的,只是接口名不一样。接口根据请求的 code 参数,返回对应的 Http status。

@RestController  
@Api(tags = "测试")  
@RequestMapping("/test")  
@Slf4j  
public class TestController {  
    @ApiOperation(value = "自定义")  
    @GetMapping("custom")  
    public ResponseEntity<Object> custom(@RequestParam Integer code) {  
        return ResponseEntity.status(code).build();  
    }  
  
    @ApiOperation(value = "自定义")  
    @GetMapping("custom1")  
    public ResponseEntity<Object> custom1(@RequestParam Integer code) {  
        return ResponseEntity.status(code).build();  
    }  
}
  1. 弄个Python脚本来请求接口
import requests  
import time  
  
  
def call_api():  
    try:  
        print("Calling API at:", time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))  
        requests.get("http://127.0.0.1:10001/test/custom?code=500")  
        requests.get("http://127.0.0.1:10001/test/custom?code=200")  
        requests.get("http://127.0.0.1:10001/test/custom?code=400")  
        requests.get("http://127.0.0.1:10001/test/custom1?code=500")  
        requests.get("http://127.0.0.1:10001/test/custom1?code=200")  
        requests.get("http://127.0.0.1:10001/test/custom1?code=400")  
    except Exception:  
        # 如果发生异常,你也可以在这里进行处理  
        print("Error")  
  
  
def main():  
    while True:  
        call_api()  
        time.sleep(10)  # 每10s调用一次接口  


if __name__ == "__main__":  
    main()

image.png

  1. 整理一下告警规则,更容易触发
rules:
  endpoint_sla2_rule:
    expression: sum((endpoint_sla / 100) < 99) >= 1
    period: 1
    silence-period: 1
    message: '{name} \n\n单接口异常'
  endpoint_sla3_rule:
    expression: sum((endpoint_sla / 100) < 99) >= 1
    include-names: 
      - 'GET:/test/custom in dev::user-api|dev|'
    period: 1
    silence-period: 1
    message: '{name} \n\n单接口异常include-names custom'
  endpoint_regex_rule:
    expression: sum((endpoint_sla / 100) < 99) >= 1
    include-names-regex: '.*custom.*'
    period: 1
    silence-period: 1
    message: '{name} \n\n正则匹配报警'

  endpoint_sla4_rule:
    expression: sum((endpoint_sla / 100) < 99) >= 1
    include-names: 
      - 'GET:/test/custom1 in dev::user-api|dev|'
    period: 1
    silence-period: 1
    message: '{name} \n\n单接口异常include-names custom1'
  service_success_rule:
      expression: sum((service_success / 100) < 99) >= 1
      # [Optional] Default, match all services in this metrics
      include-names:
        - 'dev::user-api|dev|'
      period: 1
      message: '{name}\n\n服务 SLA 低于 99%'

hooks:
  webhook:
    default:
      is-default: true
      urls:
        - http://localhost:11005/alerting/skywalking
  1. 获取到接口参数
[
    {
        "scopeId": 3,
        "scope": "ENDPOINT",
        "name": "GET:/test/custom in dev::user-api|dev|",
        "id0": "ZGV2Ojp1c2VyLWFwaXxkZXZ8.1_R0VUOi90ZXN0L2N1c3RvbQ\u003d\u003d",
        "id1": "",
        "ruleName": "endpoint_sla2_rule",
        "alarmMessage": "GET:/test/custom in dev::user-api|dev| \\n\\n单接口异常",
        "tags": [],
        "startTime": 1709716162892,
        "hooks": [
            "webhook.default"
        ]
    },
    {
        "scopeId": 3,
        "scope": "ENDPOINT",
        "name": "GET:/test/custom1 in dev::user-api|dev|",
        "id0": "ZGV2Ojp1c2VyLWFwaXxkZXZ8.1_R0VUOi90ZXN0L2N1c3RvbTE\u003d",
        "id1": "",
        "ruleName": "endpoint_sla2_rule",
        "alarmMessage": "GET:/test/custom1 in dev::user-api|dev| \\n\\n单接口异常",
        "tags": [],
        "startTime": 1709716162892,
        "hooks": [
            "webhook.default"
        ]
    },
    {
        "scopeId": 3,
        "scope": "ENDPOINT",
        "name": "GET:/test/custom in dev::user-api|dev|",
        "id0": "ZGV2Ojp1c2VyLWFwaXxkZXZ8.1_R0VUOi90ZXN0L2N1c3RvbQ\u003d\u003d",
        "id1": "",
        "ruleName": "endpoint_sla3_rule",
        "alarmMessage": "GET:/test/custom in dev::user-api|dev| \\n\\n单接口异常include-names custom",
        "tags": [],
        "startTime": 1709716162893,
        "hooks": [
            "webhook.default"
        ]
    },
    {
        "scopeId": 3,
        "scope": "ENDPOINT",
        "name": "GET:/test/custom in dev::user-api|dev|",
        "id0": "ZGV2Ojp1c2VyLWFwaXxkZXZ8.1_R0VUOi90ZXN0L2N1c3RvbQ\u003d\u003d",
        "id1": "",
        "ruleName": "endpoint_regex_rule",
        "alarmMessage": "GET:/test/custom in dev::user-api|dev| \\n\\n正则匹配报警",
        "tags": [],
        "startTime": 1709716162893,
        "hooks": [
            "webhook.default"
        ]
    },
    {
        "scopeId": 3,
        "scope": "ENDPOINT",
        "name": "GET:/test/custom1 in dev::user-api|dev|",
        "id0": "ZGV2Ojp1c2VyLWFwaXxkZXZ8.1_R0VUOi90ZXN0L2N1c3RvbTE\u003d",
        "id1": "",
        "ruleName": "endpoint_regex_rule",
        "alarmMessage": "GET:/test/custom1 in dev::user-api|dev| \\n\\n正则匹配报警",
        "tags": [],
        "startTime": 1709716162893,
        "hooks": [
            "webhook.default"
        ]
    },
    {
        "scopeId": 3,
        "scope": "ENDPOINT",
        "name": "GET:/test/custom1 in dev::user-api|dev|",
        "id0": "ZGV2Ojp1c2VyLWFwaXxkZXZ8.1_R0VUOi90ZXN0L2N1c3RvbTE\u003d",
        "id1": "",
        "ruleName": "endpoint_sla4_rule",
        "alarmMessage": "GET:/test/custom1 in dev::user-api|dev| \\n\\n单接口异常include-names custom1",
        "tags": [],
        "startTime": 1709716162893,
        "hooks": [
            "webhook.default"
        ]
    },
    {
        "scopeId": 1,
        "scope": "SERVICE",
        "name": "dev::user-api|dev|",
        "id0": "ZGV2Ojp1c2VyLWFwaXxkZXZ8.1",
        "id1": "",
        "ruleName": "service_success_rule",
        "alarmMessage": "dev::user-api|dev|\\n\\n服务 SLA 低于 99%",
        "tags": [],
        "startTime": 1709716162893,
        "hooks": [
            "webhook.default"
        ]
    }
]

image.png

Ref

Alerting | Apache SkyWalking

skywalking.apache.org/docs/main/v…

Analysis Native Streaming Traces and Service Mesh Traffic | Apache SkyWalking

skywalking/docs/en/setup/backend/backend-alarm.md at master · apache/skywalking (github.com)

skywalking/docs/en/api/metrics-query-expression.md at master · apache/skywalking (github.com)

自定义机器人使用指南 - 开发指南 - 开发文档 - Lark 开放平台 (larksuite.com)

Table of Agent Configuration Properties | Apache SkyWalking