微服务接入Skywalking监控告警系统

401 阅读5分钟
  • [安装]
  • [UI使用说明]
  • [告警功能]

一、安装

本文环境配置

  1. 基于springcloud alibaba微服务
  2. Skywalking8.6.0

Skywalking目录结构

agent:客户端需要指定的目录,其中有一个jar,就是负责和客户端整合收集日志
bin:服务端启动的脚本
config:一些配置文件的目录
logs:oap服务的日志目录
oap-libs:oap所需的依赖目录
webapp:UI服务的目录

部署步骤

  1. 编辑/config/application.yml,修改注册中心为nacos,数据持久化为ES7
cluster:
  selector: ${SW_CLUSTER:nacos}
  ....
  
   nacos:
    serviceName: ${SW_SERVICE_NAME:"SkyWalking_OAP_Cluster"}
    hostPort: ${SW_CLUSTER_NACOS_HOST_PORT:192.168.1.1:8848}
    # Nacos Configuration namespace
    namespace: ${SW_CLUSTER_NACOS_NAMESPACE:"13527466-83ab-47dd-a401-e19f343bf03c"}
    # Nacos auth username
    username: ${SW_CLUSTER_NACOS_USERNAME:"nacos"}
    password: ${SW_CLUSTER_NACOS_PASSWORD:"123456"}
    # Nacos auth accessKey
    accessKey: ${SW_CLUSTER_NACOS_ACCESSKEY:""}
    secretKey: ${SW_CLUSTER_NACOS_SECRETKEY:""}
    
storage:
  selector: ${SW_STORAGE:elasticsearch7}
  ...
  elasticsearch7:
    nameSpace: ${SW_NAMESPACE:"sw_test"}
    #clusterNodes: ${SW_STORAGE_ES_CLUSTER_NODES:localhost:9200}
    clusterNodes: 172.21.54.145:9200,172.21.54.145:9201,172.21.54.145:9202
    protocol: ${SW_STORAGE_ES_HTTP_PROTOCOL:"http"}
    trustStorePath: ${SW_STORAGE_ES_SSL_JKS_PATH:""}
    trustStorePass: ${SW_STORAGE_ES_SSL_JKS_PASS:""}
    dayStep: ${SW_STORAGE_DAY_STEP:30} # Represent the number of days in the one minute/hour/day index.
    indexShardsNumber: ${SW_STORAGE_ES_INDEX_SHARDS_NUMBER:9} # Shard number of new indexes
    indexReplicasNumber: ${SW_STORAGE_ES_INDEX_REPLICAS_NUMBER:1} # Replicas number of new indexes
    # Super data set has been defined in the codes, such as trace segments.The following 3 config would be improve es performance when storage super size data in es.
    superDatasetDayStep: ${SW_SUPERDATASET_STORAGE_DAY_STEP:-1} # Represent the number of days in the super size dataset record index, the default value is the same as dayStep when the value is less than 0
    superDatasetIndexShardsFactor: ${SW_STORAGE_ES_SUPER_DATASET_INDEX_SHARDS_FACTOR:5} #  This factor provides more shards for the super data set, shards number = indexShardsNumber * superDatasetIndexShardsFactor. Also, this factor effects Zipkin and Jaeger traces.
    superDatasetIndexReplicasNumber: ${SW_STORAGE_ES_SUPER_DATASET_INDEX_REPLICAS_NUMBER:0} # Represent the replicas number in the super size dataset record index, the default value is 0.
    user: ${SW_ES_USER:""}
    password: ${SW_ES_PASSWORD:""}
    secretsManagementFile: ${SW_ES_SECRETS_MANAGEMENT_FILE:""} # Secrets management file in the properties format includes the username, password, which are managed by 3rd party tool.
    bulkActions: ${SW_STORAGE_ES_BULK_ACTIONS:1000} # Execute the async bulk record data every ${SW_STORAGE_ES_BULK_ACTIONS} requests
    flushInterval: ${SW_STORAGE_ES_FLUSH_INTERVAL:10} # flush the bulk every 10 seconds whatever the number of requests
    concurrentRequests: ${SW_STORAGE_ES_CONCURRENT_REQUESTS:2} # the number of concurrent requests
    resultWindowMaxSize: ${SW_STORAGE_ES_QUERY_MAX_WINDOW_SIZE:10000}
    metadataQueryMaxSize: ${SW_STORAGE_ES_QUERY_MAX_SIZE:5000}
    segmentQueryMaxSize: ${SW_STORAGE_ES_QUERY_SEGMENT_SIZE:200}
    profileTaskQueryMaxSize: ${SW_STORAGE_ES_QUERY_PROFILE_TASK_SIZE:200}
    oapAnalyzer: ${SW_STORAGE_ES_OAP_ANALYZER:"{\"analyzer\":{\"oap_analyzer\":{\"type\":\"stop\"}}}"} # the oap analyzer.
    oapLogAnalyzer: ${SW_STORAGE_ES_OAP_LOG_ANALYZER:"{\"analyzer\":{\"oap_log_analyzer\":{\"type\":\"standard\"}}}"} # the oap log analyzer. It could be customized by the ES analyzer configuration to support more language log formats, such as Chinese log, Japanese log and etc.
    advanced: ${SW_STORAGE_ES_ADVANCED:""}
  1. 在webapp包下修改启动端口
  2. 启动服务 ./start.sh
  3. java服务启动命令需要添加agent
java -javaagent:/data/apache-skywalking-apm-bin/agent/skywalking-agent.jar 
-Dskywalking.agent.service_name=admin //在sk显示的服务名
-Dskywalking.collector.backend_service=172.21.54.147:11800 //sk的oap服务地址
-jar app.jar
  1. 默认无法监控springcloud gateway,需在/agent/optional-plugins找到插件复制到/agent/plugins目录下

Untitled.png

二、UI使用说明

Dashboard(仪表盘):查看被监控服务的运行状态:

  • Global

    Services load:服务每分钟请求数; Slow Services:慢响应服务,单位ms; Un-Health services(Apdex): Apdex性能指标,1为满分; Slow Endpoint:慢响应端点,单位ms; Global Response Latency:百分比响应延时,不同百分比的延时时间,单位ms; Global Heatmap:服务响应时间热力分布图,根据时间段内不同响应时间的数量显示颜色深度; 底部栏:展示数据的时间区间,点击可以调整;

  • Service

    Service Apdex(数字):当前服务的评分; Service Apdex(折线图):不同时间的Apdex评分; Service Avg Response Times:平均响应延时,单位ms; Service Response Time Percentile:百分比响应延时; Successful Rate(数字):请求成功率; Successful Rate(折线图):不同时间的请求成功率; Servce Load(数字):每分钟请求数; Servce Load(折线图):不同时间的每分钟请求数; Servce Instances Load:每个服务实例的每分钟请求数; Show Service Instance:每个服务实例的最大延时; Service Instance Successful Rate:每个服务实例的请求成功率;

  • Instance

    Service Instance Load:当前实例的每分钟请求数; Service Instance Successful Rate:当前实例的请求成功率; Service Instance Latency:当前实例的响应延时; JVM CPU:jvm占用CPU的百分比; JVM Memory:JVM内存占用大小,单位m; JVM GC Time:JVM垃圾回收时间,包含YGC和OGC; JVM GC Count:JVM垃圾回收次数,包含YGC和OGC; JVM Thread Count:JVM线程数;

  • Endpoint

    Endpoint Load in Current Service:每个端点的每分钟请求数; Slow Endpoints in Current Service:每个端点的最慢请求时间,单位ms; Successful Rate in Current Service:每个端点的请求成功率; Endpoint Load:当前端点每个时间段的请求数据; Endpoint Avg Response Time:当前端点每个时间段的请求行响应时间; Endpoint Response Time Percentile:当前端点每个时间段的响应时间占比; Endpoint Successful Rate:当前端点每个时间段的请求成功率;

Topology(拓扑图):以拓扑图的方式展现服务之间的关系,并以此为入口查看相关信息 Trace(追踪):以接口列表的方式展现,追踪接口内部调用过程
Profile(性能剖析):对端点进行采样分析,并可查看堆栈信息
Alarm(告警):触发告警的告警列表,包括服务失败率,请求超时等
Reload(自动刷新):刷新当前页面数据内容;

三、 告警功能

本例子使用企业微信机器人

  1. 默认配置/config/alarm-settings.yml
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
 
# Sample alarm rules.
rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
  service_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_sla
    op: "<"
    threshold: 8000
    # The length of time to evaluate the metrics
    period: 10
    # How many times after the metrics match the condition, will trigger alarm
    count: 2
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
  service_p90_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_p90
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: 90% response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes
  service_instance_resp_time_rule:
    metrics-name: service_instance_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 5
    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
#  Because the number of endpoint is much more than service and instance.
#
#  endpoint_avg_rule:
#    metrics-name: endpoint_avg
#    op: ">"
#    threshold: 1000
#    period: 10
#    count: 2
#    silence-period: 5
#    message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
 
webhooks:
#  - http://127.0.0.1/notify/
#  - http://127.0.0.1/go-wechat/

说明:

  • 最近3分钟内服务的平均响应时间超过1秒
  • 最近2分钟服务成功率低于80%
  • 最近3分钟90%服务响应时间超过1秒
  • 最近2分钟内服务实例的平均响应时间超过1秒
  1. 告警发送到企业微信机器人
wechatHooks: 
    textTemplate: |- 
    { 
        "msgtype": "text", 
        "text": { 
            "content": "Apache SkyWalking 告警: \n %s." 
        } 
     } 
    webhooks: 
    - https://xxxx