[安装]
[UI使用说明]
[告警功能]

一、安装

本文环境配置

基于springcloud alibaba微服务
Skywalking8.6.0

Skywalking目录结构

agent：客户端需要指定的目录，其中有一个jar，就是负责和客户端整合收集日志
bin：服务端启动的脚本
config：一些配置文件的目录
logs：oap服务的日志目录
oap-libs：oap所需的依赖目录
webapp：UI服务的目录

部署步骤

编辑/config/application.yml,修改注册中心为nacos,数据持久化为ES7

cluster:
  selector: ${SW_CLUSTER:nacos}
  ....
  
   nacos:
    serviceName: ${SW_SERVICE_NAME:"SkyWalking_OAP_Cluster"}
    hostPort: ${SW_CLUSTER_NACOS_HOST_PORT:192.168.1.1:8848}
    # Nacos Configuration namespace
    namespace: ${SW_CLUSTER_NACOS_NAMESPACE:"13527466-83ab-47dd-a401-e19f343bf03c"}
    # Nacos auth username
    username: ${SW_CLUSTER_NACOS_USERNAME:"nacos"}
    password: ${SW_CLUSTER_NACOS_PASSWORD:"123456"}
    # Nacos auth accessKey
    accessKey: ${SW_CLUSTER_NACOS_ACCESSKEY:""}
    secretKey: ${SW_CLUSTER_NACOS_SECRETKEY:""}
    
storage:
  selector: ${SW_STORAGE:elasticsearch7}
  ...
  elasticsearch7:
    nameSpace: ${SW_NAMESPACE:"sw_test"}
    #clusterNodes: ${SW_STORAGE_ES_CLUSTER_NODES:localhost:9200}
    clusterNodes: 172.21.54.145:9200,172.21.54.145:9201,172.21.54.145:9202
    protocol: ${SW_STORAGE_ES_HTTP_PROTOCOL:"http"}
    trustStorePath: ${SW_STORAGE_ES_SSL_JKS_PATH:""}
    trustStorePass: ${SW_STORAGE_ES_SSL_JKS_PASS:""}
    dayStep: ${SW_STORAGE_DAY_STEP:30} # Represent the number of days in the one minute/hour/day index.
    indexShardsNumber: ${SW_STORAGE_ES_INDEX_SHARDS_NUMBER:9} # Shard number of new indexes
    indexReplicasNumber: ${SW_STORAGE_ES_INDEX_REPLICAS_NUMBER:1} # Replicas number of new indexes
    # Super data set has been defined in the codes, such as trace segments.The following 3 config would be improve es performance when storage super size data in es.
    superDatasetDayStep: ${SW_SUPERDATASET_STORAGE_DAY_STEP:-1} # Represent the number of days in the super size dataset record index, the default value is the same as dayStep when the value is less than 0
    superDatasetIndexShardsFactor: ${SW_STORAGE_ES_SUPER_DATASET_INDEX_SHARDS_FACTOR:5} #  This factor provides more shards for the super data set, shards number = indexShardsNumber * superDatasetIndexShardsFactor. Also, this factor effects Zipkin and Jaeger traces.
    superDatasetIndexReplicasNumber: ${SW_STORAGE_ES_SUPER_DATASET_INDEX_REPLICAS_NUMBER:0} # Represent the replicas number in the super size dataset record index, the default value is 0.
    user: ${SW_ES_USER:""}
    password: ${SW_ES_PASSWORD:""}
    secretsManagementFile: ${SW_ES_SECRETS_MANAGEMENT_FILE:""} # Secrets management file in the properties format includes the username, password, which are managed by 3rd party tool.
    bulkActions: ${SW_STORAGE_ES_BULK_ACTIONS:1000} # Execute the async bulk record data every ${SW_STORAGE_ES_BULK_ACTIONS} requests
    flushInterval: ${SW_STORAGE_ES_FLUSH_INTERVAL:10} # flush the bulk every 10 seconds whatever the number of requests
    concurrentRequests: ${SW_STORAGE_ES_CONCURRENT_REQUESTS:2} # the number of concurrent requests
    resultWindowMaxSize: ${SW_STORAGE_ES_QUERY_MAX_WINDOW_SIZE:10000}
    metadataQueryMaxSize: ${SW_STORAGE_ES_QUERY_MAX_SIZE:5000}
    segmentQueryMaxSize: ${SW_STORAGE_ES_QUERY_SEGMENT_SIZE:200}
    profileTaskQueryMaxSize: ${SW_STORAGE_ES_QUERY_PROFILE_TASK_SIZE:200}
    oapAnalyzer: ${SW_STORAGE_ES_OAP_ANALYZER:"{\"analyzer\":{\"oap_analyzer\":{\"type\":\"stop\"}}}"} # the oap analyzer.
    oapLogAnalyzer: ${SW_STORAGE_ES_OAP_LOG_ANALYZER:"{\"analyzer\":{\"oap_log_analyzer\":{\"type\":\"standard\"}}}"} # the oap log analyzer. It could be customized by the ES analyzer configuration to support more language log formats, such as Chinese log, Japanese log and etc.
    advanced: ${SW_STORAGE_ES_ADVANCED:""}

在webapp包下修改启动端口
启动服务 ./start.sh
java服务启动命令需要添加agent

java -javaagent:/data/apache-skywalking-apm-bin/agent/skywalking-agent.jar 
-Dskywalking.agent.service_name=admin //在sk显示的服务名
-Dskywalking.collector.backend_service=172.21.54.147:11800 //sk的oap服务地址
-jar app.jar

默认无法监控springcloud gateway,需在/agent/optional-plugins找到插件复制到/agent/plugins目录下

二、UI使用说明

Dashboard（仪表盘）：查看被监控服务的运行状态:

Global

Services load：服务每分钟请求数； Slow Services：慢响应服务，单位ms； Un-Health services(Apdex): Apdex性能指标，1为满分； Slow Endpoint：慢响应端点，单位ms； Global Response Latency：百分比响应延时，不同百分比的延时时间，单位ms； Global Heatmap：服务响应时间热力分布图，根据时间段内不同响应时间的数量显示颜色深度；底部栏：展示数据的时间区间，点击可以调整；
Service

Service Apdex（数字）:当前服务的评分； Service Apdex（折线图）：不同时间的Apdex评分； Service Avg Response Times：平均响应延时，单位ms； Service Response Time Percentile：百分比响应延时； Successful Rate（数字）：请求成功率； Successful Rate（折线图）：不同时间的请求成功率； Servce Load（数字）：每分钟请求数； Servce Load（折线图）：不同时间的每分钟请求数； Servce Instances Load：每个服务实例的每分钟请求数； Show Service Instance：每个服务实例的最大延时； Service Instance Successful Rate：每个服务实例的请求成功率；
Instance

Service Instance Load：当前实例的每分钟请求数； Service Instance Successful Rate：当前实例的请求成功率； Service Instance Latency：当前实例的响应延时； JVM CPU：jvm占用CPU的百分比； JVM Memory：JVM内存占用大小，单位m； JVM GC Time：JVM垃圾回收时间，包含YGC和OGC； JVM GC Count：JVM垃圾回收次数，包含YGC和OGC； JVM Thread Count：JVM线程数；
Endpoint

Endpoint Load in Current Service：每个端点的每分钟请求数； Slow Endpoints in Current Service：每个端点的最慢请求时间，单位ms； Successful Rate in Current Service：每个端点的请求成功率； Endpoint Load：当前端点每个时间段的请求数据； Endpoint Avg Response Time：当前端点每个时间段的请求行响应时间； Endpoint Response Time Percentile：当前端点每个时间段的响应时间占比； Endpoint Successful Rate：当前端点每个时间段的请求成功率；

Topology（拓扑图）：以拓扑图的方式展现服务之间的关系，并以此为入口查看相关信息 Trace（追踪）：以接口列表的方式展现，追踪接口内部调用过程
Profile（性能剖析）：对端点进行采样分析，并可查看堆栈信息
Alarm（告警）：触发告警的告警列表，包括服务失败率，请求超时等
Reload（自动刷新）：刷新当前页面数据内容；

三、告警功能

本例子使用企业微信机器人

默认配置/config/alarm-settings.yml

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
 
# Sample alarm rules.
rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
  service_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_sla
    op: "<"
    threshold: 8000
    # The length of time to evaluate the metrics
    period: 10
    # How many times after the metrics match the condition, will trigger alarm
    count: 2
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
  service_p90_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_p90
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: 90% response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes
  service_instance_resp_time_rule:
    metrics-name: service_instance_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 5
    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
#  Because the number of endpoint is much more than service and instance.
#
#  endpoint_avg_rule:
#    metrics-name: endpoint_avg
#    op: ">"
#    threshold: 1000
#    period: 10
#    count: 2
#    silence-period: 5
#    message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
 
webhooks:
#  - http://127.0.0.1/notify/
#  - http://127.0.0.1/go-wechat/

说明:

最近3分钟内服务的平均响应时间超过1秒
最近2分钟服务成功率低于80%
最近3分钟90%服务响应时间超过1秒
最近2分钟内服务实例的平均响应时间超过1秒

告警发送到企业微信机器人

wechatHooks: 
    textTemplate: |- 
    { 
        "msgtype": "text", 
        "text": { 
            "content": "Apache SkyWalking 告警: \n %s." 
        } 
     } 
    webhooks: 
    - https://xxxx

微服务接入Skywalking监控告警系统