- [安装]
- [UI使用说明]
- [告警功能]
一、安装
本文环境配置
- 基于springcloud alibaba微服务
- Skywalking8.6.0
Skywalking目录结构
agent:客户端需要指定的目录,其中有一个jar,就是负责和客户端整合收集日志
bin:服务端启动的脚本
config:一些配置文件的目录
logs:oap服务的日志目录
oap-libs:oap所需的依赖目录
webapp:UI服务的目录
部署步骤
- 编辑/config/application.yml,修改注册中心为nacos,数据持久化为ES7
cluster:
selector: ${SW_CLUSTER:nacos}
....
nacos:
serviceName: ${SW_SERVICE_NAME:"SkyWalking_OAP_Cluster"}
hostPort: ${SW_CLUSTER_NACOS_HOST_PORT:192.168.1.1:8848}
# Nacos Configuration namespace
namespace: ${SW_CLUSTER_NACOS_NAMESPACE:"13527466-83ab-47dd-a401-e19f343bf03c"}
# Nacos auth username
username: ${SW_CLUSTER_NACOS_USERNAME:"nacos"}
password: ${SW_CLUSTER_NACOS_PASSWORD:"123456"}
# Nacos auth accessKey
accessKey: ${SW_CLUSTER_NACOS_ACCESSKEY:""}
secretKey: ${SW_CLUSTER_NACOS_SECRETKEY:""}
storage:
selector: ${SW_STORAGE:elasticsearch7}
...
elasticsearch7:
nameSpace: ${SW_NAMESPACE:"sw_test"}
#clusterNodes: ${SW_STORAGE_ES_CLUSTER_NODES:localhost:9200}
clusterNodes: 172.21.54.145:9200,172.21.54.145:9201,172.21.54.145:9202
protocol: ${SW_STORAGE_ES_HTTP_PROTOCOL:"http"}
trustStorePath: ${SW_STORAGE_ES_SSL_JKS_PATH:""}
trustStorePass: ${SW_STORAGE_ES_SSL_JKS_PASS:""}
dayStep: ${SW_STORAGE_DAY_STEP:30} # Represent the number of days in the one minute/hour/day index.
indexShardsNumber: ${SW_STORAGE_ES_INDEX_SHARDS_NUMBER:9} # Shard number of new indexes
indexReplicasNumber: ${SW_STORAGE_ES_INDEX_REPLICAS_NUMBER:1} # Replicas number of new indexes
# Super data set has been defined in the codes, such as trace segments.The following 3 config would be improve es performance when storage super size data in es.
superDatasetDayStep: ${SW_SUPERDATASET_STORAGE_DAY_STEP:-1} # Represent the number of days in the super size dataset record index, the default value is the same as dayStep when the value is less than 0
superDatasetIndexShardsFactor: ${SW_STORAGE_ES_SUPER_DATASET_INDEX_SHARDS_FACTOR:5} # This factor provides more shards for the super data set, shards number = indexShardsNumber * superDatasetIndexShardsFactor. Also, this factor effects Zipkin and Jaeger traces.
superDatasetIndexReplicasNumber: ${SW_STORAGE_ES_SUPER_DATASET_INDEX_REPLICAS_NUMBER:0} # Represent the replicas number in the super size dataset record index, the default value is 0.
user: ${SW_ES_USER:""}
password: ${SW_ES_PASSWORD:""}
secretsManagementFile: ${SW_ES_SECRETS_MANAGEMENT_FILE:""} # Secrets management file in the properties format includes the username, password, which are managed by 3rd party tool.
bulkActions: ${SW_STORAGE_ES_BULK_ACTIONS:1000} # Execute the async bulk record data every ${SW_STORAGE_ES_BULK_ACTIONS} requests
flushInterval: ${SW_STORAGE_ES_FLUSH_INTERVAL:10} # flush the bulk every 10 seconds whatever the number of requests
concurrentRequests: ${SW_STORAGE_ES_CONCURRENT_REQUESTS:2} # the number of concurrent requests
resultWindowMaxSize: ${SW_STORAGE_ES_QUERY_MAX_WINDOW_SIZE:10000}
metadataQueryMaxSize: ${SW_STORAGE_ES_QUERY_MAX_SIZE:5000}
segmentQueryMaxSize: ${SW_STORAGE_ES_QUERY_SEGMENT_SIZE:200}
profileTaskQueryMaxSize: ${SW_STORAGE_ES_QUERY_PROFILE_TASK_SIZE:200}
oapAnalyzer: ${SW_STORAGE_ES_OAP_ANALYZER:"{\"analyzer\":{\"oap_analyzer\":{\"type\":\"stop\"}}}"} # the oap analyzer.
oapLogAnalyzer: ${SW_STORAGE_ES_OAP_LOG_ANALYZER:"{\"analyzer\":{\"oap_log_analyzer\":{\"type\":\"standard\"}}}"} # the oap log analyzer. It could be customized by the ES analyzer configuration to support more language log formats, such as Chinese log, Japanese log and etc.
advanced: ${SW_STORAGE_ES_ADVANCED:""}
- 在webapp包下修改启动端口
- 启动服务 ./start.sh
- java服务启动命令需要添加agent
java -javaagent:/data/apache-skywalking-apm-bin/agent/skywalking-agent.jar
-Dskywalking.agent.service_name=admin //在sk显示的服务名
-Dskywalking.collector.backend_service=172.21.54.147:11800 //sk的oap服务地址
-jar app.jar
- 默认无法监控springcloud gateway,需在/agent/optional-plugins找到插件复制到/agent/plugins目录下
二、UI使用说明
Dashboard(仪表盘):查看被监控服务的运行状态:
-
Global
Services load:服务每分钟请求数; Slow Services:慢响应服务,单位ms; Un-Health services(Apdex): Apdex性能指标,1为满分; Slow Endpoint:慢响应端点,单位ms; Global Response Latency:百分比响应延时,不同百分比的延时时间,单位ms; Global Heatmap:服务响应时间热力分布图,根据时间段内不同响应时间的数量显示颜色深度; 底部栏:展示数据的时间区间,点击可以调整;
-
Service
Service Apdex(数字):当前服务的评分; Service Apdex(折线图):不同时间的Apdex评分; Service Avg Response Times:平均响应延时,单位ms; Service Response Time Percentile:百分比响应延时; Successful Rate(数字):请求成功率; Successful Rate(折线图):不同时间的请求成功率; Servce Load(数字):每分钟请求数; Servce Load(折线图):不同时间的每分钟请求数; Servce Instances Load:每个服务实例的每分钟请求数; Show Service Instance:每个服务实例的最大延时; Service Instance Successful Rate:每个服务实例的请求成功率;
-
Instance
Service Instance Load:当前实例的每分钟请求数; Service Instance Successful Rate:当前实例的请求成功率; Service Instance Latency:当前实例的响应延时; JVM CPU:jvm占用CPU的百分比; JVM Memory:JVM内存占用大小,单位m; JVM GC Time:JVM垃圾回收时间,包含YGC和OGC; JVM GC Count:JVM垃圾回收次数,包含YGC和OGC; JVM Thread Count:JVM线程数;
-
Endpoint
Endpoint Load in Current Service:每个端点的每分钟请求数; Slow Endpoints in Current Service:每个端点的最慢请求时间,单位ms; Successful Rate in Current Service:每个端点的请求成功率; Endpoint Load:当前端点每个时间段的请求数据; Endpoint Avg Response Time:当前端点每个时间段的请求行响应时间; Endpoint Response Time Percentile:当前端点每个时间段的响应时间占比; Endpoint Successful Rate:当前端点每个时间段的请求成功率;
Topology(拓扑图):以拓扑图的方式展现服务之间的关系,并以此为入口查看相关信息
Trace(追踪):以接口列表的方式展现,追踪接口内部调用过程
Profile(性能剖析):对端点进行采样分析,并可查看堆栈信息
Alarm(告警):触发告警的告警列表,包括服务失败率,请求超时等
Reload(自动刷新):刷新当前页面数据内容;
三、 告警功能
本例子使用企业微信机器人
- 默认配置/config/alarm-settings.yml
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Sample alarm rules.
rules:
# Rule unique name, must be ended with `_rule`.
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 5
message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
service_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_sla
op: "<"
threshold: 8000
# The length of time to evaluate the metrics
period: 10
# How many times after the metrics match the condition, will trigger alarm
count: 2
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 3
message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
service_p90_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_p90
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 5
message: 90% response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes
service_instance_resp_time_rule:
metrics-name: service_instance_resp_time
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 5
message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
# Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
# Because the number of endpoint is much more than service and instance.
#
# endpoint_avg_rule:
# metrics-name: endpoint_avg
# op: ">"
# threshold: 1000
# period: 10
# count: 2
# silence-period: 5
# message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
webhooks:
# - http://127.0.0.1/notify/
# - http://127.0.0.1/go-wechat/
说明:
- 最近3分钟内服务的平均响应时间超过1秒
- 最近2分钟服务成功率低于80%
- 最近3分钟90%服务响应时间超过1秒
- 最近2分钟内服务实例的平均响应时间超过1秒
- 告警发送到企业微信机器人
wechatHooks:
textTemplate: |-
{
"msgtype": "text",
"text": {
"content": "Apache SkyWalking 告警: \n %s."
}
}
webhooks:
- https://xxxx