一、搭建项目
二、部署Prometheus
1、创建挂载目录
mkdir -p /usr/local/soft/docker/prometheus
2、配置文件
# 创建配置文件
vim prometheus.yml
# 内容如下
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
# demo job
- job_name: 'springboot-actuator-prometheus-test' # job name
metrics_path: '/actuator/prometheus' # 指标获取路径
scrape_interval: 5s # 间隔
basic_auth: # Spring Security basic auth
username: 'actuator'
password: 'actuator'
static_configs:
- targets: ['10.60.45.113:8080'] # 实例的地址,默认的协议是http
3、启动容器
docker run -d -p 9090:9090 --name prom\
-v /usr/local/soft/docker/prometheus/:/etc/prometheus/ \
prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle
4、访问
- 访问
http://ip:9090,可看到如下界面:
三、部署Grafana
1、启动
docker run -d --name=grafana -p 3000:3000 grafana/grafana:9.3.4
2、访问
访问 http://ip:3000/login ,初始账号/密码为:admin/admin ,第一次登录会让你修改密码。
3、配置数据源
- 点击
Configuration中Add Data Source,会看到如下界面:
- 这里我们选择Prometheus 当做数据源,这里我们就配置一下Prometheus 的访问地址,点击
Save & Test:
4、创建监控Dashboard
- 选择Metrics,点击Run queries,即可出现指标数据
- 点击Visualizations,可以选择可视化的类型
5、Dashboard 市场
到这里,我想聪明的读者们应该已经学会如何去可视化一个指标数据了。但是应该很多人都会觉得,如果有好多指标的话,配置起来实际上是蛮繁琐的。
是否有开箱即用、通用型的DashBoard模板呢?
前往 Grafana Lab - Dashboards ,输入关键词即可搜索指定Dashboard。你就可以获得你想要的😎😎。
另外,这些已有的dashboard也可以让我们更快掌握一些panel的配置和dashboard的使用。
6、引入dashboard
- 首选你要在 Grafana Lab - Dashboards中选好你心仪的dashboard,然后记下它的ID
- 点击import按钮
- 输入ID,点击load
- 选择prometheus,点击import
四、自定义监控指标
- 这部分主要讲如何自定义监控指标(比如我们的一些业务数据,这也叫做埋点)。
模拟需求:有一个订单服务,监控 [实时订单金额]、[10分钟内下单失败率]
1、自定义监控类
这里面我们自定义了三个metrics:
requests_error_total: 下单失败次数order_request_count:下单总次数order_amount_sum:下单金额统计
@Component
public class PrometheusCustomMonitor {
/**
* 记录请求出错次数
*/
private Counter requestErrorCount;
/**
* 订单发起次数
*/
private Counter orderCount;
/**
* 金额统计
*/
private DistributionSummary amountSum;
private final MeterRegistry registry;
@Autowired
public PrometheusCustomMonitor(MeterRegistry registry) {
this.registry = registry;
}
@PostConstruct
private void init() {
requestErrorCount = registry.counter("requests_error_total", "status", "error");
orderCount = registry.counter("order_request_count", "order", "test-svc");
amountSum = registry.summary("order_amount_sum", "orderAmount", "test-svc");
}
public Counter getRequestErrorCount() {
return requestErrorCount;
}
public Counter getOrderCount() {
return orderCount;
}
public DistributionSummary getAmountSum() {
return amountSum;
}
}
2、新增/order接口
当 flag="1"时,抛异常,模拟下单失败情况。在接口中统计order_request_count和order_amount_sum
@RestController
public class TestController {
@Resource
private PrometheusCustomMonitor monitor;
@RequestMapping("/order")
public String order(@RequestParam(defaultValue = "0") String flag) throws Exception {
// 统计下单次数
monitor.getOrderCount().increment();
if ("1".equals(flag)) {
throw new Exception("出错啦");
}
Random random = new Random();
int amount = random.nextInt(100);
// 统计金额
monitor.getAmountSum().record(amount);
return "下单成功, 金额: " + amount;
}
}
3、全局异常处理
统计下单失败次数requests_error_total:
@ControllerAdvice
public class GlobalExceptionHandler {
@Resource
private PrometheusCustomMonitor monitor;
@ResponseBody
@ExceptionHandler(value = Exception.class)
public String handle(Exception e) {
monitor.getRequestErrorCount().increment();
return "error, message: " + e.getMessage();
}
}
4、测试
启动项目,访问http://localhost:8080/order和http://localhost:8080/order?flag=1模拟下单成功和失败的情况,然后我们访问http://localhost:8080/actuator/prometheus,可以看到我们自定义指标已经被/prometheus端点暴露出来了:
# HELP requests_error_total
# TYPE requests_error_total counter
requests_error_total{application="springboot-actuator-prometheus-test",status="error",} 41.0
# HELP order_request_count_total
# TYPE order_request_count_total counter
order_request_count_total{application="springboot-actuator-prometheus-test",order="test-svc",} 94.0
# HELP order_amount_sum
# TYPE order_amount_sum summary
order_amount_sum_count{application="springboot-actuator-prometheus-test",orderAmount="test-svc",} 53.0
order_amount_sum_sum{application="springboot-actuator-prometheus-test",orderAmount="test-svc",} 2701.0
5、在Grafana 中添加对应监控面板
- 首先是创建10分钟内下单失败率
sum(rate(requests_error_total{application="springboot-actuator-prometheus-test"}[10m])) / sum(rate(order_request_count_total{application="springboot-actuator-prometheus-test"}[10m])) * 100
- 然后是统计订单总金额:
- 最终结果
五、部署AlertManager
模拟告警规则:
- 服务是否下线
- 10分钟内下单失败率是否大于10%
1、创建挂载目录
mkdir -p /usr/local/soft/docker/alertmanager/
2、修改配置文件
# 创建配置文件
vim alertmanager.yml
# 内容如下
global:
resolve_timeout: 5m
smtp_from: 'xxx@163.com' # 邮箱账号
smtp_smarthost: 'smtp.163.com:25'
smtp_auth_username: 'darrytao@163.com' # 邮箱账号
smtp_auth_password: 'xxxxxxxx' # 邮箱授权码,可百度邮箱如何开通smtp服务
smtp_hello: '163.com'
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 1m
repeat_interval: 1m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: '13486307525@163.com'
3、启动
docker run --name alertmanager -d -p 9093:9093 \
-v /usr/local/soft/docker/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
prom/alertmanager:v0.25.0
4、访问
地址:ip:9093
Alerts菜单下可以查看Alertmanager 接收到的告警内容。Silences菜单下则可以通过UI创建静默规则。Status菜单下面可以看到Alertmanager 的配置信息。
5、设置告警规则
在Prometheus 目录下新建test-svc-alert-rule.yml来设置告警规则,内容如下:
groups:
- name: svc-alert-rule
rules:
- alert: svc-down # 服务是否下线
expr: sum(up{job="springboot-actuator-prometheus-test"}) == 0
for: 1m
labels: # 自定义标签
severity: critical
team: node
annotations:
summary: "订单服务已下线,请检查!!"
- alert: order-error-rate-high # 10分钟内下单失败率是否大于10%
expr: (sum(rate(requests_error_total{instance="192.168.230.1:8200"}[24h]))) / sum(rate(order_request_count_total{instance="192.168.230.1:8200"}[24h])) > 0.1
for: 1m
labels:
severity: major
team: node
annotations:
summary: "订单服务响应异常!!"
description: "10分钟订单错误率已经超过10% (当前值: {{ $value }} !!!"
6、配置prometheus
在 prometheus.yml文件下,引用test-svc-alert-rule.yml告警规则配置,并开启 Alertmanager。
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
rule_files:
- /etc/prometheus/rule/*.yml
//刷新配置
curl -X POST localhost:9090/-/reload
7、查看告警效果
- 访问http://ip:9090/alerts prometheus看效果
- 把服务停了,访问prometheus和alertmanager看效果
- 这里触发了,所以只要邮箱没有配置错误,就应该收到邮件了。