Docker部署Prometheus+Grafana+AlertManager实现监控告警

1,075 阅读5分钟

一、搭建项目

搭建一个springboot actuator项目

二、部署Prometheus

1、创建挂载目录

mkdir -p /usr/local/soft/docker/prometheus

2、配置文件

# 创建配置文件
vim prometheus.yml

# 内容如下

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  # demo job
  -  job_name: 'springboot-actuator-prometheus-test' # job name
     metrics_path: '/actuator/prometheus' # 指标获取路径
     scrape_interval: 5s # 间隔
     basic_auth: # Spring Security basic auth 
       username: 'actuator'
       password: 'actuator'
     static_configs:
     - targets: ['10.60.45.113:8080'] # 实例的地址,默认的协议是http

3、启动容器

docker run -d -p 9090:9090 --name prom\
    -v /usr/local/soft/docker/prometheus/:/etc/prometheus/ \
    prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle

4、访问

  • 访问http://ip:9090 ,可看到如下界面:

image.png

三、部署Grafana

1、启动

docker run -d --name=grafana -p 3000:3000 grafana/grafana:9.3.4

2、访问

访问 http://ip:3000/login ,初始账号/密码为:admin/admin ,第一次登录会让你修改密码。

3、配置数据源

  • 点击ConfigurationAdd Data Source,会看到如下界面:

image.png

  • 这里我们选择Prometheus 当做数据源,这里我们就配置一下Prometheus 的访问地址,点击 Save & Test

image.png

image.png

image.png

4、创建监控Dashboard

image.png

  • 选择Metrics,点击Run queries,即可出现指标数据

image.png

  • 点击Visualizations,可以选择可视化的类型 image.png

5、Dashboard 市场

到这里,我想聪明的读者们应该已经学会如何去可视化一个指标数据了。但是应该很多人都会觉得,如果有好多指标的话,配置起来实际上是蛮繁琐的。

是否有开箱即用、通用型的DashBoard模板呢?

前往 Grafana Lab - Dashboards ,输入关键词即可搜索指定Dashboard。你就可以获得你想要的😎😎。

另外,这些已有的dashboard也可以让我们更快掌握一些panel的配置和dashboard的使用。

image.png

6、引入dashboard

image.png

  • 点击import按钮

image.png

  • 输入ID,点击load

image.png

  • 选择prometheus,点击import

image.png

image.png

四、自定义监控指标

  • 这部分主要讲如何自定义监控指标(比如我们的一些业务数据,这也叫做埋点)。

模拟需求:有一个订单服务,监控 [实时订单金额]、[10分钟内下单失败率]

1、自定义监控类

这里面我们自定义了三个metrics:

  • requests_error_total: 下单失败次数
  • order_request_count:下单总次数
  • order_amount_sum:下单金额统计
@Component
public class PrometheusCustomMonitor {

    /**
     * 记录请求出错次数
     */
    private Counter requestErrorCount;

    /**
     * 订单发起次数
     */
    private Counter orderCount;

    /**
     * 金额统计
     */
    private DistributionSummary amountSum;

    private final MeterRegistry registry;

    @Autowired
    public PrometheusCustomMonitor(MeterRegistry registry) {
        this.registry = registry;
    }

    @PostConstruct
    private void init() {
        requestErrorCount = registry.counter("requests_error_total", "status", "error");
        orderCount = registry.counter("order_request_count", "order", "test-svc");
        amountSum = registry.summary("order_amount_sum", "orderAmount", "test-svc");
    }

    public Counter getRequestErrorCount() {
        return requestErrorCount;
    }

    public Counter getOrderCount() {
        return orderCount;
    }

    public DistributionSummary getAmountSum() {
        return amountSum;
    }
}

2、新增/order接口

当 flag="1"时,抛异常,模拟下单失败情况。在接口中统计order_request_countorder_amount_sum

@RestController
public class TestController {

    @Resource
    private PrometheusCustomMonitor monitor;

    @RequestMapping("/order")
    public String order(@RequestParam(defaultValue = "0") String flag) throws Exception {
        // 统计下单次数
        monitor.getOrderCount().increment();
        if ("1".equals(flag)) {
            throw new Exception("出错啦");
        }
        Random random = new Random();
        int amount = random.nextInt(100);
        // 统计金额
        monitor.getAmountSum().record(amount);
        return "下单成功, 金额: " + amount;
    }
}

3、全局异常处理

统计下单失败次数requests_error_total

@ControllerAdvice
public class GlobalExceptionHandler {

    @Resource
    private PrometheusCustomMonitor monitor;

    @ResponseBody
    @ExceptionHandler(value = Exception.class)
    public String handle(Exception e) {
        monitor.getRequestErrorCount().increment();
        return "error, message: " + e.getMessage();
    }
}

4、测试

启动项目,访问http://localhost:8080/orderhttp://localhost:8080/order?flag=1模拟下单成功和失败的情况,然后我们访问http://localhost:8080/actuator/prometheus,可以看到我们自定义指标已经被/prometheus端点暴露出来了:

# HELP requests_error_total  
# TYPE requests_error_total counter
requests_error_total{application="springboot-actuator-prometheus-test",status="error",} 41.0
# HELP order_request_count_total  
# TYPE order_request_count_total counter
order_request_count_total{application="springboot-actuator-prometheus-test",order="test-svc",} 94.0
# HELP order_amount_sum  
# TYPE order_amount_sum summary
order_amount_sum_count{application="springboot-actuator-prometheus-test",orderAmount="test-svc",} 53.0
order_amount_sum_sum{application="springboot-actuator-prometheus-test",orderAmount="test-svc",} 2701.0

5、在Grafana 中添加对应监控面板

  • 首先是创建10分钟内下单失败率
sum(rate(requests_error_total{application="springboot-actuator-prometheus-test"}[10m])) / sum(rate(order_request_count_total{application="springboot-actuator-prometheus-test"}[10m])) * 100

image.png

  • 然后是统计订单总金额:

image.png

  • 最终结果

image.png

五、部署AlertManager

模拟告警规则:

  1. 服务是否下线
  2. 10分钟内下单失败率是否大于10%

1、创建挂载目录

mkdir -p /usr/local/soft/docker/alertmanager/

2、修改配置文件

# 创建配置文件
vim alertmanager.yml

# 内容如下

global:
  resolve_timeout: 5m
  smtp_from: 'xxx@163.com'  # 邮箱账号
  smtp_smarthost: 'smtp.163.com:25' 
  smtp_auth_username: 'darrytao@163.com' # 邮箱账号
  smtp_auth_password: 'xxxxxxxx' # 邮箱授权码,可百度邮箱如何开通smtp服务
  smtp_hello: '163.com'
  smtp_require_tls: false
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 1m
  repeat_interval: 1m
  receiver: 'email'
receivers:
- name: 'email'
  email_configs:
  - to: '13486307525@163.com'

3、启动

docker run --name alertmanager -d -p 9093:9093 \
 -v /usr/local/soft/docker/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager:v0.25.0

4、访问

地址:ip:9093

image.png

Alerts菜单下可以查看Alertmanager 接收到的告警内容。Silences菜单下则可以通过UI创建静默规则。Status菜单下面可以看到Alertmanager 的配置信息。

5、设置告警规则

在Prometheus 目录下新建test-svc-alert-rule.yml来设置告警规则,内容如下:

groups:
- name: svc-alert-rule
  rules:
  - alert: svc-down # 服务是否下线
    expr: sum(up{job="springboot-actuator-prometheus-test"}) == 0
    for: 1m
    labels: # 自定义标签
      severity: critical
      team: node
    annotations:
      summary: "订单服务已下线,请检查!!"
  - alert: order-error-rate-high # 10分钟内下单失败率是否大于10%
    expr: (sum(rate(requests_error_total{instance="192.168.230.1:8200"}[24h]))) / sum(rate(order_request_count_total{instance="192.168.230.1:8200"}[24h])) > 0.1
    for: 1m
    labels:
      severity: major
      team: node
    annotations:
      summary: "订单服务响应异常!!"
      description: "10分钟订单错误率已经超过10% (当前值: {{ $value }} !!!"

6、配置prometheus

在 prometheus.yml文件下,引用test-svc-alert-rule.yml告警规则配置,并开启 Alertmanager。

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093  
rule_files:
  - /etc/prometheus/rule/*.yml

image.png

//刷新配置
curl -X POST localhost:9090/-/reload

7、查看告警效果

image.png

  • 把服务停了,访问prometheus和alertmanager看效果

image.png

image.png

  • 这里触发了,所以只要邮箱没有配置错误,就应该收到邮件了。