Prometheus 监控（5）：消息队列监控及告警“持续创作，加速成长！这是我参与「掘金日新计划 · 6 月更文挑战」

“持续创作，加速成长！这是我参与「掘金日新计划 · 6 月更文挑战」的第4天，点击查看活动详情”

前言

监控消息队列及告警

一、kafka 监控

kafka_exporter 1.3.1 版本。

kafka版本为2.6.0

1、下载

从 github.com/danielqsj/k… 下载并传 kafka_exporter-1.3.1.linux-amd64.tar.gz 安装包。

 [root@localhost package]# wget https://github.com/danielqsj/kafka_exporter/releases/download/v1.3.1/kafka_exporter-1.3.1.linux-amd64.tar.gz
 [root@localhost package]# tar -xvf kafka_exporter-1.3.1.linux-amd64.tar.gz -C /opt/software/

2、启动

 [root@localhost kafka_exporter]# nohup ./kafka_exporter --kafka.server=127.0.0.1:9092 &

启动成功后访问目录：http://192.168.81.104:9308/metrics

3、配置 prometheus

 # 添加 kafka 的配置
   - job_name: 'kafka-exporter'
     static_configs:
     - targets: ['192.168.81.104:9308']
 # 重启 prometheus
 curl -X POST http://192.168.81.104:9090/-/reload

4、导入模板 id 为 7589.

二、rabbitmq 监控

rabbitmq 的版本为 3.7.18（2019年9月份发布）

Erlang 版本为 22.0.7

rabbitmq_exporter 中下载

地址：github.com/kbudde/rabb…

 wget https://github.com/kbudde/rabbitmq_exporter/releases/download/v1.0.0-RC7/rabbitmq_exporter-1.0.0-RC7.linux-amd64.tar.gz

rabbitmq_exporter_1.0.0-RC12_linux_amd64.tar.gz

解压：

 [root@node102 package]# tar -zxvf rabbitmq_exporter-1.0.0-RC7.linux-amd64.tar.gz -C /opt/software/

进入解压后目录启动：

 [root@node102 rabbitmq]# RABBIT_USER=admin  RABBIT_PASSWORD=111111 OUTPUT_FORMAT=JSON PUBLISH_PORT=9099 RABBIT_URL=http://localhost:5672 nohup ./rabbitmq_exporter &

配置 prometheus

 - job_name: 'rabbitMq'
     static_configs:
       - targets: ['ip:9099']

导入模板 2181

重新下载3.8版本以上的

 wget http://erlang.org/download/otp_src_24.0.tar.gz
 [root@node103 package]# wget https://github.com/rabbitmq/rabbitmq-server/releases/download/v3.8.27/rabbitmq-server-generic-unix-3.8.27.tar.xz

三、告警配置

我们已经能够对收集的数据，通过Grafana 展示出来，除了能查看数据外，作为监控系统还需要具备告警功能。

3.1 邮箱配置

修改grafana/conf/defaults.ini 文件

grafana的配置文件默认是在/etc/grafana/grafana.ini,修改配置文件如下:

 #################################### Alerting ############################
 [alerting]
 # Disable alerting engine & UI features
 enabled = true
 # Makes it possible to turn off alert rule execution but alerting UI is visible
 execute_alerts = true
 
 # Default setting for new alert rules. Defaults to categorize error and timeouts as alerting. (alerting, keep_state)
 error_or_timeout = alerting
 
 # Default setting for how Grafana handles nodata or null values in alerting. (alerting, no_data, keep_state, ok)
 nodata_or_nullvalues = no_data
 
 # Alert notifications can include images, but rendering many images at the same time can overload the server
 # This limit will protect the server from render overloading and make sure notifications are sent out quickly
 concurrent_render_limit = 5

 # 邮件服务器配置，自行修改配置
 [smtp]
 enabled = true
 host = smtp.163.com:465
 user = xl_2020@163.com
 # If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""
 password = MORISJNNLZSGJRTO # 这个密码是你开启smtp服务生成的密码
 ;cert_file =
 ;key_file =
 skip_verify = true
 from_address = xl_2020@163.com
 from_name = Grafana

这个密码不是邮箱的登录密码，而是授权码

重启 Grafana 服务

systemctl start grafana-server

点击测试按钮发送。

在为单独设置告警时，不能用于变量类型的模板，需要把具体的变量引用改为具体的ip，这样就很麻烦

关键字： Template variables are not supported in alert queries.

由于 Prometheus 告警不支持变量，而模板面板使用了大量变量，导致不可使用告警。

因此，放弃使用 Grafana 的告警功能，采用 AlertManager 来实现告警。

3.2 AlertManager 告警配置

cd /prometheus,

vim rule.yml

 groups:
   - name: 192.168.1.221主机监控
     rules:
     - alert: 主机内存不足
       expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
       for: 2m
       labels:
         team: node
       annotations:
         Alert_type: 主机内存报警
         Server: '{{$labels.instance}}'
         explain: "内存使用量超过90%，目前剩余量为：{{ $value }}%"
 #主机磁盘空间不足
     - alert: 主机磁盘空间不足
       expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
       for: 2m
       labels:
         team: node
       annotations:
         Alert_type: 主机磁盘空间不足报警
         Server: '{{$labels.instance}}'
         explain: "磁盘使用量超过90%，目前剩余量为：{{ $value }}%"
 #主机inode
     - alert: 主机inode
       expr: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint="/rootfs"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly{mountpoint="/rootfs"} == 0
       for: 2m
       labels:
         team: node
       annotations:
         Alert_type: 主机inode
         Server: '{{$labels.instance}}'
         explain: "主机inode使用量超过90%，目前剩余量为：{{ $value }}%"
 #主机高CPU负载
     - alert: HostHighCpuLoad
       expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 90
       for: 0m
       labels:
         team: node
       annotations:
         Alert_type: 内存报警
         Server: '{{$labels.instance}}'
         explain: "内存使用量超过90%，目前使用量为：{{ $value }}%"

修改 prometheus 的配置,然后热重载 prometheus

 rule_files:
    - "rules.yml"

curl -X POST http://192.168.81.104:9090/-/reload

访问查看当前配置的内容：

配置 AlertManager 告警：

1、下载并解压

 [root@localhost package]# wget https://github.com/prometheus/alertmanager/releases/download/v0.22.2/alertmanager-0.22.2.linux-amd64.tar.gz
 [root@localhost package]# tar -zxvf alertmanager-0.22.2.linux-amd64.tar.gz -C /opt/software/

2、 vim alertmanager.yml

 global:
   resolve_timeout: 5m
   smtp_smarthost: 'smtp.163.com:465'
   smtp_from: 'xl_2020@163.com'
   smtp_auth_username: 'xl_2020@163.com'
   smtp_auth_password:  'MORISJNNLZSGJRTO'           
   smtp_require_tls: false
 templates:
   - 'template/*.tmpl'
 iroute:
   group_by: ['alertname']
   group_wait: 5s
   group_interval: 5s
   repeat_interval: 1m
   receiver: 'email'
 receivers:
 - name: 'email'
   email_configs:
   - to: 'xl_2020@163.com'
     send_resolved: true
 inhibit_rules:
   - source_match:
       severity: 'critical'
     target_match:
       severity: 'warning'
     equal: ['alertname', 'dev', 'instance']

3、编辑模板

 [root@lzy alertmanager]# mkdir template
 [root@lzy alertmanager]# cd template/
 [root@lzy template]# vim test.tmpl

 {{ define "test.html" }}
 <table border="1">
         <tr>
                 <td>报警项</td>
                 <td>实例</td>
                 <td>报警阀值</td>
                 <td>开始时间</td>
         </tr>
         {{ range $i, $alert := .Alerts }}
                 <tr>
                         <td>{{ index $alert.Labels "alertname" }}</td>
                         <td>{{ index $alert.Labels "instance" }}</td>
                         <td>{{ index $alert.Annotations "value" }}</td>
                         <td>{{ $alert.StartsAt }}</td>
                 </tr>
         {{ end }}
 </table>
 {{ end }}

4、编辑 prometheus.yml

 # Alertmanager configuration
 alerting:
   alertmanagers:
     - static_configs:
         - targets: ['192.168.81.104:9093']

5、启动 alertmanager

 nohup ./alertmanager &

6、重启 prometheus

 curl -X POST http://192.168.81.104:9090/-/reload

最后等待邮件传送