标题
《小鑫发现》之Prometheus+Grafana监控告警套装
介绍
最近遇到了问题,就是MySQL,Redis,MongoDB这类服务,遇到故障的时候,不能及时通知,和预警。这次就介绍一个当下比较流行的方案,并且集成起来还很方便,类似那种ServiceMesh的边车方案,那就是Prometheus。这类解决方案有很多,这个教程是我一目了然就看得懂的。 这套技术分别是Prometheus是时序性数据库,用于存储服务数据,Grafana是优秀的可视化界面,官方的样例json比较多,很方便。
版本和下载链接
- prometheus-2.24.0.linux-amd64.tar 下载地址
- grafana-7.3.6.linux-amd64.tar 下载地址
- alertmanager-0.21.0.linux-amd64.tar 下载地址
- mysqld_exporter-0.12.1.linux-amd64.tar 下载地址
- node_exporter-1.0.1.linux-amd64.tar 下载地址
- redis_exporter-v1.15.1.linux-amd64.tar 下载地址
本次搭建环境是Linux的服务器,上面已经安装好了Redis,MySQL服务,所以这类的服务的安装就略过,根据自己的场景去选择。
准备环境
服务器创建两个文件,别问为啥,个人喜好
Prometheus文件夹里存的是
node_exporter文件夹里存的是
不懂的,可以先按照我这个创建文件目录。解压的命令tar -zxvf
构建Prometheus
配置
解压 tar -zxvf ./prometheus-2.24.0.linux-amd64.tar.gz
配置文件 vim ./prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
运行
启动命令 ./prometheus --web.enable-lifecycle
--web.enable-lifecycle 命令是开启热部署,方便后面更新配置
效果
输入地址127.0.0.1:9090 查看效果
一般都是看targets,然后可以看到Endpoint里有本机的状态,是up,表示正常启动了(其他几个红的,因为是我之前构造的配置,已经保存,所以就有信息了,初次安装的话,看不到这写,后面咱们一次安装,慢慢的就会出现了,别问为啥这样,一个字,没多余的服务器。)
构建Grafana
配置
tar -zxvf ./grafana-7.3.6.linux-amd64.tar.gz
这里没啥配置的,直接去启动。
运行
./bin/grafana-server web
效果
输入地址127.0.0.1:3000 查看效果,初始密码是admin/admin,根据提示可以自己改一下。可以看到页面,里面有我之前配置的dashboards,后续就会慢慢呈现出来,现在还无所谓。
然后添加数据源,进入页面后点击
Add data source,选择Prometheus,填入信息。
我这里是127.0.0.1,因为我都部署在一台机器上,另外我个人建议,以后最好都是设置域名改host,那样更方便,然后根据自己的实际情况,进行修改ip地址。
这样数据库和展示页面就都搭建成功了,开始搭建node_exporter,也就是监听器。
构建Node_exporter
配置
node_exporter其实就是监控服务器的,没啥配置,咱们直接运行。
运行
./node_exporter,看日志没啥问题,就行。
效果
然后去Promethues里配置,去编辑vim ./prometheus.yml,我每次贴出来到目前的全部配置,方便大家查看。
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'linux121'
static_configs:
- targets: ['127.0.0.1:9100']
labels:
instance: 'linux-121'
主要就是增加新的job_name,为啥是linux121呢,别问,顺手打的。
然后重启Prometheus,用接口去请求,我就直接在服务器上使用curl了
curl -X POST http://127.0.0.1:9090/-/reload
再去Prometheus页面查看效果,就出来Linux121的了
然后咱们去Grafana添加监控信息页面,激动人心的时刻,哇哈哈哈。
然后输入8919,这是从Prometheus官方那里去获取json,这类json都可以去官方查询,下载上传,都可以。后续会贴出来地址。然后点击load进行加载。记得最下面选择数据库Prometheus,然后点击import
成果页,是不是很漂亮。有些No data可能是json里面sql的问题,也可能是时间不够,未获取到,也可能是我不知道的原因。
搭建MySQL_Exporter
配置
vim ./.my.conf 别问我为啥叫这个,我也是看教程,学的,就这样吧。
[client]
user=root
password=mima
这里无需填写MySQL地址,因为这个服务,肯定是部署在对应的MySQL服务上面,所以这类都是默认的配置,用户和密码就可以了。
运行
./mysqld_exporter --config.my-cnf="./.my.conf",哇哦,舒服,没报错。
效果
继续去添加,这里就开始省略步骤了,基本都是重复,希望大家自己思考去弄
(读者:别说了,说白了就是懒。我:管着嘛!)
配置Prometheus,增加mysql的job_name
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'mysql'
static_configs:
- targets: ['127.0.0.1:9104']
labels:
instance: 'db-01'
- job_name: 'linux121'
static_configs:
- targets: ['127.0.0.1:9100']
labels:
instance: 'linux-121'
重启Prometheus,然后查看Prometheus页面,诶就这么神奇。
去Grafana,继续import,这回是 7362 ,直接上效果图,这个具体的数据需要等到一下,他是重新抓取,获取得等会。
搭建Redis_exporter
配置
进入到对应的目录 /home/node_exporter/redis_exporter-v1.15.1.linux-amd64,其实也没配置,就是知道自己Redis的配置就好,看下面。
运行
./redis_exporter -redis.addr 127.0.0.1:6379 -redis.password mima
效果
配置Prometheus,然后重启,看效果。
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'mysql'
static_configs:
- targets: ['127.0.0.1:9104']
labels:
instance: 'db-01'
- job_name: 'linux121'
static_configs:
- targets: ['127.0.0.1:9100']
labels:
instance: 'linux-121'
- job_name: 'redis'
static_configs:
- targets: ['127.0.0.1:9121']
labels:
instance: 'redis1'
配置Grafana,这次是 11835 ,然后import。
到目前都是监控,也都已经搭建完成,他还有很多其他的例如MongoDB啊,其他的啊,其他的啊,其他的啊,我就不一一试了。后面就关键了,监控报警。
搭建Alertmanager
配置
进入目录 /home/prometheus/alertmanager-0.21.0.linux-amd64
tar -zxvf ./alertmanager-0.21.0.linux-amd64.tar.gz
这里咱们配置,我用的是qq发给网易的,这部分怎么申请,就去百度吧,另外听说这个还可以集成钉钉等工具,有机会研究研究。
global:
resolve_timeout: 5m
smtp_from: 'XXXXXXXX@qq.com'
smtp_smarthost: 'smtp.qq.com:465'
smtp_auth_username: 'XXXXXXXX@qq.com'
smtp_auth_password: 'XXXXXXXX'
smtp_require_tls: false
smtp_hello: 'qq.com'
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
repeat_interval: 5m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'XXXXXXXX@163.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
验证 ./amtool check-config alertmanager.yml,查看写的是否正确。
[root@s13-224 alertmanager-0.21.0.linux-amd64]# ./amtool check-config alertmanager.yml
Checking 'alertmanager.yml' SUCCESS
Found:
- global config
- route
- 1 inhibit rules
- 1 receivers
- 0 templates
运行
./alertmanager --config.file='./alertmanager.yml' --cluster.advertise-address=0.0.0.0:9093
效果
输入网址 127.0.0.1:9093 查看页面。这页面是AlertManager的。
编写规则,在/home/prometheus 下面建立文件 first_rules.yml
内容是
groups:
- name: test-rules
rules:
- alert: InstanceDown # 告警名称
expr: up == 0 # 告警的判定条件,这里就是规则,如果节点挂掉,就报警,参考Prometheus高级查询来设定
for: 5s # 满足告警条件持续时间多久后,才会发送告警
labels: #标签项
team: node
annotations: # 解析项,详细解释告警信息
summary: "{{$labels.instance}}: has been down"
description: "{{$labels.instance}}: job {{$labels.job}} has been down "
配置Prometheus,然后重启,查看效果。
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 127.0.0.1:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/home/prometheus/rules/*.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'mysql'
static_configs:
- targets: ['127.0.0.1:9104']
labels:
instance: 'db-01'
- job_name: 'linux121'
static_configs:
- targets: ['127.0.0.1:9100']
labels:
instance: 'linux-121'
- job_name: 'redis'
static_configs:
- targets: ['127.0.0.1:9121']
labels:
instance: 'redis1'
目前都启动了,开始验证告警,选择一个exporter去关闭,我选择node_exporter,因为我这里都是窗口启动,所以直接就ctrl+c关闭了,然后等待邮件。