《小鑫发现》之Prometheus+Grafana监控告警套装

251 阅读7分钟

标题

《小鑫发现》之Prometheus+Grafana监控告警套装

介绍

最近遇到了问题,就是MySQL,Redis,MongoDB这类服务,遇到故障的时候,不能及时通知,和预警。这次就介绍一个当下比较流行的方案,并且集成起来还很方便,类似那种ServiceMesh的边车方案,那就是Prometheus。这类解决方案有很多,这个教程是我一目了然就看得懂的。 这套技术分别是Prometheus是时序性数据库,用于存储服务数据,Grafana是优秀的可视化界面,官方的样例json比较多,很方便。

版本和下载链接

本次搭建环境是Linux的服务器,上面已经安装好了Redis,MySQL服务,所以这类的服务的安装就略过,根据自己的场景去选择。

准备环境

服务器创建两个文件,别问为啥,个人喜好 1.png

Prometheus文件夹里存的是 2.png

node_exporter文件夹里存的是 3.png

不懂的,可以先按照我这个创建文件目录。解压的命令tar -zxvf

构建Prometheus

配置

解压 tar -zxvf ./prometheus-2.24.0.linux-amd64.tar.gz 配置文件 vim ./prometheus.yml

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

运行

启动命令 ./prometheus --web.enable-lifecycle

--web.enable-lifecycle 命令是开启热部署,方便后面更新配置

效果

输入地址127.0.0.1:9090 查看效果 4.png

一般都是看targets,然后可以看到Endpoint里有本机的状态,是up,表示正常启动了(其他几个红的,因为是我之前构造的配置,已经保存,所以就有信息了,初次安装的话,看不到这写,后面咱们一次安装,慢慢的就会出现了,别问为啥这样,一个字,没多余的服务器。) 5.png

构建Grafana

配置

tar -zxvf ./grafana-7.3.6.linux-amd64.tar.gz 这里没啥配置的,直接去启动。

运行

./bin/grafana-server web

效果

输入地址127.0.0.1:3000 查看效果,初始密码是admin/admin,根据提示可以自己改一下。可以看到页面,里面有我之前配置的dashboards,后续就会慢慢呈现出来,现在还无所谓。 6.png 然后添加数据源,进入页面后点击Add data source,选择Prometheus,填入信息。 7.png 8.png

我这里是127.0.0.1,因为我都部署在一台机器上,另外我个人建议,以后最好都是设置域名改host,那样更方便,然后根据自己的实际情况,进行修改ip地址。

这样数据库和展示页面就都搭建成功了,开始搭建node_exporter,也就是监听器。

构建Node_exporter

配置

node_exporter其实就是监控服务器的,没啥配置,咱们直接运行。

运行

./node_exporter,看日志没啥问题,就行。 9.png

效果

然后去Promethues里配置,去编辑vim ./prometheus.yml,我每次贴出来到目前的全部配置,方便大家查看。

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'linux121'
    static_configs:
      - targets: ['127.0.0.1:9100']
        labels:
          instance: 'linux-121'

主要就是增加新的job_name,为啥是linux121呢,别问,顺手打的。

然后重启Prometheus,用接口去请求,我就直接在服务器上使用curl了

curl -X POST http://127.0.0.1:9090/-/reload

再去Prometheus页面查看效果,就出来Linux121的了 10.png

然后咱们去Grafana添加监控信息页面,激动人心的时刻,哇哈哈哈。 11.png 然后输入8919,这是从Prometheus官方那里去获取json,这类json都可以去官方查询,下载上传,都可以。后续会贴出来地址。然后点击load进行加载。记得最下面选择数据库Prometheus,然后点击import 13.png 12.png

成果页,是不是很漂亮。有些No data可能是json里面sql的问题,也可能是时间不够,未获取到,也可能是我不知道的原因。 14.png

搭建MySQL_Exporter

配置

vim ./.my.conf 别问我为啥叫这个,我也是看教程,学的,就这样吧。

[client]
user=root
password=mima

这里无需填写MySQL地址,因为这个服务,肯定是部署在对应的MySQL服务上面,所以这类都是默认的配置,用户和密码就可以了。

运行

./mysqld_exporter --config.my-cnf="./.my.conf",哇哦,舒服,没报错。 15.png

效果

继续去添加,这里就开始省略步骤了,基本都是重复,希望大家自己思考去弄

(读者:别说了,说白了就是懒。我:管着嘛!)

配置Prometheus,增加mysql的job_name

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'mysql'
    static_configs:
      - targets: ['127.0.0.1:9104']
        labels:
          instance: 'db-01'

  - job_name: 'linux121'
    static_configs:
      - targets: ['127.0.0.1:9100']
        labels:
          instance: 'linux-121'

重启Prometheus,然后查看Prometheus页面,诶就这么神奇。 16.png

去Grafana,继续import,这回是 7362 ,直接上效果图,这个具体的数据需要等到一下,他是重新抓取,获取得等会。 17.png

搭建Redis_exporter

配置

进入到对应的目录 /home/node_exporter/redis_exporter-v1.15.1.linux-amd64,其实也没配置,就是知道自己Redis的配置就好,看下面。

运行

./redis_exporter -redis.addr 127.0.0.1:6379 -redis.password mima 18.png

效果

配置Prometheus,然后重启,看效果。

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'mysql'
    static_configs:
      - targets: ['127.0.0.1:9104']
        labels:
          instance: 'db-01'

  - job_name: 'linux121'
    static_configs:
      - targets: ['127.0.0.1:9100']
        labels:
          instance: 'linux-121'

  - job_name: 'redis'
    static_configs:
      - targets: ['127.0.0.1:9121']
        labels:
          instance: 'redis1'

19.png

配置Grafana,这次是 11835 ,然后import。 20.png

到目前都是监控,也都已经搭建完成,他还有很多其他的例如MongoDB啊,其他的啊,其他的啊,其他的啊,我就不一一试了。后面就关键了,监控报警。

搭建Alertmanager

配置

进入目录 /home/prometheus/alertmanager-0.21.0.linux-amd64 tar -zxvf ./alertmanager-0.21.0.linux-amd64.tar.gz

这里咱们配置,我用的是qq发给网易的,这部分怎么申请,就去百度吧,另外听说这个还可以集成钉钉等工具,有机会研究研究。

global:
  resolve_timeout: 5m
  smtp_from: 'XXXXXXXX@qq.com'
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_auth_username: 'XXXXXXXX@qq.com'
  smtp_auth_password: 'XXXXXXXX'
  smtp_require_tls: false
  smtp_hello: 'qq.com'
route:
  group_by: ['alertname']
  group_wait: 5s
  group_interval: 5s
  repeat_interval: 5m
  receiver: 'email'
receivers:
- name: 'email'
  email_configs:
  - to: 'XXXXXXXX@163.com'
    send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

验证 ./amtool check-config alertmanager.yml,查看写的是否正确。

[root@s13-224 alertmanager-0.21.0.linux-amd64]# ./amtool check-config alertmanager.yml
Checking 'alertmanager.yml'  SUCCESS
Found:
 - global config
 - route
 - 1 inhibit rules
 - 1 receivers
 - 0 templates

运行

./alertmanager --config.file='./alertmanager.yml' --cluster.advertise-address=0.0.0.0:9093

21.png

效果

输入网址 127.0.0.1:9093 查看页面。这页面是AlertManager的。 22.png

编写规则,在/home/prometheus 下面建立文件 first_rules.yml 内容是

groups:
 - name: test-rules
   rules:
   - alert: InstanceDown # 告警名称
     expr: up == 0 # 告警的判定条件,这里就是规则,如果节点挂掉,就报警,参考Prometheus高级查询来设定
     for: 5s # 满足告警条件持续时间多久后,才会发送告警
     labels: #标签项
      team: node
     annotations: # 解析项,详细解释告警信息
      summary: "{{$labels.instance}}: has been down"
      description: "{{$labels.instance}}: job {{$labels.job}} has been down "

配置Prometheus,然后重启,查看效果。

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 127.0.0.1:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/home/prometheus/rules/*.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'mysql'
    static_configs:
      - targets: ['127.0.0.1:9104']
        labels:
          instance: 'db-01'

  - job_name: 'linux121'
    static_configs:
      - targets: ['127.0.0.1:9100']
        labels:
          instance: 'linux-121'

  - job_name: 'redis'
    static_configs:
      - targets: ['127.0.0.1:9121']
        labels:
          instance: 'redis1'

23.png

目前都启动了,开始验证告警,选择一个exporter去关闭,我选择node_exporter,因为我这里都是窗口启动,所以直接就ctrl+c关闭了,然后等待邮件。 24.png 25.png

参考连接

Prometheus官方文档 Dashboards查询地址地址