搭建监控:prometheus grafana node-exporter

344 阅读4分钟

prometheus

下载地址

image.png

启动/停止

# 解压目录
~/monitor/prometheus/prometheus-2.45.0.linux-amd64
# 编写启动脚本
echo ' ./prometheus --config.file=./prometheus.yml &' > start.sh && chmod +x start.sh
# 编写停止脚本
echo ' pkill prometheus' > stop.sh && chmod +x stop.sh
# 启动
nohup sh start.sh &

配置文件

cat prometheus.yml

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]
  # 采集node exporter监控数据
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
  - job_name: 'your application'
    scheme: https
    tls_config:
      insecure_skip_verify: true
    static_configs:
      - targets: [ 'localhost:9098' ]
        labels:
          application: 'your application'

注意:由于应用程序开启了https,可以简单通过tls_config跳过验证

验证

prometheus自带了前端页面。 可以在浏览器中输入:localhost:9090打开。记得打开9090端口的访问权限

image.png

image.png

grafna

下载

  • grafana.com/grafana/dow…
  • 两种文件格式,任选其一。tar.gz是二进制文件,可以直接执行。本文使用的是二进制文件。关于rpm的安装与启动,参考文末文档。

image.png

启动/停止

cd ~/monitor/grafana
nohup ./bin/grafana server &

访问

在浏览器输入localhost:3000。 会提示输入账号密码,默认都是admin。然后修改一下密码。

配置数据源

image.png

选择prometheus作为数据源

image.png

输入prometheus的地址 image.png

http方法改为get,保存 image.png 选择一个dashbboard image.png

然后就能看到效果了。

image.png

node-exporter

下载

安装/启动/停止

cd ~/monitor/prometheus/node_exporter-1.6.1.linux-amd64

# 编写启动脚本
echo ' ./node_exporter &' > start.sh && chmod +x start.sh
# 编写停止脚本
echo ' pkill node_exporter ' > stop.sh && chmod +x stop.sh

nohup sh start.sh &

配置prometheus.yml

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]
  # 采集node exporter监控数据
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

配置grafana dashboard

image.png

拷贝模版号 image.png

  • 在grafana导入模版

image.png

输入拷贝的模板号,点击load

image.png

选择数据源,点击import image.png 查看效果

image.png

alert manager

下载

image.png

配置

alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.mxhichina.com:465' # 企业邮箱smtp服务器代理,本文用的是阿里云企业免费邮箱
  smtp_from: 'xxx@xxxx.com'  # 企业邮箱用户
  smtp_auth_username: 'xxx@xxxx.com' # 同上面的邮箱
  smtp_auth_password: 'xxxxx'   # 注意:邮箱授权码,不是登录密码。只有企业邮箱才有。是第三方客户端的验证密码
  smtp_require_tls: false   # 是否启用tls

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 5m  # 发送告警后间隔多久再次发送,减少发送邮件频率
  receiver: 'mail'    #发送的告警媒体

receivers:
- name: 'mail'        # 接收者配置,这里要与接收媒体一致
  email_configs:
  - to: 'xxxx@163.com' #发送给谁的邮箱,多个人多行列出

阿里云企业邮箱

申请参考:www.iplaysoft.com/free-domain…

报警规则

alert_rules.yml

groups:
- name: node
  rules:
  - alert: server_status
    expr: up{} == 0
    for: 15s
    annotations:
      summary: "机器{{ $labels.instance }} 挂了"
      description: "请立即查看问题!"
  - alert: server_status
    expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes) > 70
    for: 1s
    annotations:
      summary: "机器{{ $labels.instance }} 内存大于70%"
      description: "请立即查看问题!"
  - alert: server_status
    expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) by (instance)) * 100 > 70
    for: 1s
    annotations:
      summary: "机器{{ $labels.instance }} CPU使用率大于70%"
      description: "请立即查看问题!"
  - alert: server_status
    expr: max((node_filesystem_size_bytes{fstype=~"ext.?|xfs"}-node_filesystem_free_bytes{fstype=~"ext.?|xfs"}) *100/(node_filesystem_avail_bytes {fstype=~"ext.?|xfs"}+(node_filesystem_size_bytes{fstype=~"ext.?|xfs"}-node_filesystem_free_bytes{fstype=~"ext.?|xfs"})))by(instance) > 80
    for: 15s
    annotations:
      summary: "机器{{ $labels.instance }} 分区使用率大于80%"
      description: "请立即查看问题!"

update20231015

更新了配置,需要重启prometheus。

sudo systemctl restart prometheus

配置prometheus

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
   - "~/monitor/prometheus/alertmanager-0.26.0.linux-amd64/alert_rules.yml"
  # - "second_rules.yml"

启动

启动见下文的service启动方式。

service启动方式

在/usr/lib/systemd/system目录下创建各自的service文件。

service文件的格式说明,参考:Systemd 入门教程:实战篇

prometheus

prometheus.service

[Unit]
Description=Prometheus Service
Wants=network-online.target
After=network-online.target

[Service]
User=fx
Group=fx
Type=simple
ExecStart=~/monitor/prometheus/prometheus-2.45.0.linux-amd64/prometheus \
    --config.file=~/monitor/prometheus/prometheus-2.45.0.linux-amd64/prometheus.yml \
    --storage.tsdb.path=~/monitor/prometheus/prometheus-2.45.0.linux-amd64/data

[Install]
WantedBy=multi-user.target

grafana

grafana.service

[Unit]
Description=Grafana
Wants=network-online.target
After=network-online.target

[Service]
User=fx
Group=fx
Type=simple
ExecStart=~/monitor/grafana/grafana-10.1.2/bin/grafana-server \
 --config=~/monitor/grafana/grafana-10.1.2/conf/defaults.ini \
 --homepath=~/monitor/grafana/grafana-10.1.2

[Install]
WantedBy=multi-user.target

node exporter

nodeexporter.service

[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=fx
Group=fx
Type=simple
ExecStart=~/monitor/prometheus/node_exporter-1.6.1.linux-amd64/node_exporter

[Install]
WantedBy=multi-user.target

alertmanager

alertmanager.service

[Unit]
Description=AlertManager
Wants=network-online.target
After=network-online.target

[Service]
User=fx
Group=fx
Type=simple
ExecStart=~/monitor/prometheus/alertmanager-0.26.0.linux-amd64/alertmanager \
    --config.file ~/monitor/prometheus/alertmanager-0.26.0.linux-amd64/alertmanager.yml \
    --storage.path ~/monitor/prometheus/alertmanager-0.26.0.linux-amd64/storage/

[Install]
WantedBy=multi-user.target

启动

# 重新加载配置
sudo systemctl daemon-reload
# 启动
systemctl start prometheus
systemctl start grafana
systemctl start node_exporter
systemctl start alertmanager
# 查看启动状态
systemctl status prometheus
systemctl status grafana
systemctl status node_exporter
systemctl status alertmanager
# 停止服务
systemctl stop prometheus
systemctl stop grafana
systemctl stop node_exporter
systemctl stop alertmanager
# 开机启动
systemctl enable prometheus
systemctl enable grafana
systemctl enable node_exporter
systemctl enable alertmanager

如果启动过程中有失败的情况,可以将ExecStart后面的启动命令在终端中直接执行下,看看报错日志。

参考

-Prometheus+Grafana 监控服务器CPU、磁盘、内存等信息

-Configure Prometheus AlertManager

-Prometheus+Grafana+Alertmanager部署教程

-Systemd 入门教程:实战篇

-阿里云企业邮箱免费版