prometheus
下载地址
启动/停止
# 解压目录
~/monitor/prometheus/prometheus-2.45.0.linux-amd64
# 编写启动脚本
echo ' ./prometheus --config.file=./prometheus.yml &' > start.sh && chmod +x start.sh
# 编写停止脚本
echo ' pkill prometheus' > stop.sh && chmod +x stop.sh
# 启动
nohup sh start.sh &
配置文件
cat prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
# 采集node exporter监控数据
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
- job_name: 'your application'
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets: [ 'localhost:9098' ]
labels:
application: 'your application'
注意:由于应用程序开启了https,可以简单通过tls_config跳过验证
验证
prometheus自带了前端页面。
可以在浏览器中输入:localhost:9090打开。记得打开9090端口的访问权限。
grafna
下载
- grafana.com/grafana/dow…
- 两种文件格式,任选其一。tar.gz是二进制文件,可以直接执行。本文使用的是二进制文件。关于rpm的安装与启动,参考文末文档。
启动/停止
cd ~/monitor/grafana
nohup ./bin/grafana server &
访问
在浏览器输入localhost:3000。
会提示输入账号密码,默认都是admin。然后修改一下密码。
配置数据源
选择prometheus作为数据源
输入prometheus的地址
http方法改为get,保存
选择一个dashbboard
然后就能看到效果了。
node-exporter
下载
安装/启动/停止
cd ~/monitor/prometheus/node_exporter-1.6.1.linux-amd64
# 编写启动脚本
echo ' ./node_exporter &' > start.sh && chmod +x start.sh
# 编写停止脚本
echo ' pkill node_exporter ' > stop.sh && chmod +x stop.sh
nohup sh start.sh &
配置prometheus.yml
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
# 采集node exporter监控数据
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
配置grafana dashboard
拷贝模版号
- 在grafana导入模版
输入拷贝的模板号,点击load
选择数据源,点击import
查看效果
alert manager
下载
配置
alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.mxhichina.com:465' # 企业邮箱smtp服务器代理,本文用的是阿里云企业免费邮箱
smtp_from: 'xxx@xxxx.com' # 企业邮箱用户
smtp_auth_username: 'xxx@xxxx.com' # 同上面的邮箱
smtp_auth_password: 'xxxxx' # 注意:邮箱授权码,不是登录密码。只有企业邮箱才有。是第三方客户端的验证密码
smtp_require_tls: false # 是否启用tls
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 5m # 发送告警后间隔多久再次发送,减少发送邮件频率
receiver: 'mail' #发送的告警媒体
receivers:
- name: 'mail' # 接收者配置,这里要与接收媒体一致
email_configs:
- to: 'xxxx@163.com' #发送给谁的邮箱,多个人多行列出
阿里云企业邮箱
申请参考:www.iplaysoft.com/free-domain…
报警规则
alert_rules.yml
groups:
- name: node
rules:
- alert: server_status
expr: up{} == 0
for: 15s
annotations:
summary: "机器{{ $labels.instance }} 挂了"
description: "请立即查看问题!"
- alert: server_status
expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes) > 70
for: 1s
annotations:
summary: "机器{{ $labels.instance }} 内存大于70%"
description: "请立即查看问题!"
- alert: server_status
expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) by (instance)) * 100 > 70
for: 1s
annotations:
summary: "机器{{ $labels.instance }} CPU使用率大于70%"
description: "请立即查看问题!"
- alert: server_status
expr: max((node_filesystem_size_bytes{fstype=~"ext.?|xfs"}-node_filesystem_free_bytes{fstype=~"ext.?|xfs"}) *100/(node_filesystem_avail_bytes {fstype=~"ext.?|xfs"}+(node_filesystem_size_bytes{fstype=~"ext.?|xfs"}-node_filesystem_free_bytes{fstype=~"ext.?|xfs"})))by(instance) > 80
for: 15s
annotations:
summary: "机器{{ $labels.instance }} 分区使用率大于80%"
description: "请立即查看问题!"
update20231015
更新了配置,需要重启prometheus。
sudo systemctl restart prometheus
配置prometheus
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "~/monitor/prometheus/alertmanager-0.26.0.linux-amd64/alert_rules.yml"
# - "second_rules.yml"
启动
启动见下文的service启动方式。
service启动方式
在/usr/lib/systemd/system目录下创建各自的service文件。
service文件的格式说明,参考:Systemd 入门教程:实战篇
prometheus
prometheus.service
[Unit]
Description=Prometheus Service
Wants=network-online.target
After=network-online.target
[Service]
User=fx
Group=fx
Type=simple
ExecStart=~/monitor/prometheus/prometheus-2.45.0.linux-amd64/prometheus \
--config.file=~/monitor/prometheus/prometheus-2.45.0.linux-amd64/prometheus.yml \
--storage.tsdb.path=~/monitor/prometheus/prometheus-2.45.0.linux-amd64/data
[Install]
WantedBy=multi-user.target
grafana
grafana.service
[Unit]
Description=Grafana
Wants=network-online.target
After=network-online.target
[Service]
User=fx
Group=fx
Type=simple
ExecStart=~/monitor/grafana/grafana-10.1.2/bin/grafana-server \
--config=~/monitor/grafana/grafana-10.1.2/conf/defaults.ini \
--homepath=~/monitor/grafana/grafana-10.1.2
[Install]
WantedBy=multi-user.target
node exporter
nodeexporter.service
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=fx
Group=fx
Type=simple
ExecStart=~/monitor/prometheus/node_exporter-1.6.1.linux-amd64/node_exporter
[Install]
WantedBy=multi-user.target
alertmanager
alertmanager.service
[Unit]
Description=AlertManager
Wants=network-online.target
After=network-online.target
[Service]
User=fx
Group=fx
Type=simple
ExecStart=~/monitor/prometheus/alertmanager-0.26.0.linux-amd64/alertmanager \
--config.file ~/monitor/prometheus/alertmanager-0.26.0.linux-amd64/alertmanager.yml \
--storage.path ~/monitor/prometheus/alertmanager-0.26.0.linux-amd64/storage/
[Install]
WantedBy=multi-user.target
启动
# 重新加载配置
sudo systemctl daemon-reload
# 启动
systemctl start prometheus
systemctl start grafana
systemctl start node_exporter
systemctl start alertmanager
# 查看启动状态
systemctl status prometheus
systemctl status grafana
systemctl status node_exporter
systemctl status alertmanager
# 停止服务
systemctl stop prometheus
systemctl stop grafana
systemctl stop node_exporter
systemctl stop alertmanager
# 开机启动
systemctl enable prometheus
systemctl enable grafana
systemctl enable node_exporter
systemctl enable alertmanager
如果启动过程中有失败的情况,可以将ExecStart后面的启动命令在终端中直接执行下,看看报错日志。
参考
-Prometheus+Grafana 监控服务器CPU、磁盘、内存等信息
-Configure Prometheus AlertManager