Prometheus
1. 软件安装
服务端程序,Prometheus版本:2.30.3.linux-amd64
wget https://github.com/prometheus/prometheus/releases/download/v2.29.2/prometheus-2.29.2.linux-amd64.tar.gz
被控端程序,用于采集监控指标数据,node_exporter版本:1.2.2.linux-amd64
wget https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.2.2.linux-amd64.tar.gz
用于配置故障告警,alertmanager版本:0.23.0.linux-amd64
wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz
用于图形界面展示,grafana版本:8.5.1.linux-amd64
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-8.5.1.linux-amd64.tar.gz
2. 启动服务
1). 解压node_exporter-1.2.2.linux-amd64.tar.gz,重命名node_exporter,进入node_exporter,执行./node_exporter,浏览器输入http://ip:9100/,可以查看内容;
2). 解压prometheus-2.29.2.linux-amd64.tar.gz,重命名prometheus,进入prometheus(具体配置参考下面prometheus.yml),执行./prometheus,浏览器输入http://ip:9090/,在Status->Targets,可以查看内容;
3). 解压grafana-enterprise-8.5.1.linux-amd64.tar.gz,重命名grafana,进入grafana,执行./bin/grafana-server,浏览器输入http://ip:3000/(具体操作配置参考下面grafana);
4). 解压alertmanager-0.23.0.linux-amd64.tar.gz,重命名alertmanager,进入alertmanager(具体配置参考下面alertmanager.yml),执行./alertmanager,浏览器输入http://ip:9093/,可以查看内容;
另:
使用:ss -ntlp | grep node_exporter/grafana-server/prometheus/alertmanager,查看服务启动情况
3. prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
rule_files:
- "first_rules.yml"
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "my server"
static_configs:
- targets: ["localhost:9100"]
- job_name: "docker"
static_configs:
- targets: ["localhost:8088"]
- job_name: "node01"
static_configs:
- targets: ["localhost:9100"]
4. first_rules.yml
在prometheus目录下创建first_rules.yml预警规则(可参考https://awesome-prometheus-alerts.grep.to/,有各类配置规则),内容如下:
groups:
- name: 'linux_status'
rules:
- alert: "主机状态告警"
expr: up == 0
for: 20s
labels:
team: "node"
severity: '严重'
annotations:
summary: "主机为:【{{$labels.instance }}】出现:主机状态告警 "
description: "主机【{{ $labels.instance }}】已经宕机20秒了!"
summarys: "主机为:【{{$labels.instance }}】出现的故障已经处理完成!"
value: "{{ $value }}"
- alert: "CPU利用率告警"
expr: sum by(instance) (avg without(cpu) (irate(node_cpu_seconds_total{mode!="idle"}[5m]))) > 0.6
for: 20s
labels:
team: "node"
severity: '告警'
annotations:
summary: "主机为:【{{$labels.instance }}】出现:CPU利用率达到60% "
description: "主机【{{ $labels.instance }}】的CPU利用率已经达到60%了!"
summarys: "主机为:【{{$labels.instance }}】出现的告警已有处理方案!"
value: "{{ $value }}"
- alert: "CPU利用率告警"
expr: sum by(instance) (avg without(cpu) (irate(node_cpu_seconds_total{mode!="idle"}[5m]))) > 0.85
for: 20s
labels:
team: "node"
severity: '严重'
annotations:
summary: "主机为:【{{$labels.instance }}】出现:CPU利用率达到85% "
description: "主机【{{ $labels.instance }}】的CPU利用率已经达到85%了!"
summarys: "主机为:【{{$labels.instance }}】出现的告警已有处理方案!"
value: "{{ $value }}"
- alert: "内存利用率告警"
expr: avg by(instance) ((1 - (node_memory_MemFree_bytes + node_memory_Cached_bytes + node_memory_Buffers_bytes) / node_memory_MemTotal_bytes ) * 100) > 70
for: 3m
labels:
team: "node"
severity: '告警'
annotations:
summary: "主机为:【{{$labels.instance }}】出现:内存利用率达到70% "
description: "主机【{{ $labels.instance }}】的内存利用率已经达到70%了!"
summarys: "主机为:【{{$labels.instance }}】出现的告警已有处理方案!"
value: "{{ $value }}"
- alert: "内存利用率告警"
expr: avg by(instance) ((1 - (node_memory_MemFree_bytes + node_memory_Cached_bytes + node_memory_Buffers_bytes) / node_memory_MemTotal_bytes ) * 100) > 90
for: 3m
labels:
team: "node"
severity: '严重'
annotations:
summary: "主机为:【{{$labels.instance }}】出现:内存利用率达到90% "
description: "主机【{{ $labels.instance }}】的内存使用率已经达到90%了!"
summarys: "主机为:【{{$labels.instance }}】出现的告警已有处理方案!"
value: "{{ $value }}"
- alert: "磁盘利用率告警"
expr: (1 - node_filesystem_free_bytes{fstype!="rootfs",mountpoint!="",mountpoint!~"/(run|var|sys|dev).*"} / node_filesystem_size_bytes) * 100 > 80
for: 2m
labels:
team: "node"
severity: '告警'
annotations:
summary: "主机为:【{{$labels.instance }}】出现:磁盘利用率达到80% "
description: "主机【{{ $labels.instance }}】的磁盘利用率已经达到80%了!"
summarys: "主机为:【{{$labels.instance }}】出现的告警已有处理方案!"
value: "{{ $value }}"
- alert: "磁盘利用率告警"
expr: (1 - node_filesystem_free_bytes{fstype!="rootfs",mountpoint!="",mountpoint!~"/(run|var|sys|dev).*"} / node_filesystem_size_bytes) * 100 > 90
for: 2m
labels:
team: "node"
severity: '严重'
annotations:
summary: "主机为:【{{$labels.instance }}】出现:磁盘利用率达到90% "
description: "主机【{{ $labels.instance }}】的磁盘使用率已经达到90%了!"
summarys: "主机为:【{{$labels.instance }}】出现的告警已有处理方案!"
value: "{{ $value }}"
5. alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: 'aaa@qq.com'
smtp_auth_username: 'aaa@qq.com'
smtp_auth_password: '123455'
smtp_require_tls: false
smtp_hello: 'qq.com'
templates:
- '/usr/local/monitor/alertmanager/template/alert.tmpl'
route:
group_by: ['alertname','team']
group_wait: 10s
group_interval: 10s
repeat_interval: 5m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'aaa@qq.com'
html: '{{ template "alert.html" . }}'
send_resolved: true
headers: { Subject: "[重要] 来自监控系统的报警邮件 " }
inhibit_rules:
- source_match:
severity: '告警'
target_match:
severity: '严重'
equal: ['alertname', 'dev', 'instance']
6. alert.tmpl
在alertmanager下创建template目录,添加alert.tmpl文件,内容如下:
{{ define "alert.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
========= 故障告警 ==========<br>
告警程序: {{ .Labels.job }} <br>
告警名称:{{ .Labels.alertname }}<br>
告警级别:{{ .Labels.severity }}<br>
告警机器:{{ .Labels.instance }} {{ .Labels.device }}<br>
告警详情:{{ .Annotations.summary }}<br>
告警时间:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>
#####此处"28800e9"指时间格式为CST(即北京时间),"2006-01-02 15:04:05"为时间格式,并不指所写时间。
========= END ==========<br>
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
========= 恢复提醒 ==========<br>
告警程序: {{ .Labels.job }} <br>
告警名称:{{ .Labels.alertname }}<br>
告警级别:{{ .Labels.severity }}<br>
告警机器:{{ .Labels.instance }}<br>
告警详情:{{ .Annotations.summary }}<br>
告警时间:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>
恢复时间:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>
========= END ==========<br>
{{- end }}
{{- end }}
{{- end }}
7. grafana
1). 默认账号/密码:admin/admin;
2). General/Home,选择DATA SOURCES;
3). 选择Prometheus;
4). URL项输入:http://ip:9090/;
5). 点击Save & test;
6). 在Data Sources/Prometheus,选择Prometheus 2.0 Stats,点击Import;
7). 在General/Home,选择Prometheus 2.0 Stats,就可以看到监控资源情况;
另:
如果需要监控其他的端口,第一步需要在prometheus.yml,添加对应的端口,再在Grafana下,+号中选择Import,添加对应的id(8919:Linux服务器监控,9276:Linux服务器监控,193:docker监控(需要部署docker监控端口(参考如下)))
部署docker监控端口:docker run -d --volume=/:/rootfs:ro --volume=/var/run:/var/run:ro --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --volume=/dev/disk/:/dev/disk:ro --publish=8088:8080 --detach=true --name=cadvisor google/cadvisor:latest
安装参考
https://zhuanlan.zhihu.com/p/425304902
http://www.yunweipai.com/39494.html