概述
白盒监控:是指我们日常监控主机的资源用量、容器的运行状态、数据库中间件的运行数据。这些都是支持业务和服务的基础设施,通过白盒能够了解其内部的实际运行状态,通过对监控指标的观察能够预判可能出现的问题,从而对潜在的不确定因素进行优化。
墨盒监控:即以用户的身份测试服务的外部可见性,常见的黑盒监控包括 HTTP探针、TCP探针、Dns、Icmp等用于检测站点、服务的可访问性、服务的连通性,以及访问效率等。
两者比较:黑盒监控相较于白盒监控最大的不同在于黑盒监控是以故障为导向当故障发生时,黑盒监控能快速发现故障,而白盒监控则侧重于主动发现或者预测潜在的问题。一个完善的监控目标是要能够从白盒的角度发现潜在问题,能够在黑盒的角度快速发现已经发生的问题。
Prometheus 基本原理
描述: Prometheus 基本工作流程步骤如下:
- Prometheus Server 读取配置解析静态监控端点(static_configs),以及服务发现规则(xxx_sd_configs)自动收集需要监控的端点
- Prometheus Server 周期刮取(scrape_interval)监控端点通过HTTP的Pull方式采集监控数据
- Prometheus Server HTTP 请求到达 Node Exporter,Exporter 返回一个文本响应,每个非注释行包含一条完整的时序数据:Name + Labels + Samples(一个浮点数和一个时间戳构成), 数据来源是一些官方的exporter或自定义sdk或接口;
- Prometheus Server 收到响应,Relabel处理之后(relabel_configs)将其存储在TSDB中并建立倒排索引
- Prometheus Server 另一个周期计算任务(evaluation_interval)开始执行,根据配置的Rules逐个计算与设置的阈值进行匹配,若结果超过阈值并持续时长超过临界点将进行报警,此时发送Alert到AlertManager独立组件中。
- AlertManager 收到告警请求,根据配置的策略决定是否需要触发告警,如需告警则根据配置的路由链路依次发送告警,比如邮件、微信、Slack、PagerDuty、WebHook等等。
- 当通过界面或HTTP调用查询时序数据利用PromQL表达式查询,Prometheus Server 处理过滤完之后返回瞬时向量(Instant vector, N条只有一个Sample的时序数据),区间向量(Range vector,N条包含M个Sample的时序数据),或标量数据 (Scalar, 一个浮点数)
- 采用Grafana开源的分析和可视化工具进行数据的图形化展示。
- 作者:WeiyiGeek www.bilibili.com/read/cv1329… 出处:bilibili
安装使用
安装
docker 安装
- 新建目录prometheus,编辑配置文件prometheus.yml
global:
scrape_interval: 60s
evaluation_interval: 60s
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
labels:
instance: prometheus
- job_name: 'node_161'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['172.171.100.161:19100']
- targets: ['172.171.100.157:19100']
- job_name: springboot-minio
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /actuator/prometheus
scheme: http
follow_redirects: true
static_configs:
- targets:
- 172.171.100.157:18099
- 启动
docker run -d -p 9090:9090 -v /data/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
3. 查看
yum 安装
(1) 下载prometheus安装包
prometheus.io/download/ 选择Liunx amd64架构
(2) 解压
#tar xf prometheus-2.27.1.linux-amd64.tar.gz -C /data/
#cd /data
#mv prometheus-2.27.1.linux-amd64/ prometheus
(3) 配置prometheus
vim /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus
After=network.target
[Service]
Type=simple
Environment="GOMAXPROCS=4"
#User=prometheus
#Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/data/prometheus/prometheus
--config.file=/data/prometheus/prometheus.yml
--storage.tsdb.path=/data/prometheus/data
--storage.tsdb.retention=30d
--web.console.libraries=/data/prometheus/console_libraries
--web.console.templates=/data/prometheus/consoles
--web.listen-address=0.0.0.0:9090
--web.read-timeout=5m
--web.max-connections=10
--query.max-concurrency=20
--query.timeout=2m
--web.enable-lifecycle
PrivateTmp=true
PrivateDevices=true
ProtectHome=true
NoNewPrivileges=true
LimitNOFILE=infinity
ReadWriteDirectories=/data/prometheus
ProtectSystem=full
SyslogIdentifier=prometheus
Restart=always
[Install]
WantedBy=multi-user.target
(4) 启动prometheus
systemctl daemon-reload
systemctl start prometheus
systemctl status prometheus
# 检查配置文件是否正确
./promtool check config prometheus.yml
node-exporter
用户监控服务器节点,安装在被监控的服务器节点上,启动后通过暴露服务器相关数据指标,被prometheus采集达到监控的目的
下载node-exporter
下载版本:node_exporter-0.18.1.linux-amd64
解压后放到 /data/文件中
编写服务
#vim /etc/systemd/system/node_exporter.service
[Unit]
Description=prometheus node_exporter Daemon
Documentation=https://github.com/prometheus/node_exporter
Requires=network.target
After=network.target
[Service]
Type=simple
WorkingDirectory=/data/node_exporter
ExecStart=/data/node_exporter/node_exporter --log.level=info --web.listen-address=:19100
TimeoutSec=30
Restart=on-failure
[Install]
WantedBy=default.target
systemctl daemon-reload && systemctl start node_exporter
systemctl enable node_exporter
systemctl stop node_exporter
systemctl restart node_exporter
黑盒监控 black-exporter
通过prometheus发出请求服务器数据达到监控目的
yum安装
# wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.16.0/blackbox_exporter-0.16.0.linux-amd64.tar.gz
# tar xf blackbox_exporter-0.16.0.linux-amd64.tar.gz -C /usr/local/
# ln -s /usr/local/blackbox_exporter-0.16.0.linux-amd64/ /usr/local/blackbox_exporter
# 使用systemd进行管理blackbox_exporter服务
# vim /usr/lib/systemd/system/blackbox_exporter.service
[Unit]
Description=blackbox_exporter
After=network.target
[Service]
User=root
Type=simple
ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target
# systemctl daemon-reload
# systemctl start blackbox_exporter.service
# systemctl enable blackbox_exporter.service
配置
blackbox.yml
modules:
http_2xx:
prober: http
http_3xx:
prober: http
http_post_2xx:
prober: http
http:
method: POST
tcp_connect:
prober: tcp
pop3s_banner:
prober: tcp
tcp:
query_response:
- expect: "^+OK"
tls: true
tls_config:
insecure_skip_verify: false
ssh_banner:
prober: tcp
tcp:
query_response:
- expect: "^SSH-2.0-"
irc_banner:
prober: tcp
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ]+)"
send: "PONG ${1}"
- expect: "^:[^ ]+ 001"
icmp:
prober: icmp
接入prometheus
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 172.171.100.174:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
file_sd_configs:
- files:
- targets/prometheus-*.yaml
refresh_interval: 1m
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'nodes'
file_sd_configs:
- files:
- targets/node-*.yaml
refresh_interval: 1m
- job_name: "http-prod-200"
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response.
file_sd_configs:
- refresh_interval: 1m
files:
- "/data/prometheus/blackbox/http-prod-200.yml"
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 172.171.100.174:9115
- job_name: "http-prod-302"
metrics_path: /probe
params:
module: [http_3xx] # Look for a HTTP 200 respons
file_sd_configs:
- refresh_interval: 1m
files:
- "/data/prometheus/blackbox/http-prod-302.yml"
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 172.171.100.174:9115
- job_name: "http-test-200"
metrics_path: /probe
params:
module: [http_3xx] # Look for a HTTP 200 respons
file_sd_configs:
- refresh_interval: 1m
files:
- "/data/prometheus/blackbox/http-test-200.yml"
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 172.171.100.174:9115
http-prod-200.yml
- targets:
- https://www.alibaba.com
- https://www.tencent.com
- https://www.baidu.com
- http://www.test-nginx.com
http-prod-302.yml
- targets:
- https://uc.lxyun.cn
接入grafana
dashboarID :7587
- resolve:DNS解析持续时间
- connect:TCP连接建立的持续时间
- tls: TLS连接协商持续时间(我认为这包括TCP连接建立持续时间)
- processing:建立连接与接收响应的第一个字节之间的持续时间
- transfer:转移响应的持续时间
监控actuator
用于监控应用程序例如java
应用配置
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
management:
endpoints:
web:
exposure:
include: 'prometheus'
metrics:
tags:
application: ${spring.application.name}
访问: http://ip:oprt/actuator/prometheus
Prometheus 配置
- job_name: 'spring'
# 多久采集一次数据
scrape_interval: 15s
# 采集时的超时时间
scrape_timeout: 10s
# 采集的路径是啥
metrics_path: '/actuator/prometheus'
# 采集服务的地址,设置成上面Spring Boot应用所在服务器的具体地址。
static_configs:
- targets: ['172.171.100.157:18099']
grafana DashboardID:12900
mysql 监控
监控mysql需要在被监控机器安装mysql_exporter
mysql_exporter下载地址:prometheus.io/download/
[root@xinsz08-20 ~]# mv mysqld_exporter-0.12.1.linux-amd64 mysqld_exporter
[root@xinsz08-20 ~]# cd mysqld_exporter
# 添加配置文件
[root@xinsz08-20 ~]# vim .my.cnf
[client]
user=root
password=123456
# 启动
[root@xinsz08-20 mysqld_exporter]# nohup ./mysqld_exporter --config.my-cnf=my.cnf &
查看端口(9104)
[root@zmedu-17 prometheus-2.16.0.linux-amd64]# vim prometheus.yml
- job_name: 'mysql-lxy'
static_configs:
- targets: ['172.171.100.172:9104']
dashboardID :7362