PrometheusPrometheus 作为一款流行的开源监控和警报工具，具有许多优势，使其成为许多组织和项目的首选监

1、概述

1.1 什么是 Prometheus？

Prometheus：是一个开源的服务监控系统和时序数据库，它提供了通用的数据模型和快捷数据采集、存储和查询接口。

Prometheus 作为一款流行的开源监控和警报工具，具有许多优势，使其成为许多组织和项目的首选监控解决方案。以下是 Prometheus 的一些主要优势：

开源和社区支持：Prometheus 是完全开源的，拥有一个活跃的社区，这意味着它不断得到改进和更新，用户也可以自由地贡献代码和反馈。
多维数据模型：Prometheus 使用多维数据模型，这意味着数据由指标名称和键值对标签组成。这种模型提供了灵活性，可以轻松地对数据进行切片和筛选。
强大的查询语言（PromQL） ：Prometheus 提供了强大的 PromQL 查询语言，允许用户执行复杂的查询和聚合操作，以获取所需的监控数据。
数据本地存储：Prometheus 不依赖于分布式存储，每个 Prometheus 实例都是独立的，并且拥有自己的本地存储。这简化了部署和扩展，同时提高了可靠性。
拉取模型：Prometheus 采用拉取模型，主动从目标服务抓取监控数据。这种模型使得 Prometheus 可以轻松地与现有的服务集成，并且可以灵活地处理服务的动态变化。
灵活的告警系统：通过 Alertmanager，Prometheus 可以配置灵活的告警规则，支持多种通知方式，如电子邮件、Slack、PagerDuty 等。
生态系统：Prometheus 拥有丰富的生态系统，包括 Grafana（一个流行的开源分析和监控解决方案，与 Prometheus 集成良好）、各种 exporter（用于从不同系统和应用程序收集数据）以及其他工具和服务。
轻量级和高性能：Prometheus 设计轻量级，即使在资源受限的环境中也能高效运行。
易于部署和维护：Prometheus 的部署和维护相对简单，不需要复杂的配置，这降低了运维成本。

1.2 Prometheus 的架构

Prometheus Server：是 Prometheus 监控系统的中心组件，负责抓取（拉取）指标数据、存储时间序列数据、处理查询和生成监控数据。
1. Retrieval：定期从 Exporter 抓取指标数据。
2. TSDB：时间序列数据库，用于存储和查询时间序列数据。
3. HTTP Server： HTTP 服务器，用于处理各种类型的 HTTP 请求。
Push Gateway：用于临时或短暂运行的监控任务，这些任务可能无法被 Prometheus Server 定期抓取。
Alert Manager：是 Prometheus 的告警处理组件，负责处理由 Prometheus Server 生成的告警。
Exporter：是一种特殊的代理，用于从各种应用程序、服务或系统收集指标数据，并将其转换为 Prometheus 可以理解的格式。

1.3 Prometheus 的数据模型

Prometheus 的数据模型是多维的，它基于时间序列数据，每个时间序列由一个指标名称和一组键值对标签（labels）唯一标识。这种数据模型提供了强大的灵活性和丰富的查询能力。

Prometheus 的数据格式：<metric name> {<label name> = <label value>, ...} <timestamp>

metric name：指标名
label name：标签名
label value：标签值
timestamp：时间戳

1.4 Prometheus 的指标类型

Prometheus 支持多种类型的指标：

Counter（计数器） ：仅递增的指标，通常用于表示事件的数量。
Gauge（仪表盘） ：可以递增也可以递减的指标，通常用于表示某些值，如温度或当前连接数。
Histogram（直方图） ：用于统计事件的分布情况，例如请求的响应时间。
Summary（摘要） ：类似于直方图，但提供了分位数的统计信息。

1.5 Prometheus 的配置文件

Prometheus 的配置文件是 YAML 格式的，它定义了 Prometheus 服务器的行为，包括全局设置、告警配置、抓取（scrape）配置、规则文件等。

global:
  # 抓取指标的间隔，默认是2分钟
  scrape_interval: 15s
  # 评估规则的间隔，默认是1分钟
  evaluation_interval: 15s
  # 抓取操作的超时时间
  scrape_timeout: 10s
  # 与外部系统通信时，添加到时间序列或告警上的标签
    external_labels:
        environment: "production"
# 抓取配置
scrape_configs:
  # 静态配置，直接在配置文件中定义要抓取的目标地址和标签
  - job_name: 'prometheus' # 作业名称，用于标识一组抓取目标
    static_configs:
      - targets: ['localhost:9090'] # 要抓取的目标地址
  # 使用服务发现机制动态发现要抓取的目标
  - job_name: 'node'
    dns_sd_configs:
      - names: ['tasks.node_EXPORTER'] # DNS 名称
        type: A # DNS 查询类型
        port: 9100 # 抓取目标的端口
# 告警配置
alerting:
    # 在发送到 Alertmanager 之前修改告警标签的配置
    alert_relabel_configs:
        - action: 'keep' # 操作类型
          regex: 'alertmanager' # :正则表达式
          source_labels: [__name__] # 源标签列表
    # Alertmanager 的静态配置
    alertmanagers:
        - static_configs:
            - targets:
                - '127.0.0.1:9093'
# 规则文件
rule_files: # 指定一个或多个规则文件的路径，Prometheus 会定期加载这些文件中的告警和记录规则
  - "rules.yml"

2、安装

2.1 Linux 安装

下载 Prometheus：从官网下载最新版本的 Prometheus 二进制文件。
解压文件：tar -xzvf prometheus-x.x.x.linux-amd64.tar.gz
编写配置文件：vim prometheus.yml
运行 Prometheus：./prometheus --config.file=xxx/prometheus.yml

2.2 Docker 安装

拉取 Prometheus 镜像：docker pull prom/prometheus
运行 Prometheus 容器：docker run -itd --name=prometheus --restart=always -p 9090:9090 prom/prometheus

3、Prom QL

Prom QL：Prometheus Query Language，是 Prometheus 的查询语言，用于查询和处理存储在 Prometheus 中的时间序列数据。

3.1 完全匹配

Prom QL 支持 = 和 != 两种完全匹配模式：

=：通过 label = value 选择满足条件的时间序列。
!=：通过 label != value 排除满足条件的时间序列。

如：

1）查询 prometheus_http_requests_total 指标中 code 等于 200 的时间序列：

prometheus_http_requests_total {code="200"}

2）查询 prometheus_http_requests_total 指标中 code 不等于 200 的时间序列：

prometheus_http_requests_total {code!="200"}

3.2 正则匹配

Prom QL 支持 =~ 和 !~ 两种完全匹配模式：

=~：通过 label=~regx 来选择满足条件的时间序列。
!~：通过 label!~value 排除满足条件的时间序列。

如：

1）查询 prometheus_http_requests_total 指标中 handler 以 /api 开头的时间序列：

prometheus_http_requests_total {handler=~"/api"}

2）查询 prometheus_http_requests_total 指标中 handler 不以 /api 开头的时间序列：

prometheus_http_requests_total {handler!~"/api"}

3.3 范围查询

在 PromQL 中，范围查询允许你选取一个时间范围内的数据。以下是范围查询格式：

<metric_name>{<label_selector>}[<time_range>]

<metric_name>：要查询的指标名称。
<label_selector>：标签选择器，用于过滤时间序列。
<time_range> ：时间范围，如 15s、5m、1h 等等。

如：

1）查询过去5分钟内 prometheus_http_requests_total 指标中 code=200 的时间序列：

prometheus_http_requests_total{code="200"}[5m]

2）查询过去1小时内 prometheus_http_requests_total 指标中 handler=~"/api.*" 的时间序列：

prometheus_http_requests_total{handler=~"/api.*"}[1h]

3.4 时间位移

在 PromQL 中，时间位移允许你查询过去的数据。以下是时间位移格式：

<metric_name>{<label_selector>} offset <time_duration>

<metric_name>：要查询的指标名称。
<label_selector>：标签选择器，用于过滤时间序列。
<time_duration> 是时间位移的持续时间，如 15s、5m、1h 等等。

如：

1）查询10分钟前 prometheus_http_requests_total 指标的数据：

prometheus_http_requests_total offset 10m

2）查询1小时前 prometheus_http_requests_total 指标中 code=200 的数据：

prometheus_http_requests_total{code="200"} offset 1h

3.5 聚合操作

PromQL 提供了多种聚合函数，用于对时间序列进行聚合操作。这些函数可以帮助你从大量数据中提取有用的信息。常用的聚合函数有：

count()：计算时间序列的数量。
sum()：计算所有时间序列的总和。
avg()：计算时间序列的平均值。
max() 和 min()：分别计算时间序列的最大值和最小值。
stddev()：计算时间序列的标准差。
topk() 和 bottomk()：分别返回最大的和最小的 k 个时间序列。

示例：

1）查询 prometheus_http_requests_total 指标中所有 code=200 的时间序列的总和：

sum(prometheus_http_requests_total{code="200"})

2）查询 prometheus_http_requests_total 指标中 code=200 的时间序列的平均值：

avg(prometheus_http_requests_total{code="200"})

3）查询 prometheus_http_requests_total 指标中 code=200 的时间序列中最多的5个时间序列：

topk(5, prometheus_http_requests_total{code="200"})

4、Prometheus 监控案例

4.1 监控 Spring Boot 应用

Spring Boot 提供了 Spring Boot Actuator，用于监控和管理 Spring Boot 应用程序的功能。它提供了一系列的端点（Endpoints），这些端点可以暴露应用程序的内部状态，包括健康检查、度量信息、环境信息、审计事件、应用配置等。通过这些端点，开发者和运维人员可以实时地监控和管理应用程序。

步骤：

添加依赖：

<dependencies>
    <!-- Spring Boot Actuator -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>
    <!-- Micrometer Prometheus Registry -->
    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-registry-prometheus</artifactId>
        <scope>runtime</scope>
    </dependency>
</dependencies>

配置文件：

 # 暴露所有可用的 Actuator 端点
management:
    endpoints:
        web:
            exposure:
                include: '*'
    # 健康检查配置
    endpoint:
        health:
            show-details: ALWAYS

启动并配置 Prometheus：下载 Prometheus，通过编写配置文件抓取指标地址（即 Spring Boot 应用地址）。

scrape_configs:
  - job_name: 'spring-boot-application'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['localhost:8080']

使用 Grafana 可视化查看指标：安装 Grafana 并在其中配置 Prometheus 作为数据源，然后创建仪表盘来展示监控数据。

4.2 监控 Linux 系统

步骤:

在目标 Linux 服务器上安装 Node Exporter。
下载并解压 Node Exporter 到 /usr/loca l目录。
启动Node Exporter服务。
在Prometheus配置文件 prometheus.yml 中添加 scrape 配置。

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['<Linux服务器IP>:9100']

重启 Prometheus 服务以加载新的配置。

4.3 监控中间件

如：监控 Redis

步骤:

在 Redis 服务器上安装 Redis Exporter。
在 Prometheus 配置文件 prometheus.yml 中添加 scrape 配置。

scrape_configs:
  - job_name: 'redis'
    static_configs:
      - targets: ['<Redis服务器IP>:9121']

重启 Prometheus 服务以加载新的配置。