Prometheus 监控部署方案引入在开始之前，先用“一家连锁超市”来比喻整个监控系统专有名词官方定义超市比喻

引入

在开始之前，先用“一家连锁超市”来比喻整个监控系统

专有名词	官方定义	超市比喻	对应 Docker 镜像
Prometheus Server	监控核心，负责拉取、存储、查询数据。	总部数据中心 + 巡查队长。它手里拿着排班表（配置），定期去各家分店（节点/Pod）抄写营业额（指标），记在账本（数据库）里。	`prom/prometheus`
Node Exporter	暴露主机硬件指标的代理程序。	分店理货员。每个分店（服务器节点）必须有一个。他负责统计本店的水电费、库存（CPU/内存），整理好放在门口，等巡查队长来抄。	`prom/node-exporter`
ConfigMap	K8s 的配置对象，可挂载为文件。	总部的红头文件/排班表。不用把排班表纹在队长身上（打镜像），而是挂在墙上。改班表只需换张纸，队长重启一下就能按新表工作。	(K8s 原生对象，无镜像)
ServiceAccount	Pod 访问 K8s API 的身份凭证。	队长的“特别通行证”。没有这个证，队长去分店抄表时会被保安（K8s API）拦下来。	(K8s 原生对象)
Deployment	管理无状态应用的控制器。	雇佣合同。告诉 K8s：“我要雇 1 个巡查队长，如果他不干了（挂了），马上再雇一个一样的。”	(K8s 资源定义)
DaemonSet	保证每个节点运行一个 Pod 的控制器。	强制编制。告诉 K8s：“不管我开多少家分店，每家店必须且只能有一个理货员（Node Exporter）。”	(K8s 资源定义)
PVC	持久化存储卷。	保险柜/档案室。确保账本（数据）不会因为队长换人（Pod 重启）而丢失。	(K8s 原生对象)

第一阶段：环境准备与镜像下载

确认环境确保 kubectl 能连接集群，且集群至少有 1 个节点

# 检查节点状态
kubectl get ndoes

# 创建命名空间
kubectl create namespace monitoring

预拉取镜像

# 下载 Prometheus 主程序 (推荐使用具体版本号，避免 latest 突变)

docker pull prom/prometheus:v2.55.1

# 下载 Node Exporter (用于采集服务器硬件信息)

docker pull prom/node-exporter:v1.8.2

# 下载 Grafana (用于展示数据)

docker pull grafana/grafana:10.4.2

第二阶段：构建基础设施（RBAC & Namespace）

Prometheus 需要权限去巡查整个集群

创建一个文件 01-rbac.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus-sa
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus-role
rules:
- apiGroups: [""]
  resources: ["nodes", "nodes/proxy", "services", "endpoints", "pods", "configmaps"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["extensions", "networking.k8s.io"]
  resources: ["ingresses"]
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus-role
subjects:
- kind: ServiceAccount
  name: prometheus-sa
  namespace: monitoring

执行

kubectl apply -f 01-rbac.yaml

第二阶段：编写配置文件ConfigMap

ConfigMap 用于告诉 prometheus 服务“去哪里抓数据” 和 “什么情况下报警”，运用了基于 Label 的发现方式

创建一个文件 02-configmap.yaml, 这里需要注意的是，该文件中监控和告警配置是最简单的，实践中需要根据实际情况修改

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  # 主配置文件 prometheus.yml
  prometheus.yml: |
    global:
      scrape_interval: 15s  # 每 15 秒去抄一次表
      evaluation_interval: 15s # 每 15 秒评估一次告警规则
    
    # 告警规则文件位置
    rule_files:
      - /etc/prometheus/rules/*.rules
    
    # 抓取配置 (Jobs)
    scrape_configs:
      # 1. 监控 Prometheus 自己
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']
      
      # 2. 监控 K8s API Server (核心组件)
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
          - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
            action: keep
            regex: default;kubernetes;https
      
      # 3. 监控所有节点 (通过 Node Exporter)
      # 逻辑：自动发现所有 K8s 节点，然后去抓它们的 9100 端口
      - job_name: 'node-exporter'
        kubernetes_sd_configs:
          - role: node
        relabel_configs:
          - source_labels: [__address__]
            regex: '(.+):(.+)'
            replacement: '${1}:9100'  # 将端口替换为 9100
            target_label: __address__
      
      # 4. 基于 Label 的业务发现 (推荐)
      # 只要 Pod 带有 label: monitor=true，就会被自动抓取
      - job_name: 'kubernetes-pods-label'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_label_monitor]
            action: keep
            regex: true
          - source_labels: [__address__, __meta_kubernetes_pod_container_port_number]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
            replacement: ${1}

      # 5. 基于注解的业务发现（备选方案）
      # 只要 Pod 上有 prometheus.io/scrape: "true" 注解，就会被自动监控
      - job_name: 'kubernetes-pods-annotation'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__

  # 告警规则文件 alert.rules
  alert.rules: |
    groups:
    - name: basic_alerts
      rules:
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"
          description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted more than 0 times in 5 minutes."
      
      - alert: HighNodeCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on node {{ $labels.instance }}"
          description: "CPU usage is above 80% for more than 5 minutes."

执行

kubectl apply -f 02-configmap.yaml

第四阶段：部署 Prometheus(带持久化存储)

使用 PersistentVolumeClaim 保证数据不丢失，设置 15 天 retention

创建一个文件 03-prometheus-deployment.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-data
  namespace: monitoring
spec:
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 10Gi
  # storageClassName: standard # 如有特殊存储类请取消注释
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
  labels:
    app: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      # 1. 使用之前创建的专用服务账户 (通行证)
      serviceAccountName: prometheus-sa
      
      containers:
      - name: prometheus
        # 2. 【核心】指定镜像地址和版本
        image: prom/prometheus:v2.55.1
        imagePullPolicy: IfNotPresent
        
        ports:
        - containerPort: 9090
          name: http
        
        # 3. 启动参数：指定配置文件路径和数据存储路径
        args:
          - "--config.file=/etc/prometheus/prometheus.yml"
          - "--storage.tsdb.path=/prometheus"
          - "--storage.tsdb.retention.time=15d"
          - "--web.console.libraries=/etc/prometheus/console_libraries"
          - "--web.console.templates=/etc/prometheus/consoles"
        
        # 4. 挂载配置文件 (从 ConfigMap 来)
        volumeMounts:
          - name: config-volume
            mountPath: /etc/prometheus
          - name: data-volume
            mountPath: /prometheus
        
        # 健康检查
        livenessProbe:
          httpGet:
            path: /-/healthy
            port: 9090
          initialDelaySeconds: 30
          periodSeconds: 15
        
        readinessProbe:
          httpGet:
            path: /-/ready
            port: 9090
          initialDelaySeconds: 5
          periodSeconds: 5

        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"

      # 5. 定义卷来源
      volumes:
        - name: config-volume
          configMap:
            name: prometheus-config
        - name: data-volume
          persistentVolumeClaim:
            claimName: prometheus-data
---
# 暴露服务，让我们能访问网页
apiVersion: v1
kind: Service
metadata:
  name: prometheus-service
  namespace: monitoring
spec:
  selector:
    app: prometheus
  ports:
    - protocol: TCP
      port: 9090
      targetPort: 9090
  type: NodePort  # 使用 NodePort 方便外部访问，生产环境建议用 LoadBalancer 或 Ingress

执行

kubectl apply -f 03-prometheus-deployment.yaml

第五阶段：部署 Node Exporter DaemonSet

Prometheus 本体跑起来了，但它还不知道服务器的 CPU/内存使用情况。我们需要在每个节点上部署 node-exporter。

创建一个文件 04-node-exporter.yaml：

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
  labels:
    app: node-exporter
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      hostNetwork: true # 使用宿主机网络，直接获取真实 IP
      hostPID: true     # 使用宿主机 PID 命名空间
      tolerations:
      - effect: NoSchedule
        operator: Exists
      containers:
      - name: node-exporter
        # 1. 【核心】指定 Node Exporter 镜像
        image: prom/node-exporter:v1.8.2
        imagePullPolicy: IfNotPresent
        
        ports:
        - containerPort: 9100
          hostPort: 9100
        
        args:
          - "--path.procfs=/host/proc"
          - "--path.sysfs=/host/sys"
          - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)"
        
        volumeMounts:
          - name: proc
            mountPath: /host/proc
            readOnly: true
          - name: sys
            mountPath: /host/sys
            readOnly: true
        resources:
          requests:
            memory: "50Mi"
            cpu: "100m"
          limits:
            memory: "100Mi"
            cpu: "200m"
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys

执行

kubectl apply -f 04-node-exporter.yaml

第六阶段：安装 Grafana 让数据可视化

Prometheus 自带的界面只能查数据，很难看。通常搭配 Grafana 使用。

部署 Grafana ,新建 05-grafana.yaml。

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana-data
  namespace: monitoring
spec:
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 5Gi
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        url: http://prometheus-service.monitoring.svc:9090
        isDefault: true
        access: proxy
        editable: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
      - name: grafana
        image: grafana/grafana:10.4.2
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 3000
        env:
        - name: GF_SECURITY_ADMIN_PASSWORD
          value: "admin123" # 设置密码
        volumeMounts:
        - name: grafana-storage
          mountPath: /var/lib/grafana
        - name: datasources
          mountPath: /etc/grafana/provisioning/datasources
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
      volumes:
      - name: grafana-storage
        persistentVolumeClaim:
          claimName: grafana-data
      - name: datasources
        configMap:
          name: grafana-datasources
---
apiVersion: v1
kind: Service
metadata:
  name: grafana-service
  namespace: monitoring
spec:
  selector:
    app: grafana
  ports:
  - port: 3000
    targetPort: 3000
  type: NodePort

执行

kubectl apply -f 05-grafana.yaml

如果数据源，没有配置成功，可以按照下面的步骤配置数据源：

访问 http://<节点IP>:<Grafana端口> (通过 kubectl get svc -n monitoring 查看端口)。

点击 Connections -> Data Sources -> Add data source -> 选择 Prometheus。

在 URL 栏填入 Prometheus 的内网地址：prometheus-service.monitoring.svc:9090。

点击 Save & Test，看到绿色对勾即成功。

导入仪表盘：

点击 Dashboards -> New -> Import。

输入社区热门模板 ID，例如 1860 (Node Exporter Full) 或 315 (Kubernetes Cluster)。

点击 Load，选择刚才配置的 Prometheus 数据源，点击 Import。

第七阶段：验证与访问

检查所有组件是否运行

kubectl get pods -n monitoring

预期结果： prometheus-xxxxx: 状态应为 Running (1/1)。 node-exporter-xxxxx: 应该有多个（等于你的节点数量），状态均为 Running (1/1)。 grafana-xxx: 1/1 Running。如果有 Pending 或 CrashLoopBackOff，使用 kubectl describe pod -n monitoring 查看报错。

获取访问地址

kubectl get svc -n monitoring

因为使用了 NodePort 类型的 Service，你需要找到一个 Service 对外开放的 IP 和分配的端口。

打开腾讯云容器服务，更新 prometheus-service 的配置，改为公网 LB 访问的模式，勾选“采用负载均衡直连 Pod 模式” 。

转存失败，建议直接上传图片文件

然后查看 prometheus-service 的对外开放 IP，这里可以确定是 43.145.28.54

转存失败，建议直接上传图片文件

打开浏览器，在浏览器输入：http://43.145.28.54:9090 点击顶部菜单 Status -> Targets。如果看到 kubernetes-apiservers, node-exporter, prometheus 等任务的状态都是 UP (绿色)，部署成功。

第八阶段：后期如何监控业务？

假设有一个名为 ezcloud-mqtt-auth 的业务，有两种方式让出现在监控中：

方式 A：打标签 (推荐，无需改业务 YAML 里的注解) 在你的 Deployment YAML 的 metadata.labels 中添加 monitor: "true"：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ezcloud-mqtt-auth
  labels:
    app: ezcloud-mqtt-auth
    monitor: "true"  # <--- 加上这个
spec:
  # ...

Prometheus 会自动发现并抓取该 Pod 的默认端口 /metrics。

方式 B：加注解（备选）在 Pod Template 的 metadata.annotations 中添加：

template:
  metadata:
    annotations:
      prometheus.io/scrape: "true"
      prometheus.io/port: "8800"       # 指定端口
      prometheus.io/path: "/actuator/prometheus" # 指定路径

应用业务后，再次检查 Prometheus Targets，找到业务 Job 的状态应为 UP。