Elasticsearch 8.x 监控告警手册目录总体原则监控体系分层指标采集方案核心指标清单告警分级与响应

适用版本：Elasticsearch 8.6+ 适用范围：生产、预生产集群文档定位：SRE/On-Call 团队监控接入、告警规则定义、值班响应的标准依据配套文档：《Elasticsearch 部署与运维手册》、《故障应急 Runbook》

1. 总体原则

1.1 监控的三个目的

预警（趋势）— 容量、慢查询、堆增长，避免被动救火
报警（突发）— 节点宕机、磁盘满、写入拒绝，触发即响应
复盘（追溯）— 故障后能在指标里复现现场，定位根因

任何指标必须能服务上述至少一个目的，否则不应采集。禁止"采集了再说"——指标也是负担（存储、告警噪音、维护）。

1.2 五条铁律

编号	铁律	反例
M1	监控集群与业务集群物理分离	业务集群挂了，监控也看不到
M2	告警必须有接收人和响应 SOP，否则不上	告警满天飞，无人响应
M3	任何告警必须能手动复现 + 验证恢复	误报无法甄别
M4	阈值必须基于历史数据 P99设定，不靠拍脑袋	"感觉 80% 应该报"
M5	静默期与抑制规则显式管理，定期审计	抑制规则积累 → 真故障被吞

1.3 告警黄金路径

指标采集 → 规则匹配 → 告警生成 → 路由（IM/电话/邮件）
                ↓
          On-Call 接单 → 响应 → 解决 → 复盘 → 调整规则

每一步必须有可观测性（采集延迟、规则评估、推送成功率），否则告警链路本身可能"静默"故障。

2. 监控体系分层

2.1 五层模型

┌─────────────────────────────────────────────────┐
│ L5  业务指标   (CTR、相关性、SLA 满足率)         │  ← 业务团队主导
├─────────────────────────────────────────────────┤
│ L4  应用指标   (客户端 QPS、P99、错误率)         │  ← 服务团队主导
├─────────────────────────────────────────────────┤
│ L3  集群指标   (查询/写入延迟、shard 状态)       │  ← SRE 主导
├─────────────────────────────────────────────────┤
│ L2  节点指标   (JVM、线程池、磁盘、IO)           │  ← SRE 主导
├─────────────────────────────────────────────────┤
│ L1  主机/OS    (CPU、内存、网络、磁盘)           │  ← SRE 主导
└─────────────────────────────────────────────────┘

各层职责：

L1~L3 由 SRE 团队拥有，决定告警与响应
L4 由业务服务团队拥有，与 SRE 协同
L5 由业务团队定义，监控只提供采集通道

2.2 告警归属

每个告警必须明确归属层级和接收人：

告警类型	层级	接收人	升级路径
节点宕机	L1/L2	SRE 值班	SRE Lead → 架构组
集群 Red	L3	SRE 值班	SRE Lead → 架构组 → CTO
写入拒绝	L3	SRE 值班 + 业务方	SRE Lead
客户端 P99 飙升	L4	业务服务 On-Call	服务 Lead → SRE
业务搜索 CTR 下跌	L5	业务团队	产品 Lead

3. 指标采集方案

3.1 采集架构

3.2 为什么不用业务集群自监控

ES 内置的 xpack.monitoring（已废弃）/ Stack Monitoring 把监控数据写入同一集群，存在两个致命问题：

业务集群挂了，监控也看不到
监控写入抢占业务资源

生产强制：监控集群独立，规格 ≥ 3 节点。

3.3 采集频率

指标类型	采集频率	保留
集群状态	10s	30 天
节点 JVM/线程池	10s	30 天
索引级写入/查询	30s	90 天
慢查询日志	实时	30 天
审计日志	实时	1 年
OS 指标（CPU/磁盘）	15s	30 天

4. 核心指标清单

4.1 集群级（L3）

指标	数据源	含义	关键阈值
`cluster_status`	`_cluster/health`	green/yellow/red	red 即告警
`unassigned_shards`	`_cluster/health`	未分配分片数	> 0 持续 10min
`relocating_shards`	`_cluster/health`	迁移中分片数	长期 > 10 异常
`initializing_shards`	`_cluster/health`	初始化中分片数	长期 > 10 异常
`pending_tasks`	`_cluster/pending_tasks`	master 待处理任务数	> 50 持续 5min
`number_of_data_nodes`	`_cluster/health`	data 节点数量	低于预期即告警
`task_max_waiting_in_queue_millis`	`_cluster/health`	任务最长等待	> 30s
`active_shards_percent`	`_cluster/health`	活跃分片比例	< 100% 持续

4.2 节点级（L2）

4.2.1 JVM

指标	阈值	说明
`jvm.mem.heap_used_percent`	> 75% 持续 10min 警告，> 85% 严重	老年代占用过高
`jvm.gc.collectors.young.collection_time_in_millis`（增速）	> 1s/min	Young GC 频繁
`jvm.gc.collectors.old.collection_time_in_millis`（增速）	> 5s/min	Old GC 异常
`jvm.gc.collectors.old.collection_count`（增速）	> 5/min	Full GC 频繁

4.2.2 线程池

最关键的三个：

线程池	关键指标	阈值
`write`	`rejected` 增速	> 0/min 警告
`search`	`rejected` 增速	> 0/min 警告
`bulk`（旧版叫 bulk，8.x 合入 write）	同上	同上
`management`	`queue` 长度	> 50
`flush`	`queue` 长度	> 5

rejected 是最重要的告警信号——任何一次拒绝都意味着请求被丢弃，业务受损。

4.2.3 磁盘 / 文件描述符

指标	阈值
`fs.total.disk_used_percent`	> 70% 警告，> 80% 严重
`fs.io_stats.total.write_kilobytes`（增速）	与历史 P99 对比
`process.open_file_descriptors / max_file_descriptors`	> 70% 警告

4.2.4 OS（Node Exporter）

指标	阈值
`node_load1 / cpu_count`	> 0.8 持续
`node_cpu_seconds_total{mode="iowait"}`	> 30% 持续
`node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes`	< 10%
`node_filesystem_avail_bytes / node_filesystem_size_bytes`	< 20%
`node_network_receive_drop_total`	> 0/s

4.3 索引级（L3）

指标	含义	阈值
`indexing.index_total`（增速）	写入 QPS	与历史对比
`indexing.index_time_in_millis / index_total`	平均写入延迟	> 50ms
`indexing.index_failed`（增速）	写入失败	> 0/min
`search.query_total`（增速）	查询 QPS	与历史对比
`search.query_time_in_millis / query_total`	平均查询延迟	> 200ms（搜索）/ > 1s（分析）
`search.fetch_time_in_millis / fetch_total`	平均 fetch 延迟	> 100ms
`refresh.total_time_in_millis`（增速）	refresh 耗时	异常增长
`merges.current`	进行中的 merge	长期 > 5 异常
`segments.count`	段数量	> 1000/分片
`store.size_in_bytes`	索引大小	趋势预测

4.4 慢查询（L3）

慢查询日志结构化采集：

字段	用途
`took_millis`	排序定位 Top N
`query`	原文重放
`source` 长度	看是否过大
`index`	定位热点
`id` (X-Opaque-Id)	关联应用 trace

告警：单查询 > 5s 立即告警；P99 > SLA 持续 5min 告警。

4.5 客户端（L4，应用埋点）

指标	类型	标签
`es_request_duration_seconds`	Histogram	operation, index, outcome
`es_request_total`	Counter	operation, index, status
`es_retry_total`	Counter	operation, reason
`es_circuit_breaker_state`	Gauge	name
`es_bulk_actions_total`	Counter	index, action, outcome
`es_bulk_queue_depth`	Gauge	writer
`es_connection_pool_active`	Gauge	host

参见《Elasticsearch 生产开发规范》§9。

5. 告警分级与响应

5.1 P 级定义

级别	名称	触发条件	响应时限	通知方式
P0	致命	集群完全不可用、数据丢失风险	5 分钟	电话 + IM + 邮件，全员通知
P1	严重	核心功能受损、可恢复但影响业务	15 分钟	电话 + IM
P2	告警	单节点异常、性能劣化、容量预警	30 分钟	IM
P3	通知	趋势预警、巡检发现	4 小时	IM/邮件
P4	信息	配置变更、定期报告	不响应	邮件

5.2 P 级与故障级别的对应

ES 故障	默认 P 级
集群 Red 持续 > 1min	P0
集群 Yellow 持续 > 30min	P1
单 master 节点宕机	P1
Data 节点宕机（导致 yellow）	P2
Data 节点宕机（无影响）	P3
磁盘 > 90%	P1
磁盘 > 80%	P2
磁盘 > 70%	P3
JVM heap > 85% 持续 10min	P1
JVM heap > 75% 持续 30min	P2
写入拒绝（rejected > 0）	P2
查询拒绝（rejected > 0）	P2
慢查询单条 > 5s	P3
备份失败	P2
证书 30 天内过期	P3
证书 7 天内过期	P1

5.3 升级路径

每个告警从接单时刻起计时，超过响应时限未确认 → 自动升级：

P0:  5min 未接 → SRE Lead → 10min → 架构组 → 15min → CTO
P1: 15min 未接 → SRE Lead → 30min → 架构组
P2: 30min 未接 → SRE Lead
P3:  4h  未接 → IM 邮件再发一次

6. 告警规则

以下规则以 PromQL 表达，使用 Prometheus + Alertmanager + es_exporter 采集。完整规则集可导入 infra/monitoring/elasticsearch/rules.yaml。

6.1 集群状态

- alert: ESClusterRed
  expr: elasticsearch_cluster_health_status{color="red"} == 1
  for: 1m
  labels: { severity: P0, team: sre }
  annotations:
    summary: "ES 集群 {{ $labels.cluster }} 处于 RED"
    runbook: "https://wiki/es-runbook#cluster-red"

- alert: ESClusterYellowProlonged
  expr: elasticsearch_cluster_health_status{color="yellow"} == 1
  for: 30m
  labels: { severity: P1, team: sre }
  annotations:
    summary: "ES 集群 {{ $labels.cluster }} 持续 yellow 超过 30 分钟"
    runbook: "https://wiki/es-runbook#cluster-yellow"

- alert: ESUnassignedShards
  expr: elasticsearch_cluster_health_unassigned_shards > 0
  for: 10m
  labels: { severity: P1, team: sre }
  annotations:
    summary: "未分配分片：{{ $value }}"
    runbook: "https://wiki/es-runbook#unassigned-shards"

- alert: ESPendingTasks
  expr: elasticsearch_cluster_health_number_of_pending_tasks > 50
  for: 5m
  labels: { severity: P2, team: sre }

6.2 节点资源

- alert: ESNodeDown
  expr: up{job="elasticsearch"} == 0
  for: 1m
  labels: { severity: P1, team: sre }
  annotations:
    summary: "ES 节点 {{ $labels.instance }} 不可达"

- alert: ESHeapHigh
  expr: elasticsearch_jvm_memory_used_bytes{area="heap"} 
        / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.85
  for: 10m
  labels: { severity: P1, team: sre }

- alert: ESHeapWarning
  expr: elasticsearch_jvm_memory_used_bytes{area="heap"} 
        / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.75
  for: 30m
  labels: { severity: P2, team: sre }

- alert: ESOldGCFrequent
  expr: rate(elasticsearch_jvm_gc_collection_seconds_sum{gc="old"}[5m]) > 0.1
  for: 10m
  labels: { severity: P2, team: sre }
  annotations:
    summary: "节点 {{ $labels.instance }} Old GC 占用 > 10% 时间"

- alert: ESDiskHigh
  expr: elasticsearch_filesystem_data_used_percent > 80
  for: 5m
  labels: { severity: P1, team: sre }

- alert: ESDiskWarning
  expr: elasticsearch_filesystem_data_used_percent > 70
  for: 30m
  labels: { severity: P2, team: sre }

- alert: ESFDExhausting
  expr: elasticsearch_process_open_files_count 
        / elasticsearch_process_max_files_count > 0.7
  for: 10m
  labels: { severity: P2, team: sre }

6.3 写入与查询

- alert: ESBulkRejected
  expr: rate(elasticsearch_thread_pool_rejected_count{type="write"}[1m]) > 0
  for: 1m
  labels: { severity: P2, team: sre }
  annotations:
    summary: "写入线程池拒绝：{{ $labels.instance }}"
    runbook: "https://wiki/es-runbook#bulk-rejected"

- alert: ESSearchRejected
  expr: rate(elasticsearch_thread_pool_rejected_count{type="search"}[1m]) > 0
  for: 1m
  labels: { severity: P2, team: sre }

- alert: ESIndexingLatencyHigh
  expr: rate(elasticsearch_indices_indexing_index_time_seconds_total[5m])
        / rate(elasticsearch_indices_indexing_index_total[5m]) > 0.05
  for: 10m
  labels: { severity: P2, team: sre }

- alert: ESQueryLatencyHigh
  expr: rate(elasticsearch_indices_search_query_time_seconds_total[5m])
        / rate(elasticsearch_indices_search_query_total[5m]) > 0.2
  for: 10m
  labels: { severity: P2, team: sre }

- alert: ESCircuitBreakerTripped
  expr: increase(elasticsearch_breakers_tripped_total[5m]) > 0
  for: 1m
  labels: { severity: P2, team: sre }
  annotations:
    summary: "Circuit breaker 触发：{{ $labels.breaker }} on {{ $labels.instance }}"

6.4 慢查询

- alert: ESSlowQuerySpike
  expr: rate(elasticsearch_slowlog_query_total{level="warn"}[5m]) > 1
  for: 5m
  labels: { severity: P2, team: sre }

- alert: ESSlowQueryCritical
  # 单条 > 5s 立即告警
  expr: elasticsearch_slowlog_max_query_seconds > 5
  for: 1m
  labels: { severity: P2, team: sre }

6.5 备份与证书

- alert: ESSnapshotFailed
  expr: elasticsearch_slm_snapshots_failed_total - elasticsearch_slm_snapshots_failed_total offset 1d > 0
  for: 1m
  labels: { severity: P2, team: sre }

- alert: ESSnapshotMissing
  expr: time() - elasticsearch_slm_last_success_timestamp_seconds > 86400 * 1.5
  for: 30m
  labels: { severity: P1, team: sre }
  annotations:
    summary: "最近一次成功快照超过 36 小时"

- alert: ESCertExpiringSoon
  expr: elasticsearch_ssl_certificate_expiry_seconds < 30 * 86400
  for: 1h
  labels: { severity: P3, team: sre }

- alert: ESCertExpiringCritical
  expr: elasticsearch_ssl_certificate_expiry_seconds < 7 * 86400
  for: 5m
  labels: { severity: P1, team: sre }

6.6 客户端侧（应用 SDK 埋点）

- alert: AppESErrorRateHigh
  expr: rate(es_request_total{status=~"5.."}[5m]) 
        / rate(es_request_total[5m]) > 0.01
  for: 5m
  labels: { severity: P2, team: "{{ $labels.service }}" }

- alert: AppESLatencyP99High
  expr: histogram_quantile(0.99,
          rate(es_request_duration_seconds_bucket{operation="search"}[5m])) > 1
  for: 10m
  labels: { severity: P2, team: "{{ $labels.service }}" }

- alert: AppBulkQueueBacklog
  expr: es_bulk_queue_depth > 5000
  for: 10m
  labels: { severity: P2, team: "{{ $labels.service }}" }

- alert: AppDegradationActive
  expr: es_degradation_active == 1
  for: 5m
  labels: { severity: P1, team: "{{ $labels.service }}" }
  annotations:
    summary: "{{ $labels.service }} 降级开关已开启 5 分钟"

7. Dashboard 规划

7.1 推荐面板集合

面板	受众	内容
集群总览	SRE / 业务	status、节点数、分片数、QPS、延迟 trends
节点详情	SRE	单节点 JVM、CPU、磁盘、线程池
索引详情	SRE / 业务	单索引写入/查询/段、Top N 索引
慢查询	SRE / 业务	Top 20 慢查询、按索引归类、按用户归类
客户端视角	业务服务	应用侧 QPS、错误率、P99、断路器
容量趋势	SRE / 架构	30/60/90 天容量预测
告警面板	SRE	当前活跃告警、最近 24h 告警分布
变更面板	SRE	settings 变更、ILM 转换、reindex 任务

7.2 集群总览必备图表

状态指示灯：green/yellow/red 大色块
节点列表：名称、角色、heap%、disk%、运行时长
分片状态：active / unassigned / relocating / initializing 计数
QPS 双轴：write QPS（蓝）、search QPS（橙）
延迟 P50/P95/P99：write、search 各一图
Top 5 索引：按写入 QPS 和按 store size
Pending Tasks：>0 时高亮
Thread Pool Rejected：write、search 两条线

7.3 设计原则

首屏必须能在 30 秒内判断"集群是否健康"
时间范围默认 1 小时，可切到 6h / 24h / 7d
关键指标有"历史 P99 参考线"
告警状态与图表联动（告警时图表高亮）

7.4 复用社区资源

可直接导入并按本手册阈值调整：

Grafana Dashboard ID 2322 — Elasticsearch Stats
Grafana Dashboard ID 14191 — Elasticsearch Exporter
Grafana Dashboard ID 266 — Elasticsearch
Kibana Stack Monitoring（开箱即用，但需与 Prometheus 链路并存）

落地时请 fork 并版本化到 infra/monitoring/dashboards/elasticsearch/，禁止在 Grafana UI 直接改保存（无版本可追溯）。

8. 日志采集

8.1 必采日志

日志	路径	用途
主日志	`logs/<cluster>.log`	启动、错误、warning
GC 日志	`logs/gc.log`	JVM 故障复盘
慢查询	`logs/<cluster>_index_search_slowlog.json`	慢查询治理
慢索引	`logs/<cluster>_index_indexing_slowlog.json`	写入瓶颈分析
审计	`logs/<cluster>_audit.json`	安全合规
Deprecation	`logs/<cluster>_deprecation.json`	升级前置检查

8.2 日志规范

主日志使用 JSON 格式（ES 8.x 默认）
慢日志阈值统一通过 Index Template 配置
审计日志单独落盘，独立采集到 SIEM
日志保留：本地 7 天，远端 90 天

8.3 慢查询治理流程

1. Dashboard "慢查询" 面板按 took_millis 排序，看 Top 20
2. 取 X-Opaque-Id 关联应用追踪，定位调用方
3. 提交 issue 至业务方代码仓库
4. 业务方修复后，慢查询消失
5. 周报跟踪慢查询数量趋势

禁止：在 ES 侧"屏蔽"慢查询（如调高 slowlog 阈值），必须根因解决。

9. 告警治理

9.1 告警必备元数据

每条告警必须有：

元数据	示例
`severity`	P0/P1/P2/P3/P4
`team`	sre / order-service / search-platform
`runbook_url`	指向《故障应急 Runbook》对应章节
`summary`	一句话说明
`description`	受影响范围、可能原因

没有 runbook 的告警禁止上线——值班看到不知如何处理的告警 = 无效告警。

9.2 静默与抑制

# Alertmanager 抑制规则示例
inhibit_rules:
  # 集群 red 时不再单独报 unassigned shards
  - source_matchers: [ alertname="ESClusterRed" ]
    target_matchers: [ alertname="ESUnassignedShards" ]
    equal: [ cluster ]

  # 节点宕机时不再报该节点的指标异常
  - source_matchers: [ alertname="ESNodeDown" ]
    target_matchers: [ alertname=~"ESHeap.*|ESDisk.*|ESBulkRejected" ]
    equal: [ instance ]

纪律：

任何手动静默必须有过期时间（最长 24h）
抑制规则集中在一个文件管理
每月审计静默列表，清理过期/失效项

9.3 告警噪音治理 KPI

指标	目标
周告警总数	持续下降或稳定
误报率	< 5%
平均响应时间（MTTA）	P0 < 5min, P1 < 15min
平均恢复时间（MTTR）	P0 < 30min, P1 < 2h
升级率	< 10%

每月 SRE 例会复盘上述 KPI，调整规则与阈值。

10. On-Call 流程

10.1 排班

主班 + 备班双人制，备班为主班 backup
主备班轮换周期：1 周
排班表公开，至少提前 2 周排定
节假日轮班补贴按公司规定

10.2 接单流程

1. 告警触发 → 推送至 IM 频道 + 主班手机
2. 主班 5min（P0）/ 15min（P1）内 ack
3. 主班无法 ack（休假、网络问题），备班自动接管
4. 主备都未 ack 触发升级
5. ack 后立即同步至 IM 频道（"我在处理"）
6. 处理过程实时同步关键节点
7. 故障恢复后 24h 内出复盘文档

10.3 处置三步法

1. 止血（Stop the bleeding）
   - 优先恢复服务，不追求当下找根因
   - 例：扩 replica、扩节点、降级、回滚

2. 隔离（Isolate）
   - 防止故障扩散
   - 例：限流、断路器、熔断、流量切换

3. 复盘（Postmortem）
   - 故障恢复后 24h 内
   - 不追责文化（blameless）
   - 必产出：时间线、根因、改进项、Owner

10.4 不应做的事

凌晨 3 点未升级请示就执行 L4 操作
故障中静默告警（应该处理而非屏蔽）
故障中"先看看会不会自己好"超过 5 分钟
复盘把责任归到个人
复盘后改进项无 Owner / 无 deadline / 无追踪

附录 A：指标全集速查表

集群

Metric	来源
`elasticsearch_cluster_health_status`	`_cluster/health`
`elasticsearch_cluster_health_active_primary_shards`	同上
`elasticsearch_cluster_health_active_shards`	同上
`elasticsearch_cluster_health_unassigned_shards`	同上
`elasticsearch_cluster_health_relocating_shards`	同上
`elasticsearch_cluster_health_initializing_shards`	同上
`elasticsearch_cluster_health_number_of_pending_tasks`	同上
`elasticsearch_cluster_health_number_of_data_nodes`	同上

节点 JVM

Metric	来源
`elasticsearch_jvm_memory_used_bytes{area="heap"}`	`_nodes/stats/jvm`
`elasticsearch_jvm_memory_max_bytes{area="heap"}`	同上
`elasticsearch_jvm_gc_collection_seconds_sum`	同上
`elasticsearch_jvm_gc_collection_seconds_count`	同上
`elasticsearch_jvm_threads_current`	同上

节点线程池

Metric	Labels
`elasticsearch_thread_pool_active_count`	type
`elasticsearch_thread_pool_queue_count`	type
`elasticsearch_thread_pool_rejected_count`	type
`elasticsearch_thread_pool_completed_count`	type

节点磁盘

Metric	来源
`elasticsearch_filesystem_data_size_bytes`	`_nodes/stats/fs`
`elasticsearch_filesystem_data_free_bytes`	同上
`elasticsearch_filesystem_data_used_percent`	计算字段
`elasticsearch_filesystem_io_stats_total_operations`	同上

索引读写

Metric	Labels
`elasticsearch_indices_indexing_index_total`	index
`elasticsearch_indices_indexing_index_time_seconds_total`	index
`elasticsearch_indices_indexing_index_failed_total`	index
`elasticsearch_indices_search_query_total`	index
`elasticsearch_indices_search_query_time_seconds_total`	index
`elasticsearch_indices_search_fetch_total`	index
`elasticsearch_indices_refresh_total`	index
`elasticsearch_indices_refresh_time_seconds_total`	index
`elasticsearch_indices_merges_current`	index
`elasticsearch_indices_segments_count`	index
`elasticsearch_indices_store_size_bytes`	index

Breaker

Metric	Labels
`elasticsearch_breakers_estimated_size_bytes`	breaker
`elasticsearch_breakers_limit_size_bytes`	breaker
`elasticsearch_breakers_tripped_total`	breaker

附录 B：告警响应卡（值班随身版）

把这一页贴在办公桌或保存到手机。出现告警先看本卡判断 P 级，再翻 Runbook。

┌──────────────────────────────────────────────────────────────┐
│ 告警 ack 时限： P0=5min  P1=15min  P2=30min  P3=4h            │
├──────────────────────────────────────────────────────────────┤
│ 集群 Red ───────────────────── P0 → Runbook §1                │
│ 数据丢失风险 ──────────────── P0 → CTO 立即同步               │
│ 集群 Yellow > 30min ───────── P1 → Runbook §2                │
│ Master 节点宕机 ─────────────  P1 → Runbook §3                │
│ Data 节点宕机（导致 yellow） P2 → Runbook §4                 │
│ 磁盘 > 90% ──────────────────  P1 → Runbook §5                │
│ JVM > 85% 持续 10min ───────  P1 → Runbook §6                │
│ 写入/查询 rejected ─────────  P2 → Runbook §7                │
│ 慢查询 > 5s ──────────────── P3 → Runbook §8                │
│ 备份失败 / 缺失 ────────────  P1 → Runbook §9                │
│ 证书 < 7 天 ────────────────── P1 → Runbook §10               │
├──────────────────────────────────────────────────────────────┤
│ 接单 → 同步 → 止血 → 隔离 → 根因 → 复盘                      │
│ 无法处理 → 升级，不要硬扛                                     │
│ 任何 L3+ 操作 → 必须双人确认                                  │
└──────────────────────────────────────────────────────────────┘

文档版本

版本	日期	变更	作者
v1.0	2026-05-11	初版	-