同事半夜爬起来重启服务，而我翻个身继续睡半夜三点的夺命连环 call 周六凌晨三点，小禾被电话吵醒。 "喂？" "小禾！

读完本文，你将学会：5 分钟配好监控告警，让服务挂了第一时间知道，甚至自动恢复。

半夜三点的夺命连环 call

周六凌晨三点，小禾被电话吵醒。

"喂？"

"小禾！你那个 AI 生成图片的服务挂了！用户投诉都炸了！"

小禾一个激灵爬起来，打开电脑，SSH 连上服务器：

$ curl http://localhost:8000/health
curl: (7) Failed to connect to localhost port 8000: Connection refused

服务确实挂了。

他翻了翻日志：

2025-12-06 01:23:45 | ERROR | CUDA out of memory
2025-12-06 01:23:45 | INFO | 服务已关闭

服务在凌晨一点就挂了，但直到三点用户投诉，他才知道。

整整两个小时，服务处于宕机状态。

小禾重启服务后，躺在床上睡不着了。

"如果服务挂了能立刻通知我就好了……"

监控的三个层次

第二天，小禾开始研究监控方案。

他发现监控有三个层次：

flowchart TB
    L1["业务指标<br/>生成成功率、响应时间<br/><i>最有价值，但最复杂</i>"]
    L2["应用指标<br/>CPU、内存、GPU 显存<br/><i>定位问题的关键</i>"]
    L3["存活检测<br/>服务是否在运行<br/><i>最基础，必须有</i>"]

    L1 --- L2 --- L3

小禾决定从最基础的开始：存活检测。

"至少服务挂了要第一时间知道。"

方案一：UptimeRobot（5 分钟搞定）

小禾搜了一圈，发现最简单的方案是 UptimeRobot——一个免费的在线监控服务。

第一步：暴露健康检查端点

# app/api/endpoints/health.py
from fastapi import APIRouter
from datetime import datetime

router = APIRouter()

@router.get("/health")
async def health_check():
    """健康检查端点"""
    return {
        "status": "healthy",
        "timestamp": datetime.now().isoformat()
    }

第二步：注册 UptimeRobot

打开 uptimerobot.com，注册账号
点击 "Add New Monitor"
填写配置：

Monitor Type: HTTP(s)
Friendly Name: AI Image Service
URL: https://your-domain.com/health
Monitoring Interval: 5 minutes

4. 设置告警方式：邮件、Telegram、Slack 都支持

第三步：等着收通知

配置完成后，UptimeRobot 每 5 分钟会请求一次 /health。

如果连续失败，你会收到这样的邮件：

🔴 AI Image Service is DOWN

URL: https://your-domain.com/health
Reason: Connection Timeout
Date/Time: 2024-01-20 01:25:00 UTC

服务恢复后，还会收到：

🟢 AI Image Service is UP

URL: https://your-domain.com/health
Was down for: 2 hours 5 minutes
Date/Time: 2024-01-20 03:30:00 UTC

小禾设置完毕，测试了一下：手动停掉服务，5 分钟后果然收到了告警邮件。

"这也太简单了吧！"

健康检查要检查什么？

小禾用了几天，发现一个问题：

有时候 /health 返回 200，但实际上服务已经"半死不活"了——API 能响应，但 AI 模型挂了。

他决定升级健康检查：

# app/api/endpoints/health.py
import torch
from fastapi import APIRouter, Response
from app.services.model_manager import model_manager

router = APIRouter()

@router.get("/health")
async def health_check(response: Response):
    """深度健康检查"""

    checks = {
        "api": True,
        "model": False,
        "gpu": False
    }

    # 检查模型是否加载
    try:
        checks["model"] = model_manager.pipe is not None
    except Exception:
        pass

    # 检查 GPU 是否可用
    try:
        if torch.cuda.is_available():
            # 尝试分配一小块显存，确认 GPU 正常
            test_tensor = torch.zeros(1).cuda()
            del test_tensor
            checks["gpu"] = True
    except Exception:
        pass

    # 综合判断
    all_healthy = all(checks.values())

    if not all_healthy:
        response.status_code = 503  # Service Unavailable

    return {
        "status": "healthy" if all_healthy else "unhealthy",
        "checks": checks,
        "gpu_memory": get_gpu_memory() if checks["gpu"] else None
    }


def get_gpu_memory():
    """获取 GPU 显存信息"""
    if not torch.cuda.is_available():
        return None

    return {
        "allocated_gb": round(torch.cuda.memory_allocated() / 1024**3, 2),
        "reserved_gb": round(torch.cuda.memory_reserved() / 1024**3, 2),
        "total_gb": round(torch.cuda.get_device_properties(0).total_memory / 1024**3, 2)
    }

现在健康检查会真正检查：

{
    "status": "healthy",
    "checks": {
        "api": true,
        "model": true,
        "gpu": true
    },
    "gpu_memory": {
        "allocated_gb": 9.87,
        "reserved_gb": 10.12,
        "total_gb": 24.0
    }
}

如果模型没加载或 GPU 挂了，会返回 503：

{
    "status": "unhealthy",
    "checks": {
        "api": true,
        "model": false,
        "gpu": true
    },
    "gpu_memory": {...}
}

UptimeRobot 只要收到非 2xx 响应，就会触发告警。

方案二：自建 Prometheus + Grafana

UptimeRobot 够用了，但小禾想要更多：

想看历史数据趋势
想知道每个接口的响应时间
想监控 GPU 显存使用率

这就需要自建监控系统了。

架构图

flowchart LR
    subgraph App["AI 服务"]
        FastAPI["FastAPI"]
        Metrics["/metrics 端点"]
    end

    subgraph Monitor["监控系统"]
        Prometheus["Prometheus<br/>数据采集"]
        Grafana["Grafana<br/>可视化"]
        AlertManager["AlertManager<br/>告警"]
    end

    FastAPI --> Metrics
    Prometheus -->|每 15s 拉取| Metrics
    Prometheus --> Grafana
    Prometheus --> AlertManager
    AlertManager -->|邮件/钉钉/飞书| Notify["📱 通知"]

第一步：暴露 Prometheus 指标

安装依赖：

pip install prometheus-client prometheus-fastapi-instrumentator

添加指标端点：

# app/main.py
from prometheus_fastapi_instrumentator import Instrumentator
from prometheus_client import Gauge, Counter, Histogram

# 自定义指标
GPU_MEMORY_GAUGE = Gauge(
    'gpu_memory_allocated_bytes',
    'GPU memory allocated in bytes'
)

GENERATION_COUNTER = Counter(
    'image_generation_total',
    'Total image generations',
    ['status']  # success / failed
)

GENERATION_LATENCY = Histogram(
    'image_generation_latency_seconds',
    'Image generation latency in seconds',
    buckets=[1, 2, 5, 10, 30, 60, 120]
)

# 初始化 instrumentator
Instrumentator().instrument(app).expose(app)

在业务代码中记录指标：

# app/api/endpoints/generate.py
import time

@router.post("/shot-image")
async def generate_shot_image(request: GenerateShotImageRequest):
    start_time = time.time()

    try:
        result = await adapter.generate(...)

        # 记录成功
        GENERATION_COUNTER.labels(status="success").inc()

        return result

    except Exception as e:
        # 记录失败
        GENERATION_COUNTER.labels(status="failed").inc()
        raise

    finally:
        # 记录耗时
        latency = time.time() - start_time
        GENERATION_LATENCY.observe(latency)

        # 更新 GPU 显存
        if torch.cuda.is_available():
            GPU_MEMORY_GAUGE.set(torch.cuda.memory_allocated())

访问 /metrics 看看效果：

# HELP gpu_memory_allocated_bytes GPU memory allocated in bytes
# TYPE gpu_memory_allocated_bytes gauge
gpu_memory_allocated_bytes 1.0598932e+10

# HELP image_generation_total Total image generations
# TYPE image_generation_total counter
image_generation_total{status="success"} 156.0
image_generation_total{status="failed"} 3.0

# HELP image_generation_latency_seconds Image generation latency in seconds
# TYPE image_generation_latency_seconds histogram
image_generation_latency_seconds_bucket{le="1.0"} 0.0
image_generation_latency_seconds_bucket{le="2.0"} 12.0
image_generation_latency_seconds_bucket{le="5.0"} 89.0
image_generation_latency_seconds_bucket{le="10.0"} 142.0
...

第二步：部署 Prometheus + Grafana

用 Docker Compose 一键部署：

# docker-compose.monitoring.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your_password

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

volumes:
  prometheus_data:
  grafana_data:

Prometheus 配置：

# prometheus.yml
global:
  scrape_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - 'alert_rules.yml'

scrape_configs:
  - job_name: 'ai-service'
    static_configs:
      - targets: ['host.docker.internal:8000']  # 你的 AI 服务地址

第三步：配置告警规则

# alert_rules.yml
groups:
  - name: ai-service-alerts
    rules:
      # 服务宕机
      - alert: ServiceDown
        expr: up{job="ai-service"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "AI 服务宕机"
          description: "服务已经宕机超过 1 分钟"

      # GPU 显存超过 90%
      - alert: GPUMemoryHigh
        expr: gpu_memory_allocated_bytes / (24 * 1024 * 1024 * 1024) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU 显存使用率过高"
          description: "显存使用率超过 90%，持续 5 分钟"

      # 生成失败率超过 10%
      - alert: HighFailureRate
        expr: |
          rate(image_generation_total{status="failed"}[5m])
          / rate(image_generation_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "图片生成失败率过高"
          description: "最近 5 分钟失败率超过 10%"

      # 响应时间过长
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, rate(image_generation_latency_seconds_bucket[5m])) > 60
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "图片生成响应时间过长"
          description: "P95 响应时间超过 60 秒"

第四步：配置告警通知

AlertManager 支持多种通知方式。小禾配置了邮件和飞书：

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'your_password'

route:
  receiver: 'default'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # 严重告警立即通知
    - match:
        severity: critical
      receiver: 'critical-alerts'
      repeat_interval: 15m

receivers:
  - name: 'default'
    email_configs:
      - to: 'dev-team@example.com'

  - name: 'critical-alerts'
    email_configs:
      - to: 'oncall@example.com'
    webhook_configs:
      # 飞书机器人
      - url: 'https://open.feishu.cn/open-apis/bot/v2/hook/xxxxx'

第五步：Grafana 看板

启动后访问 http://localhost:3000，配置 Prometheus 数据源，然后创建看板。

小禾看着漂亮的图表，心情舒畅。

"终于有可视化了！"

方案对比：怎么选？

方案	复杂度	成本	功能	适合场景
UptimeRobot	低	免费	存活检测 + 告警	个人项目、MVP
Prometheus + Grafana	高	服务器资源	完整监控 + 可视化	生产环境
云厂商方案	中	按量付费	开箱即用	预算充足

小禾的建议：

个人项目 → UptimeRobot，5 分钟搞定
小团队  → UptimeRobot + 简单 Prometheus
正式环境 → Prometheus + Grafana + AlertManager
土豪    → 直接用云厂商的监控服务

监控配置清单

小禾整理了一份清单：

检查项	方法	告警阈值
服务存活	HTTP 健康检查	连续失败 2 次
GPU 显存	nvidia-smi / torch	> 90% 持续 5 分钟
响应时间	Prometheus Histogram	P95 > 60s
失败率	Counter 比率	> 10% 持续 5 分钟
磁盘空间	node_exporter	> 85%

还有一件事：自动重启

监控告警解决了"知道挂了"的问题，但还有一个问题：凌晨三点收到告警，难道要爬起来重启？

小禾加了个自动重启机制：

# 使用 systemd 管理服务
# /etc/systemd/system/ai-service.service

[Unit]
Description=AI Image Generation Service
After=network.target

[Service]
Type=simple
User=deploy
WorkingDirectory=/opt/ai-service
ExecStart=/opt/ai-service/venv/bin/python run.py
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

关键配置：

Restart=always：无论什么原因退出，都自动重启
RestartSec=10：重启前等待 10 秒

这样服务挂了会自动重启，告警只是通知你"发生过故障"。

小禾的感悟

那个凌晨三点的电话，
让我明白了一个道理：

服务会挂，
这是墨菲定律。

问题不是"会不会挂"，
而是"挂了你知不知道"。

UptimeRobot 五分钟搞定，
Prometheus 一天能跑起来，
没有借口不做监控。

告警是保险，
自动重启是兜底，
两样都有才能睡好觉。

现在我终于可以说：
"服务挂了？让我看看告警……
哦，已经自动恢复了。"

然后翻个身，继续睡。

这才是生活该有的样子。

小禾看着 Grafana 上绿油油的曲线，终于睡了个安稳觉。

系列完结

恭喜你读完了「AI 基础设施部署」系列的全部 6 篇文章！

回顾一下我们走过的路：

① 环境配置 → CUDA、PyTorch 一次配对
② 模型部署 → vLLM 高性能推理
③ 应用平台 → Dify 快速搭建
④ 自动化   → n8n 工作流
⑤ GPU 原理 → CUDA Tile 科普
⑥ 监控告警 → 服务挂了第一时间知道

从一台裸机，到稳定运行的 AI 服务，这就是完整的链路。

有了这套基础设施，你就可以专注于业务开发，不用再担心"服务又挂了"。

收藏 AI 基础设施部署合集（超详细导读），遇到问题随时回来查。

我在微信公众号『 DevJar 』持续分享 AI 应用开发实战经验。
有问题欢迎留言，我会尽量回复。