36-模块五-AI系统架构设计 第36讲-LLM 应用的可观测性 - Prompt 追踪 Token 用量监控与质量评估

1 阅读10分钟

模块五-AI系统架构设计 | 第36讲:LLM 应用的可观测性 - Prompt 追踪、Token 用量监控与质量评估

本讲目标:理解 LLM 应用为何必须「可观测优先」:输出非确定性、成本随 token 线性增长、质量波动难以用传统单元测试覆盖。你将掌握三大支柱(Tracing / Metrics / Logging)在 CodeSentinel 中的落地方式:Trace ID 贯通、LangChain callback 自动采集、Prometheus 风格指标、PII 脱敏与审计日志;并构建 QualityEvaluator 框架,把「误报率、召回 proxy、人工反馈」纳入持续运营闭环。文末提供可运行的 FastAPI 中间件 + 指标采集器 + 质量评估示例代码。


开场:没有观测的 LLM,就像没有仪表盘的飞机

传统服务的故障模式多是确定的:空指针、超时、5xx。LLM 应用的故障模式更像概率分布漂移:同样的 prompt,模型升级后风格突变;同样的代码审核,温度略高就可能从「谨慎建议」滑向「危言耸听」;账单上看似平稳的 QPS,背后可能是某次上下文膨胀导致 token 暴增。若你没有 prompt 级追踪token 级计量质量趋势,团队只能在工单里被动解释「为什么昨天还好好的」。

CodeSentinel 作为 AI 驱动的架构治理平台,观测不仅是运维需求,更是合规与产品需求:你要回答:某条 PR 评论依据了哪些上下文片段?模型版本是什么?是否经过人工确认?误报是否上升?哪类规则最消耗 token?这些问题无法靠「打印日志」临时拼凑,而要从架构层植入:trace 贯穿、指标聚合、日志结构化

本讲与第 35 讲形成互补:第 35 讲让推理服务「跑得动、控得住成本」;本讲让推理服务「看得见、评得准、改得对」。我们会先给出观测体系总览与追踪时序,再深入三大支柱的设计要点,最后给出完整 Python 实现:异步中间件记录请求元数据、Callback 捕获 LLM 事件、线程安全的 MetricsCollector、以及可插拔的 QualityEvaluator(含简单黄金集评测思路)。下面从全局架构图开始。


全局视角:CodeSentinel LLM 可观测性架构(Mermaid)

flowchart TB
  subgraph Ingress["入口层"]
    MW["LLMTracerMiddleware\n(trace_id 注入)"]
    AUTH["鉴权/租户"]
  end

  subgraph Runtime["运行时"]
    API["FastAPI 路由\n/reviews ..."]
    PIPE["Review Pipeline"]
    LC["LangChain\nCallbacks"]
    LLM["LLM Provider"]
  end

  subgraph Pillars["三大支柱"]
    T["Tracing Store\n(span 事件)"]
    M["MetricsCollector\n(counter/histogram)"]
    L["Structured Logs\nJSON + PII redact"]
  end

  subgraph Quality["质量闭环"]
    Q["QualityEvaluator"]
    HF["Human Feedback\n(thumbs/labels)"]
    AL["Alerts\nSLO 异常)"]
  end

  AUTH --> MW --> API --> PIPE --> LC --> LLM
  LC --> T
  LC --> M
  API --> L
  M --> AL
  Q --> M
  HF --> Q

Prompt 追踪:从进入到 LLM 调用的全链路(Mermaid)

sequenceDiagram
  participant U as 用户/CI
  participant A as FastAPI
  participant M as Middleware
  participant P as Pipeline
  participant C as LLMCallbackTracer
  participant L as LLM

  U->>A: POST /reviews
  A->>M: 生成/透传 trace_id
  M->>P: ctx(trace_id, tenant)
  P->>C: 注册 callback
  C->>L: chat/stream
  L-->>C: prompt/response chunks
  C-->>P: span: llm.call
  P-->>A: ReviewReport
  A-->>U: 200 + metrics headers(可选)

  Note over C,L: 日志中仅保存 hash 后的 prompt<br/>全文进冷存储需合规审批

运营仪表盘草图:指标维度与告警(Mermaid)

flowchart LR
  subgraph Panels["Grafana 面板(示例)"]
    P1["审核 QPS / p95 延迟"]
    P2["Token:input/output 分层"]
    P3["成本:项目/模型维度"]
    P4["错误率:429/5xx/超时"]
    P5["质量:误报率/人工纠正率"]
  end

  subgraph Alerts["告警规则"]
    A1["p95 延迟 > 20s 持续 10m"]
    A2["429 比例 > 5% 持续 5m"]
    A3["单日 token 突增 3σ"]
    A4["误报率周环比 +10pt"]
  end

  P1 --> A1
  P4 --> A2
  P3 --> A3
  P5 --> A4

核心原理:为什么 LLM 可观测性与传统 APM 不同

1. 非确定性:你需要「分布」而不是「单次对错」

单条推理结果无法代表系统健康。观测应强调:同样输入下的方差(若可复现)、线上真实分布(延迟直方图)、质量 proxy(规则命中率与人工反馈)。Tracing 要记录采样参数(temperature、top_p、模型版本),否则线上漂移不可解释。

2. 成本敏感:token 是一等公民

Metrics 必须拆分 prompt_tokenscompletion_tokens,并按 project_idmodelroute 打点。否则财务归因失败。Tracing 中建议增加 estimated_cost_usd 与真实 usage 对齐校验,发现估算偏差。

3. 合规:日志里的 prompt/response 是高风险资产

代码、密钥、个人信息可能出现在 diff 与提示词中。Logging 策略应包含:字段级脱敏(正则+熵检测)、可选全文加密冷存最小可见原则(默认只存 hash 与长度)。Tracing store 若接第三方 SaaS,需要数据出境评估。

4. 质量评估的三层模型

  1. 离线黄金集:小规模标注「应命中问题」集合,计算近似召回;适合回归新模型。
  2. 线上 proxy:统计 finding_severity 分布突变、重复评论、同一文件高频相同规则。
  3. 人工反馈:PR 评论下 👍/👎 或「误报」标签,沉淀为训练数据与阈值调参依据。

5. LangChain Callback 的价值与坑

Callback 能统一捕获 chain start/end、llm start/end、token stream。但要注意:异步链路要用 AsyncCallbackHandler;避免在 callback 里做重 IO(应异步入队);防止递归记录导致日志爆炸(对超大 prompt 只记录摘要)。

6. 告警哲学:对 LLM 服务,429 与延迟同样致命

除了常规 SLO,建议对 429 占比重试风暴队列堆积缓存命中率骤降 设置独立告警——它们往往是成本与体验恶化的前兆。


代码实战:LLMTracer + MetricsCollector + QualityEvaluator

说明:以下为单文件可运行示例 obs_lab/app.py,依赖 fastapiuvicornpydantic。为兼容不同 LangChain 版本,示例实现了一个最小 LLMTracerCallback(不强制安装 langchain);若你已使用 LangChain,可把方法映射到 BaseCallbackHandler/AsyncCallbackHandler

1. requirements.txt

fastapi>=0.110.0
uvicorn[standard]>=0.27.0
pydantic>=2.6.0

2. obs_lab/app.py(完整实现)

from __future__ import annotations

import hashlib
import json
import os
import re
import threading
import time
import uuid
from dataclasses import dataclass, field
from typing import Any, Callable, Dict, List, Optional, Tuple

import uvicorn
from fastapi import FastAPI, Request
from pydantic import BaseModel, Field


# -----------------------------
# PII / Secret 脱敏
# -----------------------------


_SECRET_PATTERNS = [
    (re.compile(r"(?i)(api[_-]?key|token|secret)\s*[:=]\s*['\"]?[\w-]{8,}['\"]?", re.M), r"\1:***"),
    (re.compile(r"(?i)Bearer\s+[A-Za-z0-9\-\._~\+\/]+=*"), "Bearer ***"),
    (re.compile(r"(?i)-----BEGIN [A-Z ]+PRIVATE KEY-----[\s\S]+?-----END [A-Z ]+PRIVATE KEY-----"), "[REDACTED_PEM]"),
]


def redact_pii(text: str) -> str:
    t = text or ""
    for pat, repl in _SECRET_PATTERNS:
        t = pat.sub(repl, t)
    return t


def stable_hash(text: str) -> str:
    return hashlib.sha256((text or "").encode("utf-8")).hexdigest()


# -----------------------------
# Metrics(Prometheus 文本暴露风格,教学版)
# -----------------------------


@dataclass
class MetricValue:
    help_text: str
    type_name: str  # counter | histogram
    series: Dict[Tuple[str, ...], float] = field(default_factory=dict)
    buckets: Dict[Tuple[str, ...], List[float]] = field(default_factory=dict)


class MetricsCollector:
    def __init__(self) -> None:
        self._lock = threading.Lock()
        self._metrics: Dict[str, MetricValue] = {
            "cs_reviews_total": MetricValue("Total reviews", "counter"),
            "cs_llm_calls_total": MetricValue("Total LLM calls", "counter"),
            "cs_llm_errors_total": MetricValue("Total LLM errors", "counter"),
            "cs_tokens_prompt_total": MetricValue("Prompt tokens", "counter"),
            "cs_tokens_completion_total": MetricValue("Completion tokens", "counter"),
            "cs_review_latency_ms": MetricValue("Review latency", "histogram"),
        }
        # histogram bucket storage: key -> observations
        self._hist_obs: Dict[str, List[float]] = {}

    def inc(self, name: str, labels: Dict[str, str], delta: float = 1.0) -> None:
        key = (name, tuple(sorted(labels.items())))
        flat = (name, json.dumps(labels, sort_keys=True, ensure_ascii=False))
        with self._lock:
            m = self._metrics.setdefault(name, MetricValue(name, "counter"))
            m.series[key] = m.series.get(key, 0.0) + float(delta)

    def observe_latency_ms(self, labels: Dict[str, str], ms: float) -> None:
        name = "cs_review_latency_ms"
        flat = json.dumps(labels, sort_keys=True, ensure_ascii=False)
        with self._lock:
            self._hist_obs.setdefault(f"{name}:{flat}", []).append(ms)

    def export_prometheus(self) -> str:
        lines: List[str] = []
        with self._lock:
            for name, mv in self._metrics.items():
                lines.append(f"# HELP {name} {mv.help_text}")
                lines.append(f"# TYPE {name} {mv.type_name}")
                if mv.type_name == "counter":
                    for key, val in mv.series.items():
                        _, label_tuple = key
                        label_str = ",".join(f'{k}="{v}"' for k, v in label_tuple)
                        lines.append(f"{name}{{{label_str}}} {val}")
                lines.append("")

            # histogram 简化输出:sum/count + 几个分位近似
            lines.append("# HELP cs_review_latency_ms Review latency histogram (approx)")
            lines.append("# TYPE cs_review_latency_ms histogram")
            for flat, obs in self._hist_obs.items():
                if not obs:
                    continue
                _, labels_json = flat.split(":", 1)
                labels = json.loads(labels_json)
                label_str = ",".join(f'{k}="{v}"' for k, v in sorted(labels.items()))
                s = sum(obs)
                c = len(obs)
                avg = s / c
                p95 = sorted(obs)[int(0.95 * (c - 1))]
                lines.append(f'cs_review_latency_ms_bucket{{le="500",{label_str}}} {sum(1 for x in obs if x<=500)}')
                lines.append(f'cs_review_latency_ms_bucket{{le="2000",{label_str}}} {sum(1 for x in obs if x<=2000)}')
                lines.append(f'cs_review_latency_ms_bucket{{le="+Inf",{label_str}}} {c}')
                lines.append(f"cs_review_latency_ms_sum{{{label_str}}} {s}")
                lines.append(f"cs_review_latency_ms_count{{{label_str}}} {c}")
                lines.append(f"# AVG={avg:.2f}ms P95={p95:.2f}ms (debug comment)\n")
        return "\n".join(lines)


METRICS = MetricsCollector()


# -----------------------------
# Tracing(内存 ring buffer,生产换 OTLP/Tempo/Jaeger)
# -----------------------------


@dataclass
class SpanEvent:
    ts: float
    trace_id: str
    name: str
    attrs: Dict[str, Any]


class TraceStore:
    def __init__(self, max_items: int = 2000) -> None:
        self._buf: List[SpanEvent] = []
        self._lock = threading.Lock()
        self._max = max_items

    def add(self, ev: SpanEvent) -> None:
        with self._lock:
            self._buf.append(ev)
            if len(self._buf) > self._max:
                self._buf = self._buf[-self._max :]

    def query_by_trace(self, trace_id: str) -> List[SpanEvent]:
        with self._lock:
            return [e for e in self._buf if e.trace_id == trace_id]


TRACES = TraceStore()


class LLMTracerCallback:
    """对齐 LangChain callback 思想的最小实现。"""

    def __init__(self, trace_id: str, project_id: str, route: str, model: str = "gpt-demo") -> None:
        self.trace_id = trace_id
        self.project_id = project_id
        self.route = route
        self.model = model

    def on_llm_start(self, prompt: str, model: str) -> None:
        red = redact_pii(prompt)
        TRACES.add(
            SpanEvent(
                ts=time.time(),
                trace_id=self.trace_id,
                name="llm.start",
                attrs={
                    "model": model,
                    "project_id": self.project_id,
                    "route": self.route,
                    "prompt_sha256": stable_hash(prompt),
                    "prompt_len": len(prompt),
                    "prompt_preview": red[:400],
                },
            )
        )
        METRICS.inc("cs_llm_calls_total", {"project": self.project_id, "model": model})

    def on_llm_end(self, response_text: str, usage: Dict[str, int], model: str) -> None:
        red = redact_pii(response_text)
        TRACES.add(
            SpanEvent(
                ts=time.time(),
                trace_id=self.trace_id,
                name="llm.end",
                attrs={
                    "model": model,
                    "project_id": self.project_id,
                    "response_sha256": stable_hash(response_text),
                    "response_len": len(response_text),
                    "response_preview": red[:400],
                    "usage": usage,
                },
            )
        )
        METRICS.inc(
            "cs_tokens_prompt_total",
            {"project": self.project_id, "model": model},
            float(usage.get("prompt_tokens", 0)),
        )
        METRICS.inc(
            "cs_tokens_completion_total",
            {"project": self.project_id, "model": model},
            float(usage.get("completion_tokens", 0)),
        )

    def on_llm_error(self, err: str, model: str) -> None:
        TRACES.add(
            SpanEvent(
                ts=time.time(),
                trace_id=self.trace_id,
                name="llm.error",
                attrs={"model": model, "project_id": self.project_id, "error": err[:500]},
            )
        )
        METRICS.inc("cs_llm_errors_total", {"project": self.project_id, "model": model})


# -----------------------------
# QualityEvaluator(误报/人工反馈)
# -----------------------------


class HumanFeedback(BaseModel):
    trace_id: str
    finding_id: str
    label: str = Field(..., description="tp | fp | unknown")
    note: str = ""


class QualityEvaluator:
    def __init__(self) -> None:
        self._lock = threading.Lock()
        self._feedbacks: List[HumanFeedback] = []

    def record(self, fb: HumanFeedback) -> None:
        with self._lock:
            self._feedbacks.append(fb)

    def false_positive_rate(self, project_id: str, window: int = 500) -> Optional[float]:
        with self._lock:
            items = list(self._feedbacks)[-window:]
        fps = sum(1 for x in items if x.label == "fp")
        tps = sum(1 for x in items if x.label == "tp")
        denom = fps + tps
        if denom == 0:
            return None
        return fps / denom

    def export_summary(self) -> Dict[str, Any]:
        with self._lock:
            n = len(self._feedbacks)
        return {"feedback_count": n, "note": "按 trace 关联项目可进一步细分,此处教学从简"}


QUALITY = QualityEvaluator()


# -----------------------------
# FastAPI:中间件 + Demo 审核端点
# -----------------------------


app = FastAPI(title="CodeSentinel Observability Lab", version="0.1.0")


@app.middleware("http")
async def trace_middleware(request: Request, call_next):
    incoming = request.headers.get("x-trace-id") or str(uuid.uuid4())
    request.state.trace_id = incoming
    request.state.project_id = request.headers.get("x-project-id", "default")

    t0 = time.perf_counter()
    response = await call_next(request)
    ms = (time.perf_counter() - t0) * 1000.0
    response.headers["x-trace-id"] = incoming

    route = request.url.path
    METRICS.observe_latency_ms({"route": route, "project": request.state.project_id}, ms)
    return response


class ReviewRequest(BaseModel):
    code: str
    model: str = Field("gpt-demo", description="演示模型名")


def _fake_llm_review(code: str, tracer: LLMTracerCallback) -> Tuple[str, Dict[str, int]]:
    prompt = (
        "你是架构审核助手。请列出最多三条发现,使用 JSON 数组格式,每项含 severity/message。\n"
        f"代码:\n{code}\n"
    )
    tracer.on_llm_start(prompt, model=tracer.model)
    time.sleep(0.05)  # 模拟 IO
    resp = json.dumps(
        [
            {"severity": "medium", "message": "检测到可能的异常吞掉:except: pass"},
            {"severity": "low", "message": "建议补充类型注解以提高可维护性"},
        ],
        ensure_ascii=False,
    )
    usage = {"prompt_tokens": max(1, len(prompt) // 4), "completion_tokens": max(1, len(resp) // 4)}
    tracer.on_llm_end(resp, usage, model=tracer.model)
    return resp, usage


@app.post("/reviews/demo")
async def reviews_demo(req: ReviewRequest, request: Request) -> Dict[str, Any]:
    trace_id = request.state.trace_id
    project_id = request.state.project_id
    METRICS.inc("cs_reviews_total", {"route": "/reviews/demo", "project": project_id})

    cb = LLMTracerCallback(trace_id=trace_id, project_id=project_id, route="/reviews/demo", model=req.model)

    try:
        text, usage = _fake_llm_review(req.code, cb)
        return {"trace_id": trace_id, "result": json.loads(text), "usage": usage}
    except Exception as exc:  # noqa: BLE001
        cb.on_llm_error(repr(exc), model=req.model)
        raise


@app.post("/quality/feedback")
async def quality_feedback(fb: HumanFeedback) -> Dict[str, str]:
    QUALITY.record(fb)
    return {"status": "ok"}


@app.get("/metrics")
async def metrics() -> Any:
    from fastapi.responses import PlainTextResponse

    return PlainTextResponse(METRICS.export_prometheus(), media_type="text/plain; version=0.0.4")


@app.get("/traces/{trace_id}")
async def get_trace(trace_id: str) -> Dict[str, Any]:
    evs = TRACES.query_by_trace(trace_id)
    return {
        "trace_id": trace_id,
        "events": [
            {"ts": e.ts, "name": e.name, "attrs": e.attrs}
            for e in sorted(evs, key=lambda x: x.ts)
        ],
    }


@app.get("/quality/summary")
async def quality_summary() -> Dict[str, Any]:
    return QUALITY.export_summary()


if __name__ == "__main__":
    uvicorn.run(
        "app:app",
        host=os.getenv("HOST", "0.0.0.0"),
        port=int(os.getenv("PORT", "8010")),
        reload=False,
    )

3. 运行与验证

cd obs_lab
python -m uvicorn app:app --port 8010
curl -s -H "x-project-id: acme" -H "Content-Type: application/json" -d "{\"code\":\"def f():\n  try:\n    pass\n  except:\n    pass\n\"}" http://127.0.0.1:8010/reviews/demo
# 用返回的 trace_id:
curl -s http://127.0.0.1:8010/traces/<trace_id>
curl -s http://127.0.0.1:8010/metrics

生产环境实战:从实验室到企业级观测栈

  1. OpenTelemetry:用 OTLP exporter 将 span 发到 Tempo/Jaeger;在 resource 属性中固定 service.name=codesentinel-reviewdeployment.environment
  2. 日志:JSON 结构化 + trace_id 字段;对接 Loki/ELK;对 prompt/response 默认不落全文,仅 hash + 长度 + 脱敏预览。
  3. 指标:用官方 prometheus_client 或 OpenTelemetry metrics;histogram 的 bucket 按业务校准(例如 0.5s/2s/10s)。
  4. 质量评估:把 PR 评论 ID 与 finding_id 关联;每周生成误报报告推动规则/提示词迭代。
  5. 告警:为 cs_llm_errors_total 增长率、p95 延迟、token 突增配置多维告警;避免只盯 QPS。
  6. 隐私评审:任何第三方 tracing SaaS 上线前走数据分类与 DPA;必要时自建。

本讲小结(Mermaid mindmap)

mindmap
  root((第36讲 可观测性))
    Tracing
      trace_id
      span 事件
      prompt hash
    Metrics
      token 分层
      延迟直方图
      错误与 429
    Logging
      结构化
      PII 脱敏
      审计追踪
    Quality
      人工反馈
      误报率
      黄金集回归
    运营
      仪表盘
      告警 SLO

思考题

  1. 如果你必须保存完整 prompt 以便法务审计,你会如何设计「密钥分离、加密存储、访问审批、保留期限」四要素?
  2. QualityEvaluator 的「tp/fp」标签昂贵,如何用工单抽样与主动学习降低标注成本?
  3. 多租户场景下,metrics label 基数爆炸(每个 user_id 一个 series)会带来什么后果?应如何折中?

下一讲预告

第 37 讲是模块五的综合实战:把第 33 讲的代码索引、第 34 讲的混合检索与重排序、第 35 讲的流式推理与成本控制、第 36 讲的追踪与指标,收敛为一份可 docker compose up 的 CodeSentinel AI 后端,并给出端到端测试脚本:索引仓库 → 提交审核 → 带上下文结论 → 全链路可观测。你将第一次从「组件」看到「平台」。