模块五-AI系统架构设计 | 第36讲:LLM 应用的可观测性 - Prompt 追踪、Token 用量监控与质量评估
本讲目标:理解 LLM 应用为何必须「可观测优先」:输出非确定性、成本随 token 线性增长、质量波动难以用传统单元测试覆盖。你将掌握三大支柱(Tracing / Metrics / Logging)在 CodeSentinel 中的落地方式:Trace ID 贯通、LangChain callback 自动采集、Prometheus 风格指标、PII 脱敏与审计日志;并构建
QualityEvaluator框架,把「误报率、召回 proxy、人工反馈」纳入持续运营闭环。文末提供可运行的 FastAPI 中间件 + 指标采集器 + 质量评估示例代码。
开场:没有观测的 LLM,就像没有仪表盘的飞机
传统服务的故障模式多是确定的:空指针、超时、5xx。LLM 应用的故障模式更像概率分布漂移:同样的 prompt,模型升级后风格突变;同样的代码审核,温度略高就可能从「谨慎建议」滑向「危言耸听」;账单上看似平稳的 QPS,背后可能是某次上下文膨胀导致 token 暴增。若你没有 prompt 级追踪、token 级计量与质量趋势,团队只能在工单里被动解释「为什么昨天还好好的」。
CodeSentinel 作为 AI 驱动的架构治理平台,观测不仅是运维需求,更是合规与产品需求:你要回答:某条 PR 评论依据了哪些上下文片段?模型版本是什么?是否经过人工确认?误报是否上升?哪类规则最消耗 token?这些问题无法靠「打印日志」临时拼凑,而要从架构层植入:trace 贯穿、指标聚合、日志结构化。
本讲与第 35 讲形成互补:第 35 讲让推理服务「跑得动、控得住成本」;本讲让推理服务「看得见、评得准、改得对」。我们会先给出观测体系总览与追踪时序,再深入三大支柱的设计要点,最后给出完整 Python 实现:异步中间件记录请求元数据、Callback 捕获 LLM 事件、线程安全的 MetricsCollector、以及可插拔的 QualityEvaluator(含简单黄金集评测思路)。下面从全局架构图开始。
全局视角:CodeSentinel LLM 可观测性架构(Mermaid)
flowchart TB
subgraph Ingress["入口层"]
MW["LLMTracerMiddleware\n(trace_id 注入)"]
AUTH["鉴权/租户"]
end
subgraph Runtime["运行时"]
API["FastAPI 路由\n/reviews ..."]
PIPE["Review Pipeline"]
LC["LangChain\nCallbacks"]
LLM["LLM Provider"]
end
subgraph Pillars["三大支柱"]
T["Tracing Store\n(span 事件)"]
M["MetricsCollector\n(counter/histogram)"]
L["Structured Logs\nJSON + PII redact"]
end
subgraph Quality["质量闭环"]
Q["QualityEvaluator"]
HF["Human Feedback\n(thumbs/labels)"]
AL["Alerts\nSLO 异常)"]
end
AUTH --> MW --> API --> PIPE --> LC --> LLM
LC --> T
LC --> M
API --> L
M --> AL
Q --> M
HF --> Q
Prompt 追踪:从进入到 LLM 调用的全链路(Mermaid)
sequenceDiagram
participant U as 用户/CI
participant A as FastAPI
participant M as Middleware
participant P as Pipeline
participant C as LLMCallbackTracer
participant L as LLM
U->>A: POST /reviews
A->>M: 生成/透传 trace_id
M->>P: ctx(trace_id, tenant)
P->>C: 注册 callback
C->>L: chat/stream
L-->>C: prompt/response chunks
C-->>P: span: llm.call
P-->>A: ReviewReport
A-->>U: 200 + metrics headers(可选)
Note over C,L: 日志中仅保存 hash 后的 prompt<br/>全文进冷存储需合规审批
运营仪表盘草图:指标维度与告警(Mermaid)
flowchart LR
subgraph Panels["Grafana 面板(示例)"]
P1["审核 QPS / p95 延迟"]
P2["Token:input/output 分层"]
P3["成本:项目/模型维度"]
P4["错误率:429/5xx/超时"]
P5["质量:误报率/人工纠正率"]
end
subgraph Alerts["告警规则"]
A1["p95 延迟 > 20s 持续 10m"]
A2["429 比例 > 5% 持续 5m"]
A3["单日 token 突增 3σ"]
A4["误报率周环比 +10pt"]
end
P1 --> A1
P4 --> A2
P3 --> A3
P5 --> A4
核心原理:为什么 LLM 可观测性与传统 APM 不同
1. 非确定性:你需要「分布」而不是「单次对错」
单条推理结果无法代表系统健康。观测应强调:同样输入下的方差(若可复现)、线上真实分布(延迟直方图)、质量 proxy(规则命中率与人工反馈)。Tracing 要记录采样参数(temperature、top_p、模型版本),否则线上漂移不可解释。
2. 成本敏感:token 是一等公民
Metrics 必须拆分 prompt_tokens、completion_tokens,并按 project_id、model、route 打点。否则财务归因失败。Tracing 中建议增加 estimated_cost_usd 与真实 usage 对齐校验,发现估算偏差。
3. 合规:日志里的 prompt/response 是高风险资产
代码、密钥、个人信息可能出现在 diff 与提示词中。Logging 策略应包含:字段级脱敏(正则+熵检测)、可选全文加密冷存、最小可见原则(默认只存 hash 与长度)。Tracing store 若接第三方 SaaS,需要数据出境评估。
4. 质量评估的三层模型
- 离线黄金集:小规模标注「应命中问题」集合,计算近似召回;适合回归新模型。
- 线上 proxy:统计
finding_severity分布突变、重复评论、同一文件高频相同规则。 - 人工反馈:PR 评论下 👍/👎 或「误报」标签,沉淀为训练数据与阈值调参依据。
5. LangChain Callback 的价值与坑
Callback 能统一捕获 chain start/end、llm start/end、token stream。但要注意:异步链路要用 AsyncCallbackHandler;避免在 callback 里做重 IO(应异步入队);防止递归记录导致日志爆炸(对超大 prompt 只记录摘要)。
6. 告警哲学:对 LLM 服务,429 与延迟同样致命
除了常规 SLO,建议对 429 占比、重试风暴、队列堆积、缓存命中率骤降 设置独立告警——它们往往是成本与体验恶化的前兆。
代码实战:LLMTracer + MetricsCollector + QualityEvaluator
说明:以下为单文件可运行示例
obs_lab/app.py,依赖fastapi、uvicorn、pydantic。为兼容不同 LangChain 版本,示例实现了一个最小 LLMTracerCallback(不强制安装 langchain);若你已使用 LangChain,可把方法映射到BaseCallbackHandler/AsyncCallbackHandler。
1. requirements.txt
fastapi>=0.110.0
uvicorn[standard]>=0.27.0
pydantic>=2.6.0
2. obs_lab/app.py(完整实现)
from __future__ import annotations
import hashlib
import json
import os
import re
import threading
import time
import uuid
from dataclasses import dataclass, field
from typing import Any, Callable, Dict, List, Optional, Tuple
import uvicorn
from fastapi import FastAPI, Request
from pydantic import BaseModel, Field
# -----------------------------
# PII / Secret 脱敏
# -----------------------------
_SECRET_PATTERNS = [
(re.compile(r"(?i)(api[_-]?key|token|secret)\s*[:=]\s*['\"]?[\w-]{8,}['\"]?", re.M), r"\1:***"),
(re.compile(r"(?i)Bearer\s+[A-Za-z0-9\-\._~\+\/]+=*"), "Bearer ***"),
(re.compile(r"(?i)-----BEGIN [A-Z ]+PRIVATE KEY-----[\s\S]+?-----END [A-Z ]+PRIVATE KEY-----"), "[REDACTED_PEM]"),
]
def redact_pii(text: str) -> str:
t = text or ""
for pat, repl in _SECRET_PATTERNS:
t = pat.sub(repl, t)
return t
def stable_hash(text: str) -> str:
return hashlib.sha256((text or "").encode("utf-8")).hexdigest()
# -----------------------------
# Metrics(Prometheus 文本暴露风格,教学版)
# -----------------------------
@dataclass
class MetricValue:
help_text: str
type_name: str # counter | histogram
series: Dict[Tuple[str, ...], float] = field(default_factory=dict)
buckets: Dict[Tuple[str, ...], List[float]] = field(default_factory=dict)
class MetricsCollector:
def __init__(self) -> None:
self._lock = threading.Lock()
self._metrics: Dict[str, MetricValue] = {
"cs_reviews_total": MetricValue("Total reviews", "counter"),
"cs_llm_calls_total": MetricValue("Total LLM calls", "counter"),
"cs_llm_errors_total": MetricValue("Total LLM errors", "counter"),
"cs_tokens_prompt_total": MetricValue("Prompt tokens", "counter"),
"cs_tokens_completion_total": MetricValue("Completion tokens", "counter"),
"cs_review_latency_ms": MetricValue("Review latency", "histogram"),
}
# histogram bucket storage: key -> observations
self._hist_obs: Dict[str, List[float]] = {}
def inc(self, name: str, labels: Dict[str, str], delta: float = 1.0) -> None:
key = (name, tuple(sorted(labels.items())))
flat = (name, json.dumps(labels, sort_keys=True, ensure_ascii=False))
with self._lock:
m = self._metrics.setdefault(name, MetricValue(name, "counter"))
m.series[key] = m.series.get(key, 0.0) + float(delta)
def observe_latency_ms(self, labels: Dict[str, str], ms: float) -> None:
name = "cs_review_latency_ms"
flat = json.dumps(labels, sort_keys=True, ensure_ascii=False)
with self._lock:
self._hist_obs.setdefault(f"{name}:{flat}", []).append(ms)
def export_prometheus(self) -> str:
lines: List[str] = []
with self._lock:
for name, mv in self._metrics.items():
lines.append(f"# HELP {name} {mv.help_text}")
lines.append(f"# TYPE {name} {mv.type_name}")
if mv.type_name == "counter":
for key, val in mv.series.items():
_, label_tuple = key
label_str = ",".join(f'{k}="{v}"' for k, v in label_tuple)
lines.append(f"{name}{{{label_str}}} {val}")
lines.append("")
# histogram 简化输出:sum/count + 几个分位近似
lines.append("# HELP cs_review_latency_ms Review latency histogram (approx)")
lines.append("# TYPE cs_review_latency_ms histogram")
for flat, obs in self._hist_obs.items():
if not obs:
continue
_, labels_json = flat.split(":", 1)
labels = json.loads(labels_json)
label_str = ",".join(f'{k}="{v}"' for k, v in sorted(labels.items()))
s = sum(obs)
c = len(obs)
avg = s / c
p95 = sorted(obs)[int(0.95 * (c - 1))]
lines.append(f'cs_review_latency_ms_bucket{{le="500",{label_str}}} {sum(1 for x in obs if x<=500)}')
lines.append(f'cs_review_latency_ms_bucket{{le="2000",{label_str}}} {sum(1 for x in obs if x<=2000)}')
lines.append(f'cs_review_latency_ms_bucket{{le="+Inf",{label_str}}} {c}')
lines.append(f"cs_review_latency_ms_sum{{{label_str}}} {s}")
lines.append(f"cs_review_latency_ms_count{{{label_str}}} {c}")
lines.append(f"# AVG={avg:.2f}ms P95={p95:.2f}ms (debug comment)\n")
return "\n".join(lines)
METRICS = MetricsCollector()
# -----------------------------
# Tracing(内存 ring buffer,生产换 OTLP/Tempo/Jaeger)
# -----------------------------
@dataclass
class SpanEvent:
ts: float
trace_id: str
name: str
attrs: Dict[str, Any]
class TraceStore:
def __init__(self, max_items: int = 2000) -> None:
self._buf: List[SpanEvent] = []
self._lock = threading.Lock()
self._max = max_items
def add(self, ev: SpanEvent) -> None:
with self._lock:
self._buf.append(ev)
if len(self._buf) > self._max:
self._buf = self._buf[-self._max :]
def query_by_trace(self, trace_id: str) -> List[SpanEvent]:
with self._lock:
return [e for e in self._buf if e.trace_id == trace_id]
TRACES = TraceStore()
class LLMTracerCallback:
"""对齐 LangChain callback 思想的最小实现。"""
def __init__(self, trace_id: str, project_id: str, route: str, model: str = "gpt-demo") -> None:
self.trace_id = trace_id
self.project_id = project_id
self.route = route
self.model = model
def on_llm_start(self, prompt: str, model: str) -> None:
red = redact_pii(prompt)
TRACES.add(
SpanEvent(
ts=time.time(),
trace_id=self.trace_id,
name="llm.start",
attrs={
"model": model,
"project_id": self.project_id,
"route": self.route,
"prompt_sha256": stable_hash(prompt),
"prompt_len": len(prompt),
"prompt_preview": red[:400],
},
)
)
METRICS.inc("cs_llm_calls_total", {"project": self.project_id, "model": model})
def on_llm_end(self, response_text: str, usage: Dict[str, int], model: str) -> None:
red = redact_pii(response_text)
TRACES.add(
SpanEvent(
ts=time.time(),
trace_id=self.trace_id,
name="llm.end",
attrs={
"model": model,
"project_id": self.project_id,
"response_sha256": stable_hash(response_text),
"response_len": len(response_text),
"response_preview": red[:400],
"usage": usage,
},
)
)
METRICS.inc(
"cs_tokens_prompt_total",
{"project": self.project_id, "model": model},
float(usage.get("prompt_tokens", 0)),
)
METRICS.inc(
"cs_tokens_completion_total",
{"project": self.project_id, "model": model},
float(usage.get("completion_tokens", 0)),
)
def on_llm_error(self, err: str, model: str) -> None:
TRACES.add(
SpanEvent(
ts=time.time(),
trace_id=self.trace_id,
name="llm.error",
attrs={"model": model, "project_id": self.project_id, "error": err[:500]},
)
)
METRICS.inc("cs_llm_errors_total", {"project": self.project_id, "model": model})
# -----------------------------
# QualityEvaluator(误报/人工反馈)
# -----------------------------
class HumanFeedback(BaseModel):
trace_id: str
finding_id: str
label: str = Field(..., description="tp | fp | unknown")
note: str = ""
class QualityEvaluator:
def __init__(self) -> None:
self._lock = threading.Lock()
self._feedbacks: List[HumanFeedback] = []
def record(self, fb: HumanFeedback) -> None:
with self._lock:
self._feedbacks.append(fb)
def false_positive_rate(self, project_id: str, window: int = 500) -> Optional[float]:
with self._lock:
items = list(self._feedbacks)[-window:]
fps = sum(1 for x in items if x.label == "fp")
tps = sum(1 for x in items if x.label == "tp")
denom = fps + tps
if denom == 0:
return None
return fps / denom
def export_summary(self) -> Dict[str, Any]:
with self._lock:
n = len(self._feedbacks)
return {"feedback_count": n, "note": "按 trace 关联项目可进一步细分,此处教学从简"}
QUALITY = QualityEvaluator()
# -----------------------------
# FastAPI:中间件 + Demo 审核端点
# -----------------------------
app = FastAPI(title="CodeSentinel Observability Lab", version="0.1.0")
@app.middleware("http")
async def trace_middleware(request: Request, call_next):
incoming = request.headers.get("x-trace-id") or str(uuid.uuid4())
request.state.trace_id = incoming
request.state.project_id = request.headers.get("x-project-id", "default")
t0 = time.perf_counter()
response = await call_next(request)
ms = (time.perf_counter() - t0) * 1000.0
response.headers["x-trace-id"] = incoming
route = request.url.path
METRICS.observe_latency_ms({"route": route, "project": request.state.project_id}, ms)
return response
class ReviewRequest(BaseModel):
code: str
model: str = Field("gpt-demo", description="演示模型名")
def _fake_llm_review(code: str, tracer: LLMTracerCallback) -> Tuple[str, Dict[str, int]]:
prompt = (
"你是架构审核助手。请列出最多三条发现,使用 JSON 数组格式,每项含 severity/message。\n"
f"代码:\n{code}\n"
)
tracer.on_llm_start(prompt, model=tracer.model)
time.sleep(0.05) # 模拟 IO
resp = json.dumps(
[
{"severity": "medium", "message": "检测到可能的异常吞掉:except: pass"},
{"severity": "low", "message": "建议补充类型注解以提高可维护性"},
],
ensure_ascii=False,
)
usage = {"prompt_tokens": max(1, len(prompt) // 4), "completion_tokens": max(1, len(resp) // 4)}
tracer.on_llm_end(resp, usage, model=tracer.model)
return resp, usage
@app.post("/reviews/demo")
async def reviews_demo(req: ReviewRequest, request: Request) -> Dict[str, Any]:
trace_id = request.state.trace_id
project_id = request.state.project_id
METRICS.inc("cs_reviews_total", {"route": "/reviews/demo", "project": project_id})
cb = LLMTracerCallback(trace_id=trace_id, project_id=project_id, route="/reviews/demo", model=req.model)
try:
text, usage = _fake_llm_review(req.code, cb)
return {"trace_id": trace_id, "result": json.loads(text), "usage": usage}
except Exception as exc: # noqa: BLE001
cb.on_llm_error(repr(exc), model=req.model)
raise
@app.post("/quality/feedback")
async def quality_feedback(fb: HumanFeedback) -> Dict[str, str]:
QUALITY.record(fb)
return {"status": "ok"}
@app.get("/metrics")
async def metrics() -> Any:
from fastapi.responses import PlainTextResponse
return PlainTextResponse(METRICS.export_prometheus(), media_type="text/plain; version=0.0.4")
@app.get("/traces/{trace_id}")
async def get_trace(trace_id: str) -> Dict[str, Any]:
evs = TRACES.query_by_trace(trace_id)
return {
"trace_id": trace_id,
"events": [
{"ts": e.ts, "name": e.name, "attrs": e.attrs}
for e in sorted(evs, key=lambda x: x.ts)
],
}
@app.get("/quality/summary")
async def quality_summary() -> Dict[str, Any]:
return QUALITY.export_summary()
if __name__ == "__main__":
uvicorn.run(
"app:app",
host=os.getenv("HOST", "0.0.0.0"),
port=int(os.getenv("PORT", "8010")),
reload=False,
)
3. 运行与验证
cd obs_lab
python -m uvicorn app:app --port 8010
curl -s -H "x-project-id: acme" -H "Content-Type: application/json" -d "{\"code\":\"def f():\n try:\n pass\n except:\n pass\n\"}" http://127.0.0.1:8010/reviews/demo
# 用返回的 trace_id:
curl -s http://127.0.0.1:8010/traces/<trace_id>
curl -s http://127.0.0.1:8010/metrics
生产环境实战:从实验室到企业级观测栈
- OpenTelemetry:用 OTLP exporter 将 span 发到 Tempo/Jaeger;在 resource 属性中固定
service.name=codesentinel-review、deployment.environment。 - 日志:JSON 结构化 +
trace_id字段;对接 Loki/ELK;对 prompt/response 默认不落全文,仅 hash + 长度 + 脱敏预览。 - 指标:用官方
prometheus_client或 OpenTelemetry metrics;histogram 的 bucket 按业务校准(例如 0.5s/2s/10s)。 - 质量评估:把 PR 评论 ID 与
finding_id关联;每周生成误报报告推动规则/提示词迭代。 - 告警:为
cs_llm_errors_total增长率、p95延迟、token突增配置多维告警;避免只盯 QPS。 - 隐私评审:任何第三方 tracing SaaS 上线前走数据分类与 DPA;必要时自建。
本讲小结(Mermaid mindmap)
mindmap
root((第36讲 可观测性))
Tracing
trace_id
span 事件
prompt hash
Metrics
token 分层
延迟直方图
错误与 429
Logging
结构化
PII 脱敏
审计追踪
Quality
人工反馈
误报率
黄金集回归
运营
仪表盘
告警 SLO
思考题
- 如果你必须保存完整 prompt 以便法务审计,你会如何设计「密钥分离、加密存储、访问审批、保留期限」四要素?
QualityEvaluator的「tp/fp」标签昂贵,如何用工单抽样与主动学习降低标注成本?- 多租户场景下,metrics label 基数爆炸(每个
user_id一个 series)会带来什么后果?应如何折中?
下一讲预告
第 37 讲是模块五的综合实战:把第 33 讲的代码索引、第 34 讲的混合检索与重排序、第 35 讲的流式推理与成本控制、第 36 讲的追踪与指标,收敛为一份可 docker compose up 的 CodeSentinel AI 后端,并给出端到端测试脚本:索引仓库 → 提交审核 → 带上下文结论 → 全链路可观测。你将第一次从「组件」看到「平台」。