上下文工程 · 07 · 压缩与拼接的具体算法0. 两套算法的关系拼接是确定性的（输入相同 → 输出相同）；压缩是有损

系列第 7 篇。主文档见智能体上下文工程实现.md。

前几篇讲了为什么这么做和策略层的语义。这一篇讲算法层：每一轮请求是怎么被一条条规则机械地拼装出来的，触发压缩时具体执行哪些步骤、按什么顺序、用什么数据结构。

注：以下算法是基于我（Claude Code）可观测的行为反推出的工程实现描述。Anthropic 内部细节可能不同，但契约和可观测效果是一致的，对设计自己 agent 的人足够参考。

0. 两套算法的关系

每一轮模型调用：
  ┌─────────────────────────────┐
  │  1. 拼接算法（assemble）      │  ← 每轮必跑
  │  ┌─────────────────────┐     │
  │  │ 2. 压缩算法（compact）│    │  ← 拼完发现超限才跑
  │  └─────────────────────┘     │
  │  3. 计算 cache_control 位置   │  ← 压缩后的最终步骤
  │  4. 发送请求                  │
  └─────────────────────────────┘

拼接是确定性的（输入相同 → 输出相同）；压缩是有损的（不同时机可能压不同段，调用 LLM 生成摘要也不一定收敛到同样文本）。两者必须在同一轮内串行完成。

1. 数据模型

先定义算法操作的数据结构。

1.1 Block 与 Message

@dataclass
class Block:
    type: str              # "text" | "tool_use" | "tool_result" | "thinking"
    content: str | dict    # 取决于 type
    metadata: BlockMeta    # 见下

@dataclass
class BlockMeta:
    tier: int              # 0..3 (信任分级，见 02 篇)
    source: str            # "system" | "user" | "assistant" | "tool" | "hook" | "reminder"
    created_at: datetime
    tool_use_id: str | None  # tool_result 必填，回指 tool_use
    token_count: int       # 估算或精确
    pinned: bool = False   # 永不压缩（如 system_prompt、最近 N 轮）
    compactable: bool = True

@dataclass
class Message:
    role: str              # "user" | "assistant"
    content: list[Block]   # 多 block 数组
    turn_id: int           # 第几轮

1.2 完整对话状态

@dataclass
class ConversationState:
    system_blocks: list[Block]      # System Prompt 的多个 block
    tools: list[ToolSchema]         # 工具定义（参与 cache 前缀）
    messages: list[Message]         # 时序消息流
    pending_injections: list[Block] # 待注入的 system-reminder 等
    config: AssembleConfig

pending_injections 是关键：harness 在每轮拼接前会塞入一些 block（reminder、ide_selection、hook 输出），它们要被合并到本轮 user 消息里，而不是单独成消息。

2. 拼接算法（Assemble）

2.1 总流程

def assemble_request(state: ConversationState, user_input: str) -> APIRequest:
    # Step 1: 构造本轮的 user 消息（合并 pending injections + 用户输入）
    user_msg = build_user_message(state.pending_injections, user_input)
    state.messages.append(user_msg)
    state.pending_injections.clear()

    # Step 2: 估算总 token 数
    total_tokens = estimate_tokens(state)

    # Step 3: 触发压缩（如需要）
    if total_tokens > state.config.compact_threshold:
        state = compact(state)  # 见 §3

    # Step 4: 计算 cache_control 锚点
    cache_anchors = compute_cache_anchors(state)

    # Step 5: 序列化为 API 请求
    return serialize(state, cache_anchors)

2.2 build_user_message 详解

这是最容易出错的一步：多源信息归并到一个 user role 消息。

def build_user_message(injections: list[Block], user_input: str | None) -> Message:
    blocks = []

    # 顺序很关键：tool_result 在前，注入在中，用户原话在后

    # (a) 上一轮 assistant 的 tool_use 对应的 tool_result
    for tu in get_pending_tool_uses():
        result = run_tool(tu)  # 实际执行工具
        blocks.append(Block(
            type="tool_result",
            content=result,
            metadata=BlockMeta(
                tier=3,
                source="tool",
                tool_use_id=tu.id,
                ...
            )
        ))

    # (b) 系统注入：reminder、ide_selection、hook 输出
    for inj in injections:
        # 每个注入都用 XML 标签包裹，让模型能识别
        wrapped = wrap_with_tag(inj)
        blocks.append(wrapped)

    # (c) 用户原话（如果有）
    if user_input:
        blocks.append(Block(
            type="text",
            content=user_input,
            metadata=BlockMeta(tier=1, source="user", ...)
        ))

    return Message(role="user", content=blocks, turn_id=next_turn_id())

注意这里有几个不变量：

tool_result 必须先于其他 block，因为 API 要求它紧跟对应的 tool_use
注入 block 必须带标签，参见 02 篇
没有 user_input 时（比如 agent 自动循环），这条消息只有 tool_result 和 injections

2.3 wrap_with_tag 算法

TAG_BY_SOURCE = {
    "reminder":      "system-reminder",
    "ide_selection": "ide_selection",
    "ide_opened":    "ide_opened_file",
    "hook_prompt":   "user-prompt-submit-hook",
    "hook_pre_tool": "pre-tool-use-hook",
    # ...
}

def wrap_with_tag(block: Block) -> Block:
    tag = TAG_BY_SOURCE[block.metadata.source]
    return Block(
        type="text",
        content=f"<{tag}>\n{block.content}\n</{tag}>",
        metadata=block.metadata,
    )

为什么用 XML 标签而不是 JSON 或 markdown？

XML 在 Anthropic 训练数据里频繁出现，模型对它的边界识别最强
嵌套友好（hook 输出里如果含 <some-other-tag> 也不会破坏外层）
保持人类可读，便于调试

2.4 token 估算

def estimate_tokens(state: ConversationState) -> int:
    total = 0

    # System Prompt 和 tools 用精确计数（一般已缓存）
    total += sum(b.metadata.token_count for b in state.system_blocks)
    total += sum(t.token_count for t in state.tools)

    # 消息流：未估算的现场算
    for msg in state.messages:
        for blk in msg.content:
            if blk.metadata.token_count == 0:
                blk.metadata.token_count = tokenizer.count(blk.content)
            total += blk.metadata.token_count

    return total

实务中用 Anthropic 提供的 tokenizer 或保守估算（每 4 字符 ≈ 1 token，中文每字 ≈ 1.5 token）。保守估算偏高比偏低安全：偏高最多多压一次，偏低会真的爆窗口。

3. 压缩算法（Compact）

3.1 触发条件

class AssembleConfig:
    context_window: int        # 模型窗口，比如 1_000_000
    soft_threshold: float = 0.70  # 软阈值
    hard_threshold: float = 0.90  # 硬阈值
    keep_recent_turns: int = 4    # 最近 N 轮永不压缩
    keep_recent_tokens: int = 20_000  # 或最近多少 token

def should_compact(total: int, cfg: AssembleConfig) -> str | None:
    if total > cfg.context_window * cfg.hard_threshold:
        return "hard"   # 必须压
    if total > cfg.context_window * cfg.soft_threshold:
        return "soft"   # 建议压
    return None

软压和硬压的差别：

软压：尽量压到 50% 以下，目标是给后续轮次留余量
硬压：必须压到能塞下本轮请求 + 一些 buffer

3.2 标记可压缩区段

def mark_compactable_regions(state: ConversationState) -> list[Region]:
    regions = []
    current = []

    # 倒序扫描，跳过最近 N 轮
    recent_token_budget = cfg.keep_recent_tokens
    for msg in reversed(state.messages):
        if recent_token_budget > 0:
            for blk in msg.content:
                blk.metadata.pinned = True
                recent_token_budget -= blk.metadata.token_count
            continue

        # 进入"可压缩"区域
        for blk in msg.content:
            if not blk.metadata.compactable or blk.metadata.pinned:
                # 遇到不可压块，封装当前 region
                if current:
                    regions.append(Region(blocks=current))
                    current = []
            else:
                current.append(blk)

    if current:
        regions.append(Region(blocks=current))

    return regions

哪些 block 是 pinned / 不 compactable？

Block 类型	默认	理由
System Prompt	pinned	永不压缩
工具 schema	pinned	cache 前缀
用户原话（裸 text）	compactable=False	判断意图必需
`<system-reminder>`	compactable	可压（注入是临时的）
`tool_result`（小，<1KB）	compactable	但优先级低（影响小）
`tool_result`（大，>10KB）	compactable	优先级高（压缩收益大）
`assistant` 文本输出	compactable=False	是模型的"自我状态"
`tool_use`	必须保留	否则 tool_result 失去回指

3.3 选区 + 摘要：核心压缩动作

def compact(state: ConversationState) -> ConversationState:
    target_reduction = compute_target(state)  # 需要省多少 token
    regions = mark_compactable_regions(state)

    # 按 token 数降序，先压最大的
    regions.sort(key=lambda r: r.token_count, reverse=True)

    saved = 0
    for region in regions:
        if saved >= target_reduction:
            break

        summary = summarize_region(region)  # 见 §3.4
        replace_region_with_summary(state, region, summary)
        saved += region.token_count - summary.token_count

    return state

3.4 summarize_region：用 LLM 压缩 LLM 上下文

最关键也最微妙的子算法。用一个独立的 LLM 调用生成摘要：

SUMMARIZE_PROMPT = """\
You are compacting a software engineering agent's conversation history.
Below are several turns that need to be replaced with a shorter summary.

Preserve:
- File paths, line numbers, function names mentioned
- Decisions made by the agent (what approach was chosen)
- Key facts the agent learned (test results, error messages, configuration values)
- User's stated requirements or constraints

Drop:
- Tool call mechanics (which tool, what params) — keep only the findings
- Repeated content
- Internal deliberation

Write under {target_tokens} tokens. Use bullet points.

<conversation_segment>
{region_content}
</conversation_segment>
"""

def summarize_region(region: Region) -> Block:
    target = region.token_count // 5  # 目标压缩到 1/5
    prompt = SUMMARIZE_PROMPT.format(
        region_content=serialize_blocks(region.blocks),
        target_tokens=target,
    )
    summary_text = llm_call(
        model="claude-haiku-4-5",  # 用便宜快的小模型
        prompt=prompt,
        max_tokens=target + 200,    # 给点余量
    )
    return Block(
        type="text",
        content=f"[compacted summary of turns {region.first_turn}-{region.last_turn}]\n{summary_text}",
        metadata=BlockMeta(
            tier=3,                  # 摘要也是 Tier 3，因为含工具产出
            source="compactor",
            ...
        ),
    )

几个工程关键：

用 Haiku 压缩 Opus 的上下文：成本远低于"重新跑一遍 Opus"
目标比例 1/5 是经验值：太狠（1/10）会丢关键事实，太轻（1/2）省不下多少
明确告诉摘要器保留什么：上面 prompt 列的"Preserve"清单是产品决策
保留 turn 范围标注：方便事后调试时映射回原始对话

3.5 替换的副作用：tool_use ↔ tool_result 一致性

最危险的边界情况：压缩可能只压了 tool_result 没压 tool_use（或反之），导致 API 拒收。

def replace_region_with_summary(state, region, summary):
    # 收集 region 内所有 tool_use_id
    tool_use_ids_in_region = collect_tool_use_ids(region)

    # 检查这些 tool_use 对应的 tool_result 是否也在 region 里
    for tu_id in tool_use_ids_in_region:
        result = find_tool_result(state, tu_id)
        if result and result not in region:
            # 配对断了，要么把 result 也拉进 region，要么放弃压这个
            extend_region(region, result)

    # 现在可以安全替换
    do_replace(state, region, summary)

简单原则：tool_use 和 tool_result 必须同生共死，要么一起留，要么一起被吸入摘要。

3.6 不可重复压缩

def compact(state):
    ...
    summary_block.metadata.compactable = False  # 摘要本身不再二次压缩

如果允许摘要被压缩，长会话会出现"摘要的摘要的摘要" → 信息以指数衰减 → 模型完全失忆。所以摘要 block 标记为不可压。

代价：超长会话最终压无可压。这时应该让用户开新会话靠 Memory 接力（见 04 篇）。

4. 计算 cache_control 锚点

压缩完成后，最后一步是决定哪些位置打 cache_control。

def compute_cache_anchors(state: ConversationState) -> list[int]:
    anchors = []

    # Anchor 1: System 末尾（环境上下文之后）
    anchors.append(("system", len(state.system_blocks) - 1))

    # Anchor 2 (可选): 最后一个稳定的"里程碑" tool_result
    #   什么算里程碑？通常是大块、被引用过、不易变的内容
    milestone = find_last_milestone(state.messages)
    if milestone:
        anchors.append(("messages", milestone.index))

    # Anchor 3: 最近一轮 user 消息的"前一条"（让 user_input 之前都缓存）
    if len(state.messages) >= 2:
        anchors.append(("messages", len(state.messages) - 2))

    # 上限 4 个，按收益排序取前 4
    anchors = rank_anchors_by_expected_hit_rate(anchors)
    return anchors[:4]

4.1 锚点排序的启发式

不是所有锚点都该用，要按"预期命中率 × 缓存内容大小"排序：

def expected_value(anchor):
    cached_size = tokens_before(anchor)        # 该锚点能缓存多少 token
    hit_probability = estimate_hit_prob(anchor) # 下次请求命中率
    return cached_size * hit_probability

hit_probability 估算的启发式：

锚点之前的内容越稳定 → 概率越高
锚点之后还有多少未来轮次可命中 → 越多越高
锚点本身距今多久（接近 5 分钟 TTL 的不打）

5. 序列化与发送

def serialize(state, cache_anchors) -> APIRequest:
    req = {
        "model": state.config.model,
        "system": [],
        "tools": state.tools,
        "messages": [],
    }

    # System
    for i, blk in enumerate(state.system_blocks):
        item = {"type": "text", "text": blk.content}
        if ("system", i) in cache_anchors:
            item["cache_control"] = {"type": "ephemeral"}
        req["system"].append(item)

    # Messages
    for i, msg in enumerate(state.messages):
        content = []
        for blk in msg.content:
            content.append(serialize_block(blk))
        msg_obj = {"role": msg.role, "content": content}
        if ("messages", i) in cache_anchors:
            # cache_control 放在 content 数组的最后一个 block
            msg_obj["content"][-1]["cache_control"] = {"type": "ephemeral"}
        req["messages"].append(msg_obj)

    return req

6. 完整时序图

┌─────────┐              ┌─────────┐              ┌──────────┐
│  User   │              │ Harness │              │ Anthropic│
└────┬────┘              └────┬────┘              └────┬─────┘
     │ "修复登录bug"           │                        │
     │───────────────────────>│                        │
     │                        │                        │
     │                        │ 1. build_user_message  │
     │                        │  - 收集 pending_inj    │
     │                        │  - 加 ide_selection    │
     │                        │  - 加用户原话          │
     │                        │                        │
     │                        │ 2. estimate_tokens     │
     │                        │  total = 850k / 1M     │
     │                        │  → 触发软压缩          │
     │                        │                        │
     │                        │ 3. compact()           │
     │                        │  - 标 pinned           │
     │                        │  - 选最大可压区段       │
     │                        │  - Haiku 调用生成摘要   │
     │                        │  - 替换 + 标 uncompact │
     │                        │  total → 350k          │
     │                        │                        │
     │                        │ 4. compute_cache_anchors│
     │                        │                        │
     │                        │ 5. serialize + send    │
     │                        │───────────────────────>│
     │                        │                        │
     │                        │       响应（含 tool_use）│
     │                        │<───────────────────────│
     │                        │                        │
     │                        │ 6. 执行工具，结果入下一轮 │
     │                        │                        │
     │      最终回复            │                        │
     │<───────────────────────│                        │

7. 关键边界情况

7.1 单条消息超过窗口

如果用户一次粘贴 800k token 的 log → 单条 user 消息就接近窗口。

处理：

超大 user 消息写入临时文件，user 消息只引用路径
用 Read 工具按需读取片段
或调用 summarize 工具压成摘要

def handle_oversize_input(text: str) -> str:
    if len(text) > LARGE_INPUT_THRESHOLD:
        path = save_to_tmpfile(text)
        return f"[Large input saved to {path}, {len(text)} chars. Use Read tool to access.]"
    return text

7.2 工具结果大爆炸

某个 Bash 调用返回 10MB 输出 → 直接拼接会炸。

处理（在工具层就应做）：

def run_bash_tool(cmd):
    output = subprocess.run(cmd).stdout
    if len(output) > BASH_MAX_OUTPUT:
        truncated = output[:BASH_MAX_OUTPUT]
        return f"{truncated}\n\n[Output truncated, {len(output)} bytes total. Save to file with > if you need full output.]"
    return output

这是为什么 Bash 工具描述里有 head_limit 默认 250 行 —— 在源头就限流。

7.3 压缩中 LLM 调用失败

def compact_with_fallback(state):
    try:
        return compact(state)
    except LLMError:
        # 降级：粗暴截断而非智能摘要
        return truncate_oldest_compactable(state, target_reduction)

降级方案是直接丢最早的可压段，附一条标注 "[Earlier turns truncated due to compactor error]"。难看但可用。

7.4 cache_control 失效检测

如果连续多轮 cache miss（API 返回的 cache_read_input_tokens 一直是 0）：

def adapt_cache_strategy(state):
    recent_misses = count_recent_cache_misses(state)
    if recent_misses >= 3:
        log.warn("Cache thrashing detected, investigating")
        # 可能原因：
        # - System Prompt 有动态字段被错误注入（如时间戳）
        # - 工具集在变化
        # - cache_control 位置选错
        diagnose_cache_thrash(state)

这是上下文工程层的"健康监控"，应该作为生产 agent 的标配。

8. 简化版伪代码（可直接实现）

把上面拆开的逻辑收拢成一份最小可运行原型：

def agent_turn(state, user_input):
    # 1. 拼接
    state.messages.append(build_user_message(state, user_input))

    # 2. 压缩（如需要）
    while estimate_tokens(state) > state.config.window * 0.9:
        regions = mark_compactable_regions(state)
        if not regions:
            raise ContextOverflow("Cannot compact further; start new session")
        biggest = max(regions, key=lambda r: r.token_count)
        summary = summarize_region(biggest)
        replace_region_with_summary(state, biggest, summary)

    # 3. cache 锚点
    anchors = compute_cache_anchors(state)

    # 4. 调 API
    response = anthropic.messages.create(**serialize(state, anchors))

    # 5. 处理 tool_use
    state.messages.append(Message("assistant", response.content))
    if response.stop_reason == "tool_use":
        # 下一轮 build_user_message 会处理这些 tool_use
        return agent_turn(state, None)  # 递归无 user 输入
    return response

200 行能写完核心。生产实现的复杂度都在边界情况、错误恢复、监控、并发。

9. 与之前几篇的对应关系

把算法和概念对回去：

算法步骤	对应概念	出自
`build_user_message` 多 block 合并	多源信息归并	主文档 §1.5
`wrap_with_tag`	信任分级 + 标签化	02 篇
`mark_compactable_regions`	pinned 不压清单	主文档 §1.6.2
`summarize_region` 用小模型	主动压缩 vs 被动压缩	03 篇
`compute_cache_anchors`	5 分钟 TTL + 4 个 breakpoint	01 篇
`keep_recent_turns` 不压	保留近期窗口	主文档 §1.6.2
`summary.compactable=False`	防止指数衰减	主文档 §1.6.3
`handle_oversize_input`	工具结果信息密度	主文档 §2.2

每条概念都在算法里有对应代码位置。这是"概念-实现"双向可追溯的标志。

10. 给 Agent 实现者的可迁移规则

如果你要从零实现这套算法：

先把 Block 元数据建全：tier、source、tool_use_id、token_count、pinned 五个字段一个不能少。后续算法全靠它们。
token 估算偏保守：宁可早压一次，不要越界。
tool_use ↔ tool_result 配对不变量：写一个不变量检查函数，每次操作消息流之后跑一遍。
摘要 block 永不二次压缩：刻在 schema 里。
压缩用便宜小模型：不要用同一个大模型既推理又压缩，成本和延迟都不划算。
压缩失败要有降级路径：截断比崩溃好。
cache_control 锚点排序：用"覆盖大小 × 命中概率"，别瞎打。
健康监控：日志里至少要有 cache_hit_rate、compact_count、token_usage 三个指标。
工具层就要限流：head_limit、max_output 等在源头处理，比拼接后再处理便宜得多。
测试要覆盖边界：单条超大消息、超长会话、压缩链断裂、cache thrash —— 这些不测，生产一定踩。

11. 一句话总结

拼接是确定性的多源归并，压缩是有损的选区-摘要-替换。两套算法都不复杂，但每一步都有不变量和边界情况要守。把它们写成 200 行干净代码不难，把它们在生产里跑稳，要的是日志、降级、健康检查这些"算法之外"的工程功夫。

系列完整索引

#	主题
主	智能体上下文工程实现
01	Prompt Cache 与成本
02	注入与信任边界
03	子智能体隔离
04	Plan Mode 与 Todo 状态机
05	Hooks 与外部信号
06	知识截止与时间感知
07	压缩与拼接的具体算法（本篇）