Agent 开发进阶（十一）：错误不是例外，而是主循环的正常分支Agent 开发进阶（十一）：错误不是例外，而是主循环的

Agent 开发进阶（十一）：错误不是例外，而是主循环的正常分支

本文是「从零构建 Coding Agent」系列的第十一篇，适合想让 Agent 在遇到错误时能够优雅处理的开发者。

先问一个问题

当你的 Agent 遇到以下情况时，会发生什么？

模型输出写到一半被截断
上下文太长，请求直接失败
网络暂时抖动，API 超时或限流

如果你的 Agent 直接崩溃或停止，那么你需要一个错误恢复机制。

Agent 的「崩溃困境」问题

到了这一阶段，你的 Agent 已经具备了多种能力：

核心循环运行
工具使用与分发
会话内规划
子智能体机制
技能加载
上下文压缩
权限系统
Hook 系统
Memory 系统
系统提示词组装

但随着系统功能的增加，遇到错误的可能性也随之增加：

模型输出可能被 token 限制截断
上下文可能超过模型窗口大小
网络可能暂时不稳定
API 可能限流

如果没有错误恢复机制，主循环会在第一个错误上直接停住。这会让用户觉得「Agent 不稳定」，甚至误以为是模型的问题。

实际上，很多失败并不是「任务真的失败了」，而只是：

这一轮需要换一种继续方式。

错误恢复的核心设计：分类处理与恢复路径

用一个图来表示错误恢复的工作流程：

LLM call
  |
  +-- stop_reason == "max_tokens"
  |      -> 注入续写提示
  |      -> 再试一次
  |
  +-- prompt too long
  |      -> 压缩旧上下文
  |      -> 再试一次
  |
  +-- timeout / rate limit / transient API error
         -> 等一会儿
         -> 再试一次

关键点只有一个：

错误先分类，恢复再执行，失败最后才暴露给用户。

几个必须搞懂的概念

恢复（Recovery）

恢复不是把所有错误都藏起来。

恢复的意思是：

先判断这是不是临时问题
如果是，就尝试一个有限次数的补救动作
如果补救失败，再把失败明确告诉用户

重试预算（Retry Budget）

重试预算，就是「最多试几次」。

比如：

续写最多 3 次
网络重连最多 3 次

如果没有这个预算，程序就可能无限循环。

状态机（State Machine）

状态机这个词听起来很大，其实意思很简单：

一个东西会在几个明确状态之间按规则切换。

在这一章里，主循环就从「普通执行」变成了：

正常执行
续写恢复
压缩恢复
退避重试
最终失败

最小实现

1. 恢复决策器

def choose_recovery(stop_reason: str | None, error_text: str | None) -> dict:
    """根据错误类型选择恢复策略"""
    if stop_reason == "max_tokens":
        return {"kind": "continue", "reason": "output truncated"}

    if error_text and "prompt" in error_text and "long" in error_text:
        return {"kind": "compact", "reason": "context too large"}

    if error_text and any(word in error_text for word in [
        "timeout", "rate", "unavailable", "connection", "limit"
    ]):
        return {"kind": "backoff", "reason": "transient transport failure"}

    return {"kind": "fail", "reason": "unknown or non-recoverable error"}

2. 恢复状态管理

class RecoveryState:
    """恢复状态管理"""
    
    def __init__(self):
        self.continuation_attempts = 0
        self.compact_attempts = 0
        self.transport_attempts = 0
        self.max_attempts = 3
    
    def can_continue(self) -> bool:
        return self.continuation_attempts < self.max_attempts
    
    def can_compact(self) -> bool:
        return self.compact_attempts < self.max_attempts
    
    def can_backoff(self) -> bool:
        return self.transport_attempts < self.max_attempts
    
    def increment_continuation(self):
        self.continuation_attempts += 1
    
    def increment_compact(self):
        self.compact_attempts += 1
    
    def increment_transport(self):
        self.transport_attempts += 1

3. 恢复动作实现

import time
import random

# 续写提示
CONTINUE_MESSAGE = (
    "Output limit hit. Continue directly from where you stopped. "
    "Do not restart or repeat."
)

def auto_compact(messages: list) -> list:
    """自动压缩上下文"""
    # 简单的摘要生成（实际项目中可以用模型生成更智能的摘要）
    def summarize_messages(msgs):
        # 提取关键信息
        summary = []
        for msg in msgs:
            if msg.get("role") == "user":
                content = msg.get("content", "")
                if isinstance(content, str) and len(content) > 50:
                    summary.append(f"用户: {content[:50]}...")
                elif isinstance(content, str):
                    summary.append(f"用户: {content}")
            elif msg.get("role") == "assistant":
                content = msg.get("content", "")
                if isinstance(content, str) and len(content) > 50:
                    summary.append(f"助手: {content[:50]}...")
                elif isinstance(content, str):
                    summary.append(f"助手: {content}")
        return "\n".join(summary)
    
    summary = summarize_messages(messages)
    return [{
        "role": "user",
        "content": f"This session was compacted. Continue from this summary:\n{summary}",
    }]

def backoff_delay(attempt: int) -> float:
    """退避延迟计算"""
    return min(1.0 * (2 ** attempt), 30.0) + random.uniform(0, 1)

4. 集成到主循环

def agent_loop_with_recovery(state):
    """带错误恢复的主循环"""
    recovery_state = RecoveryState()
    
    while True:
        try:
            # 调用模型
            response = client.messages.create(
                model=MODEL,
                system=state.get("system", ""),
                messages=state.get("messages", []),
                tools=state.get("tools", []),
                max_tokens=1000
            )
            
            # 处理响应
            decision = choose_recovery(response.stop_reason, None)
            
        except Exception as e:
            # 处理异常
            error_text = str(e).lower()
            decision = choose_recovery(None, error_text)
            response = None
        
        # 处理恢复策略
        if decision["kind"] == "continue":
            if recovery_state.can_continue():
                print(f"[Recovery] 输出被截断，尝试续写... (尝试 {recovery_state.continuation_attempts + 1}/{recovery_state.max_attempts})")
                recovery_state.increment_continuation()
                state["messages"].append({"role": "user", "content": CONTINUE_MESSAGE})
                continue
            else:
                print("[Recovery] 续写尝试次数耗尽")
                return "Error: 输出恢复失败，已达到最大尝试次数"
        
        if decision["kind"] == "compact":
            if recovery_state.can_compact():
                print(f"[Recovery] 上下文过长，尝试压缩... (尝试 {recovery_state.compact_attempts + 1}/{recovery_state.max_attempts})")
                recovery_state.increment_compact()
                state["messages"] = auto_compact(state["messages"])
                continue
            else:
                print("[Recovery] 压缩尝试次数耗尽")
                return "Error: 上下文压缩失败，已达到最大尝试次数"
        
        if decision["kind"] == "backoff":
            if recovery_state.can_backoff():
                delay = backoff_delay(recovery_state.transport_attempts)
                print(f"[Recovery] 网络错误，尝试退避... (等待 {delay:.2f}秒，尝试 {recovery_state.transport_attempts + 1}/{recovery_state.max_attempts})")
                recovery_state.increment_transport()
                time.sleep(delay)
                continue
            else:
                print("[Recovery] 退避尝试次数耗尽")
                return f"Error: 网络恢复失败，已达到最大尝试次数: {str(e)}"
        
        if decision["kind"] == "fail":
            print(f"[Recovery] 无法恢复，最终失败: {decision['reason']}")
            return f"Error: {decision['reason']}"
        
        # 正常处理工具调用
        if response and response.stop_reason == "tool_use":
            results = []
            for block in response.content:
                if hasattr(block, "type") and block.type == "tool_use":
                    # 执行工具
                    output = run_tool(block.name, block.input)
                    results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": output
                    })
            
            if results:
                state["messages"].append({"role": "user", "content": results})
        else:
            # 正常结束
            return response.content if response else "No response"

三条恢复路径的详细说明

1. 输出被截断 - 续写恢复

场景：模型输出写到一半，token 用完了

恢复动作：

注入续写提示
告诉模型不要重复，不要重新总结
直接从中断点接着写

关键代码：

if response.stop_reason == "max_tokens":
    messages.append({"role": "user", "content": CONTINUE_MESSAGE})
    continue

2. 上下文太长 - 压缩恢复

场景：请求太大，装不进模型窗口

恢复动作：

压缩旧上下文
生成摘要
用摘要替换原文

关键代码：

if "prompt too long" in error_text:
    messages = auto_compact(messages)
    continue

3. 网络抖动 - 退避重试

场景：网络超时、API 限流、服务暂时不可用

恢复动作：

等待一段时间（退避）
再次尝试

关键代码：

if "timeout" in error_text or "rate limit" in error_text:
    time.sleep(backoff_delay(attempt))
    continue

新手最容易犯的 5 个错

1. 把所有错误都当成一种错误

# ❌ 错误
try:
    response = client.messages.create(...)
except Exception as e:
    # 所有错误都重试
    time.sleep(1)
    continue

# ✅ 正确
try:
    response = client.messages.create(...)
except Exception as e:
    decision = choose_recovery(None, str(e))
    if decision["kind"] == "backoff":
        time.sleep(backoff_delay(attempt))
        continue
    elif decision["kind"] == "compact":
        messages = auto_compact(messages)
        continue
    else:
        return f"Error: {str(e)}"

2. 没有重试预算

# ❌ 错误
while True:
    try:
        response = client.messages.create(...)
        break
    except:
        time.sleep(1)
        # 无限重试

# ✅ 正确
recovery_state = RecoveryState()
while recovery_state.can_backoff():
    try:
        response = client.messages.create(...)
        break
    except:
        time.sleep(backoff_delay(recovery_state.transport_attempts))
        recovery_state.increment_transport()

3. 续写提示写得太模糊

# ❌ 错误
CONTINUE_MESSAGE = "继续"

# ✅ 正确
CONTINUE_MESSAGE = (
    "Output limit hit. Continue directly from where you stopped. "
    "Do not restart or repeat."
)

4. 压缩后没有告诉模型「这是续场」

# ❌ 错误
def auto_compact(messages):
    summary = summarize(messages)
    return [{"role": "user", "content": summary}]

# ✅ 正确
def auto_compact(messages):
    summary = summarize(messages)
    return [{
        "role": "user",
        "content": f"This session was compacted. Continue from this summary:\n{summary}"
    }]

5. 恢复过程完全没有日志

# ❌ 错误
if decision["kind"] == "backoff":
    time.sleep(backoff_delay(attempt))
    continue

# ✅ 正确
if decision["kind"] == "backoff":
    delay = backoff_delay(attempt)
    print(f"[Recovery] 网络错误，等待 {delay:.2f}秒后重试...")
    time.sleep(delay)
    continue

为什么这很重要

因为一个真正可靠的系统，不应该在遇到小问题时就崩溃。

错误恢复机制让你能够：

提高稳定性：遇到临时问题时能够自动恢复
提升用户体验：用户不会因为网络抖动等小问题而中断工作
增加可靠性：系统能够处理各种边缘情况
减少人工干预：不需要用户在遇到小问题时手动重启

错误恢复与后续章节的关系

s11 错误恢复：解决遇到问题时如何继续的问题
s12 任务系统：会利用错误恢复机制来保护更长的任务流
s13 后台任务：会需要错误恢复来处理长时间运行的任务
s17 自主智能体：会依赖错误恢复来实现更高的自主性

所以错误恢复是构建可靠 Agent 系统的关键组件。

下一章预告

有了错误恢复机制，你的 Agent 已经具备了应对各种边缘情况的能力。下一章我们将探讨任务系统，让 Agent 能够管理更复杂、更长期的任务。

一句话总结：错误不是例外，而是主循环必须预留出来的一条正常分支。

如果觉得有帮助，欢迎关注，我会持续更新「从零构建 Coding Agent」系列文章。

Agent 开发进阶（十一）：错误不是例外，而是主循环的正常分支