Anthropic经典文章:构建高效智能体

7 阅读26分钟

Building effective agents

构建高效智能体

Published Dec 19, 2024

Over the past year, we've worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.

过去一年,我们和几十个正在搭建大语言模型(LLM)智能体的团队紧密合作,跨多个行业。反复观察到一个规律:做得最出色的那些项目,反而没有去套用复杂的框架或专用的库,而是用一些简单、可灵活组合的模式一步步搭建起来的。

In this post, we share what we’ve learned from working with our customers and building agents ourselves, and give practical advice for developers on building effective agents.

这篇文章里,我们想分享两点:一是我们和客户合作、自己动手构建智能体的过程中积累的经验,二是给正在搭建高效智能体的开发者一些实操建议。

What are agents?

什么是智能体?

"Agent" can be defined in several ways. Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks. Others use the term to describe more prescriptive implementations that follow predefined workflows. At Anthropic, we categorize all these variations as agentic systems, but draw an important architectural distinction between workflows and agents:

“智能体”这个词有几种不同的用法。有的客户把它理解为完全自主的系统——长时间独立运转,调用各种工具去完成复杂任务;另一些客户则用这个词来指代那些更循规蹈矩的实现——遵循预先设定好的工作流。在 Anthropic,我们把所有这些变体统称为智能体系统(agentic systems) ,但在架构上做了一个关键区分,把工作流(workflows)智能体(agents) 拆开来看:

  • Workflows are systems where LLMs and tools are orchestrated through predefined code paths.
  • 工作流(workflows) :由 LLM 和工具按照预设好的代码路径协同完成的系统。
  • Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.
  • 智能体(agents) :由 LLM 自主决定处理流程和工具调用方式,并始终掌控任务执行路径的系统。

Below, we will explore both types of agentic systems in detail. In Appendix 1 (“Agents in Practice”), we describe two domains where customers have found particular value in using these kinds of systems.

接下来我们会逐一展开这两种智能体系统。附录 1(“智能体实践”)里,我们还整理了两个客户在生产中用得最见效的场景。

When (and when not) to use agents

什么时候用,什么时候不该用

When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all. Agentic systems often trade latency and cost for better task performance, and you should consider when this tradeoff makes sense.

用 LLM 构建应用时,我们一贯的建议是:从最简单的方案入手,只在确实必要的时候才引入复杂度。有时候,最合适的做法甚至是不构建任何智能体系统——因为这类系统往往是用更高的延迟和成本来换取任务表现的提升,你需要判断这种取舍在你的场景下是否值得。

When more complexity is warranted, workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale. For many applications, however, optimizing single LLM calls with retrieval and in-context examples is usually enough.

确实需要更复杂的方案时,如果任务定义清晰、追求稳定一致,工作流是更好的选择;而当你需要在规模化场景下灵活应变、让模型自主决策时,智能体更为合适。不过对大多数应用来说,做好单次 LLM 调用——配合检索和上下文示例——就已经足够了。

When and how to use frameworks

什么时候用框架、怎么用框架

There are many frameworks that make agentic systems easier to implement, including:

市面已经有很多框架能简化智能体系统的实现,常见的包括:

  • Rivet, a drag and drop GUI LLM workflow builder; and
  • Rivet,一个拖拽式的 GUI LLM 工作流搭建工具;以及
  • Vellum, another GUI tool for building and testing complex workflows.
  • Vellum——另一个用于搭建和测试复杂工作流的图形化工具;以及

These frameworks make it easy to get started by simplifying standard low-level tasks like calling LLMs, defining and parsing tools, and chaining calls together. However, they often create extra layers of abstraction that can obscure the underlying prompts ​​and responses, making them harder to debug. They can also make it tempting to add complexity when a simpler setup would suffice.

这些框架把调用 LLM、定义和解析工具、把多个调用串起来这些底层工作都封装好了,上手很容易。但代价是多了一层抽象,把底层真实的 prompt 和 response 都盖住了,出了问题更难定位。它们还有个诱惑:本来简单方案就够了,你却可能忍不住往里面堆复杂度。

We suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about what's under the hood are a common source of customer error.

我们的建议是,开发者先从直接调用 LLM API 入手——不少模式其实几行代码就能写出来。如果一定要用框架,也请先把框架底下的代码看清楚。对底层机制的误解,是出 bug 最常见的源头之一。

See our cookbook for some sample implementations.

我们整理了一些示例实现,放在 cookbook 里供你参考。

Building blocks, workflows, and agents

基本单元、工作流与智能体

In this section, we’ll explore the common patterns for agentic systems we’ve seen in production. We'll start with our foundational building block—the augmented LLM—and progressively increase complexity, from simple compositional workflows to autonomous agents.

这一节我们梳理生产中常见的智能体系统模式。从最基础的基本单元——增强型 LLM——讲起,再沿着复杂度递增的路径一路展开:先是简单可组合的工作流,最后到能自主决策的智能体。

Building block: The augmented LLM

基本单元:增强型 LLM

The basic building block of agentic systems is an LLM enhanced with augmentations such as retrieval, tools, and memory. Our current models can actively use these capabilities—generating their own search queries, selecting appropriate tools, and determining what information to retain.

智能体系统的最小单元,是一个具备检索、工具调用、记忆等扩展能力的 LLM。我们现在的模型已经能主动驾驭这些能力——自主生成搜索查询、挑选合适的工具、判断哪些信息值得保留。

The augmented LLM

增强型 LLM

We recommend focusing on two key aspects of the implementation: tailoring these capabilities to your specific use case and ensuring they provide an easy, well-documented interface for your LLM. While there are many ways to implement these augmentations, one approach is through our recently released Model Context Protocol, which allows developers to integrate with a growing ecosystem of third-party tools with a simple client implementation.

实现层面有两点我们建议多花点心思:一是把能力针对你的具体业务场景做定制,二是把这些能力以易用、文档完善的方式暴露给你的 LLM。具体到实现方法,最近我们发布的 Model Context Protocol(模型上下文协议) 就是一条值得考虑的路径——开发者只要写一个简单的 客户端实现,就能接入不断壮大的第三方工具生态。

For the remainder of this post, we'll assume each LLM call has access to these augmented capabilities.

后文讨论中,我们都默认每次 LLM 调用都具备这些增强能力。

Workflow: Prompt chaining

工作流:提示链(Prompt chaining)

Prompt chaining decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one. You can add programmatic checks (see "gate” in the diagram below) on any intermediate steps to ensure that the process is still on track.

提示链的做法,是把一个任务拆成一连串步骤,每一步的 LLM 调用都基于上一步的输出。你还可以在中间任意步骤加入程序化的检查(见下图中的 “gate”),随时确认整个流程没跑偏。

The prompt chaining workflow

提示链工作流

When to use this workflow: This workflow is ideal for situations where the task can be easily and cleanly decomposed into fixed subtasks. The main goal is to trade off latency for higher accuracy, by making each LLM call an easier task.

何时使用这种工作流: 这种工作流适合任务能被轻松、干净地拆解成一组固定的子任务的场景。核心目标是用延迟换取更高的准确度——把每一次 LLM 调用都变得更简单。

Examples where prompt chaining is useful:

适合使用提示链的例子:

  • Generating Marketing copy, then translating it into a different language.
  • 先写一段营销文案,再翻译成另一种语言。
  • Writing an outline of a document, checking that the outline meets certain criteria, then writing the document based on the outline.
  • 先列出一份文档大纲,检查大纲是否满足某些要求,再根据大纲把文档写出来。

Workflow: Routing

工作流:路由(Routing)

Routing classifies an input and directs it to a specialized followup task. This workflow allows for separation of concerns, and building more specialized prompts. Without this workflow, optimizing for one kind of input can hurt performance on other inputs.

路由工作流先对输入进行分类,再分发给对应的专门后续任务。这种工作流实现了关注点分离,可以为不同类别构建更有针对性的 prompt。如果没有这层路由,针对某一类输入做的优化,往往会拖累其他输入上的表现。

The routing workflow

路由工作流

When to use this workflow: Routing works well for complex tasks where there are distinct categories that are better handled separately, and where classification can be handled accurately, either by an LLM or a more traditional classification model/algorithm.

何时使用这种工作流: 当任务较为复杂、能清晰划分出几类、且最好分开处理时,路由很合适;同时这些类别能被准确分拣——既可以用 LLM 来分类,也可以用更传统的分类模型或算法。

Examples where routing is useful:

适合使用路由的例子:

  • Directing different types of customer service queries (general questions, refund requests, technical support) into different downstream processes, prompts, and tools.
  • 把不同类型的客服咨询(一般问题、退款请求、技术支持)分流到不同的下游流程、prompt 和工具里。
  • Routing easy/common questions to smaller, cost-efficient models like Claude Haiku 4.5 and hard/unusual questions to more capable models like Claude Sonnet 4.5 to optimize for best performance.
  • 把简单、常见的问题路由到 Claude Haiku 4.5 这样更小、更经济的模型上,把难、罕见的问题路由到 Claude Sonnet 4.5 这样能力更强的模型上,从而在整体上拿到最好的表现。

Workflow: Parallelization

工作流:并行(Parallelization)

LLMs can sometimes work simultaneously on a task and have their outputs aggregated programmatically. This workflow, parallelization, manifests in two key variations:

LLM 有时可以同时处理同一个任务,再用程序把多份输出汇总起来。这种工作流叫“并行”,主要表现为两种形式:

  • Sectioning: Breaking a task into independent subtasks run in parallel.
  • 分段(Sectioning) :把一个任务拆成多个互不依赖的子任务,并行执行。
  • Voting: Running the same task multiple times to get diverse outputs.
  • 投票(Voting) :把同一个任务跑多次,拿到多种不同输出。

The parallelization workflow

并行工作流

When to use this workflow: Parallelization is effective when the divided subtasks can be parallelized for speed, or when multiple perspectives or attempts are needed for higher confidence results. For complex tasks with multiple considerations, LLMs generally perform better when each consideration is handled by a separate LLM call, allowing focused attention on each specific aspect.

什么时候适合用这个工作流: 并行化在两种场景下特别管用——一是子任务彼此独立、可以并行跑以加快速度;二是需要多个视角或多次尝试,投票出更可信的结果。对于涉及多项考量的复杂任务也是如此:让 LLM 把每个考量单独拆成一次调用,能让它的注意力更集中,整体效果往往更好。

Examples where parallelization is useful:

并行化在哪些场景下特别好用:

  • Sectioning:
  • Sectioning(拆分隔离)
-   Implementing guardrails where one model instance processes user queries while another screens them for inappropriate content or requests. This tends to perform better than having the same LLM call handle both guardrails and the core response.
-   Automating evals for evaluating LLM performance, where each LLM call evaluates a different aspect of the model’s performance on a given prompt.
-   搭建内容护栏时,可以一个模型实例专门处理用户查询,另一个实例专门筛选不当内容或请求。这种做法通常比让同一次 LLM 调用既要跑护栏又要给主答复效果更好。
-   自动化评估 LLM 表现时,可以让每次 LLM 调用只评估模型在某条 prompt 上某一个方面的表现。
  • Voting:
  • Voting(投票)
-   Reviewing a piece of code for vulnerabilities, where several different prompts review and flag the code if they find a problem.
-   Evaluating whether a given piece of content is inappropriate, with multiple prompts evaluating different aspects or requiring different vote thresholds to balance false positives and negatives.
-   审查一段代码里的安全漏洞时,可以让多个不同的 prompt 各自审一遍,发现问题就标出来。
-   判断一段内容是否不当时,可以用多个 prompt 各自评估不同维度,或者设置不同的投票通过阈值,在误杀和漏放之间找到平衡。

Workflow: Orchestrator-workers

Workflow:Orchestrator-workers(编排者-工作者)

In the orchestrator-workers workflow, a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results.

在编排者-工作者(orchestrator-workers)这种工作流里,有一个核心 LLM 负责把任务动态拆开,分派给若干个“worker”LLM 去执行,最后再把它们的产出汇总到一起。

The orchestrator-workers workflow

编排者-工作者(orchestrator-workers)工作流

When to use this workflow: This workflow is well-suited for complex tasks where you can’t predict the subtasks needed (in coding, for example, the number of files that need to be changed and the nature of the change in each file likely depend on the task). Whereas it’s topographically similar, the key difference from parallelization is its flexibility—subtasks aren't pre-defined, but determined by the orchestrator based on the specific input.

什么时候适合用这个工作流: 这种工作流特别适合处理那些无法提前预判要拆成哪些子任务的复杂场景——比如写代码时,需要改哪几个文件、每个文件怎么改,往往要看具体任务才知道。它的整体形状看着跟并行化很像,但最关键的区别是它更灵活:子任务不是预先定义好的,而是由编排者(orchestrator)根据当下的输入临时决定的。

Example where orchestrator-workers is useful:

编排者-工作者特别有用的场景:

  • Coding products that make complex changes to multiple files each time.
  • 每次都要同时改多个文件、且改动比较复杂的编程类产品。
  • Search tasks that involve gathering and analyzing information from multiple sources for possible relevant information.
  • 需要在多个来源里广撒网地搜集信息,再逐一分析判断是否相关的搜索类任务。

Workflow: Evaluator-optimizer

Workflow:Evaluator-optimizer(评估者-优化者)

In the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.

在评估者-优化者(evaluator-optimizer)这种工作流里,一次 LLM 调用负责生成答复,另一次 LLM 调用负责给出评估和反馈,两者循环往复。

The evaluator-optimizer workflow

评估者-优化者(evaluator-optimizer)工作流

When to use this workflow: This workflow is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value. The two signs of good fit are, first, that LLM responses can be demonstrably improved when a human articulates their feedback; and second, that the LLM can provide such feedback. This is analogous to the iterative writing process a human writer might go through when producing a polished document.

什么时候适合用这个工作流: 当我们手头有清晰的评估标准,并且反复打磨确实能带来可量化的提升时,这种工作流就特别合适。两个判断是否契合的信号:一是 LLM 的回答在人类给出明确反馈后能明显变好,二是 LLM 自己也能给出有价值的反馈——这有点像人类作者把一篇文档打磨成稿时要经历的那种多轮迭代。

Examples where evaluator-optimizer is useful:

评估器-优化器适用的场景:

  • Literary translation where there are nuances that the translator LLM might not capture initially, but where an evaluator LLM can provide useful critiques.
  • 文学翻译——细微之处很多,翻译 LLM 一开始未必抓得全,而评估器 LLM 能够给出有价值的反馈。
  • Complex search tasks that require multiple rounds of searching and analysis to gather comprehensive information, where the evaluator decides whether further searches are warranted.
  • 复杂搜索任务——需要多轮搜索和分析才能收集到充分的信息,由评估器判断是否还需要继续查下去。

Agents

智能体

Agents are emerging in production as LLMs mature in key capabilities—understanding complex inputs, engaging in reasoning and planning, using tools reliably, and recovering from errors. Agents begin their work with either a command from, or interactive discussion with, the human user. Once the task is clear, agents plan and operate independently, potentially returning to the human for further information or judgement. During execution, it's crucial for the agents to gain “ground truth” from the environment at each step (such as tool call results or code execution) to assess its progress. Agents can then pause for human feedback at checkpoints or when encountering blockers. The task often terminates upon completion, but it’s also common to include stopping conditions (such as a maximum number of iterations) to maintain control.

随着 LLM 在几项关键能力上越来越成熟——理解复杂输入、做推理和规划、稳定地使用工具、从错误中恢复——智能体开始真正在生产场景中落地。一个智能体的工作通常以人类用户下达的指令,或者与人类用户的一轮交互讨论作为起点。任务明确之后,智能体会自主地规划和执行,期间可能会回到人类那里补充信息或寻求判断。整个执行过程中有一点很关键:智能体每走一步都要从环境里拿到“真实情况”(比如 tool 调用的返回结果,或者 code 的执行输出),据此评估自己的进度。在检查点、或者撞上卡点时,智能体可以停下来向人类征询反馈。任务一般跑到完成就结束,不过实际中常常会加上停止条件(比如最大迭代次数)来兜底,确保可控。

Agents can handle sophisticated tasks, but their implementation is often straightforward. They are typically just LLMs using tools based on environmental feedback in a loop. It is therefore crucial to design toolsets and their documentation clearly and thoughtfully. We expand on best practices for tool development in Appendix 2 ("Prompt Engineering your Tools").

智能体虽然能处理复杂任务,实现起来其实并不复杂,本质上往往就是 LLM 拿着工具、依据环境反馈跑循环。正因如此,工具集和它的文档得仔细设计、写得清清楚楚。工具开发上的最佳实践,我们在附录 2《为你的工具做 prompt 工程》里展开讨论。

Autonomous agent

自主智能体

When to use agents: Agents can be used for open-ended problems where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path. The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making. Agents' autonomy makes them ideal for scaling tasks in trusted environments.

什么时候用智能体: 智能体适合那些开放性的问题——需要多少步事先说不清,也没法写死一条固定路径。LLM 可能会自己跑上很多轮,所以你得对它的判断力有一定信任。智能体的自主性,正好让它们在受信环境里把任务规模铺开。

The autonomous nature of agents means higher costs, and the potential for compounding errors. We recommend extensive testing in sandboxed environments, along with the appropriate guardrails.

不过也正因自主,智能体往往成本更高,错误还会层层叠加。建议先在沙盒环境里做充分的测试,再配上相应的护栏。

Examples where agents are useful:

智能体适用的场景:

The following examples are from our own implementations:

下面这些例子都来自我们自己的实现:

  • A coding Agent to resolve SWE-bench tasks, which involve edits to many files based on a task description;
  • 一个 coding Agent,用来攻克 SWE-bench 任务——任务描述一句话,要修改的代码文件却有一大片;

High-level flow of a coding agent

coding agent 的高层流程

Combining and customizing these patterns

组合与定制这些模式

These building blocks aren't prescriptive. They're common patterns that developers can shape and combine to fit different use cases. The key to success, as with any LLM features, is measuring performance and iterating on implementations. To repeat: you should consider adding complexity only when it demonstrably improves outcomes.

这些基本单元并不是一成不变的范式,而是常见的模式,开发者可以按需改造、组合,去贴合自己的业务场景。成功的关键和所有 LLM 特性一样:去测 performance、反复打磨 implementation。说到底,只有当复杂度真能带来明显收益时,才值得加上去。

Summary

总结

Success in the LLM space isn't about building the most sophisticated system. It's about building the right system for your needs. Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.

在 LLM 领域里能否做出来,从来不取决于系统有多复杂,而在于系统是否契合你的需求。先从简单的 prompt 入手,用充分的评估反复打磨;只有当更轻量的方案确实力有不逮,才考虑引入多步骤的智能体系统。

When implementing agents, we try to follow three core principles:

构建智能体时,我们尽量遵循三条核心原则:

  1. Maintain simplicity in your agent's design.
  • 保持智能体设计的 简洁性
  1. Prioritize transparency by explicitly showing the agent’s planning steps.
  • 透明性 放在首位,清楚地呈现智能体的规划步骤。
  1. Carefully craft your agent-computer interface (ACI) through thorough tool documentation and testing.
  • 通过扎实的工具 文档与测试,用心设计你的智能体-计算机接口 (ACI)。

Frameworks can help you get started quickly, but don't hesitate to reduce abstraction layers and build with basic components as you move to production. By following these principles, you can create agents that are not only powerful but also reliable, maintainable, and trusted by their users.

框架能帮你快速起,但一旦进入生产阶,就该果断剥掉多余的抽象,回到基础组件去构建。把握住这些原,做出来的智能体不仅能力出,也能可靠、可维护、值得用户信赖。

Acknowledgements

致谢

Written by Erik S. and Barry Zhang. This work draws upon our experiences building agents at Anthropic and the valuable insights shared by our customers, for which we're deeply grateful.

本文由 Erik S. 与 Barry Zhang 撰写。文章内容凝结了我们在 Anthropic 构建智能体的实,也吸收了来自客户的真知灼,我们对此满怀感激。

Appendix 1: Agents in practice

附录 1:实践中的智能体

Our work with customers has revealed two particularly promising applications for AI agents that demonstrate the practical value of the patterns discussed above. Both applications illustrate how agents add the most value for tasks that require both conversation and action, have clear success criteria, enable feedback loops, and integrate meaningful human oversight.

在与客户的合作中,我们发现了 AI 智能体两类格外有前景的应用场景,它们直观地印证了上文这些模式的实际价值。两者的共同点在于:任务既需要对话又需要行动、成功标准清晰、能够形成反馈闭环,并允许有意义的人类监督——这些恰是智能体最能发挥价值的地方。

A. Customer support

A. 客户支持

Customer support combines familiar chatbot interfaces with enhanced capabilities through tool integration. This is a natural fit for more open-ended agents because:

客户支持本身就基于对话界面,再叠加工具集成带来的能力扩展,天然适合采用更开放式的智能体,原因如下:

  • Support interactions naturally follow a conversation flow while requiring access to external information and actions;
  • 支持对话天然沿对话流推进,过程中又需要访问外部信息、调用具体操作;
  • Tools can be integrated to pull customer data, order history, and knowledge base articles;
  • 可以借助工具拉取客户资料、订单记录和知识库文章;
  • Actions such as issuing refunds or updating tickets can be handled programmatically; and
  • 发放退款、更新工单之类的操作,都可以通过程序自动完成;并且
  • Success can be clearly measured through user-defined resolutions.
  • 成效如,可以通过用户定义的问题解决标准来明确衡量。

Several companies have demonstrated the viability of this approach through usage-based pricing models that charge only for successful resolutions, showing confidence in their agents' effectiveness.

已经有公司用“按使用量计费”的定价模式——只对真正解决的问题才收费——验证了这条路可行,足见它们对自家智能体效果的底气。

B. Coding agents

B. 编码智能体

The software development space has shown remarkable potential for LLM features, with capabilities evolving from code completion to autonomous problem-solving. Agents are particularly effective because:

软件开发是 LLM 能力释放得最充分的领域之一,功能从最初的代码补全一路演进到自主解决问题。智能体在这里格外能打,原因如下:

  • Code solutions are verifiable through automated tests;
  • 代码方案可以通过自动化测试来验证;
  • Agents can iterate on solutions using test results as feedback;
  • 智能体可以根据测试反馈不断迭代方案;
  • The problem space is well-defined and structured; and
  • 问题空间定义清晰且结构化;并且
  • Output quality can be measured objectively.
  • 输出质量可以被客观地衡量。

In our own implementation, agents can now solve real GitHub issues in the SWE-bench Verified benchmark based on the pull request description alone. However, whereas automated testing helps verify functionality, human review remains crucial for ensuring solutions align with broader system requirements.

在我们自己的实现里,智能体现在仅凭 pull request 的描述,就能在 SWE-bench Verified 基准上解决真实的 GitHub issue。自动化测试虽然能帮助验证功能,但人工审查仍然不可或缺——它是确保方案与整体系统要求对齐的最后一道关卡。

Appendix 2: Prompt engineering your tools

附录二:针对工具的提示工程

No matter which agentic system you're building, tools will likely be an important part of your agent. Tools enable Claude to interact with external services and APIs by specifying their exact structure and definition in our API. When Claude responds, it will include a tool use block in the API response if it plans to invoke a tool. Tool definitions and specifications should be given just as much prompt engineering attention as your overall prompts. In this brief appendix, we describe how to prompt engineer your tools.

无论你要构建哪类智能体系统,工具大概率都是智能体的重要组成。Tools 让 Claude 能与外部服务和 API 交互——你在 API 中把它们的结构和定义描述清楚就行。当 Claude 决定调用某个工具时,它的响应里会带上一段 tool use block。工具的定义和说明,和你整体提示词一样值得花时间打磨。这一节简短的附录会聊聊怎么对工具做提示工程。

There are often several ways to specify the same action. For instance, you can specify a file edit by writing a diff, or by rewriting the entire file. For structured output, you can return code inside markdown or inside JSON. In software engineering, differences like these are cosmetic and can be converted losslessly from one to the other. However, some formats are much more difficult for an LLM to write than others. Writing a diff requires knowing how many lines are changing in the chunk header before the new code is written. Writing code inside JSON (compared to markdown) requires extra escaping of newlines and quotes.

同一个动作往往有多种描述方式。举两个例子:表达一次文件编辑,你可以写一个 diff,也可以把整个文件重写一遍;表达结构化输出,你既可以把代码放在 markdown 里,也可以塞进 JSON。落到软件工程上,这类差异都是表面文章,相互之间可以无损转换。但对 LLM 来说,不同格式的写作难度天差地别——写 diff 得先在 chunk 头里数清楚改了几行再动笔;把代码塞进 JSON 还要为换行和引号额外做转义。

Our suggestions for deciding on tool formats are the following:

关于如何选择工具格式,我们的建议如下:

  • Give the model enough tokens to "think" before it writes itself into a corner.
  • 给模型留出足够的 token,让它在把自己逼进死胡同之前先想清楚。
  • Keep the format close to what the model has seen naturally occurring in text on the internet.
  • 尽量让格式贴近模型在互联网文本中自然见到的样子。
  • Make sure there's no formatting "overhead" such as having to keep an accurate count of thousands of lines of code, or string-escaping any code it writes.
  • 别让格式本身成为负担——比如让模型精确数几千行代码,又或者对自己写的代码做字符串转义。

One rule of thumb is to think about how much effort goes into human-computer interfaces (HCI), and plan to invest just as much effort in creating good agent-computer interfaces (ACI). Here are some thoughts on how to do so:

有个经验法则值得参考:人类在 human-computer interfaces (HCI) 上花了多少心思,你最好也在 agent-computer interfaces (ACI) 上花同样多。下面是一些具体建议:

  • Put yourself in the model's shoes. Is it obvious how to use this tool, based on the description and parameters, or would you need to think carefully about it? If so, then it’s probably also true for the model. A good tool definition often includes example usage, edge cases, input format requirements, and clear boundaries from other tools.
  • 试着站到模型的角度想:看 description 和参数,这个工具的用法是不是一目了然?如果你自己都得琢磨一番,模型多半也会。优秀的工具定义通常会给出使用示例、边界情形、输入格式要求,并与其他工具划清边界。
  • How can you change parameter names or descriptions to make things more obvious? Think of this as writing a great docstring for a junior developer on your team. This is especially important when using many similar tools.
  • 怎么调整参数名和描述才能让含义更直白?把这件事当成在给团队里的新人写一段精彩的 docstring。当你同时提供多个相似工具时,这一点尤为关键。
  • Test how the model uses your tools: Run many example inputs in our workbench to see what mistakes the model makes, and iterate.
  • 测试模型是怎么用你的工具的:在我们的 workbench 里跑大量示例输入,观察模型在哪些地方犯错,然后迭代改进。
  • Poka-yoke your tools. Change the arguments so that it is harder to make mistakes.
  • 给你的工具做Poka-yoke(防错设计)。调整参数,让错误更难发生。

While building our agent for SWE-bench, we actually spent more time optimizing our tools than the overall prompt. For example, we found that the model would make mistakes with tools using relative filepaths after the agent had moved out of the root directory. To fix this, we changed the tool to always require absolute filepaths—and we found that the model used this method flawlessly.

在为我们这套 SWE-bench 智能体做实现时,我们花在优化工具上的时间甚至比打磨整体提示词还多。举个例子:模型在智能体离开根目录后,调用那些使用相对路径的工具时频繁出错。我们的解法是把工具改成强制要求绝对路径,结果模型用起来一次都没出过错。