使用 OpenAI Agents SDK 构建智能体——代理系统管理随着你的代理系统日益复杂，如何让它们“可管、可观、可

随着你的代理系统日益复杂，如何让它们“可管、可观、可测、可证”就和实现核心逻辑一样重要。 多代理系统里常有大量代理、工具与交接（handoff）彼此作用，路径并不直观。为管理这类复杂性，OpenAI Agents SDK 提供了可视化、护栏（guardrails）、可观测性与测试等能力。

本章你将学到：

代理可视化：生成多代理系统的图形化结构图，展示代理、工具及其交互，便于澄清设计并调试
护栏（Guardrails） ：实现输入/输出护栏，拦截不安全、无关或违反策略的内容，防止其进入或离开系统
日志、追踪与可观测性：了解 Traces 模块如何记录模型调用、工具调用、交接与护栏触发，并学习添加自定义 trace 与 span
代理测试：掌握端到端与单元测试的方法，即使在非确定性行为下也能验证系统可靠性

读完本章，你将会管理、监控并验证代理式系统。

技术要求

请按照第 3 章的步骤完成环境配置。
本书每章的实践示例与完整代码可在配套 GitHub 仓库获取：github.com/PacktPublis…。
建议你克隆仓库，复用/改造示例代码，并在学习过程中随时参考。

代理可视化

在前几章里我们看到，多代理系统可能包含多个组件：代理、工具、交接以及 MCP 服务器等。随着数量增加，这些元素会让整体结构变得难以把握。幸运的是，OpenAI Agents SDK 提供了可视化工具，可生成系统的图形表示，清晰地描绘代理、工具及其关系。

下面我们直接动手创建可视化图表。首先安装依赖，在激活的环境中运行：

$ pip install "openai-agents[viz]"

接着，复用此前章节构建过的分层代理系统，并添加若干工具以观察它们在图中的呈现。新建 visualization.py，从创建工具开始：

from agents import Agent, Runner, SQLiteSession, trace, function_tool
from agents.extensions.visualization import draw_graph

# Create tools
@function_tool
def calculate_physics_equation(equation):
    pass

@function_tool
def perform_culture_survey(goal):
    pass

这里定义了两个示例工具：一个用于求解物理方程，一个用于开展文化调研。稍后会把它们挂到相应代理上。

接着定义各领域的专业代理：

# Create our agents
# Specialized science agents
physics_agent = Agent(name="Physics Agent", instructions="Answer questions about physics.", tools=[calculate_physics_equation])
chemistry_agent = Agent(name="Chemistry Agent", instructions="Answer questions about chemistry.")
medical_agent = Agent(name="Medical Agent", instructions="Answer questions about medical science.")

# Specialized history agents
politics_agent = Agent(name="Politics Agent", instructions="Answer questions about political history.")
warfare_agent = Agent(name="Warfare Agent", instructions="Answer questions about wars and military history.")
culture_agent = Agent(name="Culture Agent", instructions="Answer questions about cultural history.", tools=[perform_culture_survey])

上面创建了科学与历史两个子域的一组专业代理；部分代理绑定了工具，其他则仅依赖指令。

然后创建对应的管理者（Manager）代理，用于编排各自域内的路由：

# Manager agents with handoffs to their respective domains
science_manager = Agent(
    name="Science Manager",
    instructions="Manage science-related queries and route them to the appropriate subdomain agent.",
    handoffs=[physics_agent, chemistry_agent, medical_agent]
)

history_manager = Agent(
    name="History Manager",
    instructions="Manage history-related queries and route them to the appropriate subdomain agent.",
    handoffs=[politics_agent, warfare_agent, culture_agent]
)

管理者代理负责协调调度；它们本身不回答问题，而是把任务转交给合适的专业代理。

最后，定义顶层分诊（triage）代理并输出结构图：

# Top-level triage agent
triage_agent = Agent(
    name="Research Triage Agent",
    instructions="Triage the user's question and decide whether it's science or history related, and route accordingly.",
    handoffs=[science_manager, history_manager]
)

# Draw agent graph
draw_graph(triage_agent, filename="graph_visualization")

该顶层代理负责接收用户问题，判断属于科学还是历史，并路由到对应管理者。agents.extensions.visualization 中的 draw_graph 会以任意一个代理为输入，绘制整个多代理系统的可视化，并保存为项目根目录下的 graph_visualization.png：

图 8.1：可视化示例图

小技巧：需要查看高清版本？请在下一代 Packt Reader 中打开本书，或在 PDF/ePub 版本中查看。
购买本书可免费使用下一代 Packt Reader。扫描二维码或访问 packtpub.com/unlock，在搜索栏输入书名并确认版本。

在可视化图中，代理以方框（节点）表示，工具以椭圆表示；箭头指示交互关系（实线箭头为代理间的 handoff，虚线箭头为代理调用工具）。图中还会包含一个起始节点与一个或多个结束节点，限定代理流的可能路径。

这套工具（双关：tool）对于大型多代理系统的管理、澄清与调试非常有用。通过检查图谱，我们可以核对系统结构是否符合预期——例如某个工具是否已连接、某个代理是否具备预期的 handoff；若缺失，图会一眼暴露配置问题。同时，它也可作为面向协作者、干系人或未来维护者的文档，为他们提供系统交互的一目了然的总览。

Guardrails（护栏）

护栏是 OpenAI Agents SDK 的另一种有用原语，用于通过执行校验来支持多代理系统。校验既可以发生在用户输入进入代理系统之前，也可以发生在代理输出返回给用户之前。

在代理系统中加入护栏的好处是让系统更健壮。护栏像一层防护，确保无效、不安全或不期望的输入与输出在造成问题前被拦截。这能防止有害回复、落实合规规则，并保持一致的用户体验。在更复杂的系统里，护栏还能让代理遵循组织政策与领域约束，而无需把验证逻辑塞进主代理里——也就是说，主代理专注业务，护栏负责边界情况、策略执行与安全问题。

输入护栏和输出护栏遵循相同模式：

先定义一个护栏函数，返回 GuardrailFunctionOutput 对象。它可以接收上下文、触发护栏的代理，以及用户提示/代理输出。GuardrailFunctionOutput 含有布尔值 tripwire_triggered，指示是否触发了“绊线”（tripwire）。
在护栏函数内部，编写逻辑判断是否应触发绊线。可用硬编码逻辑（例如：如果用户提示含有单词 “negative”，就触发并中止）或代理式逻辑（再建一个专门判断是否应触发绊线的代理）。
最后，优雅处理绊线（会抛出特定异常），并向用户输出合适的信息。

下面先看输入护栏，再看输出护栏。

输入护栏

把输入护栏想象成登机口的乘务员：只让持票乘客上机、拒绝其他人。它是第一道防线，确保只有相关的用户提示才能进入你的代理系统。比如，你可以用输入护栏检测用户请求是否违用政策，或判断请求是否超出代理职责。阻止误用还能节省成本——在请求跑进代理系统前就被拦下，省下 token 与计算资源。

回到之前做过的客户服务示例，这次我们加一个输入护栏。为演示方便，先用很朴素的规则：如果提示里包含“complaint”就触发绊线（当然也可以换成任何词）。

创建 input_guardrail.py，先导入依赖并加载环境变量：

# Required imports
import os
from dotenv import load_dotenv
from agents import Agent, Runner, function_tool, trace
from agents import GuardrailFunctionOutput, InputGuardrailTripwireTriggered, input_guardrail, RunContextWrapper, TResponseInputItem

# Load environment variables from the .env file
load_dotenv()
# Access the API key
api_key = os.getenv("OPENAI_API_KEY")

接着做一个简单工具：查询订单状态（作为代理的有用功能）：

# Create a tool
@function_tool()
def get_order_status(orderID: int) -> str:
    """
    Returns the order status given an order ID
    Args:
        orderID (int) - Order ID of the customer's order
    Returns:
        string - Status message of the customer's order
    """
    if orderID in (100, 101):
        return "Delivered"
    elif orderID in (200, 201):
        return "Delayed"
    elif orderID in (300, 301):
        return "Cancelled"

定义护栏：检查提示里是否包含 “complaint”，若包含则触发绊线：

# Create a guardrail
@input_guardrail
def complaint_detector_guardrail(
    ctx: RunContextWrapper[None],
    agent: Agent,
    prompt: str | list[TResponseInputItem]
) -> GuardrailFunctionOutput:
    tripwire_triggered = False
    if "complaint" in prompt:
        tripwire_triggered = True
    return GuardrailFunctionOutput(
        output_info="The word Complaint has been detected",
        tripwire_triggered=tripwire_triggered,
    )

定义代理，并在 input_guardrails 参数上附加护栏：

# Define an agent
agent = Agent(
    name="Customer service agent",
    instructions="You are an AI Agent that helps respond to customer queries for a local paper company",
    model="gpt-4o",
    tools=[get_order_status],
    input_guardrails=[complaint_detector_guardrail]
)

最后写个循环与代理交互。护栏会在每次输入进入代理前检查：

with trace("Input Guardrails"):
    while True:
        question = input("You: ")
        result = Runner.run_sync(agent, question)
        print("Agent: ", result.final_output)

工作机制回顾：首先我们定义了护栏函数（complaint_detector_guardrail），接收 RunContextWrapper、代理和用户提示，必须返回 GuardrailFunctionOutput，指示是否触发绊线。其次，在函数内部写检测逻辑：这里是检索关键词 “complaint”。若命中则置 tripwire_triggered=True 并报告被触发。注意这只是非常简单的逻辑；实际常会在此扫描政策违规或恶意输入等更复杂情形。

当绊线被触发时，SDK 会抛出 InputGuardrailTripwireTriggered 异常，中断正常流程，阻止请求继续进入代理，并把错误抛出。此时我们还没做异常处理，体验一般，但达到了“拦截”的目的。

试运行：

You: What's the status of my order? My order ID is 200
Agent: The status of your order with ID 200 is: Delayed. If you have any further questions or need assistance, please let me know!

因为未包含 “complaint”，未触发绊线。再试一次，故意触发：

You: I have a complaint
InputGuardrailTripwireTriggered error

现在加上更温和的异常处理：

with trace("Input Guardrails"):
    while True:
        try:
            question = input("You: ")
            result = Runner.run_sync(agent, question)
            print("Agent: ", result.final_output)
        except InputGuardrailTripwireTriggered:
            print("The tripwire has been triggered. Please call us instead to register complaints.")

再次输入：

You: I have a complaint
The tripwire has been triggered. Please call us instead to register complaints.

可以在 Traces 模块中看到这次输入护栏的记录：

图 8.2：Traces 模块中的输入护栏

注意
输入护栏只在多代理系统的第一个代理上执行。也就是说，它作为整个工作流的入口筛子，在输入流向下游代理前先做把关。

上面的护栏触发逻辑很初级：如果用户不用 “complaint” 这个词，就触发不到。更通用的做法是使用另一个轻量代理（便宜模型）来判断是否应触发绊线。这样可以先用低成本代理做筛查，再把合格请求交给昂贵的多代理系统。

我们据此升级示例：创建一个专门判断“是否与客服相关”的护栏代理；若不相关就触发绊线。

创建 input_guardrail_agent.py，导入包并设置环境：

# Required imports
import os
from dotenv import load_dotenv
from agents import Agent, Runner, function_tool, trace
from agents import GuardrailFunctionOutput, InputGuardrailTripwireTriggered, input_guardrail, RunContextWrapper, TResponseInputItem
from pydantic import BaseModel

load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

仍然需要订单状态工具：

@function_tool()
def get_order_status(orderID: int) -> str:
    """
    Returns the order status given an order ID
    """
    if orderID in (100, 101):
        return "Delivered"
    elif orderID in (200, 201):
        return "Delayed"
    elif orderID in (300, 301):
        return "Cancelled"

用 Pydantic 定义护栏代理的输出结构：

class GuardrailTrueFalse(BaseModel):
    is_relevant_to_customer_service_orders: bool

创建护栏代理：唯一职责是判断“是否与客服/订单相关”：

guardrail_agent = Agent(
    name="Guardrail check",
    instructions="You are an AI agent that checks if the user's prompt is relevant to answering customer service and order related questions",
    output_type=GuardrailTrueFalse,
)

编写护栏函数：调用护栏代理，若不相关则触发绊线：

@input_guardrail
async def relevant_detector_guardrail(
    ctx: RunContextWrapper[None],
    agent: Agent,
    prompt: str | list[TResponseInputItem]
) -> GuardrailFunctionOutput:
    result = await Runner.run(guardrail_agent, input=prompt)
    tripwire_triggered = False
    if result.final_output.is_relevant_to_customer_service_orders == False:
        tripwire_triggered = True
    return GuardrailFunctionOutput(
        output_info="The word Complaint has been detected",
        tripwire_triggered=tripwire_triggered
    )

最后定义主客服代理并附加护栏，运行：

agent = Agent(
    name="Customer service agent",
    instructions="You are an AI Agent that helps respond to customer queries for a local paper company",
    model="gpt-4o",
    tools=[get_order_status],
    input_guardrails=[relevant_detector_guardrail]
)

with trace("Input Guardrails"):
    while True:
        try:
            question = input("You: ")
            result = Runner.run_sync(agent, question)
            print("Agent: ", result.final_output)
        except InputGuardrailTripwireTriggered:
            print("This comment is irrelevant to customer service.")

变化点：不再用硬编码关键词，而是调用护栏代理判断相关性。该代理有明确指令，输出类型用 Pydantic（GuardrailTrueFalse）约束。护栏函数里异步调用护栏代理，若判定“无关”，则置 tripwire_triggered=True。这种模式更健壮、更可扩展：用轻量低价模型做过滤/校验，把昂贵高能模型保留给真正的客服对话。

运行并输入无关问题：

You: What's the meaning of life?
This comment is irrelevant to customer service

注意
这里的护栏函数被定义为 async，因为它内部异步调用了另一个代理（Runner.run）。凡是在护栏里要调用其他代理时，就需要用异步护栏函数。

现在我们已经掌握了输入护栏，接下来把焦点切到输出护栏。

输出护栏（Output guardrails）

输出护栏与输入护栏目的相似，但它们不是校验进入代理系统的内容，而是校验代理系统产出的内容。可以把它们想成协助旅客下机的乘务员：确保下机有序、且没有不安全的东西被带出机舱。实际使用中，输出护栏是在响应返回用户之前的最后一道关。它能强制你的代理系统满足诸如格式合规、敏感信息脱敏，或确保输出遵循策略等约束。

在我们的客服场景里，假设我们希望代理的最终回复总是包含有效的配送状态陈述（例如：“您的订单 #5474 正在派送，将于明日送达。”）。如果代理输出了无关内容（比如只道歉不含状态），甚至幻觉出的信息，我们就希望系统能在到达客户之前拦截它。

与输入护栏类似，输出护栏以返回 GuardrailFunctionOutput 的函数实现，其中包含判定是否触发“绊线（tripwire）”的逻辑。如果输出无效或不安全，绊线会阻止响应返回给用户。不同之处在于，输出可能是结构化对象（当代理定义了 output_type 时），因此护栏函数会接收该输出对象。

下面通过一个示例来说明。创建 output_guardrail_agent.py 并运行以下程序。首先导入依赖并加载环境变量：

# Required imports
import os
from dotenv import load_dotenv
from agents import Agent, Runner, function_tool, trace
from agents import GuardrailFunctionOutput, OutputGuardrailTripwireTriggered, output_guardrail, RunContextWrapper
from pydantic import BaseModel
# Load environment variables from the .env file
load_dotenv()
# Access the API key
api_key = os.getenv("OPENAI_API_KEY")

接着，为输出定义简单的 Pydantic 模型：

class MessageOutput(BaseModel):
    response: str
class GuardrailTrueFalse(BaseModel):
    is_relevant_to_customer_service: bool

这是一个护栏代理，用于检查主代理的响应是否对客服场景有效：

# Create a guardrail agent
guardrail_agent = Agent(
    name="Guardrail check",
    instructions="You are an AI agent that checks if the agent response is relevant to answering a customer service question and not hallucinating",
    output_type=GuardrailTrueFalse
)

编写护栏函数，落实护栏逻辑：

# Create a guardrail
@output_guardrail
async def relevant_detector_guardrail(
    ctx: RunContextWrapper[None],
    agent: Agent,
    output: MessageOutput
) -> GuardrailFunctionOutput:
  
    result = await Runner.run(guardrail_agent, input=output)
    tripwire_triggered = False
    if result.final_output.is_relevant_to_customer_service == False:
         tripwire_triggered = True
    return GuardrailFunctionOutput(
        output_info="",
        tripwire_triggered=tripwire_triggered
    )

为了演示护栏如何拦截无效输出，我们故意定义一个**“爱幻觉”的客服代理**：

# Define an agent
agent = Agent(name="Customer service agent",
              instructions="You are an AI Agent that outputs random song lines and poems", # to force model to hallucinate and trigger the output guardrail
              output_guardrails=[relevant_detector_guardrail])

最后，循环运行代理，并捕获护栏触发的情况：

with trace("Output Guardrails"):
    while True:
        try:
            question = input("You: ")
            result = Runner.run_sync(agent, question)
            print("Agent: ", result.final_output)
        except OutputGuardrailTripwireTriggered:
            print ("The agent system did not produce an output. Please try again")

在这个示例中，我们创建了输出护栏。用 @output_guardrail 装饰 relevant_detector_guardrail 表示该护栏会在主代理生成响应之后运行。在护栏内部，我们异步调用护栏代理来检查主代理输出；若判定无效，护栏会通过抛出 OutputGuardrailTripwireTriggered 异常终止执行。

一旦绊线被触发，异常被捕获，我们不会展示代理的幻觉/无关回复，而是给用户一个安全的回退消息：“The agent system did not produce an output. Please try again.”

运行程序后，无论你输入什么都会触发输出护栏（因为我们故意让主代理产生幻觉）：

You: what's the status of my return?
The agent system did not produce an output. Please try again

输出护栏可以按需定制到各种场景：保证每次响应都含有明确有效的订单状态；校验结果是否符合特定模式/架构；或自动移除敏感信息（如个人可识别信息）。把这些检查放在管线末端，可确保返回用户的内容严格满足应用所需的标准。

把输出护栏当作最后的安全网：即便前面的组件偶有不可预测行为，输出护栏仍能确保最终返回给用户的内容安全、合规，并与业务需求一致。

日志、追踪与可观测性（Logging, tracing, and observability）

管理代理不仅仅是实现护栏；还需要良好的可观测性基础设施，帮助你充分理解代理在做什么。正如前几章所示，OpenAI Agents SDK 自带强大的 **Traces（追踪）**模块，会在一次代理运行期间记录事件序列（模型调用、工具调用、交接、护栏触发等）。

追踪默认对所有代理运行启用，可在 OpenAI 控制台查看（本书多处已演示）。它开箱即用，用于调试与监控，并捕获丰富的事件。这些记录的事件会作为**span（跨度）存储在某次运行的总体trace（追踪）**内。理解 trace 与 span 的区别很有用：

Trace（追踪） ：代表你的代理系统一次完整的执行流程。就像某个用户请求从开始到结束发生的一切的时间线。与该次运行相关的所有事件都归在这条 trace 下。
Span（跨度） ：追踪中的单个事件或操作，具有开始与结束时间。span 可以嵌套，并可携带用于调试的附加属性数据。

把 trace 想成一次用户请求的完整“比赛回放”，而 span 是回放里的每个“回合/动作”。

这个模型很强大：你可以沿着复杂序列逐步查看各操作耗时及其关系。例如，某条追踪总耗时 3.2 秒，其中 1.5 秒是 LLM 思考，0.5 秒是数据库工具调用，等等。

Traces 控制台会以可视化顺序展示这些事件，你可展开查看细节，如提示词、工具入参与出参。这在开发期尤其有用，便于逐步“走读”代理内部做了什么。

示例

创建 basic_trace.py 并运行：

from agents import Agent, Runner
from dotenv import load_dotenv
load_dotenv()
# Create an agent
agent = Agent(
    name="QuestionAnswerAgent",
    instructions="You are an AI agent that answers questions in as few words as possible"
)
result = Runner.run_sync(agent, "Where is the Eiffel Tower?")
print(result.final_output)

无需额外代码，SDK 会自动把日志写入 Traces 模块。打开控制台即可看到该 trace 及对应的 spans。

（图 8.3：Traces 模块中的 spans）

自定义追踪与跨度（Custom traces and spans）

可以用 trace 给追踪设置自定义属性（如自定义名称）。凡是在 trace 上下文中发生的代码执行、代理运行等，都会记录到这条追踪下。示例：创建 custom_trace.py：

from agents import Agent, Runner, trace
from dotenv import load_dotenv
load_dotenv()
# Create an agent
agent = Agent(
    name="QuestionAnswerAgent",
    instructions="You are an AI agent that answers questions in as few words as possible"
)
with trace("Henry's Workflow"):
    result = Runner.run_sync(agent, "Where is the Eiffel Tower?")
    print(result.final_output)

这里我们把追踪命名为 “Henry's Workflow” ，便于在 Traces 模块中快速找到。

（图 8.4：Traces 模块中的日志）

注：也可使用 traces.start() 与 traces.finish()，但不推荐。

虽然代理交接、工具调用等会自动生成 span，但你也可以用自定义 span来记录特定步骤，并看到其耗时。创建 custom_span.py：

from agents import Agent, Runner, trace, custom_span
from dotenv import load_dotenv
import time
load_dotenv()
# Create an agent
agent = Agent(
    name="QuestionAnswerAgent",
    instructions="You are an AI agent that answers questions in as few words as possible"
)
with trace("Henry's Workflow"):
    with custom_span("Task 1"):
        time.sleep(5)
    with custom_span("Task 2"):
        result = Runner.run_sync(agent, "Where is the Eiffel Tower?")
    with custom_span("Task 3"):
        time.sleep(5)
    with custom_span("Task 4"):
        time.sleep(5)

上例我们创建了多个自定义 span。目前它们要么运行一次代理，要么 sleep 5 秒。你会在 Traces 控制台看到对应日志与耗时。

（图 8.5：Traces 模块中的 Tasks）

把自定义 span 放在系统关键位置，可以把复杂流程拆解成更小、可度量的步骤，精确定位耗时与瓶颈（多次工具调用与推理步骤尤为有用）。

将多次追踪与跨度分组

有时你希望把多次代理运行合并到同一条追踪。默认分别调用两次 Runner.run 会产生两条追踪，但语义上它们可能属于同一工作流。使用 trace() 上下文可把它们绑在一起。创建 multiple_agents_in_one_trace.py：

from agents import Agent, Runner, trace, custom_span
from dotenv import load_dotenv
import time
load_dotenv()
# Create an agent
agent = Agent(
    name="QuestionAnswerAgent",
    instructions="You are an AI agent that answers questions in as few words as possible"
)
with trace("Henry's Workflow"):
    with custom_span("Task 1"):
        result = Runner.run_sync(agent, "Where is the Statue of Liberty?")
    with custom_span("Task 2"):
        result = Runner.run_sync(agent, "Where is the Eiffel Tower?")
    with custom_span("Task 3"):
        result = Runner.run_sync(agent, "Where is the Notre Dame?")
    with custom_span("Task 4"):
        result = Runner.run_sync(agent, "Where is the Burj Khalifa?")

在 Traces 模块里，这些运行将出现在同一条追踪下。

（图 8.6：Traces 模块中的多任务）

跨不同 Python 进程/程序也可以分组：在 trace() 中传入同一个 trace_id。创建 multiple_agents_in_one_trace_2.py，运行三次模拟三次调用：

from agents import Agent, Runner, trace, custom_span
from dotenv import load_dotenv
import time
load_dotenv()
# Create an agent
agent = Agent(
    name="QuestionAnswerAgent",
    instructions="You are an AI agent that answers questions in as few words as possible"
)
with trace("Henry's Workflow", trace_id="A1B2C3"):
    with custom_span("Task 1"):
        result = Runner.run_sync(agent, "Where is the Statue of Liberty?")

因为指定了 trace_id，无论该程序分开运行多少次，它们都会在 Traces 模块中归拢到同一条追踪下。

（图 8.7：同一 trace ID 下的多任务）

这对长流程或分布式工作流很有帮助：流程的不同片段在不同时间甚至不同机器上执行，通过统一的 trace_id 就能把多段活动“缝合”为一个完整追踪，便于查看全生命周期。

与追踪类似，span 也能分组与嵌套。假设你的工作流有两大块：调研与文本生成。每块都有自己的代理与工具。用自定义 span 可以把它们分别归组，在 Traces 模块中“合并显示”。创建 nested_spans.py：

from agents import Agent, Runner, trace, custom_span, function_tool
from dotenv import load_dotenv
import time
load_dotenv()
@function_tool
def get_fun_facts():
    return "The Eiffel Tower is in Paris"
@function_tool
def clean_up_poem(poem_string: str):
    return poem_string.upper()
# Create the research agent
research_agent = Agent(
    name="Research",
    instructions="You are an AI agent that performs research",
    tools=[get_fun_facts]
)
# Create the text generation agent
text_generation_agent = Agent(
    name="Text Generation",
    instructions="You are an AI agent that pertakes research that's performed and writes a poem",
    tools=[clean_up_poem]
)
with trace("Henry's Research Workflow"):
    with custom_span("Research Task"):
        result = Runner.run_sync(research_agent, "The Eiffel Tower")
    with custom_span("Text Generation Task"):
        result = Runner.run_sync(text_generation_agent,
            result.final_output)
    print(result.final_output)

这样就把“调研任务”和“文本生成任务”分别成组展示。

（图 8.8：Traces 中将多对象成组）

这让你更容易定位任务，并衡量每段耗时，对调试与运维管理都很有价值。

禁用追踪

有时你可能需要禁用追踪。原因可能是合规要求不得保存任何日志或数据，或者日志中可能包含你不希望存储的敏感信息。在这种情况下，可以在 Python 脚本顶部设置环境变量 OPENAI_AGENTS_DISABLE_TRACING 来关闭追踪：

import os
os.environ["OPENAI_AGENTS_DISABLE_TRACING"] = "1"

代理测试

代理管理的另一项重点是测试，用于确认代理按预期工作并在一段时间内保持可靠性。当代理连接到更广的工作流，或直接面向终端用户时，这点尤为重要。挑战在于代理常常非确定性（同一输入可能产生不同输出），因此比传统软件更难验证。好在 OpenAI Agents SDK 提供了结构化方法，为测试过程引入严谨性与一致性。

我们将讨论两类关键测试：

端到端测试（End-to-end） ：整个多代理系统是否产出预期行为/结果？
单元测试（Unit） ：系统中的单个组件是否按预期工作？

让我们开始吧！

端到端测试

端到端测试评估整套代理系统是否产生“可接受/理想”的输出。以客服代理为例，一项端到端测试可以模拟真实用户问题，检查代理是否给出有用回答、是否正确调用工具、是否恰当进行交接。

传统做法是定义输入与期望输出，并校验系统输出是否匹配。对非确定性的代理系统，这虽更困难但并非不可能：一种方式是人工验证；另一种更自动化的方式是让**LLM（甚至另一个代理）**来判定系统输出是否“可接受”。

下面为一个示例：对“根据订单号返回订单状态”的客服代理做端到端测试。创建 test_end_to_end.py：

# Required imports
import os
from dotenv import load_dotenv
from agents import Agent, Runner, function_tool
from pydantic import BaseModel

# Load environment variables from the .env file
load_dotenv()
# Access the API key
api_key = os.getenv("OPENAI_API_KEY")

# Create a tool
@function_tool(
        name_override="Get Status of Current Order",
        description_override="Returns the status of an order given the customer's Order ID",
        docstring_style="Args: Order ID in Integer format"
)
def get_order_status(orderID: int) -> str:
    """
    Returns the order status given an order ID
    Args:
        orderID (int) - Order ID of the customer's order
    Returns:
        string - Status message of the customer's order
    """
    if orderID in (100, 101):
        return "Delivered"
    elif orderID in (200, 201):
        return "Delayed"
    elif orderID in (300, 301):
        return "Cancelled"

# Define an agent
agent = Agent(name="Customer service agent",
              instructions="You are an AI Agent that helps respond to customer queries for a local paper company",
              model="gpt-4o",
              tools=[get_order_status])

# Run the Control Logic Framework
result = Runner.run_sync(agent, "What's the status of my order? My Order ID is 200")
# Print the result
print(result.final_output)

现在为该系统编写端到端测试：为每个场景定义输入与期望输出，逐个执行并比较结果。将以下内容加到脚本底部：

# create Scenario class
class Scenario(BaseModel):
    scenario: str
    input: str
    expected_output: str

list_of_scenarios  = [
    Scenario(
        scenario="Delivered example",
        input="Hi there, could you check my customer order? It's 101",
        expected_output="The order is delivered"
    ),
    Scenario(
        scenario="Delayed",
        input="My order ID is two hundred, why has my package not been delivered yet?",
        expected_output="The order is delayed"
    ),
    Scenario(
        scenario="Order does not exist",
        input="What's the status of my Order? Its number is 400",
        expected_output="No status or order can be found"
    )
]

# create output type
class OutputTrueFalse(BaseModel):
    test_succeeded: bool

# create testing agent
testing_agent = Agent(name="Testing agent",
              instructions="You are an AI Agent that tests expected outputs from desired outputs of an agentic AI system",
              output_type=OutputTrueFalse)

# Run test
for scenario in list_of_scenarios:
    print(f"Running scenario {scenario.scenario}")
    result = Runner.run_sync(testing_agent, f"Input: {scenario.input} ||| Expected Output: {scenario.expected_output}")
    print(result.final_output)
    print('---')

运行程序，会迭代各场景并进行端到端测试（由测试代理/LLM做对比判断）：

Running scenario Delivered example
test_succeeded=True
---
Running scenario Delayed
test_succeeded=True
---
Running scenario Order does not exist
test_succeeded=True
---

若之后修改了代理系统，只需重新运行即可重复相同测试。全部成功意味着代理成功调用了工具并把结果写入响应。若失败，可能是未调用工具（提示词问题）、或输出格式与预期不符。真实场景中应分析失败原因并改进代理（调整指令或工具实现等）。

单元测试

单元测试评估系统中某一具体行为或组件的表现。对代理系统而言，可以检查是否调用了某些工具、是否发生了正确的代理交接、是否触发了护栏等。我们可以利用 SDK 的 result 与 context 对象进行检查，以满足这些期望。

下面接前文脚本，测试：当我们向代理发问时，function_tool 是否被调用。思路是检查 result 对象，确认 get_order_status 工具确实被调用。添加如下代码：

from agents import ToolCallItem

# Run a unit test to check if the function_tool was called
result = Runner.run_sync(agent, "Please provide me the status of order 101")

# Inspect items in the result
items = result.new_items
print("Tool calls made during this run:")
for item in items:
    if isinstance(item, ToolCallItem):
        print(f"- {item.raw_item.name} was called")

# Assert that get_order_status was called
if any(item.raw_item.name == "get_order_status" for item in items if isinstance(item, ToolCallItem)):
    print("get_order_status was called as expected")
else:
    print("get_order_status was not called")

运行后可见：

Tool calls made during this run:
- get_order_status was called
get_order_status was called as expected

这类单元测试的价值在于：不仅验证最终输出，还能检查中间步骤，如是否调用了正确工具、是否发生了正确的交接、是否触发了护栏等。更细的可见性有助于快速定位问题，并在迭代中保持系统可靠性。

总结

本章聚焦如何使用 OpenAI Agents SDK 来管理、监控并验证代理式系统。我们先从可视化入手，学习如何生成展示代理、工具与交接流程的图表，使系统架构更易理解与调试。随后介绍了护栏（guardrails）机制，涵盖输入与输出护栏，作为保护层以强制执行策略，防止不安全或无关的交互。接着，我们探索了 Traces 模块，它记录每次运行的 trace 与 span，帮助你深入了解代理的内部行为。最后，我们讨论了测试方法，包括端到端测试与单元测试，以在代理固有的非确定性下，系统性地验证其可靠性。

这些能力共同构成了代理管理的基础。不仅帮助你构建强大的代理式系统，也能在系统扩展时保持其安全、可观测且可信赖。下一章我们将把本书至今所学融会贯通，构建完整的端到端真实场景代理系统。