🤖 AI智能体评估秘籍:从入门到精通的实战指南

81 阅读42分钟

📌 01 传统测试的局限性

传统软件测试讲究确定性。输入A必然得到B。但AI智能体不一样。它们像人一样会思考。每次回答都可能不同。

核心区别在这里:传统代码测试只看结果对错。AI智能体评估还要看过程。一个智能体是怎么得出答案的?中间有没有走弯路?这些都得纳入考量。

📌 02 建立多维评估思维

很多人以为评估就是看准确率。太片面了。真正的专业评估需要考虑多个维度

响应质量是最基础的。包括事实准确性、表达流畅度、语法正确性。但更重要的是要看智能体是否真正理解了用户意图。

性能指标同样关键。延迟时间、资源消耗、令牌使用量。这些数据直接影响用户体验和成本控制。一个响应很慢的智能体,即使答案正确也很难被接受。

高级评估技术才是点睛之笔。轨迹分析能让我们看到智能体的思考过程。LLM-as-a-Judge则能评估主观品质,比如回答是否有帮助。

📌 03 实战工具箱:从代码到框架

理论学得再多,不会动手等于零。文档提供的代码示例让我受益匪浅。

简单的字符串比较早就过时了。现在要用语义相似度分析。这样才能判断智能体的回答是否在意思上正确,而不是死扣字眼。

监控数据不能只看控制台输出。要写入结构化日志、时间序列数据库。这样才能长期跟踪性能变化,及时发现问题。

令牌使用跟踪特别重要。这直接关系到成本控制。每个请求用了多少令牌?哪些功能最耗资源?这些数据能帮我们优化系统。

📌 04 行业标准与发展趋势

Google ADK评估框架提供了三种方法。Web界面交互测试、pytest程序化集成、命令行自动化评估。这三种方式覆盖了从开发到部署的全流程。

国内也在快速发展。中国信通院推出的智能体安全评估框架。涵盖感知、记忆、规划、工具、行为、通信六大模块。与国际先进理念完全接轨

📌 05 多智能体系统评估要诀

评估多智能体就像管理一个团队。**每个成员表现如何?协作是否顺畅?整体效率怎样?**这些都要考虑。

协作有效性很关键。航班预订智能体有没有把正确的日期和目的地传递给酒店预订智能体?一个环节出错,整个任务就失败

计划执行要检查。智能体是否按既定步骤工作?有没有跳过重要环节?比如酒店智能体在航班确认前就开始预订,这就是明显的流程错误。

任务分配要合理。用户问天气,系统应该调用专业的天气智能体,而不是让通用智能体随便回答。选对工具才能做好工作

📌 06 从智能体到高级承包商

这个概念太精彩了。传统智能体像临时工。给个简单指令就干,干不好也找不到人。高级承包商像正规公司。签合同、明责任、保质量。

正式化合同是第一步。任务要求、交付标准、完成时间都要写清楚。这样智能体知道该做什么,用户也知道该期待什么。

动态协商很重要。智能体发现任务不明确时,会主动提问澄清。这种沟通能避免很多误解和返工

质量导向的迭代执行是核心。智能体不会仓促交差。而是反复检查、测试、改进。直到达到合同要求的标准。

📌 07 我的学习心得

评估不仅是技术问题,更是责任问题。一个评估不严的AI系统可能带来安全隐患。影响用户信任。

AI智能体评估是系统工程。技术指标、业务价值、安全合规、用户体验都要考虑。只看一个方面是不够的。

从智能体到承包商的演进是必由之路。只有建立可信、可控、可验证的AI系统,才能在高风险领域推广应用

持续学习很重要。AI技术日新月异。要保持对最新标准和实践的关注。不断提升专业能力。


小结

AI智能体评估看似复杂。掌握了方法其实很简单。建立多维思维、用好工具方法、关注行业趋势、注重实践应用

今天就试试!选一个你熟悉的智能体,按照今天学到的方法做一次完整评估。你会发现,原来评估也可以这么有趣。

欢迎留言分享你的评估经验,咱们一起进步!


常见问题解答

Q1: 传统测试方法为什么对AI智能体不适用? A: 传统测试基于确定性逻辑,AI智能体具有概率性。需要评估完整轨迹,而不仅仅是最终结果。

Q2: 多智能体系统评估最需要注意什么? A: 协作有效性、计划执行、任务分配。就像管理团队,要关注个体表现和整体配合。

Q3: 如何开始学习AI智能体评估? A: 从简单的响应准确性评估开始。逐步扩展到性能监控、轨迹分析。结合实际项目练习最重要。


觉得有用就点个赞吧! 关注我,下期分享更多AI技术干货!


Chapter 19: Evaluation and Monitoring | 第19章:评估与监控

English Version | 中文版


This chapter examines methodologies that allow intelligent agents to systematically assess their performance, monitor progress toward goals, and detect operational anomalies. While Chapter 11 outlines goal setting and monitoring, and Chapter 17 addresses Reasoning mechanisms, this chapter focuses on the continuous, often external, measurement of an agent's effectiveness, efficiency, and compliance with requirements. This includes defining metrics, establishing feedback loops, and implementing reporting systems to ensure agent performance aligns with expectations in operational environments (see Fig.1)

本章探讨使智能体能够系统评估其性能、监控目标进展并检测操作异常的方法论。虽然第11章概述了目标设定和监控,第17章论述了推理机制,但本章重点关注对智能体有效性、效率和合规性的持续性、通常是外部的测量。这包括定义指标、建立反馈循环和实施报告系统,以确保智能体性能在操作环境中与期望保持一致(见图1)。

ScreenShot_2025-11-21_113538_696.png

Fig:1. Best practices for evaluation and monitoring | 图1:评估与监控的最佳实践


Practical Applications & Use Cases | 实际应用与用例

English Version | 中文版

Most Common Applications and Use Cases:

最常见的应用和用例:

  • Performance Tracking in Live Systems: Continuously monitoring the accuracy, latency, and resource consumption of an agent deployed in a production environment (e.g., a customer service chatbot's resolution rate, response time).

  • **实时系统性能跟踪:**持续监控部署在生产环境中的智能体的准确性、延迟和资源消耗(例如,客户服务聊天机器人的解决率、响应时间)。

  • A/B Testing for Agent Improvements: Systematically comparing the performance of different agent versions or strategies in parallel to identify optimal approaches (e.g., trying two different planning algorithms for a logistics agent).

  • **智能体改进的A/B测试:**系统地并行比较不同智能体版本或策略的性能,以识别最优方法(例如,为物流智能体尝试两种不同的规划算法)。

  • Compliance and Safety Audits: Generate automated audit reports that track an agent's compliance with ethical guidelines, regulatory requirements, and safety protocols over time. These reports can be verified by a human-in-the-loop or another agent, and can generate KPIs or trigger alerts upon identifying issues.

  • **合规性与安全审计:**生成自动化审计报告,跟踪智能体对道德准则、监管要求和安全协议的合规性。这些报告可由人在回路中或另一个智能体验证,并可在发现问题时生成KPI或触发警报。

  • Enterprise systems: To govern Agentic AI in corporate systems, a new control instrument, the AI "Contract," is needed. This dynamic agreement codifies the objectives, rules, and controls for AI-delegated tasks.

  • **企业系统:**为了管理企业系统中的智能体AI,需要一种新的控制工具——AI"合同"。这份动态协议将AI委托任务的目标、规则和控制措施编纂成文。

  • Drift Detection: Monitoring the relevance or accuracy of an agent's outputs over time, detecting when its performance degrades due to changes in input data distribution (concept drift) or environmental shifts.

  • **漂移检测:**监控智能体输出的相关性或准确性随时间的变化,检测其性能是否因输入数据分布变化(概念漂移)或环境变化而下降。

  • Anomaly Detection in Agent Behavior: Identifying unusual or unexpected actions taken by an agent that might indicate an error, a malicious attack, or an emergent un-desired behavior.

  • **智能体行为异常检测:**识别智能体采取的异常或意外行为,这些行为可能表明存在错误、恶意攻击或突发的非期望行为。

  • Learning Progress Assessment: For agents designed to learn, tracking their learning curve, improvement in specific skills, or generalization capabilities over different tasks or data sets.

  • **学习进展评估:**对于设计用于学习的智能体,跟踪其学习曲线、特定技能的改进或跨不同任务或数据集的泛化能力。


Hands-On Code Example | 实际代码示例

English Version | 中文版

Developing a comprehensive evaluation framework for AI agents is a challenging endeavor, comparable to an academic discipline or a substantial publication in its complexity. This difficulty stems from the multitude of factors to consider, such as model performance, user interaction, ethical implications, and broader societal impact. Nevertheless, for practical implementation, the focus can be narrowed to critical use cases essential for the efficient and effective functioning of AI agents.

为AI智能体开发全面的评估框架是一项具有挑战性的工作,其复杂性堪比一个学术学科或重要出版物。这种困难源于需要考虑众多因素,如模型性能、用户交互、道德影响以及更广泛的社会影响。然而,为了实际实施,可以将重点缩小到对AI智能体高效有效运行至关重要的关键用例上。

Agent Response Assessment: This core process is essential for evaluating the quality and accuracy of an agent's outputs. It involves determining if the agent delivers pertinent, correct, logical, unbiased, and accurate information in response to given inputs. Assessment metrics may include factual correctness, fluency, grammatical precision, and adherence to the user's intended purpose.

**智能体响应评估:**这一核心过程对于评估智能体输出的质量和准确性至关重要。它涉及确定智能体是否针对给定输入提供相关、正确、逻辑清晰、无偏见且准确的信息。评估指标可能包括事实正确性、流畅性、语法精确性以及对用户预期目的的遵循程度。

def evaluate_response_accuracy(agent_output: str, expected_output: str) -> float:
    """Calculates a simple accuracy score for agent responses."""
    # This is a very basic exact match; real-world would use more sophisticated metrics
    return 1.0 if agent_output.strip().lower() == expected_output.strip().lower() else 0.0

# Example usage
agent_response = "The capital of France is Paris."
ground_truth = "Paris is the capital of France."
score = evaluate_response_accuracy(agent_response, ground_truth)
print(f"Response accuracy: {score}")

The Python function evaluate_response_accuracy calculates a basic accuracy score for an AI agent's response by performing an exact, case-insensitive comparison between the agent's output and the expected output, after removing leading or trailing whitespace. It returns a score of 1.0 for an exact match and 0.0 otherwise, representing a binary correct or incorrect evaluation. This method, while straightforward for simple checks, does not account for variations like paraphrasing or semantic equivalence.

Python函数evaluate_response_accuracy通过执行精确的、不区分大小写的比较来计算AI智能体响应的基本准确性得分,比较前会去除前导或尾随空格。对于精确匹配,它返回1.0分,否则返回0.0分,表示二元正确或错误的评估。虽然这种方法对于简单检查来说很直接,但它没有考虑释义或语义等价性等变化。

The problem lies in its method of comparison. The function performs a strict, character-for-character comparison of the two strings. In the example provided:

问题在于其比较方法。该函数对两个字符串执行严格的逐字符比较。在提供的示例中:

  • agent_response: "The capital of France is Paris."
  • ground_truth: "Paris is the capital of France."

Even after removing whitespace and converting to lowercase, these two strings are not identical. As a result, the function will incorrectly return an accuracy score of 0.0, even though both sentences convey the same meaning.

即使在去除空格并转换为小写后,这两个字符串也不完全相同。因此,该函数将错误地返回准确性得分0.0,尽管两个句子传达的含义相同。

A straightforward comparison falls short in assessing semantic similarity, only succeeding if an agent's response exactly matches the expected output. A more effective evaluation necessitates advanced Natural Language Processing (NLP) techniques to discern the meaning between sentences. For thorough AI agent evaluation in real-world scenarios, more sophisticated metrics are often indispensable. These metrics can encompass String Similarity Measures like Levenshtein distance and Jaccard similarity, Keyword Analysis for the presence or absence of specific keywords, Semantic Similarity using cosine similarity with embedding models, LLM-as-a-Judge Evaluations (discussed later for assessing nuanced correctness and helpfulness), and RAG-specific Metrics such as faithfulness and relevance.

直接的比较在评估语义相似性方面存在不足,只有在智能体的响应与预期输出完全匹配时才有效。更有效的评估需要先进的自然语言处理(NLP)技术来识别句子之间的含义。对于现实世界场景中的全面AI智能体评估,更复杂的指标通常是必不可少的。这些指标可以包括字符串相似性度量,如Levenshtein距离Jaccard相似性、针对特定关键词存在与否的关键词分析、使用嵌入模型余弦相似性的语义相似性LLM作为法官的评估(后文讨论,用于评估细微的正确性和有用性),以及RAG特定指标,如忠实度和相关性。

Latency Monitoring: Latency Monitoring for Agent Actions is crucial in applications where the speed of an AI agent's response or action is a critical factor. This process measures the duration required for an agent to process requests and generate outputs. Elevated latency can adversely affect user experience and the agent's overall effectiveness, particularly in real-time or interactive environments. In practical applications, simply printing latency data to the console is insufficient. Logging this information to a persistent storage system is recommended. Options include structured log files (e.g., JSON), time-series databases (e.g., InfluxDB, Prometheus), data warehouses (e.g., Snowflake, BigQuery, PostgreSQL), or observability platforms (e.g., Datadog, Splunk, Grafana Cloud).

**延迟监控:**对于AI智能体操作的延迟监控在智能体响应或操作速度是关键因素的应用中至关重要。此过程测量智能体处理请求和生成输出所需的持续时间。高延迟可能会对用户体验和智能体的整体有效性产生不利影响,特别是在实时或交互式环境中。在实际应用中,仅仅将延迟数据打印到控制台是不够的。建议将此信息记录到持久化存储系统中。选项包括结构化日志文件(例如JSON)、时间序列数据库(例如InfluxDB、Prometheus)、数据仓库(例如Snowflake、BigQuery、PostgreSQL)或可观测性平台(例如Datadog、Splunk、Grafana Cloud)。

Tracking Token Usage for LLM Interactions: For LLM-powered agents, tracking token usage is crucial for managing costs and optimizing resource allocation. Billing for LLM interactions often depends on the number of tokens processed (input and output). Therefore, efficient token usage directly reduces operational expenses. Additionally, monitoring token counts helps identify potential areas for improvement in prompt engineering or response generation processes.

**LLM交互的令牌使用跟踪:**对于由LLM驱动的智能体,跟踪令牌使用对于管理成本和优化资源配置至关重要。LLM交互的计费通常取决于处理的令牌数量(输入和输出)。因此,高效的令牌使用直接降低了运营费用。此外,监控令牌数量有助于识别提示工程或响应生成过程中潜在的改进领域。

# This is conceptual as actual token counting depends on the LLM API
class LLMInteractionMonitor:
    def __init__(self):
        self.total_input_tokens = 0
        self.total_output_tokens = 0

    def record_interaction(self, prompt: str, response: str):
        # In a real scenario, use LLM API's token counter or a tokenizer
        input_tokens = len(prompt.split())  # Placeholder
        output_tokens = len(response.split())  # Placeholder
        self.total_input_tokens += input_tokens
        self.total_output_tokens += output_tokens
        print(f"Recorded interaction: Input tokens={input_tokens}, Output tokens={output_tokens}")

    def get_total_tokens(self):
        return self.total_input_tokens, self.total_output_tokens

# Example usage
monitor = LLMInteractionMonitor()
monitor.record_interaction("What is the capital of France?", "The capital of France is Paris.")
monitor.record_interaction("Tell me a joke.", "Why don't scientists trust atoms? Because they make up everything!")
input_t, output_t = monitor.get_total_tokens()
print(f"Total input tokens: {input_t}, Total output tokens: {output_t}")

This section introduces a conceptual Python class, LLMInteractionMonitor, developed to track token usage in large language model interactions. The class incorporates counters for both input and output tokens. Its record_interaction method simulates token counting by splitting the prompt and response strings. In a practical implementation, specific LLM API tokenizers would be employed for precise token counts. As interactions occur, the monitor accumulates the total input and output token counts. The get_total_tokens method provides access to these cumulative totals, essential for cost management and optimization of LLM usage.

本节介绍了一个概念性的Python类LLMInteractionMonitor,用于跟踪大型语言模型交互中的令牌使用。该类包含输入和输出令牌的计数器。其record_interaction方法通过分割提示和响应字符串来模拟令牌计数。在实际实现中,将使用特定的LLM API分词器进行精确的令牌计数。随着交互的进行,监控器累积总的输入和输出令牌计数。get_total_tokens方法提供对这些累积总数的访问,这对于LLM使用的成本管理和优化至关重要。

Custom Metric for "Helpfulness" using LLM-as-a-Judge: Evaluating subjective qualities like an AI agent's "helpfulness" presents challenges beyond standard objective metrics. A potential framework involves using an LLM as an evaluator. This LLM-as-a-Judge approach assesses another AI agent's output based on predefined criteria for "helpfulness." Leveraging the advanced linguistic capabilities of LLMs, this method offers nuanced, human-like evaluations of subjective qualities, surpassing simple keyword matching or rule-based assessments. Though in development, this technique shows promise for automating and scaling qualitative evaluations.

**使用LLM作为法官的"有用性"自定义指标:**评估AI智能体"有用性"等主观品质带来了超越标准客观指标的挑战。一个潜在的框架涉及使用LLM作为评估器。这种LLM作为法官的方法基于预定义的"有用性"标准来评估另一个AI智能体的输出。利用LLM先进的语言能力,这种方法提供了对主观品质的细致、类似人类的评估,超越了简单的关键词匹配或基于规则的评估。尽管仍在开发中,这项技术显示出自动化和扩展定性评估的前景。

import google.generativeai as genai
import os
import json
import logging
from typing import Optional

# --- Configuration ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Set your API key as an environment variable to run this script
# For example, in your terminal: export GOOGLE_API_KEY='your_key_here'
try:
    genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
except KeyError:
    logging.error("Error: GOOGLE_API_KEY environment variable not set.")
    exit(1)

# --- LLM-as-a-Judge Rubric for Legal Survey Quality ---
LEGAL_SURVEY_RUBRIC = """
You are an expert legal survey methodologist and a critical legal reviewer. Your task is to evaluate the quality of a given legal survey question. Provide a score from 1 to 5 for overall quality, along with a detailed rationale and specific feedback. Focus on the following criteria:

1.  **Clarity & Precision (Score 1-5):**
    * 1: Extremely vague, highly ambiguous, or confusing.
    * 3: Moderately clear, but could be more precise.
    * 5: Perfectly clear, unambiguous, and precise in its legal terminology (if applicable) and intent.

2.  **Neutrality & Bias (Score 1-5):**
    * 1: Highly leading or biased, clearly influencing the respondent towards a specific answer.
    * 3: Slightly suggestive or could be interpreted as leading.
    * 5: Completely neutral, objective, and free from any leading language or loaded terms.

3.  **Relevance & Focus (Score 1-5):**
    * 1: Irrelevant to the stated survey topic or out of scope.
    * 3: Loosely related but could be more focused.
    * 5: Directly relevant to the survey's objectives and well-focused on a single concept.

4.  **Completeness (Score 1-5):**
    * 1: Omits critical information needed to answer accurately or provides insufficient context.
    * 3: Mostly complete, but minor details are missing.
    * 5: Provides all necessary context and information for the respondent to answer thoroughly.

5.  **Appropriateness for Audience (Score 1-5):**
    * 1: Uses jargon inaccessible to the target audience or is overly simplistic for experts.
    * 3: Generally appropriate, but some terms might be challenging or oversimplified.
    * 5: Perfectly tailored to the assumed legal knowledge and background of the target survey audience.

**Output Format:**
Your response MUST be a JSON object with the following keys:
* `overall_score`: An integer from 1 to 5 (average of criterion scores, or your holistic judgment).
* `rationale`: A concise summary of why this score was given, highlighting major strengths and weaknesses.
* `detailed_feedback`: A bullet-point list detailing feedback for each criterion (Clarity, Neutrality, Relevance, Completeness, Audience Appropriateness). Suggest specific improvements.
* `concerns`: A list of any specific legal, ethical, or methodological concerns.
* `recommended_action`: A brief recommendation (e.g., "Revise for neutrality", "Approve as is", "Clarify scope").
"""

class LLMJudgeForLegalSurvey:
    """A class to evaluate legal survey questions using a generative AI model."""

    def __init__(self, model_name: str = 'gemini-1.5-flash-latest', temperature: float = 0.2):
        """
        Initializes the LLM Judge.

        Args:
            model_name (str): The name of the Gemini model to use.
                              'gemini-1.5-flash-latest' is recommended for speed and cost.
                              'gemini-1.5-pro-latest' offers the highest quality.
            temperature (float): The generation temperature. Lower is better for deterministic evaluation.
        """
        self.model = genai.GenerativeModel(model_name)
        self.temperature = temperature

    def _generate_prompt(self, survey_question: str) -> str:
        """Constructs the full prompt for the LLM judge."""
        return f"{LEGAL_SURVEY_RUBRIC}\n\n---\n**LEGAL SURVEY QUESTION TO EVALUATE:**\n{survey_question}\n---"

    def judge_survey_question(self, survey_question: str) -> Optional[dict]:
        """
        Judges the quality of a single legal survey question using the LLM.

        Args:
            survey_question (str): The legal survey question to be evaluated.

        Returns:
            Optional[dict]: A dictionary containing the LLM's judgment, or None if an error occurs.
        """
        full_prompt = self._generate_prompt(survey_question)

        try:
            logging.info(f"Sending request to '{self.model.model_name}' for judgment...")
            response = self.model.generate_content(
                full_prompt,
                generation_config=genai.types.GenerationConfig(
                    temperature=self.temperature,
                    response_mime_type="application/json"
                )
            )
            # Check for content moderation or other reasons for an empty response.
            if not response.parts:
                safety_ratings = response.prompt_feedback.safety_ratings
                logging.error(f"LLM response was empty or blocked. Safety Ratings: {safety_ratings}")
                return None

            return json.loads(response.text)
        except json.JSONDecodeError:
            logging.error(f"Failed to decode LLM response as JSON. Raw response: {response.text}")
            return None
        except Exception as e:
            logging.error(f"An unexpected error occurred during LLM judgment: {e}")
            return None

# --- Example Usage ---
if __name__ == "__main__":
    judge = LLMJudgeForLegalSurvey()
    
    # --- Good Example ---
    good_legal_survey_question = """
    To what extent do you agree or disagree that current intellectual property laws in Switzerland adequately protect emerging AI-generated content, assuming the content meets the originality criteria established by the Federal Supreme Court?
    (Select one: Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree)
    """
    print("\n--- Evaluating Good Legal Survey Question ---")
    judgment_good = judge.judge_survey_question(good_legal_survey_question)
    if judgment_good:
        print(json.dumps(judgment_good, indent=2))

    # --- Biased/Poor Example ---
    biased_legal_survey_question = """
    Don't you agree that overly restrictive data privacy laws like the FADP are hindering essential technological innovation and economic growth in Switzerland?
    (Select one: Yes, No)
    """
    print("\n--- Evaluating Biased Legal Survey Question ---")
    judgment_biased = judge.judge_survey_question(biased_legal_survey_question)
    if judgment_biased:
        print(json.dumps(judgment_biased, indent=2))

    # --- Ambiguous/Vague Example ---
    vague_legal_survey_question = """
    What are your thoughts on legal tech?
    """
    print("\n--- Evaluating Vague Legal Survey Question ---")
    judgment_vague = judge.judge_survey_question(vague_legal_survey_question)
    if judgment_vague:
        print(json.dumps(judgment_vague, indent=2))

The Python code defines a class LLMJudgeForLegalSurvey designed to evaluate the quality of legal survey questions using a generative AI model. It utilizes the google.generativeai library to interact with Gemini models.

Python代码定义了一个LLMJudgeForLegalSurvey类,旨在使用生成式AI模型评估法律调查问题的质量。它利用google.generativeai库与Gemini模型进行交互。

The core functionality involves sending a survey question to the model along with a detailed rubric for evaluation. The rubric specifies five criteria for judging survey questions: Clarity & Precision, Neutrality & Bias, Relevance & Focus, Completeness, and Appropriateness for Audience. For each criterion, a score from 1 to 5 is assigned, and a detailed rationale and feedback are required in the output. The code constructs a prompt that includes the rubric and the survey question to be evaluated.

核心功能涉及将调查问题与详细的评估标准一起发送给模型。评估标准指定了判断调查问题的五个标准:清晰度与精确性中立性与偏见相关性与焦点完整性对受众的适宜性。对于每个标准,分配1到5分的评分,并要求在输出中提供详细的理由和反馈。代码构建了一个包含评估标准和待评估调查问题的提示。

The judge_survey_question method sends this prompt to the configured Gemini model, requesting a JSON response formatted according to the defined structure. The expected output JSON includes an overall score, a summary rationale, detailed feedback for each criterion, a list of concerns, and a recommended action. The class handles potential errors during the AI model interaction, such as JSON decoding issues or empty responses. The script demonstrates its operation by evaluating examples of legal survey questions, illustrating how the AI assesses quality based on the predefined criteria.

judge_survey_question方法将此提示发送给配置的Gemini模型,请求按照定义结构格式化的JSON响应。预期的输出JSON包括总体评分、总结理由、每个标准的详细反馈、关注点列表和建议操作。该类处理与AI模型交互过程中可能出现的错误,如JSON解码问题或空响应。脚本通过评估法律调查问题的示例来演示其操作,说明AI如何基于预定义标准评估质量。

Before we conclude, let's examine various evaluation methods, considering their strengths and weaknesses.

在结束之前,让我们检查各种评估方法,考虑它们的优势和劣势。

| Evaluation Method | Strengths | Weaknesses |

评估方法优势劣势
Human EvaluationCaptures subtle behaviorDifficult to scale, expensive, and time-consuming, as it considers subjective human factors.
人工评估捕捉细微行为难以扩展,昂贵且耗时,因为它考虑了主观人为因素。
LLM-as-a-JudgeConsistent, efficient, and scalable.Intermediate steps may be overlooked. Limited by LLM capabilities.
LLM作为评判者一致、高效且可扩展。中间步骤可能被忽视。受LLM能力限制。
Automated MetricsScalable, efficient, and objectivePotential limitation in capturing complete capabilities.
自动化指标可扩展、高效且客观在捕捉完整能力方面可能存在局限性。

Agents trajectories

智能体轨迹

Evaluating agents' trajectories is essential, as traditional software tests are insufficient. Standard code yields predictable pass/fail results, whereas agents operate probabilistically, necessitating qualitative assessment of both the final output and the agent's trajectory—the sequence of steps taken to reach a solution. Evaluating multi-agent systems is challenging because they are constantly in flux. This requires developing sophisticated metrics that go beyond individual performance to measure the effectiveness of communication and teamwork. Moreover, the environments themselves are not static, demanding that evaluation methods, including test cases, adapt over time.

评估智能体轨迹是必不可少的,因为传统的软件测试是不够的。标准代码产生可预测的通过/失败结果,而智能体以概率方式运行,需要对最终输出和智能体轨迹(达到解决方案所采取的步骤序列)进行定性评估。评估多智能体系统具有挑战性,因为它们不断变化。这需要开发复杂的指标,超越个体表现来衡量沟通和团队协作的有效性。此外,环境本身不是静态的,要求评估方法(包括测试用例)随时间适应。

This involves examining the quality of decisions, the reasoning process, and the overall outcome. Implementing automated evaluations is valuable, particularly for development beyond the prototype stage. Analyzing trajectory and tool use includes evaluating the steps an agent employs to achieve a goal, such as tool selection, strategies, and task efficiency. For example, an agent addressing a customer's product query might ideally follow a trajectory involving intent determination, database search tool use, result review, and report generation. The agent's actual actions are compared to this expected, or ground truth, trajectory to identify errors and inefficiencies. Comparison methods include exact match (requiring a perfect match to the ideal sequence), in-order match (correct actions in order, allowing extra steps), any-order match (correct actions in any order, allowing extra steps), precision (measuring the relevance of predicted actions), recall (measuring how many essential actions are captured), and single-tool use (checking for a specific action). Metric selection depends on specific agent requirements, with high-stakes scenarios potentially demanding an exact match, while more flexible situations might use an in-order or any-order match.

这涉及检查决策质量、推理过程和整体结果。实施自动化评估是有价值的,特别是对于原型阶段之后的开发。分析轨迹和工具使用包括评估智能体为实现目标所采取的步骤,如工具选择、策略和任务效率。例如,处理客户产品查询的智能体理想情况下应遵循涉及意图确定、数据库搜索工具使用、结果审查和报告生成的轨迹。智能体的实际动作与此预期的真实值轨迹进行比较,以识别错误和效率低下。比较方法包括精确匹配(要求与理想序列完美匹配)、顺序匹配(按顺序的正确动作,允许额外步骤)、任意顺序匹配(任意顺序的正确动作,允许额外步骤)、精确率(衡量预测动作的相关性)、召回率(衡量捕获了多少基本动作)和单工具使用(检查特定动作)。指标选择取决于具体的智能体要求,高风险场景可能需要精确匹配,而更灵活的情况可能使用顺序或任意顺序匹配。

Evaluation of AI agents involves two primary approaches: using test files and using evalset files. Test files, in JSON format, represent single, simple agent-model interactions or sessions and are ideal for unit testing during active development, focusing on rapid execution and simple session complexity. Each test file contains a single session with multiple turns, where a turn is a user-agent interaction including the user's query, expected tool use trajectory, intermediate agent responses, and final response. For example, a test file might detail a user request to "Turn off device_2 in the Bedroom," specifying the agent's use of a set_device_info tool with parameters like location: Bedroom, device_id: device_2, and status: OFF, and an expected final response of "I have set the device_2 status to off." Test files can be organized into folders and may include a test_config.json file to define evaluation criteria. Evalset files utilize a dataset called an "evalset" to evaluate interactions, containing multiple potentially lengthy sessions suited for simulating complex, multi-turn conversations and integration tests. An evalset file comprises multiple "evals," each representing a distinct session with one or more "turns" that include user queries, expected tool use, intermediate responses, and a reference final response. An example evalset might include a session where the user first asks "What can you do?" and then says "Roll a 10 sided dice twice and then check if 9 is a prime or not," defining expected roll\_die tool calls and a check_prime tool call, along with the final response summarizing the dice rolls and the prime check.

AI智能体的评估涉及两种主要方法:使用测试文件和使用评估集文件。测试文件采用JSON格式,表示单个、简单的智能体-模型交互或会话,适用于活跃开发期间的单元测试,专注于快速执行和简单的会话复杂性。每个测试文件包含一个具有多个轮次的会话,其中一个轮次是用户-智能体交互,包括用户的查询、预期的工具使用轨迹、中间智能体响应和最终响应。例如,测试文件可能详细说明用户请求"关闭卧室中的device_2",指定智能体使用带有location: Bedroom、device_id: device_2和status: OFF等参数的set_device_info工具,以及预期的最终响应"我已将device_2状态设置为关闭"。测试文件可以组织到文件夹中,并且可以包含test_config.json文件来定义评估标准。评估集文件利用称为"评估集"的数据集来评估交互,包含多个可能较长的会话,适用于模拟复杂的多轮次对话和集成测试。评估集文件包含多个"评估",每个评估代表一个具有一个或多个"轮次"的不同会话,这些轮次包括用户查询、预期的工具使用、中间响应和参考最终响应。评估集的示例可能包括一个会话,用户首先询问"你能做什么?"然后说"掷一个10面骰子两次,然后检查9是否为质数",定义预期的roll\_die工具调用和check_prime工具调用,以及总结骰子滚动和质数检查的最终响应。

Multi-agents: Evaluating a complex AI system with multiple agents is much like assessing a team project. Because there are many steps and handoffs, its complexity is an advantage, allowing you to check the quality of work at each stage. You can examine how well each individual "agent" performs its specific job, but you must also evaluate how the entire system is performing as a whole.

多智能体: 评估具有多个智能体的复杂AI系统就像评估团队项目一样。由于有许多步骤和交接,其复杂性是一个优势,允许您在每个阶段检查工作质量。您可以检查每个单独的"智能体"如何执行其特定工作,但您还必须评估整个系统的整体表现。

To do this, you ask key questions about the team's dynamics, supported by concrete examples:

为此,您需要询问有关团队动态的关键问题,并以具体示例为支持:

  • Are the agents cooperating effectively? For instance, after a 'Flight-Booking Agent' secures a flight, does it successfully pass the correct dates and destination to the 'Hotel-Booking Agent'? A failure in cooperation could lead to a hotel being booked for the wrong week.

  • 智能体是否有效协作?例如,在"航班预订智能体"确保航班后,它是否成功地将正确的日期和目的地传递给"酒店预订智能体"?协作失败可能导致酒店预订错误的周次。

  • Did they create a good plan and stick to it? Imagine the plan is to first book a flight, then a hotel. If the 'Hotel Agent' tries to book a room before the flight is confirmed, it has deviated from the plan. You also check if an agent gets stuck, for example, endlessly searching for a "perfect" rental car and never moving on to the next step.

  • 他们是否制定了好的计划并坚持执行?想象计划是先预订航班,然后预订酒店。如果"酒店智能体"在航班确认前试图预订房间,它就偏离了计划。您还需要检查智能体是否卡住,例如,无休止地搜索"完美"的租车而从不进入下一步。

  • Is the right agent being chosen for the right task? If a user asks about the weather for their trip, the system should use a specialized 'Weather Agent' that provides live data. If it instead uses a 'General Knowledge Agent' that gives a generic answer like "it's usually warm in summer," it has chosen the wrong tool for the job.

  • 是否为正确的任务选择了正确的智能体?如果用户询问他们旅行的天气,系统应该使用提供实时数据的专业"天气智能体"。如果它反而使用给出通用答案(如"夏天通常很温暖")的"通用知识智能体",那么它就为工作选择了错误的工具。

  • Finally, does adding more agents improve performance? If you add a new 'Restaurant-Reservation Agent' to the team, does it make the overall trip-planning better and more efficient? Or does it create conflicts and slow the system down, indicating a problem with scalability?.

  • 最后,添加更多智能体是否能提高性能?如果您向团队添加新的"餐厅预订智能体",它是否使整体旅行规划更好更高效?还是它会产生冲突并减慢系统速度,表明存在可扩展性问题?

From Agents to Advanced Contractors

从智能体到高级承包商

Recently, it has been proposed (Agent Companion, gulli et al.) an evolution from simple AI agents to advanced "contractors", moving from probabilistic, often unreliable systems to more deterministic and accountable ones designed for complex, high-stakes environments (see Fig.2).

最近,有人提出了(智能体伴侣,gulli等人)从简单的AI智能体向高级"承包商"的演进,从概率性的、通常不可靠的系统转向更确定性、更负责任的系统,专为复杂、高风险环境设计(见图2)。

Today's common AI agents operate on brief, underspecified instructions, which makes them suitable for simple demonstrations but brittle in production, where ambiguity leads to failure. The "contractor" model addresses this by establishing a rigorous, formalized relationship between the user and the AI, built upon a foundation of clearly defined and mutually agreed-upon terms, much like a legal service agreement in the human world. This transformation is supported by four key pillars that collectively ensure clarity, reliability, and robust execution of tasks that were previously beyond the scope of autonomous systems.

当今常见的AI智能体在简短、指定不明确的指令下运行,这使它们适用于简单演示,但在生产环境中却很脆弱,其中模糊性会导致失败。"承包商"模型通过建立用户与AI之间的严格、正式关系来解决这个问题,这种关系建立在明确定义和双方同意的条款基础上,类似于人类世界中的法律服务协议。这种转变由四个关键支柱支持,共同确保清晰度、可靠性和稳健的任务执行,这些任务以前超出了自主系统的范围。

First is the pillar of the Formalized Contract, a detailed specification that serves as the single source of truth for a task. It goes far beyond a simple prompt. For example, a contract for a financial analysis task wouldn't just say "analyze last quarter's sales"; it would demand "a 20-page PDF report analyzing European market sales from Q1 2025, including five specific data visualizations, a comparative analysis against Q1 2024, and a risk assessment based on the included dataset of supply chain disruptions." This contract explicitly defines the required deliverables, their precise specifications, the acceptable data sources, the scope of work, and even the expected computational cost and completion time, making the outcome objectively verifiable.

首先是正式合同的支柱,这是一个详细的规范,作为任务的唯一真实来源。它远远超出了简单的提示。例如,财务分析任务的合同不会只说"分析上一季度的销售情况";它会要求"一份20页的PDF报告,分析2025年第一季度的欧洲市场销售情况,包括五个特定的数据可视化、与2024年第一季度的对比分析,以及基于包含的供应链中断数据集的风险评估。"该合同明确定义了所需的可交付成果、其精确规范、可接受的数据源、工作范围,甚至预期的计算成本和完成时间,使结果可以客观验证。

Second is the pillar of a Dynamic Lifecycle of Negotiation and Feedback. The contract is not a static command but the start of a dialogue. The contractor agent can analyze the initial terms and negotiate. For instance, if a contract demands the use of a specific proprietary data source the agent cannot access, it can return feedback stating, "The specified XYZ database is inaccessible. Please provide credentials or approve the use of an alternative public database, which may slightly alter the data's granularity." This negotiation phase, which also allows the agent to flag ambiguities or potential risks, resolves misunderstandings before execution begins, preventing costly failures and ensuring the final output aligns perfectly with the user's actual intent.

第二是谈判与反馈的动态生命周期的支柱。合同不是静态命令,而是对话的开始。承包商智能体可以分析初始条款并进行谈判。例如,如果合同要求使用智能体无法访问的特定专有数据源,它可以返回反馈,说明"指定的XYZ数据库无法访问。请提供凭据或批准使用替代公共数据库,这可能会稍微改变数据的粒度。"这个谈判阶段还允许智能体标记模糊性或潜在风险,在执行开始前解决误解,防止代价高昂的失败,并确保最终输出与用户的实际意图完美契合。

ScreenShot_2025-11-21_113551_568.png

Fig. 2: Contract execution example among agents

图2:智能体之间的合同执行示例

The third pillar is Quality-Focused Iterative Execution. Unlike agents designed for low-latency responses, a contractor prioritizes correctness and quality. It operates on a principle of self-validation and correction. For a code generation contract, for example, the agent would not just write the code; it would generate multiple algorithmic approaches, compile and run them against a suite of unit tests defined within the contract, score each solution on metrics like performance, security, and readability, and only submit the version that passes all validation criteria. This internal loop of generating, reviewing, and improving its own work until the contract's specifications are met is crucial for building trust in its outputs.

第三个支柱是以质量为中心的迭代执行。与为低延迟响应设计的智能体不同,承包商优先考虑正确性和质量。它基于自我验证和纠正的原则运行。例如,对于代码生成合同,智能体不会只是编写代码;它会生成多种算法方法,根据合同中定义的一组单元测试进行编译和运行,根据性能、安全性和可读性等指标对每个解决方案进行评分,并且只提交通过所有验证标准的版本。这种生成、审查和改进自身工作的内部循环,直到满足合同规范,对于建立对其输出的信任至关重要。

Finally, the fourth pillar is Hierarchical Decomposition via Subcontracts. For tasks of significant complexity, a primary contractor agent can act as a project manager, breaking the main goal into smaller, more manageable sub-tasks. It achieves this by generating new, formal "subcontracts." For example, a master contract to "build an e-commerce mobile application" could be decomposed by the primary agent into subcontracts for "designing the UI/UX," "developing the user authentication module," "creating the product database schema," and "integrating a payment gateway." Each of these subcontracts is a complete, independent contract with its own deliverables and specifications, which could be assigned to other specialized agents. This structured decomposition allows the system to tackle immense, multifaceted projects in a highly organized and scalable manner, marking the transition of AI from a simple tool to a truly autonomous and reliable problem-solving engine.

最后,第四个支柱是通过子合同进行层次分解。对于相当复杂的任务,主要承包商智能体可以充当项目经理,将主要目标分解为更小、更易于管理的子任务。它通过生成新的正式"子合同"来实现这一点。例如,"构建电子商务移动应用程序"的主合同可以由主要智能体分解为"设计UI/UX"、"开发用户身份验证模块"、"创建产品数据库模式"和"集成支付网关"等子合同。这些子合同中的每一个都是完整的独立合同,具有自己的可交付成果和规范,可以分配给其他专业智能体。这种结构化分解使系统能够以高度组织化和可扩展的方式处理庞大、多方面的项目,标志着AI从简单工具向真正自主和可靠的问题解决引擎的转变。

Ultimately, this contractor framework reimagines AI interaction by embedding principles of formal specification, negotiation, and verifiable execution directly into the agent's core logic. This methodical approach elevates artificial intelligence from a promising but often unpredictable assistant into a dependable system capable of autonomously managing complex projects with auditable precision. By solving the critical challenges of ambiguity and reliability, this model paves the way for deploying AI in mission-critical domains where trust and accountability are paramount.

最终,这个承包商框架通过将正式规范、谈判和可验证执行的原则直接嵌入智能体的核心逻辑,重新构想了AI交互。这种有条不紊的方法将人工智能从一个有前途但通常不可预测的助手提升为一个能够自主管理复杂项目并具有可审计精度的可靠系统。通过解决模糊性和可靠性的关键挑战,该模型为在信任和问责制至关重要的任务关键领域部署AI铺平了道路。

Google's ADK

谷歌ADK

Before concluding, let's look at a concrete example of a framework that supports evaluation. Agent evaluation with Google's ADK (see Fig.3) can be conducted via three methods: web-based UI (adk web) for interactive evaluation and dataset generation, programmatic integration using pytest for incorporation into testing pipelines, and direct command-line interface (adk eval) for automated evaluations suitable for regular build generation and verification processes.

在结束之前,让我们看一个支持评估的框架的具体示例。使用谷歌ADK进行智能体评估(见图3)可以通过三种方法进行:基于Web的UI(adk web)用于交互式评估和数据集生成,使用pytest进行程序化集成以纳入测试管道,以及直接命令行界面(adk eval)用于适合定期构建生成和验证过程的自动化评估。

ScreenShot_2025-11-21_113613_727.png

Fig.3: Evaluation Support for Google ADK

图3:谷歌ADK的评估支持

The web-based UI enables interactive session creation and saving into existing or new eval sets, displaying evaluation status. Pytest integration allows running test files as part of integration tests by calling AgentEvaluator.evaluate, specifying the agent module and test file path.

基于Web的UI支持交互式会话创建并保存到现有或新的评估集中,显示评估状态。Pytest集成允许通过调用AgentEvaluator.evaluate来运行测试文件作为集成测试的一部分,指定智能体模块和测试文件路径。

The command-line interface facilitates automated evaluation by providing the agent module path and eval set file, with options to specify a configuration file or print detailed results. Specific evals within a larger eval set can be selected for execution by listing them after the eval set filename, separated by commas.

命令行界面通过提供智能体模块路径和评估集文件来促进自动化评估,可以选择指定配置文件或打印详细结果。可以通过在评估集文件名后列出特定评估(用逗号分隔)来选择执行较大评估集中的特定评估。

At a Glance

概览

What: Agentic systems and LLMs operate in complex, dynamic environments where their performance can degrade over time. Their probabilistic and non-deterministic nature means that traditional software testing is insufficient for ensuring reliability. Evaluating dynamic multi-agent systems is a significant challenge because their constantly changing nature and that of their environments demand the development of adaptive testing methods and sophisticated metrics that can measure collaborative success beyond individual performance. Problems like data drift, unexpected interactions, tool calling, and deviations from intended goals can arise after deployment. Continuous assessment is therefore necessary to measure an agent's effectiveness, efficiency, and adherence to operational and safety requirements.

什么:智能体系统LLM在复杂的动态环境中运行,其性能可能随时间下降。它们的概率性和非确定性意味着传统的软件测试不足以确保可靠性。评估动态多智能体系统是一个重大挑战,因为它们不断变化的性质及其环境的不断变化要求开发适应性测试方法和能够衡量超越个体表现的协作成功的复杂指标。数据漂移、意外交互、工具调用和偏离预期目标等问题可能在部署后出现。因此,持续评估是必要的,以衡量智能体的有效性、效率以及对操作和安全要求的遵守情况。

Why: A standardized evaluation and monitoring framework provides a systematic way to assess and ensure the ongoing performance of intelligent agents. This involves defining clear metrics for accuracy, latency, and resource consumption, like token usage for LLMs. It also includes advanced techniques such as analyzing agentic trajectories to understand the reasoning process and employing an LLM-as-a-Judge for nuanced, qualitative assessments. By establishing feedback loops and reporting systems, this framework allows for continuous improvement, A/B testing, and the detection of anomalies or performance drift, ensuring the agent remains aligned with its objectives.

**为什么:**标准化的评估和监控框架提供了一种系统化的方法来评估和确保智能体的持续性能。这涉及为准确性、延迟和资源消耗(如LLM的令牌使用)定义明确的指标。它还包括分析智能体轨迹以理解推理过程和使用LLM作为评判者进行细致定性评估等高级技术。通过建立反馈循环和报告系统,该框架允许持续改进、A/B测试以及检测异常或性能漂移,确保智能体与其目标保持一致。

Rule of thumb: Use this pattern when deploying agents in live, production environments where real-time performance and reliability are critical. Additionally, use it when needing to systematically compare different versions of an agent or its underlying models to drive improvements, and when operating in regulated or high-stakes domains requiring compliance, safety, and ethical audits. This pattern is also suitable when an agent's performance may degrade over time due to changes in data or the environment (drift), or when evaluating complex agentic behavior, including the sequence of actions (trajectory) and the quality of subjective outputs like helpfulness.

**经验法则:**在部署智能体到实时生产环境且实时性能和可靠性至关重要时,使用此模式。此外,在需要系统比较不同版本的智能体或其底层模型以推动改进时,以及在需要合规、安全和道德审计的受监管或高风险领域操作时,也使用它。此模式也适用于智能体性能可能因数据或环境变化(漂移)而随时间下降,或评估复杂智能体行为(包括动作序列(轨迹)和主观输出(如有用性)的质量)的情况。

Visual summary

视觉摘要

ScreenShot_2025-11-21_113625_585.png Fig.4: Evaluation and Monitoring design pattern

图4:评估与监控设计模式

Key Takeaways

关键要点

  • Evaluating intelligent agents goes beyond traditional tests to continuously measure their effectiveness, efficiency, and adherence to requirements in real-world environments.

  • 评估智能体超越了传统测试,持续衡量它们在现实世界环境中的有效性、效率和对要求的遵守情况。

  • Practical applications of agent evaluation include performance tracking in live systems, A/B testing for improvements, compliance audits, and detecting drift or anomalies in behavior.

  • 智能体评估的实际应用包括实时系统中的性能跟踪、用于改进的A/B测试、合规审计以及检测行为中的漂移或异常。

  • Basic agent evaluation involves assessing response accuracy, while real-world scenarios demand more sophisticated metrics like latency monitoring and token usage tracking for LLM-powered agents.

  • 基本智能体评估涉及评估响应准确性,而现实世界场景需要更复杂的指标,如延迟监控和LLM驱动的智能体令牌使用跟踪。

  • Agent trajectories, the sequence of steps an agent takes, are crucial for evaluation, comparing actual actions against an ideal, ground-truth path to identify errors and inefficiencies.

  • 智能体轨迹,即智能体采取的步骤序列,对评估至关重要,将实际动作与理想的真实值路径进行比较以识别错误和效率低下。

  • The ADK provides structured evaluation methods through individual test files for unit testing and comprehensive evalset files for integration testing, both defining expected agent behavior.

  • ADK通过用于单元测试的单个测试文件和用于集成测试的综合评估集文件提供结构化评估方法,两者都定义了预期的智能体行为。

  • Agent evaluations can be executed via a web-based UI for interactive testing, programmatically with pytest for CI/CD integration, or through a command-line interface for automated workflows.

  • 智能体评估可以通过基于Web的UI进行交互式测试,通过pytest以程序化方式进行CI/CD集成,或通过命令行界面进行自动化工作流。

  • In order to make AI reliable for complex, high-stakes tasks, we must move from simple prompts to formal "contracts" that precisely define verifiable deliverables and scope. This structured agreement allows the Agents to negotiate, clarify ambiguities, and iteratively validate its own work, transforming it from an unpredictable tool into an accountable and trustworthy system.

  • 为了使AI在复杂、高风险任务中可靠,我们必须从简单的提示转向精确定义可验证交付成果和范围的正式"合同"。这种结构化协议允许智能体进行谈判、澄清模糊性并迭代验证其自身工作,将其从一个不可预测的工具转变为负责任和值得信赖的系统。

Conclusions

结论

In conclusion, effectively evaluating AI agents requires moving beyond simple accuracy checks to a continuous, multi-faceted assessment of their performance in dynamic environments. This involves practical monitoring of metrics like latency and resource consumption, as well as sophisticated analysis of an agent's decision-making process through its trajectory. For nuanced qualities like helpfulness, innovative methods such as the LLM-as-a-Judge are becoming essential, while frameworks like Google's ADK provide structured tools for both unit and integration testing. The challenge intensifies with multi-agent systems, where the focus shifts to evaluating collaborative success and effective cooperation.

总之,有效评估AI智能体需要超越简单的准确性检查,转向对其在动态环境中性能的持续、多方面评估。这涉及对延迟和资源消耗等指标的实用监控,以及通过轨迹对智能体决策过程的复杂分析。对于像有用性这样的细致品质,LLM作为评判者等创新方法正变得至关重要,而谷歌ADK等框架为单元测试和集成测试提供了结构化工具。随着多智能体系统的挑战加剧,焦点转向评估协作成功和有效合作。

To ensure reliability in critical applications, the paradigm is shifting from simple, prompt-driven agents to advanced "contractors" bound by formal agreements. These contractor agents operate on explicit, verifiable terms, allowing them to negotiate, decompose tasks, and self-validate their work to meet rigorous quality standards. This structured approach transforms agents from unpredictable tools into accountable systems capable of handling complex, high-stakes tasks. Ultimately, this evolution is crucial for building the trust required to deploy sophisticated agentic AI in mission-critical domains.

为了确保关键应用的可靠性,范式正在从简单的、由提示驱动的智能体转向受正式协议约束的高级"承包商"。这些承包商智能体在明确、可验证的条款下运作,允许它们进行谈判、分解任务并自我验证其工作以满足严格的质量标准。这种结构化方法将智能体从不可预测的工具转变为能够处理复杂、高风险任务的负责任系统。最终,这种演变对于建立部署复杂智能体AI到任务关键领域所需的信任至关重要。

References

参考文献

Relevant research includes:

相关研究包括:

  1. ADK Web: github.com/google/adk-…
  2. ADK Evaluate: google.github.io/adk-docs/ev…
  3. Survey on Evaluation of LLM-based Agents, arxiv.org/abs/2503.16…
  4. Agent-as-a-Judge: Evaluate Agents with Agents, arxiv.org/abs/2410.10…
  5. Agent Companion, gulli et al: www.kaggle.com/whitepaper-…