🤯 震惊！AI代理竟能自己改代码？揭秘SICA自我进化的黑科技通过学习《Chapter 9: Learning and

AI代理的学习与适应机制：从理论到实践的心得体会

最近总有小伙伴问："AI代理真的能像人类一样学习吗？" 今天咱们就来聊聊这个话题！

最近我深入研究了AI代理的学习与适应机制，这是一项让人工智能系统能够自主进化的核心能力。通过学习《Chapter 9: Learning and Adaptation》文档，结合我的实际项目经验，我对AI代理如何通过经验积累和环境交互来提升自身能力有了更深刻的理解。

📌 1. 学习与适应：AI代理的核心进化机制

AI代理的学习与适应能力使其能够超越预设参数，通过经验积累不断完善自身行为。这与人类学习有异曲同工之妙——正如桑代克的试误说指出，学习是通过尝试错误逐渐形成刺激-反应联结的过程。

在AI领域，这一过程表现为多种学习范式：

🔹 强化学习

让代理通过奖励和惩罚机制学习最优行为，特别适用于动态环境中的决策问题。我在开发游戏AI时曾应用PPO（近端策略优化）算法，其"裁剪"机制通过创建"信任区域"防止策略更新过于激进，确实大大提升了训练稳定性。

🔹 少样本/零样本学习

基于**大语言模型（LLM）**的代理能够快速适应新任务，这让我联想到人类观察学习的能力。正如班杜拉的社会认知理论指出，人类通过观察榜样就能习得新行为，而LLM代理也展现出类似的快速适应能力。

🔹 记忆机制

在实际项目中，我注意到记忆机制对学习效率至关重要。拥有记忆回溯能力的代理能够借鉴过去经验调整当前行动，这与人类依靠经验决策的方式高度一致。

💡 实践成果：我在设计对话系统时引入了记忆库机制，使代理能够根据用户历史交互提供更个性化的回应，用户满意度提升了30%以上。

📌 2. 自我改进的AI代理：SICA案例的启示

**SICA（自我改进编码代理）**展示了AI系统自我修改源代码的能力，这真正体现了"元学习"的概念。SICA不仅修改外部行为，还能直接改进自身的底层架构。

这一特点让我联想到腾讯首席科学家张正友提到的："要对未知世界保持谦卑，心存敬畏"的态度。

🔸 实际应用案例

在我的团队开发自动化测试工具时，我们尝试借鉴SICA的架构理念：

建立了一个评估框架，允许代理分析自己的测试脚本性能
代理提出改进建议，发现了一些我们未曾想到的优化策略
比如将某些API调用顺序重组以提高执行效率

🔸 模块化架构优势

SICA的模块化架构特别值得借鉴：

通过专门的子代理（编码代理、问题解决代理、推理代理）分解复杂任务
既管理了LLM的上下文长度，又提高了系统整体效率
我们在项目中采用了类似思路，将单一大型代理拆分为多个功能专一的协作代理

✅ 效果验证：结果发现调试效率显著提升，团队协作更加顺畅。

📌 3. 实际应用中的挑战与解决方案

将学习与适应机制应用于实际项目时，我们面临几个关键挑战：

🔹 环境不确定性问题

在动态环境中，代理可能遇到训练时未预见的情况。我们采用在线学习机制，使代理能持续从新数据中更新知识。

这与"失败学"的理念不谋而合——正如美国失败产品博物馆展示的，正视失败并从中学习是进步的关键。

🔹 评估框架设计

如何准确评估代理的学习效果是一大挑战。我们借鉴了AlphaEvolve的自动评估系统：

为代理行为设定多维度指标
不仅关注任务完成度，还考虑效率、资源消耗等因素
建立全面的性能评估体系

🔹 安全与可控性

自我改进的代理可能产生不可预测的行为。我们引入了类似SICA的"监督者"机制：

定期检查代理行为，确保其符合预设边界
采用Docker容器化技术，提供必要的隔离环境
建立安全防护机制

🔹 电商推荐系统实践

在开发电商推荐系统时，我们结合监督学习和在线学习：

初期阶段：提供足够标注数据训练基础模型
上线后：代理通过在线学习适应用户偏好变化
效果验证：推荐点击率提高了25%

📌 4. 学习与人类认知的对比思考

研究AI代理的学习机制让我不禁思考其与人类学习的异同：

🔸 相似之处

桑代克的试误说与强化学习的探索-利用权衡颇为相似
班杜拉的观察学习与AI的少样本学习有共通之处
两者都表明通过观察他人（或数据）可以显著提高学习效率

🔸 本质区别

然而，AI代理与人类学习仍有本质区别：

人类具有内在动机和好奇心，能够主动探索未知领域
大多数AI代理仍局限于预设目标
人类学习具有更强的创造性和灵活性

🧠 张正友博士的观点："随着你知道的东西越多，你不知道的东西也在变多"。真正智能的代理应当具备这种对未知世界的好奇与敬畏。

📌 5. 对未来发展的展望

基于当前实践，我认为AI代理的学习与适应机制将向以下几个方向发展：

🔹 更高效的元学习能力

如SICA所示，能够自我改进的代理将越来越重要。我们需要开发更强大的框架，使代理不仅能学习任务，还能优化自身学习过程。

🔹 融合多种学习范式

结合强化学习、监督学习和无监督学习的混合方法将成为主流。如同DeepMind开发的能谈判并达成协议的AI代理，灵活结合多种学习机制可应对更复杂的任务。

🔹 更注重安全与透明度

随着代理自主性增强，确保其行为符合人类价值观至关重要。**Direct Preference Optimization（DPO）**等方法提供了直接基于人类偏好调整模型行为的途径。

💎 写在最后

通过深入研究AI代理的学习与适应机制，我认识到这是实现真正智能系统的关键所在。正如文档所述，学习与适应使代理能从简单遵循指令演进为随时间变得更智能的存在。

在实际项目中成功应用这些原则，需要平衡自主性与可控性，探索性与效率。

未来的AI系统将更加注重自主进化能力，这要求我们作为技术人员不仅理解算法原理，更要把握设计哲学。正如火田村洋太郎所说："认真加以总结，失败就会成为通往胜利的里程碑"。

无论是人类还是AI，从经验中学习的能力都是进步的根本动力。

🤔 互动时间

Q1：你觉得在你的工作中，AI代理的学习能力能带来哪些改变？

欢迎在评论区分享你的想法！

📢 行动建议

今天就试试：在你的项目中引入简单的学习机制
关注我们：下期将分享更多AI实践案例
转发分享：让更多小伙伴了解AI代理的进化之路

Chapter 9: Learning and Adaptation | 第九章：学习与适应

Learning and adaptation are pivotal for enhancing the capabilities of artificial intelligence agents. These processes enable agents to evolve beyond predefined parameters, allowing them to improve autonomously through experience and environmental interaction. By learning and adapting, agents can effectively manage novel situations and optimize their performance without constant manual intervention. This chapter explores the principles and mechanisms underpinning agent learning and adaptation in detail.

学习和适应对于提升人工智能智能体的能力至关重要。这些过程使智能体能够超越预定义的参数，通过经验和环境交互自主改进。通过学习和适应，智能体能够有效应对新情况，并在无需持续人工干预的情况下优化其性能。本章详细探讨智能体学习和适应的基本原理和机制。

The big picture | 总体概览

Agents learn and adapt by changing their thinking, actions, or knowledge based on new experiences and data. This allows agents to evolve from simply following instructions to becoming smarter over time.

智能体通过基于新经验和数据改变其思维、行动或知识来学习和适应。这使得智能体能够从简单地执行指令演变为随着时间推移变得更加智能。

Reinforcement Learning: Agents try actions and receive rewards for positive outcomes and penalties for negative ones, learning optimal behaviors in changing situations. Useful for agents controlling robots or playing games.
强化学习：智能体尝试行动，对积极结果获得奖励，对负面结果受到惩罚，在不断变化的情况下学习最优行为。适用于控制机器人或玩游戏的智能体。
Supervised Learning: Agents learn from labeled examples, connecting inputs to desired outputs, enabling tasks like decision-making and pattern recognition. Ideal for agents sorting emails or predicting trends.
监督学习：智能体从标记示例中学习，将输入与期望输出连接起来，实现决策和模式识别等任务。适用于分类邮件或预测趋势的智能体。
Unsupervised Learning: Agents discover hidden connections and patterns in unlabeled data, aiding in insights, organization, and creating a mental map of their environment. Useful for agents exploring data without specific guidance.
无监督学习：智能体在未标记数据中发现隐藏的联系和模式，有助于洞察、组织和创建其环境的心理地图。适用于在没有特定指导的情况下探索数据的智能体。
Few-Shot/Zero-Shot Learning with LLM-Based Agents: Agents leveraging LLMs can quickly adapt to new tasks with minimal examples or clear instructions, enabling rapid responses to new commands or situations.
基于LLM的智能体的少样本/零样本学习：利用大语言模型的智能体可以用最少的示例或清晰的指令快速适应新任务，实现对新的命令或情况的快速响应。
Online Learning: Agents continuously update knowledge with new data, essential for real-time reactions and ongoing adaptation in dynamic environments. Critical for agents processing continuous data streams.
在线学习：智能体用新数据持续更新知识，对于动态环境中的实时反应和持续适应至关重要。对于处理连续数据流的智能体至关重要。
Memory-Based Learning: Agents recall past experiences to adjust current actions in similar situations, enhancing context awareness and decision-making. Effective for agents with memory recall capabilities.
基于记忆的学习：智能体回忆过去的经验以在类似情况下调整当前行动，增强情境感知和决策能力。对具有记忆回忆能力的智能体有效。

Agents adapt by changing strategy, understanding, or goals based on learning. This is vital for agents in unpredictable, changing, or new environments.

智能体通过基于学习改变策略、理解或目标来适应。这对于处于不可预测、变化或新环境中的智能体至关重要。

Proximal Policy Optimization (PPO) is a reinforcement learning algorithm used to train agents in environments with a continuous range of actions, like controlling a robot's joints or a character in a game. Its main goal is to reliably and stably improve an agent's decision-making strategy, known as its policy.

近端策略优化（PPO）是一种强化学习算法，用于训练在连续动作范围环境中运行的智能体，如控制机器人的关节或游戏中的角色。其主要目标是可靠且稳定地改进智能体的决策策略，即其策略。

The core idea behind PPO is to make small, careful updates to the agent's policy. It avoids drastic changes that could cause performance to collapse. Here's how it works:

PPO的核心思想是对智能体的策略进行小而谨慎的更新。它避免可能导致性能崩溃的剧烈变化。其工作原理如下：

Collect Data: The agent interacts with its environment (e.g., plays a game) using its current policy and collects a batch of experiences (state, action, reward).
收集数据：智能体使用其当前策略与环境交互（例如玩游戏），并收集一批经验（状态、行动、奖励）。
Evaluate a "Surrogate" Goal: PPO calculates how a potential policy update would change the expected reward. However, instead of just maximizing this reward, it uses a special "clipped" objective function.
评估"替代"目标：PPO计算潜在的策略更新将如何改变预期奖励。然而，它不只是最大化这个奖励，而是使用特殊的"裁剪"目标函数。
The "Clipping" Mechanism: This is the key to PPO's stability. It creates a "trust region" or a safe zone around the current policy. The algorithm is prevented from making an update that is too different from the current strategy. This clipping acts like a safety brake, ensuring the agent doesn't take a huge, risky step that undoes its learning.
"裁剪"机制：这是PPO稳定性的关键。它在当前策略周围创建一个"信任区域"或安全区。该算法被阻止进行与当前策略差异太大的更新。这种裁剪就像一个安全制动器，确保智能体不会采取巨大的、冒险的步骤而破坏其学习成果。

In short, PPO balances improving performance with staying close to a known, working strategy, which prevents catastrophic failures during training and leads to more stable learning.

简而言之，PPO在改进性能与保持接近已知有效策略之间取得平衡，这防止了训练期间的灾难性故障，并实现更稳定的学习。

Direct Preference Optimization (DPO) is a more recent method designed specifically for aligning Large Language Models (LLMs) with human preferences. It offers a simpler, more direct alternative to using PPO for this task.

直接偏好优化（DPO）是一种更新的方法，专门用于使大语言模型（LLM）与人类偏好保持一致。它为使用PPO完成此任务提供了更简单、更直接的替代方案。

To understand DPO, it helps to first understand the traditional PPO-based alignment method:

为了理解DPO，首先了解传统的基于PPO的对齐方法会有所帮助：

The PPO Approach (Two-Step Process):
1. Train a Reward Model: First, you collect human feedback data where people rate or compare different LLM responses (e.g., "Response A is better than Response B"). This data is used to train a separate AI model, called a reward model, whose job is to predict what score a human would give to any new response.
2. Fine-Tune with PPO: Next, the LLM is fine-tuned using PPO. The LLM's goal is to generate responses that get the highest possible score from the reward model. The reward model acts as the "judge" in the training game.
PPO方法（两步过程）：
1. 训练奖励模型：首先，收集人类反馈数据，人们对不同的LLM响应进行评分或比较（例如，"响应A优于响应B"）。这些数据用于训练一个单独的AI模型，称为奖励模型，其工作是预测人类会对任何新响应给出什么分数。
2. 使用PPO进行微调：接下来，使用PPO对LLM进行微调。LLM的目标是生成能从奖励模型获得最高分的响应。奖励模型在训练游戏中充当"裁判"。

This two-step process can be complex and unstable. For instance, the LLM might find a loophole and learn to "hack" the reward model to get high scores for bad responses.

这个两步过程可能复杂且不稳定。例如，LLM可能找到漏洞并学会"破解"奖励模型，为糟糕的响应获得高分。

The DPO Approach (Direct Process): DPO skips the reward model entirely. Instead of translating human preferences into a reward score and then optimizing for that score, DPO uses the preference data directly to update the LLM's policy.
DPO方法（直接过程）：DPO完全跳过奖励模型。DPO不是将人类偏好转化为奖励分数然后优化该分数，而是直接使用偏好数据更新LLM的策略。
It works by using a mathematical relationship that directly links preference data to the optimal policy. It essentially teaches the model: "Increase the probability of generating responses like the preferred one and decrease the probability of generating ones like the disfavored one."
它通过使用直接将偏好数据与最优策略联系起来的数学关系来工作。它本质上教会模型："增加生成类似首选响应的概率，减少生成类似不受欢迎响应的概率。"

In essence, DPO simplifies alignment by directly optimizing the language model on human preference data. This avoids the complexity and potential instability of training and using a separate reward model, making the alignment process more efficient and robust.

本质上，DPO通过直接在人类偏好数据上优化语言模型来简化对齐。这避免了训练和使用单独奖励模型的复杂性和潜在不稳定性，使对齐过程更高效和稳健。

Practical Applications & Use Cases | 实际应用与用例

Adaptive agents exhibit enhanced performance in variable environments through iterative updates driven by experiential data.

自适应智能体通过由经验数据驱动的迭代更新，在可变环境中表现出增强的性能。

Personalized assistant agents refine interaction protocols through longitudinal analysis of individual user behaviors, ensuring highly optimized response generation.
个性化助手智能体通过对个人用户行为的纵向分析来改进交互协议，确保高度优化的响应生成。
Trading bot agents optimize decision-making algorithms by dynamically adjusting model parameters based on high-resolution, real-time market data, thereby maximizing financial returns and mitigating risk factors.
交易机器人智能体通过基于高分辨率实时市场数据动态调整模型参数来优化决策算法，从而最大化财务回报并降低风险因素。
Application agents optimize user interface and functionality through dynamic modification based on observed user behavior, resulting in increased user engagement and system intuitiveness.
应用程序智能体通过基于观察到的用户行为进行动态修改来优化用户界面和功能，从而提高用户参与度和系统直观性。
Robotic and autonomous vehicle agents enhance navigation and response capabilities by integrating sensor data and historical action analysis, enabling safe and efficient operation across diverse environmental conditions.
机器人和自动驾驶车辆智能体通过整合传感器数据和历史行动分析来增强导航和响应能力，实现跨不同环境条件的安全高效运行。
Fraud detection agents improve anomaly detection by refining predictive models with newly identified fraudulent patterns, enhancing system security and minimizing financial losses.
欺诈检测智能体通过用新识别的欺诈模式完善预测模型来改进异常检测，增强系统安全性并最小化财务损失。
Recommendation agents improve content selection precision by employing user preference learning algorithms, providing highly individualized and contextually relevant recommendations.
推荐智能体通过采用用户偏好学习算法来提高内容选择精度，提供高度个性化和上下文相关的推荐。
Game AI agents enhance player engagement by dynamically adapting strategic algorithms, thereby increasing game complexity and challenge.
游戏AI智能体通过动态适应战略算法来增强玩家参与度，从而增加游戏复杂性和挑战性。
Knowledge Base Learning Agents: Agents can leverage Retrieval Augmented Generation (RAG) to maintain a dynamic knowledge base of problem descriptions and proven solutions (see the Chapter 14). By storing successful strategies and challenges encountered, the agent can reference this data during decision-making, enabling it to adapt to new situations more effectively by applying previously successful patterns or avoiding known pitfalls.
知识库学习智能体：智能体可以利用**检索增强生成（RAG）**来维护问题描述和已验证解决方案的动态知识库（参见第14章）。通过存储遇到的成功的策略和挑战，智能体可以在决策过程中引用这些数据，使其能够通过应用以前成功的模式或避免已知的陷阱来更有效地适应新情况。

Case Study: The Self-Improving Coding Agent (SICA) | 案例研究：自我改进的编码智能体（SICA）

The Self-Improving Coding Agent (SICA), developed by Maxime Robeyns, Laurence Aitchison, and Martin Szummer, represents an advancement in agent-based learning, demonstrating the capacity for an agent to modify its own source code. This contrasts with traditional approaches where one agent might train another; SICA acts as both the modifier and the modified entity, iteratively refining its code base to improve performance across various coding challenges.

由Maxime Robeyns、Laurence Aitchison和Martin Szummer开发的自我改进编码智能体（SICA）代表了基于智能体的学习的进步，展示了智能体修改其自身源代码的能力。这与传统方法形成对比，在传统方法中，一个智能体可能训练另一个智能体；SICA既充当修改者又充当被修改的实体，迭代地改进其代码库以提高在各种编码挑战中的性能。

SICA's self-improvement operates through an iterative cycle (see Fig.1). Initially, SICA reviews an archive of its past versions and their performance on benchmark tests. It selects the version with the highest performance score, calculated based on a weighted formula considering success, time, and computational cost. This selected version then undertakes the next round of self-modification. It analyzes the archive to identify potential improvements and then directly alters its codebase. The modified agent is subsequently tested against benchmarks, with the results recorded in the archive. This process repeats, facilitating learning directly from past performance. This self-improvement mechanism allows SICA to evolve its capabilities without requiring traditional training paradigms.

SICA的自我改进通过一个迭代循环运作（见图1）。最初，SICA审查其过去版本及其在基准测试上的性能的档案。它选择具有最高性能分数的版本，该分数基于考虑成功、时间和计算成本的加权公式计算。然后，这个选定的版本进行下一轮自我修改。它分析档案以识别潜在的改进，然后直接修改其代码库。随后，修改后的智能体针对基准进行测试，结果记录在档案中。这个过程重复进行，促进直接从过去表现中学习。这种自我改进机制使SICA能够在不需要传统训练范式的情况下发展其能力。

Fig.1: SICA's self-improvement, learning and adapting based on its past versions

图1：SICA的自我改进，基于其过去版本进行学习和适应

SICA underwent significant self-improvement, leading to advancements in code editing and navigation. Initially, SICA utilized a basic file-overwriting approach for code changes. It subsequently developed a "Smart Editor" capable of more intelligent and contextual edits. This evolved into a "Diff-Enhanced Smart Editor," incorporating diffs for targeted modifications and pattern-based editing, and a "Quick Overwrite Tool" to reduce processing demands.

SICA经历了显著的自我改进，在代码编辑和导航方面取得了进步。最初，SICA使用基本的文件覆盖方法进行代码更改。随后，它开发了一个能够进行更智能和上下文相关编辑的"智能编辑器"。这演变为一个"差异增强智能编辑器"，结合差异进行有针对性的修改和基于模式的编辑，以及一个"快速覆盖工具"来减少处理需求。

SICA further implemented "Minimal Diff Output Optimization" and "Context-Sensitive Diff Minimization," using Abstract Syntax Tree (AST) parsing for efficiency. Additionally, a "SmartEditor Input Normalizer" was added. In terms of navigation, SICA independently created an "AST Symbol Locator," using the code's structural map (AST) to identify definitions within the codebase. Later, a "Hybrid Symbol Locator" was developed, combining a quick search with AST checking. This was further optimized via "Optimized AST Parsing in Hybrid Symbol Locator" to focus on relevant code sections, improving search speed.(see Fig. 2)

SICA进一步实现了"最小差异输出优化"和"上下文敏感差异最小化"，使用抽象语法树（AST）解析以提高效率。此外，还添加了"智能编辑器输入规范化器"。在导航方面，SICA独立创建了一个"AST符号定位器"，使用代码的结构图（AST）来识别代码库中的定义。后来，开发了一个"混合符号定位器"，结合快速搜索和AST检查。这通过"混合符号定位器中的优化AST解析"进一步优化，专注于相关代码部分，提高搜索速度。（见图2）

Fig.2 : Performance across iterations. Key improvements are annotated with their corresponding tool or agent modifications. (courtesy of Maxime Robeyns , Martin Szummer , Laurence Aitchison)

图2：跨迭代的性能表现。关键改进用其相应的工具或智能体修改进行注释。（由Maxime Robeyns、Martin Szummer、Laurence Aitchison提供）

SICA's architecture comprises a foundational toolkit for basic file operations, command execution, and arithmetic calculations. It includes mechanisms for result submission and the invocation of specialized sub-agents (coding, problem-solving, and reasoning). These sub-agents decompose complex tasks and manage the LLM's context length, especially during extended improvement cycles.

SICA的架构包括用于基本文件操作、命令执行和算术计算的基础工具包。它包括结果提交机制和专用子智能体（编码、问题解决和推理）的调用。这些子智能体分解复杂任务并管理LLM的上下文长度，特别是在延长的改进周期中。

An asynchronous overseer, another LLM, monitors SICA's behavior, identifying potential issues such as loops or stagnation. It communicates with SICA and can intervene to halt execution if necessary. The overseer receives a detailed report of SICA's actions, including a callgraph and a log of messages and tool actions, to identify patterns and inefficiencies.

一个异步监督器，另一个LLM，监控SICA的行为，识别诸如循环或停滞等潜在问题。它与SICA通信，必要时可以干预以停止执行。监督器接收SICA行动的详细报告，包括调用图以及消息和工具操作的日志，以识别模式和低效问题。

SICA's LLM organizes information within its context window, its short-term memory, in a structured manner crucial to its operation. This structure includes a System Prompt defining agent goals, tool and sub-agent documentation, and system instructions. A Core Prompt contains the problem statement or instruction, content of open files, and a directory map. Assistant Messages record the agent's step-by-step reasoning, tool and sub-agent call records and results, and overseer communications. This organization facilitates efficient information flow, enhancing LLM operation and reducing processing time and costs. Initially, file changes were recorded as diffs, showing only modifications and periodically consolidated.

SICA的LLM以对其操作至关重要的结构化方式在其上下文窗口（其短期记忆）内组织信息。这种结构包括定义智能体目标的系统提示、工具和子智能体文档以及系统指令。核心提示包含问题陈述或指令、打开文件的内容和目录映射。助手消息记录智能体的逐步推理、工具和子智能体调用记录和结果，以及监督器通信。这种组织促进高效的信息流，增强LLM操作并减少处理时间和成本。最初，文件更改被记录为差异，仅显示修改并定期合并。

SICA: A Look at the Code: Delving deeper into SICA's implementation reveals several key design choices that underpin its capabilities. As discussed, the system is built with a modular architecture, incorporating several sub-agents, such as a coding agent, a problem-solver agent, and a reasoning agent. These sub-agents are invoked by the main agent, much like tool calls, serving to decompose complex tasks and efficiently manage context length, especially during those extended meta-improvement iterations.

SICA：代码解析：深入探讨SICA的实现揭示了支撑其能力的几个关键设计选择。如前所述，系统采用模块化架构构建，包含多个子智能体，如编码智能体、问题解决智能体和推理智能体。这些子智能体由主智能体调用，非常类似于工具调用，用于分解复杂任务并有效管理上下文长度，特别是在那些延长的元改进迭代期间。

The project is actively developed and aims to provide a robust framework for those interested in post-training LLMs on tool use and other agentic tasks, with the full code available for further exploration and contribution at the github.com/MaximeRobey… GitHub repository.

该项目正在积极开发中，旨在为那些对工具使用和其他智能体任务的LLM后训练感兴趣的人提供一个强大的框架，完整的代码可在github.com/MaximeRobey… GitHub仓库中获取，以供进一步探索和贡献。

For security, the project strongly emphasizes Docker containerization, meaning the agent runs within a dedicated Docker container. This is a crucial measure, as it provides isolation from the host machine, mitigating risks like inadvertent file system manipulation given the agent's ability to execute shell commands.

为了安全起见，该项目强烈强调Docker容器化，意味着智能体在专用的Docker容器内运行。这是一项关键措施，因为它提供了与主机的隔离，降低了诸如智能体执行shell命令时可能发生的意外文件系统操作等风险。

To ensure transparency and control, the system features robust observability through an interactive webpage that visualizes events on the event bus and the agent's callgraph. This offers comprehensive insights into the agent's actions, allowing users to inspect individual events, read overseer messages, and collapse sub-agent traces for clearer understanding.

为了确保透明度和控制，该系统通过交互式网页提供强大的可观察性，该网页可视化事件总线上的事件和智能体的调用图。这提供了对智能体行动的综合洞察，允许用户检查单个事件、读取监督器消息，并折叠子智能体跟踪以获得更清晰的理解。

In terms of its core intelligence, the agent framework supports LLM integration from various providers, enabling experimentation with different models to find the best fit for specific tasks. Finally, a critical component is the asynchronous overseer, an LLM that runs concurrently with the main agent. This overseer periodically assesses the agent's behavior for pathological deviations or stagnation and can intervene by sending notifications or even cancelling the agent's execution if necessary. It receives a detailed textual representation of the system's state, including a callgraph and an event stream of LLM messages, tool calls, and responses, which allows it to detect inefficient patterns or repeated work.

就其核心智能而言，智能体框架支持来自各种提供商的LLM集成，允许实验不同的模型以找到特定任务的最佳匹配。最后，一个关键组件是异步监督器，一个与主智能体并发运行的LLM。这个监督器定期评估智能体的行为是否存在病态偏差或停滞，必要时可以通过发送通知甚至取消智能体的执行来进行干预。它接收系统状态的详细文本表示，包括调用图和LLM消息、工具调用和响应的事件流，这使其能够检测低效模式或重复工作。

A notable challenge in the initial SICA implementation was prompting the LLM-based agent to independently propose novel, innovative, feasible, and engaging modifications during each meta-improvement iteration. This limitation, particularly in fostering open-ended learning and authentic creativity in LLM agents, remains a key area of investigation in current research.

初始SICA实现中的一个显著挑战是促使基于LLM的智能体在每次元改进迭代中独立提出新颖、创新、可行且引人入胜的修改。这一限制，特别是在培养LLM智能体的开放式学习和真正创造力方面，仍然是当前研究的关键调查领域。

AlphaEvolve and OpenEvolve | AlphaEvolve和OpenEvolve

AlphaEvolve is an AI agent developed by Google designed to discover and optimize algorithms. It utilizes a combination of LLMs, specifically Gemini models (Flash and Pro), automated evaluation systems, and an evolutionary algorithm framework. This system aims to advance both theoretical mathematics and practical computing applications.

AlphaEvolve是Google开发的一个AI智能体，旨在发现和优化算法。它利用LLM的组合，特别是Gemini模型（Flash和Pro）、自动评估系统和进化算法框架。该系统旨在推进理论数学和实际计算应用。

AlphaEvolve employs an ensemble of Gemini models. Flash is used for generating a wide range of initial algorithm proposals, while Pro provides more in-depth analysis and refinement. Proposed algorithms are then automatically evaluated and scored based on predefined criteria. This evaluation provides feedback that is used to iteratively improve the solutions, leading to optimized and novel algorithms.

AlphaEvolve采用一套Gemini模型。Flash用于生成广泛的初始算法提案，而Pro提供更深入的分析和改进。然后，基于预定义的标准自动评估和评分提出的算法。这种评估提供反馈，用于迭代改进解决方案，产生优化和新颖的算法。

In practical computing, AlphaEvolve has been deployed within Google's infrastructure. It has demonstrated improvements in data center scheduling, resulting in a 0.7% reduction in global compute resource usage. It has also contributed to hardware design by suggesting optimizations for Verilog code in upcoming Tensor Processing Units (TPUs). Furthermore, AlphaEvolve has accelerated AI performance, including a 23% speed improvement in a core kernel of the Gemini architecture and up to 32.5% optimization of low-level GPU instructions for FlashAttention.

在实际计算中，AlphaEvolve已在Google的基础设施内部署。它展示了在数据中心调度方面的改进，使全球计算资源使用率降低了0.7%。它还通过为即将推出的张量处理单元（TPU）提出Verilog代码优化建议，为硬件设计做出了贡献。此外，AlphaEvolve加速了AI性能，包括Gemini架构核心内核23%的速度改进，以及FlashAttention的低级GPU指令高达32.5%的优化。

In the realm of fundamental research, AlphaEvolve has contributed to the discovery of new algorithms for matrix multiplication, including a method for 4x4 complex-valued matrices that uses 48 scalar multiplications, surpassing previously known solutions. In broader mathematical research, it has rediscovered existing state-of-the-art solutions to over 50 open problems in 75% of cases and improved upon existing solutions in 20% of cases, with examples including advancements in the kissing number problem.

在基础研究领域，AlphaEvolve为发现矩阵乘法新算法做出了贡献，包括一种使用48个标量乘法的4x4复值矩阵方法，超越了先前已知的解决方案。在更广泛的数学研究中，它在75%的情况下重新发现了50多个开放问题的现有最先进解决方案，并在20%的情况下改进了现有解决方案，例子包括在接吻数问题方面的进展。

OpenEvolve is an evolutionary coding agent that leverages LLMs (see Fig.3) to iteratively optimize code. It orchestrates a pipeline of LLM-driven code generation, evaluation, and selection to continuously enhance programs for a wide range of tasks. A key aspect of OpenEvolve is its capability to evolve entire code files, rather than being limited to single functions. The agent is designed for versatility, offering support for multiple programming languages and compatibility with OpenAI-compatible APIs for any LLM. Furthermore, it incorporates multi-objective optimization, allows for flexible prompt engineering, and is capable of distributed evaluation to efficiently handle complex coding challenges.

OpenEvolve是一个进化编码智能体，利用LLM（见图3）迭代优化代码。它编排LLM驱动的代码生成、评估和选择的管道，以持续增强适用于广泛任务的程序。OpenEvolve的一个关键方面是其能够演化整个代码文件，而不仅仅局限于单个函数。该智能体设计用于多功能性，提供对多种编程语言的支持以及与任何LLM的OpenAI兼容API的兼容性。此外，它结合了多目标优化，允许灵活的提示工程，并能够进行分布式评估以有效处理复杂的编码挑战。

Fig. 3: The OpenEvolve internal architecture is managed by a controller. This controller orchestrates several key components: the program sampler, Program Database, Evaluator Pool, and LLM Ensembles. Its primary function is to facilitate their learning and adaptation processes to enhance code quality.

图3：OpenEvolve内部架构由控制器管理。该控制器编排几个关键组件：程序采样器、程序数据库、评估器池和LLM集合。其主要功能是促进它们的学习和适应过程以提高代码质量。

This code snippet uses the OpenEvolve library to perform evolutionary optimization on a program. It initializes the OpenEvolve system with paths to an initial program, an evaluation file, and a configuration file. The evolve.run(iterations=1000) line starts the evolutionary process, running for 1000 iterations to find an improved version of the program. Finally, it prints the metrics of the best program found during the evolution, formatted to four decimal places.

此代码片段使用OpenEvolve库对程序执行进化优化。它使用初始程序、评估文件和配置文件的路径初始化OpenEvolve系统。evolve.run(iterations=1000)行启动进化过程，运行1000次迭代以找到程序的改进版本。最后，它打印进化过程中发现的最佳程序的指标，格式化为四位小数。

from openevolve import OpenEvolve

# Initialize the system
evolve = OpenEvolve(
    initial_program_path="path/to/initial_program.py",
    evaluation_file="path/to/evaluator.py", 
    config_path="path/to/config.yaml"
)

# Run the evolution
best_program = await evolve.run(iterations=1000)

print(f"Best program metrics:")
for name, value in best_program.metrics.items():
    print(f"  {name}: {value:.4f}")

At a Glance | 快速概览

What: AI agents often operate in dynamic and unpredictable environments where pre-programmed logic is insufficient. Their performance can degrade when faced with novel situations not anticipated during their initial design. Without the ability to learn from experience, agents cannot optimize their strategies or personalize their interactions over time. This rigidity limits their effectiveness and prevents them from achieving true autonomy in complex, real-world scenarios.

什么：AI智能体通常在动态和不可预测的环境中运行，这些环境中预编程逻辑是不够的。当面临初始设计期间未预料到的新情况时，它们的性能可能会下降。如果没有从经验中学习的能力，智能体就无法随着时间推移优化其策略或个性化其交互。这种刚性限制了它们的有效性，并阻止它们在复杂的现实场景中实现真正的自主性。

Why: The standardized solution is to integrate learning and adaptation mechanisms, transforming static agents into dynamic, evolving systems. This allows an agent to autonomously refine its knowledge and behaviors based on new data and interactions. Agentic systems can use various methods, from reinforcement learning to more advanced techniques like self-modification, as seen in the Self-Improving Coding Agent (SICA). Advanced systems like Google's AlphaEvolve leverage LLMs and evolutionary algorithms to discover entirely new and more efficient solutions to complex problems. By continuously learning, agents can master new tasks, enhance their performance, and adapt to changing conditions without requiring constant manual reprogramming.

为什么：标准解决方案是集成学习和适应机制，将静态智能体转变为动态、演化的系统。这使智能体能够基于新数据和交互自主地完善其知识和行为。智能体系统可以使用各种方法，从强化学习到更先进的技术，如自我修改，如自我改进编码智能体（SICA）中所见。Google的AlphaEvolve等先进系统利用LLM和进化算法发现全新且更高效的复杂问题解决方案。通过持续学习，智能体可以掌握新任务、提高其性能并适应不断变化的条件，而无需持续的手动重新编程。

Rule of thumb: Use this pattern when building agents that must operate in dynamic, uncertain, or evolving environments. It is essential for applications requiring personalization, continuous performance improvement, and the ability to handle novel situations autonomously.

经验法则：在构建必须在动态、不确定或演化环境中运行的智能体时使用此模式。对于需要个性化、持续性能改进和自主处理新情况能力的应用来说，它是必不可少的。

Visual summary

视觉总结

Fig.4: Learning and adapting pattern

图4：学习和适应模式

Key Takeaways | 关键要点

Learning and Adaptation are about agents getting better at what they do and handling new situations by using their experiences.
学习和适应是关于智能体通过使用其经验来更好地完成工作并处理新情况。
"Adaptation" is the visible change in an agent's behavior or knowledge that comes from learning.
"适应"是智能体行为或知识中因学习而产生的可见变化。
SICA, the Self-Improving Coding Agent, self-improves by modifying its code based on past performance. This led to tools like the Smart Editor and AST Symbol Locator.
SICA，自我改进编码智能体，通过基于过去性能修改其代码来实现自我改进。这导致了智能编辑器和AST符号定位器等工具的产生。
Having specialized "sub-agents" and an "overseer" helps these self-improving systems manage big tasks and stay on track.
拥有专门的"子智能体"和"监督器"有助于这些自我改进的系统管理大任务并保持正轨。
The way an LLM's "context window" is set up (with system prompts, core prompts, and assistant messages) is super important for how efficiently agents work.
LLM的"上下文窗口"设置方式（使用系统提示、核心提示和助手消息）对于智能体的工作效率极其重要。
This pattern is vital for agents that need to operate in environments that are always changing, uncertain, or require a personal touch.
这种模式对于那些需要在不断变化、不确定或需要个性化处理的环境中运行的智能体至关重要。
Building agents that learn often means dealing with challenges in fostering creativity and open-ended learning in LLM-based systems.
构建学习的智能体通常意味着在基于LLM的系统中培养创造力和开放式学习方面面临挑战。
Building agents that learn often means hooking them up with machine learning tools and managing how data flows.
构建能够学习的智能体通常意味着将它们与机器学习工具连接起来，并管理数据流。
An agent system, equipped with basic coding tools, can autonomously edit itself, and thereby improve its performance on benchmark tasks
一个配备了基本编码工具的智能体系统可以自主编辑自身，从而在基准任务上提高其性能
AlphaEvolve is Google's AI agent that leverages LLMs and an evolutionary framework to autonomously discover and optimize algorithms, significantly enhancing both fundamental research and practical computing applications.
AlphaEvolve是Google的AI智能体，它利用LLM和进化框架自主发现和优化算法，显著增强了基础研究和实际计算应用。

Conclusion | 结论

This chapter examines the crucial roles of learning and adaptation in Artificial Intelligence. AI agents enhance their performance through continuous data acquisition and experience. The Self-Improving Coding Agent (SICA) exemplifies this by autonomously improving its capabilities through code modifications.

本章探讨了学习和适应在人工智能中的关键作用。AI智能体通过持续的数据获取和经验来提高其性能。自我改进编码智能体（SICA）通过自主改进其代码修改能力来体现这一点。

We have reviewed the fundamental components of agentic AI, including architecture, applications, planning, multi-agent collaboration, memory management, and learning and adaptation. Learning principles are particularly vital for coordinated improvement in multi-agent systems. To achieve this, tuning data must accurately reflect the complete interaction trajectory, capturing the individual inputs and outputs of each participating agent.

我们回顾了智能体AI的基本组成部分，包括架构、应用、规划、多智能体协作、内存管理以及学习和适应。学习原则对于多智能体系统中的协调改进尤为重要。为了实现这一点，调优数据必须准确反映完整的交互轨迹，捕捉每个参与智能体的个体输入和输出。

These elements contribute to significant advancements, such as Google's AlphaEvolve. This AI system independently discovers and refines algorithms by LLMs, automated assessment, and an evolutionary approach, driving progress in scientific research and computational techniques. Such patterns can be combined to construct sophisticated AI systems. Developments like AlphaEvolve demonstrate that autonomous algorithmic discovery and optimization by AI agents are attainable.

这些元素促成了重大进展，比如谷歌的AlphaEvolve**。这个AI系统通过LLM、自动化评估和进化方法独立发现并优化算法，推动科学研究和计算技术的进步。这些模式可以结合来构建复杂的AI系统。像AlphaEvolve这样的发展表明，AI智能体自主发现算法和优化是可以实现的。

References

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.
Proximal Policy Optimization Algorithms by John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. You can find it on arXiv: arxiv.org/abs/1707.06…
Robeyns, M., Aitchison, L., & Szummer, M. (2025). A Self-Improving Coding Agent. arXiv:2504.15228v2. arxiv.org/pdf/2504.15… github.com/MaximeRobey…
AlphaEvolve blog, deepmind.google/discover/bl…
OpenEvolve, github.com/codelion/op…