从30亿Token到全自治AGI：我花了两天搭建AI测试体系（附架构设计）由于csdn对普通用户限制，今日全面转向掘金！

由于csdn对普通用户限制，今日全面转向掘金！！！

烧了30亿+token，超10万+api调用次数养龙虾之后，我彻底转向以hermes为执行者的harness工程agi全自治架构

很多人用AI重构代码效果不错，但忽略了一个关键问题：重构后的质量谁来保证？

我的经历：搭了一个Refactor Agent，第一次尝试让它同时负责测试，结果第5小时漏掉边界条件、第8小时跳过性能回归、第10小时思维混乱。最终交付代码带有12个未发现的回归缺陷。

核心结论：一个Agent同时承担重构和测试，会导致职责过载。

本文是技术实战记录，详细展示从问题发现到架构设计，再到知识库填充的完整踩坑过程，所有验证脚本均可复现。

一、为什么重构Agent不能兼任测试Agent

1.1 认知过载：上下文污染

Refactor Agent满脑子都是"怎么改更优雅"。当它切换到测试模式时，重构思维会污染测试设计：

# 重构思维
refactor_thinking = {
    "priority": "代码可读性和设计模式",
    "self_assessment": "这个提取很优雅，肯定没问题",
    "blind_spot": "忽略边界条件和异常路径"
}

# 测试思维
test_thinking = {
    "priority": "边界条件和异常场景",
    "objective": "这个函数在空输入时会不会崩溃？",
    "verification": "性能下降15%是否在可接受范围？"
}

实测数据：两种思维模式互相干扰，测试覆盖率从预期的85%下降到实际的62%。

1.2 自我确认偏差

场景	独立Test Agent	重构者自测
发现边界条件缺失	立即标记为缺陷	"这个场景不太可能出现"
性能下降15%	触发回归告警	"为了可读性可以接受"
安全漏洞	创建高优先级工单	"外层已校验过了"

1.3 资源竞争

单个Agent执行轮次有限（max_turns=90）。如果60%时间用于重构：

单元测试：✅ 完成
集成测试：⚠️ 简化执行
性能回归：❌ 跳过
安全扫描：❌ 跳过
混沌工程：❌ 完全跳过

二、Test Agent架构设计

2.1 整体架构

Test Agent 架构
├── Skills（6个核心技能）
│   ├── behavior-verification-skill    # 行为基线生成与对比
│   ├── test-generation-skill          # 基于代码分析自动生成测试
│   ├── performance-testing-skill      # 负载/压力/基准测试
│   ├── security-testing-skill         # SAST/DAST/依赖扫描
│   ├── chaos-engineering-skill        # 故障注入与韧性验证
│   └── e2e-testing-skill              # 端到端自动化测试
├── Knowledge Base（47个模块，158个文件）
│   ├── 测试基础（5）：测试金字塔、测试生成、测试数据、编排、报告
│   ├── Web/API（3）、客户端（4）、后端（6）
│   ├── 游戏/图形（4）、安全（6）、性能/可靠性（5）
│   ├── 数据/合规（3）、新兴技术（4）、兼容性/工程（7）
│   └── 总行数：45,163行
└── 与Hermes协作：通过独立路由自动触发验证流程

2.2 与Refactor Agent协作流程

# 协作流程伪代码
def refactor_and_test_flow(module_name):
    # 1. Refactor Agent执行重构
    refactor_result = refactor_agent.refactor(module_name)

    # 2. Hermes通知Test Agent验证
    hermes.notify(test_agent, "重构完成，请验证")

    # 3. Test Agent独立验证
    baseline = test_agent.generate_baseline(module_name)
    behavior_diff = test_agent.compare_behavior(baseline, refactor_result)

    # 4. 生成验证报告
    report = test_agent.generate_report(behavior_diff)

    return report

关键设计：两个Agent之间不直接通信，通过Hermes协调。Test Agent看不到Refactor Agent的Repo Map，保证验证客观性。

三、知识库填充：完整踩坑实录

这是整个项目最耗时的部分。47个模块、158份文档、45,000+行内容，必须用多Agent并行填充。但多Agent场景下遇到了全新的挑战。

3.1 踩坑1：并发数量限制

预期：启动7-10个Agent并行填充。
现实：

[error] delegate 5 parallel tasks 0.0s [error]
✗ [2/2] 创建6个后端测试模块（共30个文件）：Timeout

根因分析：

# Hermes的delegate_task内部限制
def _get_max_concurrent_children(self):
    # 默认最多3-5个子Agent
    return min(5, os.cpu_count() // 2)

解决方案：分批次执行

# 错误方案：同时启动10个Agent
tasks = [fill_module(i) for i in range(10)]  # 超时失败

# 正确方案：分批，每批最多3个
def fill_modules_batch(module_indices, batch_size=3):
    results = []
    for i in range(0, len(module_indices), batch_size):
        batch = module_indices[i:i+batch_size]
        # 每批内并行
        batch_results = [fill_module(idx) for idx in batch]
        # 验证后再下一批
        validate(batch_results)
        results.extend(batch_results)
    return results

3.2 踩坑2：模型输出长度截断

预期：每个Agent生成完整内容并写入文件。
现实：

[subagent-1] ⚠️ Response truncated (finish_reason='length')
[subagent-1] ⚠️ Truncated tool call detected — refusing to execute incomplete tool arguments.

根因分析：

# 虽然上下文有204.8K，但max output tokens是独立限制
config = {
    "context_length": 204800,  # 上下文长度
    "max_output_tokens": 8192  # 输出长度限制，会截断
}

# 当生成大量代码示例时，输出被截断
# 导致WriteFile参数不完整，写入失败

解决方案：降低任务复杂度

# 错误方案：一个大任务生成5个模块
task = "填充5个模块，每个文件500+行"  # 截断失败

# 正确方案：小任务，一个模块一个文件
task = "填充1个模块，每个文件200-300行"  # 成功

3.3 踩坑3：文件写入路径错误

预期：文件写入/knowledge_base/{module}/
现实：Agent写入/{module}/（少了knowledge_base/层级）

影响：43个模块显示"已完成"但实际为空

验证脚本：

#!/bin/bash
# 验证知识库文件位置的脚本

AGENT_ROOT="/Volumes/A/Agents/test-agent"
KB_ROOT="$AGENT_ROOT/knowledge_base"

echo "=== 验证知识库文件位置 ==="

# 检查应该在的位置
expected_count=$(find "$KB_ROOT" -name "*.md" 2>/dev/null | wc -l)
echo "知识库目录中的md文件数: $expected_count"

# 检查是否有可能的错位位置
wrong_location="$AGENT_ROOT"
wrong_count=$(find "$wrong_location" -maxdepth 1 -name "*.md" 2>/dev/null | wc -l)
echo "根目录的md文件数（应该为0）: $wrong_count"

if [ "$wrong_count" -gt 0 ]; then
    echo "⚠️ 发现文件错位！"
    find "$wrong_location" -maxdepth 1 -name "*.md" -exec ls -lh {} ;
fi

# 检查各模块目录
for module in test-*-skill test-*-base test-*/*/; do
    if [ -d "$KB_ROOT/$module" ]; then
        file_count=$(find "$KB_ROOT/$module" -name "*.md" 2>/dev/null | wc -l)
        echo "$module: $file_count 个文件"
    fi
done

echo "=== 验证完成 ==="

修复脚本：

#!/usr/bin/env python3
"""修复错位的知识库文件"""

import os
import shutil
from pathlib import Path

AGENT_ROOT = Path("/Volumes/A/Agents/test-agent")
KB_ROOT = AGENT_ROOT / "knowledge_base"

def find_misplaced_files():
    """找出错位的md文件"""
    misplaced = []

    # 检查根目录的md文件
    for item in AGENT_ROOT.iterdir():
        if item.is_file() and item.suffix == '.md':
            misplaced.append(item)

    # 检查错误层级的模块目录
    for item in AGENT_ROOT.iterdir():
        if item.is_dir() and not item.name.startswith('.'):
            if 'knowledge_base' not in str(item):
                for md_file in item.rglob("*.md"):
                    misplaced.append(md_file)

    return misplaced

def fix_misplaced_files():
    """将错位的文件移动到正确位置"""
    misplaced = find_misplaced_files()

    for src in misplaced:
        # 尝试推断正确位置
        # 例如: /test-agent/test-orchestration/xxx.md
        # -> /test-agent/knowledge_base/test-orchestration/xxx.md
        module_name = src.parent.name
        if module_name == AGENT_ROOT.name:  # 根目录的md文件
            continue  # 这些可能是README之类

        dst_dir = KB_ROOT / module_name
        dst = dst_dir / src.name

        if dst.exists():
            print(f"⚠️ 目标已存在，跳过: {dst}")
            continue

        dst_dir.mkdir(parents=True, exist_ok=True)
        shutil.copy2(src, dst)
        print(f"✅ 复制: {src} -> {dst}")

if __name__ == "__main__":
    fix_misplaced_files()

3.4 踩坑4：Agent声称完成但实际未完成

现象：日志显示完成，但文件为空

✓ [1/2] 创建 test-orchestration/ 模块（4个文件）
✓ [2/2] 创建 test-reporting/ 模块（4个文件）

根因：Agent生成内容但写入失败（截断），"完成确认"与"实际写入"逻辑分离

解决方案：不依赖Agent自我报告，强制验证

#!/usr/bin/env python3
"""知识库完整性验证脚本"""

import os
from pathlib import Path

KB_ROOT = Path("/Volumes/A/Agents/test-agent/knowledge_base")

# 预期模块列表（47个）
EXPECTED_MODULES = [
    # 测试基础（5）
    "test-pyramid", "test-generation", "test-data", "test-orchestration", "test-reporting",
    # Web/API（3）
    "web-testing", "e2e-testing", "api-testing",
    # 客户端（4）
    "mobile-testing", "desktop-testing", "terminal-testing", "cross-platform-testing",
    # 后端（6）
    "server-testing", "database-testing", "message-queue-testing",
    "event-driven-testing", "microservices-testing", "serverless-testing",
    # ... 更多模块
]

def validate_knowledge_base():
    """验证知识库完整性"""
    results = {
        "total_modules": 0,
        "total_files": 0,
        "total_lines": 0,
        "missing_modules": [],
        "empty_modules": [],
        "low_content_modules": []
    }

    # 检查每个模块
    for module_name in EXPECTED_MODULES:
        module_path = KB_ROOT / module_name
        results["total_modules"] += 1

        if not module_path.exists():
            results["missing_modules"].append(module_name)
            continue

        # 统计文件和行数
        md_files = list(module_path.rglob("*.md"))
        file_count = len(md_files)
        results["total_files"] += file_count

        if file_count == 0:
            results["empty_modules"].append(module_name)
            continue

        # 统计行数
        total_lines = 0
        for f in md_files:
            with open(f, 'r', encoding='utf-8', errors='ignore') as fp:
                total_lines += len(fp.readlines())

        results["total_lines"] += total_lines

        # 检查内容是否过少（每个文件应>50行）
        if file_count > 0 and total_lines < file_count * 50:
            results["low_content_modules"].append({
                "module": module_name,
                "files": file_count,
                "lines": total_lines
            })

    return results

if __name__ == "__main__":
    print("=== 知识库完整性验证 ===")
    results = validate_knowledge_base()

    print(f"模块总数: {results['total_modules']}")
    print(f"文件总数: {results['total_files']}")
    print(f"总行数: {results['total_lines']}")

    if results['missing_modules']:
        print(f"\n⚠️ 缺失模块 ({len(results['missing_modules'])}):")
        for m in results['missing_modules']:
            print(f"  - {m}")

    if results['empty_modules']:
        print(f"\n⚠️ 空模块 ({len(results['empty_modules'])}):")
        for m in results['empty_modules']:
            print(f"  - {m}")

    if results['low_content_modules']:
        print(f"\n⚠️ 内容不足模块 ({len(results['low_content_modules'])}):")
        for m in results['low_content_modules']:
            print(f"  - {m['module']}: {m['files']}文件, {m['lines']}行")

    if not any([results['missing_modules'], results['empty_modules'], results['low_content_modules']]):
        print("\n✅ 验证通过！所有模块完整")

四、最终执行策略

经过多轮踩坑，确定最稳定的执行策略：

# 最终执行策略
EXECUTION_STRATEGY = {
    "batches": [
        # 批次1（3个Agent并行）
        {
            "agents": 3,
            "modules": ["测试基础（5）", "Web/API（3）", "客户端（4）"],
            "verification": "filesystem_check"
        },
        # 批次2（3个Agent并行）
        {
            "agents": 3,
            "modules": ["后端（6）", "游戏/图形（4）", "安全（6）"],
            "verification": "filesystem_check"
        },
        # 批次3（3个Agent并行）
        {
            "agents": 3,
            "modules": ["性能/可靠性（5）", "数据/合规（3）", "新兴技术（4）"],
            "verification": "filesystem_check"
        },
        # 批次4（1个Agent）
        {
            "agents": 1,
            "modules": ["兼容性/工程（7）"],
            "verification": "filesystem_check"
        }
    ],
    "total_time": "8小时",
    "stability": "100% (47/47模块完成)"
}

五、最终成果数据

5.1 Test Agent独立成果

维度	数据
知识库模块	47个
知识库文件	158个
总行数	45,163行
代码示例覆盖率	95%
检查清单覆盖率	90%
覆盖语言	Python, Go, JavaScript, Java, C/C++, C#, Rust
测试类型	单元/集成/E2E/性能/安全/混沌/契约/可观测性

5.2 Refactor Agent + Test Agent组合

维度	Refactor Agent	Test Agent	组合
知识库模块	15	47	62
知识库文件	84	158	242
总行数	27,427	45,163	72,590
Skills	5	6	11

六、Agent架构黄金法则

法则1：单一职责原则（SRP）

# 错误架构
class CombinedAgent:
    def refactor(self): pass      # 职责冲突
    def test(self): pass          # 上下文污染
    def verify(self): pass        # 自我确认偏差

# 正确架构
class RefactorAgent:
    def refactor(self): pass      # 只负责重构
    def submit(self): pass        # 通知测试

class TestAgent:
    def verify(self): pass        # 只负责验证
    def report(self): pass        # 客观报告

法则2：独立上下文原则

# 错误：Test Agent看到Refactor Agent的内心想法
test_context = {
    "repo_map": agent.repo_map,           # 污染判断
    "task_decomposition": agent.tasks,    # 知道哪里偷懒
    "intent": "重构这个模块"               # 先入为主
}

# 正确：只能看到输入和输出
test_context = {
    "input": "函数签名+文档",
    "output": "运行结果+行为",
    "behavior_diff": "与基线差异"
}

法则3：可验证性原则

# 验证脚本示例：检查重构前后的行为一致性
python3 skills/behavior-verification-skill/scripts/compare-behavior.py \
  --before workspace/baseline/before.json \
  --after workspace/baseline/after.json \
  --report workspace/behavior-diff.md

法则4：渐进稳定原则

❌ 错误：同时10个Agent → 超时、截断、路径错误、0%完成
✅ 正确：分4批（3+3+3+1） → 8小时，100%完成

七、总结

通过这次实战，我总结了多Agent知识库填充的核心经验：

分批执行：每批最多3个Agent，完成后强制验证
小任务：一个任务只填充1个模块，避免输出截断
验证优先：不依赖Agent自我报告，用脚本检查实际结果
路径明确：在提示词中重复3次完整路径
渐进稳定：不要追求一次性完美，小步快跑更可靠

这套方法论同样适用于其他多Agent场景。

引用来源

Hermes Agent官方文档 - Agent职责分离设计原则（2026-04）
软件测试金字塔理论 - Martin Fowler关于测试分层（2026-04）
测试左移实践 - Microsoft Engineering团队关于提前验证的研究（2026-04）

免责声明：个人技术观察，具体行为以官方文档为准。