OpenClaw 子 Agent 生命周期管理深度解析：一个子任务是如何被注册、追踪并完成清理的一、子 Agent 的存

本文基于 src/agents/subagent-registry.ts（1486 行，整个 agents 目录最大的单文件）及其 8 个拆分模块（subagent-registry-state.ts、subagent-registry-queries.ts、subagent-registry-cleanup.ts、subagent-registry-completion.ts、subagent-registry.types.ts、subagent-lifecycle-events.ts、subagent-depth.ts、subagent-announce-queue.ts）进行深度分析。

一、子 Agent 的存在意义

在 OpenClaw 的架构里，主对话 Agent（Pi）在执行复杂任务时可以通过工具调用（spawn_subagent）创建独立的子 Agent，每个子 Agent 拥有独立的：

Session：独立的 sessionKey，独立的对话历史文件
工作空间：独立的 workspaceDir，文件操作互不干扰
模型配置：可以指定不同的 model
超时预算：独立的 runTimeoutSeconds
附件目录：attachmentsDir 用于存放工具调用产生的文件

父 Agent 调用 spawn_subagent 后不会阻塞自己的对话流——它注册一个 Run Record，然后继续响应用户，子 Agent 的结果通过 Announce 机制（一条新的消息）异步回传给父 Agent 所在的 session。

这个"注册 → 异步等待 → 结果回传 → 清理"的全链路，就是 subagent-registry.ts 的职责所在。

二、核心数据结构：SubagentRunRecord

所有子 Agent 的状态都由 SubagentRunRecord（subagent-registry.types.ts L6-58）表示：

export type SubagentRunRecord = {
  // 身份标识
  runId: string;                    // 全局唯一，来自 gateway 分配的 run ID
  childSessionKey: string;          // 子 Agent 的 session key
  controllerSessionKey?: string;    // 控制者 session（默认等于 requesterSessionKey）
  requesterSessionKey: string;      // 发起请求的父 session
  requesterOrigin?: DeliveryContext; // 父 session 的来源渠道信息（用于 announce 回传）
  requesterDisplayKey: string;      // 显示用的请求者标识
  task: string;                     // 任务描述（用于 announce 消息体）
  label?: string;                   // 可选标签
  model?: string;                   // 使用的模型
  spawnMode?: SpawnSubagentMode;    // "run" | "session"

  // 时序戳
  createdAt: number;
  startedAt?: number;
  endedAt?: number;

  // 结果与清理
  cleanup: "delete" | "keep";       // 完成后是否删除子 session
  outcome?: SubagentRunOutcome;     // { status: "ok" | "error" | "timeout"; error?: string }
  endedReason?: SubagentLifecycleEndedReason;
  cleanupHandled?: boolean;         // 是否已进入清理流程（防重入）
  cleanupCompletedAt?: number;      // 清理完成时间戳
  archiveAtMs?: number;             // 自动归档（删除）的绝对时间戳

  // Announce 投递
  expectsCompletionMessage?: boolean; // 是否需要等待完成消息
  announceRetryCount?: number;        // 已重试次数
  lastAnnounceRetryAt?: number;       // 最后一次重试时间
  suppressAnnounceReason?: "steer-restart" | "killed"; // 抑制 announce 的原因

  // 结果冻结（Frozen Result）
  frozenResultText?: string | null;           // 子 Agent 最终回复内容的快照
  frozenResultCapturedAt?: number;
  fallbackFrozenResultText?: string | null;   // Steer 重启时的备用快照
  fallbackFrozenResultCapturedAt?: number;

  // Hook 去重
  endedHookEmittedAt?: number;    // subagent_ended hook 已触发的时间戳

  // 附件
  attachmentsDir?: string;
  attachmentsRootDir?: string;
  retainAttachmentsOnKeep?: boolean;
  wakeOnDescendantSettle?: boolean;
};

几个关键字段的语义：

runId vs childSessionKey：runId 是这次"执行"的 ID（可以被替换，steer 重启后 runId 变化），childSessionKey 是子 Agent 的 session 持久标识（始终不变）
cleanup: "delete" | "keep"：delete 模式在 announce 成功后删除子 session 的 transcript；keep 模式保留 session（用于"持久子 Agent"场景）
spawnMode: "run" | "session"：session 模式的子 Agent 拥有线程绑定，keepThreadBinding 在 killed 之外都保留；run 模式不保留
frozenResultText：子 Agent 完成时捕获的最终回复，是 announce 消息的核心内容来源。上限 100KB（FROZEN_RESULT_TEXT_MAX_BYTES），超出会截断并附加提示

三、生命周期概述

子 Agent 的生命周期可以分为六个阶段：

注册 (registerSubagentRun)
    │
    ├─ 启动等待 (waitForSubagentCompletion via gateway RPC)
    │   └─ 并行：in-process lifecycle listener (onAgentEvent)
    │
    ▼
运行中 (running)
    │
    ├─ 正常完成 → completeSubagentRun(reason=complete)
    ├─ 错误终止 → schedulePendingLifecycleError → completeSubagentRun(reason=error)
    ├─ 强制终止 → markSubagentRunTerminated(reason=killed)
    └─ Steer 重启 → markSubagentRunForSteerRestart → replaceSubagentRunAfterSteer
    │
    ▼
结果冻结 (freezeRunResultAtCompletion)
    │
    ▼
Announce 投递 (startSubagentAnnounceCleanupFlow)
    │
    ├─ 投递成功 → finalizeSubagentCleanup(didAnnounce=true)
    └─ 投递失败 → resolveDeferredCleanupDecision → retry / defer / give-up
    │
    ▼
清理完成 (completeCleanupBookkeeping)
    │
    ├─ delete 模式 → 删除 run record + 通知 context engine + 触发兄弟 announce 重试
    └─ keep 模式 → 保留 run record，设置 cleanupCompletedAt
    │
    ▼
自动归档 sweeper (archiveAtMs 到期 → sweepSubagentRuns)

四、注册阶段：registerSubagentRun

registerSubagentRun（subagent-registry.ts L1159-1219）是整个生命周期的起点：

// src/agents/subagent-registry.ts L1159-1219

export function registerSubagentRun(params: { ... }) {
  const now = Date.now();
  const cfg = loadConfig();
  const archiveAfterMs = resolveArchiveAfterMs(cfg);  // 默认 60 分钟
  const spawnMode = params.spawnMode === "session" ? "session" : "run";
  // session 模式不设置自动归档时间
  const archiveAtMs =
    spawnMode === "session" ? undefined : archiveAfterMs ? now + archiveAfterMs : undefined;

  subagentRuns.set(params.runId, {
    runId: params.runId,
    childSessionKey: params.childSessionKey,
    controllerSessionKey: params.controllerSessionKey ?? params.requesterSessionKey,
    // ...所有字段初始化
    createdAt: now,
    startedAt: now,
    archiveAtMs,
    cleanupHandled: false,
  });

  ensureListener();      // 启动 in-process lifecycle 事件监听
  persistSubagentRuns(); // 立即持久化到磁盘
  if (archiveAtMs) {
    startSweeper();      // 启动 60s 间隔的清理扫描器
  }
  void waitForSubagentCompletion(params.runId, waitTimeoutMs);  // 启动异步等待
}

注册完成后立即启动两条并行的"完成检测"路径：

路径一：Gateway RPC agent.wait（主路径）

waitForSubagentCompletion（L1221-1284）通过 callGateway({ method: "agent.wait", ... }) 长轮询等待子 Agent 完成，Gateway 是跨进程的，所以这条路径能检测到任何进程（包括另一个 worker）中运行的子 Agent 完成事件：

const wait = await callGateway<{ status?, startedAt?, endedAt?, error? }>({
  method: "agent.wait",
  params: { runId, timeoutMs },
  timeoutMs: timeoutMs + 10_000,  // 额外 10s 作为网络缓冲
});
// status 为 "ok" | "error" | "timeout" 时进入 completeSubagentRun

路径二：In-process lifecycle 事件（fallback）

ensureListener（L765-820）订阅 onAgentEvent，处理 phase=start、phase=end、phase=error 事件。这条路径是针对 embedded 运行（子 Agent 在同一进程内运行）的快速路径，不需要等 Gateway RPC 超时。

五、生命周期事件处理：三种 phase

ensureListener 内的事件处理逻辑（L770-819）：

// src/agents/subagent-registry.ts L770-819

listenerStop = onAgentEvent((evt) => {
  void (async () => {
    if (!evt || evt.stream !== "lifecycle") return;
    const phase = evt.data?.phase;
    const entry = subagentRuns.get(evt.runId);

    if (!entry) {
      // 没有对应 run record，但 phase=end 时刷新 frozenResult（供其他待定 run 使用）
      if (phase === "end" && typeof evt.sessionKey === "string") {
        await refreshFrozenResultFromSession(evt.sessionKey);
      }
      return;
    }

    if (phase === "start") {
      clearPendingLifecycleError(evt.runId);
      if (startedAt) { entry.startedAt = startedAt; persistSubagentRuns(); }
      return;
    }

    if (phase === "error") {
      // ⚠️ 不立即终止！延迟 15 秒，等待可能到来的 start/end 事件
      schedulePendingLifecycleError({ runId, endedAt, error });
      return;
    }

    // phase === "end"
    clearPendingLifecycleError(evt.runId);
    const outcome = evt.data?.aborted ? { status: "timeout" } : { status: "ok" };
    await completeSubagentRun({ runId, endedAt, outcome, reason: COMPLETE, ... });
  })();
});

关键设计：phase=error 的延迟处理

嵌入式运行（embedded）在模型/Provider 重试期间可能发出瞬态的 error 事件，但随后会有新的 start 事件重新开始。如果立即处理 error，会导致一次正在重试的子 Agent 被错误地标记为终止。

schedulePendingLifecycleError（L279-313）用 setTimeout(LIFECYCLE_ERROR_RETRY_GRACE_MS = 15000) 延迟执行：如果 15 秒内来了新的 start 或 end 事件，clearPendingLifecycleError 会取消这个定时器；如果 15 秒内没有后续事件，才真正触发错误完成。

六、完成流程：completeSubagentRun

completeSubagentRun（L451-530）是生命周期中最复杂的函数，处理从"运行结束"到"开始清理"的全链路：

// src/agents/subagent-registry.ts L451-530

async function completeSubagentRun(params: {
  runId: string;
  endedAt?: number;
  outcome: SubagentRunOutcome;
  reason: SubagentLifecycleEndedReason;
  sendFarewell?: boolean;
  accountId?: string;
  triggerCleanup: boolean;
}) {
  clearPendingLifecycleError(params.runId);
  const entry = subagentRuns.get(params.runId);
  if (!entry) return;

  // 特殊情况：如果有更晚到达的 complete 事件，允许覆盖之前的 killed 标记
  if (params.reason === SUBAGENT_ENDED_REASON_COMPLETE &&
      entry.suppressAnnounceReason === "killed" &&
      (entry.cleanupHandled || typeof entry.cleanupCompletedAt === "number")) {
    entry.suppressAnnounceReason = undefined;
    entry.cleanupHandled = false;
    entry.cleanupCompletedAt = undefined;
    mutated = true;
  }

  // 更新 endedAt、outcome、endedReason
  // ...

  // 冻结最终回复内容
  if (await freezeRunResultAtCompletion(entry)) { mutated = true; }

  if (mutated) { persistSubagentRuns(); }

  // 确定是否触发 ended hook
  const suppressedForSteerRestart = suppressAnnounceForSteerRestart(entry);
  const shouldEmitEndedHook = !suppressedForSteerRestart && shouldEmitEndedHookForRun({ entry, reason });
  const shouldDeferEndedHook = shouldEmitEndedHook && params.triggerCleanup &&
    entry.expectsCompletionMessage === true && !suppressedForSteerRestart;

  if (!shouldDeferEndedHook && shouldEmitEndedHook) {
    await emitSubagentEndedHookForRun({ entry, reason, sendFarewell, accountId });
  }

  if (!params.triggerCleanup || suppressedForSteerRestart) return;
  startSubagentAnnounceCleanupFlow(params.runId, entry);
}

这里有一个微妙的 hook 延迟逻辑：当 expectsCompletionMessage === true（子 Agent 需要发送完成消息）时，shouldDeferEndedHook 为 true，hook 不在这里触发，而是等 announce 完成后在 emitCompletionEndedHookIfNeeded 中触发。原因是需要等 announce 流携带完整的完成结果投递后，父 session 才真正感知到子 Agent 结束。

七、结果冻结：Frozen Result 机制

在子 Agent 完成后、announce 投递前，系统会尝试捕获子 Agent 的最终回复文本，保存为 frozenResultText。

// src/agents/subagent-registry.ts L379-391

async function freezeRunResultAtCompletion(entry: SubagentRunRecord): Promise<boolean> {
  if (entry.frozenResultText !== undefined) {
    return false;  // 已经冻结过，不重复
  }
  try {
    const captured = await captureSubagentCompletionReply(entry.childSessionKey);
    entry.frozenResultText = captured?.trim() ? capFrozenResultText(captured) : null;
  } catch {
    entry.frozenResultText = null;
  }
  entry.frozenResultCapturedAt = Date.now();
  return true;
}

capFrozenResultText（L99-115）对捕获的文本进行大小限制：超过 100KB 时截断并附加 [truncated: frozen completion output exceeded 100KB (...KB)] 的注释，避免内存和磁盘的过度占用。

frozenResultText 有两种用途：

作为 roundOneReply 传递给 runSubagentAnnounceFlow，让 announce 立即有内容可以展示（不需要等待子 session 的 live stream）
在进程崩溃重启后，从磁盘恢复 run record 并继续 announce 投递，不丢失完成结果

fallbackFrozenResultText 是 steer 重启时的备用快照：当上一次 run 已经产生了 frozenResultText，但 steer 重启后新的 run 完成时只回复了 NO_REPLY（无有效内容），这时使用 fallback 作为最终 announce 的内容。

八、Announce 投递流：startSubagentAnnounceCleanupFlow

startSubagentAnnounceCleanupFlow（L532-579）启动 announce + 清理的联动流程：

// src/agents/subagent-registry.ts L532-579

function startSubagentAnnounceCleanupFlow(runId: string, entry: SubagentRunRecord): boolean {
  if (!beginSubagentCleanup(runId)) return false;  // 防重入锁

  const finalizeAnnounceCleanup = (didAnnounce: boolean) => {
    void finalizeSubagentCleanup(runId, entry.cleanup, didAnnounce).catch((err) => {
      // 清理失败：重置 cleanupHandled，允许 resume 重试
      const current = subagentRuns.get(runId);
      if (!current || current.cleanupCompletedAt) return;
      current.cleanupHandled = false;
      persistSubagentRuns();
    });
  };

  void runSubagentAnnounceFlow({
    childSessionKey: entry.childSessionKey,
    childRunId: entry.runId,
    requesterSessionKey: entry.requesterSessionKey,
    requesterOrigin,
    task: entry.task,
    timeoutMs: SUBAGENT_ANNOUNCE_TIMEOUT_MS,  // 120s
    roundOneReply: entry.frozenResultText ?? undefined,
    fallbackReply: entry.fallbackFrozenResultText ?? undefined,
    waitForCompletion: false,
    // ...
  })
    .then((didAnnounce) => finalizeAnnounceCleanup(didAnnounce))
    .catch((error) => finalizeAnnounceCleanup(false));

  return true;
}

beginSubagentCleanup（L1033-1047）用 entry.cleanupHandled 作为防重入锁：一旦设为 true 并持久化，后续任何重试都能检测到并跳过，防止多个进程或多次回调同时进行清理。

九、清理决策：resolveDeferredCleanupDecision

当 runSubagentAnnounceFlow 返回 false（announce 未成功投递）时，finalizeSubagentCleanup（L860-955）需要决定下一步行动。这个决策由 resolveDeferredCleanupDecision（subagent-registry-cleanup.ts L33-74）完成：

// src/agents/subagent-registry-cleanup.ts L33-74

export function resolveDeferredCleanupDecision(params: { ... }): DeferredCleanupDecision {
  const endedAgo = resolveEndedAgoMs(params.entry, params.now);
  const isCompletionMessageFlow = params.entry.expectsCompletionMessage === true;
  const completionHardExpiryExceeded = isCompletionMessageFlow && endedAgo > params.announceCompletionHardExpiryMs;

  // 场景 1：等待后代子 Agent 完成（completion flow 专用）
  if (isCompletionMessageFlow && params.activeDescendantRuns > 0) {
    if (completionHardExpiryExceeded) {
      return { kind: "give-up", reason: "expiry" };
    }
    return { kind: "defer-descendants", delayMs: params.deferDescendantDelayMs };
  }

  // 场景 2：超出重试次数或过期时间 → 放弃
  const retryCount = (params.entry.announceRetryCount ?? 0) + 1;
  const expiryExceeded = isCompletionMessageFlow
    ? completionHardExpiryExceeded
    : endedAgo > params.announceExpiryMs;  // 普通 flow 5 分钟
  if (retryCount >= params.maxAnnounceRetryCount || expiryExceeded) {
    return {
      kind: "give-up",
      reason: retryCount >= params.maxAnnounceRetryCount ? "retry-limit" : "expiry",
      retryCount,
    };
  }

  // 场景 3：继续重试
  return {
    kind: "retry",
    retryCount,
    resumeDelayMs: isCompletionMessageFlow
      ? params.resolveAnnounceRetryDelayMs(retryCount)
      : undefined,
  };
}

三种决策：

决策	触发条件	行为
`defer-descendants`	completion flow 且仍有活跃后代子 Agent	设 `wakeOnDescendantSettle=true`，等后代完成后再试
`give-up`	超出 3 次重试或超出 5 分钟（普通）/ 30 分钟（completion flow）	放弃 announce，强制完成清理
`retry`	其他情况	指数退避后重新调用 `resumeSubagentRun`

重试退避（resolveAnnounceRetryDelayMs L117-123）：从 1s 开始，每次翻倍，上限 8s（MAX_ANNOUNCE_RETRY_DELAY_MS）。最多重试 3 次（MAX_ANNOUNCE_RETRY_COUNT）。

十、强制终止：markSubagentRunTerminated

当用户通过 /kill 命令或父 Agent 超时终止子 Agent 时，调用 markSubagentRunTerminated（L1369-1428）：

// src/agents/subagent-registry.ts L1369-1428

export function markSubagentRunTerminated(params: {
  runId?: string;
  childSessionKey?: string;
  reason?: string;
}): number {
  // 支持通过 runId 或 childSessionKey 定位（可能有多个 runId 对应同一 childSessionKey）
  const runIds = new Set<string>();
  // ...收集所有匹配的 runId...

  for (const runId of runIds) {
    const entry = subagentRuns.get(runId);
    if (typeof entry.endedAt === "number") continue;  // 已结束的跳过
    entry.endedAt = now;
    entry.outcome = { status: "error", error: reason };
    entry.endedReason = SUBAGENT_ENDED_REASON_KILLED;
    entry.cleanupHandled = true;       // 不走 announce flow
    entry.cleanupCompletedAt = now;    // 立即完成清理
    entry.suppressAnnounceReason = "killed";  // 抑制 announce
  }

  // 为每个独特的 childSessionKey 触发一次 subagent_ended hook
  for (const entry of entriesByChildSessionKey.values()) {
    void emitSubagentEndedHookOnce({ ..., reason: KILLED, outcome: KILLED, ... });
  }
}

killed 状态的特殊之处在于：它立即完成清理（设置 cleanupCompletedAt），并且不发送 announce（suppressAnnounceReason = "killed"）。父 Agent 不会收到"子 Agent 已杀死"的消息，符合静默终止的语义。

但有一个例外：如果在 killed 之后又收到了一个 complete 事件（来自仍在运行中的 embedded 子任务），completeSubagentRun 会检测到这种情况并取消 killed 标记，让 announce 正常进行（L467-478）：

// 允许 complete 覆盖 killed（如果 complete 是更准确的结果）
if (params.reason === SUBAGENT_ENDED_REASON_COMPLETE &&
    entry.suppressAnnounceReason === "killed" &&
    (entry.cleanupHandled || typeof entry.cleanupCompletedAt === "number")) {
  entry.suppressAnnounceReason = undefined;
  entry.cleanupHandled = false;
  entry.cleanupCompletedAt = undefined;
  mutated = true;
}

十一、Steer 重启：子 Agent 的身份延续

"Steer"是 OpenClaw 中将运行中的 Agent 引导到新任务的操作。当父 Agent 对一个正在运行的子 Agent 执行 steer 时，不是简单地杀死旧的再创建新的，而是通过 replaceSubagentRunAfterSteer（L1089-1157）复用同一个 childSessionKey 的子 Agent，只替换 runId：

// src/agents/subagent-registry.ts L1089-1157

export function replaceSubagentRunAfterSteer(params: {
  previousRunId: string;
  nextRunId: string;
  fallback?: SubagentRunRecord;
  runTimeoutSeconds?: number;
  preserveFrozenResultFallback?: boolean;
}) {
  const previous = subagentRuns.get(previousRunId);
  const source = previous ?? params.fallback;  // fallback 用于进程重启场景

  subagentRuns.delete(previousRunId);

  const next: SubagentRunRecord = {
    ...source,            // 继承所有元数据（task、label、requesterSessionKey 等）
    runId: nextRunId,     // 新的 runId
    startedAt: now,
    endedAt: undefined,   // 重置结束状态
    endedReason: undefined,
    outcome: undefined,
    frozenResultText: undefined,          // 清除旧的冻结结果
    fallbackFrozenResultText: preserveFrozenResultFallback
      ? source.frozenResultText           // 保留上次结果作为 fallback
      : undefined,
    cleanupCompletedAt: undefined,
    cleanupHandled: false,
    suppressAnnounceReason: undefined,
    announceRetryCount: undefined,
    spawnMode: "run",     // steer 后的 run 不保留 session 绑定
  };

  subagentRuns.set(nextRunId, next);
  ensureListener();
  persistSubagentRuns();
  void waitForSubagentCompletion(nextRunId, waitTimeoutMs);
}

Steer 流程的三个阶段：

markSubagentRunForSteerRestart(runId)：设置 suppressAnnounceReason = "steer-restart"，阻止旧 run 发出 announce（因为结果还没出来）
旧 run 结束：lifecycle end 事件到达，但因为 suppressAnnounceReason 阻止了清理流程
replaceSubagentRunAfterSteer：创建新 run，继承旧 run 的元数据，清除 steer-restart 标记

如果 steer 失败（新 run 没有创建成功），clearSubagentRunSteerRestart（L1066-1087）会清除 suppress 标记并触发 resume，避免旧 run 永远停在抑制状态。

十二、后代计数：descendant 查询与级联等待

subagent-registry-queries.ts 实现了一套基于 BFS（广度优先搜索）的后代遍历：

// src/agents/subagent-registry-queries.ts L155-185

function forEachDescendantRun(
  runs: Map<string, SubagentRunRecord>,
  rootSessionKey: string,
  visitor: (runId: string, entry: SubagentRunRecord) => void,
): boolean {
  const root = rootSessionKey.trim();
  const pending = [root];
  const visited = new Set<string>([root]);
  for (let index = 0; index < pending.length; index += 1) {
    const requester = pending[index];
    for (const [runId, entry] of runs.entries()) {
      if (entry.requesterSessionKey !== requester) continue;
      visitor(runId, entry);
      const childKey = entry.childSessionKey.trim();
      if (!childKey || visited.has(childKey)) continue;
      visited.add(childKey);
      pending.push(childKey);  // 把子 Agent 的 session 也加入 BFS 队列
    }
  }
  return true;
}

基于这个 BFS，三个公开函数的语义：

countActiveDescendantRuns：endedAt 为空的后代数（正在运行的）
countPendingDescendantRuns：endedAt 为空或 cleanupCompletedAt 为空的后代数（运行中或清理未完成的）
countActiveRunsForSession：某 session 直接控制的 run 中，活跃的（包括已结束但后代仍在清理中的）

countPendingDescendantRuns 的设计让 resolveDeferredCleanupDecision 能感知到：即使子 Agent 已经运行完毕，只要它的 announce 清理还没完成，父 Agent 的 completion flow 就需要继续等待。这保证了 announce 消息的顺序性——父 Agent 的 completion 消息不会在子 Agent 的 announce 之前到达用户。

十三、Announce 队列：subagent-announce-queue.ts

多个子 Agent 可能同时完成，向同一个父 session 发送 announce 消息。subagent-announce-queue.ts 提供了一个 per-session 的队列，防止消息乱序和轰炸：

// src/agents/subagent-announce-queue.ts L22-36

export type AnnounceQueueItem = {
  announceId?: string;  // 去重 ID
  prompt: string;       // announce 消息体
  summaryLine?: string; // 队列溢出时的摘要行
  enqueuedAt: number;
  sessionKey: string;
  origin?: DeliveryContext;
  // ...
};

队列有两种 mode（来自 queue-helpers.ts）：

direct 模式：每个 item 独立投递，按顺序一个一个发
collect 模式：debounce 期间累积的 items 合并成一条消息投递

队列有容量上限（默认 20），超出时根据 dropPolicy 处理：

summarize（默认）：把溢出的 item 压缩为 summaryLine，最终发出一条摘要消息
new：丢弃新来的（保留队列中已有的）

失败退避（L192-209）：每次 drain 失败 consecutiveFailures 加 1，等待时间指数增长（2s, 4s, 8s...上限 60s），成功后清零。

十四、进程重启恢复：持久化与 resume

subagent-registry.ts 的一个重要设计目标是对进程重启透明。所有 run record 都通过 persistSubagentRunsToDisk（subagent-registry-state.ts L7-13）持久化到磁盘（~/.openclaw/agents/.../subagent-registry.json），进程重启后通过 initSubagentRegistry（L1483-1485）恢复：

// src/agents/subagent-registry.ts L660-690

function restoreSubagentRunsOnce() {
  if (restoreAttempted) return;
  restoreAttempted = true;
  try {
    const restoredCount = restoreSubagentRunsFromDisk({ runs: subagentRuns, mergeOnly: true });
    if (restoredCount === 0) return;
    if (reconcileOrphanedRestoredRuns()) { persistSubagentRuns(); }
    if (subagentRuns.size === 0) return;
    ensureListener();
    if ([...subagentRuns.values()].some((entry) => entry.archiveAtMs)) { startSweeper(); }
    for (const runId of subagentRuns.keys()) {
      resumeSubagentRun(runId);
    }
  } catch { /* ignore */ }
}

孤儿 Run 检测（reconcileOrphanedRestoredRuns）：进程重启后，对每个恢复的 run 检查 childSessionKey 是否在 session store 中存在。如果 session 已经被删除（用户手动清除了数据），这个 run 就成了孤儿，需要立即标记为 error 并删除（L184-225）。

resumeSubagentRun（L581-658）：对每个恢复的 run 决定下一步：

function resumeSubagentRun(runId: string) {
  if (resumedRuns.has(runId)) return;  // 防重入（进程内已在处理中）
  const entry = subagentRuns.get(runId);

  // 检查孤儿状态
  const orphanReason = resolveSubagentRunOrphanReason({ entry });
  if (orphanReason) { reconcileOrphanedRun(...); return; }

  if (entry.cleanupCompletedAt) return;  // 已清理完成

  // 超出重试次数 → 放弃
  if ((entry.announceRetryCount ?? 0) >= MAX_ANNOUNCE_RETRY_COUNT) { /* give-up */ }

  // 普通 run 超过 5 分钟 → 放弃
  if (!isCompletionMessageFlow && Date.now() - entry.endedAt > ANNOUNCE_EXPIRY_MS) { /* give-up */ }

  // 有退避等待时间 → 延迟后重试
  if (isCompletionMessageFlow && now < earliestRetryAt) {
    setTimeout(() => resumeSubagentRun(runId), waitMs);
    return;
  }

  if (typeof entry.endedAt === "number" && entry.endedAt > 0) {
    // 已结束但 announce 未完成 → 重新尝试 announce
    startSubagentAnnounceCleanupFlow(runId, entry);
    return;
  }

  // 未结束 → 重新等待完成
  void waitForSubagentCompletion(runId, waitTimeoutMs);
}

读磁盘合并（getSubagentRunsSnapshotForRead L37-56）：查询类接口（如 countActiveRunsForSession）会在非测试环境下合并内存和磁盘状态，确保多进程场景下 worker A 能看到 worker B 注册的 run。

十五、深度控制：subagent-depth.ts

子 Agent 可以继续 spawn 自己的子 Agent，形成嵌套层级。subagent-depth.ts 维护了一套从 session store 读取 spawnDepth 的机制，防止无限递归：

// src/agents/subagent-depth.ts L124-176

export function getSubagentDepthFromSessionStore(
  sessionKey: string | undefined | null,
  opts?: { cfg?: OpenClawConfig; store?: Record<string, SessionDepthEntry>; },
): number {
  const fallbackDepth = getSubagentDepth(raw);  // 从 session key 格式解析
  // ...

  const depthFromStore = (key: string): number | undefined => {
    const entry = resolveEntryForSessionKey({ sessionKey: key, cfg, store, cache });
    const storedDepth = normalizeSpawnDepth(entry?.spawnDepth);
    if (storedDepth !== undefined) return storedDepth;

    const spawnedBy = normalizeSessionKey(entry?.spawnedBy);
    if (!spawnedBy) return undefined;

    const parentDepth = depthFromStore(spawnedBy);  // 递归追溯父 session
    if (parentDepth !== undefined) return parentDepth + 1;

    return getSubagentDepth(spawnedBy) + 1;
  };

  return depthFromStore(raw) ?? fallbackDepth;
}

深度从 session store 的 spawnedBy 字段沿链追溯：子 Agent 的深度 = 父 Agent 的深度 + 1。使用 visited Set 防止循环引用导致无限递归。系统在 spawn 时检查这个深度值，超过限制则拒绝创建子 Agent（具体限制在 subagent-spawn.ts 中实现）。

十六、自动归档：sweeper 机制

对于 spawnMode === "run" 的子 Agent，注册时会设置 archiveAtMs = now + archiveAfterMs（默认 60 分钟后）。sweepSubagentRuns（L726-763）每 60 秒运行一次：

// src/agents/subagent-registry.ts L726-763

async function sweepSubagentRuns() {
  const now = Date.now();
  let mutated = false;
  for (const [runId, entry] of subagentRuns.entries()) {
    if (!entry.archiveAtMs || entry.archiveAtMs > now) continue;
    clearPendingLifecycleError(runId);
    // 通知 context engine（memory 系统）
    void notifyContextEngineSubagentEnded({
      childSessionKey: entry.childSessionKey,
      reason: "swept",
      workspaceDir: entry.workspaceDir,
    });
    subagentRuns.delete(runId);
    mutated = true;
    await safeRemoveAttachmentsDir(entry);
    // 通过 gateway RPC 删除 session
    await callGateway({
      method: "sessions.delete",
      params: { key: entry.childSessionKey, deleteTranscript: true, emitLifecycleHooks: false },
      timeoutMs: 10_000,
    });
  }
  if (subagentRuns.size === 0) stopSweeper();  // 没有 run 了就关掉计时器
}

archiveAtMs 对 session 模式的子 Agent 不设置（archiveAtMs = undefined），因为 session 模式的子 Agent 有持久线程绑定，不应该被自动回收。sweeper 在所有 run 都清理完后自动停止（stopSweeper()）。

十七、附件目录的安全删除

子 Agent 工具调用产生的附件保存在 attachmentsDir 中，清理时需要安全删除，避免路径穿越（path traversal）：

// src/agents/subagent-registry.ts L822-858

async function safeRemoveAttachmentsDir(entry: SubagentRunRecord): Promise<void> {
  if (!entry.attachmentsDir || !entry.attachmentsRootDir) return;

  const [rootReal, dirReal] = await Promise.all([
    resolveReal(entry.attachmentsRootDir),
    resolveReal(entry.attachmentsDir),
  ]);

  const rootBase = rootReal ?? path.resolve(entry.attachmentsRootDir);
  const rootWithSep = rootBase.endsWith(path.sep) ? rootBase : `${rootBase}${path.sep}`;

  // 安全检查：attachmentsDir 必须是 attachmentsRootDir 的子目录
  if (!dirBase.startsWith(rootWithSep)) return;

  await fs.rm(dirBase, { recursive: true, force: true });
}

通过 fs.realpath 解析符号链接，然后检查 dirBase 是否以 rootBase + path.sep 开头。即使 attachmentsDir 是指向系统目录的符号链接，realpath 也会揭露真实路径，确保不会删除 root 目录之外的任何内容。

十八、整体状态机总结

           ┌──────────────┐
           │  registerSubagentRun  │
           └──────────────┘
                    │
                    ▼
         ┌─────────────────┐
         │    RUNNING       │  ← waitForSubagentCompletion / lifecycle listener
         └─────────────────┘
          /        |         \
         /         |          \
        ▼          ▼           ▼
   [complete]  [error]      [killed]
        │          │              │
        │   schedulePendingLifecycleError (15s 延迟)
        │          │              │
        │          ▼              │
        └──→ completeSubagentRun ─┘
                    │
                    ├─ suppressAnnounceReason = steer-restart → 停止等待 steer 完成
                    │
                    ▼
           freezeRunResultAtCompletion
                    │
                    ▼
         startSubagentAnnounceCleanupFlow
                    │
              runSubagentAnnounceFlow
              /              \
         [success]          [false]
             │                  │
             │         resolveDeferredCleanupDecision
             │           /       |         \
             │    [defer-desc] [retry]  [give-up]
             │         │         │          │
             │    wakeOnDescend  │     force complete
             │    =true + retry  │
             │                  └─> setTimeout → resumeSubagentRun
             │
             ▼
      finalizeSubagentCleanup (didAnnounce=true)
             │
      completeCleanupBookkeeping
             │
     ┌───────┴───────┐
  [delete]        [keep]
     │                │
  remove record   set cleanupCompletedAt
  notify context  notify context engine
  engine          persist
  retryDeferred
  Announces
             │
             ▼
      [archiveAtMs 到期] → sweeper → sessions.delete

十九、设计哲学总结

回顾整个子 Agent 生命周期管理系统，可以提炼出几个核心设计原则：

1. 双路径完成检测，互为 fallback。Gateway RPC agent.wait 能跨进程感知，in-process lifecycle 事件能快速响应 embedded 运行。两条路径可以同时触发 completeSubagentRun，但函数内置了幂等保护（runOutcomesEqual + 状态检查）。

2. cleanupHandled 作为原子防重入锁。beginSubagentCleanup 先检查后设置 cleanupHandled = true 并持久化，即使多进程并发触发 startSubagentAnnounceCleanupFlow，也只有第一个能实际进入清理流程。

3. 磁盘持久化 + 恢复，保证跨进程重启不丢失进度。每个关键状态变更（注册、完成、清理）都立即 persistSubagentRuns。进程重启后 restoreSubagentRunsOnce 恢复所有 pending run 并继续 resumeSubagentRun。

4. Announce 投递与 run 清理解耦。announce 是"尽力投递"的（最多 3 次重试，5 分钟超时），失败不影响 run 被标记为完成。completion flow 有独立的 30 分钟硬过期，防止无限等待。

5. 错误不立即终止，给 Provider 重试留缓冲。LIFECYCLE_ERROR_RETRY_GRACE_MS = 15s 的延迟处理 error 事件，使得 model fallback 过程中的瞬态 error 不会触发错误的子 Agent 终止。

6. steer 重启的身份延续。通过 suppressAnnounceReason、replaceSubagentRunAfterSteer 和 fallbackFrozenResultText 的配合，steer 重启后子 Agent 在父 Agent 眼中仍是"同一个任务"，不会重复 announce，也不丢失上次的执行结果。

附：核心源文件索引

本文涉及的主要源文件：