OpenClaw Auth Profile 与多 Key 冷却隔离机制深度解析：一个 API Key 是如何被选择、追踪并轮换的

本文基于 OpenClaw 源码（src/agents/auth-profiles/ 目录，共 23 个文件约 5800 行）进行深度分析，涵盖凭据存储结构、多 Key 轮询调度、冷却隔离算法、OAuth Token 刷新机制、Session 级别 Profile 绑定以及跨 Agent 凭据继承等核心主题。

一、问题的起点：为什么一个 API Key 不够

在设计上，一个简单的 AI 对话客户端只需要一个配置：OPENAI_API_KEY=sk-xxx。但现实场景远比这复杂：

一个团队共享多个 Anthropic Key，每个 Key 都有独立的速率限制配额
当前 Key 因账单问题被暂停，需要自动切换到备用 Key
某个 Key 在特定时间段内被限速（HTTP 429），应该暂时跳过它，用其他 Key 继续工作
OAuth 凭据有过期时间，到期后需要在后台静默刷新
同一个 provider（如 Anthropic）在不同账号下有不同的权限，需要根据 session 绑定特定账号

这些需求催生了 OpenClaw 的 Auth Profile 系统：把"一个 API Key"抽象为"一组 Auth Profile"，每个 Profile 是一个独立的凭据实体，拥有独立的冷却状态、使用统计、轮询位置。

整个系统的核心文件是 src/agents/auth-profiles/，入口是 src/agents/auth-profiles.ts，它把内部 23 个文件的导出统一暴露给上层调用者。

二、数据结构：AuthProfileStore 的完整形态

2.1 凭据类型三叉树

所有凭据都由 AuthProfileCredential 联合类型表示，分为三种：

// src/agents/auth-profiles/types.ts L5-36

export type ApiKeyCredential = {
  type: "api_key";
  provider: string;
  key?: string;           // 明文 key 或 ${ENV_VAR} 引用
  keyRef?: SecretRef;     // 外部密钥引用（keychain / secret manager）
  email?: string;
  metadata?: Record<string, string>;
};

export type TokenCredential = {
  type: "token";
  provider: string;
  token?: string;
  tokenRef?: SecretRef;
  expires?: number;       // ms epoch，无此字段表示永不过期
  email?: string;
};

export type OAuthCredential = OAuthCredentials & {
  type: "oauth";
  provider: string;
  clientId?: string;
  email?: string;
  // OAuthCredentials 继承自 @mariozechner/pi-ai，含 access/refresh/expires/projectId 等
};

三种类型的区别：

类型	典型场景	刷新机制	过期判断
`api_key`	OpenAI Key、Anthropic Key	无（永久有效）	无
`token`	GitHub PAT、静态 Bearer Token	无（需手动更新）	`expires` 字段（可选）
`oauth`	Anthropic Console OAuth、Qwen Portal	自动刷新（refresh token）	`expires` 字段（必须）

2.2 Profile 命名规范：`provider:suffix`

每个 Profile 都有一个唯一 ID，格式为 {provider}:{suffix}：

anthropic:default          # 旧版迁移兼容 ID
anthropic:user@email.com   # 邮箱账号命名（OAuth 登录后自动生成）
openai:sk-prod             # 用户自定义后缀
amazon-bedrock:default     # Bedrock 默认
anthropic:claude-cli       # Claude CLI 外部工具同步
openai-codex:codex-cli     # OpenAI Codex CLI 同步
qwen-portal:qwen-cli       # Qwen Code CLI 同步
minimax-portal:minimax-cli # MiniMax CLI 同步

这些常量定义在 constants.ts（L7-10）。命名规则不是强制约定，但 suggestOAuthProfileIdForLegacyDefault（repair.ts L23-82）在 OAuth 登录修复时会按邮箱后缀自动推断 Profile ID，因此邮箱命名有特殊语义。

2.3 AuthProfileStore：磁盘格式

// src/agents/auth-profiles/types.ts L61-73

export type AuthProfileStore = {
  version: number;
  profiles: Record<string, AuthProfileCredential>;  // profileId -> 凭据
  order?: Record<string, string[]>;    // provider -> profileId 列表（用户/store 级别顺序覆盖）
  lastGood?: Record<string, string>;   // provider -> 最近成功的 profileId
  usageStats?: Record<string, ProfileUsageStats>;  // profileId -> 使用统计
};

这个结构直接序列化到磁盘文件 ~/.openclaw/agents/main/agent/auth-profiles.json（或子 Agent 对应路径）。值得注意的是：

profiles：核心凭据数据，是唯一持久的"秘密"部分
order：运行时可覆盖，存储优先级排序
lastGood：记录上一次成功的 Profile，用于 OAuth repair 推断
usageStats：冷却系统的状态载体，包含所有时间戳

2.4 ProfileUsageStats：冷却状态的载体

// src/agents/auth-profiles/types.ts L51-58

export type ProfileUsageStats = {
  lastUsed?: number;          // 最近使用时间（ms epoch）
  cooldownUntil?: number;     // 瞬态冷却截止时间（rate_limit / overloaded）
  disabledUntil?: number;     // 长期禁用截止时间（billing / auth_permanent）
  disabledReason?: AuthProfileFailureReason;  // 禁用原因
  errorCount?: number;        // 累计错误次数（用于指数退避计算）
  failureCounts?: Partial<Record<AuthProfileFailureReason, number>>;  // 分原因错误计数
  lastFailureAt?: number;     // 最近失败时间（用于失败窗口衰减）
};

cooldownUntil 和 disabledUntil 的区别非常关键：

cooldownUntil：短期冷却，用于速率限制、过载等瞬态错误。冷却期过后 Profile 自动恢复，errorCount 清零
disabledUntil：长期禁用，用于账单问题、永久鉴权失败等持久错误。退避时间从 5 小时起，上限 24 小时，并使用指数退避（每次重试翻倍）

两个字段独立管理，互不干扰：一个 Profile 可以同时有 cooldownUntil（已过期）和 disabledUntil（仍活跃），此时 resolveProfileUnusableUntil 会取 Math.max。

三、Store 加载机制：三层合并策略

3.1 文件路径体系

// src/agents/auth-profiles/paths.ts（引用）

resolveAuthStorePath()           // ~/.openclaw/agents/main/agent/auth-profiles.json
resolveAuthStorePath(agentDir)   // ~/.openclaw/agents/{name}/agent/auth-profiles.json
resolveLegacyAuthStorePath()     // ~/.openclaw/agents/main/agent/auth.json（旧版）

每个 Agent 实例有独立的 auth-profiles.json，子 Agent 和主 Agent 可以有不同的凭据配置。

3.2 加载时的三条路径

loadAuthProfileStore（store.ts L346-372）是最简单的无 Agent 加载，只用于非运行时场景（如 CLI 配置命令）。运行时场景使用 loadAuthProfileStoreForRuntime（L443-456）：

// src/agents/auth-profiles/store.ts L443-456

export function loadAuthProfileStoreForRuntime(
  agentDir?: string,
  options?: LoadAuthProfileStoreOptions,
): AuthProfileStore {
  const store = loadAuthProfileStoreForAgent(agentDir, options);
  const authPath = resolveAuthStorePath(agentDir);
  const mainAuthPath = resolveAuthStorePath();
  if (!agentDir || authPath === mainAuthPath) {
    return store;
  }

  // 子 Agent：主 Agent store 作为 base，子 Agent store 作为 override
  const mainStore = loadAuthProfileStoreForAgent(undefined, options);
  return mergeAuthProfileStores(mainStore, store);
}

合并策略（mergeAuthProfileStores L258-277）：子 Agent 的 profiles/order/lastGood/usageStats 全部覆盖主 Agent 的同名字段（override 优先）。这意味着如果子 Agent 的 auth-profiles.json 为空，它会直接继承主 Agent 的全部凭据。

3.3 旧版迁移：auth.json → auth-profiles.json

OpenClaw 在某个版本之前使用 auth.json，格式是扁平的 Record<string, AuthProfileCredential>，没有 version/profiles/usageStats 等字段。coerceLegacyStore（L166-186）会识别这种旧格式，applyLegacyStore（L305-339）把旧格式的每个 key 转换为 {provider}:default 形式的 profileId。

迁移完成后旧文件会被删除（L426-438），防止下次启动时重复迁移覆盖新的 OAuth 凭据（这个 bug 在 PR #368 中修复，对应 issue #363）。

3.4 Runtime Snapshot：进程内缓存

// src/agents/auth-profiles/store.ts L21-74

const runtimeAuthStoreSnapshots = new Map<string, AuthProfileStore>();

export function replaceRuntimeAuthProfileStoreSnapshots(
  entries: Array<{ agentDir?: string; store: AuthProfileStore }>,
): void {
  runtimeAuthStoreSnapshots.clear();
  for (const entry of entries) {
    runtimeAuthStoreSnapshots.set(
      resolveRuntimeStoreKey(entry.agentDir),
      cloneAuthProfileStore(entry.store),
    );
  }
}

Gateway 启动时会调用 replaceRuntimeAuthProfileStoreSnapshots 把所有 Agent 的 store 快照写入这个进程内 Map。后续 ensureAuthProfileStore（L462-482）优先从这个 Map 读取，避免在每次请求时重复读磁盘。

注意：每次读取都会 structuredClone（L27-29），保证每个调用者拿到的是独立副本，不会因一个 Agent 的状态变更影响另一个 Agent 的内存视图。

3.5 写锁：跨进程并发安全

所有写操作都通过 updateAuthProfileStoreWithLock（L80-99）执行，底层调用 withFileLock（来自 src/infra/file-lock.js）：

// src/agents/auth-profiles/store.ts L80-99

export async function updateAuthProfileStoreWithLock(params: {
  agentDir?: string;
  updater: (store: AuthProfileStore) => boolean;
}): Promise<AuthProfileStore | null> {
  const authPath = resolveAuthStorePath(params.agentDir);
  ensureAuthStoreFile(authPath);

  try {
    return await withFileLock(authPath, AUTH_STORE_LOCK_OPTIONS, async () => {
      // 在锁内重新从磁盘读取，保证看到最新状态
      const store = ensureAuthProfileStore(params.agentDir);
      const shouldSave = params.updater(store);
      if (shouldSave) {
        saveAuthProfileStore(store, params.agentDir);
      }
      return store;
    });
  } catch {
    return null;
  }
}

锁参数（constants.ts L12-21）：指数退避，最多重试 10 次，基础超时 100ms，最大超时 10 秒，过期锁 30 秒后自动释放。这套参数能应对多个子 Agent 并发更新同一 store 的情况（例如同时完成请求后都调用 markAuthProfileUsed）。

双写策略：函数先尝试带锁写（从磁盘重新读 → 更新 → 写回），如果锁失败（catch 返回 null），则直接在内存 store 上操作并调用 saveAuthProfileStore。这是一个有意识的降级：宁可偶尔出现轻微的并发写覆盖，也不要因为锁竞争导致冷却状态完全丢失。

四、凭据资格评估：eligible 判断链

在进入轮询排序之前，系统会先过滤掉"当前不可用"的 Profile。这由 resolveAuthProfileEligibility（order.ts L30-65）完成：

export function resolveAuthProfileEligibility(params: {
  cfg?: OpenClawConfig;
  store: AuthProfileStore;
  provider: string;
  profileId: string;
  now?: number;
}): AuthProfileEligibility {
  const providerAuthKey = normalizeProviderIdForAuth(params.provider);
  const cred = params.store.profiles[params.profileId];

  // 1. Profile 不存在
  if (!cred) return { eligible: false, reasonCode: "profile_missing" };

  // 2. Provider 不匹配（用 normalizeProviderIdForAuth 做等价类匹配）
  if (normalizeProviderIdForAuth(cred.provider) !== providerAuthKey)
    return { eligible: false, reasonCode: "provider_mismatch" };

  // 3. openclaw.json 中有显式 profileConfig，且 mode 不兼容
  const profileConfig = params.cfg?.auth?.profiles?.[params.profileId];
  if (profileConfig) {
    if (normalizeProviderIdForAuth(profileConfig.provider) !== providerAuthKey)
      return { eligible: false, reasonCode: "provider_mismatch" };
    if (profileConfig.mode !== cred.type) {
      const oauthCompatible = profileConfig.mode === "oauth" && cred.type === "token";
      if (!oauthCompatible)
        return { eligible: false, reasonCode: "mode_mismatch" };
    }
  }

  // 4. 凭据本身的有效性（key 是否存在、token 是否过期等）
  const credentialEligibility = evaluateStoredCredentialEligibility({ credential: cred, now: params.now });
  return { eligible: credentialEligibility.eligible, reasonCode: credentialEligibility.reasonCode };
}

注意第 3 步的 oauth/token 兼容性：当 config 中声明 mode: "oauth" 但实际凭据是 type: "token" 时（两者都是 Bearer Token 模式），允许通过。这是为了兼容历史数据中 Claude CLI 凭据的格式差异。

evaluateStoredCredentialEligibility（credential-state.ts L34-74）检查三种类型的"有无凭据"：

api_key：key 字符串存在或 keyRef 引用存在
token：token 字符串存在或 tokenRef 存在，且未过期
oauth：access 或 refresh 字段存在（OAuth 不在这里检查 expires，因为会在 resolveApiKeyForProfile 中实时刷新）

五、轮询排序算法：resolveAuthProfileOrder

这是整个 Auth Profile 系统最核心的调度函数，位于 order.ts L67-160。它接受 provider 名称，返回一个有序的 profileId 列表，供调用方从头到尾尝试。

5.1 顺序来源的四条路径

// src/agents/auth-profiles/order.ts L82-95

const storedOrder = findNormalizedProviderValue(store.order, providerKey);
const configuredOrder = findNormalizedProviderValue(cfg?.auth?.order, providerKey);
const explicitOrder = storedOrder ?? configuredOrder;
const explicitProfiles = cfg?.auth?.profiles ? Object.entries(cfg.auth.profiles)
  .filter(([, profile]) => normalizeProviderIdForAuth(profile.provider) === providerAuthKey)
  .map(([profileId]) => profileId)
  : [];
const baseOrder =
  explicitOrder ??
  (explicitProfiles.length > 0 ? explicitProfiles : listProfilesForProvider(store, provider));

优先级从高到低：

Store 级别顺序（store.order[provider]）：运行时通过 setAuthProfileOrder 设置，最高优先级
Config 级别顺序（openclaw.json 的 auth.order[provider]）：用户配置文件中的显式排序
Config 中声明的 Profiles（openclaw.json 的 auth.profiles）：按 provider 过滤后作为候选池
Store 中的全部 Profiles：兜底，从 auth-profiles.json 中按 provider 扫描

5.2 资格过滤与 Profile ID 漂移修复

// src/agents/auth-profiles/order.ts L97-114

const isValidProfile = (profileId: string): boolean =>
  resolveAuthProfileEligibility({ cfg, store, provider: providerAuthKey, profileId, now }).eligible;
let filtered = baseOrder.filter(isValidProfile);

// 当 config 中配置的 profileId 在 store 中找不到时，自动降级到 store 里存在的有效凭据
const allBaseProfilesMissing = baseOrder.every((profileId) => !store.profiles[profileId]);
if (filtered.length === 0 && explicitProfiles.length > 0 && allBaseProfilesMissing) {
  const storeProfiles = listProfilesForProvider(store, provider);
  filtered = storeProfiles.filter(isValidProfile);
}

这段代码处理了一个常见的用户困境：用户在 openclaw.json 中配置了 auth.profiles.anthropic:default，但 OAuth 登录后新的 Profile ID 变成了 anthropic:user@email.com，导致 config 中的 profileId 在 store 中找不到。系统在这种情况下会自动从 store 扫描合法凭据，而不是报错退出。

5.3 显式顺序模式：保留用户意图，但尊重冷却

// src/agents/auth-profiles/order.ts L121-148

if (explicitOrder && explicitOrder.length > 0) {
  const available: string[] = [];
  const inCooldown: Array<{ profileId: string; cooldownUntil: number }> = [];

  for (const profileId of deduped) {
    if (isProfileInCooldown(store, profileId)) {
      const cooldownUntil = resolveProfileUnusableUntil(store.usageStats?.[profileId] ?? {}) ?? now;
      inCooldown.push({ profileId, cooldownUntil });
    } else {
      available.push(profileId);
    }
  }

  // 冷却中的 Profile 追加到末尾，按过期时间升序（最快恢复的排最前）
  const cooldownSorted = inCooldown
    .toSorted((a, b) => a.cooldownUntil - b.cooldownUntil)
    .map((entry) => entry.profileId);

  return [...available, ...cooldownSorted];
}

当用户指定了显式顺序时，系统尊重这个顺序——但不是盲目地。冷却中的 Profile 会被挪到列表末尾，按冷却截止时间升序排列（最快恢复的排最前），这样当所有 Profile 都在冷却时，调用方拿到的第一个仍然是"最快可用"的那个。

5.4 自动轮询模式：类型优先 + 最旧优先

当没有显式顺序时，进入 orderProfilesByMode（order.ts L162-208）：

// src/agents/auth-profiles/order.ts L177-208

const scored = available.map((profileId) => {
  const type = store.profiles[profileId]?.type;
  const typeScore = type === "oauth" ? 0 : type === "token" ? 1 : type === "api_key" ? 2 : 3;
  const lastUsed = store.usageStats?.[profileId]?.lastUsed ?? 0;
  return { profileId, typeScore, lastUsed };
});

// 主排序：type 优先级（oauth > token > api_key）
// 次排序：lastUsed 最旧优先（实现 round-robin）
const sorted = scored
  .toSorted((a, b) => {
    if (a.typeScore !== b.typeScore) return a.typeScore - b.typeScore;
    return a.lastUsed - b.lastUsed;  // 最旧的排最前
  })
  .map((entry) => entry.profileId);

// 冷却中的追加到末尾
const cooldownSorted = inCooldown
  .map((profileId) => ({
    profileId,
    cooldownUntil: resolveProfileUnusableUntil(store.usageStats?.[profileId] ?? {}) ?? now,
  }))
  .toSorted((a, b) => a.cooldownUntil - b.cooldownUntil)
  .map((entry) => entry.profileId);

return [...sorted, ...cooldownSorted];

这个排序算法实现了两个目标：

类型优先：OAuth 凭据总是优先于 Token，Token 优先于 API Key。这反映了一个实践判断：OAuth 凭据通常对应"免费套餐"或"企业账号"，具有更高的权限上限；API Key 则更容易触发速率限制
最旧优先（Round-Robin）：在同类型的 Profile 中，lastUsed 最小的（即最长时间没用的）排在前面，实现均匀分发。注意代码注释特别说明"lastGood is NOT prioritized"——上一次成功的 Profile 不会因此获得优先权，否则会破坏轮询效果

六、冷却隔离算法：指数退避的两套策略

6.1 失败原因分类

// src/agents/auth-profiles/types.ts L38-48

export type AuthProfileFailureReason =
  | "auth"           // 401/403 临时鉴权失败
  | "auth_permanent" // 永久性鉴权失败（账号被封等）
  | "format"         // 响应格式异常
  | "overloaded"     // 服务过载（503）
  | "rate_limit"     // 速率限制（429）
  | "billing"        // 账单问题（payment required / subscription expired）
  | "timeout"        // 请求超时
  | "model_not_found"// 模型不存在
  | "session_expired"// Session 过期
  | "unknown";       // 未分类

这些错误码由 src/agents/failover-error.ts 中的 resolveFailoverReasonFromError 分类后转换而来，并通过 markAuthProfileFailure 写入 usageStats。

6.2 短期冷却：指数退避，上限 1 小时

所有非 billing/auth_permanent 的失败都进入 cooldownUntil 路径：

// src/agents/auth-profiles/usage.ts L276-282

export function calculateAuthProfileCooldownMs(errorCount: number): number {
  const normalized = Math.max(1, errorCount);
  return Math.min(
    60 * 60 * 1000,         // 1 小时上限
    60 * 1000 * 5 ** Math.min(normalized - 1, 3),
  );
}

冷却时间表（以分钟为单位）：

errorCount	计算值	实际冷却时间
1	60 * 5^0 = 60s	1 分钟
2	60 * 5^1 = 300s	5 分钟
3	60 * 5^2 = 1500s	25 分钟
≥4	60 * 5^3 = 7500s（上限 3600s）	60 分钟（封顶）

底数 5 的指数退避比常见的 2 倍退避更激进，这是刻意设计的：对于被限速的 Key，退得快一些对用户体验更好。

6.3 长期禁用：billing 和 auth_permanent 的独立路径

// src/agents/auth-profiles/usage.ts L430-444

if (params.reason === "billing" || params.reason === "auth_permanent") {
  const billingCount = failureCounts[params.reason] ?? 1;
  const backoffMs = calculateAuthProfileBillingDisableMsWithConfig({
    errorCount: billingCount,
    baseMs: params.cfgResolved.billingBackoffMs,  // 默认 5 小时
    maxMs: params.cfgResolved.billingMaxMs,       // 默认 24 小时
  });
  // 只有在没有活跃的禁用窗口时才更新（防止反复失败把禁用时间越推越远）
  updatedStats.disabledUntil = keepActiveWindowOrRecompute({
    existingUntil: params.existing.disabledUntil,
    now: params.now,
    recomputedUntil: params.now + backoffMs,
  });
  updatedStats.disabledReason = params.reason;
}

禁用时间计算（calculateAuthProfileBillingDisableMsWithConfig L334-345）：

function calculateAuthProfileBillingDisableMsWithConfig(params: {
  errorCount: number;
  baseMs: number;  // 默认 5h = 18,000,000ms
  maxMs: number;   // 默认 24h = 86,400,000ms
}): number {
  const exponent = Math.min(normalized - 1, 10);
  const raw = baseMs * 2 ** exponent;  // 2 倍指数退避
  return Math.min(maxMs, raw);
}

禁用时间表（默认配置）：

失败次数	禁用时间
1	5 小时
2	10 小时
3	20 小时
≥4	24 小时（封顶）

用户可以通过 openclaw.json 中的 auth.cooldowns 自定义这些参数：

{
  "auth": {
    "cooldowns": {
      "billingBackoffHours": 3,
      "billingMaxHours": 12,
      "failureWindowHours": 48,
      "billingBackoffHoursByProvider": {
        "anthropic": 8
      }
    }
  }
}

billingBackoffHoursByProvider 支持按 provider 单独配置（usage.ts L304-315），通过 normalizeProviderId 做标准化匹配。

6.4 keepActiveWindowOrRecompute：防止冷却窗口被延长

// src/agents/auth-profiles/usage.ts L385-394

function keepActiveWindowOrRecompute(params: {
  existingUntil: number | undefined;
  now: number;
  recomputedUntil: number;
}): number {
  const { existingUntil, now, recomputedUntil } = params;
  const hasActiveWindow =
    typeof existingUntil === "number" && Number.isFinite(existingUntil) && existingUntil > now;
  return hasActiveWindow ? existingUntil : recomputedUntil;
}

这个小函数解决了一个微妙问题：如果一个 Profile 处于 5 小时禁用期中（还剩 4 小时），此时又收到一个 billing 错误（比如重试被拒了），如果直接更新 disabledUntil = now + 5h，禁用期反而被从 4 小时延长到了 5 小时。keepActiveWindowOrRecompute 确保只有在当前没有活跃的禁用窗口时才更新时间戳，一旦窗口激活就不再修改。

6.5 失败窗口衰减：errorCount 的自动重置

// src/agents/auth-profiles/usage.ts L396-421

function computeNextProfileUsageStats(params: { ... }) {
  const windowMs = params.cfgResolved.failureWindowMs;  // 默认 24h
  const windowExpired =
    typeof params.existing.lastFailureAt === "number" &&
    params.existing.lastFailureAt > 0 &&
    params.now - params.existing.lastFailureAt > windowMs;

  // 如果上次冷却已经过期，errorCount 归零（circuit-breaker 的 half-open → closed 逻辑）
  const unusableUntil = resolveProfileUnusableUntil(params.existing);
  const previousCooldownExpired = typeof unusableUntil === "number" && params.now >= unusableUntil;

  const shouldResetCounters = windowExpired || previousCooldownExpired;
  const baseErrorCount = shouldResetCounters ? 0 : (params.existing.errorCount ?? 0);
  const nextErrorCount = baseErrorCount + 1;
  // ...
}

这是熔断器（Circuit Breaker）的 half-open → closed 逻辑：

失败窗口过期（24小时内没有新失败）→ errorCount 清零，下次失败从第 1 次开始计，退避重置为 1 分钟
冷却期已过期（cooldownUntil/disabledUntil 已在过去）→ 同样清零

没有这个重置逻辑，一个偶发限速后恢复正常的 Key，如果几天后再次被限速，会直接跳到 25 分钟或 1 小时的冷却（errorCount 仍是旧值），这就是 issue #3604 报告的"Profile 看起来卡住了"的根本原因。

6.6 clearExpiredCooldowns：排序前的主动清理

// src/agents/auth-profiles/usage.ts L187-238

export function clearExpiredCooldowns(store: AuthProfileStore, now?: number): boolean {
  // ...
  for (const [profileId, stats] of Object.entries(usageStats)) {
    const cooldownExpired = /* cooldownUntil 已过 */;
    const disabledExpired = /* disabledUntil 已过 */;

    if (cooldownExpired) { stats.cooldownUntil = undefined; profileMutated = true; }
    if (disabledExpired) {
      stats.disabledUntil = undefined;
      stats.disabledReason = undefined;
      profileMutated = true;
    }

    // 所有冷却都已清除时，重置 errorCount 和 failureCounts
    if (profileMutated && !resolveProfileUnusableUntil(stats)) {
      stats.errorCount = 0;
      stats.failureCounts = undefined;
    }
  }
  // ...
}

resolveAuthProfileOrder 在每次调用时都会先调用 clearExpiredCooldowns（order.ts L81）。这保证了排序时使用的是最新的冷却状态，不会因为内存快照过时而把已经恢复的 Profile 还排在末尾。

这里有个重要细节：clearExpiredCooldowns 只修改内存中的 store，不直接持久化到磁盘，"disk persistence happens lazily on the next store write"（代码注释原话）。下一次 markAuthProfileUsed 或 markAuthProfileFailure 写入时，这些清除会被顺带持久化。

6.7 两个特殊 Provider 的豁免

// src/agents/auth-profiles/usage.ts L23-26

function isAuthCooldownBypassedForProvider(provider: string | undefined): boolean {
  const normalized = normalizeProviderId(provider ?? "");
  return normalized === "openrouter" || normalized === "kilocode";
}

OpenRouter 和 Kilocode 的 Profile 永远不进入冷却状态。这两个是聚合路由服务，它们自身会处理下游模型的失败和重试，由 OpenClaw 再加一层冷却会导致双重退避，反而降低可用性。

七、OAuth Token 刷新：带锁的异步更新

7.1 刷新触发条件

resolveApiKeyForProfile（oauth.ts L309-491）是凭据最终解析的入口，它在 model-auth.ts 的 resolveApiKeyForProvider 中被调用。对于 type: "oauth" 的凭据：

先检查 cred.expires > Date.now()，未过期直接返回 access 作为 API Key
过期则调用 refreshOAuthTokenWithLock 进行刷新

7.2 refreshOAuthTokenWithLock：文件锁 + 二次检查

// src/agents/auth-profiles/oauth.ts L158-215

async function refreshOAuthTokenWithLock(params: {
  profileId: string;
  agentDir?: string;
}): Promise<{ apiKey: string; newCredentials: OAuthCredentials } | null> {
  const authPath = resolveAuthStorePath(params.agentDir);
  ensureAuthStoreFile(authPath);

  return await withFileLock(authPath, AUTH_STORE_LOCK_OPTIONS, async () => {
    const store = ensureAuthProfileStore(params.agentDir);
    const cred = store.profiles[params.profileId];
    if (!cred || cred.type !== "oauth") return null;

    // 二次检查：另一个进程可能已经刷新了
    if (Date.now() < cred.expires) {
      return { apiKey: buildOAuthApiKey(cred.provider, cred), newCredentials: cred };
    }

    // 按 provider 分发到不同的刷新逻辑
    const result =
      String(cred.provider) === "chutes"
        ? await refreshChutesTokens({ credential: cred })
        : String(cred.provider) === "qwen-portal"
          ? await refreshQwenPortalCredentials(cred)
          : await getOAuthApiKey(resolveOAuthProvider(cred.provider), oauthCreds);

    if (!result) return null;
    // 更新 store 并写磁盘
    store.profiles[params.profileId] = { ...cred, ...result.newCredentials, type: "oauth" };
    saveAuthProfileStore(store, params.agentDir);
    return result;
  });
}

**二次检查（double-checked locking）**是这里的核心设计：当多个并发请求同时发现 token 过期并尝试刷新时，只有第一个拿到文件锁的进程实际发起 HTTP 刷新请求；其他进程拿到锁后发现 cred.expires > Date.now()（第一个已经刷新好了），直接返回新 token，不重复刷新。

7.3 Provider 特殊分发

refreshOAuthTokenWithLock 中的 provider 分发逻辑：

chutes：调用 refreshChutesTokens（自定义刷新逻辑）
qwen-portal：调用 refreshQwenPortalCredentials（通义千问 Portal 专用刷新）
其他：调用 getOAuthApiKey（来自 @mariozechner/pi-ai/oauth，支持 Anthropic、OpenAI Codex 等标准 OAuth 提供商）

7.4 跨 Agent OAuth 继承：adoptNewerMainOAuthCredential

// src/agents/auth-profiles/oauth.ts L121-156

function adoptNewerMainOAuthCredential(params: { store, profileId, agentDir, cred }) {
  if (!params.agentDir) return null;
  try {
    const mainStore = ensureAuthProfileStore(undefined);  // 主 Agent store
    const mainCred = mainStore.profiles[params.profileId];
    if (
      mainCred?.type === "oauth" &&
      mainCred.provider === params.cred.provider &&
      Number.isFinite(mainCred.expires) &&
      (!Number.isFinite(params.cred.expires) || mainCred.expires > params.cred.expires)
    ) {
      // 主 Agent 有更新的凭据，直接借用并存入子 Agent store
      params.store.profiles[params.profileId] = { ...mainCred };
      saveAuthProfileStore(params.store, params.agentDir);
      return mainCred;
    }
  } catch { /* best-effort */ }
  return null;
}

这解决了子 Agent 和主 Agent OAuth Token 不同步的问题：主 Agent 完成了一次 Token 刷新，子 Agent 的 store 副本还是旧的。每次 resolveApiKeyForProfile 在子 Agent 上运行时，都会先检查主 Agent 有没有更新的 Token 可以直接借用，避免子 Agent 发起不必要的重复刷新请求。

7.5 刷新失败的多层兜底

当 refreshOAuthTokenWithLock 失败时，resolveApiKeyForProfile（L389-491）按顺序尝试：

重新加载 store：刷新失败时重新读磁盘，可能另一个进程已经刷新好了
Legacy Profile ID 修复：suggestOAuthProfileIdForLegacyDefault 推断等效的新 profileId（邮箱 ID），尝试用它解析
主 Agent 新鲜凭据：如果是子 Agent，直接从主 Agent store 获取新鲜 OAuth 凭据并复制过来
OpenAI Codex 特殊兜底：如果错误信息匹配 extract accountid from token 的模式，使用现有 access token 降级继续（shouldUseOpenaiCodexRefreshFallback L95-110）
彻底失败：格式化包含 formatAuthDoctorHint 的友好错误消息抛出

八、Session 级别 Profile 绑定：resolveSessionAuthProfileOverride

上面描述的 resolveAuthProfileOrder 返回的是全局排序，但在实际对话中，系统还需要一个"当前 session 用的是哪个 Profile"的绑定机制，避免在一次长对话中频繁切换 Profile（会影响上下文一致性）。这由 session-override.ts 实现。

8.1 Session Override 的三种来源

// src/agents/auth-profiles/session-override.ts L110-118

const source =
  sessionEntry.authProfileOverrideSource ??
  (typeof sessionEntry.authProfileOverrideCompactionCount === "number"
    ? "auto"
    : current
      ? "user"
      : undefined);

user：用户通过 /profile 命令手动指定了 Profile，这个绑定永久有效（不随 compaction 切换），直到用户主动清除
auto：系统自动分配，可以在以下条件下自动轮换

8.2 自动轮换触发条件

// src/agents/auth-profiles/session-override.ts L121-128

let next = current;
if (isNewSession) {
  // 新 session：选下一个可用的（相对于上次 session 的 Profile 往后轮）
  next = current ? pickNextAvailable(current) : pickFirstAvailable();
} else if (current && compactionCount > storedCompaction) {
  // compaction 发生后切换（每次对话历史压缩时轮换一次 Profile）
  next = pickNextAvailable(current);
} else if (!current || isProfileInCooldown(store, current)) {
  // 当前 Profile 进入冷却，立即切换
  next = pickFirstAvailable();
}

三种切换场景：

新 Session：开始新对话时，选当前 Profile 之后的下一个可用 Profile。这与 resolveAuthProfileOrder 的"最旧优先"配合，实现跨 session 的均匀负载分发
Compaction 触发：每次上下文压缩（对话历史超过 token 阈值被压缩）时切换 Profile，这提供了一个自然的轮换时间点
当前 Profile 冷却：如果当前绑定的 Profile 在对话过程中被标记为冷却（比如刚收到一个 429），立即切换到第一个可用的

8.3 Override 的持久化

// src/agents/auth-profiles/session-override.ts L133-148

const shouldPersist =
  next !== sessionEntry.authProfileOverride ||
  sessionEntry.authProfileOverrideSource !== "auto" ||
  sessionEntry.authProfileOverrideCompactionCount !== compactionCount;
if (shouldPersist) {
  sessionEntry.authProfileOverride = next;
  sessionEntry.authProfileOverrideSource = "auto";
  sessionEntry.authProfileOverrideCompactionCount = compactionCount;
  sessionEntry.updatedAt = Date.now();
  sessionStore[sessionKey] = sessionEntry;
  if (storePath) {
    await updateSessionStore(storePath, (store) => { store[sessionKey] = sessionEntry; });
  }
}

Override 和 compactionCount 一起存入 Session Store（sessions.json），这使得即使进程重启，对话恢复后也能知道这个 session 上次用的是哪个 Profile，以及 compaction 发生了几次。

九、外部 CLI 凭据同步：syncExternalCliCredentials

OpenClaw 不是唯一能写入 auth-profiles.json 的工具。其他 CLI 工具（Qwen Code CLI、MiniMax CLI、Claude CLI、OpenAI Codex CLI）也可以独立登录并存储凭据。external-cli-sync.ts 负责在每次 store 加载时同步这些外部凭据。

// src/agents/auth-profiles/external-cli-sync.ts L89-135

export function syncExternalCliCredentials(store: AuthProfileStore): boolean {
  let mutated = false;
  const now = Date.now();

  // 同步 Qwen Code CLI 凭据到 qwen-portal:qwen-cli
  const shouldSyncQwen = !existingQwen || existingQwen.provider !== "qwen-portal" ||
    !isExternalProfileFresh(existingQwen, now);
  const qwenCreds = shouldSyncQwen
    ? readQwenCliCredentialsCached({ ttlMs: EXTERNAL_CLI_SYNC_TTL_MS })
    : null;
  if (qwenCreds) { /* 更新 store */ mutated = true; }

  // 同步 MiniMax CLI 凭据到 minimax-portal:minimax-cli
  if (syncExternalCliCredentialsForProvider(store, MINIMAX_CLI_PROFILE_ID, "minimax-portal", ...)) {
    mutated = true;
  }

  return mutated;
}

同步策略（isExternalProfileFresh L33-47）：如果现有凭据还有超过 10 分钟有效期（EXTERNAL_CLI_NEAR_EXPIRY_MS = 10 * 60 * 1000），就不从外部 CLI 重新读取，避免频繁的文件系统访问。TTL 为 15 分钟（EXTERNAL_CLI_SYNC_TTL_MS = 15 * 60 * 1000）。

Claude CLI（anthropic:claude-cli）和 OpenAI Codex CLI（openai-codex:codex-cli）的同步逻辑在 external-cli-sync.ts 中没有出现，它们通过各自的 readQwenCliCredentialsCached/readMiniMaxCliCredentialsCached 等函数处理，常量定义在 constants.ts L7-10。

十、resolveProfilesUnavailableReason：诊断所有 Profile 都不可用的原因

当 resolveAuthProfileOrder 返回空列表（所有 Profile 都在冷却中或不可用），系统需要给用户一个有意义的错误信息。resolveProfilesUnavailableReason（usage.ts L70-141）通过投票机制推断最可能的原因：

// src/agents/auth-profiles/usage.ts L70-141

export function resolveProfilesUnavailableReason(params: {
  store: AuthProfileStore;
  profileIds: string[];
  now?: number;
}): AuthProfileFailureReason | null {
  const scores = new Map<AuthProfileFailureReason, number>();

  for (const profileId of params.profileIds) {
    const stats = params.store.usageStats?.[profileId];

    const disabledActive = isActiveUnusableWindow(stats.disabledUntil, now);
    if (disabledActive && stats.disabledReason) {
      // disabledReason 权重 1000（高置信度）
      addScore(stats.disabledReason, 1_000);
      continue;
    }

    const cooldownActive = isActiveUnusableWindow(stats.cooldownUntil, now);
    if (!cooldownActive) continue;

    // failureCounts 投票（低置信度）
    for (const [reason, count] of Object.entries(stats.failureCounts ?? {})) {
      addScore(reason as AuthProfileFailureReason, count);
    }
  }

  // 按分数（高优先）+ 原因优先级（低 index 优先）选出最佳
  // ...
}

FAILURE_REASON_PRIORITY（L7-17）定义了相同分数时的决胜规则：auth_permanent > auth > billing > format > model_not_found > overloaded > timeout > rate_limit > unknown。

这个设计避免了一个之前存在的 bug：当 Profile 有 cooldownUntil 但没有 failureCounts（比如通过 markAuthProfileCooldown 这个旧 API 标记的）时，旧代码会默认返回 "rate_limit"，导致用户看到错误的"已达到速率限制"提示。现在改为返回 "unknown"（L116-118）。

十一、saveAuthProfileStore 的 SecretRef 脱敏

// src/agents/auth-profiles/store.ts L484-509

export function saveAuthProfileStore(store: AuthProfileStore, agentDir?: string): void {
  const profiles = Object.fromEntries(
    Object.entries(store.profiles).map(([profileId, credential]) => {
      // 如果同时有 key 和 keyRef，保存时删除明文 key
      if (credential.type === "api_key" && credential.keyRef && credential.key !== undefined) {
        const sanitized = { ...credential } as Record<string, unknown>;
        delete sanitized.key;
        return [profileId, sanitized];
      }
      // token 同理
      if (credential.type === "token" && credential.tokenRef && credential.token !== undefined) {
        const sanitized = { ...credential } as Record<string, unknown>;
        delete sanitized.token;
        return [profileId, sanitized];
      }
      return [profileId, credential];
    }),
  ) as AuthProfileStore["profiles"];
  // ...
}

当凭据同时有明文值（key）和引用（keyRef，如 keychain 引用或 secret manager 引用）时，持久化时会删除明文值，只保存引用。这防止了凭据通过 auth-profiles.json 泄露——keyRef 在运行时再通过 resolveSecretRefString（oauth.ts L274-306）动态解析。

十二、整体调用链路总结

一次典型的多 Key 自动切换流程如下：

Provider 层（model-auth.ts）
  └─ resolveApiKeyForProvider
       ├─ 读取 store: ensureAuthProfileStore(agentDir)
       ├─ 获取有序候选列表: resolveAuthProfileOrder({ cfg, store, provider })
       │    ├─ clearExpiredCooldowns(store)     ← 主动清理过期冷却
       │    ├─ 过滤 eligible profiles
       │    └─ 排序（类型优先 + lastUsed 最旧优先）
       ├─ 遍历有序列表，逐个尝试:
       │    └─ resolveApiKeyForProfile(cfg, store, profileId)
       │         ├─ api_key: 直接返回 key（可能解析 SecretRef）
       │         ├─ token:   检查过期，返回 token
       │         └─ oauth:   检查过期 → 带锁刷新 → 多层兜底
       └─ 返回第一个成功的 apiKey

请求完成后:
  └─ markAuthProfileUsed({ store, profileId, agentDir })
       └─ 更新 lastUsed，清零 errorCount（带锁写磁盘）

请求失败时（由 failover-error.ts 分类原因后调用）:
  └─ markAuthProfileFailure({ store, profileId, reason })
       ├─ computeNextProfileUsageStats()
       │    ├─ billing / auth_permanent → 更新 disabledUntil（指数退避，5h → 24h）
       │    └─ 其他 → 更新 cooldownUntil（指数退避，1min → 1h）
       └─ 带锁写磁盘

Session 层（session-override.ts）:
  └─ resolveSessionAuthProfileOverride
       ├─ 新 Session → 选 order 中当前之后的下一个可用 Profile
       ├─ compactionCount 增加 → 轮换 Profile
       └─ 当前 Profile 进入冷却 → 立即切换到第一个可用

十三、设计哲学总结

回顾整个 Auth Profile 系统，可以归纳出几个核心设计原则：

1. 两层冷却、语义隔离。cooldownUntil（瞬态，分钟级）和 disabledUntil（持久，小时级）分开管理，分别对应"暂时限速"和"账号异常"两类截然不同的场景，防止轻微的速率限制触发账单级别的长期封锁。

2. 冷却不等于排除。进入冷却的 Profile 仍然会出现在 resolveAuthProfileOrder 的返回列表中，只是被挪到末尾。当所有 Profile 都在冷却时，系统仍然返回"最快恢复"的那个，让调用方决定是等待还是报错，而不是在排序层就彻底丢弃。

3. 锁降级策略。写操作先尝试文件锁，锁失败不报错而是直接操作内存 + 写磁盘。这是一个有意识的 trade-off：冷却状态偶尔被覆盖比写操作阻塞整个请求更能接受。

4. 错误计数的自我修复。clearExpiredCooldowns 在每次排序前主动清理过期状态，computeNextProfileUsageStats 在新失败发生时检测旧冷却是否已过期并重置 errorCount，两道机制保证了 Profile 不会因为历史错误计数被永久"拉低"。

5. Session 级别的稳定性。在全局轮询的基础上，session 级别的 authProfileOverride 确保一次长对话始终使用同一个 Profile（除非它进入冷却），避免上下文不一致问题，同时又通过 compaction 时机实现跨 session 的均匀分发。

附：核心源文件索引

文件	行数	职责
`src/agents/auth-profiles/types.ts`	82	数据结构定义（AuthProfileStore、凭据类型、ProfileUsageStats）
`src/agents/auth-profiles/store.ts`	510	Store 读写、锁机制、旧版迁移、Runtime Snapshot
`src/agents/auth-profiles/usage.ts`	606	冷却算法核心（markAuthProfileFailure、calculateAuthProfileCooldownMs、clearExpiredCooldowns）
`src/agents/auth-profiles/order.ts`	209	Profile 排序（resolveAuthProfileOrder、orderProfilesByMode）
`src/agents/auth-profiles/oauth.ts`	492	OAuth Token 刷新、resolveApiKeyForProfile
`src/agents/auth-profiles/credential-state.ts`	75	凭据有效性评估（evaluateStoredCredentialEligibility）
`src/agents/auth-profiles/session-override.ts`	152	Session 级别 Profile 绑定与自动轮换
`src/agents/auth-profiles/profiles.ts`	116	Profile CRUD 操作、listProfilesForProvider
`src/agents/auth-profiles/repair.ts`	165	OAuth Profile ID 漂移修复
`src/agents/auth-profiles/external-cli-sync.ts`	136	外部 CLI 凭据同步（Qwen、MiniMax）
`src/agents/auth-profiles/constants.ts`	27	常量定义（文件名、锁参数、TTL）
`src/agents/auth-profiles.ts`	55	统一导出入口