基于 LangChain.js 的前端 Agent 工作流编排：Tool 注册、思维链可视化与多步推理的实时 DAG 渲染

AgentExecutor.invoke() 那个 Promise resolve 的时候，你用户已经对着空白页发了 40 秒呆。

这不是性能问题。这是产品层面的硬伤——LLM Agent 做推理天生就慢，一个中等复杂度的任务跑个 3 到 5 轮 tool.call() 很正常，每轮都要等模型吐完 token、解析结构化输出、跑一下外部调用、再把结果塞回 messages 数组喂回去，整条链路跑下来十几秒起步，你要是把这些全藏在一个 loading spinner 后面，用户的耐心大概撑不过第二轮。所以真正要解决的问题不是"怎么让 Agent 跑起来"，是怎么把它边跑边想的过程实时地、结构化地渲染出来（当然这是理想情况）。

Tool 选择、参数组装、中间结果、重试决策。全得摊开给用户看。说白了嘛，就是给 LLM 的"内心戏"搭一个可视化的舞台，让用户知道它不是卡死了而是真的在干活。跑通一个 demo 不难，难的是这套东西在生产环境里不崩——两个字概括就是"耐操"。

用户输入
  ↓
LLM 决策（选 Tool + 生成参数）
  ↓                    ↓
Tool A 执行         Tool B 执行（并行）
  ↓                    ↓
结果合并 → LLM 再决策
               ↓
          Tool C 执行
               ↓
          最终输出

这个流程画出来像个 DAG。但运行时它是动态生长的——你在第一步根本不知道后面会长出几个分支，也不知道哪个 Tool 会超时、哪个会返回意料之外的格式让 LLM 的 JSON.parse 直接炸掉。这篇文章围绕这个矛盾展开：怎么设计一套前端架构让 Tool 可插拔注册、思维链状态可追踪、DAG 可实时渲染，同时不把代码写成一坨谁都不想维护的东西。

Tool 注册机制：别让你的 Agent 变成一个巨型 switch-case

先上问题。LangChain.js 里注册 Tool 的标准姿势大概长这样：

import { DynamicStructuredTool } from '@langchain/core/tools'
import { z } from 'zod'

const searchTool = new DynamicStructuredTool({
  name: 'web_search',
  description: '搜索互联网获取实时信息',
  schema: z.object({
    query: z.string().describe('搜索关键词'),
    maxResults: z.number().optional().default(5),
  }),
  func: async ({ query, maxResults }) => {
    const res = await fetch(`/api/search?q=${encodeURIComponent(query)}&limit=${maxResults}`)
    const data = await res.json()
    return JSON.stringify(data.results.slice(0, maxResults))
  },
})

一个 Tool 写成这样没问题。三个也凑合。十五个呢？

真实项目里 Agent 要调的 Tool 很容易膨胀到两位数——搜索、计算、db.query()、文件读写、外部 REST API 调用、沙箱代码执行——每一个都有自己的 schema 定义、错误处理逻辑、重试策略、权限校验规则，你要是把它们全塞在一个文件里就会得到一个 800 行的 tools.ts，三个月后没人敢碰这玩意。

需要 registry 模式。

// tool-registry.ts
// 核心思路：Tool 自己知道自己是谁，registry 只负责收集和分发

type ToolMeta = {
  category: 'search' | 'compute' | 'io' | 'external'
  requiresAuth: boolean
  timeout: number  // 毫秒，超时直接 abort
  retryable: boolean
}

class ToolRegistry {
  private tools = new Map<string, DynamicStructuredTool>()
  private meta = new Map<string, ToolMeta>()

  register(tool: DynamicStructuredTool, meta: ToolMeta) {
    if (this.tools.has(tool.name)) {
      // 同名 Tool 重复注册，直接炸——这种 bug 越早发现越好
      throw new Error(`Tool "${tool.name}" already registered`)
    }
    this.tools.set(tool.name, tool)
    this.meta.set(tool.name, meta)
  }

  getTools(filter?: { category?: ToolMeta['category'] }): DynamicStructuredTool[] {
    let entries = [...this.tools.entries()]
    if (filter?.category) {
      entries = entries.filter(([name]) => 
        this.meta.get(name)?.category === filter.category
      )
    }
    return entries.map(([, tool]) => tool)
  }

  getMeta(name: string): ToolMeta | undefined {
    return this.meta.get(name)
  }
}

export const registry = new ToolRegistry()

然后每个 Tool 自己单独一个文件，文件末尾做自注册，import 的副作用就是把自己挂到 registry 上：

// tools/web-search.ts
import { registry } from '../tool-registry'

const tool = new DynamicStructuredTool({
  name: 'web_search',
  description: '搜索互联网获取实时信息',
  schema: z.object({ query: z.string() }),
  func: async ({ query }) => {
    // ...实际逻辑
  },
})

registry.register(tool, {
  category: 'search',
  requiresAuth: false,
  timeout: 10000,
  retryable: true,
})

这个模式有个隐含的坑。

const toolModules = import.meta.glob('./tools/*.ts', { eager: true })
// eager: true → 同步加载，确保注册发生在 Agent 创建之前
// 不需要用返回值，import 的副作用已经完成注册

静态注册搞定了。

但跑起来还有一层：Tool 执行过程中的生命周期钩子。你需要知道一个 Tool 什么时候开始执行、什么时候结束、返回了什么、报错了没有——这些信息不只是后面思维链可视化的数据源，它就是思维链本身的骨架，没有这些事件流你后面画个锤子的 DAG。

嗯，继续。

LangChain.js 原生提供了 callbacks 机制来做这事。但它的回调设计——怎么说呢——有点"Java 味儿"，handleToolStart、handleToolEnd、handleToolError 一堆方法签名糊你脸上，参数类型还经常对不上文档（虽然这个设计我觉得有点奇怪，明明 TypeScript 项目为什么类型定义这么随意）。我的做法是在 registry 层包一层代理把 Tool 的 func 拦截掉：

// 在 ToolRegistry.register 方法内部
register(tool: DynamicStructuredTool, meta: ToolMeta) {
  const originalFunc = tool.func.bind(tool)

  const wrappedFunc = async (input: any, runManager?: any) => {
    const startTime = Date.now()
    const executionId = crypto.randomUUID()

    this.emit('tool:start', { 
      executionId, 
      toolName: tool.name, 
      input, 
      timestamp: startTime 
    })

    try {
      const result = await Promise.race([
        originalFunc(input, runManager),
        new Promise((_, reject) => 
          setTimeout(() => reject(new Error(`Tool ${tool.name} timeout`)), meta.timeout)
        ),
      ])

      this.emit('tool:end', { 
        executionId, 
        toolName: tool.name, 
        result, 
        duration: Date.now() - startTime 
      })
      return result
    } catch (err) {
      this.emit('tool:error', { 
        executionId, 
        toolName: tool.name, 
        error: err, 
        duration: Date.now() - startTime,
        retryable: meta.retryable,
      })
      throw err
    }
  }

  ;(tool as any).func = wrappedFunc
  this.tools.set(tool.name, tool)
  this.meta.set(tool.name, meta)
}

这段代码有个细节值得停一下。Promise.race 里塞 setTimeout 做超时兜底这个套路很常见，但用在 LangChain Tool 里有一个陷阱——timeout reject 之后原始的 fetch 或者数据库查询其实还在跑着呢。你的 Agent 已经收到报错往下走了，后台还挂着一个请求在那耗资源。前端并发高这个说法本身就有点奇怪对吧？一个用户一次也就跑一个 Agent。但你仔细想——如果 Agent 支持并行 Tool 调用，同时起 3、4 个 fetch，再叠上用户可能开了好几个对话 tab 每个 tab 都在跑，这个泄漏就不是理论问题了，AbortController 是正解但 DynamicStructuredTool 不方便把 AbortSignal 传进 func 里，得自己在闭包里存一个，写出来不好看，先欠着。

嗯，继续。

真正让 registry 模式值回票价的是动态 Tool 集，不同用户角色、不同对话场景，Agent 能调的 Tool 不一样。管理员能用 db_query，普通用户碰都别碰（虽然官方文档不是这么说的）。哦不，准确说是用 db_query，普通用户碰都别碰（虽然官方文档不是这么说的）。处理代码问题时加载 code_executor，闲聊天的时候不需要。

function getToolsForContext(user: User, conversationType: string) {
  const tools = registry.getTools()

  return tools.filter(tool => {
    const meta = registry.getMeta(tool.name)!
    if (meta.requiresAuth && !user.permissions.includes(tool.name)) {
      return false
    }
    if (conversationType === 'casual' && meta.category === 'compute') {
      return false
    }
    return true
  })
}

const agent = await createOpenAIFunctionsAgent({
  llm,
  tools: getToolsForContext(currentUser, 'technical'),
  prompt,
})

这段 filter 看着朴素，本质上是把 Tool 的注册和使用解耦了。

不过话说回来。这套 registry 最大的受益者不是运行时（虽然官方文档不是这么说的）。是后面的 DAG 渲染，因为 tool:start、tool:end 这些事件流出来了，思维链的数据源就有了。

思维链状态管理：把 LLM 的内心戏变成一棵可追踪的树

AgentExecutor 跑起来之后内部在干嘛？

就是一个循环：

while (true) {
  1. 把当前 messages 数组发给 LLM
  2. LLM 返回：要调 Tool（哪个 Tool 什么参数）或者直接吐最终答案
  3. 最终答案 → break
  4. Tool 调用 → 执行 → 结果塞回 messages → 回到 1
}

循环每转一圈就是思维链上一个节点。问题在于 LangChain 的 callbacks 能告诉你这些事件发生了，但它不给你一个结构化的状态对象来表达整条链的拓扑关系——你拿到的是一堆散装事件，得自己攒成一棵树。

一开始设计太复杂了后来砍了又砍，砍到不能再砍：（数据结构。踩了几次坑之后收敛出来的版本）

type ThinkingNodeType = 'llm_call' | 'tool_call' | 'tool_result' | 'final_answer' | 'error'

type ThinkingNodeStatus = 'pending' | 'running' | 'completed' | 'failed'

interface ThinkingNode {
  id: string
  type: ThinkingNodeType
  status: ThinkingNodeStatus
  parentId: string | null
  label: string
  data: Record<string, any>
  startedAt: number
  completedAt: number | null
  children: string[]
  streamTokens?: string[]
}

interface ThinkingChain {
  sessionId: string
  rootId: string
  nodes: Map<string, ThinkingNode>
  currentNodeId: string | null
}

ThinkingNode 用 parentId 和 children 形成树结构。等下——不是说好了 DAG 吗？对，理论上如果两个 Tool 的结果同时喂给下一轮 LLM 决策那确实是 DAG 不是树。但在 LangChain.js 目前的 AgentExecutor 实现里（注意我说的是 AgentExecutor 不是 langgraph）并行 Tool 调用的结果最终还是拼成一条消息喂回去的，所以中间状态用树来建模够用了，真要严格 DAG 后面单独讲。

管理器，维护这棵树同时对接 LangChain 的 callback 体系：

class ThinkingChainManager {
  private chain: ThinkingChain
  private listeners = new Set<(chain: ThinkingChain) => void>()

  constructor(sessionId: string) {
    const rootId = crypto.randomUUID()
    this.chain = {
      sessionId,
      rootId,
      nodes: new Map(),
      currentNodeId: null,
    }
  }

  addNode(
    type: ThinkingNodeType,
    label: string,
    parentId: string | null,
    data: Record<string, any> = {}
  ): string {
    const id = crypto.randomUUID()
    const node: ThinkingNode = {
      id, type, status: 'pending', parentId, label, data,
      startedAt: Date.now(), completedAt: null, children: [],
    }

    this.chain.nodes.set(id, node)

    if (parentId && this.chain.nodes.has(parentId)) {
      this.chain.nodes.get(parentId)!.children.push(id)
    }

    this.notify()
    return id
  }

  updateStatus(nodeId: string, status: ThinkingNodeStatus) {
    const node = this.chain.nodes.get(nodeId)
    if (!node) return
    node.status = status
    if (status === 'completed' || status === 'failed') {
      node.completedAt = Date.now()
    }
    if (status === 'running') {
      this.chain.currentNodeId = nodeId
    }
    this.notify()
  }

  appendStreamToken(nodeId: string, token: string) {
    const node = this.chain.nodes.get(nodeId)
    if (!node) return
    if (!node.streamTokens) node.streamTokens = []
    node.streamTokens.push(token)
    // 这里刻意不调 notify()
  }

  subscribe(listener: (chain: ThinkingChain) => void) {
    this.listeners.add(listener)
    return () => this.listeners.delete(listener)
  }

  private notify() {
    this.listeners.forEach(fn => fn(this.chain))
  }

  getSnapshot(): ThinkingChain {
    return this.chain
  }
}

为什么 appendStreamToken 不触发 notify()？

因为 GPT-4 和 Claude 吐 token 的速度大概每秒 30 到 80 个，短 token 飞起来的时候能到 100 以上——如果每个 token 都触发一次 React re-render 你的 UI 线程会直接卡成幻灯片放映。正确做法是在消费端 throttle，用 requestAnimationFrame 一帧刷一次就够了：

useEffect(() => {
  const unsub = chainManager.subscribe(chain => {
    setDisplayChain(structuredClone(chain))
  })

  let rafId: number
  const tickStream = () => {
    setDisplayChain(structuredClone(chainManager.getSnapshot()))
    rafId = requestAnimationFrame(tickStream)
  }
  rafId = requestAnimationFrame(tickStream)

  return () => {
    unsub()
    cancelAnimationFrame(rafId)
  }
}, [chainManager])

structuredClone 在这里是有点奢侈的。节点多的时候每帧 clone 一次整棵树开销不小（虽然说实话 20 个节点的对象 clone 一次也就微秒级别），更好的做法是上 immer 维护 immutable 结构，但过早优化不如先跑通再说。

写到这里突然觉得之前说的不太对。

接着要把 ThinkingChainManager 和 LangChain 的 callback 对接。继承 BaseCallbackHandler 重写一堆 handle* 方法：

import { BaseCallbackHandler } from '@langchain/core/callbacks/base'

class ThinkingChainCallbackHandler extends BaseCallbackHandler {
  name = 'ThinkingChainHandler'
  private manager: ThinkingChainManager
  private runNodeMap = new Map<string, string>()
  private currentLlmNodeId: string | null = null

  constructor(manager: ThinkingChainManager) {
    super()
    this.manager = manager
  }

  async handleLLMStart(llm: any, prompts: string[], runId: string) {
    const parentId = this.getParentNodeId()
    const nodeId = this.manager.addNode(
      'llm_call',
      '正在思考...',
      parentId,
      { model: llm?.modelName || 'unknown' }
    )
    this.runNodeMap.set(runId, nodeId)
    this.currentLlmNodeId = nodeId
    this.manager.updateStatus(nodeId, 'running')
  }

  async handleLLMNewToken(token: string) {
    if (this.currentLlmNodeId) {
      this.manager.appendStreamToken(this.currentLlmNodeId, token)
    }
  }

  async handleLLMEnd(output: any, runId: string) {
    const nodeId = this.runNodeMap.get(runId)
    if (nodeId) {
      this.manager.updateStatus(nodeId, 'completed')
    }
    this.currentLlmNodeId = null
  }

  async handleToolStart(tool: any, input: string, runId: string) {
    const parentId = this.currentLlmNodeId || this.getParentNodeId()
    const nodeId = this.manager.addNode(
      'tool_call',
      `调用 ${tool.name || 'Tool'}`,
      parentId,
      { toolName: tool.name, input: JSON.parse(input || '{}') }
    )
    this.runNodeMap.set(runId, nodeId)
    this.manager.updateStatus(nodeId, 'running')
  }

  async handleToolEnd(output: string, runId: string) {
    const nodeId = this.runNodeMap.get(runId)
    if (!nodeId) return

    const resultNodeId = this.manager.addNode(
      'tool_result',
      '结果返回',
      nodeId,
      { output: output.slice(0, 500) }
    )
    this.manager.updateStatus(resultNodeId, 'completed')
    this.manager.updateStatus(nodeId, 'completed')
  }

  async handleToolError(err: any, runId: string) {
    const nodeId = this.runNodeMap.get(runId)
    if (nodeId) {
      this.manager.updateStatus(nodeId, 'failed')
      this.manager.addNode('error', `错误: ${err.message}`, nodeId, { error: err })
    }
  }

  private getParentNodeId(): string | null {
    return this.manager.getSnapshot().currentNodeId
  }
}

这段 handler 有一个 LangChain 做得不好的地方——handleToolStart 的第二个参数 input 是 string 不是结构化对象，你得自己 JSON.parse，而且它有时候给你的不是合法 JSON。不是 bug。是"特性"。（我已经在 GitHub issue 里看到过不下十个人吐槽这个事了，官方一直没改。）

串起来。启动代码：

const chainManager = new ThinkingChainManager(sessionId)
const callbackHandler = new ThinkingChainCallbackHandler(chainManager)

const executor = AgentExecutor.fromAgentAndTools({
  agent,
  tools: getToolsForContext(currentUser, conversationType),
  callbacks: [callbackHandler],
  // streaming 这个配置名字叫 streaming
  // 但实际控制的是 callback 的粒度——不开的话 handleLLMNewToken 不触发
})

registry.on('tool:start', (event) => {
  // 补充 meta 信息：预期耗时、是否可重试之类的
})

到这一步思维链的数据流就通了，每一步推理每一次 Tool 调用都会在 ThinkingChainManager 里生成对应节点。

拉回来讲渲染。

DAG 渲染：把动态生长的图画到屏幕上

这是整个方案里最容易做出来、也最容易做烂的部分。

先明确一下要渲染什么：

[用户提问] 
    ↓
[LLM 思考 #1] ──→ [调用 web_search("天气")] ──→ [结果: 晴 25°C]
    ↓                                                    ↓
[LLM 思考 #2] ←──────────────────────────────────────────┘
    ↓
    ├──→ [调用 calculator("25 * 9/5 + 32")] ──→ [结果: 77°F]
    │
    └──→ [调用 translator("晴", "en")] ──→ [结果: "Sunny"]
              ↓                                    ↓
[LLM 思考 #3] ←──────────────────────────────────┘
    ↓
[最终回答: "今天天气晴朗，25°C (77°F)"]

节点类型不统一，有 llm_call 有 tool_call 有 tool_result 有 final_answer。连边方向单一但有并行分支。整个图是边跑边长的——这很要命。

用什么库？

核心挑战不在渲染。在布局算法。

每次新增节点整个图的布局可能要重算，如果用 dagre 做自动布局（react-flow 文档推荐的方式），每次 addNode 就重新算一遍所有节点的 x/y 坐标——已有节点位置会跳。用户正盯着某个节点看呢突然它蹦到另一个位置去了。体验极差。

我的方案是增量布局。新节点根据父节点位置做相对定位，已有节点纹丝不动：

import { useCallback, useRef } from 'react'

const LAYOUT = {
  nodeWidth: 240,
  nodeHeight: 80,
  horizontalGap: 60,
  verticalGap: 100,
} as const

function useIncrementalLayout() {
  const positionCache = useRef(new Map<string, { x: number; y: number }>())
  const depthCounters = useRef(new Map<number, number>())

  const getNodePosition = useCallback((
    nodeId: string,
    parentId: string | null,
    depth: number
  ): { x: number; y: number } => {
    if (positionCache.current.has(nodeId)) {
      return positionCache.current.get(nodeId)!
    }

    const currentCount = depthCounters.current.get(depth) || 0
    depthCounters.current.set(depth, currentCount + 1)

    let x: number, y: number

    if (!parentId) {
      x = 400
      y = 50
    } else {
      const parentPos = positionCache.current.get(parentId)
      if (parentPos) {
        x = parentPos.x + (currentCount * (LAYOUT.nodeWidth + LAYOUT.horizontalGap))
        y = parentPos.y + LAYOUT.verticalGap
        
        const siblings = currentCount
        if (siblings > 0) {
          x = parentPos.x + ((siblings - 0.5) * (LAYOUT.nodeWidth + LAYOUT.horizontalGap) / 2)
        }
      } else {
        x = currentCount * (LAYOUT.nodeWidth + LAYOUT.horizontalGap)
        y = depth * LAYOUT.verticalGap
      }
    }

    const pos = { x, y }
    positionCache.current.set(nodeId, pos)
    return pos
  }, [])

  return { getNodePosition }
}

坦白讲这段布局代码写得有点糙。并行分支水平展开的算法不太对，三个以上并行 Tool 的时候节点会挤成一坨——但 80% 的场景够用。再说吧。完美的 DAG 布局是一个学术级问题，Sugiyama 算法那一套你真去实现要写好几百行，在这个业务场景下追求完美属于浪费生命。你的用户关心的是"Agent 在干嘛""到第几步了""哪步挂了"，不是这图的 margin 对不对称。

自定义节点组件，根据 ThinkingNodeType 渲染不同样式：

function ThinkingNodeComponent({ data }: { data: ThinkingNode }) {
  const statusColor = {
    pending: '#94a3b8',
    running: '#3b82f6',
    completed: '#22c55e',
    failed: '#ef4444',
  }[data.status]

  return (
    <div 
      className={`thinking-node thinking-node--${data.type}`}
      style={{ borderLeftColor: statusColor, borderLeftWidth: 4 }}
    >
      <div className="thinking-node__header">
        <span className="thinking-node__icon">{getIcon(data.type)}</span>
        <span>{data.label}</span>
        {data.status === 'running' && <PulseIndicator />}
      </div>
      
      {data.streamTokens && data.status === 'running' && (
        <div className="thinking-node__stream">
          {data.streamTokens.join('')}
          <BlinkingCursor />
        </div>
      )}
      
      {data.type === 'tool_call' && data.data.input && (
        <Collapsible title="参数">
          <pre>{JSON.stringify(data.data.input, null, 2)}</pre>
        </Collapsible>
      )}
      
      {data.type === 'tool_result' && (
        <Collapsible title="结果">
          <pre>{data.data.output}</pre>
        </Collapsible>
      )}
    </div>
  )
}

把 ThinkingChain 转成 @xyflow/react 要的 nodes 和 edges 数组——BFS 遍历顺便算深度：

function chainToFlowElements(
  chain: ThinkingChain,
  getPosition: (id: string, parentId: string | null, depth: number) => { x: number; y: number }
) {
  const nodes: Node[] = []
  const edges: Edge[] = []

  const queue: Array<{ nodeId: string; depth: number }> = []
  const visited = new Set<string>()

  for (const [id, node] of chain.nodes) {
    if (!node.parentId) {
      queue.push({ nodeId: id, depth: 0 })
    }
  }

  while (queue.length > 0) {
    const { nodeId, depth } = queue.shift()!
    if (visited.has(nodeId)) continue
    visited.add(nodeId)

    const thinkingNode = chain.nodes.get(nodeId)!
    const position = getPosition(nodeId, thinkingNode.parentId, depth)

    nodes.push({
      id: nodeId,
      type: 'thinkingNode',
      position,
      data: thinkingNode,
    })

    if (thinkingNode.parentId) {
      edges.push({
        id: `${thinkingNode.parentId}-${nodeId}`,
        source: thinkingNode.parentId,
        target: nodeId,
        animated: thinkingNode.status === 'running',
        style: { stroke: thinkingNode.status === 'failed' ? '#ef4444' : '#64748b' },
      })
    }

    for (const childId of thinkingNode.children) {
      queue.push({ nodeId: childId, depth: depth + 1 })
    }
  }

  return { nodes, edges }
}

最终 React 组件：

function AgentDAGViewer({ chainManager }: { chainManager: ThinkingChainManager }) {
  const [chain, setChain] = useState<ThinkingChain | null>(null)
  const { getNodePosition } = useIncrementalLayout()

  useEffect(() => {
    return chainManager.subscribe(newChain => {
      setChain(structuredClone(newChain))
    })
  }, [chainManager])

  const { nodes, edges } = useMemo(() => {
    if (!chain) return { nodes: [], edges: [] }
    return chainToFlowElements(chain, getNodePosition)
  }, [chain, getNodePosition])

  const reactFlowInstance = useReactFlow()
  useEffect(() => {
    if (chain?.currentNodeId) {
      const pos = getNodePosition(chain.currentNodeId, null, 0)
      reactFlowInstance.setCenter(pos.x, pos.y, { duration: 300, zoom: 1 })
    }
  }, [chain?.currentNodeId])

  return (
    <ReactFlow
      nodes={nodes}
      edges={edges}
      nodeTypes={{ thinkingNode: ThinkingNodeComponent }}
      fitView={false}
      panOnDrag
      zoomOnScroll
      minZoom={0.3}
      maxZoom={1.5}
    >
      <Background />
      <Controls />
    </ReactFlow>
  )
}

踩坑提醒：useReactFlow() 必须在 <ReactFlowProvider> 内部调用否则直接报错，而且这个 Provider 不能和 <ReactFlow> 在同一个组件里——得包在外面一层，文档里写了但不显眼，十个人里九个半会踩这个。

设计权衡和边界

langgraph 还是 AgentExecutor？绕不开的选择。

LangChain 团队自己都在推 langgraph 作为 Agent 编排的下一代方案，AgentExecutor 某种意义上已经进维护模式了。langgraph 原生就是图结构——StateGraph 加节点加边——天然比 AgentExecutor 那个 while 循环模型更贴合 DAG 可视化的需求：

import { StateGraph } from '@langchain/langgraph'

const workflow = new StateGraph({ channels: agentState })
  .addNode('agent', callModel)
  .addNode('tools', callTools)
  .addEdge('__start__', 'agent')
  .addConditionalEdges('agent', shouldContinue, {
    continue: 'tools',
    end: '__end__',
  })
  .addEdge('tools', 'agent')

const app = workflow.compile()

但 langgraph 也不是万能药，它的学习曲线比 AgentExecutor 陡不少——StateGraph、channels、conditional edges、checkpointer 一堆新概念砸过来，而且 JS 版本目前功能比 Python 版少了一截。如果你的场景就是一个简单的 ReAct 循环，AgentExecutor 配上前面那套 callback 机制已经够使了，别为了架构上的"正确性"引入不必要的复杂度。能跑。够了。

性能方面最大的瓶颈根本不在前端渲染。

状态持久化这块，ThinkingChainManager 的数据目前纯内存，location.reload() 一下就全没了。如果需要回放历史对话的推理过程——企业场景里这个需求挺常见，审计合规什么的——得把整个 ThinkingChain 序列化存后端，每个事件带 timestamp，回放时按时间戳重新 replay。这块展开讲又是一整篇文章的体量了。

跑了大半年生产环境，这套方案最大的教训就一句话：别想一步到位。ThinkingNode 的 type 枚举我改了四版，ToolMeta 的结构加了三次字段，DAG 布局算法换过两种方案。先用 AgentExecutor 加最基础的 callback 加一个简单的列表式渲染跑通，确认产品方向没问题了再逐步往上堆 DAG 可视化、增量布局、流式 token 这些花活。想等一步到位只会等出个寂寞来。