Fastapi+Langgraph工作流接口优化最近在做一个智能体项目，基于 FastAPI + Gunicorn +

一个复杂的智能体应用如何从频繁超时到稳定运行的优化实录

最近在做一个智能体项目，基于 FastAPI + Gunicorn + LangGraph 搭建。本以为用了异步架构会很丝滑，结果上线后各种问题接踵而至：工作流初始化慢、worker 经常被 timeout kill、用户体验极差。这篇文章记录了我们如何一步步解决这些问题的完整过程。

背景：复杂度带来的麻烦

我们的智能体架构相当复杂：

主工作流(Agent本身) 由 LangGraph 构建，包含多个执行节点
子工作流(工具)：某些工具本身也是 LangGraph 构建的独立工作流
AI 组件：集成了向量存储、知识库检索、多个 LLM 等
执行特点：工作流初始化需要 3-5 秒，某些节点执行可能超过 30 秒

虽然代码基本都是异步的，但无法保证 100% 异步，特别是一些第三方依赖。这就带来了致命问题：

# gunicorn 配置
timeout = 30  # 默认 30 秒后 kill worker

一旦某个请求执行超过 30 秒，Gunicorn 就会强制终止 worker 进程：

WORKER TIMEOUT (pid:12345)
Worker timeout signal received. Printing call stack...

用户看到的就是接口超时，正在运行的工作流被意外中断。

第一步：多级缓存优化

2.1 时间戳方案的尝试

工作流初始化慢的主要原因是需要构建 CompiledGraph 对象，这个过程包括节点解析、工具加载、依赖检查等。自然想到用缓存优化。

但这里有个坑：CompiledGraph 对象无法序列化。

# 这样做会报错
compiled_graph = workflow.compile()
redis.set("graph:123", pickle.dumps(compiled_graph))  # TypeError: cannot pickle

既然不能用分布式缓存，那就用本地缓存，但集群环境下需要保证一致性。我们设计了基于时间戳的多级缓存：

class MultiLevelCache:
    def __init__(self):
        self.local_cache = SimpleMemoryCache()
        self.remote_cache = RedisCache()
    
    async def set(self, key: str, value: Any, expire: int = 60):
        expire_time = datetime.now().timestamp() + expire
        
        # 存储到本地内存
        await self.local_cache.set(f"{key}:data", value, ttl=expire)
        await self.local_cache.set(f"{key}:expire", expire_time, ttl=expire)
        
        # 在远程设置时间戳（如果不存在）
        if not await self.remote_cache.get(f"{key}:timestamp"):
            await self.remote_cache.set(f"{key}:timestamp", 
                                       datetime.now().timestamp(), ttl=None)
    
    async def get(self, key: str):
        # 检查本地过期时间
        local_expire = await self.local_cache.get(f"{key}:expire")
        if not local_expire:
            return None
        
        # 获取远程时间戳
        remote_timestamp = await self.remote_cache.get(f"{key}:timestamp") or 0.0
        
        # 时间戳校验
        if float(local_expire) < remote_timestamp:
            # 本地缓存过期，清除
            await self._clear_local(key)
            return None
        
        return await self.local_cache.get(f"{key}:data")
    
    async def evict(self, key: str):
        # 更新远程时间戳，广播失效信号
        await self.remote_cache.set(f"{key}:timestamp", 
                                   datetime.now().timestamp(), ttl=None)
        await self._clear_local(key)

2.2 时间戳方案的问题

这个方案在测试环境运行良好，但上线后发现了一个严重问题：经常在调用 evict 方法后缓存依然存在。

排查后发现是时钟同步问题。在分布式环境下，各服务器的系统时钟可能存在偏差，导致时间戳比较不准确。比如：

服务器 A 的时钟比标准时间快 30 秒
服务器 B 的时钟比标准时间慢 20 秒
A 服务器 evict 时设置的时间戳，可能小于 B 服务器之前 set 时记录的过期时间

这种时钟偏差在云环境中很常见，特别是不同可用区的实例。

2.3 版本号方案

为了彻底解决时钟同步问题，我们改用基于版本号的同步机制：

class MultiLevelCache:
    async def get(self, key: str):
        # 获取本地版本号
        local_version = await self.local_cache.get(f"{key}:version")
        if not local_version:
            return None
        
        # 获取远程版本号
        remote_version = await self.remote_cache.get(f"{key}:version") or 0
        if local_version < remote_version:
            # 版本号落后，清除本地缓存
            await self._clear_local(key)
            return None
        
        return await self.local_cache.get(f"{key}:data")
    
    async def set(self, key: str, value: Any, expire: int = 60):
        # 获取当前版本号
        current_version = await self.remote_cache.get(f"{key}:version") or 0
        
        # 存储到本地
        await self.local_cache.set(f"{key}:data", value, ttl=expire)
        await self.local_cache.set(f"{key}:version", current_version, ttl=expire)
    
    async def evict(self, key: str):
        # 递增版本号
        current_version = await self.remote_cache.get(f"{key}:version") or 0
        await self.remote_cache.set(f"{key}:version", current_version + 1, ttl=None)
        await self._clear_local(key)

多级缓存的架构设计：

graph TB
    subgraph "应用实例 A"
        A1[本地内存缓存 L1]
        A2[CompiledGraph 对象]
        A3[版本号: v1]
    end
    
    subgraph "应用实例 B"  
        B1[本地内存缓存 L1]
        B2[CompiledGraph 对象]
        B3[版本号: v1]
    end
    
    subgraph "Redis (L2)"
        C1[版本号同步: v1]
        C2[无法存储 CompiledGraph ❌]
    end
    
    A1 <--> C1
    B1 <--> C1
    
    D[工作流更新] --> E[版本号递增: v2]
    E --> C1
    C1 --> F[其他实例检测到版本号变化]
    F --> G[清除本地缓存]
    
    style A2 fill:#ccffcc
    style B2 fill:#ccffcc  
    style C2 fill:#ffcccc

版本号方案的优势：

不依赖系统时钟，避免时钟同步问题
每次更新都会递增版本号，确保失效信号的准确传播
实现简单，逻辑清晰

这个改进将缓存命中时的初始化时间从几秒降低到 100-200ms，但 worker 超时的根本问题依然存在。

第二步：异步任务架构

3.1 架构调整

缓存只是治标不治本。真正的问题是把耗时的工作流执行放在了 Web 请求中。解决思路：将执行逻辑移到后台 TaskIQ worker，前端通过事件流获取结果。

原始架构的问题：

graph TD
    A[客户端请求] --> B[FastAPI接口]
    B --> C[初始化CompiledGraph]
    C --> D[astream执行工作流]
    D --> E[返回结果]
    E --> A
    
    F[Gunicorn Worker] --> G{执行时间 > 30s?}
    G -->|是| H[Worker被Kill ❌]
    G -->|否| I[正常返回]
    
    style C fill:#ffcccc
    style D fill:#ffcccc
    style H fill:#ff6666

新的异步架构：

graph TD
    A[客户端请求] --> B[FastAPI接口]
    B --> C[生成execution_id]
    C --> D[提交TaskIQ任务]
    D --> E[立即返回execution_id]
    E --> A
    
    F[TaskIQ Worker] --> G[初始化CompiledGraph]
    G --> H[astream执行工作流]
    H --> I[发布事件到Redis Stream]
    
    J[客户端SSE连接] --> K[订阅Redis Stream]
    K --> L[实时接收执行进度]
    
    style F fill:#ccffcc
    style I fill:#66cc66
    style L fill:#66cc66

API 接口变得非常简单：

@router.post("/workflow/execute/{workflow_id}")
async def execute_workflow(workflow_id: int, params: dict):
    execution_id = generate_id()
    
    # 提交异步任务
    await execute_workflow_task.kiq(
        workflow_id=workflow_id,
        execution_id=execution_id,
        input_params=params
    )
    
    return {"execution_id": execution_id, "status": "queued"}

核心执行逻辑在 TaskIQ worker 中：

@taskiq_broker.task
async def execute_workflow_task(workflow_id: int, execution_id: str, input_params: dict):
    try:
        await publish_event("start", execution_id)
        
        # 利用缓存加载工作流
        workflow_graph = await load_graph(workflow_id, use_cache=True)
        
        # 使用 astream 流式执行
        async for chunk in workflow_graph.graph.astream(input_params):
            await check_cancellation(execution_id)
            
            for node_id, output in chunk.items():
                await publish_event("node_end", execution_id, {
                    "node_id": node_id, 
                    "output": output
                })
        
        await publish_event("complete", execution_id)
        
    except Exception as e:
        await publish_event("error", execution_id, {"error": str(e)})

3.2 Redis PubSub 的尝试

为了让前端能实时获取执行进度，需要事件推送机制。最开始选择了 Redis PubSub：

sequenceDiagram
    participant C as 客户端
    participant API as FastAPI
    participant W as TaskIQ Worker
    participant P as Redis PubSub
    
    C->>API: POST /workflow/execute
    API->>W: 提交异步任务
    API-->>C: 返回execution_id
    
    C->>API: GET /workflow/events (SSE)
    API->>P: 订阅频道
    
    W->>P: 发布 start 事件
    P-->>API: 推送事件
    API-->>C: SSE: start
    
    W->>P: 发布 node_end 事件
    P-->>API: 推送事件
    API-->>C: SSE: node_end
    
    Note over P: 问题：消息不持久化

# 发布事件
async def publish_event(event_type: str, execution_id: str, data: dict = None):
    channel = f"workflow:{execution_id}"
    event = {"type": event_type, "data": data, "timestamp": time.time()}
    await redis.publish(channel, json.dumps(event))

# 订阅事件
@router.get("/workflow/events/{execution_id}")
async def get_events(execution_id: str):
    async def event_generator():
        channel = f"workflow:{execution_id}"
        pubsub = redis.pubsub()
        await pubsub.subscribe(channel)
        
        async for message in pubsub.listen():
            if message['type'] == 'message':
                yield f"data: {message['data']}\n\n"
    
    return EventSourceResponse(event_generator())

3.3 PubSub 的问题：消息丢失

PubSub 方案很快暴露了问题：没有消息持久化。如果客户端在某个事件发布时网络中断，就会错过这些事件，特别是：

页面刷新时会错过之前的事件
网络波动导致连接中断
客户端较晚建立连接，错过 start 事件

3.4 PubSub + 队列的改进

为了解决消息丢失问题，我们增加了队列存储历史消息：

graph TD
    A[Worker 发布事件] --> B[Redis PubSub]
    A --> C[Redis List 存储历史]
    
    D[客户端连接] --> E[先读取历史消息]
    E --> F[再订阅实时消息]
    
    B --> G[实时消息推送]
    C --> E
    
    H[问题1: 消息重复] --> I[历史 + 实时重叠]
    J[问题2: 竞态条件] --> K[读历史和订阅间隙]
    L[问题3: 手动清理] --> M[内存泄漏风险]
    
    style H fill:#ffcccc
    style J fill:#ffcccc
    style L fill:#ffcccc

async def publish_event(event_type: str, execution_id: str, data: dict = None):
    event = {"type": event_type, "data": data, "timestamp": time.time()}
    event_json = json.dumps(event)
    
    # 同时发布和存储
    await redis.publish(f"workflow:{execution_id}", event_json)
    await redis.lpush(f"history:{execution_id}", event_json)
    await redis.expire(f"history:{execution_id}", 3600)  # 1小时过期

async def get_events(execution_id: str):
    async def event_generator():
        # 先发送历史消息
        history = await redis.lrange(f"history:{execution_id}", 0, -1)
        for event_json in reversed(history):
            yield f"data: {event_json}\n\n"
        
        # 再订阅实时消息
        pubsub = redis.pubsub()
        await pubsub.subscribe(f"workflow:{execution_id}")
        
        async for message in pubsub.listen():
            if message['type'] == 'message':
                yield f"data: {message['data']}\n\n"
    
    return EventSourceResponse(event_generator())

但这个方案仍有问题：

消息重复：历史消息和实时消息之间可能有重复
无法续订：断线重连后不知道从哪个位置开始读取
手动清理：需要手动管理历史消息的清理，容易出现内存泄漏
竞态条件：在读取历史消息和订阅之间的时间窗口内，可能错过消息

3.5 Redis Stream 的完美解决

Redis Stream 专门为解决这类问题而设计：

sequenceDiagram
    participant C as 客户端
    participant API as FastAPI
    participant W as TaskIQ Worker
    participant S as Redis Stream
    
    C->>API: POST /workflow/execute
    API->>W: 提交异步任务
    API-->>C: 返回execution_id
    
    C->>API: GET /workflow/events (SSE)
    API->>S: XRANGE 读取历史消息
    S-->>API: 返回历史事件
    API-->>C: SSE: 历史事件
    
    W->>S: XADD 发布新事件
    API->>S: XREAD 阻塞读取新事件
    S-->>API: 推送新事件
    API-->>C: SSE: 新事件
    
    Note over C: 网络中断
    C->>API: 重新连接 (from last_id)
    API->>S: XREAD from last_id
    S-->>API: 续订未读消息
    API-->>C: SSE: 续订消息
    
    Note over S: 支持持久化、续订、去重

async def publish_event(event_type: str, execution_id: str, data: dict = None):
    stream_key = f"workflow:execution:{execution_id}"
    event = {"type": event_type, "data": json.dumps(data), "timestamp": time.time()}
    
    # 添加到 Stream
    await redis.xadd(stream_key, event)
    
    # 自动维护长度和过期时间
    await redis.xtrim(stream_key, maxlen=1000, approximate=True)
    await redis.expire(stream_key, 3600)

async def stream_events(execution_id: str):
    stream_key = f"workflow:execution:{execution_id}"
    
    # 先读取所有历史事件
    last_id = "0-0"
    messages = await redis.xrange(stream_key)
    
    for message_id, fields in messages:
        last_id = message_id
        yield parse_event(fields)
    
    # 再阻塞读取新事件
    while True:
        try:
            streams = await redis.xread({stream_key: last_id}, block=1000, count=10)
            if streams:
                for stream, messages in streams:
                    for message_id, fields in messages:
                        last_id = message_id
                        yield parse_event(fields)
        except Exception as e:
            logger.error(f"XREAD 失败: {e}")
            await asyncio.sleep(1)

Redis Stream vs PubSub 对比

特性	PubSub	PubSub + Queue	Redis Stream
消息持久化	❌	✅	✅
历史消息获取	❌	✅	✅
断线重连续订	❌	❌	✅
消息去重	❌	❌	✅
内存管理	自动	需手动	✅
实现复杂度	低	中	中
Redis版本要求	2.0+	2.0+	5.0+

Redis Stream 要求 5.0+ 版本，但现在大部分环境都满足，带来的收益远超成本。

3.6 任务取消机制

智能体应用通常支持取消操作，我们通过 Redis 信号量实现：

async def cancel_task(execution_id: str):
    """设置取消信号"""
    await redis.setex(f"cancel:{execution_id}", 300, "1")

async def check_cancellation(execution_id: str):
    """检查取消信号"""
    if await redis.exists(f"cancel:{execution_id}"):
        await publish_event("cancelled", execution_id)
        raise asyncio.CancelledError("任务被取消")

# 在关键位置检查取消信号
async for chunk in workflow_graph.graph.astream(input_params):
    await check_cancellation(execution_id)
    # 处理 chunk...

效果如何

经过完整改造后的性能对比：

初始化性能：

冷启动：3-5 秒（保持不变）
缓存命中：100-200ms（提升 95%）
缓存命中率：85%+

系统稳定性：

Worker 超时问题：完全解决
任务成功率：70% → 99%+
并发处理能力：提升 5 倍
响应时间：秒级 → 毫秒级

用户体验：

支持实时进度展示
支持断线重连和历史回放
支持任务取消操作

踩过的坑

Redis 连接泄漏

在高并发场景下容易出现连接池耗尽，记得及时关闭连接：

# 错误做法
async def publish_event(...):
    await redis.xadd(stream_key, event)
    # 忘记关闭连接

# 正确做法
async def publish_event(...):
    try:
        await redis.xadd(stream_key, event)
    finally:
        await redis.close()

内存泄漏监控

CompiledGraph 对象较大，需要监控内存使用：

from pympler import summary

def monitor_memory():
    size = summary.summarize([workflow_graph])[0][2]
    logger.info(f"WorkflowGraph 内存占用: {size / 1024:.2f} KB")

总结

这次优化解决了两个核心问题：

初始化性能瓶颈：通过版本号同步的多级缓存，在保证一致性的前提下，大幅提升了响应速度。关键是针对 CompiledGraph 无法序列化的限制，设计了合适的缓存策略。
Worker 超时问题：通过异步任务架构，将耗时操作从 Web 请求中剥离，彻底解决了超时被 kill 的问题。

技术选型上，Redis Stream 相比 PubSub 在事件流处理方面有明显优势，特别是消息持久化和断线重连能力。

对于类似的 AI 应用，建议在架构设计阶段就考虑异步化处理。缓存策略也需要根据具体的序列化限制来设计，不能简单套用传统方案。

最重要的是，复杂系统的优化往往需要多个角度的组合拳，单一的解决方案很难彻底解决问题。