LLM流式输出工程进阶2026:从SSE到多模态实时渲染的完整实战

3 阅读1分钟

引言:流式输出为什么重要

用户等待 AI 回复的体验,很大程度上取决于两个指标:TTFT(首字符时间)和感知流畅度。一个 30 秒才显示完整回复的界面,远不如一个几百毫秒就开始"打字"的界面体验好,哪怕最终内容完全一样。

流式输出(Streaming)是解决这个体验问题的核心技术。但在工程实践中,流式不只是"把响应分块发给前端"这么简单。从 SSE 协议选型、断线重连、前端渲染优化,到多模态内容的混合流,都有大量值得深挖的细节。


一、流式输出的协议基础

1.1 SSE vs WebSocket:怎么选

维度SSE(Server-Sent Events)WebSocket
连接方向单向(服务器→客户端)双向
协议HTTP/1.1 或 HTTP/2WS/WSS
浏览器支持原生支持,无需额外库原生支持
负载均衡友好(标准 HTTP)需要特殊配置(sticky session)
断线重连内置自动重连机制需手动实现
适用场景LLM 流式输出(推荐)双向实时通信

结论:LLM 流式输出推荐使用 SSE,原因:

  1. LLM 输出是单向的,不需要 WebSocket 的双向能力
  2. SSE 天然支持断线重连(Last-Event-ID 机制)
  3. HTTP/2 下 SSE 可以多路复用,效率更高
  4. 中间件(nginx、CDN)对 SSE 更友好

1.2 SSE 协议规范

HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

data: {"token": "你好", "index": 0}\n\n

data: {"token": ",我是", "index": 1}\n\n

data: {"token": "AI助手", "index": 2}\n\n

data: [DONE]\n\n

关键规则:

  • 每条消息以两个换行符 \n\n 结束
  • data: 字段包含实际内容
  • id: 字段用于断线重连恢复
  • event: 字段用于自定义事件类型

二、后端:构建高性能流式 API

2.1 FastAPI SSE 实现

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import openai
import json
import asyncio

app = FastAPI()
client = openai.AsyncOpenAI()

async def generate_stream(messages: list, model: str = "gpt-4o"):
    """核心流式生成函数"""
    
    async with client.chat.completions.stream(
        model=model,
        messages=messages,
        temperature=0.7,
    ) as stream:
        async for text in stream.text_stream:
            # 构造 SSE 数据包
            data = json.dumps({"token": text, "type": "text"}, ensure_ascii=False)
            yield f"data: {data}\n\n"
        
        # 发送完成信号和使用量统计
        usage = (await stream.get_final_completion()).usage
        final_data = json.dumps({
            "type": "done",
            "usage": {
                "prompt_tokens": usage.prompt_tokens,
                "completion_tokens": usage.completion_tokens
            }
        })
        yield f"data: {final_data}\n\n"

@app.post("/api/chat/stream")
async def chat_stream(request: Request):
    body = await request.json()
    messages = body.get("messages", [])
    
    return StreamingResponse(
        generate_stream(messages),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",  # 禁用 nginx 缓冲
            "Connection": "keep-alive",
        }
    )

2.2 处理连接中断与超时

async def generate_stream_safe(messages: list, session_id: str):
    """带超时和断线检测的流式生成"""
    
    try:
        # 设置总超时:30 秒
        async with asyncio.timeout(30):
            async with client.chat.completions.stream(
                model="gpt-4o",
                messages=messages,
            ) as stream:
                token_index = 0
                async for text in stream.text_stream:
                    data = json.dumps({
                        "token": text,
                        "index": token_index,
                        "session_id": session_id
                    }, ensure_ascii=False)
                    yield f"id: {token_index}\ndata: {data}\n\n"
                    token_index += 1
                    
    except asyncio.TimeoutError:
        error_data = json.dumps({"type": "error", "message": "生成超时,请重试"})
        yield f"data: {error_data}\n\n"
    
    except openai.APIError as e:
        error_data = json.dumps({"type": "error", "message": f"API错误: {str(e)}"})
        yield f"data: {error_data}\n\n"
    
    finally:
        yield "data: [DONE]\n\n"

2.3 流式 Token 缓冲策略

直接逐 token 发送会产生大量小数据包,影响性能。可以用缓冲策略:

async def generate_stream_buffered(messages: list, buffer_size: int = 5):
    """缓冲多个 token 后再发送,减少网络开销"""
    
    buffer = []
    
    async with client.chat.completions.stream(model="gpt-4o", messages=messages) as stream:
        async for text in stream.text_stream:
            buffer.append(text)
            
            # 缓冲满了或遇到标点则立即发送
            if len(buffer) >= buffer_size or any(c in text for c in "。!?\n"):
                chunk = "".join(buffer)
                data = json.dumps({"chunk": chunk}, ensure_ascii=False)
                yield f"data: {data}\n\n"
                buffer = []
        
        # 发送剩余缓冲
        if buffer:
            chunk = "".join(buffer)
            data = json.dumps({"chunk": chunk}, ensure_ascii=False)
            yield f"data: {data}\n\n"
    
    yield "data: [DONE]\n\n"

三、前端:流畅的实时渲染

3.1 原生 EventSource 接收

class StreamingChat {
    constructor(apiUrl) {
        this.apiUrl = apiUrl;
        this.eventSource = null;
    }
    
    async sendMessage(messages, onToken, onDone, onError) {
        // 使用 fetch + ReadableStream(比 EventSource 更灵活)
        const response = await fetch(this.apiUrl, {
            method: 'POST',
            headers: {
                'Content-Type': 'application/json',
                'Accept': 'text/event-stream',
            },
            body: JSON.stringify({ messages }),
        });
        
        if (!response.ok) {
            onError(new Error(`HTTP ${response.status}`));
            return;
        }
        
        const reader = response.body.getReader();
        const decoder = new TextDecoder();
        let buffer = '';
        
        while (true) {
            const { done, value } = await reader.read();
            if (done) break;
            
            buffer += decoder.decode(value, { stream: true });
            
            // 处理 SSE 消息
            const lines = buffer.split('\n');
            buffer = lines.pop(); // 保留不完整的行
            
            for (const line of lines) {
                if (line.startsWith('data: ')) {
                    const data = line.slice(6);
                    if (data === '[DONE]') {
                        onDone();
                        return;
                    }
                    try {
                        const parsed = JSON.parse(data);
                        if (parsed.token) onToken(parsed.token);
                        if (parsed.type === 'error') onError(new Error(parsed.message));
                    } catch (e) {
                        console.warn('解析 SSE 数据失败:', e);
                    }
                }
            }
        }
    }
}

3.2 React 流式渲染组件

import { useState, useCallback, useRef } from 'react';

function StreamingMessage() {
    const [content, setContent] = useState('');
    const [isStreaming, setIsStreaming] = useState(false);
    const contentRef = useRef('');  // 避免 state 更新过于频繁
    const rafRef = useRef(null);
    
    const flushContent = useCallback(() => {
        setContent(contentRef.current);
        rafRef.current = null;
    }, []);
    
    const handleToken = useCallback((token) => {
        contentRef.current += token;
        
        // 使用 requestAnimationFrame 批量更新 DOM,避免频繁重渲染
        if (!rafRef.current) {
            rafRef.current = requestAnimationFrame(flushContent);
        }
    }, [flushContent]);
    
    const sendMessage = useCallback(async (messages) => {
        setIsStreaming(true);
        contentRef.current = '';
        setContent('');
        
        const chat = new StreamingChat('/api/chat/stream');
        
        await chat.sendMessage(
            messages,
            handleToken,
            () => setIsStreaming(false),
            (err) => {
                console.error(err);
                setIsStreaming(false);
            }
        );
    }, [handleToken]);
    
    return (
        <div className="message-container">
            <div className="message-content">
                {content}
                {isStreaming && <span className="cursor-blink">|</span>}
            </div>
        </div>
    );
}

四、Markdown 实时渲染优化

LLM 输出通常包含 Markdown,实时渲染 Markdown 有特殊挑战:代码块可能还在输入中,渲染会闪烁。

import { marked } from 'marked';
import DOMPurify from 'dompurify';

class StreamingMarkdownRenderer {
    constructor() {
        this.buffer = '';
        this.codeBlockDepth = 0;
    }
    
    addToken(token) {
        this.buffer += token;
        return this.render();
    }
    
    render() {
        // 检测是否在未完成的代码块中
        const codeBlockMatches = (this.buffer.match(/```/g) || []).length;
        const inCodeBlock = codeBlockMatches % 2 !== 0;
        
        if (inCodeBlock) {
            // 在代码块中,补全代码块再渲染
            const rendered = marked(this.buffer + '\n```');
            return DOMPurify.sanitize(rendered);
        }
        
        return DOMPurify.sanitize(marked(this.buffer));
    }
}

五、生产注意事项

5.1 Nginx 配置

location /api/chat/stream {
    proxy_pass http://backend;
    proxy_buffering off;           # 关闭缓冲,立即转发
    proxy_cache off;
    proxy_set_header Connection '';
    proxy_http_version 1.1;
    chunked_transfer_encoding on;
    proxy_read_timeout 120s;       # 流式响应超时要设长
}

5.2 断线重连机制

function connectWithRetry(url, options, maxRetries = 3) {
    let retries = 0;
    
    function connect() {
        const eventSource = new EventSource(url);
        
        eventSource.onerror = () => {
            eventSource.close();
            if (retries < maxRetries) {
                retries++;
                const delay = Math.pow(2, retries) * 1000; // 指数退避
                setTimeout(connect, delay);
            }
        };
        
        return eventSource;
    }
    
    return connect();
}

结语

流式输出工程是 LLM 应用体验优化的核心环节。从选择合适的传输协议,到前端渲染优化,再到生产环境的超时与重连处理,每个环节都直接影响用户感受到的"AI 响应速度"。

好的流式实现,让用户感觉 AI 在实时思考;差的实现,让用户觉得 AI 在"一次性输出"——即便背后用的是同一个模型。这就是流式输出工程的价值所在。