Multimodal RAG工程实践：图文混合检索系统的完整实现纯文本 RAG 已经是标配了。但现实世界的知识往往不是纯

纯文本 RAG 已经是标配了。但现实世界的知识往往不是纯文字：产品手册里有流程图、技术文档里有架构图、财报里有数据表格、医疗报告里有影像。如果你的 RAG 系统只能处理文字，就永远看不懂这些信息。

Multimodal RAG（多模态 RAG）要解决的，就是让检索和回答能够同时理解图片、表格和文字。本文从工程实现角度，梳理完整的构建路径。

多模态 RAG 的核心挑战

在动手写代码之前，先理解问题的本质：

挑战 1：图片不能直接向量化（传统方式） 传统文本 RAG 用文本嵌入模型把文字变成向量。图片没有直接对应的"文本嵌入"，需要先理解图片内容才能检索。

挑战 2：跨模态的语义对齐 用户用文字提问，但相关内容可能在一张图里。如何让文字查询能找到图片？需要"文本向量"和"图像向量"在同一个语义空间里。

挑战 3：图文混合上下文 检索到的内容可能同时包含图片和文字，如何把它们组合成 LLM 能理解的输入？

三种主流架构

架构一：图片转文字（Caption-based）

最简单，把图片用 Vision LLM 描述成文字，然后用普通文本 RAG：

图片 → Vision LLM 生成描述 → 文本嵌入 → 向量库
查询 → 文本嵌入 → 向量检索 → 找到描述文本 → 回答

优点：实现简单，不需要多模态嵌入模型
缺点：描述可能丢失图片细节；查询时无法看到原图，只能看描述

架构二：多模态嵌入（CLIP-based）

使用 CLIP 类模型，图文共享嵌入空间：

图片 → CLIP 图像编码器 → 向量库
文字 → CLIP 文本编码器 → 查询向量
→ 跨模态检索（文字找图，图找图）→ 返回原始图片 → 多模态 LLM 回答

优点：可以用文字直接找图片，检索质量高
缺点：需要多模态嵌入模型，成本更高

架构三：混合架构（生产推荐）

文档解析
  ├── 文本块 → 文本嵌入 → 向量库（文本索引）
  └── 图片块 → CLIP 嵌入 + Caption → 向量库（图像索引）

查询
  ├── 文本检索 → Top-K 文本块
  └── 图像检索 → Top-K 相关图片
  ↓
合并 + 重排序
  ↓
多模态 LLM（传入文字 + 图片原文件）
  ↓
最终回答

实现：文档解析与索引

安装依赖

pip install unstructured[all-docs]  # 文档解析
pip install pillow pdf2image         # 图片处理
pip install sentence-transformers    # 文本嵌入
pip install open-clip-torch          # CLIP 多模态嵌入
pip install qdrant-client            # 向量库
pip install openai                   # 多模态 LLM

文档解析：提取图片和文字

import os, base64, hashlib
from pathlib import Path
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from unstructured.partition.pdf import partition_pdf
from PIL import Image
import io

@dataclass
class TextChunk:
    content: str
    page_num: int
    source_file: str
    chunk_id: str
    metadata: Dict[str, Any]

@dataclass
class ImageChunk:
    image_data: bytes           # 原始图片字节
    image_b64: str              # base64 编码（传给 LLM 用）
    caption: Optional[str]      # 图片描述
    page_num: int
    source_file: str
    chunk_id: str
    metadata: Dict[str, Any]


def parse_pdf_multimodal(pdf_path: str, output_dir: str = "/tmp/pdf_images") -> tuple[List[TextChunk], List[ImageChunk]]:
    """解析 PDF，分别提取文本块和图片块"""
    os.makedirs(output_dir, exist_ok=True)
    
    # 用 unstructured 解析，同时提取图片
    elements = partition_pdf(
        filename=pdf_path,
        extract_images_in_pdf=True,
        infer_table_structure=True,
        strategy="hi_res",          # 高质量解析
        image_output_dir_path=output_dir,
    )
    
    text_chunks = []
    image_chunks = []
    
    for element in elements:
        elem_type = type(element).__name__
        page_num = element.metadata.page_number or 0
        
        # 文本类元素
        if elem_type in ("NarrativeText", "Title", "ListItem", "Table"):
            content = str(element)
            if len(content.strip()) < 20:  # 跳过太短的内容
                continue
            
            chunk_id = hashlib.md5(f"{pdf_path}:{page_num}:{content[:50]}".encode()).hexdigest()[:12]
            text_chunks.append(TextChunk(
                content=content,
                page_num=page_num,
                source_file=pdf_path,
                chunk_id=chunk_id,
                metadata={
                    "type": elem_type,
                    "is_table": elem_type == "Table",
                }
            ))
        
        # 图片元素
        elif elem_type == "Image":
            img_path = element.metadata.image_path
            if img_path and os.path.exists(img_path):
                with open(img_path, "rb") as f:
                    img_data = f.read()
                
                img_b64 = base64.b64encode(img_data).decode()
                chunk_id = hashlib.md5(f"{pdf_path}:{page_num}:img".encode()).hexdigest()[:12]
                
                image_chunks.append(ImageChunk(
                    image_data=img_data,
                    image_b64=img_b64,
                    caption=None,       # 后续生成
                    page_num=page_num,
                    source_file=pdf_path,
                    chunk_id=chunk_id,
                    metadata={"image_path": img_path}
                ))
    
    return text_chunks, image_chunks

为图片生成描述

import asyncio
from openai import AsyncOpenAI

async def generate_image_captions(
    image_chunks: List[ImageChunk],
    client: AsyncOpenAI,
    batch_size: int = 5
) -> List[ImageChunk]:
    """批量为图片生成描述"""
    
    async def caption_one(chunk: ImageChunk) -> ImageChunk:
        try:
            resp = await client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{
                    "role": "user",
                    "content": [
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{chunk.image_b64}",
                                "detail": "high"
                            }
                        },
                        {
                            "type": "text",
                            "text": """请详细描述这张图片的内容，包括：
1. 图片类型（流程图/架构图/表格/截图/照片等）
2. 主要内容和核心信息
3. 图中的关键数据、标签或文字
4. 图片所表达的主要含义

请用中文描述，尽量详细，以便后续检索使用。"""
                        }
                    ]
                }],
                max_tokens=500,
            )
            chunk.caption = resp.choices[0].message.content
        except Exception as e:
            print(f"图片描述生成失败 {chunk.chunk_id}: {e}")
            chunk.caption = "图片内容无法解析"
        return chunk
    
    # 批量并发处理
    results = []
    for i in range(0, len(image_chunks), batch_size):
        batch = image_chunks[i:i+batch_size]
        batch_results = await asyncio.gather(*[caption_one(c) for c in batch])
        results.extend(batch_results)
        await asyncio.sleep(1)  # 限速
    
    return results

实现：多模态向量化与索引

import open_clip
import torch
import numpy as np
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct
)


class MultimodalIndexer:
    def __init__(self, qdrant_url: str = "http://localhost:6333"):
        self.client = QdrantClient(url=qdrant_url)
        
        # 文本嵌入模型
        self.text_encoder = SentenceTransformer("BAAI/bge-large-zh-v1.5")
        self.text_dim = 1024
        
        # CLIP 多模态嵌入（文图共享空间）
        self.clip_model, _, self.clip_preprocess = open_clip.create_model_and_transforms(
            "ViT-L-14", pretrained="openai"
        )
        self.clip_tokenizer = open_clip.get_tokenizer("ViT-L-14")
        self.clip_model.eval()
        self.clip_dim = 768
        
        self._init_collections()
    
    def _init_collections(self):
        """初始化向量集合"""
        # 文本集合
        if "text_chunks" not in [c.name for c in self.client.get_collections().collections]:
            self.client.create_collection(
                "text_chunks",
                vectors_config=VectorParams(size=self.text_dim, distance=Distance.COSINE)
            )
        
        # 图片集合（使用 CLIP 向量）
        if "image_chunks" not in [c.name for c in self.client.get_collections().collections]:
            self.client.create_collection(
                "image_chunks",
                vectors_config=VectorParams(size=self.clip_dim, distance=Distance.COSINE)
            )
    
    def index_text_chunks(self, chunks: List[TextChunk]):
        """索引文本块"""
        contents = [c.content for c in chunks]
        vectors = self.text_encoder.encode(
            contents, 
            batch_size=32, 
            show_progress_bar=True,
            normalize_embeddings=True
        )
        
        points = [
            PointStruct(
                id=i,
                vector=v.tolist(),
                payload={
                    "content": c.content,
                    "page_num": c.page_num,
                    "source_file": c.source_file,
                    "chunk_id": c.chunk_id,
                    **c.metadata,
                }
            )
            for i, (c, v) in enumerate(zip(chunks, vectors))
        ]
        
        self.client.upsert("text_chunks", points=points)
        print(f"已索引 {len(points)} 个文本块")
    
    def index_image_chunks(self, chunks: List[ImageChunk]):
        """索引图片块（用 CLIP 图像编码器）"""
        points = []
        for i, chunk in enumerate(chunks):
            try:
                # CLIP 图像编码
                img = Image.open(io.BytesIO(chunk.image_data)).convert("RGB")
                img_tensor = self.clip_preprocess(img).unsqueeze(0)
                
                with torch.no_grad():
                    img_vector = self.clip_model.encode_image(img_tensor)
                    img_vector = img_vector / img_vector.norm(dim=-1, keepdim=True)
                
                points.append(PointStruct(
                    id=i,
                    vector=img_vector[0].tolist(),
                    payload={
                        "caption": chunk.caption or "",
                        "page_num": chunk.page_num,
                        "source_file": chunk.source_file,
                        "chunk_id": chunk.chunk_id,
                        "image_b64": chunk.image_b64,  # 存储原图用于回答
                    }
                ))
            except Exception as e:
                print(f"图片索引失败 {chunk.chunk_id}: {e}")
        
        self.client.upsert("image_chunks", points=points)
        print(f"已索引 {len(points)} 张图片")
    
    def search_text(self, query: str, top_k: int = 5) -> List[Dict]:
        """文本语义检索"""
        query_vector = self.text_encoder.encode(
            query, normalize_embeddings=True
        ).tolist()
        
        results = self.client.search(
            "text_chunks", 
            query_vector=query_vector, 
            limit=top_k,
            with_payload=True,
        )
        return [{"score": r.score, **r.payload} for r in results]
    
    def search_images(self, query: str, top_k: int = 3) -> List[Dict]:
        """用文字查找相关图片（跨模态检索）"""
        # CLIP 文本编码
        text_tokens = self.clip_tokenizer([query])
        with torch.no_grad():
            text_vector = self.clip_model.encode_text(text_tokens)
            text_vector = text_vector / text_vector.norm(dim=-1, keepdim=True)
        
        results = self.client.search(
            "image_chunks",
            query_vector=text_vector[0].tolist(),
            limit=top_k,
            with_payload=True,
        )
        return [{"score": r.score, **r.payload} for r in results]

实现：多模态回答生成

class MultimodalRAG:
    def __init__(self, indexer: MultimodalIndexer, openai_client: AsyncOpenAI):
        self.indexer = indexer
        self.client = openai_client
    
    async def answer(
        self, 
        question: str,
        text_top_k: int = 5,
        image_top_k: int = 3,
        image_score_threshold: float = 0.25,  # CLIP 分数阈值
    ) -> Dict:
        # 1. 检索文本和图片
        text_results = self.indexer.search_text(question, top_k=text_top_k)
        image_results = self.indexer.search_images(question, top_k=image_top_k)
        
        # 2. 过滤低质量图片结果
        relevant_images = [
            r for r in image_results 
            if r["score"] >= image_score_threshold
        ]
        
        # 3. 构建多模态消息
        message_content = []
        
        # 先添加文本上下文
        if text_results:
            context_text = "\n\n".join([
                f"[文档片段 {i+1}（第{r['page_num']}页）]\n{r['content']}"
                for i, r in enumerate(text_results)
            ])
            message_content.append({
                "type": "text",
                "text": f"## 相关文档内容\n\n{context_text}\n\n"
            })
        
        # 添加相关图片
        for i, img in enumerate(relevant_images):
            message_content.append({
                "type": "text",
                "text": f"\n## 相关图片 {i+1}（第{img['page_num']}页，相关度:{img['score']:.2f}）\n图片描述：{img.get('caption', '无描述')}\n"
            })
            message_content.append({
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{img['image_b64']}",
                    "detail": "high"
                }
            })
        
        # 添加问题
        message_content.append({
            "type": "text",
            "text": f"\n## 用户问题\n{question}\n\n请综合以上文档内容和图片，给出准确、详细的回答。如果某些信息来源于图片，请说明。"
        })
        
        # 4. 调用多模态 LLM
        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": "你是一个专业的文档分析助手，能够理解文字和图片内容，给出准确的回答。"
                },
                {"role": "user", "content": message_content}
            ],
            max_tokens=2000,
        )
        
        answer_text = response.choices[0].message.content
        
        return {
            "answer": answer_text,
            "text_sources": text_results,
            "image_sources": relevant_images,
            "usage": {
                "prompt_tokens": response.usage.prompt_tokens,
                "completion_tokens": response.usage.completion_tokens,
            }
        }

完整使用示例

async def main():
    client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    indexer = MultimodalIndexer(qdrant_url="http://localhost:6333")
    
    # 1. 解析并索引文档
    pdf_path = "technical_doc.pdf"
    text_chunks, image_chunks = parse_pdf_multimodal(pdf_path)
    
    print(f"提取文本块: {len(text_chunks)}")
    print(f"提取图片: {len(image_chunks)}")
    
    # 为图片生成描述
    image_chunks = await generate_image_captions(image_chunks, client)
    
    # 索引
    indexer.index_text_chunks(text_chunks)
    indexer.index_image_chunks(image_chunks)
    
    # 2. 多模态问答
    rag = MultimodalRAG(indexer, client)
    
    questions = [
        "系统架构图中各个模块是如何连接的？",
        "数据处理的完整流程是什么？",
        "性能对比表格中哪个方案最优？",
    ]
    
    for q in questions:
        print(f"\n问题: {q}")
        result = await rag.answer(q)
        print(f"回答: {result['answer'][:200]}...")
        print(f"引用图片数: {len(result['image_sources'])}")
        print(f"Token用量: {result['usage']}")

asyncio.run(main())

性能优化建议

1. 图片预处理管道 不要在查询时才处理图片，在索引阶段完成所有预处理：图片压缩（降低 base64 大小）、生成描述、建立索引。

2. 分级缓存

Caption 生成结果缓存（同一图片不重复生成）
CLIP 向量缓存（图片内容不变，向量不变）

3. 控制传入 LLM 的图片数量 每次传给 GPT-4o 的图片建议不超过 3 张，否则 Token 成本急剧上升（每张图约 1000-2000 Token）。

4. 相关度阈值调优 CLIP 相似度 0.25 是起点，根据你的文档类型调整。技术图表建议 0.3+，照片类 0.2 即可。

总结

Multimodal RAG 的工程核心是三个分工：

解析层：用 Unstructured 等工具把文档拆成文本块和图片块
索引层：文本用 BGE 等文本嵌入，图片用 CLIP 多模态嵌入，分别建索引
回答层：混合检索后，把图片和文字一起传给 GPT-4o 等多模态 LLM

当你的业务中有大量图文混合文档（产品手册、技术文档、研究报告），这套架构能显著提升检索质量，让 RAG 系统真正"看懂"这些内容。