Embeddings工程实践2026:从文本向量化到多模态检索的完整指南

2 阅读1分钟

什么是Embedding,为什么它是RAG的心脏

RAG系统中,大多数工程师把精力放在LLM的选型和Prompt设计上。但实际上,检索质量的70%取决于Embedding质量。选错了Embedding模型,再好的LLM也无济于事——因为如果相关文档根本没检索上来,LLM什么都做不了。

Embedding的本质是将任意语义内容(文本、图像、音频)映射到高维向量空间,使得语义相似的内容在空间中距离相近

# 一个简单的演示
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('BAAI/bge-m3')

sentences = [
    "如何在Python中处理JSON数据",
    "Python解析JSON的方法",           # 语义相似
    "今天天气怎么样",                  # 语义不相关
]

embeddings = model.encode(sentences)
# [0] 和 [1] 的余弦相似度 >> [0] 和 [2] 的余弦相似度

2026年主流Embedding模型选型

文本Embedding模型对比

模型维度上下文长度特点推荐场景
text-embedding-3-large30728191OpenAI旗舰,效果最强英文为主、不在乎成本
text-embedding-3-small15368191性价比优秀英文、大规模应用
BAAI/bge-m310248192开源多语言最强中文、多语言、自托管
BAAI/bge-large-zh1024512中文专精纯中文短文本
Cohere Embed v31024512支持图文多模态初探
voyage-3102432000长文档检索超长文档RAG

2026年推荐选择

  • 纯中文场景:BAAI/bge-m3(免费、效果好、支持最长8192 token)
  • 混合中英文:text-embedding-3-small(API简单、成本低)
  • 超长文档:voyage-3bge-m3(均支持长上下文)

使用text-embedding-3系列

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed_text(text: str, model="text-embedding-3-small") -> list[float]:
    """单文本向量化"""
    response = client.embeddings.create(
        input=text,
        model=model,
        # 可选:缩减维度(节省存储空间)
        dimensions=512  # 从1536缩减到512,损失约10%精度
    )
    return response.data[0].embedding

def embed_batch(texts: list[str], model="text-embedding-3-small") -> list[list[float]]:
    """批量向量化(更高效)"""
    response = client.embeddings.create(
        input=texts,  # 最多2048条/次
        model=model
    )
    # 按原始顺序排序(API返回可能乱序)
    return [e.embedding for e in sorted(response.data, key=lambda x: x.index)]

使用BGE-M3(本地部署)

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel(
    'BAAI/bge-m3',
    use_fp16=True,  # 节省GPU内存
    device='cuda'
)

# BGE-M3的特殊能力:同时输出三种向量表示
sentences = ["向量数据库的选型指南", "如何选择合适的向量存储方案"]

output = model.encode(
    sentences,
    batch_size=12,
    max_length=512,
    return_dense=True,    # 稠密向量(用于语义检索)
    return_sparse=True,   # 稀疏向量(用于关键词检索)
    return_colbert_vecs=True  # ColBERT向量(用于精排)
)

dense_embeddings = output['dense_vecs']
sparse_embeddings = output['lexical_weights']  # dict形式的稀疏表示
colbert_vecs = output['colbert_vecs']

Embedding的关键工程细节

1. 文档分块策略

Embedding质量与分块策略强相关。分块太短,丢失上下文;分块太长,语义稀释。

from langchain.text_splitter import RecursiveCharacterTextSplitter

# 基础分块
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,       # 每块约512个token
    chunk_overlap=64,     # 相邻块重叠64个token,避免信息断裂
    separators=["\n\n", "\n", "。", "!", "?", " ", ""]
)
chunks = splitter.split_text(document)

# 进阶:父子分块(Parent-Child Chunking)
# 检索用小块(精确),输入给LLM用大块(上下文丰富)
small_splitter = RecursiveCharacterTextSplitter(chunk_size=128, chunk_overlap=16)
large_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=128)

small_chunks = small_splitter.split_text(document)
large_chunks = large_splitter.split_text(document)

# 建立小块到大块的映射
def map_small_to_large(small_chunk, large_chunks):
    for large in large_chunks:
        if small_chunk in large:
            return large
    return small_chunk

2. Embedding归一化

import numpy as np

def normalize_embedding(embedding: list[float]) -> list[float]:
    """L2归一化,使余弦相似度 = 点积相似度(加速计算)"""
    arr = np.array(embedding)
    norm = np.linalg.norm(arr)
    if norm == 0:
        return embedding
    return (arr / norm).tolist()

# 大多数向量数据库默认会做归一化,但显式处理更安全

3. 查询增强(Query Enhancement)

直接用用户输入做检索往往不是最优的,查询增强能显著提升召回率:

async def enhance_query_for_retrieval(user_query: str) -> list[str]:
    """将用户查询扩展为多个检索视角"""
    
    prompt = f"""用户问题:"{user_query}"
    
请从以下角度生成3-5个用于文档检索的搜索短语:
1. 与原问题语义相同但措辞不同的表述
2. 更宽泛的上位概念
3. 可能的相关专业术语

只输出短语列表,每行一个:"""
    
    response = await llm.complete(prompt)
    queries = [q.strip() for q in response.split('\n') if q.strip()]
    queries.insert(0, user_query)  # 原始查询保留
    return queries

async def multi_query_retrieve(user_query: str, vector_store, k=5):
    """多查询检索,结果去重合并"""
    queries = await enhance_query_for_retrieval(user_query)
    
    all_results = []
    seen_ids = set()
    
    for q in queries:
        embedding = embed_text(q)
        results = vector_store.similarity_search(embedding, k=k)
        for r in results:
            if r.id not in seen_ids:
                all_results.append(r)
                seen_ids.add(r.id)
    
    # 按相关度重排
    return rerank(user_query, all_results)[:k]

4. 负样本硬挖掘(Hard Negative Mining)

如果你要fine-tune自己的Embedding模型,硬负样本(看起来相关但实际不相关)是提升区分能力的关键:

async def mine_hard_negatives(query: str, positives: list[str], corpus: list[str]):
    """从语料库中挖掘硬负样本"""
    # 先用当前模型检索
    query_emb = embed_text(query)
    all_scores = [(doc, cosine_sim(query_emb, embed_text(doc))) for doc in corpus]
    
    # 按相似度排序
    ranked = sorted(all_scores, key=lambda x: x[1], reverse=True)
    
    # 硬负样本 = 排名靠前但不在正样本中的文档
    positives_set = set(positives)
    hard_negatives = [
        doc for doc, score in ranked[:50]  # 前50里找
        if doc not in positives_set
    ][:10]  # 取前10个最难区分的
    
    return hard_negatives

多模态Embedding:图文检索

使用CLIP做图文检索

from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

def embed_image(image_path: str) -> list[float]:
    """图像向量化"""
    image = Image.open(image_path)
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        features = model.get_image_features(**inputs)
    return features[0].numpy().tolist()

def embed_text_for_clip(text: str) -> list[float]:
    """文本向量化(与图像在同一空间)"""
    inputs = processor(text=text, return_tensors="pt", padding=True)
    with torch.no_grad():
        features = model.get_text_features(**inputs)
    return features[0].numpy().tolist()

# 用文本查找图像
def image_search_by_text(query: str, image_db: list) -> list:
    query_emb = embed_text_for_clip(query)
    scores = [(img, cosine_sim(query_emb, img['embedding'])) for img in image_db]
    return sorted(scores, key=lambda x: x[1], reverse=True)[:5]

多模态联合检索

class MultimodalRetriever:
    """图文联合检索:文本和图像都在同一向量空间中"""
    
    def __init__(self, vector_store):
        self.store = vector_store
    
    def index_document_with_images(self, doc_id: str, text: str, images: list):
        """索引包含图片的文档"""
        # 文本块
        text_emb = embed_text(text)
        self.store.insert({
            "id": f"{doc_id}_text",
            "embedding": text_emb,
            "content": text,
            "type": "text"
        })
        
        # 图片块(用alt-text增强)
        for i, img_path in enumerate(images):
            img_emb = embed_image(img_path)
            # 提取图片描述(可选,增强检索效果)
            alt_text = extract_image_caption(img_path)
            combined_emb = blend_embeddings(img_emb, embed_text(alt_text), alpha=0.7)
            
            self.store.insert({
                "id": f"{doc_id}_img_{i}",
                "embedding": combined_emb,
                "content": img_path,
                "alt_text": alt_text,
                "type": "image"
            })
    
    def search(self, query: str, modality: str = "both") -> list:
        query_emb = embed_text(query)
        results = self.store.search(query_emb, k=10)
        
        if modality == "text_only":
            return [r for r in results if r["type"] == "text"]
        elif modality == "image_only":
            return [r for r in results if r["type"] == "image"]
        return results

向量存储与检索优化

选择合适的索引类型

# Qdrant示例:不同索引配置的权衡
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, HnswConfigDiff

client = QdrantClient("localhost", port=6333)

# HNSW配置(适合大多数场景)
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE,
    ),
    hnsw_config=HnswConfigDiff(
        m=16,           # 每个节点的最大连接数(越大精度越高,构建越慢)
        ef_construct=100,  # 构建时的搜索范围(越大精度越高)
        full_scan_threshold=10000  # 小于此数量时全量搜索
    )
)

混合检索(向量+关键词)

from qdrant_client.models import NamedVector, NamedSparseVector

async def hybrid_search(query: str, collection: str, k: int = 10):
    """融合语义检索和关键词检索"""
    
    # 稠密向量(语义检索)
    dense_emb = embed_text(query)
    
    # 稀疏向量(关键词检索,由BGE-M3生成)
    sparse_result = bge_m3.encode([query], return_sparse=True)
    sparse_emb = sparse_result['lexical_weights'][0]
    
    # 执行混合检索
    results = client.query_points(
        collection_name=collection,
        prefetch=[
            # 稠密检索
            models.Prefetch(
                query=dense_emb,
                using="dense",
                limit=20,
            ),
            # 稀疏检索
            models.Prefetch(
                query=models.SparseVector(**sparse_emb),
                using="sparse",
                limit=20,
            ),
        ],
        # RRF融合(Reciprocal Rank Fusion)
        query=models.FusionQuery(fusion=models.Fusion.RRF),
        limit=k,
    )
    return results

评估Embedding质量

别盲目相信Benchmark,在你自己的数据上评估才有意义:

class EmbeddingEvaluator:
    """评估Embedding模型在特定领域的检索质量"""
    
    def evaluate_retrieval(self, test_pairs: list[dict], k_values=[1, 5, 10]) -> dict:
        """
        test_pairs: [{"query": "...", "relevant_docs": ["doc_id1", ...]}]
        """
        results = {f"recall@{k}": 0 for k in k_values}
        results.update({f"ndcg@{k}": 0 for k in k_values})
        
        for pair in test_pairs:
            query_emb = embed_text(pair["query"])
            retrieved = self.vector_store.search(query_emb, k=max(k_values))
            retrieved_ids = [r.id for r in retrieved]
            
            for k in k_values:
                top_k = retrieved_ids[:k]
                # Recall@K
                relevant_found = len(set(top_k) & set(pair["relevant_docs"]))
                results[f"recall@{k}"] += relevant_found / len(pair["relevant_docs"])
        
        # 计算平均值
        n = len(test_pairs)
        return {k: v/n for k, v in results.items()}

结语

Embedding工程是RAG系统中最值得深耕的环节。2026年,BGE-M3已经将开源Embedding的天花板拉到了与商业API相当的水平,多模态检索也从实验室走向了工程实用。

把时间投入到:正确的分块策略、领域适配的模型微调、混合检索实现、持续的质量评估——这四件事做好了,RAG系统的检索质量就能进入顶流。