基于 Python 的 RAG 开发手册——RAG 系统的高效检索

0 阅读33分钟

引言

检索是任何 RAG 系统的骨架,因为它决定了模型能够使用哪些信息来为其响应提供事实支撑。没有有效的搜索,再先进的语言模型也有可能生成不完整、不相关,甚至产生幻觉的答案。本章将探讨那些能够让检索既准确又高效的策略与技术,从而确保为生成模型找出正确的证据。我们将涵盖传统方法,如 BM25 关键词搜索;现代方法,如稠密检索与混合检索;以及专门化技术,包括分块、层次化检索、语义过滤、元数据感知检索和时间感知检索。这些方法共同构成了 RAG 中证据选择的基础,弥合了非结构化数据源与可靠、具备上下文依据的生成之间的鸿沟。

结构

本章将涵盖以下主题:

  • 软件要求
  • 稠密检索
  • BM25 关键词搜索
  • 混合检索
  • 带扩展的混合检索
  • 语义过滤
  • 层次化检索
  • 分块策略
  • 批量检索
  • 元数据感知检索
  • 时间感知检索

学习目标

到本章结束时,读者将能够全面理解支撑 RAG 系统的各种检索技术,并认识到高效搜索如何直接影响生成响应的准确性、可靠性以及上下文 grounding(语境锚定)。通过考察传统方法(如 BM25 关键词搜索)以及高级方法(如稠密检索、混合检索、层次化检索和时间感知检索),本章旨在展示每种技术的优势与局限。同时,本章还强调分块、语义过滤和元数据感知检索等实用增强手段,说明它们如何提升精度、召回率和效率。目标是让实践者具备设计和实现最适合其数据与应用需求的检索策略的能力,并为将这些方法整合进完整的 RAG 流水线打下基础。

软件要求

本书中的每个概念后面都会配有相应的 recipe,也就是用 Python 编写的可运行代码。你会在所有 recipe 中看到代码注释,它们将逐行解释每一行代码的作用。

运行这些 recipe 需要以下软件环境:

  • 系统:至少 16.0 GB RAM 的计算机
  • 操作系统:Windows
  • Python:Python 3.13.3 或更高版本
  • LangChain:1.0.5
  • LLM 模型:Ollama 的 llama3.2:3b
  • 程序输入文件:程序中使用的输入文件可在本书的 Git 仓库中找到

要运行程序,请使用 Python 命令 pip install <packages name> 安装 recipe 中提到的软件包。安装完成后,在你的开发环境中运行 recipe 中提到的 Python 脚本(.py 文件)即可。

请参考下图以进一步了解:

图 9.1:搜索


稠密检索

稠密检索是一种神经检索方法,它将查询和文档都表示为共享嵌入空间中的稠密向量,这些表示通常是通过基于 Transformer 的模型学习得到的。它不依赖精确的关键词匹配,而是利用语义相似性:文档的检索依据是其嵌入向量与查询嵌入向量之间的接近程度,通常使用余弦相似度来度量。这使系统能够捕捉超越表层词汇重叠的意义,因此对同义词、释义表达和上下文变化具有较强鲁棒性。对于 RAG 系统而言,稠密检索能够提供更准确、上下文更相关的证据,显著降低无关匹配的风险,并增强生成响应的 grounding(事实锚定能力)。

Recipe 91

本 recipe 演示如何实现稠密检索:

  • 使用一个简短的示例文档。你可以替换为自己的文档。
  • 使用预训练的 sentence-transformer 模型获取稠密嵌入。
  • 对嵌入进行归一化,以使用余弦相似度。
  • 构建一个 FAISS 索引,以便快速执行相似度搜索,这里使用 IndexFlatIP 来实现余弦相似度搜索(配合归一化向量)。
  • 使用一个示例查询测试搜索函数。
  • 搜索与查询字符串最相似的 top-k 文档。
  • 在 FAISS 索引中执行搜索。
  • 返回 top-k 结果及其分数和文本。
  • 打印搜索结果。

安装所需软件包:

pip install sentence-transformers faiss-cpu

dense_retrieval.py

请参考以下代码:

# dense_retrieval.py
# Example of dense retrieval using FAISS and Sentence-Transformers
from typing import List, Tuple
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# 1. These are short example documents
# You can replace this with your own documents
DOCS = [
    "RAG combines retrieval with generation to ground answers in evidence.",
    "Dense retrieval uses vector embeddings instead of keyword matching.",
    "FAISS is a fast vector similarity library from Facebook/Meta.",
    "BM25 is a sparse retrieval method based on term frequency statistics.",
    "Sentence-Transformers provides easy embedding models for sentences.",
    "Vector databases store embeddings to enable semantic search.",
]

# 2. Use a pre-trained Sentence-Transformer model to get dense embeddings
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(MODEL_NAME)

# 3. Normalize embeddings to use cosine similarity
embeddings = model.encode(DOCS, convert_to_numpy=True, normalize_embeddings=True)
dim = embeddings.shape[1]

# 4. Build a FAISS index for fast similarity search
# Using IndexFlatIP for cosine similarity (with normalized vectors)
index = faiss.IndexFlatIP(dim)
index.add(embeddings)

def search(query: str, k: int = 3) -> List[Tuple[float, str]]:
    # 6. Search for the top-k most similar documents to the query string
    q_emb = model.encode([query], convert_to_numpy=True, normalize_embeddings=True)

    # 7. Perform the search in the FAISS index
    scores, idxs = index.search(q_emb, k)  # scores are cosine similarities

    # 8. Return the top-k results with their scores and texts
    return [(float(scores[0][i]), DOCS[int(idxs[0][i])]) for i in range(k)]

if __name__ == "__main__":
    # 5.Example query to test the search function
    query = "How does RAG reduce hallucinations?"
    results = search(query, k=3)

    # 9. Print the search results
    print(f"Query: {query}\n")
    for rank, (score, text) in enumerate(results, start=1):
        print(f"{rank}. score={score:.4f}  |  {text}")

输出:

Query: How does RAG reduce hallucinations?
1. score=0.3682  |  RAG combines retrieval with generation to ground answers in evidence.
2. score=0.0452  |  BM25 is a sparse retrieval method based on term frequency statistics.
3. score=-0.0133  |  Dense retrieval uses vector embeddings instead of keyword matching.

BM25 关键词搜索

BM25 是一种经典的稀疏检索方法,它通过使用概率评分函数,将查询关键词与文档集合中的词项进行匹配,从而为文档排序。它相较于更早期的词袋模型方法有两项重要改进:其一是词频饱和(term frequency saturation) ,它能够防止同一个词在单篇文档中多次重复出现而带来过高分数;其二是逆文档频率(inverse document frequency) ,它会给予稀有且具有区分性的词更高权重。此外,BM25 还会对文档长度进行归一化,避免较长文档获得不公平优势。在 RAG 系统中,BM25 常被用作基线检索器或补充型检索器,因为它简单、高效、可解释;但它的局限在于只能处理表层关键词重叠,无法像稠密检索那样捕捉更深层的语义。

Recipe 92

本 recipe 演示如何在 RAG 系统中实现 BM25 关键词搜索。步骤如下:

  • 为演示起见,我们使用一小组示例文档。
  • 使用简单分词和小写化对文档进行预处理。
  • 使用分词后的文档初始化 BM25。
  • 使用示例查询测试搜索函数。
  • 使用与文档相同的分词方法对查询进行预处理。
  • 获取 BM25 分数,并基于分数为文档排序。
  • 返回 top-k 结果及其分数和原始文本。
  • 展示结果,包括排名、分数和文本。

安装所需软件包:

pip install rank-bm25

bm25_search.py

请参考以下代码:

# bm25_search.py
from rank_bm25 import BM25Okapi

# 1. For demonstration, we use a small set of example documents
DOCS = [
    "RAG combines retrieval with generation to ground answers in evidence.",
    "Dense retrieval uses vector embeddings instead of keyword matching.",
    "FAISS is a fast vector similarity library from Facebook/Meta.",
    "BM25 is a sparse retrieval method based on term frequency statistics.",
    "Sentence-Transformers provides easy embedding models for sentences.",
    "Vector databases store embeddings to enable semantic search.",
]

# 2. Preprocess documents using simple tokenization and lowercasing
tokenized_docs = [doc.lower().split() for doc in DOCS]

# 3. Initialize BM25 with the tokenized documents
bm25 = BM25Okapi(tokenized_docs)

def search(query: str, k: int = 3):
    # 5. Preprocess the query using the same tokenization method as documents
    tokenized_query = query.lower().split()

    # 6. Get BM25 scores and rank documents based on these scores
    scores = bm25.get_scores(tokenized_query)
    ranked = sorted(list(enumerate(scores)), key=lambda x: x[1], reverse=True)

    # 7. Return the top-k results with their scores and original text
    return [(score, DOCS[idx]) for idx, score in ranked[:k]]

if __name__ == "__main__":
    # 4. Sample query to test the search function
    query = "How does RAG reduce hallucinations?"
    results = search(query, k=3)
    print(f"Query: {query}\n")

    # 8. Display results with rank, score, and text
    for rank, (score, text) in enumerate(results, start=1):
        print(f"{rank}. score={score:.4f}  |  {text}")

输出:

Query: How does RAG reduce hallucinations?
1. score=1.2374  |  RAG combines retrieval with generation to ground answers in evidence.
2. score=0.0000  |  Dense retrieval uses vector embeddings instead of keyword matching.
3. score=0.0000  |  FAISS is a fast vector similarity library from Facebook/Meta.

混合检索

混合检索结合了稀疏检索方法(如 BM25)与基于神经嵌入的稠密检索方法的优势,从而提供更稳健、更全面的检索结果。BM25 擅长精确的关键词匹配,并且能够高效处理稀有术语或领域专有词;而稠密检索能够捕捉语义含义,即使查询与文档之间没有精确词汇匹配,也能找回上下文相关的文档。通过合并或重排这两类方法的结果,混合检索在精度与召回率之间取得平衡,既保证重要的关键词匹配不会被遗漏,又利用稠密向量的语义表达能力。在 RAG 系统中,这种混合策略通常能够带来更可靠的证据检索,从而提升生成响应的 grounding 质量与整体质量。

Recipe 93

本 recipe 演示如何实现混合检索:

  • 创建一个小型文档语料库。实际中可使用更大的文档集或向量数据库。为了演示,我们使用一小组示例文档。
  • 使用简单分词和小写化为 BM25 索引构建文档分词结果。
  • 使用 sentence-transformers 的嵌入构建 FAISS 索引,用于稠密检索。
  • 使用一个示例查询分别测试 BM25、稠密检索和混合检索方法。
  • 执行 BM25 搜索并打印带分数的结果。
  • 执行稠密搜索并打印带分数的结果。
  • 执行混合搜索并打印带分数的结果。

安装所需软件包:

pip install rank-bm25 sentence-transformers faiss-cpu

hybrid_search.py

请参考以下代码:

# hybrid_search.py
# Hybrid search combining BM25 and Dense Retrieval using FAISS and
# Sentence-Transformers
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# 1. Create a small document corpus
# In practice, use a larger document or a vector database
# For demonstration, we use a small set of example documents
DOCS = [
    "RAG combines retrieval with generation to ground answers in evidence.",
    "Dense retrieval uses vector embeddings instead of keyword matching.",
    "FAISS is a fast vector similarity library from Facebook/Meta.",
    "BM25 is a sparse retrieval method based on term frequency statistics.",
    "Sentence-Transformers provides easy embedding models for sentences.",
    "Vector databases store embeddings to enable semantic search.",
]

# 2. Tokenize documents for BM25 index using simple tokenization and
# lowercasing
tokenized_docs = [doc.lower().split() for doc in DOCS]
bm25 = BM25Okapi(tokenized_docs)

# 3. Create FAISS index for dense retrieval using
# Sentence-Transformers embeddings
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(MODEL_NAME)
embeddings = model.encode(DOCS, convert_to_numpy=True, normalize_embeddings=True)
dim = embeddings.shape[1]
index = faiss.IndexFlatIP(dim)
index.add(embeddings)

def bm25_search(query: str, k: int = 3):
    tokens = query.lower().split()
    scores = bm25.get_scores(tokens)
    ranked = sorted(list(enumerate(scores)), key=lambda x: x[1], reverse=True)
    return [(float(score), DOCS[idx]) for idx, score in ranked[:k]]

def dense_search(query: str, k: int = 3):
    q_emb = model.encode([query], convert_to_numpy=True, normalize_embeddings=True)
    scores, idxs = index.search(q_emb, k)
    return [(float(scores[0][i]), DOCS[int(idxs[0][i])]) for i in range(k)]

def hybrid_search(query: str, k: int = 3, alpha: float = 0.5):
    """alpha=0 → BM25 only, alpha=1 → Dense only"""
    tokens = query.lower().split()
    bm25_scores = np.array(bm25.get_scores(tokens))
    q_emb = model.encode([query], convert_to_numpy=True, normalize_embeddings=True)
    dense_scores, _ = index.search(q_emb, len(DOCS))
    dense_scores = dense_scores[0]

    # Normalize
    def normalize(x): return (x - x.min()) / (x.max() - x.min() + 1e-8)
    bm25_norm = normalize(bm25_scores)
    dense_norm = normalize(dense_scores)

    # Hybrid scoring
    hybrid_scores = alpha * dense_norm + (1 - alpha) * bm25_norm
    ranked = np.argsort(-hybrid_scores)
    return [(float(hybrid_scores[i]), DOCS[i]) for i in ranked[:k]]

# Run Demo
if __name__ == "__main__":
    # 4. Example query to test the search functions using BM25, Dense,
    # and Hybrid methods
    query = "How does RAG reduce hallucinations?"
    print(f"\nQuery: {query}\n")

    print(" BM25 Results:")
    # 5. Perform BM25 search and Print results with scores
    for score, text in bm25_search(query):
        print(f"  score={score:.4f} | {text}")

    print("\n Dense Results:")
    # 6. Perform Dense search and Print results with scores
    # for score, text in dense_search(query):
        print(f"  score={score:.4f} | {text}")

    print("\n Hybrid Results (alpha=0.5):")
    # 7. Perform Hybrid search and Print results with scores
    # for score, text in hybrid_search(query, alpha=0.5):
        print(f"  score={score:.4f} | {text}")

输出:

  BM25 Results:
  score=1.2374 | RAG combines retrieval with generation to ground answers in evidence.
  score=0.0000 | Dense retrieval uses vector embeddings instead of keyword matching.
  score=0.0000 | FAISS is a fast vector similarity library from Facebook/Meta.

  Dense Results:
  score=0.3682 | RAG combines retrieval with generation to ground answers in evidence.
  score=0.0452 | BM25 is a sparse retrieval method based on term frequency statistics.
  score=-0.0133 | Dense retrieval uses vector embeddings instead of keyword matching.

  Hybrid Results (alpha=0.5):
  score=1.0000 | RAG combines retrieval with generation to ground answers in evidence.
  score=0.1570 | Dense retrieval uses vector embeddings instead of keyword matching.
  score=0.0949 | FAISS is a fast vector similarity library from Facebook/Meta.

带扩展的混合检索

带扩展的混合检索是在混合稀疏检索与稠密检索的基础上进一步增强的一种方法,它通过在检索前丰富查询内容,以覆盖更广泛的相关证据。在这种方法中,原始查询会被改写或扩展,加入同义词、释义表达或相关术语,这些扩展可以是人工整理的,也可以由语言模型自动生成。然后,扩展后的查询会同时用于 BM25(精确关键词匹配)和稠密检索(语义相似性),最终再对结果进行合并或重排,以最大化覆盖范围。这种方法降低了由于词汇不匹配而错失相关文档的风险,同时保留了语义深度与关键词精度。对于 RAG 系统而言,带扩展的混合检索能够增强召回率,确保模型建立在更丰富、更多样的证据之上,从而提升最终答案质量。

Recipe 94

本 recipe 演示如何实现带扩展的混合检索:

  • 使用示例文档;实际中应使用更大的语料库或向量数据库。
  • 准备查询扩展候选词列表(预先确定)。
  • 加载 sentence transformer 模型,用于嵌入与相似度搜索。
  • 构建 BM25 索引,用于稀疏检索。
  • 构建 FAISS 索引,用于稠密检索。
  • 使用一个示例查询测试系统,并将扩展词加入查询字符串以获得更好的结果。
  • 使用 top 3 扩展词扩展查询。
  • 基于扩展后的查询词执行 BM25 搜索。
  • 基于扩展后的查询词执行稠密搜索。
  • 执行一个混合搜索函数,将 BM25 分数与稠密检索分数结合起来。

安装所需软件包:

pip install rank-bm25 sentence-transformers faiss-cpu

hybrid_search_with_expansion.py

请参考以下代码:

# hybrid_search_with_expansion.py
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, util
import faiss
import numpy as np
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# 1. Sample documents, in practice, use a larger corpus or a vector DB
DOCS = [
    "RAG combines retrieval with generation to ground answers in evidence.",
    "Dense retrieval uses vector embeddings instead of keyword matching.",
    "FAISS is a fast vector similarity library from Facebook/Meta.",
    "BM25 is a sparse retrieval method based on term frequency statistics.",
    "Sentence-Transformers provides easy embedding models for sentences.",
    "Vector databases store embeddings to enable semantic search.",
]

# 2. Candidate expansion terms for query expansion (determined
# beforehand)
EXPANSION_CANDIDATES = ["retrieval","search","semantic search","dense retrieval","vector database","BM25","information retrieval","keyword search","ranking","query understanding","knowledge base","embeddings",
"document matching","relevance","natural language processing",]

# 3. Load Sentence Transformer model for embeddings and similarity
# search
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(MODEL_NAME)

# 4. Build BM25 index for sparse retrieval
tokenized_docs = [doc.lower().split() for doc in DOCS]
bm25 = BM25Okapi(tokenized_docs)

# 5. Build FAISS index for dense retrieval
embeddings = model.encode(DOCS, convert_to_numpy=True, normalize_embeddings=True)
dim = embeddings.shape[1]
index = faiss.IndexFlatIP(dim)
index.add(embeddings)

# --- Query Expansion ---
def expand_query(query: str, top_k: int = 3):
    query_emb = model.encode(query, convert_to_tensor=True, normalize_embeddings=True)
    cand_embs = model.encode(EXPANSION_CANDIDATES, convert_to_tensor=True, normalize_embeddings=True)
    scores = util.cos_sim(query_emb, cand_embs)[0]
    top_indices = scores.argsort(descending=True)[:top_k]
    return [EXPANSION_CANDIDATES[i] for i in top_indices]

def build_expanded_query(query: str, expansions):
    return query + " " + " ".join(expansions)

# 8. bm25 search function using the expanded query terms
def bm25_search(query: str, k: int = 3):
    tokens = query.lower().split()
    scores = bm25.get_scores(tokens)
    ranked = sorted(list(enumerate(scores)), key=lambda x: x[1], reverse=True)
    return [(float(score), DOCS[idx]) for idx, score in ranked[:k]]

# 9. dense search function using the expanded query terms
def dense_search(query: str, k: int = 3):
    q_emb = model.encode([query], convert_to_numpy=True, normalize_embeddings=True)
    scores, idxs = index.search(q_emb, k)
    return [(float(scores[0][i]), DOCS[int(idxs[0][i])]) for i in range(k)]

# 10. hybrid search function combining BM25 and dense scores using the
# expanded query terms
def hybrid_search(query: str, k: int = 3, alpha: float = 0.5):
    tokens = query.lower().split()
    bm25_scores = np.array(bm25.get_scores(tokens))
    q_emb = model.encode([query], convert_to_numpy=True, normalize_embeddings=True)
    dense_scores, _ = index.search(q_emb, len(DOCS))
    dense_scores = dense_scores[0]

    # Normalize
    def normalize(x): return (x - x.min()) / (x.max() - x.min() + 1e-8)
    bm25_norm = normalize(bm25_scores)
    dense_norm = normalize(dense_scores)

    # Weighted sum
    hybrid_scores = alpha * dense_norm + (1 - alpha) * bm25_norm
    ranked = np.argsort(-hybrid_scores)
    return [(float(hybrid_scores[i]), DOCS[i]) for i in ranked[:k]]

# --- Run Demo ---
if __name__ == "__main__":
    # 6. Example query to test the system with expansion terms
    # added in the query string for better results
    query = "How does RAG reduce hallucinations?"

    # 7. Expand query with top 3 expansion terms
    expansions = expand_query(query, top_k=3)
    expanded_query = build_expanded_query(query, expansions)

    # Display results
    print(f"\nOriginal Query: {query}")
    print(f"Expanded Terms: {', '.join(expansions)}")
    print(f"Expanded Query String: {expanded_query}\n")

    print(" BM25 Results:")
    for score, text in bm25_search(expanded_query):
        print(f"  score={score:.4f} | {text}")

    print("\n Dense Results:")
    for score, text in dense_search(expanded_query):
        print(f"  score={score:.4f} | {text}")

    print("\n Hybrid Results (alpha=0.5):")
    for score, text in hybrid_search(expanded_query, alpha=0.5):
        print(f"  score={score:.4f} | {text}")

输出:

Expanded Terms: document matching, query understanding, retrieval
Expanded Query String: How does RAG reduce hallucinations? document matching query understanding retrieval

 BM25 Results:
  score=1.2374 | RAG combines retrieval with generation to ground answers in       evidence.
  score=0.0000 | Dense retrieval uses vector embeddings instead of keyword matching.
  score=0.0000 | FAISS is a fast vector similarity library from Facebook/Meta.

 Dense Results:
  score=0.5879 | RAG combines retrieval with generation to ground answers in evidence.
  score=0.4707 | Dense retrieval uses vector embeddings instead of keyword matching.
  score=0.3435 | BM25 is a sparse retrieval method based on term frequency statistics.

 Hybrid Results (alpha=0.5):
  score=1.0000 | RAG combines retrieval with generation to ground answers in evidence.
  score=0.3642 | Dense retrieval uses vector embeddings instead of keyword matching.
  score=0.2169 | FAISS is a fast vector similarity library from Facebook/Meta.

语义过滤

语义过滤是一种后检索精炼技术,它通过在查询与候选结果之间额外执行一次语义相似度检查,从而提升检索结果的质量。在初始检索之后——无论是通过 BM25、稠密检索还是混合检索——系统都会利用基于嵌入的相似度评分对已检索出的文档重新评估,过滤掉那些虽然在主题上相关、但实际上并不真正匹配的文档。这一步有助于剔除噪声或边缘匹配,避免误导生成模型。在 RAG 系统中,语义过滤相当于一道质量控制层,确保只有上下文最匹配的证据被传递给生成器,从而减少幻觉并增强响应的事实 grounding。

Recipe 95

本 recipe 演示如何实现语义过滤:

  • 创建一些示例文档,以便基于语义相似度进行过滤。
  • 加载预训练 SentenceTransformer 模型。
  • 使用余弦相似度进行计算并过滤文档。只保留高于某个阈值的文档,例如 0.5,表示中等相似度。
  • 仅展示高于 0.5 阈值的过滤结果。

安装所需软件包:

pip install sentence-transformers

semantic_filtering.py

请参考以下代码:

# semantic_filtering.py
from sentence_transformers import SentenceTransformer, util
import torch
from torch.nn.functional import normalize

# 1. Create sample documents to filter based on semantic similarity
DOCS = [
    "RAG reduces hallucinations by grounding answers in retrieved evidence.",
    "Dense retrieval uses vector embeddings instead of keyword matching.",
    "FAISS is a fast vector similarity library from Meta.",
    "BM25 is a sparse retrieval method based on term frequency statistics.",
    "Sentence-Transformers provides easy embedding models for sentences.",
    "Vector databases store embeddings to enable semantic search.",
    "Cooking recipes often require precise measurements and timing.",
    "Traveling to new countries helps you learn about culture and history."
]

# 2. Load pre-trained SentenceTransformer model
MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
model = SentenceTransformer(MODEL_NAME)

# 3. Compute using cosine similarity and filter documents
# Keep only those above a certain threshold
# (e.g., 0.5 for moderate similarity)
def semantic_filter(query: str, docs, threshold: float = 0.5):
    """
    Compute cosine similarity between query and docs,
    normalize embeddings to avoid negative/weird values,
    and filter by threshold.
    """
    # Encode and normalize
    query_emb = model.encode(query, convert_to_tensor=True)
    query_emb = normalize(query_emb, p=2, dim=0)

    doc_embs = model.encode(docs, convert_to_tensor=True)
    doc_embs = normalize(doc_embs, p=2, dim=1)

    # Cosine similarity
    similarities = util.cos_sim(query_emb, doc_embs)[0]

    print("\n--- All Docs with Scores ---")
    for i, score in enumerate(similarities):
        print(f"{score:.4f} | {docs[i]}")

    # Keep only docs above threshold
    filtered = [(docs[i], float(similarities[i])) for i in range(len(docs)) if similarities[i] >= threshold]
    filtered.sort(key=lambda x: x[1], reverse=True)
    return filtered

if __name__ == "__main__":
    query = "How does RAG reduce hallucinations?"
    results = semantic_filter(query, DOCS, threshold=0.5)

    # 4. Display filtered results only above threshold 0.5
    print(f"\nQuery: {query}\n")
    print("Filtered Results (similarity > 0.5):")
    for text, score in results:
        print(f"  similarity={score:.4f} | {text}")

输出:

--- All Docs with Scores ---
0.6190 | RAG reduces hallucinations by grounding answers in retrieved 
evidence.
-0.0185 | Dense retrieval uses vector embeddings instead of keyword 
matching.
0.0139 | FAISS is a fast vector similarity library from Meta.
-0.0189 | BM25 is a sparse retrieval method based on term frequency 
statistics.
-0.0296 | Sentence-Transformers provides easy embedding models for sentences.
-0.0428 | Vector databases store embeddings to enable semantic search.
-0.0015 | Cooking recipes often require precise measurements and timing.
0.0383 | Traveling to new countries helps you learn about culture and 
history.

Query: How does RAG reduce hallucinations?

Filtered Results (similarity > 0.5):
similarity=0.6190 | RAG reduces hallucinations by grounding answers in retrieved evidence.

层次化检索

层次化检索是一种多阶段检索策略,它在不同粒度层级上组织和搜索信息,例如文档级、章节级和段落级。系统不会直接从整个语料中检索细小的文本块,而是先识别最相关的高层级单元(例如文档或章节),然后再在这些单元内部进一步缩小范围,找出更细粒度的片段。这种分层方式能够减少噪声、提升效率,并确保检索到的证据在上下文上保持连贯。在 RAG 系统中,层次化检索尤其适用于大型或结构化数据集,例如书籍、研究论文或技术手册,因为在这些场景中,grounding 所需要的并不仅仅是孤立的句子,而是与查询相匹配、并且上下文丰富的片段。

Recipe 96

本 recipe 演示如何编写一个层次化检索程序:

  • 使用一个示例语料库进行演示。
  • 使用按词切分后的语料文档初始化 BM25。
  • 初始化 cross-encoder,用于精细重排序。
  • 使用 BM25 先粗召回 top-5 文档。
  • 使用 cross-encoder 对粗召回结果进行精细重排。
  • 打印粗检索结果。
  • 打印精细重排后的结果。

安装所需软件包:

pip install sentence-transformers rank-bm25

hierarchical_retrieval.py

请参考以下代码:

# hierarchical_retrieval.py
# Example of hierarchical retrieval using BM25 and Cross-
# Encoder reranking
from sentence_transformers import SentenceTransformer, CrossEncoder, util
from rank_bm25 import BM25Okapi
import torch
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# 1. Sample corpus for demonstration purposes
CORPUS = [
    "RAG reduces hallucinations by grounding answers in retrieved evidence.",
    "Dense retrieval uses vector embeddings instead of keyword matching.",
    "BM25 is a sparse retrieval method based on term frequency statistics.",
    "FAISS is a fast vector similarity library from Meta.",
    "Vector databases store embeddings to enable semantic search.",
    "Cooking recipes often require precise measurements and timing.",
    "Traveling to new countries helps you learn about culture and history."
]

# 2. Initialize BM25 with the corpus documents tokenized into words
tokenized_corpus = [doc.lower().split() for doc in CORPUS]
bm25 = BM25Okapi(tokenized_corpus)

# 4. Coarse retrieval function to get top-k documents using bm25
def coarse_retrieve(query, top_k=5):
    tokenized_query = query.lower().split()
    doc_scores = bm25.get_scores(tokenized_query)
    top_indices = doc_scores.argsort()[-top_k:][::-1]
    return [CORPUS[i] for i in top_indices]

# 3. Initialize Cross-Encoder for fine reranking
cross_encoder_model = "cross-encoder/ms-marco-MiniLM-L-6-v2"
cross_encoder = CrossEncoder(cross_encoder_model)

# 5. Fine reranking to refine the coarse results using Cross-Encoder
def fine_rerank(query, candidates):
    pairs = [(query, doc) for doc in candidates]
    scores = cross_encoder.predict(pairs)
    reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return reranked

if __name__ == "__main__":
    # Example query
    query = "How does RAG reduce hallucinations?"
    coarse_results = coarse_retrieve(query, top_k=5)

    # 6. Print coarse retrieval results.
    print(f"Coarse Retrieval ({len(coarse_results)} results):")
    for doc in coarse_results:
        print(f"  {doc}")

    fine_results = fine_rerank(query, coarse_results)

    # 7. Print fine reranked results.
    print(f"\nFine Reranked Results:")
    for doc, score in fine_results:
        print(f"  score={score:.4f} | {doc}")

输出:

Coarse Retrieval (5 results):
  RAG reduces hallucinations by grounding answers in retrieved evidence.
  Cooking recipes often require precise measurements and timing.
  Traveling to new countries helps you learn about culture and history.
  Vector databases store embeddings to enable semantic search.
  FAISS is a fast vector similarity library from Meta.

Fine Reranked Results:
  score=8.4985 | RAG reduces hallucinations by grounding answers in retrieved evidence.
  score=-11.3435 | Traveling to new countries helps you learn about culture and history.
  score=-11.3601 | Cooking recipes often require precise measurements and timing.
  score=-11.3980 | FAISS is a fast vector similarity library from Meta.
  score=-11.4207 | Vector databases store embeddings to enable semantic search.

分块策略

分块策略是指在 RAG 系统中,将长文档切分为更小、更易管理的文本片段的过程,以便进行有效索引与检索。由于语言模型和向量数据库通常更适合处理长度受限的输入,分块可以保证每个片段既足够短,能够满足嵌入模型和上下文窗口的限制,又保留足够的语义信息。常见方法包括固定长度分块、带重叠窗口的滑动分块(用于保持连续性),或者基于自然文本边界(如段落、标题或语篇单元)的语义分块。一个良好的分块策略需要在粒度与连贯性之间取得平衡:块太小会丢失上下文,块太大则会稀释相关性。在 RAG 系统中,高质量的分块是实现准确检索的基础,因为它直接影响被送入生成器的证据质量。

Recipe 97

本 recipe 演示如何编写一个分块策略程序:

  • 使用一些示例文档进行演示。这些文档可以是更长的文档。
  • 对所有文档进行分块,并创建一个扁平化的块列表。
  • 初始化 SentenceTransformer 模型以生成嵌入。
  • 执行语义搜索,找到 top-k 最相关的块。
  • 初始化 cross-encoder 用于重排。
  • 使用 cross-encoder 对 top-k 块进行重排。
  • 展示语义搜索得到的 top 块及其分数。
  • 展示重排后的结果及其分数。

安装所需软件包:

pip install nltk sentence-transformers

chunk_search_rerank.py

请参考以下代码:

# chunk_search_rerank.py
import nltk
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer, util
import torch
nltk.download('punkt')

# 1. Sample documents for demonstration purposes
# These can be longer documents that will be chunked into smaller pieces
DOCUMENTS = [
    """
    Retrieval-Augmented Generation (RAG) is a method that combines retrieval and generation.
    It reduces hallucinations by grounding answers in retrieved documents.
    Dense retrieval uses vector embeddings for semantic similarity.
    BM25 is a sparse retrieval method based on keyword frequency.
    Vector databases store embeddings for fast similarity search.
    """,
    """
    Cooking recipes require precise measurements and timing.
    Ingredients must be prepared in advance.
    Following instructions carefully ensures good results.
    """
]

# Chunking Strategy: Split documents into overlapping chunks of words
def chunk_document(doc, chunk_size=30, overlap=5):
    """
    Split a document into word-based chunks with overlap.
    """
    words = doc.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunks.append(' '.join(words[start:end]))
        start += chunk_size - overlap
    return chunks

# 2. Chunk all documents and create a flat list of chunks
all_chunks = []
for doc in DOCUMENTS:
    all_chunks.extend(chunk_document(doc, chunk_size=30, overlap=5))

# 3. Initialize SentenceTransformer model for embeddings
MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
model = SentenceTransformer(MODEL_NAME)
chunk_embeddings = model.encode(all_chunks, convert_to_tensor=True, normalize_embeddings=True)

# 4. Perform semantic search to find top-k relevant chunks
def semantic_search(query, top_k=3):
    query_emb = model.encode(query, convert_to_tensor=True, normalize_embeddings=True)
    cos_scores = util.cos_sim(query_emb, chunk_embeddings)[0]
    top_results = torch.topk(cos_scores, k=min(top_k, len(all_chunks)))
    return [(all_chunks[idx], float(score)) for score, idx in zip(top_results.values, top_results.indices)]

# 5. Initialize Cross-Encoder for reranking
from sentence_transformers import CrossEncoder
cross_encoder_model = "cross-encoder/ms-marco-MiniLM-L-6-v2"
cross_encoder = CrossEncoder(cross_encoder_model)

# 6. Re-rank the top-k chunks using Cross-Encoder
def rerank(query, candidates):
    pairs = [(query, cand) for cand in candidates]
    scores = cross_encoder.predict(pairs)
    reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return reranked

if __name__ == "__main__":
    # Example Query
    query = "How does RAG reduce hallucinations?"
    search_results = semantic_search(query, top_k=5)

    # 7. Display top chunks from semantic search with scores
    print("Top Chunks from Semantic Search:")
    for chunk, score in search_results:
        print(f"  score={score:.4f} | {chunk}")

    chunks_only = [chunk for chunk, _ in search_results]
    reranked_results = rerank(query, chunks_only)

    # 8. Display reranked results with scores
    print("\nReranked Results:")
    for chunk, score in reranked_results:
        print(f"  score={score:.4f} | {chunk}")

输出:

Top Chunks from Semantic Search:
score=0.2636 | Retrieval-Augmented Generation (RAG) is a method that combines retrieval and generation. It reduces hallucinations by grounding answers in retrieved documents. Dense retrieval uses vector embeddings for semantic similarity. BM25 is
score=0.0191 | Cooking recipes require precise measurements and timing. Ingredients must be prepared in advance. Following instructions carefully ensures good results.
score=0.0029 | for semantic similarity. BM25 is a sparse retrieval method based on keyword frequency. Vector databases store embeddings for fast similarity search.

Reranked Results:
score=7.6566 | Retrieval-Augmented Generation (RAG) is a method that combines retrieval and generation. It reduces hallucinations by grounding answers in retrieved documents. Dense retrieval uses vector embeddings for semantic similarity. BM25 is
score=-11.3524 | Cooking recipes require precise measurements and timing. Ingredients must be prepared in advance. Following instructions carefully ensures good results.
score=-11.3673 | for semantic similarity. BM25 is a sparse retrieval method based on keyword frequency. Vector databases store embeddings for fast similarity search.

批量检索

批量检索是一种在单次操作中同时处理多个查询的检索方式,而不是逐条查询地处理。这种技术在 RAG 系统中特别有用,例如处理多轮对话、需要被拆解为多个子查询的复杂问题,或在大规模推理场景下同时处理大量输入。通过将查询分组成批,系统可以更好地利用向量数据库和嵌入模型的并行能力,从而提高效率并减少计算开销。此外,批量检索还支持在相关查询之间对结果进行比较与融合,有助于捕捉多样视角或互补证据。对于 RAG 系统而言,这种方法不仅加速了检索过程,也通过拓宽相关信息覆盖面增强了系统鲁棒性。

Recipe 98

本 recipe 演示如何实现批量检索:

  • 使用一些示例文档进行演示。这些文档可以是更长的文档,并被切分成更小片段。
  • 使用一组示例查询,表示一次同时检索多个查询。
  • 初始化 SentenceTransformer 模型以生成嵌入。
  • 预先计算文档嵌入以提升效率。
  • 为批量中的每个查询检索 top-k 相关文档。
  • 展示每个查询的结果及其分数。

安装所需软件包:

pip install sentence-transformers torch

batch_retrieval.py

请参考以下代码:

# batch_retrieval.py
# Example of batch retrieval using SentenceTransformer embeddings and
# cosine similarity
from sentence_transformers import SentenceTransformer, util
import torch

# 1. Sample documents for demonstration purposes
# These can be longer documents that will be chunked into smaller pieces
DOCUMENTS = [
    "RAG reduces hallucinations by grounding answers in retrieved evidence.",
    "Dense retrieval uses vector embeddings instead of keyword matching.",
    "BM25 is a sparse retrieval method based on term frequency statistics.",
    "FAISS is a fast vector similarity library from Meta.",
    "Vector databases store embeddings to enable semantic search.",
    "Cooking recipes require precise measurements and timing.",
    "Traveling to new countries helps you learn about culture and history."
]

# 2. Sample batch of queries
# These can be multiple queries to retrieve documents at once
QUERIES = [
    "How does RAG reduce hallucinations?",
    "Best methods for dense retrieval?",
    "How to cook precise recipes?"
]

# 3. Initialize SentenceTransformer model for embeddings
MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
model = SentenceTransformer(MODEL_NAME)

# 4. Precompute document embeddings for efficiency
doc_embeddings = model.encode(DOCUMENTS, convert_to_tensor=True, normalize_embeddings=True)

# 5. Retrieve top-k relevant documents for each query in batch
def batch_retrieve(queries, top_k=3):
    """
    Retrieve top_k relevant documents for each query in batch.
    """
    query_embeddings = model.encode(queries, convert_to_tensor=True, normalize_embeddings=True)

    # Compute cosine similarity in batch
    cos_scores = util.cos_sim(query_embeddings, doc_embeddings)  # shape: [num_queries, num_docs]

    results = []
    for i, query in enumerate(queries):
        top_results = torch.topk(cos_scores[i], k=min(top_k, len(DOCUMENTS)))
        retrieved = [(DOCUMENTS[idx], float(score)) for score, idx in zip(top_results.values, top_results.indices)]
        results.append((query, retrieved))
    return results

if __name__ == "__main__":
    batch_results = batch_retrieve(QUERIES, top_k=3)

    # 6. Display results for each query with scores
    for query, retrieved_docs in batch_results:
        print(f"\nQuery: {query}")
        print("Top Documents:")
        for doc, score in retrieved_docs:
            print(f"  similarity={score:.4f} | {doc}")

输出:

Query: How does RAG reduce hallucinations?
Top Documents:
similarity=0.6190 | RAG reduces hallucinations by grounding answers in retrieved evidence.
similarity=0.0383 | Traveling to new countries helps you learn about culture and history.
similarity=0.0139 | FAISS is a fast vector similarity library from Meta.

Query: Best methods for dense retrieval?
Top Documents:
similarity=0.5661 | Dense retrieval uses vector embeddings instead of keyword matching.
similarity=0.5094 | BM25 is a sparse retrieval method based on term frequency statistics.
similarity=0.3360 | Vector databases store embeddings to enable semantic search.

Query: How to cook precise recipes?
Top Documents:
similarity=0.6743 | Cooking recipes require precise measurements and timing.
similarity=0.1190 | FAISS is a fast vector similarity library from Meta.
similarity=0.1149 | RAG reduces hallucinations by grounding answers in   retrieved evidence.

元数据感知检索

元数据感知检索通过利用结构化信息(如作者、日期、来源类型、领域或标签)来增强检索过程,而不仅仅依赖文本内容本身。它不只是基于关键词或语义相似度进行搜索,而是通过元数据过滤或加权机制,引导检索更精准地命中那些在语义上相关、且在上下文上也合适的文档。通过把语义检索与元数据约束结合起来,RAG 系统能够对证据选择进行更细粒度的控制,减少噪声并提高可信度。这使得元数据感知检索在领域型应用中尤为有价值,因为这些场景通常对精确性、来源可追溯性和上下文对齐要求很高。

Recipe 99

本 recipe 演示如何实现元数据感知检索:

  • 使用带有元数据(标签)的示例文档进行演示。
  • 加载预训练 sentence transformer 模型。
  • 预先计算文档嵌入,并连同元数据一起保存,以便检索。
  • 实现元数据感知检索:先按元数据过滤,再检索 top-k 相关文档。
    示例元数据过滤条件如:{"category": "AI", "author": "Alice"}
    它返回既匹配元数据、又在语义上与查询最相似的文档。
    如果没有任何文档满足元数据过滤条件,则返回空列表;如果 metadata_filtersNone,则在全部文档中检索。
  • 打印结果,并将相似度分数格式化为小数点后 4 位。

安装所需软件包:

pip install sentence-transformers torch

metadata_aware_retrieval.py

请参考以下代码:

# metadata_aware_retrieval.py
# Example of metadata-aware dense retrieval using Sentence Transformers
from sentence_transformers import SentenceTransformer, util
import torch

# 1. Sample documents with metadata (tags) for demonstration purposes
DOCUMENTS = [
    {"text": "RAG reduces hallucinations by grounding answers in retrieved evidence.",
     "category": "AI", "author": "Alice", "year": 2023},
    {"text": "Dense retrieval uses vector embeddings instead of keyword matching.",
     "category": "AI", "author": "Bob", "year": 2022},
    {"text": "Cooking recipes require precise measurements and timing.",
     "category": "Cooking", "author": "Carol", "year": 2021},
    {"text": "Vector databases store embeddings to enable semantic search.",
     "category": "AI", "author": "Alice", "year": 2022},
    {"text": "Traveling to new countries helps you learn about culture and history.",
     "category": "Travel", "author": "Dave", "year": 2023}
]

# 2. Load pre-trained Sentence Transformer model
MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
model = SentenceTransformer(MODEL_NAME)

# 3. Pre-compute document embeddings and store with metadata for retrieval
doc_texts = [doc["text"] for doc in DOCUMENTS]
doc_embeddings = model.encode(doc_texts, convert_to_tensor=True, normalize_embeddings=True)

# 4. Metadata-aware retrieval to filter by metadata and retrieve
# top-k relevant documents
# Example metadata filters used: {"category": "AI", "author": "Alice"}
# It returns documents matching metadata and most similar to
# query based on cosine similarity
# If no documents match metadata, returns empty list
# If metadata_filters is None, retrieves from all documents
def metadata_aware_retrieve(query, metadata_filters=None, top_k=3):
    """
    Retrieve top-k documents relevant to query and metadata filters.
    metadata_filters: dict, e.g., {"category": "AI", "author": "Alice"}
    """
    # Filter documents by metadata
    if metadata_filters:
        filtered_docs = []
        filtered_embeddings = []
        for doc, emb in zip(DOCUMENTS, doc_embeddings):
            match = all(doc.get(key) == value for key, value in metadata_filters.items())
            if match:
                filtered_docs.append(doc["text"])
                filtered_embeddings.append(emb)
        if not filtered_docs:
            return []  # No document matches metadata
        embeddings_tensor = torch.stack(filtered_embeddings)
    else:
        filtered_docs = doc_texts
        embeddings_tensor = doc_embeddings

    # Encode query
    query_emb = model.encode(query, convert_to_tensor=True, normalize_embeddings=True)

    # Compute cosine similarity
    cos_scores = util.cos_sim(query_emb, embeddings_tensor)[0]

    # Get top-k results
    top_results = torch.topk(cos_scores, k=min(top_k, len(filtered_docs)))
    return [(filtered_docs[idx], float(score)) for score, idx in zip(top_results.values, top_results.indices)]

if __name__ == "__main__":
    query = "How does RAG reduce hallucinations?"
    filters = {"category": "AI", "author": "Alice"}
    results = metadata_aware_retrieve(query, metadata_filters=filters, top_k=3)

    # 5. Print results with similarity scores formatted to 4 decimal places
    print(f"Query: {query}")
    print(f"Metadata Filters: {filters}")
    print("Top Documents:")
    for doc, score in results:
        print(f"  similarity={score:.4f} | {doc}")

输出:

Query: How does RAG reduce hallucinations?
Metadata Filters: {'category': 'AI', 'author': 'Alice'}
Top Documents:
similarity=0.6190 | RAG reduces hallucinations by grounding answers in retrieved evidence.
similarity=-0.0428 | Vector databases store embeddings to enable semantic search.

时间感知检索

时间感知检索是一种专门的检索策略,它将时间信息(例如发布时间、事件时间线,或查询对时效性的要求)纳入证据选择过程。它不会将所有文档一视同仁,而是根据时间相关性来优先排序结果,从而确保响应所依据的信息不仅是正确的,而且是时效上恰当的。例如,在回答有关当前事件、最新研究或持续变化法规的问题时,时间感知检索可以优先选择更新的文档,同时仍允许访问较早材料以提供历史背景。在 RAG 系统中,这种方法能够避免过时或失效证据误导生成模型,从而在信息变化迅速的动态领域中提升答案的准确性、新鲜度和可信度。

Recipe 100

本 recipe 演示如何实现时间感知检索:

  • 使用带时间戳(YYYY-MM-DD)的示例文档进行演示。
  • 加载预训练 sentence transformer 模型。
  • 预先计算文档嵌入。
  • 实现时间感知检索:支持可选的时间范围过滤(start_dateend_date)。
    其中 query 是输入查询字符串,top_k 指定返回多少个 top 文档。
    返回值是 (document_text, similarity_score) 元组列表。
    如果没有任何文档匹配时间过滤条件,则返回空列表。
  • 打印查询结果、时间范围以及带相似度分数的 top 文档。

安装所需软件包:

pip install sentence-transformers torch

time_aware_retrieval.py

请参考以下代码:

# time_aware_retrieval.py
# Example of time-aware document retrieval using embeddings and timestamps
from sentence_transformers import SentenceTransformer, util
import torch
from datetime import datetime

# 1. Sample documents with timestamps
# (YYYY-MM-DD) for demonstration purposes
DOCUMENTS = [
    {"text": "RAG reduces hallucinations by grounding answers in retrieved evidence.",
     "timestamp": "2023-05-10"},
    {"text": "Dense retrieval uses vector embeddings instead of keyword matching.",
     "timestamp": "2022-08-15"},
    {"text": "Cooking recipes require precise measurements and timing.",
     "timestamp": "2021-12-01"},
    {"text": "Vector databases store embeddings to enable semantic search.",
     "timestamp": "2022-11-20"},
    {"text": "Traveling to new countries helps you learn about culture and history.",
     "timestamp": "2023-01-05"}
]

# 2. Load pre-trained sentence transformer model
MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
model = SentenceTransformer(MODEL_NAME)

# 3. Pre-compute embeddings for documents
doc_texts = [doc["text"] for doc in DOCUMENTS]
doc_embeddings = model.encode(doc_texts, convert_to_tensor=True, normalize_embeddings=True)

# 4. Time-aware retrieval function
# Query with optional time range filtering with start_date and end_date,
# query is the input query string and top_k which specifies how many
# top documents to return
# Returns list of (document_text, similarity_score) tuples
# If no documents match the time filter, returns an empty list
def time_aware_retrieve(query, start_date=None, end_date=None, top_k=3):
    """
    Retrieve top-k documents relevant to query and within the time range.
    start_date, end_date: string in 'YYYY-MM-DD' format
    """
    # Filter documents by timestamp
    filtered_docs = []
    filtered_embeddings = []
    for doc, emb in zip(DOCUMENTS, doc_embeddings):
        doc_date = datetime.strptime(doc["timestamp"], "%Y-%m-%d")
        if start_date:
            start_dt = datetime.strptime(start_date, "%Y-%m-%d")
            if doc_date < start_dt:
                continue
        if end_date:
            end_dt = datetime.strptime(end_date, "%Y-%m-%d")
            if doc_date > end_dt:
                continue
        filtered_docs.append(doc["text"])
        filtered_embeddings.append(emb)

    if not filtered_docs:
        return []  # No document matches the time filter

    embeddings_tensor = torch.stack(filtered_embeddings)

    # Encode query
    query_emb = model.encode(query, convert_to_tensor=True, normalize_embeddings=True)

    # Compute cosine similarity
    cos_scores = util.cos_sim(query_emb, embeddings_tensor)[0]

    # Get top-k results
    top_results = torch.topk(cos_scores, k=min(top_k, len(filtered_docs)))
    return [(filtered_docs[idx], float(score)) for score, idx in zip(top_results.values, top_results.indices)]

if __name__ == "__main__":
    # Example Query
    query = "How does RAG reduce hallucinations?"
    results = time_aware_retrieve(query, start_date="2023-01-01", end_date="2023-12-31", top_k=3)

    # 5. Print query results, time range, and top documents with
    # similarity scores
    print(f"Query: {query}")
    print("Time Range: 2023-01-01 to 2023-12-31")
    print("Top Documents:")
    for doc, score in results:
        print(f"  similarity={score:.4f} | {doc}")

输出:

Query: How does RAG reduce hallucinations?
Time Range: 2023-01-01 to 2023-12-31
Top Documents:
similarity=0.6190 | RAG reduces hallucinations by grounding answers in retrieved evidence.
similarity=0.0383 | Traveling to new countries helps you learn about culture and history.

结论

高效搜索是 RAG 的核心,因为被检索到的证据质量将直接决定生成响应的准确性、连贯性和可信度。在本章中,我们探讨了一系列检索策略,从经典方法(如 BM25 关键词搜索)到高级技术(如稠密检索、混合检索和层次化检索)。我们还考察了查询扩展、语义过滤、分块、批量检索,以及元数据感知与时间感知等实用增强手段,它们分别针对召回率、精度、效率与上下文对齐中的特定挑战。所有这些方法共同构成了一套工具箱,可以根据具体 RAG 系统的独特需求进行定制,以确保生成模型始终建立在高质量、强相关的证据之上。

在已经打下坚实搜索基础之后,我们接下来将进一步看看这些检索方法如何被整合进真实世界的 RAG 工作流。下一章将演示如何借助编排框架,把检索器与语言模型连接起来。我们将学习搜索流水线是如何被构建、组合与优化的,从而支撑端到端的 RAG 应用。保持这一版风格,统一精译下去。