基于 Python 的 RAG 开发手册——用于语义检索的向量存储引言随着数字信息量以前所未有的速度增长，如何快速找到

引言

随着数字信息量以前所未有的速度增长，如何快速找到正确的知识片段，已经成为现代应用中最大的挑战之一。传统基于关键词的搜索引擎依赖词法匹配，而当查询的表述方式与目标文本不一致时，这类方法往往会表现不佳。

正是这种局限催生了语义检索（semantic retrieval）这一范式：在该范式中，文档与查询都会被表示为高维空间中的向量，从而使相似度可以基于“意义”而不是精确词语来衡量。而支撑这一范式的核心基础设施，就是向量存储（vector stores）——这是一类专门用于高效存储、索引和查询嵌入的数据库。

向量存储为 RAG 系统以及其他 AI 驱动应用提供了关键基础设施。通过将文本块（甚至包括图像、音频等多模态数据）嵌入到向量空间并建立索引，这些存储系统能够实现快速而准确、且与用户意图对齐的检索。

结构

本章将涵盖以下主题：

软件要求
FAISS 向量存储
加载并查询持久化向量存储
带元数据过滤的混合搜索
批量嵌入
将 ChromaDB 作为向量存储
将向量存储与文档切分器结合
基于自动摘要 chunk 的语义搜索
使用异步索引加速向量存储创建
评估向量存储的检索性能
结合稠密与稀疏的多向量检索

学习目标

到本章结束时，读者将理解向量存储如何通过高效存储、管理和搜索高维嵌入，成为语义检索系统的核心支柱。读者将建立起对向量数据库在实现快速、准确相似度搜索中的作用的理解，并掌握它们如何与嵌入模型集成，以及在可扩展性、索引方式和检索性能之间的取舍。此外，读者还将更清晰地理解如何选择、实现并优化向量存储，以支撑真实世界中的语义搜索与 RAG 应用。

软件要求

本书中的每个概念后面都会配有相应的 recipe，也就是用 Python 编写的可运行代码。你会在所有 recipe 中看到代码注释，这些注释将逐行解释每一行代码的作用。

运行这些 recipe 需要以下软件环境：

系统配置：至少 16.0 GB 内存的系统
操作系统：Windows
Python：Python 3.13.3 或更高版本
LangChain：1.0.5
LLM 模型：Ollama 的 llama3.2:3b
程序输入文件：程序中使用的输入文件可在本书的 Git 仓库中获取

要运行程序，请执行 Python 命令 pip install <packages name> 安装 recipe 中提到的依赖包。安装完成后，在你的开发环境中运行 recipe 中提到的 Python 脚本（.py 文件）即可。

图 5.1 展示了向量存储：

图 5.1：向量存储

FAISS 向量存储

Facebook AI Similarity Search（FAISS）是目前使用最广泛的高效相似度搜索与稠密向量聚类库之一。在 RAG 系统中，FAISS 可作为一个强大的后端，用于存储和检索由文本文档生成的嵌入。与简单的关键词搜索不同，FAISS 能够基于语义含义来匹配查询，因此它是构建智能检索流水线的重要组成部分。

在本节中，我们将探讨如何把原始文本文档转换为嵌入，并将其存储在 FAISS 向量索引中。整个工作流通常包含三个步骤：

文档准备（Document preparation） ：加载并预处理原始文本，将其转换为适合嵌入的结构化 chunk。
嵌入生成（Embedding generation） ：使用语言模型的嵌入函数，将文本 chunk 转换为高维向量。
在 FAISS 中建立索引（Indexing in FAISS） ：将生成的向量存入支持大规模快速相似度搜索的 FAISS 索引中。

借助 FAISS，你可以将语义搜索高效扩展到数百万份文档，同时保持较高的检索准确率。

Recipe 52

本 recipe 演示如何从文本文档创建一个 FAISS 向量存储：

准备你的文本内容。这些内容将被索引到 FAISS 向量存储中。

将文本转换为 Document 对象。这是 LangChain 正确处理文本所必需的。

加载 HuggingFaceEmbeddings 嵌入模型。

创建 FAISS 向量存储。它会使用指定的嵌入模型为文档建立索引。

将 FAISS 索引保存到磁盘。这使你可以持久化保存索引，并在后续加载，而无需重新建索引。

执行相似度搜索。这将为给定查询找到最相关的文档。

安装所需依赖：

pip install langchain faiss-cpu sentence-transformers

faiss_vector_store_from_text.py

请参考以下代码：

# faiss_vector_store_from_text.py
# Step 1: import necessary libraries
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
import warnings

warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# Step 2: Prepare your text documents
# These documents will be indexed in the FAISS vector store.
texts = [
    "Retrieval Augmented Generation enhances LLMs output by injecting external knowledge.",
    "LangChain supports multiple vector stores including FAISS and Chroma.",
    "FAISS is a library for efficient similarity search.",
    "Embeddings transform text into numerical vector representations."
]

# Step 3: Convert texts to Document objects
# This is necessary for LangChain to handle the text properly.
documents = [Document(page_content=text) for text in texts]

# Step 4: Load the embedding model
# You can choose a different model if needed, but this one is
# efficient for many tasks.
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Step 5: Create the FAISS vector store
# This will index the documents using the specified embedding model.
vectorstore = FAISS.from_documents(documents, embedding_model)

# Step 6: Save the FAISS index to disk
# This allows you to persist the index and load it later without
# re-indexing.
vectorstore.save_local("chapter5_faiss_index")

# Step 7: Perform a similarity search
# This will find the most relevant documents for a given query.
query = "What is LangChain?"
results = vectorstore.similarity_search(query, k=2)

# Step 8: Output the results
# This will print the content of the retrieved documents.
for i, res in enumerate(results):
    print(f"Result {i+1}: {res.page_content}")

输出结果。打印检索到文档的内容。

输出：

Result 1: LangChain supports multiple vector stores including FAISS and Chroma.
Result 2: FAISS is a library for efficient similarity search.

加载并查询持久化向量存储

当向量存储被构建并保存之后，下一个关键步骤就是能够重新加载它，并高效地对其进行查询。持久化向量存储使你无需在每次运行应用时都重新对文档进行嵌入和建索引。相反，你可以直接加载已经保存的 FAISS、Chroma 或其他受支持的向量索引，并立即开始语义检索。

这个过程通常包含两个主要步骤：

加载存储（Loading the store） ：从磁盘或远程位置重新初始化向量数据库。在这一步中，向量索引及其相关元数据（例如文档标题、来源或 ID）都会被恢复，从而确保查询仍然保留上下文意义。
查询存储（Querying the store） ：一旦完成加载，该存储就可以通过相似度搜索或混合检索方法进行查询。查询会先被嵌入为向量，再与已存储的嵌入进行比较，最后返回最接近的匹配结果及其元数据。这让用户无需重建索引，就能即时检索到相关文档。

持久化向量存储在生产环境中尤其有价值，因为那里的文档集合通常很大，而更新频率相对较低。

Recipe 53

本 recipe 演示如何加载并查询一个持久化向量存储：

加载一个持久化向量存储：

加载 HuggingFaceEmbeddings 嵌入模型。

准备你的文本文档。这些文档将被索引到 FAISS 向量存储中。

创建 FAISS 向量存储。它会使用 HuggingFaceEmbeddings 嵌入模型为文档建立索引。

将 FAISS 索引保存到磁盘。这使你可以持久化该索引，并在后续加载，而无需重新建索引。

安装所需依赖：

pip install langchain langchain-community faiss-cpu sentence-transformers

load_faiss_vectorstore.py

请参考以下代码：

# load_faiss_vectorstore.py
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
import os
import warnings

warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# 1. Load the embedding model
# You can choose a different model if needed, but this one is
# efficient for many tasks.
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# 2. Prepare your text documents
# These documents will be indexed in the FAISS vector store.
documents = [
    Document(page_content="LangChain is a framework for building LLM-powered apps."),
    Document(page_content="FAISS is a vector database for fast similarity search."),
    Document(page_content="Transformers from Hugging Face are widely used in NLP."),
]

# 3. Create the FAISS vector store
# This will index the documents using the specified embedding
# model.
vectorstore = FAISS.from_documents(documents, embedding=embedding_model)

# 4. Save the FAISS index to disk
# This allows you to persist the index and load it later without
# re-indexing.
save_dir = "chapter5_faiss_store"
vectorstore.save_local(folder_path=save_dir)

print("Vector Store saved to:", save_dir)

输出：

Vector Store saved to: chapter5_faiss_store

查询一个持久化向量存储：

加载 FAISS 向量存储。请确保 FAISS 索引已经在指定目录中创建并保存。

定义你的查询。这是你希望借助检索上下文来回答的问题。

query_faiss_vectorstore.py

请参考以下代码：

# query_faiss_vectorstore.py
# This code snippet demonstrates how to query a FAISS vector store
# using LangChain and Hugging Face embeddings.
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
import warnings

warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# 1. Load the FAISS vector store
# Ensure the FAISS index is already created and saved in the
# specified directory.
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.load_local(
    folder_path="chapter5_faiss_store",
    embeddings=embedding_model,
    allow_dangerous_deserialization=True
)

# 2. Define your query
# This is the question you want to answer using the retrieved
# context.
query = "What is LangChain?"
results = vectorstore.similarity_search(query, k=2)

# 3. Output the results
# This will print the content of the retrieved documents.
print("\nTop Matches:")
for i, doc in enumerate(results, start=1):
    print(f"{i}. {doc.page_content}")

输出结果。程序会打印检索到文档的内容。

输出：

Top Matches:
1. LangChain is a framework for building LLM-powered apps.
2. Transformers from Hugging Face are widely used in NLP.

带元数据过滤的混合搜索

在许多检索系统中，仅返回语义上相似的结果还不够；用户往往还需要施加过滤条件，基于文档的特定属性来限制结果范围。这正是元数据过滤在向量搜索中变得强大的地方。

元数据表示文档的结构化信息，例如：

来源类型（Source type） ：例如网站、研究论文、客户邮件或 Slack 消息。
作者或创建者（Author or creator） ：例如员工姓名、部门或出版机构。
日期或时间（Date or time） ：例如仅返回 2023 年之后发布的文档，或仅返回最近一个月内的文档。
类别或标签（Category or tags） ：例如 finance、legal、medical，或项目专属标签。

通过将语义相似度与元数据过滤结合起来，检索会变得更精确、更具上下文感知能力。举例来说，用户在搜索项目路线图时，可以只保留带有 Q1 2025 标签、或者由产品团队撰写的文档。这样就能确保结果不仅在语义上相关，而且也符合用户的上下文需求。

Recipe 54

本 recipe 演示如何执行带元数据过滤的混合搜索：

准备你的文本文档。这些文档将被索引到 FAISS 向量存储中。每个文档都可以携带元数据，以便在搜索时进行过滤。

加载 HuggingFaceEmbeddings 嵌入模型。你也可以根据需要选择其他模型，但这个模型在很多任务中都较高效。

创建 FAISS 向量存储。它会使用指定嵌入模型为文档建立索引。

执行带元数据过滤的混合搜索。这将为给定查询找到最相关的文档，同时结合语义相似度与元数据。

安装所需依赖：

pip install langchain faiss-cpu sentence-transformers

hybrid_search_with_metadata.py

请参考以下代码：

# hybrid_search_with_metadata.py
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.documents import Document as LangchainDocument
import warnings

warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# 2. Prepare your text documents
# These documents will be indexed in the FAISS vector store.
# Each document can have metadata for filtering during search.
docs = [
    LangchainDocument(page_content="LangChain is a tool to build LLM applications.", metadata={"source": "intro", "topic": "LLM"}),
    LangchainDocument(page_content="FAISS is used for vector similarity search.", metadata={"source": "db", "topic": "Vector DB"}),
    LangchainDocument(page_content="OpenAI provides GPT models.", metadata={"source": "api", "topic": "LLM"}),
    LangchainDocument(page_content="FAISS and Chroma are vector databases.", metadata={"source": "db", "topic": "Vector DB"}),
]

# 2. Load the embedding model
# You can choose a different model if needed, but this one is
# efficient for many tasks.
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# 4. Create the FAISS vector store
# This will index the documents using the specified embedding model.
vectorstore = FAISS.from_documents(docs, embedding=embedding_model)

# 5. Perform a hybrid search with metadata filtering
# This will find the most relevant documents for a given query,
# considering both semantic similarity and metadata
query = "Which tools help with semantic search?"
results = vectorstore.similarity_search(
    query,
    k=2,
    filter={"topic": "Vector DB"}  # Hybrid filtering
)

# 6. Output the results
# This will print the content of the retrieved documents along with
# their metadata.
print("\nHybrid Search Results (Semantic + Metadata):")
for i, doc in enumerate(results, 1):
    print(f"{i}. {doc.page_content} | Metadata: {doc.metadata}")

输出结果。程序会打印检索到的文档内容以及它们的元数据。

输出：

Hybrid Search Results (Semantic + Metadata):
1. FAISS is used for vector similarity search. | Metadata: {'source': 'db', 'topic': 'Vector DB'}
2. FAISS and Chroma are vector databases. | Metadata: {'source': 'db', 'topic': 'Vector DB'}

批量嵌入

在处理小规模文档集合时，逐条嵌入并建立索引通常已经足够。然而，真实世界应用往往要面对海量文档，例如数百万篇文章、研究论文、客户工单或企业文档，因此必须以更高效的方式处理。对这样的大规模数据集进行嵌入，就需要采用批处理策略，以在性能、内存使用和可靠性之间取得平衡。

批量嵌入是指：先将语料拆分为可管理的块，再为每个批次生成嵌入，并将它们增量写入向量存储。这种方法具有以下优点：

效率（Efficiency） ：按批处理文档可以减少重复开销，并更充分利用 GPU/CPU 资源。
可扩展性（Scalability） ：可以处理大规模数据集，而不会耗尽内存或触发 API 速率限制。
容错性（Fault tolerance） ：如果某个批次失败，只需重试其中一小部分数据，而不必从头重跑整个流水线。
并行化（Parallelization） ：可以将不同批次分发到多个 worker 或机器上，从而提高吞吐量。

整个工作流通常如下：

预处理（Preprocessing） ：将原始语料切分为文本 chunk，并附加元数据。
分批（Batching） ：将这些 chunk 组织成若干批次（例如每批 100–1,000 个）。
嵌入（Embedding） ：使用语言模型为每个批次生成嵌入。
写入（Insertion） ：将生成的向量增量存入向量存储。

在企业级 RAG 流水线中，批量嵌入至关重要，因为效率与规模同检索准确率一样重要。采用这一方式，可以确保系统在处理大型、持续演化的知识库时，不牺牲性能。

Recipe 55

本 recipe 演示如何将大规模语料按批次嵌入到向量存储中：

定义一个用于批处理的自定义嵌入类。

准备你的大规模文本语料。

将文档切分为多个 chunk。这一步对于大文档非常关键，因为它能够确保高效嵌入和建索引。

将向量存储保存到磁盘。这样后续可以直接加载，而无需重新嵌入文档。

加载向量存储并执行相似度搜索。这将演示如何检索与查询相似的文档。

安装所需依赖：

pip install langchain faiss-cpu sentence-transformers tqdm

batch_embedding_large_content.py

请参考以下代码：

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.embeddings.base import Embeddings
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import warnings

warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# 1. Define a custom embedding class for batching
class BatchedSentenceTransformerEmbedding(Embeddings):
    def __init__(self, model_name: str = "all-MiniLM-L6-v2", batch_size: int = 64):
        self.model = SentenceTransformer(model_name)
        self.batch_size = batch_size

    def embed_documents(self, texts):
        embeddings = []
        for i in tqdm(range(0, len(texts), self.batch_size), desc="Embedding Batches"):
            batch = texts[i:i + self.batch_size]
            batch_embeddings = self.model.encode(batch, show_progress_bar=False)
            embeddings.extend(batch_embeddings)
        return embeddings

    def embed_query(self, text):
        return self.model.encode([text])[0]

# 2. Prepare your large corpus of text documents
# This is a simulated large dataset for demonstration purposes.
corpus = [
    "LangChain enables LLM applications.",
    "FAISS performs fast vector searches."
] * 10  # Simulated large dataset

documents = [Document(page_content=doc) for doc in corpus]

# 3. Split Documents into Chunks
# This step is crucial for large documents to ensure efficient
# embedding and indexing.
splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=30)
chunked_docs = []
for doc in documents:
    chunks = splitter.split_text(doc.page_content)
    chunked_docs.extend([Document(page_content=chunk) for chunk in chunks])

# 4. Create the FAISS vector store with batched embeddings
# This will index the documents using the custom embedding model.
embedding_model = BatchedSentenceTransformerEmbedding()
vectorstore = FAISS.from_documents(chunked_docs, embedding=embedding_model)

# 5. Save the vector store to disk
# This allows for later retrieval without needing to
# re-embed the documents
vectorstore.save_local("faiss_large_content_index")

# 6. Load the vector store and perform a similarity search
# This demonstrates how to retrieve documents similar to a query.
vectorstore = FAISS.load_local(
    "faiss_large_content_index",
    embedding_model,
    allow_dangerous_deserialization=True
)
results = vectorstore.similarity_search("What does FAISS do?", k=2)

# print the search results
for r in results:
    print(r.page_content)

打印搜索结果。

输出：

FAISS performs fast vector searches.
FAISS performs fast vector searches.

将 ChromaDB 作为向量存储

ChromaDB 是一个开源向量数据库，专为机器学习与 RAG 工作流设计。与通用数据库不同，ChromaDB 针对嵌入的存储、查询和管理进行了专门优化，因此非常适合构建依赖语义搜索的智能应用。

ChromaDB 的一个突出特点，是其对开发者友好的设计以及内置持久化能力。它既支持适合快速实验的内存模式，也支持适合生产环境的持久化存储，从而保证向量索引可以跨会话复用，而无需重新嵌入。ChromaDB 还能够与主流机器学习框架和 LLM 生态无缝集成，因此对开发者来说非常易用。

它的关键特性包括：

易用性（Ease of use） ：提供简单 API，可用于创建、更新和查询嵌入集合。
元数据支持（Metadata support） ：可将额外结构化数据（例如来源、作者、标签、时间戳）与嵌入一并存储，以支持更强大的过滤。
混合搜索能力（Hybrid search capabilities） ：可将稠密嵌入与元数据约束结合，从而实现更精确的检索。
持久化（Persistence） ：支持保存与重新加载集合，以便在生产流水线中长期使用。
可扩展性（Scalability） ：具备高效的索引与查询能力，可处理大规模语料。

在典型工作流中，文档会先被切分为多个 chunk，转换为嵌入，并与元数据一起存入 Chroma collection。随后，查询会被嵌入，并与已存储向量比较，从而返回语义上最相关的结果。元数据过滤还能进一步优化结果，使答案更贴合上下文。

Recipe 56

本 recipe 演示如何将 ChromaDB 用作向量存储：

加载一个预训练 SentenceTransformer 模型。该模型用于为文本文档生成嵌入。

将 ChromaDB 初始化为向量存储。这会创建一个本地 ChromaDB 实例，用于存储和检索嵌入。

创建一些示例文档，并将其加入向量存储。

把文档添加到向量存储中。这会为文档生成嵌入，并将它们存储到 ChromaDB 中。

执行相似度搜索。这将基于嵌入查找与查询相似的文档。

安装所需依赖：

pip install langchain langchain-community chromadb sentence-transformers

chromadb_as_vector_store.py

请参考以下代码：

# chromadb_as_vector_store.py
# This script demonstrates how to use ChromaDB as a vector store
# for text embeddings.
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.documents import Document
import warnings

warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# 1. Load a pre-trained embedding model
# This model is used to generate embeddings for the text documents.
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# 2. Initialize ChromaDB as a vector store
# This will create a local ChromaDB instance to store and retrieve
# embeddings.
persist_directory = "./chroma_db_offline"
vectorstore = Chroma(
    collection_name="offline_collection",
    embedding_function=embeddings,
    persist_directory=persist_directory
)

# 3. Create some example documents
# These documents will be added to the vector store.
docs = [
    Document(page_content="ChromaDB is an open-source vector database for AI applications.", metadata={"source": "wiki"}),
    Document(page_content="LangChain helps developers build context-aware applications with LLMs.", metadata={"source": "wiki"}),
    Document(page_content="Python is a popular programming language for AI and data science.", metadata={"source": "wiki"})
]

# 4. Add documents to the vector store
# This will generate embeddings for the documents and store them
# in ChromaDB.
vectorstore.add_documents(docs)
print("Documents added and saved (offline).")

# 5. Perform a similarity search
# This will search for documents similar to the query based on
# their embeddings.
query = "Tell me about ChromaDB"
results = vectorstore.similarity_search(query, k=2)

# 6. Print the search results
# Displaying the results of the similarity search.
print("\nSearch Results:")
for i, doc in enumerate(results, start=1):
    print(f"{i}. {doc.page_content} (source: {doc.metadata['source']})")

打印搜索结果，即展示相似度搜索的返回内容。

输出：

Documents added and saved (offline).

Search Results:
1. ChromaDB is an open-source vector database for AI applications. (source: wiki)
2. ChromaDB is an open-source vector database for AI applications. (source: wiki)

将向量存储与文档切分器结合

一个向量存储的效果，很大程度上取决于它所包含的文档。如果文档过大，嵌入可能无法捕捉细粒度语义；如果文档过小，检索则可能丢失连贯性与上下文。文档切分器正是在这里发挥作用：它们会在嵌入和索引之前，将大型文本源拆分为可管理的 chunk。

文档切分器能够确保每个 chunk 在语义丰富度与检索精度之间取得平衡。例如，将一篇冗长的研究论文切分为 500–1,000 token 的片段后，每一部分都可以独立嵌入，从而让检索器能够准确返回与查询相关的具体章节，而不必返回整篇文档。

当文档切分器与向量存储结合时，可以带来以下优势：

提升召回率（Improved recall） ：查询可以匹配到更小、语义更密集的 chunk，而不是与长文档中大量无关部分竞争。
保留上下文（Context preservation） ：chunk 之间的重叠可以确保长段落的连续性得以保留。
可扩展性（Scalability） ：大型语料可以按 chunk 增量嵌入，并高效存储到向量索引中。
细粒度元数据（Granular metadata） ：每个 chunk 都可以携带自己的元数据（例如章节标题、页码），使过滤搜索更强大。

典型工作流如下：

加载文档（Load documents） ：导入原始文本、PDF 或结构化文件。
切分为 chunk（Split into chunks） ：使用基于字符、基于 token 或语义感知的切分器。
嵌入 chunk（Embed chunks） ：将每个 chunk 转换为嵌入。
写入向量数据库（Store in vector database） ：将嵌入及其元数据插入所选向量存储中。

这种组合构成了健壮 RAG 流水线的基础，确保搜索结果既语义准确，又上下文相关。通过切分，检索器面对的不再是整篇文档的粗粒度索引，而是更适合自然语言查询的细粒度索引。

Recipe 57

本 recipe 演示如何将向量存储与文档切分器结合使用：

选择你的文件（PDF 或 TXT）。

加载文档。根据文件类型，我们会使用不同的加载器。

将文档切分为多个 chunk。这会创建更小的文本块，以便更好地处理与存储。

加载 SentenceTransformer 嵌入模型。该模型用于为文本块生成嵌入。

将 ChromaDB 初始化为向量存储。这会创建一个本地 ChromaDB 实例，用于存储和检索嵌入。

把文档添加到向量存储中。这会把嵌入存入 ChromaDB。

执行相似度搜索。这会基于嵌入查找与查询相似的文档。

安装所需依赖：

pip install langchain langchain-community chromadb sentence-transformers pypdf

document_split_vector_store.py

请参考以下代码：

# document_split_vector_store.py
# This script demonstrates how to split a document into chunks and
# store them in ChromaDB as a vector store.
import os
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
import warnings

warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# 1. Choose your file (PDF or TXT)
file_path = "RAG.pdf"  # change to your file

# 2. Load the document
# Depending on the file type, we use different loaders.
if file_path.lower().endswith(".pdf"):
    loader = PyPDFLoader(file_path)
elif file_path.lower().endswith(".txt"):
    loader = TextLoader(file_path)
else:
    raise ValueError("Unsupported file type. Use PDF or TXT.")

documents = loader.load()
print(f"Loaded {len(documents)} pages from {file_path}")

# 3. Split the document into chunks
# This will create smaller chunks of text for better processing
# and storage.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # characters per chunk
    chunk_overlap=50 # overlap between chunks
)
docs = text_splitter.split_documents(documents)
print(f"Split into {len(docs)} chunks.")

# 4. Load SentenceTransformer embeddings
# This model is used to generate embeddings for the text chunks.
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# 5. Initialize ChromaDB as a vector store
# This will create a local ChromaDB instance to store and retrieve
# embeddings.
persist_directory = "./chapter5_chroma_db_docs"
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory=persist_directory
)

# 6. Add documents to the vector store
# This will store the embeddings in ChromaDB.
print("Documents stored in ChromaDB.")

# 7. Perform a similarity search
# This will search for documents similar to the query based on
# their embeddings.
query = "Explain RAG"
results = vectorstore.similarity_search(query, k=1)

# 8. Display the search results
print("\nSearch Results:")
for i, doc in enumerate(results, start=1):
    print(f"{i}. {doc.page_content.strip()[:200]}...")  # Print first 200 characters of each result

展示搜索结果。

输出：

Documents stored in ChromaDB.

Search Results:
1. Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and q...

基于自动摘要 chunk 的语义搜索

在 RAG 流水线中，一个挑战是如何在检索时平衡粒度与上下文。将文档切分为更小的 chunk 有助于提升召回率，但更小的 chunk 也可能丢失关键上下文，使回答变得碎片化或意义不足。一个有效的解决方案，是为每个 chunk 配套自动生成摘要，从而实现更准确、上下文更丰富的语义搜索。

在这种方法中，每个文档 chunk 会并行经历两个过程：

嵌入 chunk：原始文本会被转换为嵌入并存入向量数据库。
生成摘要：为该 chunk 生成一段简洁摘要，以压缩形式捕捉其核心思想。

原始 chunk 与其摘要都可以被嵌入、存储，并通过元数据建立关联。在查询时，检索器不仅可以在 chunk 的嵌入上进行搜索，也可以在摘要嵌入上进行搜索。这会提高找到相关上下文的概率，即便用户查询所使用的词汇与原始文本不同。

Recipe 58

本 recipe 演示如何编写一个“基于自动摘要 chunk 的语义搜索”程序：

选择你的文件（PDF 或 TXT）。

加载文档。根据文件类型，我们会使用不同的加载器。

将文档切分为多个 chunk。这会生成更小的文本块，以便更好地处理与存储。

初始化摘要流水线。该流水线会对每个文本 chunk 进行摘要。

创建一个列表，用于存储带摘要的文档。每个摘要后的 chunk 都会被保存为一个 Document 对象。

对每个 chunk 执行摘要，并创建 Document 对象。

加载 SentenceTransformer 嵌入模型。该模型用于为摘要后的 chunk 生成嵌入。

将 ChromaDB 初始化为向量存储。这会创建一个本地 ChromaDB 实例，用于存储和检索嵌入。

把摘要文档添加到向量存储中。这会将嵌入存入 ChromaDB。

查询向量存储。

安装所需依赖：

pip install langchain langchain-community chromadb sentence-transformers pypdf transformers

semantic_search_auto_summarized_chunks.py

请参考以下代码：

import os
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_huggingface import HuggingFaceEmbeddings
from transformers import pipeline
import warnings

warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# 1. Choose your file (PDF or TXT)
file_path = "RAG.pdf"  # Change to your file

# 2. Load the document
if file_path.lower().endswith(".pdf"):
    loader = PyPDFLoader(file_path)
elif file_path.lower().endswith(".txt"):
    loader = TextLoader(file_path)
else:
    raise ValueError("Unsupported file type. Use PDF or TXT.")

documents = loader.load()
print(f"Loaded {len(documents)} pages from {file_path}")

# 3. Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks.")

# 4. Initialize the summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# 5. Create a list to hold summarized documents
summarized_docs = []

# 6. Summarize each chunk and create Document objects
for chunk in chunks:
    try:
        summary_text = summarizer(
            chunk.page_content,
            max_length=40,
            min_length=10,
            do_sample=False
        )[0]['summary_text']
    except Exception:
        summary_text = "Summary not available"

    summarized_docs.append(
        Document(
            page_content=chunk.page_content,
            metadata={
                "summary": summary_text,
                "source": chunk.metadata.get("source", "unknown")
            }
        )
    )

print("Summaries added to chunks.")

# 7. Load SentenceTransformer embeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# 8. Initialize ChromaDB as a vector store
persist_directory = "./chroma_db_summarized"
vectorstore = Chroma.from_documents(
    documents=summarized_docs,
    embedding=embeddings,
    persist_directory=persist_directory
)

# 9. Add summarized documents to the vector store
print("Stored summarized chunks in ChromaDB.")

# 10. Query the vector store
query = "Explain RAG"
results = vectorstore.similarity_search(query, k=3)

# 11. Display search results
print("\nSearch Results:")
for i, doc in enumerate(results, start=1):
    print(f"{i}. SUMMARY: {doc.metadata.get('summary')}")
    print(f"   CONTENT: {doc.page_content.strip()[:200]}...\n")

展示搜索结果。

输入：

RAG.pdf 是本程序所使用的输入文件，其内容如下：

Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.
Traditional generative models rely solely on internal parameters for producing responses, which limits their ability to provide up-to-date or domain-specific knowledge. RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources.
Traditional generative models laid the foundation for today’s LLMs. They helped us understand how to model processes represent knowledge, user input and generate

输出：

Stored summarized chunks in ChromaDB.

Search Results:
1. SUMMARY: Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual
   CONTENT: Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large
language models (LLMs) with a retrieval system to enhance the factual accuracy,
contextual relevance, and q...

2. SUMMARY: Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual
   CONTENT: Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large
language models (LLMs) with a retrieval system to enhance the factual accuracy,
contextual relevance, and q...

3. SUMMARY: Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual
   CONTENT: Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large
language models (LLMs) with a retrieval system to enhance the factual accuracy,
contextual relevance, and q...

使用异步索引加速向量存储创建

从大规模文档集合构建向量存储可能既耗计算资源又耗时，尤其当嵌入是通过模型生成时更是如此。顺序式流水线——一次处理一个文档：加载、生成嵌入、写入向量存储——很快就会成为瓶颈。为了解决这一问题，我们可以采用异步（async）索引，也就是通过并行化任务来加速向量存储的创建。

异步索引利用并发能力，同时处理多个嵌入请求。系统不再等某个文档处理完成后才继续下一个，而是通过批处理或流水式调度，让 I/O 操作（如 API 调用）与 CPU 密集型预处理（如文档切分和清洗）重叠执行。这能够显著减少总的建索引时间，尤其适用于大型语料库。

异步索引的关键优势包括：

速度（Speed） ：多个嵌入任务可并行生成，从而减少瓶颈。
效率（Efficiency） ：CPU 与网络资源通过任务重叠执行得到更充分利用。
可扩展性（Scalability） ：可以更快处理大型数据集，而不需要线性增加索引时间。
容错性（Fault tolerance） ：失败任务可以独立重试，而无需阻塞整个流水线。

典型工作流如下：

加载文档：收集原始文本文件、PDF 或网页。
切分为 chunk：应用文档切分器，准备文本片段。
异步嵌入请求：并发向模型或 API 发送嵌入请求。
存储结果：将向量和元数据增量插入向量存储。

Recipe 59

本 recipe 演示如何编写一个使用异步索引来加速向量存储创建的程序：

定义脚本中的常量。这些常量包括输入文件目录、向量存储目录以及嵌入模型。

定义一个将文本切分为更小片段的函数。该函数接收字符串，并将其切分为指定大小且带有一定重叠的 chunk。

编写一个异步文件读取函数。它会借助事件循环异步读取文件，并将内容作为字符串返回。

异步处理文件。该函数会读取文件，并对其内容进行切分。

编写主函数，用于收集所有文件并异步处理它们。它会收集全部文本文件，处理它们，并把最终 chunk 存储到向量存储中。

打印找到的文件数量，并开始异步处理。

将“列表的列表”展平为单一 chunk 列表。也就是把每个文件生成的 chunk 合并成一个整体列表。

打印总共创建的 chunk 数量。

使用 Chroma 和 Hugging Face 嵌入创建向量存储。这会使用处理后的文本 chunk 初始化向量存储。

打印成功消息，表明向量存储已创建完成。

安装所需依赖：

pip install langchain langchain-community chromadb sentence-transformers

async_indexing_for_speed.py

请参考以下代码：

# async_indexing_for_speed.py
import os
import glob
import asyncio
from typing import List
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
import warnings

warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# 1. Define constants for the script
INPUT_DIR = "documents"                  # Folder containing text files
PERSIST_DIR = "chapter5_chroma_store"   # Chroma DB folder
EMBED_MODEL = "all-MiniLM-L6-v2"        # Fast embedding model
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50

# 2. Function to chunk text into smaller pieces
def chunk_text(text: str, size: int, overlap: int) -> List[str]:
    """Split text into chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = min(len(text), start + size)
        chunks.append(text[start:end])
        start += size - overlap
    return chunks

# 3. Asynchronous file reading function
async def read_file(path: str) -> str:
    """Read file asynchronously."""
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(None, lambda: open(path, "r", encoding="utf-8").read())

# 4. Asynchronous processing of files
async def process_file(path: str) -> List[str]:
    """Read and chunk file."""
    text = await read_file(path)
    return chunk_text(text, CHUNK_SIZE, CHUNK_OVERLAP)

# 5. Main function to gather all files and process them asynchronously
async def main():
    files = glob.glob(os.path.join(INPUT_DIR, "*.txt"))
    if not files:
        print("No supported files found.")
        return

    # 6. Print the number of files found and start processing them
    # asynchronously
    print(f"Found {len(files)} files. Processing asynchronously...")
    results = await asyncio.gather(*(process_file(f) for f in files))

    # 7. Flatten the list of lists into a single list of chunks
    all_chunks = [chunk for chunks in results for chunk in chunks]

    # 8. Print the total number of chunks created
    print(f"Total chunks: {len(all_chunks)}")

    # 9. Create a vector store using Chroma and the HuggingFace embeddings
    embeddings = HuggingFaceEmbeddings(model_name=EMBED_MODEL)
    db = Chroma(persist_directory=PERSIST_DIR, embedding_function=embeddings)
    db.add_texts(all_chunks)

    # 10. Print a success message indicating the vector store has been
    # created
    print("Vector store created successfully.")

if __name__ == "__main__":
    asyncio.run(main())

打印找到的文件数量并开始异步处理。

打印总共创建的 chunk 数量。

打印成功消息，表明向量存储已创建成功。

输出：

Found 1 files. Processing asynchronously...
Total chunks: 2
Vector store created successfully.

评估向量存储的检索性能

构建向量存储只是第一步；同样关键的是确保它能够返回准确、相关且高效的结果。如果没有系统化评估，你就很难知道检索流水线是否为下游任务（例如问答或摘要）提供了高质量上下文。

评估向量存储的检索性能，既包括定量指标，也包括定性检查。目标是衡量检索器在多大程度上能够返回与用户意图匹配的文档，以及在规模化场景下能以多高效率完成这一过程。

Recipe 60

本 recipe 演示如何评估向量存储的检索性能：

定义常量：本程序使用上一道 recipe 中创建的持久化目录。

定义评估数据：这是一个由查询与期望答案组成的列表，用于评估向量存储的检索性能。

加载向量存储：这会使用指定目录和嵌入模型初始化 Chroma 向量存储。

评估向量存储的检索性能：程序会遍历评估数据，执行相似度搜索，并检查期望答案是否出现在检索结果中。

安装所需依赖：

pip install langchain langchain-community chromadb sentence-transformers

evaluate_vector_store_retrieval_performance.py

请参考以下代码：

# evaluate_vector_store_retrieval_performance.py
import time
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

# 1. define constants
# This is a persistent directory created in last recipe
PERSIST_DIR = "chapter5_chroma_store"
EMBED_MODEL = "all-MiniLM-L6-v2"
K = 3

# 2. Define evaluation data
# This is a list of queries and expected answers to evaluate the
# vector store's retrieval performance
EVAL_DATA = [
    {"query": "What is RAG?", "answer": "RAG"},
    {"query": "What is the purpose of vector stores?", "answer": "vector stores"},
]

# 3. Load the vector store
# This initializes the Chroma vector store with the specified
# directory and embedding model
embeddings = HuggingFaceEmbeddings(model_name=EMBED_MODEL)
db = Chroma(persist_directory=PERSIST_DIR, embedding_function=embeddings)

# 4. Evaluate the vector store's retrieval performance
# This iterates through the evaluation data, performs
# similarity searches, and checks if the expected answer
# is present in the retrieved results
correct = 0
latencies = []

for example in EVAL_DATA:
    query, expected = example["query"], example["answer"]
    start = time.time()
    results = db.similarity_search(query, k=K)
    latency = time.time() - start
    latencies.append(latency)

    retrieved_texts = " ".join([doc.page_content for doc in results])
    if expected.lower() in retrieved_texts.lower():
        correct += 1

    print(f"\nQuery: {query}")
    print(f"Expected answer keyword: {expected}")
    print(f"Top-{K} Retrieved: {[doc.page_content[:50]+'...' for doc in results]}")
    print(f"Correct? {'Yes' if expected.lower() in retrieved_texts.lower() else 'No'}")
    print(f"Latency: {latency:.3f}s")

# 5. Print evaluation summary
# This calculates and prints the accuracy of the retrieval
# performance and the average latency
accuracy = correct / len(EVAL_DATA)
avg_latency = sum(latencies) / len(latencies)

print("\n==== Evaluation Summary ====")
print(f"Accuracy@{K}: {accuracy:.2f}")
print(f"Average Latency: {avg_latency:.3f}s")

打印评估总结。程序会计算并输出检索准确率及平均延迟。

输出：

Query: What is RAG?
Expected answer keyword: RAG
Top-3 Retrieved: ['Retrieval Augmented Generation (RAG) is an archite...', 'Retrieval Augmented Generation (RAG) is an archite...', 'Retrieval Augmented Generation (RAG) is an archite...']
Correct? Yes
Latency: 0.137s

Query: What is the purpose of vector stores?
Expected answer keyword: vector stores
Top-3 Retrieved: [' mitigates this by augmenting the generation proce...', ' mitigates this by augmenting the generation proce...', ' mitigates this by augmenting the generation proce...']
Correct? No
Latency: 0.017s

==== Evaluation Summary ====
Accuracy@3: 0.50
Average Latency: 0.077s

结合稠密与稀疏的多向量检索

没有哪一种单一检索策略是完美的。由嵌入驱动的**稠密检索（dense retrieval）擅长捕捉语义含义，即便查询措辞与文档表述不同，也能建立匹配。而稀疏检索（sparse retrieval）**则采用传统基于关键词的技术，能保证精确术语匹配，但在处理改写表达或同义词时往往表现较差。

为了克服单独使用任一方法的局限，现代检索系统通常采用多向量检索（multi-vector retrieval） ，即把稠密（语义）向量与稀疏（词法）向量结合成混合流水线。这种双重策略兼具两者优点：

稠密检索（embeddings） ：捕捉上下文含义、改写表达和概念相似性。
稀疏检索（keywords） ：保留对精确匹配、罕见术语和领域专用术语的高精度支持。

Recipe 61

本 recipe 演示如何编写一个结合稠密与稀疏的多向量检索程序：

定义用于检索的文档。这些文本将同时用于稠密检索与稀疏检索。

使用 Hugging Face 嵌入创建一个稠密检索器。

使用 BM25 创建一个稀疏检索器。

创建一个结合稠密与稀疏检索器的集成检索器。

使用一个查询对该集成检索器进行测试。

安装所需依赖：

pip install langchain langchain-community chromadb sentence-transformers rank-bm25

multivector_retrieval_dense_sparse.py

请参考以下代码：

# multivector_retrieval_dense_sparse.py
from langchain_community.retrievers import BM25Retriever
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.runnables import RunnableParallel
from langchain_core.documents import Document
from operator import itemgetter
import warnings

warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# -----------------------------
# 1. Define documents
# -----------------------------
docs = [
    Document(page_content="Artificial intelligence is the simulation of human intelligence by machines."),
    Document(page_content="Deep learning is a subset of machine learning using neural networks."),
    Document(page_content="Machine learning enables computers to learn from data without explicit programming."),
    Document(page_content="Reinforcement learning is based on rewards and punishments."),
]

# -----------------------------
# 2. Dense Retriever (Chroma + HF Embeddings)
# -----------------------------
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
dense_store = Chroma.from_documents(
    docs,
    embedding=embeddings,
    persist_directory="./chroma_multi"
)
dense_retriever = dense_store.as_retriever(search_kwargs={"k": 2})

# -----------------------------
# 3. Sparse Retriever (BM25)
# -----------------------------
sparse_retriever = BM25Retriever.from_documents(docs)
sparse_retriever.k = 2

# -----------------------------
# 4. Ensemble Retriever using LCEL
# -----------------------------
parallel_retriever = RunnableParallel(
    dense=itemgetter("query") | dense_retriever,
    sparse=itemgetter("query") | sparse_retriever
)

def combine_results(results, weights=(0.5, 0.5)):
    dense_docs = results["dense"]
    sparse_docs = results["sparse"]

    # Add weighted scores in metadata
    for d in dense_docs:
        d.metadata["score"] = weights[0]
    for s in sparse_docs:
        s.metadata["score"] = weights[1]

    merged = dense_docs + sparse_docs

    # Deduplicate using document content
    unique = {}
    for doc in merged:
        key = doc.page_content.strip()
        if key not in unique:
            unique[key] = doc
        else:
            # Keep document with higher score
            if doc.metadata["score"] > unique[key].metadata["score"]:
                unique[key] = doc

    # Convert back to list and sort by score
    return sorted(unique.values(), key=lambda x: x.metadata["score"], reverse=True)

# -----------------------------
# 5. Query
# -----------------------------
query = "What are neural networks?"
results = parallel_retriever.invoke({"query": query})
final_docs = combine_results(results)

# -----------------------------
# 6. Print Results
# -----------------------------
print(f"Query: {query}\n")
for i, doc in enumerate(final_docs, start=1):
    print(f"Result {i}: {doc.page_content}")

打印查询结果。

输出：

Query: What are neural networks?

Result 1: Deep learning is a subset of machine learning using neural networks.
Result 2: Reinforcement learning is based on rewards and punishments.

结论

在本章中，我们探讨了向量存储如何构成语义检索的基础。通过将文档转换为嵌入，并将其存储在专用数据库中，系统就能够基于“意义”而不是关键词来检索信息。我们还考察了多种策略，例如持久化、元数据过滤、混合搜索、批量索引和多向量检索，这些策略使向量存储从简单的存储仓库演化为强大的生产级知识系统。

本章最核心的结论是：向量存储并不仅仅是存储层，它们还是智能检索引擎，负责在原始文档集合与用户意图之间搭建桥梁。通过合理组合 chunking、embeddings、filtering 和 evaluation，向量存储能够交付既准确又具备上下文感知能力的结果，从而成为任何 RAG 流水线的骨干。不过，构建向量存储只是完成了一半旅程。随着数据集不断变大、查询不断复杂化，检索过程仍必须保持快速、可扩展和高效。

下一章《从向量存储中进行高效检索》将聚焦于性能优化技术，包括索引结构、缓存和查询优化等策略。只有同时掌握语义准确性与检索效率，你才能设计出既智能又能够应对真实世界规模需求的 RAG 系统。