基于 Python 的 RAG 开发手册——检索增强生成（RAG）原理基础引言本章介绍检索增强生成（Retrieval

引言

本章介绍检索增强生成（Retrieval-Augmented Generation，RAG）的核心原理。RAG 通过为大语言模型（LLM）提供访问外部知识源的能力，增强了其能力表现。

本章涵盖 RAG 的关键组成部分，包括：文档加载（document loading）、文档切分（splitting）、嵌入（embedding）、检索（retrieval）和生成（generation）。

通过实践性讲解和可动手操作的示例，你将建立起对如何使用现代 Python 工具（如 LangChain）以及向量存储（如 FAISS 或 Chroma）实现一个简单 RAG 流水线的基础理解。本章将为后续更深入讲解 RAG 各个组件及其复杂实现方式奠定基础。

结构

本章将涵盖以下主题：

软件要求
加载文档
切分文档
嵌入
存储
检索
生成

学习目标

到本章结束时，读者将能够扎实理解 RAG 背后的核心概念，并通过编写实现这些思想的程序，将其应用于实践。

本章首先探讨 RAG 流水线的关键组件，包括嵌入模型、向量存储、文档分块、查询构造，以及生成过程中的上下文锚定（contextual grounding）。每个组件都在确保流水线能够有效检索并利用相关信息方面发挥着至关重要的作用。

读者还将进一步理解 RAG 的核心思想：它通过将检索系统与生成式 AI 相结合，弥合了传统语言模型与外部知识源之间的鸿沟。这种融合使 AI 模型能够生成既具上下文相关性、又知识丰富的输出。

本章还将强调检索与生成之间的区别。它将解释：检索组件如何从文档存储或向量数据库中识别并获取相关的上下文信息，而生成模型则如何利用这些信息生成准确且有依据的回答。

软件要求

本书中的每个概念后面都会配有相应的 recipe，也就是用 Python 编写的可运行代码。你会在所有 recipe 中看到代码注释，这些注释将逐行解释每一行代码的作用。

运行这些 recipe 需要以下软件环境：

系统配置：至少 16.0 GB 内存的系统
操作系统：Windows
Python：Python 3.13.3 或更高版本
LangChain：1.0.5
LLM 模型：Ollama 的 llama3.2:3b
程序输入文件：程序中使用的输入文件可在本书的 Git 仓库中获取

要运行程序，请执行 Python 命令 pip install <packages name> 安装 recipe 中提到的依赖包。安装完成后，在你的开发环境中运行 recipe 中提到的 Python 脚本（.py 文件）即可。

图 1.1 展示了 RAG 的“加载文档”流程：

图 1.1：RAG 的文档加载流程

加载文档

任何 RAG 流水线的第一步，都是为系统准备一个知识源。在本节中，你将学习如何将 PDF、Word 和文本文件等外部文档加载为适合检索与增强处理的格式。

你将能够理解以下 RAG 关键概念：

在 LangChain 等框架中可用的不同文档加载器
如何从非结构化文件中提取干净、结构化的内容
常见文件格式的使用方式，例如 PDF、DOCX 和 TXT

Recipe 1

本 recipe 说明如何使用 LangChain 的文档加载器加载 PDF、文本和 DOCX 文件：

将 PDF 文件加载并存储到 all_docs_list 中，其中 all_docs_list 是一个 Document 对象列表。PyPDFLoader 用于加载 PDF 文件。

将加载得到的 PDF 文档追加到 all_docs_list 中。这会把该文件的 PDF 内容以 Document 对象的形式加入到 all_docs_list 列表中。接着加载 TXT 文件并存入 all_docs_list。TextLoader 用于加载文本文件。再将加载后的文本内容追加到 all_docs_list 中，这会把文本文件内容以 Document 对象的形式加入列表。

然后加载 DOCX 文件并存入 all_docs_list。UnstructuredWordDocumentLoader 用于加载 DOCX 文件。再将加载后的 DOCX 文档追加到 all_docs_list 中。这会把 DOCX 文件内容以 Document 对象的形式加入到 all_docs_list 中。

最后打印加载文档的总数以及加载后的文档内容。

安装所需依赖：

pip install langchain pypdf python-docx unstructured

load_document.py

请参考以下代码：

# This code is part of the LangChain framework and is used to load
# documents from various formats.
# It demonstrates how to load PDF, TXT, and DOCX files into a
# list of document objects.
from langchain_core.documents import Document
from langchain_community.document_loaders import PyPDFLoader, TextLoader, UnstructuredWordDocumentLoader

# Initialize list to store all documents which will be loaded from
# different formats
all_docs_list = []

# 1. Load PDF file and store it in all_docs_list where all_docs_list
# is a list of document objects
# PyPDFLoader is used to load PDF files
pdf_loader = PyPDFLoader("RAG.pdf")

# pdf_doc is of type list of document where each document has
# page_content and metadata attributes
# If the PDF has multiple pages, it will return a list of document
# objects, where each document corresponds to a page in the PDF.
pdf_doc = pdf_loader.load()

# Extend the all_docs_list with the loaded PDF documents
# This will add the PDF content of the file as a document object to
# the all_docs_list
all_docs_list.extend(pdf_doc)

# 2. Load TXT file and store it in all_docs
# TextLoader is used to load text files
txt_loader = TextLoader("RAG.txt")

# TextLoader reads the text file and splits it into paragraphs
# Each paragraph will be treated as a separate document object.
# txt_doc is of type list of document where each document has
# page_content and metadata attributes
# If the text file has multiple paragraphs, it will return a list of
# document objects, where each document corresponds to a paragraph
# in the text file.
txt_doc = txt_loader.load()

# Extend all_docs_list with the loaded text documents
# This will add the text content of the file as a document object
# to all_docs_list
all_docs_list.extend(txt_doc)

# 3. Load DOCX file and store it in all_docs
# UnstructuredWordDocumentLoader is used to load DOCX files
# It is useful for loading word documents that may contain complex
# formatting.
docx_loader = UnstructuredWordDocumentLoader("RAG.docx")

# docx_doc is of type list of document where each document has page
# content and metadata attributes
# If the DOCX file has multiple sections, it will return a list of
# document objects
docx_doc = docx_loader.load()

# Extend the all_docs_list with the loaded DOCX documents
# This will add the DOCX content of the file as a document object
# to the all_docs_list
all_docs_list.extend(docx_doc)

# Print the total number of documents loaded
print(f"\nTotal documents loaded: {len(all_docs_list)}\n")

# Print the content of the loaded three types of documents
# This will print the first 275 characters of each document's
# content
for i, doc in enumerate(all_docs_list[:]):
    print(f"--- Document {i+1} ---")
    print(doc.page_content[:275])
    print("\n")

输出结果：

Total documents loaded: 3

--- Document 1 ---
Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large
language models (LLMs) with a retrieval system to enhance the factual accuracy,
contextual relevance, and quality of generated response against the query raised by user to
a RAG sys

--- Document 2 ---
Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system

--- Document 3 ---
Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system

切分文档

如前一节所述，当文档被加载后，RAG 流水线中的下一个关键步骤就是将它们切分成可管理的小块（chunks）。为了让 LLM 输出基于上下文、且更高效的结果，向模型提供相关、简洁、聚焦的输入非常重要。切分技术可以确保每一段内容都保持在合适的上下文范围内。

以下 recipe 使用不同的切分技术来说明，如何在 RAG 流水线中应用文档切分。

Recipe 2

本 recipe 说明如何使用 RecursiveCharacterTextSplitter 将文档切分为多个块：

加载 DOCX 文档。加载器会读取 DOCX 文件，并返回一个 Document 对象列表。每个 Document 对象都包含一页或一个章节的文本内容和元数据。

以字符为单位定义 chunk 大小、chunk 重叠长度以及分隔符，然后将文档切分成更小的块。

显示创建出的 chunk 数量及其字符数。对于每个 chunk，展示其前 300 个字符。

安装所需依赖：

pip install langchain unstructured python-docx

split_document.py

请参考以下代码：

# Load and split a DOCX document into smaller chunks
# This example uses the UnstructuredWordDocumentLoader to load a DOC
# file and the RecursiveCharacterTextSplitter to split it into
# manageable chunks
from langchain_community.document_loaders import UnstructuredWordDocumentLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 1. Load the DOCX document
docx_loader = UnstructuredWordDocumentLoader("RAG.docx")

# The loader reads the DOCX file and returns a list of document
# objects.
# Each document object contains the text content and metadata of a
# page or section.
documents = docx_loader.load()

# 2. Display basic stats about the loaded document
splitter = RecursiveCharacterTextSplitter(
   chunk_size = 300, # max characters (≈ 120150 tokens) per chunk
   chunk_overlap = 50, # characters of overlap to preserve context
   separators = ["\n\n", "\n", ".", " ", ""], # split on these char
)

# 3. Split the document into smaller chunks
chunks = splitter.split_documents(documents)

# 4. Display the number of chunks created and preview
print(f"\nTotal chunks created: {len(chunks)}\n")
for i, chunk in enumerate(chunks):   # iterate through each chunk
    # Print the chunk number and its character count
    print(f"--- Chunk {i+1} ({len(chunk.page_content)} chars) ---")
    print(chunk.page_content.strip()[:300]) # preview first 300 chars
    print()

输出结果：

Total chunks created: 4

--- Chunk 1 (276 chars) ---
Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.

--- Chunk 2 (282 chars) ---
Traditional generative models rely solely on internal parameters for producing responses, which limits their ability to provide up-to-date or domain-specific knowledge. RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources.

--- Chunk 3 (166 chars) ---
Traditional generative models laid the foundation for today’s LLMs. They helped us understand how to model processes represent knowledge, user input and generate data

Recipe 3

本 recipe 使用基于 token 的切分方式，以优化 LLM 的上下文使用：

初始化 TokenTextSplitter。它会基于 token 数量将文本切分为多个块。

将文本切分成多个块。split_text 方法会返回一个文本块列表。

显示创建的 chunk 数量并预览各个 chunk。打印 chunk 总数以及每个 chunk 的内容。

安装所需依赖：

pip install langchain tiktoken

token_doc_splitting.py

请参考以下代码：

# This script demonstrates how to split a long text into smaller chunks
# using TokenTextSplitter from Langchain
from langchain_text_splitters import TokenTextSplitter

# Example text to be split
text = """
Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.
"""

# 1. Initialize the TokenTextSplitter
# This will split the text into chunks based on token count
splitter = TokenTextSplitter(
    chunk_size=30,        # max tokens per chunk
    chunk_overlap=10      # token overlap between chunks
)

# 2. Split the text into chunks
# The split_text method will return a list of text chunks
chunks = splitter.split_text(text)

# 3. Display the number of chunks created and preview the chunks
# This will print the total number of chunks and the content of each
# chunk
print(f"Total Chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk)
    print()

输出结果：

Total Chunks: 3

--- Chunk 1 ---
Retrieval Augmented Generation (RAG) is an architecture that combines the ability
of large language models (LLMs) with a retrieval system

--- Chunk 2 ---
 language models (LLMs) with a retrieval system to enhance the factual
accuracy, contextual relevance, and quality of generated response against the
query

--- Chunk 3 ---
, and quality of generated response against the
query raised by user to a RAG system.

Recipe 4

本 recipe 演示如何使用 LangChain 的文本切分器将文档切分为更小的块：

创建一个 .txt 文件，并将其放入本地目录中。使用 TextLoader 加载文档。它会将文本文件加载为一个 Document 对象列表。

初始化文本切分器。它会根据指定参数将文本切分成多个块。

将文档切分成多个 chunk。这些 chunk 是更小的文本片段，便于后续处理和检索。

安装所需依赖：

pip install langchain

text_splitter.py

请参考以下代码：

# It demonstrates how to split a document into smaller chunks using
# LangChain's text splitter.
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader

# 1. Load the document
# Replace 'RAG.txt' with your actual text file path.
# Ensure the file exists in the specified path.
loader = TextLoader("RAG.txt")
documents = loader.load()

# 2. Initialize the text splitter
# This will split the text into chunks based on specified parameters
text_splitter = CharacterTextSplitter(
    separator="\n",       # Splits at newline characters
    chunk_size=300,       # Max characters per chunk
    chunk_overlap=50,     # Overlap to preserve context
    length_function=len   # Optional, default is len()
)

# 3. Split the documents into chunks
# This will create smaller text segments for better processing and
# retrieval
split_docs = text_splitter.split_documents(documents)

# 4. Output the results
# This will print each chunk to the console.
for i, doc in enumerate(split_docs):
    print(f"\n== Chunk {i+1} ==")
    print(doc.page_content)

输出结果。程序会将每个 chunk 打印到控制台。

输出：

== Chunk 1 ==
Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.

== Chunk 2 ==
Traditional generative models rely solely on internal parameters for producing responses, which limits their ability to provide up-to-date or domain-specific knowledge. RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources.

== Chunk 3 ==
Traditional generative models laid the foundation for todayâ€™s LLMs. They helped us understand how to model processes represent knowledge, user input and generate data. However, they are now mostly replaced or augmented by deep learning-based transformer models, which offer greater accuracy, coherence, and scalability.

Recipe 5

本 recipe 演示如何使用 LangChain 的 NLTKTextSplitter，借助 NLTK 的分词器按照自然语言的句子边界切分文本文档：

加载文档。确保文件存在于指定路径中。

初始化 NLTKTextSplitter。它会依据指定参数将文本切分成多个块。

将文档切分为多个 chunk。这将生成更小的文本片段，以便更好地进行处理和检索。

安装所需依赖：

pip install langchain nltk

nltktext_splitter.py

请参考以下代码：

# This code demonstrates how to split a text document into smaller
# chunks using LangChain's NLTKTextSplitter.
# It uses the NLTK library for tokenization and is suitable for
# processing large text files
import nltk
from langchain_text_splitters import NLTKTextSplitter
from langchain_community.document_loaders import TextLoader

# Ensure you have the NLTK tokenizer downloaded
nltk.download("punkt")

# Step 1: Load the document
# Replace 'RAG.txt' with your actual text file path.
# Ensure the file exists in the specified path.
loader = TextLoader("RAG.txt")  # Replace with your actual text file
documents = loader.load()

# Step 2: Initialize the NLTKTextSplitter
# This will split the text into chunks based on specified parameters
text_splitter = NLTKTextSplitter(
    chunk_size=300,        # Max characters per chunk
    chunk_overlap=50       # Overlap between chunks
)

# Step 3: Split the documents into chunks
# This will create smaller text segments for better processing and
# retrieval.
split_docs = text_splitter.split_documents(documents)

# Step 4: Output the results
# This will print each chunk to the console.
for i, chunk in enumerate(split_docs):
    print(f"\n=== Chunk {i + 1} ===")
    print(chunk.page_content)

输出结果。程序会将每个 chunk 打印到控制台。

输出：

=== Chunk 1 ===
Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.

=== Chunk 2 ===
Traditional generative models rely solely on internal parameters for producing responses, which limits their ability to provide up-to-date or domain-specific knowledge.
RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources.

=== Chunk 3 ===
Traditional generative models laid the foundation for todayâ€™s LLMs.
They helped us understand how to model processes represent knowledge, user input and generate data.

=== Chunk 4 ===
However, they are now mostly replaced or augmented by deep learning-based transformer models, which offer greater accuracy, coherence, and scalability.

嵌入

在将文档切分为多个 chunk 之后，下一步就是将这些 chunk 转换为数值表示。把文本转换成能够表示其语义含义的高维向量，这个过程被称为嵌入（embedding）。

Recipe 6

本 recipe 展示如何使用 LangChain 中的 HuggingFaceEmbeddings 模型，为一组文本输入生成嵌入向量：

加载嵌入模型。

准备文本输入。如果需要，你可以用自己的文本或数据替换这些示例文本。

生成嵌入。这会将文本列表转换为对应的向量表示。

安装所需依赖：

pip install langchain sentence-transformers

create_embeddings.py

请参考以下代码：

# This code generates embeddings for a list of text inputs using the
# HuggingFaceEmbeddings model from LangChain.
from langchain_huggingface import HuggingFaceEmbeddings

# 1. Load the embedding model (any from sentence-transformers)
# You can choose a different model if needed, but this one is efficient
# for many tasks.
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# 2. Your text input
# Replace these with your own texts or data if needed.
texts = [
"Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system.",
    "Traditional generative models rely solely on internal parameters for producing responses.",
"RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources."
]

# 3. Generate embeddings
# This will convert the list of texts into their corresponding vector
# embeddings.
embeddings = embedding_model.embed_documents(texts)

# 4. Output the embeddings
# Print the number of embeddings and their dimensions for verification.
print(f"Generated {len(embeddings)} embeddings.")
print(f"Each vector has {len(embeddings[0])} dimensions.\n")

# Display the first 5 dimensions of each embedding
for i, vector in enumerate(embeddings):
    print(f"--- Embedding {i+1} (first 5 dims) ---")
    print(vector[:5])  # Show only first 5 dimensions for brevity
    print()

输出嵌入结果。打印嵌入数量及其维度，以便校验。

展示每个嵌入向量的前 5 个维度。

输出：

Generated 3 embeddings.
Each vector has 384 dimensions.

--- Embedding 1 (first 5 dims) ---
[-0.08739510923624039, -0.03305370360612869, -0.012390639632940292, 0.02596159651875496, -0.05413474142551422]

--- Embedding 2 (first 5 dims) ---
[-0.040841661393642426, -0.03903619945049286, 0.07830005139112473, 0.06659548729658127, -0.00704152463003993]

--- Embedding 3 (first 5 dims) ---
[-0.11565263569355011, -0.013607499189674854, -0.03747229650616646, 0.03100057691335678, 0.022551823407411575]

Recipe 7

本 recipe 展示如何使用 LangChain 的 HuggingFaceEmbeddings 从文本嵌入创建一个 FAISS 索引。它可以对已建立索引的文本执行高效的相似度搜索：

准备待嵌入的示例文本。

加载嵌入模型。你也可以选择其他模型，但 sentence-transformers 在很多任务中都很高效。

根据嵌入创建 FAISS 索引。这会将文本列表转换为其对应的向量表示，并建立索引。

执行相似度搜索。

安装所需依赖：

pip install langchain sentence-transformers faiss-cpu

index_embeddings_using_faiss.py

请参考以下代码：

# This code creates a FAISS index from text embeddings using LangChain's HuggingFaceEmbeddings.
# It allows for efficient similarity search on the indexed texts.
# Ensure you have the required libraries installed:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# 1. Sample texts to embed
texts = [
"Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy.",
"Traditional generative models rely solely on internal parameters for producing responses, which limits their ability to provide up-to-date or domain-specific knowledge.",
"RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources.",
"Traditional generative models are now mostly replaced or augmented by deep learning-based transformer models, which offer greater accuracy, coherence, and scalability."
]

# 2. Load the embedding model
# You can choose a different model if needed, but this one is efficient
# for many tasks.
# This model is from the sentence-transformers library, which is
# commonly used for generating text embeddings
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# 3. Create a FAISS index from the embeddings
# This will convert the list of texts into their corresponding vector
# embeddings and create an index.
vector_store = FAISS.from_texts(texts, embedding=embedding_model)

# 4. Perform a similarity search
query = "benefit of using RAG?"
results = vector_store.similarity_search(query, k=3)  # top 3 matches

# 5. Output the results
# Print the query and the top results from the FAISS index.
print(f"\n Query: {query}\n")
for i, doc in enumerate(results):
    print(f"Result {i+1}: {doc.page_content}")

输出结果。打印查询语句以及从 FAISS 索引中返回的最佳匹配结果。

输出：

Query: benefit of using RAG?

Result 1: Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy.
Result 2: RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources.
Result 3: Traditional generative models are now mostly replaced or augmented by deep learning-based transformer models, which offer greater accuracy, coherence, and scalability.

存储

当文档块被嵌入为向量之后，流程中的下一步就是将它们存储为向量存储（vector store）。这是 RAG 中检索环节的关键步骤，因为它负责组织和管理向量化数据，以便实现快速的、基于相似度的访问。

以下 recipes 展示了向量存储如何支持对嵌入结果进行快速语义检索。

Recipe 8

本 recipe 展示如何使用 LangChain 的 HuggingFaceEmbeddings 创建一个 Chroma 向量存储。它同样可以对已索引文本执行高效的相似度搜索：

准备要嵌入的示例文本。如有需要，可替换为你自己的文本。

加载嵌入模型。你也可以选择其他模型，但该模型在许多任务中都较高效。

根据嵌入创建 Chroma 向量存储。这会将文本列表转换为对应的向量表示并建立索引。

持久化向量存储。这会将向量存储保存到磁盘。

执行相似度搜索。这将根据查询的嵌入表示，找到 top-k 个最相似的文本。

安装所需依赖：

pip install langchain chromadb sentence-transformers

chroma_vector_store.py

请参考以下代码：

# This code creates a Chroma vector store from text embeddings using
# LangChain's HuggingFaceEmbeddings.
# It allows for efficient similarity search on the indexed texts.
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

# 1. Sample texts to embed
# Replace these with your own texts if needed.
texts = [
"Chroma is a popular vector database used to store embeddings (vectors).",
    "We are using sentence transformers for generating embeddings.",
    "RAG is a polpular framework to make Agentic AI applications.",
    "LangChain is a framework for builing applications using LLM.",
]

# 2. Load the embedding model
# You can choose a different model if needed, but this one is efficient
# for many tasks.
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# 3. Create a Chroma vector store from the embeddings
# This will convert the list of texts into their corresponding vector
# embeddings and create an index.
vector_store = Chroma.from_texts(
    texts,
    embedding=embedding_model,
    persist_directory="chroma_vector_store"
)

# 4. Persist the vector store
vector_store.add_texts(texts)

# 5. Perform a similarity search
# This will find the top-k most similar texts to the query based on
# their embeddings.
query = "Which database is used to store embeddings?"
results = vector_store.similarity_search(query, k=1)

# 6. Output the results
# Print the query and the top results from the Chroma vector store
print(f"\n Query: {query}\n")
for i, doc in enumerate(results):
    print(f"Result {i+1}: {doc.page_content}")

输出结果。打印查询语句以及从 Chroma 向量存储中返回的最佳匹配结果。

输出：

Query: Which database is used to store embeddings?

Result 1: Chroma is a popular vector database used to store embeddings (vectors).

Recipe 9

本 recipe 展示如何使用 LangChain 和 Hugging Face Embeddings 创建一个 FAISS 向量存储：

加载文档。请将 'RAG.txt' 替换为你的实际文本文件路径。

将文档切分为更小的块。这对高效处理和检索非常重要。

为文档块创建嵌入，以实现语义理解。

从文档块和嵌入中创建一个 FAISS 向量存储，以实现高效的相似度搜索。

执行相似度搜索。这将根据查询找到最相关的文本块。

安装所需依赖：

pip install langchain faiss-cpu

store_faiss_langchain.py

请参考以下代码：

# This code demonstrates how to create a FAISS vector store using
# LangChain and HuggingFace embeddings.
# It loads a text document, splits it into chunks, creates
# embeddings, and stores them
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# 1. Load the document
# Replace 'RAG.txt' with your actual text file path.
loader = TextLoader("RAG.txt")  # Replace with your file
documents = loader.load()

# 2. Split the documents into smaller chunks
# This is important for efficient processing and retrieval.
splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=50)
docs = splitter.split_documents(documents)

# 3. Create embeddings for the document chunks
# Using HuggingFace embeddings for semantic understanding.
embeddings = HuggingFaceEmbeddings()

# 4. Create a FAISS vector store from the document chunks and
# embeddings
# This allows for efficient similarity search.
faiss_index = FAISS.from_documents(docs, embeddings)

# 5. Perform a similarity search
# This will find the most relevant chunks based on a query.
query = "What is RAG?"
results = faiss_index.similarity_search(query, k=3)

# 6. Output the results
# This will print the top matching chunks to the console.
print("\n Top Matches:")
for i, doc in enumerate(results, 1):
    print(f"\nResult {i}:")
    print(doc.page_content)

输出结果。程序会将最佳匹配的文本块打印到控制台。

输出：

Top Matches:

Result 1:
Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large languagRetrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, ande models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.
 quality of generated response against the query raised by user to a RAG system.

Result 2:
Traditional generative models rely solely on internal parameters for producing responses, which limits their ability to provide up-to-date or domain-specific knowledge. RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources.

Result 3:
Traditional generative models laid the foundation for today LLMs. They helped us understand how to model processes represent knowledge, user input and generate data. However, they are now mostly replaced or augmented by deep learning-based transformer models, which offer greater accuracy, coherence, and scalability.

检索

检索是指根据用户查询，获取语义上最相关的文档块的过程。本节重点说明：通过向量数据库中存储的嵌入，向量相似度搜索是如何实现这一目标的。

Recipe 10

本 recipe 展示如何使用 LangChain 的 Hugging Face 嵌入模型，从 Chroma 向量存储中检索文本块：

加载嵌入模型。

初始化 Chroma 向量存储。确保持久化目录与你保存向量存储时使用的目录一致。

定义查询语句。这是你希望通过检索上下文来回答的问题。

执行相似度搜索。这会找到与给定查询最相关的文本块。

安装所需依赖：

pip install langchain chromadb sentence-transformers

retrieve_from_vector_store.py

请参考以下代码：

# Retrieve from Chroma Vector Store
# This code retrieves text chunks from a Chroma vector store using
# LangChain's HuggingFace
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

# 1. Load the embedding model
# You can choose a different model if needed, but this one is
# efficient for many tasks.
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# 2. Initialize the Chroma vector store
# Ensure the persist_directory matches where you saved your
# vector store
vector_store = Chroma(
    persist_directory="chroma_vector_store",
    embedding_function=embedding_model
)

# 3. Define your query
# This is the question you want to answer using the retrieved
# context
query = "What is RAG?"

# 4. Perform a similarity search
# This will find the most relevant text chunks for the given query.
results = vector_store.similarity_search(query, k=1)

# 5. Output the results
# This will print the content of the retrieved documents.
print(f"\nQuery: {query}\n")
for i, doc in enumerate(results):
    print(f"Result {i+1}: {doc.page_content}\n")

输出结果。程序会打印检索到的文档内容。

输出：

Query: What is RAG?

Result 1: RAG is a popular framework to make Agentic AI applications.

Recipe 11

本 recipe 演示如何使用 LangChain 和 FAISS 执行带分数的相似度搜索：

加载文档。请将 'RAG.txt' 替换为你的实际文本文件路径，并确保文件存在。

将文档切分成更小的块，这对于高效处理和检索非常重要。

为文档块创建嵌入，以实现语义理解。

基于文档块和嵌入创建 FAISS 向量存储，从而实现高效的相似度搜索。

执行带分数的相似度搜索。这将根据查询找到最相关的文本块，并返回它们的相似度分数。

安装所需依赖：

pip install langchain faiss-cpu sentence-transformers

retrieve_similarity_search_with_score.py

请参考以下代码：

# This code demonstrates how to perform a similarity search with scores
# using LangChain and FAISS.
# It loads a text document, splits it into chunks, creates embeddings,
# and performs a similarity
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# 1. Load the document
# Replace 'RAG.txt' with your actual text file path.
# Ensure the file exists in the specified path.
loader = TextLoader("RAG.txt")  # Replace with your file path
documents = loader.load()

# 2. Split the documents into smaller chunks
# This is important for efficient processing and retrieval.
splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=50)
docs = splitter.split_documents(documents)

# 3. Create embeddings for the document chunks
# Using HuggingFace embeddings for semantic understanding.
embedding_model = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2"  # Lightweight, good-quality model
)

# 4. Create a FAISS vector store from the document chunks and embeddings
# This allows for efficient similarity search.
faiss_index = FAISS.from_documents(docs, embedding_model)

# 5. Perform a similarity search with scores
# This will find the most relevant chunks based on a query and return
# their similarity scores.
query = "What is RAG?"
results_with_score = faiss_index.similarity_search_with_score(query, k=3)

# 6. Output the results with scores
# This will print the top matching chunks and their similarity scores to
# the console.
print("\nTop Matches with Similarity Scores:")
for i, (doc, score) in enumerate(results_with_score, 1):
    print(f"\nResult {i}:")
    print(f"Score: {score:.4f}")
    print("Content:")
    print(doc.page_content)

输出结果。程序会将最佳匹配文本块及其相似度分数打印到控制台。

输出：

Top Matches with Similarity Scores:

Result 1:
Score: 0.9408
Content:
Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.

Result 2:
Score: 1.3505
Content:
Traditional generative models rely solely on internal parameters for producing responses, which limits their ability to provide up-to-date or domain-specific knowledge. RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources.

Result 3:
Score: 2.0506
Content:
Traditional generative models laid the foundation for todayâ€™s LLMs. They helped us understand how to model processes represent knowledge, user input and generate data. However, they are now mostly replaced or augmented by deep learning-based transformer models, which offer greater accuracy, coherence, and scalability.

Recipe 12

本 recipe 展示如何从 Chroma 向量存储中检索相关文本块，并使用 LLM 基于这些文本回答查询。它将检索与生成能力结合起来，以提供具备上下文感知能力的回答：

从 https://ollama.com/ 下载 llama3.2:3b。

启动 Ollama 中的 llama3.2:3b 模型。

如果你已经安装了 Ollama 并准备使用 llama3.2:3b，请在命令行中运行以下命令：

ollama run llama3.2:3b

加载嵌入模型。你也可以根据需要选择其他模型，但这个模型在很多任务中都比较高效。

加载持久化保存的 Chroma 向量存储。确保目录路径与保存向量存储时使用的目录一致。

定义查询语句。这是你希望借助检索上下文来回答的问题。

检索相关文档（前 3 个 chunk）。这会找出与查询最相关的文本块。

创建提示模板（prompt template）。这个模板会将上下文和问题格式化后输入给 LLM。

加载 LLM 模型。你也可以根据需要选择其他模型，但这个模型在很多任务中都较为高效。

创建 RAG 链（RAG chain）。这个链将 prompt 和 LLM 组合起来，以基于上下文生成答案。

使用上下文和查询运行该链。这会基于检索到的上下文生成答案。

安装所需依赖：

pip install langchain langchain-community chromadb sentence-transformers

basic_rag_single_prompt.py

请参考以下代码：

# Retrieve and Answer with RAG using LangChain and Ollama
# This code retrieves relevant text chunks from a Chroma vector
# store and uses them to answer a query using LLM.
# It combines retrieval and generation capabilities to provide
# context-aware answers.
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.prompts import PromptTemplate
from langchain_ollama import ChatOllama
from langchain_core.output_parsers import StrOutputParser

# 1. Load the embedding model
# You can choose a different model if needed, but this one is
# efficient for many tasks.
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# 2. Load the persisted Chroma vector store
# Ensure the directory matches where you saved your vector store
vector_store = Chroma(
    persist_directory="chroma_vector_store",
    embedding_function=embedding_model
)

# 3. Define your query
# This is the question you want to answer using the retrieved
# context
query = "What is LangChain used for?"

# 4. Retrieve relevant documents (top 3 chunks)
# This will find the most relevant text chunks for the
# given query.
# Adjust 'k' to retrieve more or fewer documents as needed.
retrieved_docs = vector_store.similarity_search(query, k=3)
context = "\n\n".join([doc.page_content for doc in retrieved_docs])

# 5. Create a prompt template
# This template will format the context and question for the LLM.
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
You are an expert assistant. Use the context below to answer the
question.
Context:
{context}
Question:
{question}
Answer:
"""
)

# 6. Load the LLM model
# You can choose a different model if needed, but this one is
# efficient for many tasks.
llm = ChatOllama(model="llama3.2:3b", temperature=0.3) # type: ignore

# 7. Create Runnable pipeline
parser = StrOutputParser()
rag_chain = prompt | llm | parser

# 8. Invoke chain
answer = rag_chain.invoke({"context": context, "question": query})

# 9. Print the Query and the answer generated by the LLM based on
# the retrieved context.
print("Query:", query)
print("Answer:", answer)

打印查询语句，以及 LLM 基于检索上下文生成的答案。

输出：

Query: What is LangChain used for?
Answer: LangChain is a framework for building applications using Large Language Models (LLMs).

Recipe 13

本 recipe 演示如何使用 LangChain 和 FAISS 执行带元数据过滤的相似度搜索：

它包含以下步骤：加载文档、在保留元数据的前提下进行切分、创建嵌入，并执行带过滤条件的相似度搜索。

用于程序处理的原始数据：这是一个字典列表，用于模拟待处理内容。每个字典包含一个 text 键用于存放内容，一个 metadata 键用于存放附加信息。这模拟了这样一种场景：你有多段内容，并且每段内容都带有关联元数据。
在保留元数据的前提下切分：这一步会把文档切分成更小的块，同时保留其元数据。这对于高效处理和检索非常重要。
使用 Hugging Face 创建嵌入：这一步将文本块转换为数值向量表示，使用的是预训练模型。
带元数据地存储到 FAISS 中：这一步会根据文档块及其嵌入创建一个 FAISS 向量存储。
带元数据过滤的相似度搜索：这一步会在 FAISS 索引上执行相似度搜索，并根据元数据对结果进行过滤。
安装所需依赖：

pip install langchain faiss-cpu sentence-transformers

retrieve_metadata_filtering.py

请参考以下代码：

# This code demonstrates how to perform a similarity search with
# metadata filtering using LangChain and FAISS.
# It includes steps to load documents, split them while preserving
# metadata, create embeddings, and perform a filtered similarity search.
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document

# 1. Raw data for processing of the program
# This is a list of dictionaries simulating the content to be processed.
# Each dictionary contains a 'text' key for the content and a 'metadata'
# key for additional information.
# This simulates a scenario where you have multiple pieces of content
# with associated metadata.
raw_docs = [
    {
        "text": "In RAG LangChain supports various vector stores including FAISS and Chroma.",
        "metadata": {"source": "content1", "category": "LangChain"}
    },
    {
        "text": "In RAG LangChain is used to build RAG applications.",
        "metadata": {"source": "content2", "category": "LangChain"}
    },
    {
        "text": "In RAG FAISS is a library for efficient similarity search.",
        "metadata": {"source": "content3", "category": "FAISS"}
    },
]

documents = [Document(page_content=doc["text"], metadata=doc["metadata"]) for doc in raw_docs]

# 2. Split with metadata preserved
# This step splits the documents into smaller chunks while preserving
# their metadata.
# This is important for efficient processing and retrieval.
splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=10)
split_docs = splitter.split_documents(documents)

# 3. Create embeddings using Hugging Face
# This step converts the text chunks into numerical vector
# representations using a pre-trained model.
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# 4. Store in FAISS with metadata
# This step creates a FAISS vector store from the document chunks and
# their embeddings.
faiss_index = FAISS.from_documents(split_docs, embeddings)

# 5. Filtered similarity search (filter by metadata)
# This step performs a similarity search on the FAISS index, filtering
# results based on metadata.
query = "What is RAG?"
results = faiss_index.similarity_search(
    query=query,
    k=5,
    filter={"category": "LangChain"}  # Filtering condition
)

# 6. Print the results
# This will print the content of the retrieved documents along with
# their metadata.
# This is useful for understanding which documents were retrieved and
# their associated metadata.
print("\nFiltered Similarity Search Results:")
for i, doc in enumerate(results, 1):
    print(f"\nResult {i}:")
    print(f"Source: {doc.metadata.get('source')}")
    print(f"Category: {doc.metadata.get('category')}")
    print(f"Content: {doc.page_content}")

7. 打印结果：这将输出检索到的文档内容及其元数据，有助于理解哪些文档被检索出来，以及它们所附带的元数据信息。

输出：

Filtered Similarity Search Results:

Result 1:
Source: content2
Category: LangChain
Content: In RAG LangChain is used to build RAG applications.

Result 2:
Source: content1
Category: LangChain
Content: In RAG LangChain supports various vector stores including FAISS and Chroma.

生成

在检索到相关信息之后，RAG 的最后一个步骤就是生成。此时，语言模型会利用检索到的信息来生成回答。也正是在这里，整个流程的各个组件被整合起来，最终产出准确、可溯源且具备上下文感知能力的结果。

Recipe 14

本 recipe 演示了一个使用 LangChain 和 Ollama 构建的简单 RAG 流水线。它从 Chroma 向量存储中检索相关文本块，并利用这些文本来回答查询。

该程序展示了如何在 Python 中使用 LangChain 构建一个最小可运行的 RAG 流水线，并执行以下操作：

加载 PDF 文件
将文档切分为多个块
创建嵌入
存储到向量数据库（Chroma）
检索相关文本块
使用 Ollama（llama3.2:3b）生成答案

请按以下步骤执行：

启动 Ollama 中的 llama3.2:3b 模型。

如果你已经安装了 Ollama 并准备使用 llama3.2:3b，请在命令行中运行以下命令：

ollama run llama3.2:3b

加载 PDF 文档。请确保当前目录下有一个名为 RAG.pdf 的 PDF 文件。

将文档切分成适合处理的块。这有助于创建更小的文本片段，以便更好地进行检索和处理。

加载嵌入模型。你也可以根据需要选择其他模型，但该模型对于很多任务都比较高效。

根据文档块创建一个 Chroma 向量存储。这会将文本块转换成对应的向量表示并建立索引。

定义你的查询。这是你希望通过检索上下文来回答的问题。

创建提示模板。这个模板将对上下文与问题进行格式化，以供 LLM 使用。

加载 LLM 模型。

安装所需依赖：

pip install langchain langchain-community chromadb sentence-transformers
pip install langchain langchain-community chromadb pypdf sentence-transformers

simple_langchain_rag_pipeline.py

请参考以下代码：

# Basic RAG Pipeline with LangChain and Ollama
# This code demonstrates a simple Retrieval Augmented Generation
# (RAG) pipeline using LangChain and Ollama.
# It retrieves relevant text chunks from a Chroma vector store
# and uses them to answer a query
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import PromptTemplate
from langchain_ollama import ChatOllama

# 1. Load the PDF document
# Ensure you have a PDF file named "RAG.pdf" in the current
# directory.
loader = PyPDFLoader("RAG.pdf")
pages = loader.load()

# 2. Split the document into manageable chunks
# This helps in creating smaller text segments for better
# retrieval and processing.
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(pages)

# 3. Load the embedding model
# You can choose a different model if needed, but this one is
# efficient for many tasks.
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# 4. Create a Chroma vector store from the document chunks
# This will convert the list of text chunks into their corresponding
# vector embeddings and create an index
vector_store = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="chroma_vector_store"
)

# 5. Define your query
# This is the question you want to answer using the
# retrieved context
query = "What is the document about?"
docs = vector_store.similarity_search(query, k=3)
context = "\n\n".join([doc.page_content for doc in docs])

# 6. Create a prompt template
# This template will format the context and question for the LLM.
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
You are an AI assistant. Use the context below to answer the question
accurately.
Context:
{context}
Question:
{question}
Answer:
"""
)

# 7. Load the LLM model
llm = ChatOllama(model="llama3.2:3b", temperature=0.3)
rag_chain = prompt | llm
answer = rag_chain.invoke({"context": context, "question": query})

# 8. Output the results
# This will print the query and the answer generated by the
# LLM based on the retrieved context
print("Question:", query)
print("Answer:", answer.content)

输出结果。程序会打印查询问题，以及 LLM 基于检索上下文生成的答案。

输出：

Question: What is the document about?
Answer: The document appears to be about LangChain, a framework for building applications using Large Language Models (LLMs).

结论

在这一基础章节中，我们探讨了 RAG 的核心构建模块。RAG 是一种强大的架构，它将大语言模型的能力与实时的、具备上下文的信息检索结合起来。我们理解了 RAG 是如何通过在生成过程中注入检索到的知识，来增强纯生成式能力，从而生成既相关又有事实依据的输出。

我们学习了文档如何被切分、如何被嵌入成向量、如何被存储进向量数据库，以及如何基于用户查询进行检索。我们还强调了嵌入模型、相似度搜索，以及元数据在上下文过滤中的重要作用。

现在，随着对 RAG 概念的清晰理解，我们已经准备好进一步深入各个组件的细节，从下一章的文档加载器开始。本书下一章将详细讲解文档加载器的概念与技术，并通过 recipe 提供可动手实践的学习体验。