引言
当我们不再满足于简单的“检索 + 响应”流水线,而是开始通过 Chains(链) 来编排多个步骤时,RAG 的真正威力才会显现出来。链提供了一种结构化方式,用来连接检索、推理与生成,从而构建出更可控、更可解释、也更贴合具体任务的工作流。通过以链的方式实现 RAG,我们可以把查询改写、检索策略、工具调用、摘要生成、答案验证等模块化组件整合为一个有机统一的系统。本章将介绍链的核心概念,说明它们为何是构建可靠 RAG 应用的关键,并展示如何设计、实现与评估多种基于链的实践方案,以同时提升准确性与可用性。
结构
本章将涵盖以下内容:
- 软件要求
- 知识检索问答链
- 对话式 RAG 链
- 摘要链
- 带引用答案链
- Stuff Documents 链
- 工具增强型 RAG 链
- 源感知链
- 稠密 + 稀疏混合链
- 元数据过滤自查询链
- 重排序链
- Step-back 链
学习目标
学完本章后,读者将掌握使用链来设计和实现 RAG 工作流所需的知识与实践技能。到本章结束时,读者将理解链如何支持对检索、推理和响应步骤进行模块化组合,从而为 RAG 系统带来更高的灵活性与控制力。本章重点构建并实验多种基于链的实践方案,例如带引用答案链、自查询链、混合检索链以及摘要链,这些方法能够提升事实准确性、上下文感知能力与用户信任。最终目标是帮助读者从基础的检索流水线,迈向结构化、生产可用、能够充分发挥链式机制潜力的 RAG 应用。
软件要求
本书中的每个概念后面都会配有相应的实践示例(recipe),即一段可运行的 Python 代码。你会在所有示例中看到代码注释,这些注释会逐行解释每一行代码的作用。
运行这些示例需要满足以下软件要求:
- 系统:至少 16.0 GB RAM 的计算机
- 操作系统:Windows
- Python:Python 3.13.3 或更高版本
- LangChain:1.0.5
- LLM 模型:Ollama 的 llama3.2:3b
- 程序输入文件:本章程序所使用的输入文件可在本书的 Git 仓库中获取
要运行这些程序,请先执行 Python 命令 pip install <packages name> 安装示例中提到的依赖包。安装完成后,在你的开发环境中运行示例中给出的 .py Python 脚本即可。
请参考下图,它展示了 RAG 中的链:
图 10.1:RAG 中的 Chains
知识检索问答链
知识检索问答链是 RAG 流水线中的核心组件,它将文档检索与问答能力结合在一起。它的工作方式是:首先根据用户查询,从向量存储中获取最相关的文档或文本块;然后将这些上下文传递给语言模型,生成准确且具备上下文感知能力的回答。由于回答是建立在检索到的文档之上的,检索问答链能够减少幻觉,并确保答案由真实证据支撑。这种方式非常适合那些对准确性、可追溯性和相关性要求较高的知识密集型应用。
Recipe 101
本示例展示如何实现一个知识检索问答链:
- 准备示例文档,用于填充向量存储。
- 使用 FAISS 和 Hugging Face embeddings 创建向量存储与检索器。
- 加载本地 HuggingFace 模型(如 Flan-T5),并创建 LLM 包装器。
- 运行查询并打印结果。
安装所需依赖:
pip install langchain langchain-community langchain-huggingface faiss-cpu sentence-transformers transformers
retrievalQA_chain.py
请参考以下代码:
# retrievalQA_chain_fixed.py
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFacePipeline
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.documents import Document
from transformers import pipeline
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="torch")
# 1. Sample documents to populate the vector store
docs = [
Document(page_content="RAG reduces hallucinations by grounding answers in retrieved evidence."),
Document(page_content="Dense retrieval uses vector embeddings instead of keyword matching."),
]
# 2. Create vector store and retriever with FAISS and HuggingFace
# embeddings
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
db = FAISS.from_documents(docs, embedding_model)
retriever = db.as_retriever(search_kwargs={"k": 2})
# 3. Load local HuggingFace model (e.g., Flan-T5) and create LLM wrapper
hf_pipeline = pipeline("text2text-generation", model="google/flan-t5-base", max_new_tokens=200)
llm = HuggingFacePipeline(pipeline=hf_pipeline)
# 4. Define a simple prompt formatter
def format_prompt(context: str, question: str) -> str:
return f"""
Use the following context to answer the question.
If you don't know the answer, just say "I don't know".
Context:
{context}
Question: {question}
Answer:
"""
# 5. Run a query
query = "How does RAG reduce hallucinations?"
# Retrieve top documents
retrieved_docs = retriever._get_relevant_documents(query, run_manager=None) # type: ignore
# Concatenate context
context_text = "\n".join([doc.page_content for doc in retrieved_docs])
# Prepare prompt
input_prompt = format_prompt(context_text, query)
# Generate answer using HuggingFacePipeline
# In LangChain 1.0.5, use .generate()
answer_obj = llm.generate([input_prompt])
answer_text = answer_obj.generations[0][0].text
# Print results
print(f"Query: {query}")
print(f"Answer: {answer_text}")
输出:
Query: How does RAG reduce hallucinations?
Answer: grounding answers in retrieved evidence
对话式 RAG 链
对话式 RAG 链在基础检索式问答之上进一步增加了记忆与上下文管理能力,以支持多轮交互。它不会将每个问题都视为彼此独立,而是会追踪对话流程、记住之前的问题与答案,并在多轮上下文中持续检索相关文档。通过将检索与对话记忆结合起来,这类链能够支持自然、上下文敏感的连续对话,使模型可以根据之前的交流对答案进行细化、澄清或扩展。这种方式特别适用于聊天机器人、虚拟助手和客服系统等需要连贯对话能力的应用场景。
Recipe 102
本示例展示如何实现一个对话式 RAG 链:
- 准备示例文档,用于填充向量存储。
- 使用 FAISS 和 Hugging Face embeddings 创建向量存储与检索器。
- 加载本地 HuggingFace 模型(如 Flan-T5),并创建 LLM 包装器。
- 使用各组件创建
ConversationalRetrievalChain。ConversationalRetrievalChain会在内部管理聊天历史,因此无需自定义 prompt。 - 维护聊天历史并运行多轮查询。
- 通过该链运行第一轮查询并打印答案。
- 运行后续追问并打印答案。
安装所需依赖:
pip install langchain langchain-community langchain-huggingface faiss-cpu sentence-transformers transformers
conversational_rag_chain.py
请参考以下代码:
# conversational_rag_chain.py
# Example of a conversational RAG chain using LangChain with a local
# HuggingFace model and FAISS vector store.
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_classic.chains import ConversationalRetrievalChain
from langchain_huggingface import HuggingFacePipeline
from langchain_core.documents import Document
from transformers.pipelines import pipeline
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="torch")
# 1. Sample documents to populate the vector store
docs = [
Document(page_content="RAG reduces hallucinations by grounding answers in retrieved evidence."),
Document(page_content="Dense retrieval uses vector embeddings instead of keyword matching."),
Document(page_content="BM25 is a sparse retrieval method based on term frequency statistics."),
Document(page_content="Vector databases store embeddings to enable semantic search."),
]
# 2. Create vector store and retriever with FAISS and HuggingFace embeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
db = FAISS.from_documents(docs, embeddings)
retriever = db.as_retriever(search_kwargs={"k": 2})
# 3. Load local HuggingFace model
# (e.g., Flan-T5) and create LLM wrapper
hf_pipeline = pipeline("text2text-generation", model="google/flan-t5-base", max_new_tokens=200)
llm = HuggingFacePipeline(pipeline=hf_pipeline)
# 4. Create ConversationalRetrievalChain with components
# ConversationalRetrievalChain manages chat history internally hence
# no custom prompt needed
qa_chain = ConversationalRetrievalChain.from_llm(llm, retriever=retriever)
# 5. Maintain chat history and run multiple queries
chat_history = []
# First query
query1 = "How does RAG reduce hallucinations?"
# 6. Run the first query through the chain and print the answer
result1 = qa_chain.invoke({"question": query1, "chat_history": chat_history})
print(f"Q1: {query1}")
print("A1:", result1["answer"])
chat_history.append((query1, result1["answer"]))
# Follow-up query (depends on context)
query2 = "And what method does it use for retrieval?"
# 7. Run the follow-up query through the chain and print the answer
result2 = qa_chain.invoke({"question": query2, "chat_history": chat_history})
print(f"\nQ2: {query2}")
print("A2:", result2["answer"])
输出:
Q1: How does RAG reduce hallucinations?
A1: grounding answers in retrieved evidence
Q2: And what method does it use for retrieval?
A2: BM25 is a sparse retrieval method based on term frequency statistics. Dense retrieval uses vector embeddings instead of keyword matching.
摘要链
摘要链的目标是将长文档或大量检索得到的内容压缩为简洁、连贯的摘要。与回答某个具体问题不同,它专注于提炼输入文本中的核心思想、关键细节和整体结构。这使它在处理长篇报告、研究论文或多文档集合时尤其有价值,因为用户往往希望快速把握重点,而不想被海量原文淹没。通过将检索与摘要结合起来,这类链可以帮助用户快速理解复杂信息的精髓,而无需直接阅读全部内容。
Recipe 103
本示例展示如何实现摘要链:
- 选择一个偏事实性的摘要模型。
facebook/bart-large-cnn以简洁、准确的摘要能力著称。 - 加载 tokenizer 和模型,并显式设置
model_max_length以避免截断警告。 - 选择设备(若有 GPU 则使用 GPU,否则使用 CPU)。
- 创建摘要 pipeline。
pipeline会负责分词、模型推理与解码。 - 准备待摘要的输入文本。
- 清洗并规范化输入文本,移除多余换行和空格,以获得更好的摘要效果。
- 动态计算摘要长度上下限,确保输出与输入长度成比例,从而得到更稳定的结果。
- 生成摘要,使用确定性解码(beam search)来提升事实一致性。
- 展示结果。
安装所需依赖:
pip install langchain transformers sentence-transformers
summarization_chain.py
请参考以下代码:
# Import required libraries
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
import torch
# 1. Choose a factual summarization model
# 'facebook/bart-large-cnn' is known for concise, accurate summaries
model_name = "facebook/bart-large-cnn"
# 2. Load tokenizer and model
# Explicitly set 'model_max_length' to prevent truncation warnings
tokenizer = AutoTokenizer.from_pretrained(
model_name,
model_max_length=1024, # Maximum token length the model can handle
truncation=True
)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# 3. Select device (GPU if available, else CPU)
device = 0 if torch.cuda.is_available() else -1
print(f"Device set to: {'GPU' if device == 0 else 'CPU'}")
# 4. Create the summarization pipeline
# 'pipeline' handles tokenization, model inference, and decoding
summarizer = pipeline(
"summarization",
model=model,
tokenizer=tokenizer,
device=device,
truncation=True
)
# 5. Input text to summarize
text = """Retrieval-Augmented Generation (RAG) is a technique
that reduces hallucinations in large language models by grounding answers
in external evidence. Instead of generating responses purely from
parameters, RAG retrieves relevant context from a knowledge base and
conditions the generation on it. This improves factual accuracy and
reduces the chances of fabricated information."""
# 6. Clean and normalize the input text
# Removes unnecessary newlines or spaces for better summarization
text = " ".join(line.strip() for line in text.splitlines() if line.strip())
# 7. Dynamically calculate summary length limits
# Ensures output is proportional to input size for consistent results
input_length = len(text.split())
max_len = min(100, max(30, int(input_length * 0.6))) # upper limit for summary tokens
min_len = int(max_len * 0.4) # lower limit for summary tokens
print(f"Dynamic length settings: min={min_len}, max={max_len}")
# 8. Generate summary
# Uses deterministic decoding (beam search) to ensure factual output
summary = summarizer(
text,
max_length=max_len,
min_length=min_len,
truncation=True,
do_sample=False, # Disable random sampling for consistent output
num_beams=4, # Explore multiple candidate summaries
length_penalty=1.8 # Encourage shorter, more concise summaries
)[0]['summary_text']
# 9. Display results
print("\n=== Original Text ===")
print(text)
print("\n=== Faithful Summary ===")
print(summary)
输出:
Device set to: CPU
Device set to use cpu
Dynamic length settings: min=12, max=30
=== Original Text ===
Retrieval-Augmented Generation (RAG) is a technique that reduces hallucinations in large language models by grounding answers in external evidence. Instead of generating responses purely from parameters, RAG retrieves relevant context from a knowledge base and conditions the generation on it. This improves factual accuracy and reduces the chances of fabricated information.
=== Faithful Summary ===
Retrieval-Augmented Generation (RAG) is a technique that reduces hallucinations in large language models by grounding answers in external evidence.
带引用答案链
带引用答案链通过在生成答案的同时明确给出支撑来源,从而增强 RAG 系统的可靠性。它不仅生成回答,还会显式标注支持该回答的来源文档。这种方式提升了透明度,使用户能够验证信息出处,并增强对系统输出的信任。由于答案明确锚定在可引用的证据之上,这类链可以降低幻觉风险,尤其适用于研究、合规和知识管理等对可信度与可追溯性要求极高的场景。
Recipe 104
本示例展示如何在 RAG 系统中实现带引用答案链:
- 准备带来源信息的示例文档,其中既包含相关内容,也包含无关内容。
- 使用 FAISS 搭建 embedding 与检索流程,便于简单演示。
- 加载一个文本生成模型,用于生成带引用的答案。
- 针对查询检索相似度高于阈值的 top-k 文档。
- 构造带来源信息的上下文,供生成阶段使用。
- 提示模型生成带引用的答案。
- 生成带引用的回答。
- 输出最终答案及其引用。
安装所需依赖:
pip install sentence-transformers faiss-cpu transformers torch
cited_answer_chain.py
请参考以下代码:
# cited_answer_chain.py
from sentence_transformers import SentenceTransformer
from transformers import pipeline
import faiss
# 1. Sample documents with sources with mix of relevant and
# irrelevant content
docs = [
{"content": "Intermittent fasting improves insulin sensitivity.", "source": "Source 1"},
{"content": "It may help with weight loss by reducing calorie intake.", "source": "Source 2"},
{"content": "It supports autophagy, the body’s cell repair process.", "source": "Source 3"},
# Irrelevant documents
{"content": "Paris is the capital of France.", "source": "Irrelevant 1"},
{"content": "Python is a programming language for AI and data science.", "source": "Irrelevant 2"},
]
# 2. Embedding and retrieval setup using FAISS for simplicity
embedder = SentenceTransformer("all-MiniLM-L6-v2")
embs = embedder.encode([d["content"] for d in docs])
index = faiss.IndexFlatL2(embs.shape[1])
index.add(embs)
# 4. Retrieve top-documents above similarity threshold for a query
def retrieve(query, k=5, threshold=0.5):
q_emb = embedder.encode([query])
distances, idxs = index.search(q_emb, k)
results = []
for dist, idx in zip(distances[0], idxs[0]):
sim = 1 / (1 + dist) # crude similarity from L2
if sim >= threshold:
results.append(docs[idx])
return results
# 3. Load a text generation model for answering with citations
gen = pipeline("text2text-generation", model="google/flan-t5-base")
def cited_answer(query, k=5, threshold=0.5):
retrieved = retrieve(query, k, threshold)
if not retrieved:
return "No relevant information found.", []
# 5. Construct context with sources for generation
context = "\n".join([f"{d['content']} ({d['source']})" for d in retrieved])
# 6. Prompt the model to answer with citations
prompt = f"""
Answer the question using ALL the relevant context facts below.
List each fact as a separate bullet point and always include its source
in parentheses.
Do not skip or merge facts. Ignore irrelevant content.
Question: {query}
Context:
{context}
Answer:"""
# 7. Generate the answer with citations
out = gen(prompt, max_new_tokens=200)[0]["generated_text"]
if "(" not in out or len(out.splitlines()) < len(retrieved):
out = "\n".join([f"- {d['content']} ({d['source']})" for d in retrieved])
return out
if __name__ == "__main__":
query = "What are the benefits of intermittent fasting?"
k = 5 # number of docs to retrieve
answer = cited_answer(query, k)
# 8. Display the answer with citations
print("\n=== Cited Answer Chain Output ===")
print(answer)
输出:
=== Cited Answer Chain Output ===
- Intermittent fasting improves insulin sensitivity. (Source 1)
- It may help with weight loss by reducing calorie intake. (Source 2)
Stuff Documents 链
Stuff Documents 链是将检索与语言模型结合起来的最简单方式。在这种方法中,所有检索得到的文档或文本块都会被直接“塞进”模型的 prompt 中,并与用户查询一起作为输入。它实现简单,对于少量文本时也很有效;但当文档很长或检索块很多时,就很容易触碰模型的上下文窗口上限。尽管如此,Stuff Documents 链仍然是理解 RAG 工作原理的一个良好起点,也常用于检索信息量相对较小的应用场景。
Recipe 105
本示例展示如何实现 Stuff Documents 链:
- 准备带元数据的示例文档,用于过滤。
- 创建 embeddings 与向量存储。
- 创建文本生成 LLM pipeline。
- 将 pipeline 包装为 LangChain LLM。
- 从向量存储中创建检索器。
- 准备带元数据过滤的示例查询。
- 执行查询并输出带元数据过滤的结果。
安装所需依赖:
pip install langchain langchain-community chromadb sentence-transformers transformers torch
stuff_documents_chain.py
请参考以下代码:
from langchain_community.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_huggingface import HuggingFacePipeline
from langchain_core.documents import Document
from transformers import pipeline
import warnings
import transformers
warnings.filterwarnings("ignore", category=FutureWarning, module="torch")
transformers.logging.set_verbosity_error()
# 1. Sample documents with metadata for filtering
docs = [
Document(
page_content="Intermittent fasting improves insulin sensitivity.",
metadata={"author": "Alice", "category": "health", "year": 2022}
),
Document(
page_content="The 2008 financial crisis impacted global markets.",
metadata={"author": "Bob", "category": "finance", "year": 2008}
),
Document(
page_content="The French Revolution began in 1789.",
metadata={"author": "Charles", "category": "history", "year": 1789}
),
]
# 2. Embed both content and metadata
def combine_metadata(doc: Document):
meta_text = " ".join(f"{k}: {v}" for k, v in doc.metadata.items())
return Document(page_content=f"{meta_text}. {doc.page_content}", metadata=doc.metadata)
docs_with_meta = [combine_metadata(doc) for doc in docs]
# 3. Create embeddings and vector store
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(docs_with_meta, embeddings)
# 4. Create LLM pipeline for text generation
generator = pipeline(
"text2text-generation",
model="google/flan-t5-base",
device=-1, # CPU
max_length=256
)
# 5. Wrap the pipeline in a LangChain LLM
llm = HuggingFacePipeline(pipeline=generator)
# 6. Create a retriever from the vector store
retriever = vectorstore.as_retriever(search_kwargs={"k": 1})
# 7. Helper to generate LLM input and get response
def run_llm(context: str, question: str) -> str:
"""
Build input prompt from context and question, then invoke the LLM
"""
input_text = f"Context:\n{context}\nQuestion: {question}\nAnswer:"
return llm.invoke(input_text)
# 8. RAG query function
def rag_query(query: str):
"""
Retrieve documents, combine context, and run LLM
"""
retrieved_docs = retriever._get_relevant_documents(run_manager=None, query=query) # type: ignore
context_text = "\n".join([doc.page_content for doc in retrieved_docs])
answer = run_llm(context_text, query)
return answer, retrieved_docs
# 9. Example queries
queries = [
"What did Alice write about health?",
"Tell me about financial events after 2005",
"History events before 1900"
]
# 10. Execute queries
for q in queries:
ans, docs_used = rag_query(q)
print(f"\nQuery: {q}")
print("Answer:", ans)
print("Documents used:")
for d in docs_used:
print(f"- {d.page_content}")
输出:
Query: What did Alice write about health?
Answer: Intermittent fasting improves insulin sensitivity
Documents used:
- author: Alice category: health year: 2022. Intermittent fasting improves insulin sensitivity.
Query: Tell me about financial events after 2005
Answer: The 2008 financial crisis impacted global markets.
Documents used:
- author: Bob category: finance year: 2008. The 2008 financial crisis impacted global markets.
Query: History events before 1900
Answer: The French Revolution began in 1789.
Documents used:
- author: Charles category: history year: 1789. The French Revolution began in 1789.
工具增强型 RAG 链
工具增强型 RAG 链在基础检索流程之上,进一步允许语言模型调用外部工具,而不仅仅依赖检索文档。这些工具可以是 API、计算器、搜索引擎,或特定领域服务,用于提供最新信息、事实性结果,或者完成计算密集型任务。通过将文档检索与工具调用结合起来,这类链能够生成更准确、更新鲜、也更具可操作性的回答。它尤其适合那些需要访问实时数据、进行复杂推理,或与静态知识源之外的外部系统交互的应用场景。
Recipe 106
本示例展示如何实现一个工具增强型 RAG 链:
- 使用 Hugging Face 的 Transformers pipeline 初始化本地 LLM。
- 准备用于 RAG 工具的示例文档。在真实场景下,这里应替换为规模更大且更相关的数据集。
- 初始化 embeddings,并创建 RAG 工具所需的向量存储。
- 构造工具增强型 RAG 链的示例查询。
- 执行工具增强型 RAG 链,并打印结果。
安装所需依赖:
pip install langchain torch transformers sentence-transformers faiss-cpu numexpr
tool_augmented_rag_chain.py
请参考以下代码:
# tool_augmented_rag_chain.py
# This code demonstrates a tool-augmented retrieval-augmented generation (RAG) chain
import re
from langchain_huggingface import HuggingFacePipeline
from transformers import pipeline
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document
# 1. Initialize the local LLM using HuggingFace's transformers pipeline
local_pipe = pipeline("text2text-generation", model="google/flan-t5-base")
llm = HuggingFacePipeline(pipeline=local_pipe)
# Define the calculator tool
def calculator_tool(query: str) -> str:
expr_match = re.findall(r"[0-9.+-*/()]+", query)
if not expr_match:
return "No math expression found."
expr = "".join(expr_match)
allowed_names = {"abs": abs, "round": round, "pow": pow, "min": min, "max": max}
try:
return str(eval(expr, {"__builtins__": {}}, allowed_names))
except Exception as e:
return f"Error: {e}"
# 2. Sample documents for RAG Tool
# In a real-world scenario, these would be replaced with a larger and
# more relevant dataset.
docs = [
Document(page_content="Alice wrote about intermittent fasting and health."),
Document(page_content="The 2008 financial crisis impacted global markets."),
Document(page_content="The French Revolution began in 1789.")
]
# 3. Initialize embeddings and create the vector store for RAG Tool
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(docs, embedding=embeddings)
# RAG Tool function to retrieve relevant documents and generate a response
def rag_tool(query: str) -> str:
# return only the most relevant document
results = vectorstore.similarity_search(query, k=1)
if not results:
return "No relevant documents found."
return results[0].page_content
# Combined tool agent function
def tool_agent(query: str) -> str:
if any(op in query for op in ["+", "-", "*", "/", "**", "(", ")"]):
return calculator_tool(query)
else:
return rag_tool(query)
# 4. Example query of the tool-augmented RAG chain
queries = [
"What is 25 * 12?",
"Calculate 100 / 4 + 7",
"Tell me about financial events after 2005",
"History events before 1900"
]
# 5. Execute the tool-augmented RAG chain with example queries and print the results
for q in queries:
print("\nQuery:", q)
answer = tool_agent(q)
print("Answer:", answer)
输出:
Query: What is 25 * 12?
Answer: 300
Query: Calculate 100 / 4 + 7
Answer: 32.0
Query: Tell me about financial events after 2005
Answer: The 2008 financial crisis impacted global markets.
Query: History events before 1900
Answer: The French Revolution began in 1789.
源感知链
源感知链不仅返回答案,还会清晰记录在生成答案时使用了哪些文档或文本片段。与只返回最终回答的标准检索流水线不同,这类链会维护输出内容与其支撑来源之间的映射关系。这一机制能够提升透明度、支持审计,并帮助用户验证系统输出的可靠性。在法律研究、医疗和企业知识管理等场景中,信息来源的可验证性与答案本身同样重要,因此源跟踪尤为关键。
Recipe 107
本示例展示如何实现一个源感知链:
- 使用 transformers 创建本地 LLM pipeline。
- 创建带元数据的示例文档,以支持来源跟踪。
- 使用较轻量的 embeddings 模型搭建 embeddings 与向量存储。
- 准备示例查询。每个查询都会返回答案以及所使用文档的来源。
- 执行查询并输出带来源信息的结果。
安装所需依赖:
pip install langchain sentence-transformers faiss-cpu transformers torch
source_tracked_chain.py
请参考以下代码:
# source_aware_chain.py
from langchain_core.prompts import PromptTemplate
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_huggingface import HuggingFacePipeline
from transformers import pipeline
# --------------------------------------
# 1. Initialize LLM
# --------------------------------------
generator = pipeline(
"text2text-generation",
model="google/flan-t5-large",
max_length=256,
do_sample=False
)
llm = HuggingFacePipeline(pipeline=generator)
# --------------------------------------
# 2. Embeddings
# --------------------------------------
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
# --------------------------------------
# 3. Documents WITH SOURCES
# --------------------------------------
docs = [
Document(
page_content="Intermittent fasting improves metabolic health.",
metadata={"source": "doc1"}
),
Document(
page_content="The 2008 financial crisis impacted global markets.",
metadata={"source": "doc2"}
),
Document(
page_content="The French Revolution began in 1789 and reshaped Europe.",
metadata={"source": "doc3"}
),
]
# --------------------------------------
# 4. Vectorstore + Retriever
# --------------------------------------
vectorstore = FAISS.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 1})
# --------------------------------------
# 5. Prompt Template (Source-Aware Answering)
# --------------------------------------
answer_prompt = PromptTemplate(
input_variables=["context", "question"],
template=(
"You are a helpful assistant. Answer ONLY using the context.\n\n"
"Context:\n{context}\n\n"
"Question: {question}\n\n"
"Answer:"
)
)
# --------------------------------------
# 6. The Source-Aware Chain Function
# --------------------------------------
def source_aware_chain(question: str):
# Step 1: Retrieve documents
retrieved_docs = retriever.invoke(question)
# Step 2: Combine context text
context_text = "\n".join([doc.page_content for doc in retrieved_docs])
# Step 3: Generate answer via LCEL
answer = (answer_prompt | llm).invoke({
"context": context_text,
"question": question
}).strip()
# Step 4: Extract sources
sources = [doc.metadata.get("source", "unknown") for doc in retrieved_docs]
return {
"query": question,
"answer": answer,
"sources": sources
}
# --------------------------------------
# 7. Test Queries (like your example)
# --------------------------------------
queries = [
"What did Alice write about health?",
"Tell me about financial events after 2005",
"History events before 1900"
]
# --------------------------------------
# 8. Run the chain and print output
# --------------------------------------
for q in queries:
result = source_aware_chain(q)
print("\nQuery:", result["query"])
print("Answer: Content:", result["answer"])
print("Sources:", result["sources"])
输出:
Query: What did Alice write about health?
Answer: Content: Intermittent fasting and health
Sources: ['doc1']
Query: Tell me about financial events after 2005
Answer: Content: The 2008 financial crisis impacted global markets
Sources: ['doc2']
Query: History events before 1900
Answer: Content: The French Revolution began in 1789.
Sources: ['doc3']
稠密 + 稀疏混合链
稠密 + 稀疏混合链结合了 embedding 检索(稠密检索)与关键词检索(稀疏检索)两者的优势。稠密检索能够捕捉语义信息,在向量空间中根据相似性检索相关结果;稀疏检索则擅长精确关键词匹配以及处理低频术语。通过把这两种方法结合起来,混合链能够同时提升召回率与精确率,避免因为词汇不匹配而漏掉相关文档,同时保留语义相关性。这种方式非常适合现实中的 RAG 应用,因为用户查询既可能是自然语言问题,也可能包含技术性很强的领域关键词。
Recipe 108
本示例展示如何实现一个稠密 + 稀疏混合链:
- 准备按主题划分的示例文档,用于模拟稠密与稀疏检索。
- 准备用于稠密检索模拟的主题关键词。
- 搭建稠密检索器、稀疏检索器和混合检索器。
- 准备示例查询。
- 构建一个简单的 QA 链,根据检索结果生成带来源的答案。
- 执行查询并打印带来源的结果。
安装所需依赖:
pip install langchain faiss-cpu torch
hybrid_dense_sparse_chain.py
请参考以下代码:
# hybrid_dense_sparse_chain.py
from typing import List
# 1. Example documents with topics for dense and sparse retrieval
documents = [
{"id": "doc1", "topic": "health", "text": "Alice wrote about intermittent fasting and health."},
{"id": "doc2", "topic": "finance", "text": "The 2008 financial crisis impacted global markets."},
{"id": "doc3", "topic": "history", "text": "The French Revolution began in 1789."},
]
# 2. Keywords for topics in dense retrieval simulation
topic_keywords = {
"health": ["health", "medical", "wellness", "intermittent fasting"],
"finance": ["finance", "financial", "market", "crisis"],
"history": ["history", "revolution", "before 1900"]
}
# Dense retriever simulates semantic search using topic keywords
class DenseRetriever:
def __init__(self, docs):
self.docs = docs
def retrieve(self, query: str, top_k: int = 5):
query_lower = query.lower()
relevant_docs = []
for doc in self.docs:
keywords = topic_keywords.get(doc["topic"], [])
if any(k.lower() in query_lower for k in keywords):
relevant_docs.append({"id": doc["id"], "text": doc["text"], "score": 1.0})
return relevant_docs[:top_k]
# Sparse retriever simulates keyword matching
class SparseRetriever:
def __init__(self, docs):
self.docs = docs
def retrieve(self, query: str, top_k: int = 5):
# simple word matching
query_words = set(query.lower().split())
relevant_docs = []
for doc in self.docs:
doc_words = set(doc["text"].lower().split())
if query_words & doc_words: # intersection
relevant_docs.append({"id": doc["id"], "text": doc["text"], "score": 1.0})
return relevant_docs[:top_k]
# Hybrid retriever combining dense and sparse results
class HybridRetriever:
def __init__(self, dense_retriever, sparse_retriever, top_k=2):
self.dense = dense_retriever
self.sparse = sparse_retriever
self.top_k = top_k
def retrieve(self, query: str):
dense_results = self.dense.retrieve(query, top_k=self.top_k)
sparse_results = self.sparse.retrieve(query, top_k=self.top_k)
# Merge by unique doc id
seen = set()
merged = []
for r in dense_results + sparse_results:
if r["id"] not in seen:
merged.append(r)
seen.add(r["id"])
return merged[:self.top_k]
# 5. Simple QA chain that formats answers with sources from retriever results
def qa_chain(query: str, retriever: HybridRetriever):
results = retriever.retrieve(query)
if not results:
return "No relevant documents found."
output = ""
for r in results:
output += f"Answer: {r['text']} (Source: {r['id']})\n"
return output.strip()
# 3. Setup dense, sparse and hybrid retriever
dense_retriever = DenseRetriever(documents)
sparse_retriever = SparseRetriever(documents)
hybrid_retriever = HybridRetriever(dense_retriever, sparse_retriever, top_k=2)
# 4. Prepare example queries
queries = [
"Tell me about health",
"Financial events after 2005",
"History events before 1900"
]
# 6. Execute Queries and Print Results with Sources
for q in queries:
print(f"Query: {q}")
answer = qa_chain(q, hybrid_retriever)
print(answer)
print()
输出:
Query: Tell me about health
Answer: Alice wrote about intermittent fasting and health. (Source: doc1)
Query: Financial events after 2005
Answer: The 2008 financial crisis impacted global markets. (Source: doc2)
Query: History events before 1900
Answer: The French Revolution began in 1789. (Source: doc3)
元数据过滤自查询链
元数据过滤自查询链通过让模型自动把用户查询改写为包含语义条件和元数据约束的结构化检索请求,从而增强检索能力。它不再只依赖文本相似度,而是利用作者、日期、文档类型或领域标签等属性来缩小搜索范围。这样做可以让检索结果不仅在语义上相关,而且在上下文约束上也更精准,从而锁定最合适的文档子集。将自查询与元数据过滤结合后,这种链在企业、法律与科研场景中特别有价值,因为这些场景中,约束条件和上下文往往与内容本身同样重要。
Recipe 109
本示例展示如何实现元数据过滤自查询链:
- 准备带元数据的示例文档,用于过滤。
- 使用 SentenceTransformers 创建本地 embeddings。
- 将带元数据的文档写入 FAISS 向量存储。
- 准备示例查询及其预期类别。
- 执行查询,并输出带元数据过滤的结果。
安装所需依赖:
pip install langchain langchain-community faiss-cpu sentence-transformers
metadata_filtered_self_query_chain.py
请参考以下代码:
# metadata_filtered_self_query_chain.py
# Example of metadata-filtered retrieval using FAISS and local embeddings
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.chat_models import ChatOpenAI
from langchain_core.documents import Document
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="torch")
# 1. Sample Documents with metadata for filtering
docs = [
Document(page_content="Alice wrote about intermittent fasting and health.", metadata={"category": "health"}),
Document(page_content="The 2008 financial crisis impacted global markets.", metadata={"category": "finance"}),
Document(page_content="The French Revolution began in 1789.", metadata={"category": "history"})
]
# 2. Create local embeddings using SentenceTransformers
embedding = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2") # Fully local
# 3. Create FAISS vector store from documents with metadata
vectorstore = FAISS.from_documents(docs, embedding)
# filter function by metadata
def filter_by_category(category):
return vectorstore.similarity_search(
query="",
filter={"category": category},
k=3
)
# 4. Example Queries with expected categories
queries = [
("Tell me about health", "health"),
("Financial events after 2005", "finance"),
("History events before 1900", "history")
]
# 5. Execute queries and print results with metadata filtering
for q_text, category in queries:
results = filter_by_category(category)
print(f"Query: {q_text}")
for res in results:
print(f"Answer: {res.page_content} (Source category: {res.metadata['category']})")
print("\n" + "-"*50 + "\n")
输出:
Query: Tell me about health
Answer: Alice wrote about intermittent fasting and health. (Source category: health)
--------------------------------------------------
Query: Financial events after 2005
Answer: The 2008 financial crisis impacted global markets. (Source category: finance)
--------------------------------------------------
Query: History events before 1900
Answer: The French Revolution began in 1789. (Source category: history)
--------------------------------------------------
重排序链
重排序链通过在初始检索之后增加一个二次排序步骤,来提升检索质量。它不只依赖向量相似度,而是使用更精细的打分模型(通常是 cross-encoder 或 reranker)对候选文档重新排序,以更准确地反映它们相对于查询的真实相关性。这样就能确保最有用、最符合上下文的文档排在最前面,从而生成更准确、更可信的答案。对于文档规模较大、初始召回率较高但精度要求也很高的场景,重排序链尤其有价值。
Recipe 110
本示例展示如何实现重排序链:
- 加载或创建一个使用本地 embeddings 模型构建的 FAISS 索引。
- 如果 FAISS 索引已经存在,则直接加载;否则使用一个小型语料库构建。
- 从 FAISS 索引创建检索器。
- 初始化用于重排序的 CrossEncoder。
- 用一个示例查询执行重排序链,并打印结果。
安装所需依赖:
pip install langchain sentence-transformers faiss-cpu
rerank_chain.py
请参考以下代码:
# rerank_chain.py
# Example of a rerank chain using FAISS and CrossEncoder for reranking
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from sentence_transformers import CrossEncoder
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="torch")
# 1. Load or create FAISS index with local embeddings model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# 2. If FAISS index already exists, load it
# Otherwise, build from a small sample corpus
try:
vectorstore = FAISS.load_local("faiss_index", embedding_model)
except:
texts = [
"Intermittent fasting improves insulin sensitivity.",
"It may help with weight loss by reducing calorie intake.",
"Fasting can reduce inflammation and support cellular repair.",
"Drinking water during fasting helps with hydration.",
"Exercise combined with intermittent fasting can boost fat loss."
]
vectorstore = FAISS.from_texts(texts, embedding_model)
vectorstore.save_local("faiss_index")
# 3. Create retriever from FAISS index
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# 4. Initialize CrossEncoder for reranking
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
# Rerank function using CrossEncoder
def rerank_results(query, docs):
pairs = [(query, d.page_content) for d in docs]
scores = reranker.predict(pairs)
for i, doc in enumerate(docs):
doc.metadata["score"] = float(scores[i])
reranked = sorted(docs, key=lambda d: d.metadata["score"], reverse=True)
return reranked
# 5. Execute the rerank chain with an example query and print results
def run_rerank_chain(query):
docs = retriever.invoke(query)
reranked_docs = rerank_results(query, docs)
print("\n=== Top Reranked Documents ===")
for d in reranked_docs:
print(f"Score: {d.metadata['score']:.4f} | {d.page_content}")
return reranked_docs
# Example query
run_rerank_chain("What are the benefits of intermittent fasting?")
输出:
=== Top Reranked Documents ===
Score: 8.5342 | Intermittent fasting improves insulin sensitivity.
Score: 6.0997 | Exercise combined with intermittent fasting can boost fat loss.
Score: 3.8877 | Fasting can reduce inflammation and support cellular repair.
Score: 1.2349 | Drinking water during fasting helps with hydration.
Score: -6.5435 | It may help with weight loss by reducing calorie intake.
Step-back 链
Step-back 链是一种结合检索与推理的增强方法,它通过把问题拆解为更简单的中间步骤来提高回答准确性。它不会直接拿原始问题去查询知识库或模型,而是先“退一步”,将原问题改写为更宽泛、更基础的子问题或澄清问题。然后,这些子问题会分别通过检索机制或推理模块进行处理,最后再将结果聚合为一个精确、具备上下文感知能力的最终答案。对于复杂查询,这种方法尤其有效,因为直接回答往往更容易出错;而 step-back 机制可以逐步细化问题理解,并综合多个来源的支撑上下文,从而提高结果的可靠性与可追踪性。
Recipe 111
本示例展示如何实现 Step-back 链:
- 创建示例查询。
- 使用一个小模型创建文本生成 pipeline,以便演示。
- 使用 Hugging Face embeddings 基于文档创建向量存储。
- 构造一个简单数据集用于演示。
- 创建用于生成 step-back 问题的 prompt。
- 创建用于问答的 prompt。
- 使用 step-back 链对原始查询进行改写。
- 用改写后的 step-back 问题执行检索。
- 输出原始问题、step-back 问题、检索得到的上下文以及最终答案。
安装所需依赖:
pip install langchain langchain-community transformers sentence-transformers faiss-cpu
step_back_chain.py
请参考以下代码:
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFacePipeline
from langchain_core.prompts import PromptTemplate
from langchain_core.documents import Document
from transformers.pipelines import pipeline
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="torch")
# 2. Create a text generation pipeline using a small model for
# demonstration purposes
generator = pipeline("text2text-generation", model="google/flan-t5-base", max_length=256)
llm = HuggingFacePipeline(pipeline=generator)
# 3. Create a vector store from the documents using
# HuggingFace embeddings
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# 4. Create a simple dataset for demonstration purposes
docs = [
Document(page_content="The Eiffel Tower is located in Paris, France."),
Document(page_content="The Great Wall of China is a historic fortification."),
Document(page_content="The Colosseum is an ancient amphitheater in Rome."),
]
vectorstore = FAISS.from_documents(docs, embedding_model)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
# 5. Step-Back question generator prompt
step_back_prompt = PromptTemplate(
input_variables=["question"],
template=(
"You are a reasoning assistant. Reformulate the user’s question into a broader, "
"more general step-back question that captures its essence.\n\n"
"Examples:\n"
"Q: Who discovered penicillin?\n"
"Step-back: What are important discoveries in the history of medicine?\n\n"
"Q: When was the iPhone first released?\n"
"Step-back: What are major product launch events in technology history?\n\n"
"Q: {question}\n"
"Step-back:"
),
)
step_back_chain = step_back_prompt | llm
# 6. Create a QA prompt
qa_prompt = PromptTemplate(
input_variables=["context", "question"],
template=(
"Use the following context to answer the user’s question.\n\n"
"Context:\n{context}\n\n"
"Question: {question}\n\n"
"Answer:"
),
)
qa_chain = qa_prompt | llm
# --- Orchestration function ---
def step_back_rag(question: str):
# 7. Reformulate the original query
step_back_question = step_back_chain.invoke({"question": question}).strip()
# 8. Retrieve the document using step-back question
retrieved_docs = retriever.invoke(step_back_question)
context = "\n".join([doc.page_content for doc in retrieved_docs])
# Generate final answer
final_answer = qa_chain.invoke({"context": context, "question": question}).strip()
return {
"original_question": question,
"step_back_question": step_back_question,
"retrieved_context": context,
"final_answer": final_answer,
}
# --- Run example ---
if __name__ == "__main__":
query = "Where is the Eiffel Tower located?"
result = step_back_rag(query)
#9. Print
print("\n=== Step-Back Chain Execution ===")
print("Original Question:", result["original_question"])
print("Step-Back Reformulation:", result["step_back_question"])
print("Retrieved Context:", result["retrieved_context"])
print("Final Answer:", result["final_answer"])
输出:
=== Step-Back Chain Execution ===
Original Question: Where is the Eiffel Tower located?
Step-Back Reformulation: What is the location of the Eiffel Tower?
Retrieved Context: The Eiffel Tower is located in Paris, France.
The Great Wall of China is a historic fortification.
Final Answer: Paris, France
结论
在本章中,我们探讨了链如何为 RAG 流水线带来结构化、模块化和可靠性。通过实现不同的链式实践方案,我们学习了如何控制检索过程、强化事实锚定,并整体提升响应质量。链就像构建模块,它们将 RAG 从简单的“查询—回答”系统,升级为能够推理、能够适应用户需求的更灵活应用。然而,静态链虽然能提供一致性,却缺乏根据上下文或任务需求动态调整检索策略的能力。
这也为下一章做好了铺垫:我们将进一步超越静态编排,引入具备 agent-like 行为的机制,让 RAG 系统能够在运行时动态调整检索与推理流程,从而释放出更强的灵活性与智能性。