基于 Python 的 RAG 开发手册——RAG 系统中基于大语言模型的响应生成引言在 RAG 系统中，检索的真正价

引言

在 RAG 系统中，检索的真正价值会在响应生成阶段得到体现：在这一阶段，LLM 会将检索到的知识转化为有意义的答案。这个阶段至关重要，因为它直接塑造了用户对整个系统的感知。无论检索本身多么有效，最终返回的回答仍然必须清晰、准确且值得信赖。高质量的响应生成不只是“语言流畅”这么简单；它还要求系统能够有意识地整合上下文、应用推理能力，并精心组织呈现方式，以确保答案的清晰性与可靠性。

安全的响应生成需要在系统架构中设置一个专门的层，并且与检索、排序和事实核查相互独立，以便强制落实隐私、安全以及品牌一致性要求。

结构

本章将涵盖以下主题：

软件要求
直接回答式响应生成
结构化响应生成
链式思维引导的响应生成
带引用的响应生成
结合多向量与多提示词的混合响应生成
批判性验证响应生成
带置信度评分的响应生成
查询分解式响应
上下文增强型响应生成
渐进式披露响应生成

学习目标

到本章结束时，读者将能够理解 LLM 如何在 RAG 系统中生成连贯、上下文相关且有事实依据的响应。本章重点强调 prompt engineering 的作用、如何让回答锚定在已检索知识之上、如何处理歧义，以及如何缓解幻觉。通过本章学习，读者将掌握一系列技术与最佳实践，用于让 LLM 输出与检索证据保持一致，从而在实际应用中实现可靠、以用户为中心的响应生成。

软件要求

本书中的每个概念后面都会配有相应的 recipe，也就是用 Python 编写的可运行代码。你会在所有 recipe 中看到代码注释，这些注释将逐行解释每一行代码的作用。

运行这些 recipe 需要以下软件环境：

系统配置：至少 16.0 GB 内存的系统
Python：Python 3.13.3 或更高版本
LangChain：1.0.5

要运行程序，请执行 Python 命令 pip install <packages name> 安装 recipe 中提到的依赖包。安装完成后，在你的开发环境中运行 recipe 中提到的 Python 脚本（.py 文件）即可。

图 7.1 展示了响应生成：

图 7.1：响应生成

直接回答式响应生成

在这种方法中，语言模型会接收检索到的文档以及用户查询，然后直接生成一个简洁答案。与多步骤方法或结构化方法不同，它不会额外加入推理层或格式化层。它的目标是提供清晰、即时的回应。

这种方法非常适合处理直接的事实型问题，例如定义、日期或名称之类的问题，因为此时检索到的证据通常已经足够，而且几乎不需要额外解释。不过，它不太擅长处理复杂问题或多部分问题，因为它缺乏更深层的推理与拆解机制。如果文档很长，也不推荐使用这种方法；在生产系统中，这种方式通常也并不是最佳选择。

Recipe 72

本 recipe 演示如何实现直接回答式响应生成：

加载文档。请确保与脚本位于同一目录下有一个名为 RAG.txt 的文件。

切分文档；你可以自定义 chunk size 和 chunk overlap。

使用本地嵌入模型创建向量存储。

使用本地 text2text 模型初始化 LLM。

使用 stuff chain type 创建一个 RetrievalQA 链来做直接回答生成。stuff 方法在文档较短时效果较好；否则，存在超出模型 token 限制的风险。

创建一个查询并获得答案。

安装所需依赖：

pip install langchain langchain-community faiss-cpu sentence-transformers transformers

direct_answer_generation.py

请参考以下代码：

# direct_answer_generation.py
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFacePipeline
from transformers import pipeline
from langchain.chains import RetrievalQA

# 1. Load the document
# make sure you have a file named 'RAG.txt' in the same directory
loader = TextLoader("RAG.txt")   # put your docs here
docs = loader.load()

# 2. Split documents
# you can customize chunk_size and chunk_overlap
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
documents = splitter.split_documents(docs)

# 3. Create vector store
# using a local embedding model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(documents, embeddings)

# 4. Initialize LLM
# using a local text2text model
generator = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",    # runs fully local
    tokenizer="google/flan-t5-base",
    max_length=256
)

# Wrap the pipeline in a LangChain LLM
llm = HuggingFacePipeline(pipeline=generator)

# 5. Create RetrievalQA chain
# using "stuff" chain type for direct answer generation
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(),
    chain_type="stuff"  # directly stuffs docs into prompt for answer
)

# 6. Create a query to get an answer
query = "What is Retrieval-Augmented Generation?"
answer = qa.run(query)

# 7. Print the answer
print("Question:", query)
print("Answer:", answer)

打印答案。

输出：

Question: What is Retrieval-Augmented Generation?
Answer: an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system

结构化响应生成

RAG 系统中的结构化响应生成，强调的是以组织良好的格式来输出答案，而不是自由文本。与单段叙述不同，模型会按照预定义结构输出信息，例如项目符号列表、编号列表、表格，甚至 JSON 对象。这样能够提高一致性，使回答更易于浏览，也更方便集成到外部系统中。

虽然通过 prompt 可以引导模型生成 JSON 之类的结构化格式，但这并不能保证输出在语法上一定有效。如果没有 JSON mode、function calling 或解析器级校验机制，模型很可能只会生成“看起来像 JSON”的文本，而这些内容在下游系统中并不能被真正解析。在生产环境中，应优先使用受约束的结构化输出接口，而不是只依赖 prompt。

Recipe 73

本 recipe 演示如何在 RAG 系统中实现结构化响应生成：

加载文档。请确保与脚本位于同一目录下有一个名为 chapter7_RAG.txt 的文本文件。你也可以替换为自己的文档。

将文档切分为多个 chunk。你可以根据需要调整 chunk 大小和 overlap。

使用较小的嵌入模型创建 embeddings 和向量存储。

初始化 LLM。

定义一个获取结构化响应的函数。一个好的 RAG 系统在响应生成时必须使用对话历史（chat history） ，以便 LLM 能够正确理解追问问题。

定义程序要处理的查询。

安装所需依赖：

pip install langchain langchain-community faiss-cpu sentence-transformers transformers

structured_response_generation.py

请参考以下代码：

from typing import Dict, Any
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFacePipeline
from transformers import pipeline
from langchain_core.prompts import PromptTemplate
import warnings, json

warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# -------------------------------
# 1. Load documents
# -------------------------------
loader = TextLoader("chapter7_RAG.txt")
docs = loader.load()

# -------------------------------
# 2. Split documents
# -------------------------------
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
documents = splitter.split_documents(docs)

# -------------------------------
# 3. Vector store
# -------------------------------
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(documents, embeddings)

# -------------------------------
# 4. LLM
# -------------------------------
generator = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",
    tokenizer="google/flan-t5-base",
    max_length=256,
    device=-1  # CPU
)
llm = HuggingFacePipeline(pipeline=generator)

# -------------------------------
# 5. Prompt template
# -------------------------------
prompt = PromptTemplate.from_template(
    "Use the context below to answer the question.\n\n"
    "Context:\n{context}\n\n"
    "Question: {question}\n\nAnswer:"
)

# -------------------------------
# 6. RAG function
# -------------------------------
def structured_answer(query: str) -> Dict[str, Any]:
    # Retrieve top 3 relevant documents
    docs = vectorstore.similarity_search(query, k=3)

    # Combine content for context
    context = "\n\n".join([d.page_content for d in docs])

    # Format prompt
    prompt_text = prompt.format(context=context, question=query)

    # Generate answer
    answer_text = llm.invoke(prompt_text)

    # Gather sources
    sources = list({doc.metadata.get("source", "unknown") for doc in docs})

    # Estimate confidence
    confidence = round(min(1.0, len(answer_text) / 300), 2)

    return {
        "question": query,
        "answer": answer_text,
        "sources": sources,
        "confidence": confidence
    }

# -------------------------------
# 7. Run query
# -------------------------------
query = "What is Retrieval-Augmented Generation?"
response = structured_answer(query)

# -------------------------------
# 8. Print structured response
# -------------------------------
print(json.dumps(response, indent=2))

以 JSON 形式打印结构化响应。

输出：

{
  "question": "What is Retrieval-Augmented Generation?",
  "answer": "an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.",
  "sources": [
    "chapter7_RAG.txt"
  ],
  "confidence": 0.79
}

链式思维引导的响应生成

链式思维引导的响应生成，会把推理过程引入 RAG 工作流。模型不会从检索结果直接跳到最终答案，而是先给出一系列中间推理步骤，展示检索到的证据是如何一步步导向结论的。这会让回答更透明、更可解释，使用户能够理解为什么系统会得出某个答案。这种方法尤其适用于需要推理、综合分析或问题求解的任务。

Recipe 74

本 recipe 演示如何实现链式思维引导的响应生成：

加载待处理文档。

将文档切分为 500 个字符大小的 chunk，并设置 50 个字符的 overlap。

使用 HuggingFaceEmbeddings 和 FAISS 从文档中创建 embeddings 与向量存储。

使用本地模型通过纯采样方式生成候选答案。

定义一个函数，用来计算两个向量 a 和 b 之间的余弦相似度。

定义函数 chain_of_thought_response(query: str)，它接收一个查询字符串，并返回一个结构化结果，其中包括最佳答案、得分、候选数、所有候选及其得分，以及检索文档的来源。

准备一个查询，使用链式思维提示生成答案。

安装所需依赖：

pip install langchain langchain-community faiss-cpu sentence-transformers transformers numpy

chain_of_thought_guided_response.py

请参考以下代码：

# chain_of_thought_guided_response.py
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from transformers.pipelines import pipeline
import numpy as np
import json

# 1. Load the document
loader = TextLoader("chapter7_RAG.txt")   # replace with your docs
docs = loader.load()

# 2. Split the documents into chunks of 500 characters with 50 character overlap
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
documents = splitter.split_documents(docs)

# 3. Create embeddings and vector store from documents using HuggingFaceEmbeddings and FAISS
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(documents, embeddings)

# 4. Generate candidate answers using local model with pure sampling
generator = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",
    tokenizer="google/flan-t5-base",
    max_length=256,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    num_return_sequences=5,
    num_beams=1
)

# 5. Define function to compute cosine similarity between two vectors a and b
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# 6. Define function chain_of_thought_response
def chain_of_thought_response(query: str):
    # retrieve context
    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
    retrieved_docs = retriever.get_relevant_documents(query)
    context = "\n".join([doc.page_content for doc in retrieved_docs])

    # reasoning prompt
    prompt = f"""Question: {query}
Relevant Context:
{context}
Think step by step and provide a clear, factual answer.
"""

    # generate 5 candidate answers
    candidates = generator(prompt)

    # embed query+context for scoring
    reference_text = query + " " + context
    reference_vec = embeddings.embed_query(reference_text)

    # score each candidate with cosine similarity
    scored = []
    for cand in candidates:
        cand_text = cand["generated_text"]
        cand_vec = embeddings.embed_query(cand_text)
        score = cosine_similarity(reference_vec, cand_vec)
        scored.append((cand_text, score))

    # pick the best candidate
    best_answer, best_score = max(scored, key=lambda x: x[1])

    # return structured result
    return {
        "question": query,
        "answer": best_answer,
        "score": round(float(best_score), 3),
        "candidates_considered": len(candidates),
        "candidates": [
            {"text": text, "score": round(float(score), 3)}
            for text, score in scored
        ],
        "sources": list({doc.metadata.get("source", "unknown") for doc in retrieved_docs})
    }

# 7. Query which will use chain of thought prompting to generate an answer
query = "Explain Retrieval-Augmented Generation in simple terms."
response = chain_of_thought_response(query)

# 8. print the structured response in JSON format
print(json.dumps(response, indent=2))

以 JSON 格式打印结构化响应。

输出：

{
  "question": "Explain Retrieval-Augmented Generation in simple terms.",
  "answer": "Retrieval Augmented Generation combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system. So the answer is RAG.",
  "score": 0.906,
  "candidates_considered": 5,
  "candidates": [
    {
      "text": "Retrieval Augmented Generation combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system. So the answer is RAG.",
      "score": 0.906
    },
    {
      "text": "LLMs have been used by RAG systems to improve the factual accuracy, contextual relevance, and quality of a response against a query. RAG is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system. So the answer is RAG.",
      "score": 0.811
    },
    {
      "text": "RAG is an architecture that combines the ability of large language models with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system. So the final answer is RAG.",
      "score": 0.776
    },
    {
      "text": "Retrieval Augmented Generation is architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system. So the answer is Retrieval Augmented Generation.",
      "score": 0.865
    },
    {
      "text": "RAG is an architecture that combines the ability of large language models (LLMs) with a retrieval system. So the final answer is RAG.",
      "score": 0.773
    }
  ],
  "sources": [
    "chapter7_RAG.txt"
  ]
}

带引用的响应生成

带引用的响应生成通过为答案附加支持该结论的检索来源，来增强 RAG 输出。系统不只是给出最终回答，还会明确展示这些信息来自哪个文档、哪段内容，或者哪个 URL。这会显著提升信任度、可验证性与透明度，尤其适用于医疗、法律、学术等对准确性要求极高的领域。通过让回答锚定到可识别来源，带引用的响应能帮助用户判断其可信度，并降低误导性信息的风险。

Recipe 75

本 recipe 演示如何实现带引用的响应生成：

加载待处理文档。

将文档切分为 500 个字符大小的 chunk，并设置 50 个字符的 overlap。

使用 HuggingFaceEmbeddings 和 FAISS 从文档中创建 embeddings 与向量存储。

使用本地模型创建文本生成 pipeline。

定义函数 cited_response(query: str)，它接收一个查询字符串，并返回一个结构化结果，其中包含答案以及检索文档的来源。

准备一个查询，使用带引用响应生成来获得答案。

通过 cited_response 函数生成响应。

安装所需依赖：

pip install langchain langchain-community faiss-cpu sentence-transformers transformers

cited_response_generation.py

请参考以下代码：

# cited_response_generation.py
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from transformers.pipelines import pipeline
import json

# 1. Load the document
loader = TextLoader("chapter7_RAG.txt")   # replace with your docs
docs = loader.load()

# 2. Split the documents into chunks of 500 characters with 50 character overlap
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
documents = splitter.split_documents(docs)

# 3. Create embeddings and vector store from documents using HuggingFaceEmbeddings and FAISS
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(documents, embeddings)

# 4. Create a text generation pipeline using a local model
generator = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",
    tokenizer="google/flan-t5-base",
    max_length=256
)

# 5. Define function cited_response
def cited_response(query: str):
    # retrieve top documents
    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
    retrieved_docs = retriever.get_relevant_documents(query)

    # build context with numbered references
    numbered_context = ""
    for i, doc in enumerate(retrieved_docs, 1):
        source = doc.metadata.get("source", f"doc{i}")
        numbered_context += f"[{i}] ({source}): {doc.page_content}\n"

    prompt = f"""Question: {query}
Use the following references to answer, and cite them like [1], [2].
References:
{numbered_context}
Answer with citations:
"""

    # generate answer
    answer = generator(prompt, max_length=256)[0]["generated_text"]

    # return structured result
    return {
        "question": query,
        "answer": answer,
        "sources": [
            {
                "id": i+1,
                "source": doc.metadata.get("source", "unknown"),
                "content": doc.page_content[:200] + "..."
            }
            for i, doc in enumerate(retrieved_docs)
        ]
    }

# 6. Query which will use cited response generation to generate an answer
query = "What is Retrieval-Augmented Generation?"

# 7. Generate the response using cited_response function
response = cited_response(query)

# 8. Print the structured response in JSON format
print(json.dumps(response, indent=2))

以 JSON 格式打印结构化响应。

输出：

{
  "question": "What is Retrieval-Augmented Generation?",
  "answer": "architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system",
  "sources": [
    {
      "id": 1,
      "source": "chapter7_RAG.txt",
      "content": "Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and qua..."
    }
  ]
}

结合多向量与多提示词的混合响应生成

混合响应生成会结合多种检索策略与多种提示词策略，以生成更丰富、更可靠的输出。在这种方法中，系统会从多种向量表示中检索信息，然后基于不同 prompt 生成多个候选响应。最终输出可以是从这些候选中选出的最佳答案，也可以是对这些候选进行综合后的结果，从而获得更广的覆盖面和更高的准确性。

这种方法特别适合处理复杂、含糊，或跨越多个知识维度的查询。通过结合多样化的检索信号与 prompt 设计，它可以减少遗漏关键信息的风险，并在精确性与完整性之间取得平衡。不过，这种方式也会消耗更多计算资源，并且需要谨慎编排，以避免内容冗余。

在生成最终答案之前，通常还会使用 cross-encoder 对候选进行重新打分与重排序。这是高质量 RAG 系统中的标准步骤，因为它通常能带来更准确的结果。

Recipe 76

本 recipe 演示如何在 RAG 中基于多向量与多提示词实现响应生成：

加载待处理文档。

将文档切分为 400 个字符大小的 chunk，并设置 50 个字符的 overlap。

使用 HuggingFaceEmbeddings 和 FAISS 从文档中创建 embeddings 与向量存储。

使用 BM25 创建一个稀疏检索器。

使用本地模型创建生成 pipeline。该模型会根据检索到的上下文生成响应。

创建一个 hybrid_retriever 方法，将稠密检索与稀疏检索结合起来。

创建一个 hybrid_response 函数，利用多个 prompt 生成响应。

创建一个查询，并使用 hybrid_response 函数生成混合响应。

安装所需依赖：

pip install langchain langchain-community faiss-cpu sentence-transformers rank-bm25 transformers

hybrid_response_generation.py

请参考以下代码：

# hybrid_response_generation.py
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.retrievers import BM25Retriever
from transformers.pipelines import pipeline
import json

# 1. Load the document
loader = TextLoader("chapter7_RAG.txt")   # replace with your docs
docs = loader.load()

# 2. Split the documents into chunks of 400 characters with 50 character overlap
splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
documents = splitter.split_documents(docs)

# 3. Create embeddings and vector store from documents using HuggingFaceEmbeddings and FAISS
dense_embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
dense_store = FAISS.from_documents(documents, dense_embeddings)

# 4. Create sparse retriever using BM25
sparse_retriever = BM25Retriever.from_documents(documents)

# 5. Generate pipeline using a local model
generator = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",
    tokenizer="google/flan-t5-base",
    max_length=256
)

# 6. Create a hybrid retriever method that combines dense and sparse retrieval
def hybrid_retrieve(query: str, k=3):
    dense_docs = dense_store.as_retriever(search_kwargs={"k": k}).get_relevant_documents(query)
    sparse_docs = sparse_retriever.get_relevant_documents(query)

    # merge unique docs by content
    seen, merged = set(), []
    for doc in dense_docs + sparse_docs:
        if doc.page_content not in seen:
            seen.add(doc.page_content)
            merged.append(doc)
    return merged[:k]

# 7. Create hybrid response function to generate response using multiple prompts
def hybrid_response(query: str):
    retrieved_docs = hybrid_retrieve(query, k=3)
    context = "\n".join([doc.page_content for doc in retrieved_docs])

    prompts = [
        f"Question: {query}\nContext:\n{context}\n\nAnswer briefly and clearly:",
        f"Question: {query}\nContext:\n{context}\n\nGive a detailed explanation with reasoning:",
        f"Question: {query}\nContext:\n{context}\n\nExplain in simple terms, as if teaching a beginner:"
    ]

    responses = []
    for p in prompts:
        out = generator(p)[0]["generated_text"]
        responses.append(out)

    # merge responses
    final_answer = "\n\n---\n\n".join(responses)

    return {
        "question": query,
        "final_answer": final_answer,
        "retrieved_sources": [doc.metadata.get("source", "unknown") for doc in retrieved_docs],
        "num_prompts": len(prompts),
        "num_sources": len(retrieved_docs)
    }

# 8. Create a query and generate the hybrid response using hybrid_response function
query = "Explain Retrieval-Augmented Generation (RAG)."
response = hybrid_response(query)

# 9. Print the structured response in JSON format
print(json.dumps(response, indent=2))

以 JSON 格式打印结构化响应。

输出：

{
  "question": "Explain Retrieval-Augmented Generation (RAG).",
  "final_answer": "RAG is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.\n\n---\n\nRetrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.\n\n---\n\nRAG is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.",
  "retrieved_sources": [
    "chapter7_RAG.txt"
  ],
  "num_prompts": 3,
  "num_sources": 1
}

批判性验证响应生成

批判性验证响应生成会在 RAG 工作流中增加一个额外的验证步骤。在模型生成答案之后，系统会把该答案与检索到的证据进行交叉核对，以确保其在事实层面是一致且对齐的。这个过程能通过过滤或修订那些无法被验证的输出，来降低幻觉或无证据支持断言的概率。

这种方法在医学、法律、金融等对事实准确性要求极高的领域尤其有用。虽然它能显著提升可靠性和用户信任，但也会带来更高延迟和计算开销，因为答案在返回之前必须先被验证。

Recipe 77

本 recipe 演示如何编写一个在响应生成前执行批判性验证的程序：

加载待处理文档。

将文档切分为 500 个字符大小的 chunk，并设置 50 个字符的 overlap。

使用 HuggingFaceEmbeddings 和 FAISS 从文档中创建 embeddings 与向量存储。

创建一个查询并执行批判性验证。

安装所需依赖：

pip install langchain langchain-community faiss-cpu sentence-transformers transformers torch

critical_verification_response.py

请参考以下代码：

# critical_verification_response.py
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from transformers.pipelines import pipeline
import json

# 1. Load the documents
loader = TextLoader("chapter7_RAG.txt")   # replace with your docs
docs = loader.load()

# 2. Split the document into chunks with chunk size as 500 and overlap as 50
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
documents = splitter.split_documents(docs)

# 3. Create embeddings and vector store from documents using HuggingFaceEmbeddings and FAISS.
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(documents, embeddings)
nli_model = pipeline("text-classification", model="facebook/bart-large-mnli")

# Claim Verification Function
def verify_claim(claim, retrieved_docs):
    scores = []
    for doc in retrieved_docs:
        text = doc.page_content[:500]  # check snippet
        result = nli_model(f"{claim} </s> {text}", truncation=True)[0]
        scores.append(result)
    best = max(scores, key=lambda x: x["score"])
    return best["label"], float(f"{best['score']:.2f}")

def critical_verification_response(query: str, answer: str):
    # retrieve evidence
    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
    retrieved_docs = retriever.get_relevant_documents(query)

    claims = [c.strip() for c in answer.split(".") if c.strip()]
    results = []
    for claim in claims:
        label, score = verify_claim(claim, retrieved_docs)
        results.append({
            "claim": claim,
            "verdict": label,
            "confidence": score
        })

    support_ratio = sum(1 for r in results if r["verdict"] == "ENTAILMENT") / len(results)
    contradiction_ratio = sum(1 for r in results if r["verdict"] == "CONTRADICTION") / len(results)

    return {
        "question": query,
        "answer": answer,
        "verification": results,
        "support_ratio": float(f"{support_ratio:.2f}"),
        "contradiction_ratio": float(f"{contradiction_ratio:.2f}")
    }

# 4. Create query
query = "What is Retrieval-Augmented Generation?"
answer = "Retrieval-Augmented Generation is a method that combines document retrieval with text generation. It improves factual accuracy."

# 5. Perform critical verification
response = critical_verification_response(query, answer)

# 6. Print the response after performing critical verification
print(json.dumps(response, indent=2))

打印经过批判性验证后的响应。

输出：

{
  "question": "What is Retrieval-Augmented Generation?",
  "answer": "Retrieval-Augmented Generation is a method that combines document retrieval with text generation. It improves factual accuracy.",
  "verification": [
    {
      "claim": "Retrieval-Augmented Generation is a method that combines document retrieval with text generation",
      "verdict": "neutral",
      "confidence": 0.48
    },
    {
      "claim": "It improves factual accuracy",
      "verdict": "neutral",
      "confidence": 0.85
    }
  ],
  "support_ratio": 0.0,
  "contradiction_ratio": 0.0
}

带置信度评分的响应生成

带置信度评分的响应生成通过给每个答案附加一个置信度水平，来增强 RAG 输出。系统不只是给出回答，还会表达它对该结果“有多确信”。这种置信度通常来自检索相关性、模型确定性，或多个模型/策略之间的一致性。这样的透明层可以帮助用户判断答案的可靠程度，并决定是否需要进一步核验。

置信度评分尤其适用于决策支持系统、研究场景以及企业应用，因为这些场景中信任与可解释性都很重要。不过，如果打分机制本身不够稳健，那么对置信度评分的过度依赖也可能会误导用户。

Recipe 78

本 recipe 演示如何编写一个带置信度评分的响应生成程序：

将原始文本文件作为 LangChain 文档加载。请将 chapter7_RAG.txt 替换为你的文件路径。

将文档切分为更小、带重叠的 chunk，以便更好地检索。每个 chunk 最多 300 个字符，且带 50 个字符的 overlap。

初始化 sentence-transformer 嵌入模型。

基于文档 chunk 及其 embeddings 构建一个 FAISS 向量存储。

使用 Flan-T5 初始化文本生成 pipeline。

定义置信度评分函数。

对向量存储发起带置信度评分的查询。

使用 FAISS 检索器获取 top-k 相关 chunk（这里是 k=2）。

把检索到文档的文本内容合并成一个上下文字符串。

提取原始 FAISS 分数（距离）。

将相似度分数转换为归一化的置信度得分（0–1）。

利用上下文和用户查询构建输入 prompt。

使用 Flan-T5 模型生成响应。

安装所需依赖：

pip install langchain-community faiss-cpu sentence-transformers transformers numpy

confidence_scored_response.py

请参考以下代码：

# confidence_scored_response.py
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from transformers.pipelines import pipeline
import numpy as np

# 1. Load and split the document
loader = TextLoader("chapter7_RAG.txt")   # Replace with your document
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
splits = text_splitter.split_documents(docs)

# 2. Create embeddings and FAISS vectorstore
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(splits, embeddings)
generator = pipeline("text2text-generation", model="google/flan-t5-base")

# 3. Define confidence scoring function
def get_confidence(scores):
    """Normalize similarity scores to a 0–1 confidence scale."""
    if not scores:
        return 0.0

    # FAISS similarity is cosine distance → lower = closer
    # Convert to similarity by inverting
    sims = [1 - s for s in scores]
    sims = np.clip(sims, 0, 1)  # ensure valid range
    return float(np.mean(sims))

# 4. Query with confidence scoring
query = "Explain what Retrieval-Augmented Generation (RAG) is."
docs_with_scores = vectorstore.similarity_search_with_score(query, k=2)

# 5. Extract context and scores
context = "\n".join([doc.page_content for doc, _ in docs_with_scores])
scores = [score for _, score in docs_with_scores]
confidence = get_confidence(scores)

# 6. Generate response using context and query
prompt = f"Context:\n{context}\n\nQuestion: {query}"
response = generator(prompt, num_return_sequences=1)

# 7. Output response and confidence score
print("\n--- RESPONSE ---")
print(response[0]['generated_text'])
print("\n--- CONFIDENCE SCORE ---")
print(f"{confidence:.2f}")

打印最终答案及置信度得分。

输出：

--- RESPONSE ---
architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system

--- CONFIDENCE SCORE ---
0.68

查询分解式响应

查询分解式响应是一种 RAG 策略，它会把复杂或多层次的用户查询拆分成更小、更易处理的子查询。每个子查询都会独立地在知识库中进行检索，以获得更聚焦、更相关的信息，然后再将这些结果综合成一个连贯的最终回答。这种方法有助于系统处理多层问题、含糊请求，或者需要逐步推理的任务。通过对查询进行拆解，RAG 系统可以降低遗漏关键信息的风险、提高检索准确率，并产出更有结构、更富上下文的回答。这种方法尤其适用于研究、故障排查和多跳问答等领域，因为这些场景中的用户通常希望获得详细且逻辑连贯的输出。

Recipe 79

本 recipe 演示如何编写一个查询分解式响应程序：

创建一个内存中的知识库文档集合。在真实系统中，这里通常会被真正的文档存储或向量数据库（例如 FAISS）替代。

将每份原始文档切分为单独句子，以支持更细粒度的检索。

创建一个用于标准化和相似度计算的工具函数。

使用启发式规则（例如在连接词处分割）把复杂查询拆解为更简单的子查询。

为每个子查询执行检索，并找出最相关的片段。

为子查询生成答案：利用检索到的片段构建简明答案，并收集证据（引用）。

组合最终的结构化响应：将各子答案按顺序整合，并附上证据。

运行演示：

创建一个需要分解的复杂查询。

将其拆解为多个子查询。

为每个子查询生成答案。

组合并打印最终结构化响应。

安装所需依赖：

无需额外依赖

query_decomposition_response.py

请参考以下代码：

# query_decomposition_response.py
from collections import defaultdict
import re
import textwrap

# 1. Simple in-memory knowledge base of documents
DOCS = {
    "doc1": """
Retrieval-Augmented Generation (RAG) combines a retriever with a generator.
The retriever pulls relevant passages from a knowledge base.
The generator uses those passages to produce grounded answers and reduce hallucinations.
""",
    "doc2": """
Query decomposition breaks a complex question into simpler sub-questions.
It improves coverage, lets you retrieve per sub-question, and merge answers.
It is useful when a question asks for multiple facts or steps.
""",
    "doc3": """
A simple retriever can score passages by token overlap (e.g., Jaccard similarity).
More advanced retrievers use TF-IDF, BM25, or dense embeddings.
""",
    "doc4": """
To merge answers, order sub-answers logically and keep citations if available.
When unsure, say so and surface the evidence used to answer.
""",
}

# Splitting sentences
def split_sentences(text: str):
    # Simple sentence splitter; good enough for the demo
    parts = re.split(r'(?<=[.!?])\s+', text.strip())
    return [s.strip() for s in parts if s.strip()]

KB = []
for doc_id, raw in DOCS.items():
    for sent in split_sentences(raw):
        KB.append({"doc": doc_id, "text": sent})

# Normalize text for token overlap
def normalize(text: str):
    return re.sub(r"[^a-z0-9\s]", " ", text.lower()).split()

# Simple Jaccard similarity for token overlap
def jaccard(a_tokens, b_tokens):
    a, b = set(a_tokens), set(b_tokens)
    if not a and not b:
        return 0.0
    return len(a & b) / len(a | b)

# Decompose query into sub-queries
def decompose_query(query: str):
    """
    Super-simple heuristic decomposition:
    Split on connectors (and/also/then/;/.), question marks
    Keep non-empty, trimmed parts
    """
    chunks = re.split(r'\b(?:and then|and|also|then)\b|[?;]|(?<=.)', query, flags=re.I)
    subs = [c.strip(" .") for c in chunks if c and c.strip(" .")]
    return subs or [query.strip()]

# Retrieval method based on token overlap
def retrieve_snippets(subquery: str, k: int = 3):
    q_tokens = normalize(subquery)
    scored = []
    for item in KB:
        s_tokens = normalize(item["text"])
        score = jaccard(q_tokens, s_tokens)
        if score > 0:
            scored.append((score, item))
    scored.sort(key=lambda x: x[0], reverse=True)
    return [it for _, it in scored[:k]]

# Answer a subquery using retrieved snippets
def answer_subquery(subquery: str):
    hits = retrieve_snippets(subquery, k=3)
    if not hits:
        return {
            "subquery": subquery,
            "answer": "No direct match found in the knowledge base.",
            "evidence": [],
        }

    # Stitch the top snippets as a concise answer
    unique = []
    seen = set()
    for h in hits:
        if h["text"] not in seen:
            unique.append(h)
            seen.add(h["text"])

    evidence = [f'{h["text"]} (from {h["doc"]})' for h in unique]
    answer_text = " ".join(h["text"] for h in unique)

    return {"subquery": subquery, "answer": answer_text, "evidence": evidence}

# Compose final structured response
def compose_response(query: str, subanswers):
    lines = [f"Original query: {query}", "", "Subquery:"]
    for i, a in enumerate(subanswers, 1):
        lines.append(f"{i}. {a['subquery']}")
    lines.append("")
    lines.append("Subquery Answers:")
    for i, a in enumerate(subanswers, 1):
        wrapped = textwrap.fill(a["answer"], width=88)
        lines.append(f"{i}) {wrapped}")
    lines.append("")
    lines.append("Evidence used:")
    for i, a in enumerate(subanswers, 1):
        for ev in a["evidence"]:
            lines.append(f"- [{i}] {ev}")
    return "\n".join(lines)

# Demo run
if __name__ == "__main__":
    # 1. Create a complex query that needs decomposition
    complex_query = (
        "What is RAG and why is query decomposition helpful, "
        "also mention a simple retrieval method and how to merge the answers?"
    )

    # 2. Decompose the query into sub-queries
    subqs = decompose_query(complex_query)

    # 3. Answer each subquery
    subanswers = [answer_subquery(sq) for sq in subqs]

    # 4. Compose and print the final structured response
    print(compose_response(complex_query, subanswers))

输出：

Original query: What is RAG and why is query decomposition helpful, also mention a simple retrieval method and how to merge the answers?

Subquery:
1. What is RAG
2. why is query decomposition helpful,
3. mention a simple retrieval method
4. how to merge the answers

Subquery Answers:
1) Retrieval-Augmented Generation (RAG) combines a retriever with a generator. It is useful
when a question asks for multiple facts or steps.

2) Query decomposition breaks a complex question into simpler sub-questions. It is useful
when a question asks for multiple facts or steps.

3) Retrieval-Augmented Generation (RAG) combines a retriever with a generator. A simple
retriever can score passages by token overlap (e.g., Jaccard similarity). The retriever
pulls relevant passages from a knowledge base.

4) To merge answers, order sub-answers logically and keep citations if available. The
generator uses those passages to produce grounded answers and reduce hallucinations.
When unsure, say so and surface the evidence used to answer.

Evidence used:
- [1] Retrieval-Augmented Generation (RAG) combines a retriever with a generator. (from doc1)
- [1] It is useful when a question asks for multiple facts or steps. (from doc2)
- [2] Query decomposition breaks a complex question into simpler sub-questions. (from doc2)
- [2] It is useful when a question asks for multiple facts or steps. (from doc2)
- [3] Retrieval-Augmented Generation (RAG) combines a retriever with a generator. (from doc1)
- [3] A simple retriever can score passages by token overlap (e.g., Jaccard similarity). (from doc3)
- [3] The retriever pulls relevant passages from a knowledge base. (from doc1)
- [4] To merge answers, order sub-answers logically and keep citations if available. (from doc4)
- [4] The generator uses those passages to produce grounded answers and reduce hallucinations. (from doc1)
- [4] When unsure, say so and surface the evidence used to answer. (from doc4)

上下文增强型响应生成

上下文增强型响应生成是一种 RAG 技术，它不仅仅是“检索后回答”，而是会把额外的上下文信号嵌入到响应中。这些信号可以包括元数据，例如文档来源、作者、时间戳、领域特定注释，也可以包括结构化信息，如表格、项目符号列表或高亮内容，用来帮助说明相关性。通过让响应携带这些上下文线索，系统不仅提升了事实锚定能力，也增强了透明度，使用户能够理解为什么会生成这个答案，以及它来自哪里。这种方法可以提高信任、降低歧义，并使回答更具可执行性，尤其适用于医疗、金融、法律研究等知识密集型领域，因为这些场景下不仅要求准确，还要求信息来源清晰可追溯。

Recipe 80

本 recipe 演示如何实现上下文增强型响应生成：

把文档切分为句子级 chunk。

加载一个轻量级嵌入模型。

将所有句子编码为向量嵌入。

构建一个 FAISS 索引，以支持快速最近邻搜索。

运行演示：

使用示例文档。

用示例文本初始化 responder。

安装所需依赖：

pip install sentence-transformers faiss-cpu numpy

context_enriched_response_generation.py

请参考以下代码：

# context_enriched_response_generation.py
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

class ContextEnrichedResponder:
    def __init__(self, text: str):
        # Split into chunks (sentences for simplicity)
        self.sentences = [s.strip() for s in text.split(".") if s.strip()]

        # Embedding model (local, no API key needed)
        self.model = SentenceTransformer("all-MiniLM-L6-v2")

        # Encode sentences into embeddings
        self.embeddings = self.model.encode(self.sentences)

        # Build FAISS index for similarity search
        d = self.embeddings.shape[1]
        self.index = faiss.IndexFlatL2(d)
        self.index.add(np.array(self.embeddings).astype("float32"))

    def retrieve(self, query: str, k: int = 2):
        """Find top-k most relevant sentences"""
        q_emb = self.model.encode([query])
        D, I = self.index.search(np.array(q_emb).astype("float32"), k)
        return [self.sentences[i] for i in I[0]]

    def respond(self, query: str, k: int = 2) -> str:
        # Retrieve relevant context
        retrieved = self.retrieve(query, k=k)
        enriched_answer = " ".join(retrieved) if retrieved else "No relevant info found."
        return enriched_answer

if __name__ == "__main__":
    # Example document
    sample_text = """
Retrieval-Augmented Generation (RAG) is an AI technique that combines
information retrieval with text generation. It improves factual accuracy
by grounding answers in external knowledge sources. RAG reduces
hallucinations, which are false or made-up outputs generated by AI models.
This makes it especially useful for applications like chatbots, search
assistants, and knowledge-intensive question answering.
"""
    bot = ContextEnrichedResponder(sample_text)

    # Example query
    print("Q: What is RAG?")
    print("A:", bot.respond("What is RAG?", k=2))

打印示例查询及其响应。

输出：

Q: What is RAG?
A: RAG reduces hallucinations, which are false or made-up outputs generated by AI models Retrieval-Augmented Generation (RAG) is an AI technique that combines information retrieval with text generation

渐进式披露响应生成

渐进式披露响应生成是一种 RAG 方法，它不会一次性把答案全部抛给用户，而是按步骤、逐层展示。系统通常会先给出一个简明摘要或直接回答，然后在用户继续请求时，或者随着交互推进，再逐步展开更深层细节。这种方式不仅提高了可读性，也能适应不同用户偏好：只想快速得到答案的人可以在前面就停下，而希望获得完整洞见的人则可以继续深入阅读。通过把响应设计成渐进式结构，RAG 系统能够提升交互体验、鼓励用户主动探索，并确保用户在不过载的前提下，获得恰到好处的细节深度。在真正的生产系统中，必须使用 LangChain 的 CallbackHandler 来实现模型 token 的实时流式输出。

Recipe 81

本 recipe 演示如何实现渐进式披露响应生成：

准备示例文档文本。你可以替换为任何相关文档。

定义一个从文档中检索上下文的方法。

定义一个结合上下文获取模型响应的方法。

对响应内容进行渐进式披露。

为程序创建示例查询。

对每个查询执行渐进式披露，并打印响应。

安装所需依赖：

pip install transformers torch

progressive_disclosure_response.py

请参考以下代码：

# progressive_disclosure_response.py
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import re
import time

# 1. Load model and tokenizer (local, no API key needed)
model_name = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# 2. Sample document text
document_text = """
Retrieval-Augmented Generation (RAG) is an AI technique that combines information
retrieval with natural language generation to improve the accuracy of answers. In RAG,
when a question is asked, the system first searches for relevant documents or data
sources, then uses a language model to generate a response based on the retrieved
information. This method reduces hallucinations by grounding the model's output in real
data. RAG is widely used in chatbots, customer support, and knowledge management systems
to provide factual and context-aware responses.
"""

# 3. Define the method to retrieve context from the document
def retrieve_context(document: str, query: str, window: int = 200) -> str:
    query_words = query.lower().split()
    doc_lower = document.lower()
    for word in query_words:
        idx = doc_lower.find(word)
        if idx != -1:
            start = max(idx - window, 0)
            end = min(idx + window, len(document))
            return document[start:end]
    return document[:400]

# 4. Define the method to get model response with context
def get_model_response_with_context(query: str, context: str) -> str:
    prompt = f"Answer the question using the following context:\n{context}\n\nQuestion: {query}"
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        **inputs,
        max_new_tokens=600,
        do_sample=True,
        top_p=0.95,
        temperature=0.7
    )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
    if not response:
        response = "Sorry, I could not generate a response based on the document."
    return response

# 5. Method to progressively disclose response
def progressive_disclosure_auto(response: str, chunk_size: int = 2, delay: float = 2):
    """
    Automatically reveals response in chunks with a time delay.
    """
    sentences = re.split(r'(?<=[.!?]) +', response)
    total = len(sentences)
    start = 0
    while start < total:
        end = min(start + chunk_size, total)
        chunk = " ".join(sentences[start:end])
        print("\n" + chunk + "\n")
        start = end
        time.sleep(delay)  # pause before next chunk

# 6. Example queries
queries = [
    "Explain RAG in AI",
    "How does RAG reduce hallucinations?"
]

# 7. Run the progressive disclosure for each query and print the response
for query in queries:
    print(f"\nQuery: {query}\n")
    context = retrieve_context(document_text, query)
    full_response = get_model_response_with_context(query, context)
    print("Bot is revealing response automatically:\n")
    progressive_disclosure_auto(full_response, chunk_size=2, delay=2)
    print("\n--- End of response ---\n")

输出：

Query: Explain RAG in AI
Bot is revealing response automatically:

Retrieval-Augmented Generation is an AI technique that combines information retrieval with natural language generation

--- End of response ---

Query: How does RAG reduce hallucinations?
Bot is revealing response automatically:

natural language generation

--- End of response ---

结论

在 RAG 系统中，基于 LLM 的响应生成并不仅仅是为了输出流畅答案，更关键的是要确保这些答案准确、具备上下文感知，并且值得信赖。诸如置信度评分、查询分解、上下文增强和渐进式披露等技术，有助于弥合“原始检索结果”与“真正面向用户的有意义回答”之间的鸿沟，使 RAG 系统变得更可靠、更易用。不过，这些响应的质量，很大程度上仍然取决于模型输入是否被有效引导。

Agentic RAG 则进一步让 LLM 参与到检索控制中。在生成初步答案后，模型会评估当前检索到的文档是否提供了足够且相关的证据；如果没有，它就会重写查询，或发出更有针对性的子查询，并重新执行检索（可选地使用不同检索器、过滤器或时间窗口）。这个过程会持续迭代，直到满足停止条件（如置信度阈值、最大迭代次数、token/预算上限）。这种模式能够减少幻觉，并且在查询含糊、或文档质量参差不齐时，尤其适合用于生产级 RAG 系统。

这也自然引向下一章：我们将探讨如何通过精心设计的 prompt，来塑造模型行为、优化对检索知识的利用，并最终提升整个 RAG 流水线的表现。