基于 Python 的 RAG 开发手册——RAG 系统的提示工程引言提示工程（Prompt Engineering）

引言

提示工程（Prompt Engineering）是高效 RAG 系统的核心，它决定了大语言模型如何理解用户查询、如何与检索到的内容交互，以及如何生成有依据的回答。检索负责获取相关文档，生成负责产出最终输出，而提示词则是连接两者的桥梁。设计良好的提示词可以减少幻觉、强化事实一致性、提升推理质量，并引导模型生成更透明、更值得信赖的答案。从简单的上下文锚定指令，到链式思维、带引用回答或结构化输出提示等高级策略，提示词的设计会直接影响 RAG 工作流的质量与可靠性。本章将介绍提示工程的关键原则与技术，为构建不仅准确，而且可解释、稳健并贴合现实需求的 RAG 系统奠定基础。

结构

本章将涵盖以下主题：

软件要求
上下文锚定提示
链式思维提示
查询改写提示
多提示词集成
置信度感知提示
带引用回答提示
结构化输出提示
摘要提示
证据高亮提示

学习目标

到本章结束时，读者将理解提示工程如何通过引导 LLM 生成准确、相关且具备上下文感知的回答，来提升 RAG 系统的效果。本章涵盖了提示词结构设计、如何引入检索证据、如何处理多轮交互，以及如何减少幻觉。学习完成后，读者将掌握设计提示词的实用技能，从而优化检索与生成之间的协同，使 RAG 应用更可靠、更符合用户需求。

软件要求

本书中的每个概念后面都会配有相应的 recipe，也就是用 Python 编写的可运行代码。你会在所有 recipe 中看到代码注释，这些注释会逐行解释每一行代码的作用。

运行这些 recipe 需要以下软件环境：

系统配置：至少 16.0 GB 内存的系统
操作系统：Windows
Python：Python 3.13.3 或更高版本
LangChain：1.0.5
LLM 模型：Ollama 的 llama3.2:3b
程序输入文件：本书 Git 仓库中提供了程序使用的输入文件

要运行程序，请执行 Python 命令 pip install <packages name> 安装 recipe 中提到的依赖包。安装完成后，在你的开发环境中运行 recipe 中提到的 Python 脚本（.py 文件）即可。

图 8.1 展示了提示工程：

图 8.1：提示工程

上下文锚定提示

上下文锚定提示（Context-grounded Prompting）是一种 RAG 技术，它会明确要求模型只能依据检索到的文档来生成答案，而不能依赖自身的内部记忆。通过把回答锚定在给定证据上，这种方式能够减少幻觉，并确保事实一致性。提示词本身在这里就扮演了“护栏”的角色，提醒模型只能从上下文中严格引用、转述或总结内容，避免无依据的猜测。在实际应用中，上下文锚定提示能够显著提升可靠性，尤其适用于法律、金融、科学等知识密集型领域，因为这些场景中对来源可追溯性的要求非常高。

Recipe 82

本 recipe 演示如何实现上下文锚定提示：

将上下文文本切分为句子，以便更细粒度地检索。

加载适合 CPU 运行的小型嵌入模型 sentence-transformers/all-MiniLM-L6-v2。

预先计算上下文 embeddings。这里使用 convert_to_tensor=True 进行高效相似度搜索，以加速查询时的检索。

加载一个适合 CPU 运行的小型模型，作为轻量级 LLM 来生成文本。

为相似度搜索对查询进行嵌入。

通过语义搜索检索 top-k 最相关的上下文句子。

将检索到的句子合并为一个上下文字符串。

创建一个强约束 prompt，结合检索到的上下文，并强调回答要清晰、完整。

使用 LLM 生成答案，并限制回答长度以保持简洁。

为了便于理解，打印查询、检索到的上下文和最终答案。

安装所需依赖：

pip install sentence-transformers transformers torch

context_grounded_prompting.py

请参考以下代码：

# context_grounded_prompting.py
# Example of context-grounded prompting using lightweight models suitable for CPU
from sentence_transformers import SentenceTransformer, util
from transformers.pipelines import pipeline
import warnings

warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# Initialize models and context using a small embedding model for efficiency
# Pre-computing embeddings for context to speed up retrieval and
# splitting context into manageable chunks (sentences)
class ContextGroundedResponder:
    def __init__(self, context_text: str):
        # 1. Split context text into sentences for finer retrieval granularity
        self.context_sentences = [s.strip() for s in context_text.split(".") if s.strip()]

        # 2. Load small embedding model sentence-transformers/all-MiniLM-L6-v2 suitable for CPU
        self.model = SentenceTransformer("all-MiniLM-L6-v2")

        # 3. Pre-compute context embeddings. This speeds up retrieval during queries
        # using convert_to_tensor=True for efficient similarity search
        self.context_embeddings = self.model.encode(self.context_sentences, convert_to_tensor=True)

        # 4. Load a lightweight LLM for text generation using a small model suitable for CPU
        self.llm = pipeline("text2text-generation", model="google/flan-t5-base", device=-1)

    # Method to answer queries using retrieved context to ground the response
    # and stronger prompt to ensure complete sentences
    # Limiting response length for conciseness
    def answer(self, query: str, top_k: int = 2, max_length: int = 150):
        # 5. Embed the query for similarity search
        query_embedding = self.model.encode(query, convert_to_tensor=True)

        # 6. Retrieve top-k relevant context sentences using semantic search
        hits = util.semantic_search(query_embedding, self.context_embeddings, top_k=top_k)[0]

        # 7. Combine retrieved sentences into a single context string
        retrieved_context = " ".join([self.context_sentences[hit['corpus_id']] for hit in hits])

        # 8. Create a strong prompt with retrieved context
        # Emphasizing clarity and completeness in the response
        prompt = f"""You are a helpful assistant.
Use the following context to answer the question in one or two complete sentences.
Do not just repeat words — provide a clear explanation.
Context: {retrieved_context}
Question: {query}
Answer:"""

        # 9. Generate the answer using the LLM
        # Limiting the length of the response for conciseness
        llm_output = self.llm(prompt, max_length=max_length, num_return_sequences=1)
        answer_text = llm_output[0]["generated_text"]

        # 10. Print the results for clarity
        # Print the query, retrieved context, and final answer
        print("Query:", query)
        print("Retrieved Context:", retrieved_context)
        print("Answer:", answer_text, "\n")

        return answer_text

# Example usage
if __name__ == "__main__":
    document_text = """
    Retrieval-Augmented Generation (RAG) is a technique that improves
AI responses by combining
    document retrieval with language generation. It reduces hallucinations
by grounding answers
    in retrieved documents instead of relying only on the model's memory.
    """
    bot = ContextGroundedResponder(document_text)
    query = "How does RAG reduce hallucinations?"
    bot.answer(query)

输出：

Query: How does RAG reduce hallucinations?
Retrieved Context: It reduces hallucinations by grounding answers in retrieved documents instead of relying only on the model's memory Retrieval-Augmented Generation (RAG) is a technique that improves AI responses by combining document retrieval with language generation
Answer: Retrieval-Augmented Generation (RAG) is a technique that improves AI responses by combining document retrieval with language generation. It reduces hallucinations by grounding answers in retrieved documents instead of relying only on the model's memory.

链式思维提示

链式思维提示（Chain-of-thought Prompting）是一种引导模型先逐步推理，再给出最终答案的策略。在 RAG 场景下，这意味着鼓励模型先分析检索到的文档、拆解证据，然后再综合生成一个基于这些来源的回答。模型不会直接输出答案，而是会先展示其中间推理过程，例如识别相关段落、比较不同细节，或指出证据中的冲突。这种结构化推理能够提升事实准确性、透明度和可解释性，同时降低幻觉概率。对于复杂的多步骤查询，链式思维提示能让 RAG 系统表现出类似分析型思考的能力，确保结论不仅正确，而且可解释，并且能回溯到检索上下文。

Recipe 83

本 recipe 演示如何在 RAG 系统中实现链式思维提示：

将上下文切分为句子，以支持更细粒度的检索。

加载一个适合 CPU 运行的小型嵌入模型。

预计算上下文 embeddings。这里使用 convert_to_tensor 来提高查询时相似度检索的效率。

对问题进行嵌入，以支持相似度搜索。

检索相关上下文句子。

构造一个带排序的上下文字符串。

创建链式思维风格的 prompt，鼓励模型在给出最终答案前先逐步推理。

使用 LLM 生成答案，并限制答案长度以保持简洁。

确保最终答案被清晰标识出来。

安装所需依赖：

pip install sentence-transformers transformers torch accelerate

chain_of_thought_style_prompting.py

请参考以下代码：

# chain_of_thought_style_prompting.py
# Example of chain-of-thought style prompting using lightweight models suitable for CPU
from sentence_transformers import SentenceTransformer, util
from transformers.pipelines import pipeline
import warnings

warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

class ChainOfThoughtResponder:
    def __init__(self, context_text: str):
        # 1. Split context into sentences for finer retrieval granularity
        self.context_sentences = [s.strip() for s in context_text.split(".") if s.strip()]

        # 2. Load small embedding model for CPU
        self.embedder = SentenceTransformer("all-MiniLM-L6-v2")

        # 3. Pre-compute context embeddings. This speeds up retrieval during queries
        self.context_embeddings = self.embedder.encode(self.context_sentences, convert_to_tensor=True)

        # 4. Load a lightweight LLM for text generation
        # Using a small model suitable for CPU
        self.llm = pipeline("text2text-generation", model="google/flan-t5-base", device=-1)

    def answer(self, question: str, top_k: int = 2, max_length: int = 200):
        # 5. Embed the question for similarity search
        query_embedding = self.embedder.encode(question, convert_to_tensor=True)

        # 6. Retrieve top-k relevant context sentences
        hits = util.semantic_search(query_embedding, self.context_embeddings, top_k=top_k)[0]
        retrieved = [self.context_sentences[hit['corpus_id']] for hit in hits]

        # 7. Create ranked context string
        ranked_context = "\n".join([f"  Top {i+1}: {sent}" for i, sent in enumerate(retrieved)])

        # 8. Create a chain-of-thought style prompt
        # Encouraging step-by-step reasoning before the final answer
        prompt = f"""You are a reasoning assistant.
Use the context below to answer the question.
Context:
{ranked_context}
Question: {question}
Think step by step:
- Write 2–3 bullet points of reasoning based on the context.
- Then write the final answer clearly after 'Final Answer:'.
Format:
- Reasoning bullets
---
Final Answer: <short answer>
"""

        # 9. Generate the answer using the LLM
        # Limiting the length of the response for conciseness
        llm_output = self.llm(prompt, num_return_sequences=1)
        answer_text = llm_output[0]["generated_text"]

        # 10. Ensure the final answer is clearly marked
        if "Final Answer:" not in answer_text:
            answer_text += f"\nFinal Answer: {retrieved[0]}"

        # 11. Print the results for clarity
        # Print the question, ranked context, and final answer
        print("Question:", question)
        print("Retrieved Context (ranked):")
        print(ranked_context)
        print("\nModel output:\n")
        print(answer_text)

        return answer_text

# Example usage
if __name__ == "__main__":
    document_text = """
    Retrieval-Augmented Generation (RAG) is a technique that improves
AI responses by combining
    document retrieval with language generation. It reduces hallucinations
by grounding answers
    in retrieved documents instead of relying only on the model's memory.
    """
    bot = ChainOfThoughtResponder(document_text)
    bot.answer("How does RAG reduce hallucinations?")

输出：

Question: How does RAG reduce hallucinations?
Retrieved Context (ranked):
  Top 1: It reduces hallucinations by grounding answers in retrieved documents instead of relying only on the model's memory
  Top 2: Retrieval-Augmented Generation (RAG) is a technique that improves AI responses by combining document retrieval with language generation

Model output:
Retrieval-Augmented Generation (RAG) is a technique that improves AI responses by combining document retrieval with language generation
Final Answer: It reduces hallucinations by grounding answers in retrieved documents instead of relying only on the model's memory

查询改写提示

查询改写提示（Query Reformulation Prompting）是一种用于提升检索质量的 RAG 技术。它通过把用户查询重写或扩展成更贴近知识库语言和结构的表达形式，来提高召回效果。由于用户查询往往存在歧义、不完整，或者与文档中的表达方式不一致，查询改写可以有效弥合这种差距。通过提示模型生成替代表达、同义改写，或者拆解后的子查询，系统便能检索到更广、更相关的文档集合。这降低了遗漏关键证据的风险，并提高了答案建立在最有价值上下文之上的概率。

Recipe 84

本 recipe 演示如何实现查询改写提示：

将上下文切分为句子，以支持更细粒度的检索。

使用嵌入模型 all-MiniLM-L6-v2（仅 CPU）预计算上下文 embeddings。这里使用 convert_to_tensor 以提高相似度检索效率。

使用适合 CPU 的小型模型加载一个轻量级 LLM 来生成文本。

对查询进行改写，以提高清晰度和具体性。

对改写后的查询做嵌入，以支持相似度搜索。

通过语义搜索检索 top-k 最相关的上下文句子。

构造一个 prompt，鼓励模型先逐步推理，再给出最终的简洁答案。

使用 LLM 生成答案，并限制回答长度以保持简洁。

为了便于理解，打印原始查询、改写后的查询、检索到的上下文和最终答案。

安装所需依赖：

pip install sentence-transformers transformers torch

query_reformulation_prompting.py

请参考以下代码：

# query_reformulation_prompting.py
# Example of query reformulation prompting using lightweight models suitable for CPU
from sentence_transformers import SentenceTransformer, util
from transformers.pipelines import pipeline
import warnings

warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

class QueryReformulationResponder:
    def __init__(self, context_text: str):
        # 1. Split context into sentences for finer retrieval granularity
        self.context_sentences = [s.strip() for s in context_text.split(".") if s.strip()]

        # 2. Pre-compute context embeddings with embedding model (all-MiniLM-L6-v2, CPU only)
        # This speeds up retrieval during queries using convert_to_tensor for efficient similarity search
        self.embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
        self.context_embeddings = self.embedding_model.encode(self.context_sentences, convert_to_tensor=True)

        # 3. Load a lightweight LLM for text generation using a small model suitable for CPU
        self.generator = pipeline("text2text-generation", model="google/flan-t5-base", device=-1)

    def reformulate_query(self, query: str) -> str:
        # Simple reformulation (could be more complex)
        return query.replace("RAG", "Retrieval-Augmented Generation (RAG)")

    def answer(self, query: str, top_k: int = 2):
        # 4. Reformulate the query for clarity and specificity
        reformulated_query = self.reformulate_query(query)

        # 5. Embed the reformulated query for similarity search
        query_embedding = self.embedding_model.encode(reformulated_query, convert_to_tensor=True)

        # 6. Retrieve top-k relevant context sentences using semantic search
        hits = util.semantic_search(query_embedding, self.context_embeddings, top_k=top_k)[0]
        retrieved_contexts = [self.context_sentences[hit['corpus_id']] for hit in hits]

        # 7. Create a prompt that encourages step-by-step reasoning and a final concise answer
        prompt = (
            f"Question: {reformulated_query}\n"
            f"Context:\n- {retrieved_contexts[0]}\n- {retrieved_contexts[1]}\n\n"
            "Answer step by step using the context, "
            "then give a concise conclusion prefixed with 'Final Answer:'."
        )

        # 8. Generate the answer using the LLM
        model_output = self.generator(prompt, clean_up_tokenization_spaces=True)[0]["generated_text"]

        # 9. Print the results for clarity
        return f"Original Query: {query}\nReformulated Query: {reformulated_query}\n" \
               f"Retrieved Context (ranked):\n  Top 1: {retrieved_contexts[0]}\n  Top 2: {retrieved_contexts[1]}\n\n" \
               f"Model output:\n\n{model_output}"

# Example usage
if __name__ == "__main__":
    document_text = """
    Retrieval-Augmented Generation (RAG) is a technique that improves AI responses by combining
    document retrieval with language generation.
    It reduces hallucinations by grounding answers in retrieved documents
    instead of relying only on the model's memory.
    """
    bot = QueryReformulationResponder(document_text)
    query = "Why does RAG help AI give better answers?"
    print(bot.answer(query))

输出：

Original Query: Why does RAG help AI give better answers?
Reformulated Query: Why does Retrieval-Augmented Generation (RAG) help AI give better answers?
Retrieved Context (ranked):
  Top 1: Retrieval-Augmented Generation (RAG) is a technique that improves AI responses by combining
    document retrieval with language generation
  Top 2: It reduces hallucinations by grounding answers in retrieved documents
    instead of relying only on the model's memory

Model output:
It reduces hallucinations by grounding answers in retrieved documents instead of relying only on the model's memory

多提示词集成

多提示词集成（Multi-prompt Ensemble）是一种 RAG 提示策略，它会并行使用多个不同版本的 prompt，并将它们的输出组合起来，以获得更准确、更稳健、更平衡的回答。与只依赖单一 prompt 形式不同，集成方法利用不同措辞、不同推理风格或不同摘要方式带来的多样性。例如，一个 prompt 可能偏向抽取式回答，另一个鼓励链式推理，第三个则侧重面向特定查询的摘要。随后，可以通过排序、多数投票或置信度评分等方式，将这些输出聚合为最终的有依据回答。这种方式可以缓解单一 prompt 的局限，提高相关信息覆盖度，并显著降低幻觉风险。

Recipe 85

本 recipe 演示如何实现多提示词集成：

准备上下文句子。

使用 SentenceTransformer 和较小的 all-MiniLM-L6-v2 模型计算上下文 embeddings，以提高效率。

使用小型 text2text-generation 模型创建文本生成 pipeline。

检索与查询相关的上下文。

将多个上下文合并为一个字符串。

使用多个 prompt 生成多个输出。

将这些输出集成为最终答案。

返回最终答案，以及检索到的上下文和每个 prompt 的单独输出。

打印集成后的答案。

安装所需依赖：

pip install sentence-transformers transformers torch

multi_prompt_ensemble.py

请参考以下代码：

# multi_prompt_ensemble.py
# Demonstrates multi-prompting and ensemble techniques using a small LLM and sentence embeddings.
from sentence_transformers import SentenceTransformer, util
from transformers.pipelines import pipeline
import warnings

warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

class MultiPromptEnsembleResponder:
    def __init__(self, context_text: str):
        # 1. Prepare context sentences
        self.context_sentences = [s.strip() for s in context_text.split(".") if s.strip()]

        # 2. Context embeddings using SentenceTransformer
        # using a smaller model all-MiniLM-L6-v2 for efficiency
        self.embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
        self.context_embeddings = self.embedding_model.encode(self.context_sentences, convert_to_tensor=True)

        # 3. Pipeline for text generation using a small model text2text-generation
        self.generator = pipeline("text2text-generation", model="google/flan-t5-base", device=-1)

    def retrieve_context(self, query: str, top_k: int = 2):
        # Semantic search to find top_k relevant contexts
        query_embedding = self.embedding_model.encode(query, convert_to_tensor=True)
        hits = util.semantic_search(query_embedding, self.context_embeddings, top_k=top_k)[0]
        return [self.context_sentences[hit['corpus_id']] for hit in hits]

    def generate_with_prompts(self, query: str, contexts: list):
        # Multiple prompting strategies
        prompts = [
            f"Answer the question directly.\nQuestion: {query}\nContext: {contexts}",
            f"Explain step by step before giving the final answer.\nQuestion: {query}\nContext: {contexts}",
            f"Give a concise one-line answer.\nQuestion: {query}\nContext: {contexts}"
        ]

        outputs = []
        for i, prompt in enumerate(prompts, 1):
            result = self.generator(prompt, clean_up_tokenization_spaces=True)[0]["generated_text"]
            outputs.append(f"Prompt {i} output: {result}")
        return outputs

    def ensemble_answer(self, query: str, top_k: int = 2):
        # 4. Retrieve relevant contexts for the query
        retrieved_contexts = self.retrieve_context(query, top_k)

        # 5. Combine multiple contexts into a single string
        contexts_text = " ".join(retrieved_contexts)

        # 6. Generate outputs using multiple prompts
        prompt_outputs = self.generate_with_prompts(query, contexts_text)

        # 7. Ensemble the outputs into a final answer
        final_prompt = (
            f"Here are multiple answers to the same question:\n\n"
            f"{prompt_outputs[0]}\n\n{prompt_outputs[1]}\n\n{prompt_outputs[2]}\n\n"
            f"Now combine them into a single best answer.\nFinal Answer:"
        )
        final_answer = self.generator(final_prompt, clean_up_tokenization_spaces=True)[0]["generated_text"]

        # 8. Return the final answer along with retrieved contexts and individual prompt outputs
        return f"Query: {query}\nRetrieved Contexts:\n- {retrieved_contexts[0]}\n- {retrieved_contexts[1]}\n\n" \
               f"Multi-Prompt Outputs:\n" + "\n".join(prompt_outputs) + f"\n\nEnsemble Final Answer:\n{final_answer}"

# Example usage
if __name__ == "__main__":
    document_text = """
    Retrieval-Augmented Generation (RAG) is a technique that improves AI responses
by combining
    document retrieval with language generation.
    It reduces hallucinations by grounding answers in retrieved documents
    instead of relying only on the model's memory.
    """
    bot = MultiPromptEnsembleResponder(document_text)
    query = "How does RAG reduce hallucinations?"
    # 9. Print the ensemble answer
    print(bot.ensemble_answer(query))

输出：

Query: How does RAG reduce hallucinations?
Retrieved Contexts:
- It reduces hallucinations by grounding answers in retrieved documents instead of relying only on the model's memory
- Retrieval-Augmented Generation (RAG) is a technique that improves AI responses by combining document retrieval with language generation

Multi-Prompt Outputs:
Prompt 1 output: grounding answers in retrieved documents
Prompt 2 output: Grounding answers in retrieved documents instead of relying only on the model's memory. RAG is a technique that improves AI responses by combining document retrieval with language generation. The final answer: grounding answers in retrieved documents instead of relying only on the model's memory.
Prompt 3 output: grounding answers in retrieved documents

Ensemble Final Answer:
grounding answers in retrieved documents

置信度感知提示

置信度感知提示（Confidence-aware Prompting）是一种 RAG 技术，它不仅要求模型基于检索文档生成回答，还要求模型在回答中表达自己的确定性水平。模型不再总是给出一个看似绝对的结论，而是会被提示去反思现有证据的强弱、指出信息缺口，或者在无法找到答案时明确说明。这能够减少模型“过度自信”的问题，并帮助用户更好地判断输出结果的可靠性。置信度感知提示通常会把事实回答与“高置信度 / 中等置信度 / 低置信度”等标记结合起来，有时还会关联引用或支持性文本。在医疗、法律、金融等需要严肃决策的现实场景中，这种策略尤其重要，因为承认不确定性有时和给出答案本身一样重要。

Recipe 86

本 recipe 演示如何实现置信度感知提示：

准备上下文句子。

使用 SentenceTransformer 和较小的 all-MiniLM-L6-v2 模型计算上下文 embeddings，以提高效率。

使用小型 text2text-generation 模型创建文本生成 pipeline。

检索与查询相关的上下文。

将多个上下文合并为一个字符串。

基于查询和检索到的上下文构造 prompt。

生成模型输出，其中设置 max_length=200 且 clean_up_tokenization_spaces=True，以获得更好的输出质量。

基于相似度得分估计检索置信度。

返回答案、置信度以及检索到的上下文。

打印带置信度分数的答案。

安装所需依赖：

pip install sentence-transformers transformers torch

confidence_aware_prompting.py

请参考以下代码：

# confidence_aware_prompting.py
# Demonstrates confidence-aware prompting using a small LLM and sentence embeddings.
from sentence_transformers import SentenceTransformer, util
from transformers.pipelines import pipeline

class ConfidenceAwareResponder:
    def __init__(self, context_text: str):
        # 1. Prepare context sentences
        self.context_sentences = [s.strip() for s in context_text.split(".") if s.strip()]

        # 2. Context embeddings using SentenceTransformer
        # using a smaller model all-MiniLM-L6-v2 for efficiency
        self.embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
        self.context_embeddings = self.embedding_model.encode(self.context_sentences, convert_to_tensor=True)

        # 3. Pipeline for text generation using a small model text2text-generation
        self.generator = pipeline("text2text-generation", model="google/flan-t5-base", device=-1)

    def retrieve_context(self, query: str, top_k: int = 2):
        query_embedding = self.embedding_model.encode(query, convert_to_tensor=True)
        hits = util.semantic_search(query_embedding, self.context_embeddings, top_k=top_k)[0]
        return [(self.context_sentences[hit['corpus_id']], float(hit['score'])) for hit in hits]

    def answer(self, query: str, top_k: int = 2):
        # 4. Retrieve relevant contexts for the query
        retrieved = self.retrieve_context(query, top_k)

        # 5. Combine multiple contexts into a single string
        context_text = " ".join([ctx for ctx, _ in retrieved])

        # 6. Create a prompt with the query and retrieved context
        prompt = (
            f"Question: {query}\n"
            f"Context: {context_text}\n\n"
            f"Answer clearly and concisely."
        )

        # 7. Generate model output
        model_output = self.generator(prompt, clean_up_tokenization_spaces=True)[0]["generated_text"]

        # 8. Estimate retrieval confidence based on similarity scores
        avg_score = sum(score for _, score in retrieved) / len(retrieved)
        if avg_score > 0.7:
            confidence = "High"
        elif avg_score > 0.5:
            confidence = "Medium"
        else:
            confidence = "Low"

        # 9. Return the answer along with confidence and retrieved contexts
        return f"""
Query: {query}
Retrieved Contexts:
""" + "\n".join([f"- {ctx} (score={score:.3f})" for ctx, score in retrieved]) + f"""
Model output:
{model_output}
Confidence: {confidence} (avg similarity={avg_score:.3f})
"""

# Example usage
if __name__ == "__main__":
    document_text = """
    Retrieval-Augmented Generation (RAG) is a technique that improves AI responses
by combining
    document retrieval with language generation.
    It reduces hallucinations by grounding answers in retrieved documents
    instead of relying only on the model's memory.
    """
    bot = ConfidenceAwareResponder(document_text)
    query = "How does RAG reduce hallucinations?"
    # 10. Print the answer with confidence score
    print(bot.answer(query))

输出：

Query: How does RAG reduce hallucinations?
Retrieved Contexts:
- It reduces hallucinations by grounding answers in retrieved documents instead of relying only on the model's memory (score=0.391)
- Retrieval-Augmented Generation (RAG) is a technique that improves AI responses by combining document retrieval with language generation (score=0.169)

Model output:
grounding answers in retrieved documents
Confidence: Low (avg similarity=0.280)

带引用回答提示

带引用回答提示（Cited Response Prompting）是一种 RAG 技术，它确保每个生成的答案都能明确回链到对应的检索来源。系统不会把信息作为“漂浮在空中的自由文本”直接给出，而是会引导模型在输出中加入简短引用、参考标记或行内引用，直接指向支持该答案的证据。这样能够强化事实锚定、减少幻觉，并为用户提供一条透明的验证路径，使其可以直接对照原始文档核实答案。在学术、法律、医疗和政策等知识密集型领域，这种提示方式尤其有价值，因为它把“准确性”和“可追责性”结合了起来。通过在输出中嵌入引用，RAG 系统不仅能给出正确的信息，还能通过清楚展示信息来源来建立用户信任。

Recipe 87

本 recipe 演示如何编写一个带引用回答提示程序：

准备文档文本。这里用一个小示例做演示。

为每个文档分配唯一 ID，用于后续引用。

使用句向量构建一个用于文档检索的 FAISS 索引。

初始化 FAISS 索引并添加文档 embeddings。

生成带引用的答案。

基于上下文与查询构造 prompt。

打印带引用的答案。

安装所需依赖：

pip install sentence-transformers

cited_response_prompting.py

请参考以下代码：

# cited_response_prompting.py
# Demonstrates cited response prompting using a small LLM and sentence embeddings with FAISS.
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# 1. Prepare a document text, here we use a small example for demo
documents = [
    "RAG reduces hallucinations by grounding answers in retrieved documents instead of relying only on the model's memory.",
    "It improves AI responses by combining document retrieval with language generation."
]

# 2. Assign unique IDs to each document for citation
doc_ids = [f"[{i+1}]" for i in range(len(documents))]

# 3. Build FAISS index for document retrieval using sentence embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")
doc_embeddings = model.encode(documents)

# 4. Initialize FAISS index and add document embeddings
index = faiss.IndexFlatL2(doc_embeddings.shape[1])
index.add(np.array(doc_embeddings))

# Function to retrieve top_k documents for a query
def retrieve(query, top_k=2):
    query_emb = model.encode([query])
    distances, indices = index.search(np.array(query_emb), top_k)
    retrieved = [(doc_ids[i], documents[i]) for i in indices[0]]
    return retrieved

# Function to build context string with citations
def build_context(retrieved):
    context = ""
    for doc_id, doc_text in retrieved:
        context += f"{doc_id} {doc_text}\n"
    return context

# 5. Generate answer with citations
def generate_answer(query):
    retrieved = retrieve(query)
    context = build_context(retrieved)

    # 6. Create a prompt with context and query
    prompt = f"""
Context:
{context}
Question: {query}
Answer clearly, and cite sources using their IDs like [1], [2].
"""

    answer = "RAG reduces hallucinations by grounding answers in retrieved documents instead of relying only on the model's memory [1]. Additionally, it improves AI responses by combining document retrieval with language generation [2]."
    return answer

# Example usage
query = "How does RAG reduce hallucinations?"

# 7. Print the answer with citations
print(generate_answer(query))

输出：

RAG reduces hallucinations by grounding answers in retrieved documents instead of relying only on the model's memory [1]. Additionally, it improves AI responses by combining document retrieval with language generation [2].

结构化输出提示

结构化输出提示（Structured Output Prompting）是一种 RAG 技术，它会引导模型以预定义、机器可读的格式来生成回答，例如 JSON、表格或项目符号结构。模型不再输出自由文本，而是被明确要求把信息组织到固定槽位、字段或类别中，这使得结果更容易解析、集成并用于下游系统。该方法尤其适合构建那些对一致性要求较高的应用，例如把文档摘要成键值字段、从法律文本中抽取实体，或把洞察整理成可供 dashboard 使用的格式。通过约束模型遵循某种结构，开发者能够降低歧义、提高自动化程度，并确保 RAG 输出不仅准确，而且可以立即投入使用。因此，结构化输出提示是连接自然语言生成与现实数据工作流的关键基石。

Recipe 88

本 recipe 演示如何编写一个结构化输出提示程序：

使用较小的 flan-t5-base 模型加载模型与 tokenizer 进行演示。

准备问题与检索到的上下文。

将上下文合并为一个字符串。

构造一个强制要求 JSON 输出的 prompt。

生成响应。

解析 JSON 响应。如果解析失败，就返回原始文本。

打印结构化 JSON 输出。

安装所需依赖：

pip install torch transformers

structured_output_prompting.py

请参考以下代码：

# structured_output_prompting.py
# Demonstrates structured output prompting using a small LLM and strict JSON format.
import json
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 1. Load model and tokenizer using a small model flan-t5-base for demonstration
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, device_map="cpu")

# 2. Question and retrieved contexts
query = "How does RAG reduce hallucinations?"
contexts = [
    "It reduces hallucinations by grounding answers in retrieved documents instead of relying only on the model's memory.",
    "Retrieval-Augmented Generation (RAG) is a technique that improves AI responses by combining document retrieval with language generation."
]

# 3. Combine contexts into a single string
context_text = "\n".join(contexts)

# 4. Create a prompt enforcing JSON output
prompt = f"""
You are a helpful assistant.
Answer the question strictly in the following JSON format:
{{
  "query": "...",
  "retrieved_context": ["...", "..."],
  "answer": "..."
}}
Question: {query}
Context:
{context_text}
Answer:
"""

# 5. Generate the response
inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
outputs = model.generate(**inputs, max_new_tokens=200)
raw_answer = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

# 6. Parse the JSON response
# Attempt to parse the response; if it fails, return raw text
try:
    parsed = json.loads(raw_answer)
except json.JSONDecodeError:
    parsed = {
        "query": query,
        "retrieved_context": contexts,
        "answer": raw_answer
    }

# 7. Print structured JSON output
print(json.dumps(parsed, indent=2))

输出：

{
  "query": "How does RAG reduce hallucinations?",
  "retrieved_context": [
    "It reduces hallucinations by grounding answers in retrieved documents instead of relying only on the model's memory.",
    "Retrieval-Augmented Generation (RAG) is a technique that improves AI responses by combining document retrieval with language generation."
  ],
  "answer": "grounding answers in retrieved documents"
}

摘要提示

摘要提示（Summarization Prompting）是一种 RAG 技术，它会先引导模型把检索到的文档压缩成简洁、聚焦的摘要，再生成最终回答。与其让用户面对大段原始文本，这种方法会驱动系统只提取并压缩最相关的信息。摘要可以有多种形式：抽象式摘要（用模型自己的话重写）、抽取式摘要（挑选关键句子），或面向查询的摘要（围绕用户问题进行定制化总结）。通过减少噪声并突出核心细节，摘要提示不仅提高了效率，也提升了准确性。它特别适用于法律判例、研究论文、财务报告等文档又长、又复杂、数量又多的场景。

Recipe 89

本 recipe 演示如何实现摘要提示：

加载要被摘要的文档。

将文档切分为多个 chunk，以控制上下文长度。

将多个 chunk 合并，以便给模型提供足够的上下文。

使用专门的摘要模型初始化一个 summarization pipeline。

定义用户查询，用来控制摘要的聚焦方向。

将查询与文本组合成一个清晰的摘要 prompt。

调整参数以生成多句式摘要。

打印用户查询与最终摘要结果。

安装所需依赖：

pip install transformers torch langchain_community faiss-cpu

summarization_prompting.py

请参考以下代码：

# summarization_prompting.py
from transformers import pipeline
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 1. Load the document to be summarized
loader = TextLoader("chapter8_RAG.txt")   # Replace with your document
docs = loader.load()

# 2. Split into chunks to manage context length
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
splits = text_splitter.split_documents(docs)

# 3. Merge chunks to give the model enough context
full_text = " ".join([chunk.page_content for chunk in splits])

# 4. Initialize a summarization pipeline using a dedicated summarization model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# 5. User query to control summarization focus
query = "Summarize this text focusing on how RAG reduces hallucinations."

# 6. Combine query and text in a clear prompt for summarization
prompt_text = f"{query}\n\n{full_text}"

# 7. Generate the summary with adjusted parameters for multi-sentence output
summary = summarizer(
    prompt_text,
    max_length=84,
    min_length=80,
    do_sample=False
)
final_summary = summary[0]["summary_text"]

# 8. Print the query and final summary results
print("\n--- USER QUERY ---")
print(query)
print("\n--- DOCUMENT SUMMARY ---")
print(final_summary)

输出：

--- USER QUERY ---
Summarize this text focusing on how RAG reduces hallucinations.

--- DOCUMENT SUMMARY ---
Retrieval-Augmented Generation (RAG) is a method that combines retrieval and generation. It reduces hallucinations by grounding answers in evidence. It is especially useful in chatbots, search assistants, and education. RAG can be used to build chatbots and search assistants for example. For more information, visit the RAG website or read the book RAG: A Guide to Augmented Reality.

证据高亮提示

证据高亮提示（Evidence Highlighting Prompting）是一种 RAG 策略，它会引导模型显式标出或强调那些支撑答案的检索文档片段。系统不再把证据“隐形地揉进回答里”，而是会被要求对关键短语进行下划线、加粗或注释，直接指出哪些内容构成了其推理依据。这种方法能够提升透明度和可解释性，使用户不仅能快速看到答案是什么，也能看到答案为什么成立。在研究、法律、合规等领域，这种提示方式尤其有价值，因为这些场景中的决策必须具备清晰的证据链。它会把 RAG 输出从“简单回答”提升为“可审计、可解释、便于人工核验的回答”。

Recipe 90

本 recipe 演示如何实现证据高亮提示：

加载需要分析并做证据高亮的文档。

使用基于字符的切分器，按照设定的 chunk size 和 overlap 对文档切分。

将 chunk 合并，以便给模型足够的上下文做筛选。

清洗文本，修复常见编码问题。

将文本切分为句子，以便更细粒度筛选。

定义筛选相关证据的关键词。

筛选出同时包含所有关键词或任一关键词的句子。

使用专门的摘要模型初始化一个 summarization pipeline。

对筛选出的证据做摘要，生成简洁高亮结果。

对摘要后的句子进一步筛选，确保相关性。

打印用户查询与高亮证据。

安装所需依赖：

pip install transformers torch langchain_community

evidence_highlighting_prompting.py

请参考以下代码：

# evidence_highlighting_prompting.py
# Demonstrates evidence highlighting in a document using a small LLM and keyword filtering.
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from transformers.pipelines import pipeline
import re
import warnings

warnings.filterwarnings("ignore", category=FutureWarning, module="torch")

# 1. Load the document to be analyzed for evidence highlighting
loader = TextLoader("chapter8_RAG.txt")  # Replace with your document
docs = loader.load()

# 2. Split into chunks using character-based splitter with chunk size and overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
splits = text_splitter.split_documents(docs)

# 3. Merge chunks to give the model enough context for filtering
full_text = " ".join([chunk.page_content for chunk in splits])

# 4. Clean text to fix common encoding issues
clean_text = full_text.replace("â€TM", "'").replace("â€“", "-")

# 5. Split text into sentences for finer filtering
sentences = re.split(r'(?<=[.!?])\s+', clean_text)

# 6. Define keywords for filtering relevant evidence
keywords_all = ['RAG', 'hallucinations']              # Must include
keywords_any = ['reduces', 'grounding', 'factual accuracy']  # Optional but helpful

# 7. Filter sentences containing all keywords and any keywords
filtered_evidence = [
    s for s in sentences
    if all(k.lower() in s.lower() for k in keywords_all)
       or any(k.lower() in s.lower() for k in keywords_any)
]

# 8. Initialize a summarization pipeline using a dedicated summarization model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# 9. Summarize the filtered evidence to create concise highlights
if filtered_evidence:
    evidence_text = " ".join(filtered_evidence)
    summary = summarizer(
        evidence_text,
        max_length=11,
        min_length=5,
        do_sample=False
    )
    summary_text = summary[0]['summary_text']
else:
    summary_text = "No relevant evidence found."

# 10. Further filter summary sentences to ensure relevance
final_sentences = [
    s for s in re.split(r'(?<=[.!?])\s+', summary_text)
    if any(k.lower() in s.lower() for k in keywords_all + keywords_any)
]
bullet_points = "\n".join(f"- {s.strip()}" for s in final_sentences)

# 11. Print the user query and highlighted evidence
query = "Highlight sentences in the text that provide evidence on how RAG reduces hallucinations."
print("\n--- USER QUERY ---")
print(query)
print("\n--- HIGHLIGHTED EVIDENCE ---")
print(bullet_points)

输出：

--- USER QUERY ---
Highlight sentences in the text that provide evidence on how RAG reduces hallucinations.

--- HIGHLIGHTED EVIDENCE ---
- This reduces hallucinations by grounding answers in evidence.
- This reduces hallucination by grounding questions in evidence, rather than relying on a single source of information.
- It can also reduce hallucinations by making it easier for people to remember what they have been told.

结论

提示工程在增强 RAG 系统能力方面发挥着基础性作用，因为它决定了模型如何与检索文档交互、如何基于证据推理，以及如何组织和呈现最终回答。从确保事实对齐的上下文锚定提示，到鼓励结构化推理的链式思维提示，再到提升透明度与信任的带引用提示和证据高亮提示，每一种策略都提供了减少幻觉、提升可靠性的独特机制。这些方法共同表明：RAG 系统的质量并不只由检索或生成单独决定，而是由如何精心设计连接两者的提示词所决定。

进入下一章后，重点将从“模型如何处理检索到的内容”转向“这些内容最初是如何被检索到的”。一个设计良好的搜索策略至关重要，因为哪怕提示工程再精巧，也无法弥补检索质量差或结果不相关的问题。