从 0 开始搞定 RAG 应用（四）索引从 0 开始搞定 RAG 应用（四）索引上一篇文章从 0 开始搞定 RAG

从 0 开始搞定 RAG 应用（四）索引

上一篇文章从 0 开始搞定 RAG 应用（三）路由 , 想要构建一个比较灵活的、功能强大的、可复用的 RAG 应用，路由的能力是很重要的，它可以动态规划要查询的数据源。

索引在 RAG 也与很重要的一部分内，本文我们主要介绍索引相关的内容。

Chunking 分块

RAG 在进行信息检索的时候需要将检索出来的有价值的文本段送给模型，模型才能生成可靠有用的内容。分块（chunking）是将整篇文本分成小段的过程。当我们使用LLM embedding内容时，分块可以帮助优化从向量数据库被召回的内容的准确性，因此文本段的质量也是RAG中比较重要的一环。

常见的chunk切分方法有：

1、固定长度切分

操作：按照文本的字数或者词数将文本切分为多块。比如可以将文档按照500字切分，切分之后的每个文本块字数为500。
优点：简单易实现，可快速处理。
缺点：可能会导致上下文断裂，影响重要的语义信息。

2、基于句子的切分

操作：按照句子粒度进行切分，比如以句号、点号等标点符号进行切分
优点：该方法能保证每个句子的完整性、上下文连贯性
缺点：如果句子过长，可能丢失一些细节。可能切分的不准确，影响检索效果。

3、滑动窗口切分

操作：创建一个重叠的滑动窗口，比如设置窗口大小为500，步长为100。
优点：可以减少因固定长度或句子边界切分可能引入的信息丢失问题。
缺点：上下文重叠导致信息重复，增加计算量。窗口的开始和结束可能会在句子或短语中间，导致语义不连贯。

4、基于主题切分

操作：通过识别文章主题的变换点进行切分。
优点：保持高度的语义连贯性，适用于结构化比较好的文本。
缺点：无法处理结构化不足的文本。

5、基于语义相似度的切分

操作：使用模型来评估文本间的语义相似度，并在相似度降低到某个阈值以下时进行切分
优点：保持高度语义相似性，优化检索效果
缺点：模型准确率要求高

6、按文档结构切分

操作：典型的是markdown切分工具，按照文档结构切分
优点：语义连贯
缺点：有的问题涉及多个部分的内容，可能无法覆盖；生成模型的token数有限制，该切分方式可能不满足token限制；

7、文档块摘要切分

操作：切分文档后，使用摘要生成技术来提取每个块的关键信息
优点：可以将关键信息精简并保留
缺点：摘要生成方法的精度直接影响整体效果

分块需要考虑的因素：

1、被索引内容的性质是什么？

是处理较长的文本(书籍或文章)，还是处理较短的内容。不同场景需要的分块策略不同。

2、不同的embedding模型在不同大小块上的效果不同

3、查询query的长度和复杂度与块的切分有很大关系

用户输入的查询文件时简短而具体的还是冗长而复杂的。

4、如何在特定的程序中使用检索结果

比如在LLM中，token长度会限制切块的大小。

环境安装

 ! pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain youtube-transcript-api pytube

根据文档安装 LangSmith docs.smith.langchain.com/

 # langSmith 环境配置
 import os
 os.environ['LANGCHAIN_TRACING_V2'] = 'true'
 os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
 os.environ['LANGCHAIN_API_KEY'] = <your-api-key>

多重表征索引

相关论文: arxiv.org/abs/2312.06…

密集检索已成为开放域 NLP 任务中获取相关上下文或世界知识的主要方法。当我们在推理时对检索语料库使用学习到的密集检索器时，一个经常被忽视的设计选择是索引语料库的检索单元，例如文档、段落或句子。我们发现检索单元的选择会显著影响检索和下游任务的性能。与使用段落或句子的典型方法不同，我们为密集检索引入了一个新颖的检索单元——命题。命题被定义为文本中的原子表达式，每个命题都包含一个独特的事实并以简洁、独立的自然语言格式呈现。我们对不同的检索粒度进行了实证比较。结果表明，基于命题的检索在密集检索中明显优于传统的基于段落或句子的方法。此外，通过命题检索也提高了下游 QA 任务的性能，因为检索到的文本包含与问题相关的更多信息，减少了对冗长的输入标记的需求，并最大限度地减少了无关的、不相关信息的包含。

from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/")
docs.extend(loader.load())
USER_AGENT environment variable not set, consider setting it to identify your requests.

import uuid, os, dotenv

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import AzureChatOpenAI

dotenv.load_dotenv("../.env")

llm = AzureChatOpenAI(
    azure_deployment=os.getenv("AZURE_DEPLOYMENT_NAME_GPT35"),
    temperature=0
)

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("总结以下文件:\n\n{doc}, 使用中文回答。")
    | llm
    | StrOutputParser()
)

summaries = chain.batch(docs, {"max_concurrency": 5})
summaries

Out[4]:

['This file is a detailed summary of a blog post titled "LLM Powered Autonomous Agents" by Lilian Weng. The post discusses the concept of building agents with LLM (large language model) as its core controller. It covers the components of a LLM-powered autonomous agent system, including planning, memory, and tool use. The post also includes case studies and proof-of-concept examples, as well as challenges and references related to the topic. The post provides a comprehensive overview of the potential of LLM-powered autonomous agents.',
 '该文件总结了有关高质量人类数据的思考。作者指出高质量的数据对于现代深度学习模型训练至关重要，而大部分任务特定的标记数据来自人类注释。文件讨论了人类数据收集的操作步骤，包括任务设计、选择和培训评分员以及数据收集和聚合。作者还讨论了“众声之智”概念，即通过众包来进行人类评估，以及评分员之间的一致性和不一致性。文件还介绍了一些方法来衡量数据质量，包括影响函数、训练过程中的预测变化和嘈杂的交叉验证。最后，文件提供了一些引用和参考文献，以及作者的联系方式和网站链接。']

from langchain.storage import InMemoryByteStore
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings

# 实例化 embedding 模型 AzureOpenAIEmbeddings
embeddings = AzureOpenAIEmbeddings(
    model=os.environ.get("AZURE_EMBEDDING_TEXT_MODEL")
)

# 用于索引子块的向量存储器
vectorstore = Chroma(collection_name="summaries",embedding_function=embeddings)

# 父文档的存储层
store = InMemoryByteStore()
id_key = "doc_id"

# 检索器 retriever, MultiVectorRetriever 主要是将大的 Chunk 与小的 Chunk 做关联起来。
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

# 文档链接到摘要
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))
/Users/oo7/Developer/pycode/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:141: LangChainDeprecationWarning: The class `Chroma` was deprecated in LangChain 0.2.9 and will be removed in 0.4. An updated version of the class exists in the langchain-chroma package and should be used instead. To use it run `pip install -U langchain-chroma` and import as `from langchain_chroma import Chroma`.
  warn_deprecated(

# 使用向量检索来检索需要的内容：
query = "Memory in agents"
sub_docs = vectorstore.similarity_search(query,k=1)
print(sub_docs[0])

Document(metadata={'doc_id': '5478665d-56b4-4296-88a4-1404503f979e'}, page_content='This file is a detailed summary of a blog post titled "LLM Powered Autonomous Agents" by Lilian Weng. The post discusses the concept of building agents with LLM (large language model) as its core controller. It covers the components of a LLM-powered autonomous agent system, including planning, memory, and tool use. The post also includes case studies and proof-of-concept examples, as well as challenges and references related to the topic. The post provides a comprehensive overview of the potential of LLM-powered autonomous agents.')

retrieved_docs = retriever.get_relevant_documents(query,n_results=1)
retrieved_docs[0].page_content[0:500]
/Users/oo7/Developer/pycode/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:141: LangChainDeprecationWarning: The method `BaseRetriever.get_relevant_documents` was deprecated in langchain-core 0.1.46 and will be removed in 1.0. Use invoke instead.
  warn_deprecated(
Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2

输出内容：

"\n\n\n\n\n\nLLM Powered Autonomous Agents | Lil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLil'Log\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPosts\n\n\n\n\nArchive\n\n\n\n\nSearch\n\n\n\n\nTags\n\n\n\n\nFAQ\n\n\n\n\nemojisearch.app\n\n\n\n\n\n\n\n\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\n \n\n\nTable of Contents\n\n\n\nAgent System Overview\n\nComponent One: Planning\n\nTask Decomposition\n\nSelf-Reflection\n\n\nComponent Two: Memory\n\nTypes of Memory\n\nMaximum Inner Product Search (MIPS)\n\n"

ColBERT

ColBERT 为段落中的每个标记生成受上下文影响的向量。

ColBERT 同样为查询中的每个标记生成向量。

然后，每个文档的分数是将每个查询嵌入与任何文档嵌入之间的最大相似度求和：

从 0 开始搞定 RAG 应用（四）索引