引言
在 RAG 系统中,要想有效检索信息,首先必须将大型文档拆分为更小、更易管理的单元,以便进行索引、嵌入和搜索。这个过程被称为文档切分(document splitting),它至关重要,因为语言模型和向量数据库都受到上下文长度和内存约束的限制。文档如何切分会直接影响检索质量:过小的片段可能会丢失重要上下文,而过大的片段则可能稀释相关性,或者超出模型限制。在本章中,我们将探讨多种文档切分技术,从简单的固定长度和滑动窗口方法,到更高级的语义切分、结构化切分和分层切分。通过理解这些策略,我们可以设计出在连贯性与粒度之间取得平衡的文档块,从而为 RAG 系统中准确且上下文丰富的检索打下坚实基础。
结构
本章将涵盖以下主题:
- 软件要求
- N-gram 文本切分器
- 基于主题的文档切分
- 基于正则表达式的切分
- Markdown 文本切分
- 按元数据分组与切分文档
- 基于时间的切分
- 基于 HTML 标签的切分
- 基于表格的切分
- 基于页面的切分
- 自定义分隔符切分
学习目标
到本章结束时,读者将清晰理解如何为 RAG 系统有效地切分文档,从而确保检索和生成都建立在结构良好、具有上下文意义的文本单元之上。通过探索不同的切分策略——从固定长度与重叠分块,到基于语义和结构感知的方法——本章旨在强调每种方法如何影响检索准确率、上下文保留能力以及整体系统性能。其目标是帮助实践者选择或设计能够在效率与上下文保留之间取得平衡的切分技术,最终使 RAG 系统能够检索到更相关的证据,并生成既精确又有坚实依据的回答。
文本切分策略通常可分为两大类:结构化切分与语义切分。
结构化切分方法(例如基于 Markdown、正则表达式或页面的切分方法)保留了文档原有的组织方式。它们依赖可见线索,如标题、章节或页边界,按照文档格式对文本进行逻辑划分。
与之相对,语义切分方法(例如基于主题或由嵌入驱动的方法)关注的是意义而非结构。它们根据概念相似性、主题连贯性或上下文变化来重新组织或划分文本,而不是依据文档的版式。理解这种区别有助于读者选择合适的方法:结构化方法适用于格式良好的文档,而语义方法则更适用于非结构化文本或概念密集型文本,在这些场景下,意义比格式更重要。
本章包含多组 recipes,用于演示结构化与语义切分方法的使用。
软件要求
本书中的每个概念后面都会配有相应的 recipe,也就是用 Python 编写的可运行代码。你会在所有 recipe 中看到代码注释,这些注释将逐行解释每一行代码的作用。
运行这些 recipe 需要以下软件环境:
- 系统配置:至少 16.0 GB 内存的系统
- 操作系统:Windows
- Python:Python 3.13.3 或更高版本
- LangChain:1.0.5
- LLM 模型:Ollama 的
llama3.2:3b - 程序输入文件:程序中使用的输入文件可在本书的 Git 仓库中获取
要运行程序,请执行 Python 命令 pip install <packages name> 安装 recipe 中提到的依赖包。安装完成后,在你的开发环境中运行 recipe 中提到的 Python 脚本(.py 文件)即可。
图 3.1 展示了文档切分过程:
图 3.1:文档切分过程
N-gram 文本切分器
本 recipe 使用 NGramTextSplitter,它是一个继承自 LangChain TextSplitter 的自定义类。它基于 n-gram 的概念,将文本切分为带有重叠部分的多个 chunk,展示了如何实现与 LangChain 框架兼容的自定义切分逻辑。n-gram 是指从给定文本中连续抽取的 n 个项目(单词、字符或 token)所组成的序列。
例如,当 n=3 时,句子如下:
RAG with Python
如果我们按大小为 2 的 gram 来切分,就会变成如下形式:
RAG with,with Python
在 RAG 中使用 NGramTextSplitter 将有助于以下过程:
- 基于 n-gram 大小生成带重叠的文本块;
- 在 chunk 之间更好地保留语义。
上述两个优势使 NGramTextSplitter 能够在检索过程中提升检索粒度。
以下是 n-gram 文本切分器中的关键参数:
- n:n-gram 的大小(例如,3 表示 trigram,即三元组)。
- separator:可选的词分隔符(默认是空格)。
- keep_separator:是否在输出 chunk 中保留分隔符。
- add_start_token:是否为每个 chunk 添加起始 token(例如
[START])。
以下是在 RAG 流水线中切分文档的一些基本需求:
- LLM 具有 token 限制,若不必要地使用过多 token,会增加使用成本。
- Chunking 有助于高效存储、检索和语义匹配,也有助于在 RAG 过程中更有效地管理内容。
- 合理的 chunking 能提高检索质量。
让我们看一些常见挑战:
- 如果 chunk 太小,就会丢失上下文。
- 如果 chunk 太大,就会超出 token 限制。
- 不合理的重叠或错误的句子边界会导致词语语义的丢失。
Recipe 27
本 recipe 演示如何使用 NGramTextSplitter 将文本切分为 n-gram:
准备一个待切分为 n-gram 的示例输入。
使用所需参数初始化 n-gram 切分器。在本例中,n=8 表示每个 chunk 将包含 8 个单词,overlap=3。
将示例文本切分为 n-gram。create_documents 方法会返回一个 Document 对象列表,每个 Document 都包含一个根据 n-gram 参数生成的文本块。
打印生成的 n-gram chunks。每个 chunk 都会打印出其内容。
安装所需依赖:
pip install langchain
ngram_splitter_usage_example.py
请参考以下代码:
# ngram_splitter_usage_example.py
# It demonstrates how to use the NGramTextSplitter to split text into
# n-grams.
from ngram_splitter import NGramTextSplitter
# 1. Sample text to be split into n-grams
sample_text = (
"""Retrieval Augmented Generation (RAG) is an architecture that
combines
the ability of large language models (LLMs) with a retrieval system to
enhance
the factual accuracy, contextual relevance, and quality of generated
response
against the query raised by user to a RAG system."""
)
# 2. Initialize the n-gram splitter with desired parameters
# Here, n=8 means each chunk will have 8 words, and overlap=3
splitter = NGramTextSplitter(n=8, overlap=3)
# 3. Split the sample text into n-grams
# The create_documents method will return a list of Document objects
# Each Document will contain a chunk of text as specified by the n-gram
# parameters
docs = splitter.create_documents([sample_text])
# 4. Print the resulting n-gram chunks
# Each chunk will be printed with its content
for i, doc in enumerate(docs, 1):
print(f"\n--- Chunk {i} ---\n{doc.page_content}")
请参考以下 ngram_splitter.py 程序,该程序被 ngram_splitter_usage_example.py 所调用:
# ngram_splitter.py
# This module defines a custom text splitter that divides text into
# n-grams with overlap.
from typing import List
from langchain.schema import Document
from langchain.text_splitter import TextSplitter
# n-gram text splitter class
class NGramTextSplitter(TextSplitter):
# Initialize with n-gram size and overlap
def __init__(self, n: int = 10, overlap: int = 2):
super().__init__()
self.n = n
self.overlap = overlap
# Split text into n-grams with specified overlap
def split_text(self, text: str) -> List[str]:
words = text.split()
chunks = []
step = self.n - self.overlap
for i in range(0, len(words) - self.n + 1, step):
chunk = " ".join(words[i:i + self.n])
chunks.append(chunk)
return chunks
# Create LangChain-style documents from the text
def create_documents(self, texts: List[str], metadata: List[dict] = None) -> List[Document]:
documents = []
metadata = metadata or [{} for _ in texts]
for text, meta in zip(texts, metadata):
splits = self.split_text(text)
for chunk in splits:
documents.append(Document(page_content=chunk, metadata=meta))
return documents
输出:
--- Chunk 1 ---
Retrieval Augmented Generation (RAG) is an architecture that
--- Chunk 2 ---
an architecture that combines the ability of large
--- Chunk 3 ---
ability of large language models (LLMs) with a
--- Chunk 4 ---
(LLMs) with a retrieval system to enhance the
--- Chunk 5 ---
to enhance the factual accuracy, contextual relevance, and
--- Chunk 6 ---
contextual relevance, and quality of generated response against
--- Chunk 7 ---
generated response against the query raised by user
基于主题的文档切分
基于主题的切分是一种高级文档分段技术,其切分依据不是长度或结构,而是语义连贯性。在这种方法中,每个 chunk 代表一个独立的概念、主题或话题。
这种方法不是在固定长度处或结构断点处任意切割文档,而是根据内容主题来切分文档,从而确保每个 chunk 不会丢失其所属主题的上下文。
基于主题的切分是一种具备语义感知能力的策略,可以改善文档为检索做准备的方式。它与现代 LLM 的优势高度契合,并确保输入系统的知识是相关、连贯且结构良好的。
我们来看一些适用场景:
- 当主题的重要性高于内容顺序时;
- 当你希望基于概念匹配来提升检索准确率时;
- 当你使用依赖内容相似性的稠密向量嵌入时。
以下是一些可用于高效进行主题切分的指导原则:
- 章节标题(Markdown、DOCX、HTML):使用显式标题作为新主题的指示器;
- 关键词分组:将共享相同关键词/主题的段落或句子分为一组;
- 标题标签(如
<h2>、##等):在 HTML 或 Markdown 文档中尤其有用; - 标注(Annotation) :为文档打标签。
Recipe 28
本 recipe 演示如何根据内容主题对文本进行切分。当你希望每个 chunk 都围绕一个单一思想形成连贯部分时,这种方法非常有用。
为了简洁和清晰起见,本 recipe 使用了一个硬编码的聚类中心列表以及一个固定的0.5 相似度阈值。在真实世界应用中,你可能希望将其替换为更动态、数据驱动的聚类方法,例如 K-means 或层次聚类(agglomerative clustering),以获得更好的适应性与鲁棒性。具体步骤如下:
初始化 SentenceTransformer 模型。该模型将用于把句子编码为嵌入向量。
准备要切分为主题块的示例文本。该文本包含多个主题,我们希望识别并将其拆分。
将文本切分为句子。这里使用的是按换行切分的简单方法,但你也可以使用更复杂的方法。
对句子进行编码以获取嵌入表示。每个句子都会被转换成一个向量表示,从而可以衡量句子之间的相似性。
计算余弦相似度矩阵。该矩阵中每个单元格 (i, j) 表示句子之间的相似度。
基于相似度对句子进行聚类。这里我们使用简单启发式规则将句子分为 3 个主题。
根据句子索引定义每个主题的中心。
遍历每个中心,将相似度足够高的句子分配给相应主题。
创建基于主题的 chunk。每个聚类代表一个主题,我们将每个聚类中的句子拼接成一个 chunk。最后打印这些主题 chunks。
安装所需依赖:
pip install langchain sentence-transformers
topic_based_splitting.py
请参考以下代码:
# topic_based_splitting.py
# This script demonstrates how to split text into topic-based chunks using sentence embeddings and clustering.
"""This recipe uses a hard coded list of cluster centers and a fixed
similarity threshold of 0.5 for simplicity and clarity.
In real-world applications, you may want to replace this with a
more dynamic and data-driven clustering method such as K-Means
or Agglomerative Clustering to achieve better adaptability and
robustness."""
from sentence_transformers import SentenceTransformer, util
import numpy as np
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="torch")
# 1. Initialize the SentenceTransformer model
# This model will be used to encode sentences into embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
# 2. Sample text to be split into topic-based chunks
# This text contains multiple topics that we want to identify and split
text = """
LangChain is a powerful framework for working with large language
models.
It provides tools for loading documents, creating embeddings, and
building retrieval pipelines.
You can integrate it with FAISS, Chroma, and other vector stores.
Transformers are deep learning models that understand language context.
Popular models include BERT, GPT, and T5.
They are used for text generation, classification, and summarization.
Python is a popular programming language.
It is used in machine learning, web development, and automation.
Python supports libraries like Pandas, NumPy, and Scikit-learn.
"""
# 3. Split text into sentences
# This is a simple split by new lines, but you can use more
# sophisticated methods if needed
sentences = [s.strip() for s in text.strip().split("\n") if s.strip()]
# 4. Encode sentences to get embeddings
# Each sentence is converted into a vector representation
# This allows us to measure similarity between sentences
embeddings = model.encode(sentences)
# 5. Calculate cosine similarity matrix
# This will give us a matrix where each cell (i, j) is the
# similarity between
similarity_matrix = util.cos_sim(embeddings, embeddings).numpy()
# 6. Cluster sentences based on similarity
# Here we will use a simple heuristic to group sentences into 3 topics
clusters = [[] for _ in range(3)]
assigned = set()
# 7. Define centers for each topic based on sentence indices
centers = [0, 3, 6] # sentence indices starting each topic block
# 8. We iterate over each center and assign sentences that are
# similar enough
for i, center in enumerate(centers):
for j in range(len(sentences)):
if j not in assigned and similarity_matrix[center][j] > 0.5:
clusters[i].append(sentences[j])
assigned.add(j)
# 9. Create topic-based chunks
# Each cluster represents a topic, and we join sentences in each
# cluster to form a chunk
topic_chunks = [" ".join(cluster) for cluster in clusters]
# 10. Print the topic-based chunks
for idx, chunk in enumerate(topic_chunks, 1):
print(f"--- Topic Chunk {idx} ---\n{chunk}\n")
输出:
--- Topic Chunk 1 ---
LangChain is a powerful framework for working with large language models.
--- Topic Chunk 2 ---
Transformers are deep learning models that understand language context.
--- Topic Chunk 3 ---
Python is a popular programming language. Python supports libraries like Pandas, NumPy, and Scikit-learn.
基于正则表达式的切分
基于正则表达式的切分(regex-based splitting)使用正则表达式(regex)根据文本中的特定模式将文档分割为多个 chunk。与基于大小或基于语义的切分不同,这种方法根据已经识别出的重复符号或格式来切分文档,例如日期、标题、项目符号等。
当处理结构化或半结构化文档,且其内容边界遵循可识别模式时,这种方法尤其有用。
它的一些应用场景包括:法律文档、转录文本切分、会议纪要、项目符号列表、带分页的报告等。
让我们看一下基于正则表达式切分的一些优势:
- 允许我们实现自定义切分;
- 可以处理非标准格式;
- 适用于日志、转录文本、列表和表单;
- 可以与 OCR 或解析工具的预处理流水线兼容。
使用基于正则表达式的切分过程如下:
-
定义正则模式:选择一个匹配切分点的模式。例如:
^[6-9]\d{9}$:用于电话号码#\w+:用于话题标签(hashtags)[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}:用于电子邮箱地址
-
搜索文本:系统会在文档中搜索所选模式出现的所有位置。
-
在匹配处切分:文本会在定义好的 regex 匹配位置被分割,从而围绕识别出的格式生成边界清晰的 chunk。
Recipe 29
当你的内容包含标题、项目符号或自定义分隔符等结构化分隔标记,并且你希望围绕这些结构化分隔标记来切分内容时,基于正则表达式的文档切分会非常有帮助。本 recipe 演示如何使用 Python 内置的 re 库通过正则表达式切分文本。它提供了一种灵活方式,允许基于自定义模式对文本进行分段,而不依赖 LangChain 的 TextSplitter 类:
准备一段基于标题分节的示例文本。该文本包含多个我们希望识别并切分的部分。
定义一个用于匹配标题的正则模式。该模式匹配以 ## 开头,后跟标题文本的行。
通过将标题与其对应内容配对来创建 chunks。这将生成一个列表,其中每个 chunk 都包含一个标题及其对应的内容。
打印生成的 chunks。每个 chunk 都会带有其索引的标题进行打印。
安装所需依赖:
pip install langchain
regex_based_splitting.py
请参考以下代码:
# regex_based_splitting.py
# This script demonstrates how to split text into chunks based on
# regex patterns.
import re
# 1. Sample text to be split into sections based on headers
# This text contains multiple sections that we want to identify and
# split
text = """
## What is RAG?
RAG stands for Retrieval-Augmented Generation. It enhances language
models by retrieving relevant information from external sources before
generating responses.
## Components of RAG
- Retriever: Finds relevant documents.
- Generator: Uses the retrieved context to generate answers.
- Vector Store: Stores document embeddings for efficient search.
## Benefits
- Improved accuracy
- Current information access
- Cost-effective context handling
"""
# 2. Define a regex pattern to match headers
# This pattern matches lines that start with '## ' followed by the
# header text
pattern = r"^##\s+(.*)$"
sections = re.split(pattern, text, flags=re.MULTILINE)
# 3. Create chunks by pairing headers with their content
# This will create a list of chunks where each chunk contains a
# header and its corresponding content
chunks = []
for i in range(1, len(sections), 2):
header = sections[i].strip()
content = sections[i + 1].strip()
chunks.append(f"{header}\n{content}")
# 4. Print the resulting chunks
# Each chunk is printed with a header indicating its index
for i, chunk in enumerate(chunks, 1):
print(f"\n--- Chunk {i} ---\n{chunk}")
输出:
--- Chunk 1 ---
What is RAG?
RAG stands for Retrieval-Augmented Generation. It enhances language models by retrieving relevant information from external sources before generating responses.
--- Chunk 2 ---
Components of RAG
- Retriever: Finds relevant documents.
- Generator: Uses the retrieved context to generate answers.
- Vector Store: Stores document embeddings for efficient search.
--- Chunk 3 ---
Benefits
- Improved accuracy
- Current information access
- Cost-effective context handling
Markdown 文本切分
基于 Markdown 的切分是一种文档分块技术,它利用 Markdown 语法(例如标题 #、## 等)将内容划分为有意义的部分。由于许多技术文档、博客、说明文档和知识库文章都是以 Markdown 编写的,这种技术能够保留语义边界,并在 RAG 系统中提升检索相关性。
Recipe 30
本脚本演示如何使用正则表达式基于标题将 Markdown 文本切分为多个 chunk。这对于处理结构化文本文件非常有用:
准备一段包含标题的 Markdown 示例文本,我们希望将其切分成多个 chunk。
使用 regex 根据标题将 Markdown 文本切分成多个 chunk。该模式匹配以 ## 或 ### 开头的标题。
对 chunk 进行后处理,清理空白字符并确保它们不为空。
只打印非空 chunk。
无需安装额外依赖包。
markdown_header_splitting.py
请参考以下代码:
# This script demonstrates how to split markdown text into chunks
# based on headers using regex.
# It is useful for processing structured text files.
import re
# 1. This is a sample markdown text with headers that we want to
# split into chunks.
markdown_text = """
# RAG with Python
This explains use of RAG
## What is RAG?
RAG (Retrieval-Augmented Generation) enhances LLMs by injecting external information
into prompts.
## Components
Components of RAG
### Retriever
This fetches relevant documents from a knowledge base.
### Generator
The LLM uses retrieved docs to answer the query.
## Use Cases
- Customer support bots
- Legal document assistants
- Research assistants
"""
# 2. Use regex to split the markdown text into chunks based on headers.
# The pattern matches headers starting with '##' or '###'.
header_pattern = r'(?=^#{2,3} .*)'
chunks = re.split(header_pattern, markdown_text, flags=re.MULTILINE)
# 3. Post-process the chunks to clean up whitespace and ensure they are not empty
for i, chunk in enumerate(chunks, 1):
cleaned = chunk.strip()
# 4. Only print non-empty chunks
if cleaned:
print(f"\n--- Chunk {i} ---\n{cleaned}")
输出:
--- Chunk 1 ---
# RAG with Python
This explains use of RAG
--- Chunk 2 ---
## What is RAG?
RAG (Retrieval-Augmented Generation) enhances LLMs by injecting external information into prompts.
--- Chunk 3 ---
## Components
Components of RAG
--- Chunk 4 ---
### Retriever
This fetches relevant documents from a knowledge base.
--- Chunk 5 ---
### Generator
The LLM uses retrieved docs to answer the query.
--- Chunk 6 ---
## Use Cases
- Customer support bots
- Legal document assistants
- Research assistants
按元数据分组与切分文档
基于元数据的文档切分是一种根据分配给文档的元数据字段(例如 source、author、date、topic 等)来划分文档 chunk 的技术。
Recipe 31
本脚本演示如何使用元数据来切分文档。这种技术允许你根据文档的元数据(例如来源、标题或其他标签)来切分文档。当你希望根据文档属性来分块时,这种方法非常有用:
创建一个包含不同元数据的示例文档列表。
使用所需参数初始化 RecursiveCharacterTextSplitter。在这里,chunk_size=60 表示每个 chunk 最多包含 60 个字符。
将文档切分成多个 chunk。这会在保留上下文的同时生成更小的文本块。
对每个文档进行切分,并按 source 分组。这将确保来自同一来源的 chunk 会被一同处理。
打印结果 chunk 及其元数据。每个 chunk 都会保留 source 元数据,以便保留上下文。
安装所需依赖:
pip install langchain
metadata_based_document_splitting.py
请参考以下代码:
# metadata_based_document_splitting.py
# This script demonstrates how to split documents based on their
# metadata
# using LangChain's RecursiveCharacterTextSplitter.
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
# 1. Create a list of sample documents with different metadata
docs = [
Document(
page_content="RAG combines retrieval with generation for better answers.",
metadata={"source": "rag_with_python_cookbook.txt"}
),
Document(
page_content="LangChain supports many loaders and chunking strategies.",
metadata={"source": "langchain_guide.md"}
),
Document(
page_content="You can use FAISS or Chroma for vector storage.",
metadata={"source": "vector_stores.pdf"}
),
]
# 2. Initialize the RecursiveCharacterTextSplitter with desired
# parameters
# Here, chunk_size=60 means each chunk will have a maximum of
# 60 characters,
splitter = RecursiveCharacterTextSplitter(chunk_size=60, chunk_overlap=0)
# 3. Split the documents into chunks
# This will create smaller chunks while preserving the context
grouped_by_source = {}
for doc in docs:
source = doc.metadata["source"]
grouped_by_source.setdefault(source, []).append(doc)
# 4. Split each document into chunks and group by source
# This will ensure that chunks from the same source are
# processed together
all_chunks = []
for source, group in grouped_by_source.items():
for doc in group:
chunks = splitter.split_documents([doc])
all_chunks.extend(chunks)
# 5. Print the resulting chunks with their metadata
# Each chunk will retain the source metadata for context
for i, chunk in enumerate(all_chunks, 1):
print(f"\n--- Chunk {i} (Source: {chunk.metadata['source']}) ---")
print(chunk.page_content)
输出:
--- Chunk 1 (Source: rag_with_python_cookbook.txt) ---
RAG combines retrieval with generation for better answers.
--- Chunk 2 (Source: langchain_guide.md) ---
LangChain supports many loaders and chunking strategies.
--- Chunk 3 (Source: vector_stores.pdf) ---
You can use FAISS or Chroma for vector storage.
基于时间的切分
基于时间的文档切分是一种根据文档中嵌入的时间戳、时间区间或日期,将内容划分为多个 chunk 的技术。
当处理转录文本、会议日志、聊天记录、呼叫中心数据或事件日志时,这种方法尤其有用,因为这些内容的每个部分都与时间相关联,而按时间组织内容往往具有重要意义。
Recipe 32
本脚本演示如何进行基于时间的文档切分。这种方法根据内容的时间戳进行切分,在会议转录、讲座录音或播客等场景中尤其有用。
准备一组带有时间元数据(以秒为单位)的示例转录内容。
定义 chunk 的时间长度(单位为秒)。在这里,我们创建的是每 2 分钟(120 秒)一个 chunk。
遍历文档,并根据定义好的时间长度将其归组为多个 chunk。
打印结果 chunk 及其起始时间元数据。每个 chunk 都会保留起始时间元数据,以便保留上下文。
安装所需依赖:
pip install langchain
time_based_splitting.py
请参考以下代码:
# time_based_splitting.py
# This script demonstrates how to split documents based on time metadata
from langchain.schema import Document
from datetime import timedelta
# 1. Sample transcript with time metadata (in seconds)
docs = [
Document(page_content="Intro to RAG and its use cases.", metadata={"timestamp": 0}),
Document(page_content="LangChain framework and components.", metadata={"timestamp": 60}),
Document(page_content="Loading and splitting documents.", metadata={"timestamp": 130}),
Document(page_content="Creating embeddings using HuggingFace.", metadata={"timestamp": 190}),
Document(page_content="Indexing and retrieval techniques.", metadata={"timestamp": 250}),
Document(page_content="Building a full RAG pipeline.", metadata={"timestamp": 310}),
]
# 2. Define the chunk duration in seconds
# Here, we will create chunks of 2 minutes (120 seconds)
chunk_duration = 120
chunks = []
current_chunk = []
current_start_time = 0
# 3. Iterate through the documents and group them into chunks based on
# the defined duration
for doc in docs:
if doc.metadata["timestamp"] < current_start_time + chunk_duration:
current_chunk.append(doc)
else:
# Create one chunk
combined_text = "\n".join([d.page_content for d in current_chunk])
chunks.append(Document(page_content=combined_text, metadata={"start_time": str(timedelta(seconds=current_start_time))}))
# Start a new chunk
current_start_time += chunk_duration
current_chunk = [doc]
# Add the last chunk if any
if current_chunk:
combined_text = "\n".join([d.page_content for d in current_chunk])
chunks.append(Document(page_content=combined_text, metadata={"start_time": str(timedelta(seconds=current_start_time))}))
# 4. Print the resulting chunks with their start time metadata
# Each chunk will retain the start time metadata for context
for i, chunk in enumerate(chunks, 1):
print(f"\n--- Chunk {i} (Start Time: {chunk.metadata['start_time']}) ---")
print(chunk.page_content)
输出:
--- Chunk 1 (Start Time: 0:00:00) ---
Intro to RAG and its use cases.
LangChain framework and components.
--- Chunk 2 (Start Time: 0:02:00) ---
Loading and splitting documents.
Creating embeddings using HuggingFace.
--- Chunk 3 (Start Time: 0:04:00) ---
Indexing and retrieval techniques.
Building a full RAG pipeline.
基于 HTML 标签的切分
基于 HTML 标签的切分,是指根据特定 HTML 元素(如 <p>、<h1>、<div>、<section>,甚至自定义标签)来拆分 HTML 文档。
Recipe 33
本脚本演示如何基于 HTML 标签(如 <div>、<p>、<h1> 等)来切分 HTML 文档。这有助于根据 HTML 文档的结构标签提取所需内容列表:
创建一个示例 HTML 文档。该 HTML 包含多个我们希望提取并切分成 chunk 的标签。
解析 HTML,并从指定标签中提取文本。这里我们将从 <h1> 和 <p> 标签中提取文本。
使用 RecursiveCharacterTextSplitter 将提取出的文本切分为 chunk,把文档进一步切成更小的块。
打印生成的 chunk。每个 chunk 都会保留从 HTML 标签中提取出的文本。
安装所需依赖:
pip install langchain beautifulsoup4
html_tag_based_splitting.py
请参考以下代码:
# html_tag_based_splitting.py
# This code demonstrates how to split HTML content into chunks using
# LangChain's text splitter.
# It extracts text from HTML tags and splits it into
# manageable chunks.
from langchain_core.documents import Document
from bs4 import BeautifulSoup
from langchain_text_splitters import RecursiveCharacterTextSplitter
# 1. Sample HTML document
# This HTML contains various tags that we want to extract and
# split into chunks.
html_doc = """
<html>
<body>
<h1>RAG Pipeline</h1>
<p>Load → Split → Embed → Retrieve → Generate</p>
<p>Used in chatbots, document search, and more.</p>
</body>
</html>
"""
# 2. Parse the HTML and extract text from specific tags
# Here, we will extract text from <h1> and <p> tags.
soup = BeautifulSoup(html_doc, "html.parser")
paragraphs = soup.find_all(["h1", "p"])
documents = [Document(page_content=tag.get_text()) for tag in paragraphs]
# 3. Split the extracted text into chunks
# Using RecursiveCharacterTextSplitter to split the documents into
# smaller chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
chunks = splitter.split_documents(documents)
# 4. Print the resulting chunks
# Each chunk will retain the text extracted from the HTML tags.
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk.page_content}\n")
输出:
Chunk 1:
RAG Pipeline
Chunk 2:
Load → Split → Embed → Retrieve → Generate
Chunk 3:
Used in chatbots, document search, and more.
基于表格的切分
基于表格的切分,是指将表格数据(来自 HTML、PDF、Excel 或 Markdown)提取并拆分为单独的 chunk,以供嵌入、检索等下游任务使用,尤其适用于 RAG 流水线。
Recipe 34
本 recipe 说明如何实现基于表格的切分:
准备示例表格数据。该数据包含产品信息,我们希望将其转换为文档。
使用 Pandas DataFrame 创建示例数据。
将 DataFrame 的每一行转换为一个 Document。每个文档都包含产品名称和描述。
打印生成的文档。每个文档都表示一个产品及其描述。
安装所需依赖:
pip install langchain pandas
table_based_splitting.py
请参考以下代码:
# table_based_splitting.py
# This code demonstrates how to split documents based on tabular data
import pandas as pd
from langchain.schema import Document
# 1. Sample tabular data
# This data contains product information that we want to convert
# into documents.
# 2. Sample data is created as a pandas DataFrame.
data = pd.DataFrame({
"Product": ["A", "B", "C"],
"Description": ["Book A", "Book B", "Book C"]
})
# 3. Convert each row of the DataFrame into a Document
# Each document will contain the product name and description.
documents = [Document(page_content=f"{row.Product}: {row.Description}") for row in data.itertuples()]
# 3. Print the resulting documents
# Each document will represent a product with its description.
for i, doc in enumerate(documents):
print(f"Chunk {i+1}:\n{doc.page_content}\n")
输出:
Chunk 1:
A: Book A
Chunk 2:
B: Book B
Chunk 3:
C: Book C
基于页面的切分
基于页面的切分,是指将 PDF 按单独页面划分,每一页都成为一个独立的文档 chunk。这对大型、多主题 PDF 尤其有用,例如报告、书籍、手册、法律文件、会议纪要等。
Recipe 35
本 recipe 说明如何实现基于页面的切分:
加载 PDF 文件。确保 PDF 文件路径正确。
打印已加载页面数量,并展示每页内容的预览。这有助于验证 PDF 是否已被正确切分为独立页面。
显示每页内容。这里我们打印每页前 100 个字符作为预览。
安装所需依赖:
pip install langchain pypdf
splitting_pdf_into_pages.py
请参考以下代码:
# splitting_pdf_into_pages.py
# This code demonstrates how to split a PDF into individual pages using
# LangChain's PyPDFLoader.
from langchain_community.document_loaders import PyPDFLoader
# 1. Load the PDF file
# Ensure the PDF file path is correct.
loader = PyPDFLoader("RAG_3pages.pdf")
docs = loader.load()
# 2. Print the number of pages loaded and a preview of each page's
# content
# This will help verify that the PDF has been split correctly into
# individual pages.
print(f"Total Pages: {len(docs)}")
# 3. Display the content of each page
# Here we print the first 100 characters of each page to give a preview.
for i, doc in enumerate(docs):
print(f"\n--- Page {i+1} ---")
print(doc.page_content[:100]) # print first 100 characters
输出:
--- Page 1 ---
Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large
language
--- Page 2 ---
Traditional generative models rely solely on internal parameters for producing responses,
which lim
--- Page 3 ---
Traditional generative models laid the foundation for today’s LLMs. They helped us
understand how t
自定义分隔符切分
自定义分隔符关键字切分,是一种根据特定关键字、短语或符号来切分文档的方法,这些标记可作为自然边界。例如 Chapter、##(Markdown 标题)、### Section、---END--- 等。
Recipe 36
本 recipe 展示如何实现自定义分隔符切分:
定义一段包含特定关键字、可用于切分的自定义文本。该文本的结构中包含多个 chunk。
使用自定义关键字将文本切分为多个 chunk,并为每个 chunk 创建 Document 对象。文本首先按 START 切分,然后从每个片段中提取直到下一个 END 关键字之前的内容。
打印生成的 Document 对象。每个 Document 都表示自定义关键字之间的一段文本。
安装所需依赖:
pip install langchain
custom_separator_keywords_splitter.py
请参考以下代码:
# custom_separator_keywords_splitter.py
# This code splits a text into chunks based on custom keywords "START"
# and "END".
from langchain_core.documents import Document
# 1. Define a custom text with specific keywords to split on.
# The text is structured such that it contains multiple chunks
text = "START This is first chunk. END START This is a second chunk. END"
# 2. Split the text into chunks using the custom keywords
# and create Document objects for each chunk.
# The text is split at "START" and then each chunk is processed
# to extract content up to the next "END" keyword.
chunks = text.split("START")
documents = []
for chunk in chunks:
if "END" in chunk:
content = chunk.split("END")[0].strip()
documents.append(Document(page_content=content))
# 3. Print the resulting Document objects
# Each Document represents a chunk of text between the custom keywords.
for i, doc in enumerate(documents, 1):
print(f"[Chunk {i}] {doc.page_content}")
输出:
[Chunk 1] This is first chunk.
[Chunk 2] This is a second chunk.
Recipe 37
本 recipe 说明如何切分 JSON 文件内容:
准备一个示例 JSON 文件。
将 JSON Lines 字符串切分为单独的行。将每一行解析为一个 JSON 对象,并创建对应的 Document 对象。
打印生成的 Document 对象。每个 Document 都表示一个 JSON 对象,并带有其内容和元数据。
安装所需依赖:
pip install langchain
json_line_chunk.py
请参考以下代码:
# json_line_chunk.py
# It demonstrates how to split a JSON Lines formatted string into
# Document objects.
import json
from langchain_core.documents import Document
# 1.Sample JSON file
json_lines = """
{"id": 1, "text": "First entry"}
{"id": 2, "text": "Second entry"}
"""
# 2. Split the JSON Lines string into individual lines,
# parse each line as a JSON object, and create Document objects.
chunks = []
for line in json_lines.strip().splitlines():
item = json.loads(line)
chunks.append(Document(page_content=item["text"], metadata={"id": item["id"]}))
# 3. Print the resulting Document objects
# Each Document represents a JSON object with its content and metadata.
for doc in chunks:
print(f"[ID {doc.metadata['id']}] {doc.page_content}")
输出:
[ID 1] First entry
[ID 2] Second entry
Recipe 38
本 recipe 说明如何使用自定义分隔符切分幻灯片内容:
定义一段带有幻灯片分隔符的示例文本。该文本结构中包含多张幻灯片,它们之间以 --- 分隔。
使用自定义分隔符 --- 将文本切分为多个 chunk。每个 chunk 对应演示文稿中的一页幻灯片。
打印生成的 Document 对象。每个 Document 都代表一页幻灯片,其元数据中会标记幻灯片编号。
安装所需依赖:
pip install langchain
slide_deck_splitting.py
请参考以下代码:
# slide_deck_splitting.py
from langchain_core.documents import Document
# 1. Define a sample text with slide separators
# The text is structured such that it contains multiple slides
# separated by "---"
text = """Title Slide
---
Overview of Project
---
Results and Discussion"""
# 2. Split the text into chunks using the custom separator "---"
# Each chunk represents a slide in the presentation.
slides = text.split('---')
chunks = [Document(page_content=slide.strip(), metadata={"slide": i + 1}) for i, slide in enumerate(slides)]
# 3. Print the resulting Document objects
# Each Document represents a slide with its content and metadata
# indicating the slide number.
for doc in chunks:
print(f"[Slide {doc.metadata['slide']}] {doc.page_content}")
输出:
[Slide 1] Title Slide
[Slide 2] Overview of Project
[Slide 3] Results and Discussion
Recipe 39
本 recipe 展示如何根据对话中的说话人来切分转录文本:
定义一段带有说话人标识的示例转录文本。该转录文本的结构是:每一行都以说话人姓名加冒号开头。
根据说话人标识将转录文本切分为多个 chunk。每个 chunk 代表某个特定说话人的一条发言。
打印生成的 Document 对象。每个 Document 都表示某位说话人的一条消息,其元数据中会标明说话人姓名。
安装所需依赖:
pip install langchain
split_based_on_speaker.py
请参考以下代码:
# spilit_based_on_speaker.py
# This code demonstrates how to split a transcript based on speakers
# and create Document objects for each speaker's message.
from langchain.schema import Document
# 1. Define a sample transcript with speaker identifiers
# The transcript is structured such that each line starts with a
# speaker's name followed by a colon
transcript = """Alice: Let's begin the meeting.
Bob: Sure, the agenda today is RAG implementation.
Alice: Great, I have some updates."""
# 2. Split the transcript into chunks based on speaker identifiers
# Each chunk represents a message from a specific speaker.
chunks = []
for line in transcript.splitlines():
if ':' in line:
speaker, message = line.split(':', 1)
chunks.append(Document(page_content=message.strip(), metadata={"speaker": speaker.strip()}))
# 3. Print the resulting Document objects
# Each Document represents a message from a speaker with its
# content and metadata indicating the speaker's name
for doc in chunks:
print(f"[{doc.metadata['speaker']}] {doc.page_content}")
输出:
[Alice] Let's begin the meeting.
[Bob] Sure, the agenda today is RAG implementation.
[Alice] Great, I have some updates.
Recipe 40
本 recipe 说明如何按段落切分 DOCX 文档:
加载 .docx 文件。将 RAG.docx 替换为你的 .docx 文件路径。
从文档中提取段落。每个段落都被视为一个独立的 chunk。
为每个段落创建 Document 对象。每个 Document 都表示一个段落,并附带相应内容和元数据。
打印生成的 Document 对象。每个 Document 都表示一个段落,其元数据中包含来源文件和段落编号。
安装所需依赖:
pip install python-docx langchain
docs_paragraph_splitting.py
请参考以下代码:
# document_paragraph_splitting.py
# This code demonstrates how to split a .docx file into individual
# paragraphs
# and create Document objects for each paragraph using the
# langchain library.
from docx import Document
from langchain_core.documents import Document as LangChainDocument
# 1. Load the .docx file
# Replace 'RAG.docx' with the path to your .docx file
doc = Document("RAG.docx")
# 2. Extract paragraphs from the document
# Each paragraph is treated as a separate chunk
paragraphs = [para.text.strip() for para in doc.paragraphs if para.text.strip()]
# 3. Create Document objects for each paragraph
# Each Document represents a paragraph with its content and
# metadata.
split_docs = [
LangChainDocument(page_content=para, metadata={"source": "RAG.docx", "paragraph_num": idx + 1})
for idx, para in enumerate(paragraphs)
]
# 4. Print the resulting Document objects
# Each Document represents a paragraph with its content and metadata
# indicating the source and paragraph number.
for i, d in enumerate(split_docs):
print(f"\nParagraph {i+1}:\n{d.page_content}")
输出:
Paragraph 1:
Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.
Paragraph 2:
Traditional generative models rely solely on internal parameters for producing responses, which limits their ability to provide up-to-date or domain-specific knowledge. RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources.
Paragraph 3:
Traditional generative models laid the foundation for today’s LLMs. They helped us understand how to model processes represent knowledge, user input and generate data. However, they are now mostly replaced or augmented by deep learning-based transformer models, which offer greater accuracy, coherence, and scalability.
结论
本章讨论了文档切分技术。它们是构建高效 RAG 系统的基础步骤,因为它们确保信息被分割为既保留上下文、又可供检索的多个 chunk。恰当的切分会直接影响检索质量、减少噪声,并提升回答的精度。不过,切分本身只是整个拼图中的一块,真正的效果还取决于这些 chunk 被转换为能够捕捉其底层语义的丰富数值表示。
这也自然引出下一章:我们将探讨嵌入(embeddings)如何把切分后的文本转换为向量,从而支持精确的语义匹配与可扩展检索。
如果你愿意,我下一步可以继续按同样风格,把这一章再整理成更适合正式出版的中文版教材表达,包括统一术语、修正原文里不够严谨的表述,以及顺手帮你清理代码块里的小问题。