基于 Python 的 RAG 开发手册——RAG 流水线中的文档加载器

0 阅读27分钟

引言

在 RAG 流水线中,知识之旅始于文档。原始数据以多种形式存在,例如 PDF 报告、研究论文、电子表格、幻灯片、网站,甚至实时 API。如果没有一种结构化的方法将这些数据引入系统,检索过程就会变得碎片化且不可靠。文档加载器(document loaders)为这些多样化的数据源与嵌入和语义搜索所需的统一表示形式之间架起了一座桥梁。它们不仅负责提取和规范化文本,还会为文本补充元数据,以支持过滤和具备上下文感知能力的检索。本章将探讨文档加载器在构建高效 RAG 流水线中的作用,重点介绍常见加载器类型、它们的优势与局限,以及如何通过实用策略扩展加载器以满足特定领域的需求。

结构

本章将涵盖以下主题:

  • 核心组件
  • 支持的文档类型
  • 软件要求
  • 从常见来源类型加载文档
  • 元数据提取与管理
  • 预处理与文本规范化
  • 基于目录摄取的批量加载
  • 自定义文档加载器

学习目标

到本章结束时,读者将建立起对 RAG 流水线中文档加载的基础性理解。他们将理解以下概念,并学习如何为其编写程序,从而更好地认识文档加载器在 RAG 架构中的作用与重要性。

读者还将学习如何将多种文档格式(如 PDF、DOCX、TXT、HTML、JSON、CSV、Excel、Markdown、Web URL 等)加载并转换为结构化文本。本章还将探讨文档加载器如何处理元数据提取,以及它们如何支持高级过滤与检索;同时讲解如何为领域特定或不常见的格式实现并扩展自定义加载器,并理解在嵌入前进行预处理的最佳实践,包括文本清洗、切分与规范化。这将帮助读者通过使用 LangChain 及其他工具的实际代码示例,获得构建健壮文档摄取层的动手经验。

核心组件

以下是对文档加载器核心组件的说明:

Document(文档) :文档是 RAG 流水线中传递的基础数据单元。它封装了文本内容和上下文信息,并为后续的嵌入、分块或检索做好准备。

Metadata(元数据) :元数据是与文档关联的上下文键值数据。它可以增强过滤与检索能力,并提供可追溯性。

Loader interface(加载器接口) :文档加载器是一个组件,它实现了一套标准接口,用于摄取外部数据并将其转换为 Document 对象。它读取文件或网页内容,将其解析为文本,并返回包含页面内容和元数据的 List[Document]

支持的文档类型

以下是真实世界 RAG 流水线中常见、且文档加载器支持的文档格式:

  • 纯文本
  • Markdown
  • CSV 和 Excel
  • PDF 文档
  • Word 文档
  • HTML 和网页
  • 图像与扫描文件
  • JSON 与结构化日志

软件要求

本书中的每个概念后面都会配有相应的 recipe,也就是用 Python 编写的可运行代码。你会在所有 recipe 中看到代码注释,这些注释将逐行解释每一行代码的作用。

运行这些 recipe 需要以下软件环境:

  • 系统配置:至少 16.0 GB 内存的系统
  • 操作系统:Windows
  • Python:Python 3.13.3 或更高版本
  • LangChain:1.0.5
  • LLM 模型:Ollama 的 llama3.2:3b
  • 程序输入文件:程序中使用的输入文件可在本书的 Git 仓库中获取

要运行程序,请执行 Python 命令 pip install <packages name> 安装 recipe 中提到的依赖包。安装完成后,在你的开发环境中运行 recipe 中提到的 Python 脚本(.py 文件)即可。

图 2.1 展示了文档加载器的流程:

image.png

图 2.1:文档加载器流程

从常见来源类型加载文档

本章将涵盖以下文档类型的加载方式:

纯文本(Plain text) :其结构简单,几乎不需要复杂解析,非常适合日志、转录文本或笔记。

Markdown 文件:Markdown 属于半结构化格式,包含标题与分节,广泛用于技术文档、知识库和 wiki。

CSV 和 Excel:表格型文档通常需要按行或按列读取,适用于会议记录、客户工单或财务数据等结构化记录。

PDF 文档:PDF 非常常见,但由于版式多样,解析较为复杂,需要处理多栏文本、页眉、脚注等问题。

Word 文档(DOCX) :这类富文本文件包含段落、表格和元数据,对合同、制度文件和报告等内容尤为重要。

HTML 与网页:Web 文档通常包含富内容以及嵌套结构。

JSON:当数据以结构化日志、配置文件或外部 API 返回结果的形式存储时,通常会使用 JSON。

下面的部分将结合实现代码,介绍相关 recipes。

Recipe 15

本 recipe 说明如何使用 LangChain 的 UnstructuredMarkdownLoader 加载 Markdown 文件:

指定 Markdown 文件路径,并确保该文件存在于指定位置。

加载 Markdown 文件。UnstructuredMarkdownLoader 会解析 Markdown 文件,并将其转换为一个文档列表。

打印已加载文档的内容,其中每个文档对应 Markdown 文件中的一个部分。

安装所需依赖:

pip install langchain unstructured markdown

load_markdown_file.py

请参考以下代码:

# Load a Markdown file using LangChain's UnstructuredMarkdownLoader
from langchain_community.document_loaders import UnstructuredMarkdownLoader

# 1. Specify the path to your Markdown file
# Replace 'sample_markdown_file.md' with the actual path to your
# Markdown file, if required
# Ensure the file exists in the specified location
file_path = "sample_markdown_file.md"

# 2. Load the Markdown file
# This will create a loader instance and load the content of the file
# The UnstructuredMarkdownLoader will parse the Markdown file and
# convert it into a list of documents
loader = UnstructuredMarkdownLoader(file_path)
docs = loader.load()

# 3. Print the content of the loaded documents
# Each document corresponds to a section in the Markdown file
for i, doc in enumerate(docs):
    print(f"\n--- Document {i+1} ---\n")
    print(doc.page_content)

输入:

sample_markdown_file.md 是本程序使用的输入文件,其内容如下:

# Getting Started with Retrieval-Augmented Generation (RAG)
RAG combines **retrieval** and **generation** to produce answers grounded in external knowledge.

## Prerequisites
Before you begin, make sure you have the following installed:
* Python 3.10 or above
* `langchain`
* `chromadb` or `faiss`

Install dependencies:
```bash
pip install langchain chromadb faiss-cpu

输出:

```text
--- Document 1 ---
Getting Started with Retrieval-Augmented Generation (RAG)
RAG combines retrieval and generation to produce answers grounded in external knowledge.
Prerequisites
Before you begin, make sure you have the following installed:
Python 3.10 or above
langchain
chromadb or faiss
Install dependencies:
```bash pip install langchain chromadb faiss-cpu

Recipe 16

本 recipe 说明如何编写程序,使用 LangChain 的 CSVLoader 加载 CSV 文件:

指定 CSV 文件路径。请确保将 sample_csv_file.csv 替换为实际的 CSV 文件路径。

加载 CSV 文件。这会创建一个加载器实例并加载文件内容。

从 CSV 文件中加载文档。这将解析 CSV,并创建一个文档列表。

打印已加载文档的内容。

安装所需依赖:

pip install langchain pandas

load_csv_file.py

请参考以下代码:

# Load a CSV file using LangChain's CSVLoader
from langchain.document_loaders import CSVLoader

# 1. Specify the path to your CSV file
# Make sure to replace 'sample_csv_file.csv' with the actual path to
# your CSV file
file_path = "sample_csv_file.csv"

# 2. Load the CSV file
# This will create a loader instance and load the content of the file
loader = CSVLoader(file_path)

# 3. Load the documents from the CSV file
# This will parse the CSV and create a list of documents
documents = loader.load()

# 4. Print the content of the loaded documents
# Each document corresponds to a row in the CSV file
for i, doc in enumerate(documents):
    print(f"\n--- Document {i+1} ---")
    print(doc.page_content)

输入:

sample_csv_file.csv 是本程序使用的输入文件,其内容如下:

Name,Department,Role,Location,Joining Date
Anil Sharma,Engineering,Software Engineer,Bangalore,10/01/2022
Raj Kumar,HR,HR Manager,Mumbai,15/03/2021
Neha Kohli,Marketing,Content Writer,Delhi,22/05/2023
Ashok Singh,Finance,Accountant,Pune,01/07/2020

输出:

--- Document 1 ---
Name: Anil Sharma
Department: Engineering
Role: Software Engineer
Location: Bangalore
Joining Date: 10/01/2022

--- Document 2 ---
Name: Raj Kumar
Department: HR
Role: HR Manager
Location: Mumbai
Joining Date: 15/03/2021

--- Document 3 ---
Name: Neha Kohli
Department: Marketing
Role: Content Writer
Location: Delhi
Joining Date: 22/05/2023

--- Document 4 ---
Name: Ashok Singh
Department: Finance
Role: Accountant
Location: Pune
Joining Date: 01/07/2020

Recipe 17

本 recipe 说明如何编写程序,使用 LangChain 的 UnstructuredExcelLoader 加载 Excel 文件:

指定 Excel 文件路径。

使用 UnstructuredExcelLoader 加载 Excel 文件。这会创建一个加载器实例并加载文件内容。

使用加载器从 Excel 文件中加载文档。这将解析 Excel 文件并创建一个文档列表。

打印已加载文档的内容。每个文档对应 Excel 文件中的一个工作表。

安装所需依赖:

pip install langchain unstructured openpyxl msoffcrypto-tool

load_excel_file.py

请参考以下代码:

# Load_excel_file.py
# Load an Excel file using LangChain's UnstructuredExcelLoader
from langchain_community.document_loaders import UnstructuredExcelLoader

# 1. Specify the path to your Excel file
file_path = "sample_excel_file.xlsx"

# 2. Load the Excel file
# This will create a loader instance and load the content of the file
loader = UnstructuredExcelLoader(file_path)

# 3. Load the documents from the Excel file
# This will parse the Excel and create a list of documents
documents = loader.load()

# 4. Print the content of the loaded documents
# Each document corresponds to a sheet in the Excel file
for i, doc in enumerate(documents):
    print(f"\n--- Document {i+1} ---")
    print(doc.page_content)

输入:

sample_excel_file.xlsx 是本程序使用的输入文件,其内容如下:

ID Name Department Age
1 Jatin Kumar HR 34
2 Abhishek Kumar IT 28
3 Rajeev Kumar Finance 42

输出:

--- Document 1 ---
ID Name Department Age 1 Jatin Kumar HR 34 2 Abhishek Kumar IT 28 3 Rajeev Kumar Finance 42

Recipe 18

本 recipe 说明如何编写程序,使用 LangChain 的 BSHTMLLoader 加载 HTML 文件:

指定 HTML 文件路径。你可以将 sample.html 替换为实际的 HTML 文件路径。然后通过 BSHTMLLoader 获取加载器实例。

使用该加载器加载 HTML 文件。这会创建一个加载器实例并加载文件内容。

打印已加载文档的内容。每个文档对应 HTML 文件中的一个部分。

安装所需依赖:

pip install langchain beautifulsoup4 html2text

load_html.py

请参考以下代码:

# Load_html.py
# Load a web page using LangChain's BSHTMLLoader
# This example assumes you have a local HTML file named 'sample.html'
from langchain_community.document_loaders import BSHTMLLoader

# 1. Specify the path to your HTML file
# You can replace 'sample.html' with the actual path to your HTML file
# Get loader instance using BSHTMLLoader
loader = BSHTMLLoader("sample.html")

# 2. Load the HTML file using the loader
# This will create a loader instance and load the content of the file
docs = loader.load()

# 3. Print the content of the loaded documents
# Each document corresponds to a section in the HTML file
print(f"Loaded {len(docs)} HTML documents")
print(docs[0].page_content[:500])

输入:

sample.html 是本程序使用的输入文件,其内容如下:

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Sample HTML Page</title>
<style>
body {
font-family: Arial, sans-serif;
margin: 40px;
background-color: #f4f4f4;
}
h1 {
color: #333;
}
p {
color: #555;
}
a {
color: #0066cc;
text-decoration: none;
}
a:hover {
text-decoration: underline;
}
</style>
</head>
<body>
<h1>Welcome to My Sample Page</h1>
<p>This is a simple HTML page to demonstrate basic structure and
tags.</p>
<h2>Sections</h2>
<ul>
<li><a href="#about">About</a></li>
<li><a href="#contact">Contact</a></li>
</ul>
<h2 id="about">About</h2>
<p>This section provides some basic information about the site.</p>
<h2 id="contact">Contact</h2>
<p>You can reach me at <a href="mailto:someone@example.com">someone@example.com</a>.</p>
</body>
</html>

输出:

Loaded 1 HTML documents
Sample HTML Page
Welcome to My Sample Page
This is a simple HTML page to demonstrate basic structure and tags.
Sections
About
Contact
About
This section provides some basic information about the site.
Contact
You can reach me at someone@example.com.

Recipe 19

本 recipe 将加载 JSON 文档。JSON 文档既可以是结构化的,也可以是非结构化的。程序会打印其内容和元数据:

从文件中加载 JSON 文档,可根据需要修改文件路径与 jq_schema

使用加载器加载文档。这将返回一个 Document 对象列表。

打印已加载文档。每个文档都包含 page_contentmetadata 属性。

安装所需依赖:

pip install langchain jq

load_json.py

请参考以下代码:

# This script will load JSON document, which can be structured or
# unstructured,
# and print their content and metadata.
from langchain_community.document_loaders import JSONLoader

# 1. Load JSON documents from a file
# Modify the file path and jq_schema as needed
loader = JSONLoader(
    file_path="sample.json",
    jq_schema=".",  # "." means load the whole list of objects
    text_content=False  # We'll get structured documents instead of raw strings
)

# 2. Load the documents using the loader
# This will return a list of Document objects
docs = loader.load()

# 3. Print the loaded documents
# Each document will have a page content and metadata attributes
for i, doc in enumerate(docs):
    print(f"Document {i+1}:")
    print("Content:", doc.page_content)
    print("Metadata:", doc.metadata)
    print("---")

输入:

sample.json 是本程序使用的输入文件,其内容如下:

[
  {
    "id": 1,
    "name": "Pankaj Kumar",
    "email": "pankaj@example.com",
    "interests": ["reading", "hiking", "cooking"]
  },
  {
    "id": 2,
    "name": "Pratap Singh",
    "email": "pratap@example.com",
    "interests": ["gaming", "cycling"]
  },
  {
    "id": 3,
    "name": "Neeraj Kumar",
    "email": "neeraj@example.com",
    "interests": ["photography", "travel", "music"]
  }
]

输出:

Document 1:
Content: [{"id": 1, "name": "Pankaj Kumar", "email": "pankaj@example.com", "interests": ["reading", "hiking", "cooking"]}, {"id": 2, "name": "Pratap Singh", "email": "pratap@example.com", "interests": ["gaming", "cycling"]}, {"id": 3, "name": "Neeraj Kumar", "email": "neeraj@example.com", "interests": ["photography", "travel", "music"]}]

Recipe 20

本 recipe 演示如何使用 LangChain 的 WebBaseLoader 加载网页。它会从指定 URL 获取内容,并打印前 1000 个字符:

通过 WebBaseLoader 获取加载器实例。指定你想加载的网页 URL。

使用加载器加载网页。这会创建一个加载器实例并加载网页内容。

打印内容中的前 1000 个字符。每个文档对应网页内容。

安装所需依赖:

pip install langchain beautifulsoup4

load_web_page.py

请参考以下代码:

# This script demonstrates how to load a web page using LangChain's
# WebBaseLoader
# It fetches the content from a specified URL and prints the first 1000
# characters of
from langchain_community.document_loaders import WebBaseLoader

# 1. Get Loader instance using WebBaseLoader
# Specify the URL of the web page you want to load
loader = WebBaseLoader("https://example.com/")

# 2. Load the web page using the loader
# This will create a loader instance and load the content of the
# web page
docs = loader.load()

# 3. Print the 1000 characters from the content
# Each document corresponds to the content of the web page
for i, doc in enumerate(docs):
    print(f"--- Document {i+1} ---")
    print(doc.page_content[:1000])  # Print first 1000 characters
    print("\nMetadata:", doc.metadata)

输入:

https://example.com/

输出:

--- Document 1 ---
Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
More information...

Metadata: {'source': 'https://example.com/', 'title': 'Example Domain', 'language': 'No language found.'}

元数据提取与管理

元数据可用于上下文检索,例如按作者、标题、文件路径、创建日期、标签、域名、源 URL 等进行检索。其常见应用场景包括高效过滤、来源归因,以及多源系统中的内容路由。

下面的部分将结合 Python 实现,讨论相关 recipes。

Recipe 21

本 recipe 演示如何在使用 LangChain 从文本文件加载文档时,为文档添加自定义元数据:

使用 LangChain 的 TextLoader 加载文本文件。

使用加载器从文本文件中加载文档。这会创建一个加载器实例并加载文件内容。

为每个文档添加自定义元数据。自定义元数据可以是你希望与文档关联的任意键值对。本例中包括 source、category、author 等。

打印已加载文档的内容和元数据。每个文档都会附加这些自定义元数据。

安装所需依赖:

pip install langchain

custom_metadata_during_loading.py

请参考以下代码:

# Custom Metadata During Document Loading
# This script demonstrates how to add custom metadata to documents
# loaded from a text file using LangChain
from langchain_community.document_loaders import TextLoader
from langchain_core.documents import Document

# 1. Load a text file using LangChain's TextLoader
loader = TextLoader("RAG.txt")

# 2. Load the documents from the text file using the loader
# This will create a loader instance and load the content of the file
raw_docs = loader.load()

# 3. Add custom metadata to each document
# Custom metadata can be any key-value pair you want to associate with the document
# In this example this includes source, category, author, etc.
custom_docs = []
for doc in raw_docs:
    doc.metadata["source"] = "local_file"
    doc.metadata["category"] = "tutorial"
    doc.metadata["author"] = "Deepak"
    custom_docs.append(doc)

# 4. Print the content and metadata of the loaded documents
# Each document will have the custom metadata added
for i, doc in enumerate(custom_docs):
    print(f"\n--- Document {i+1} ---")
    print("Content:", doc.page_content[:105])
    print("Metadata:", doc.metadata)

输入:

RAG.txt 是本程序使用的输入文件,其内容如下:

Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.
Traditional generative models rely solely on internal parameters for producing responses, which limits their ability to provide up-to-date or domain-specific knowledge. RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources.
Traditional generative models laid the foundation for today’s LLMs. They helped us understand how to model processes represent knowledge, user input and generate data. However, they are now mostly replaced or augmented by deep learning-based transformer models, which offer greater accuracy, coherence, and scalability.

输出:

--- Document 1 ---
Content: Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language model
Metadata: {'source': 'local_file', 'category': 'tutorial', 'author': 'Deepak'}

预处理与文本规范化

预处理与文本规范化是将原始文档块清洗、标准化并准备好用于嵌入的阶段。即便在完成加载与切分之后,文本中仍可能包含不一致内容、格式问题或无关数据,这些都会影响检索质量。下面的部分将讨论相关 recipes。

Recipe 22

本脚本演示如何在使用 LangChain 加载文档时,对文本文档进行预处理:

如果尚未下载 NLTK 的停用词库,则先下载。

使用 LangChain 的 TextLoader 加载文本文件。

使用加载器从文本文件中加载文档。这会创建一个加载器实例并加载文件内容。

定义停用词集合,用于过滤掉那些对语义贡献不大的常见词。

定义预处理函数,用于通过小写化、去除停用词和规范化来清洗文本。

对每个文档应用预处理。这将生成一个包含清洗后内容的新文档列表。

打印清洗后文档的内容。每个文档都将包含预处理后的文本。

安装所需依赖:

pip install langchain nltk

preprocess_during_loading.py

请参考以下代码:

# Preprocess During Document Loading
# This script demonstrates how to preprocess text documents during
# loading using LangChain
from langchain_community.document_loaders import TextLoader
from langchain_core.documents import Document
import nltk
from nltk.corpus import stopwords
import re

# 1. Download NLTK stopwords if not already downloaded
nltk.download("stopwords")

# 2. Load a text file using LangChain's TextLoader
loader = TextLoader("RAG.txt")

# 3. Load the documents from the text file using the loader
# This will create a loader instance and load the content of the file
docs = loader.load()

# 4. Define stopwords
# This will be used to filter out common words that do not contribute to
# the meaning
stop_words = set(stopwords.words("english"))

# 5. Define a preprocessing function
# This function will clean the text by lowercasing, removing stopwords,
# and normalizing
def preprocess(text):
    # 1. Lowercase
    text = text.lower()

    # 2. Remove stopwords
    words = text.split()
    filtered = [word for word in words if word not in stop_words]
    text = ' '.join(filtered)

    # 3. Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# 6. Apply preprocessing to each document
# This will create a new list of documents with cleaned content
processed_docs = []
for doc in docs:
    cleaned = preprocess(doc.page_content)
    processed_doc = Document(
        page_content=cleaned,
        metadata=doc.metadata  # Preserve metadata
    )
    processed_docs.append(processed_doc)

# 7. Print the content of the cleaned documents
# Each document will have the preprocessed content
for doc in processed_docs:
    print("\nCleaned Document:\n", doc.page_content[:300])

输入:

RAG_uncleaned.txt 是本程序使用的输入文件,其内容如下:

Retrieval Augmented Generation (RAG) is an
architecture that combines the ability of large language models (LLMs) with a retrieval
system to enhance the factual accuracy, contextual relevance, and quality of generated
response against the query raised by user to a RAG system.

输出:

Cleaned Document:
retrieval augmented generation (rag) architecture combines ability large language models (llms) retrieval system enhance factual accuracy, contextual relevance, quality generated response query raised user rag system. traditional generative models rely solely internal parameters producing responses,

Recipe 23

本 recipe 演示如何使用 LangChain Community 中的 unstructured 加载器加载文档。该加载器支持 PDF、DOCX、HTML 等多种文件格式。同时它还展示了如何输出经过语义分组后的文档块:

指定文件路径。它可以是文本文件、PDF、DOCX、HTML 等。确保该文件存在于指定路径中。

使用 UnstructuredFileLoader 加载文档。该加载器会自动处理文件格式并提取内容。

输出已加载文档。每个文档都会包含元数据和内容,可供后续处理。

安装所需依赖:

pip install langchain langchain-community unstructured "unstructured[all-docs]" langchain_unstructured

loader_with_semantic_grouping.py

请参考以下代码:

# This script demonstrates how to load a document using LangChain's
# UnstructuredLoader
from langchain_unstructured import UnstructuredLoader
from langchain_core.documents import Document

# 1. Specify the path to your file. This can be a text file, PDF, DOCX,
# HTML, etc.
# Ensure the file exists in the specified path.
file_path = "Unstructured RAG.txt"

# 2. Load the document using UnstructuredFileLoader
# This loader will automatically handle the file format and extract the
# content.
loader = UnstructuredLoader(file_path)
docs = loader.load()

# 3. Output the loaded documents
# Each document will have metadata and content, which can be used for
# further processing.
for i, doc in enumerate(docs):
    print(f"\n Section {i+1}")
    print("Metadata:", doc.metadata)
    print("Content Preview:\n", doc.page_content[:300], "\n---")

输入:

Unstructured RAG.txt 是本程序使用的输入文件,其内容如下:

Retrieval Augmented Generation (RAG) is an
architecture that combines the ability
of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.
Traditional generative models rely solely
on internal parameters for producing
responses, which limits their ability to provide up-to-date or domain-specific knowledge. RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources.
Traditional generative
models laid the foundation for today’s LLMs.
They helped us understand how to model processes represent knowledge, user input and generate data. However, they are now mostly replaced or augmented by deep learning-based transformer models, which offer greater accuracy, coherence, and scalability.

输出:

Section 1
Metadata: {'source': 'Unstructured RAG.txt', 'last_modified': '2025-07-29T23:00:52', 'languages': ['eng'], 'filename': 'Unstructured RAG.txt', 'filetype': 'text/plain', 'category': 'NarrativeText', 'element_id': '0c03d287784ecbed4faa5f6cd9d5c989'}
Content Preview:
Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.
---

Section 2
Metadata: {'source': 'Unstructured RAG.txt', 'last_modified': '2025-07-29T23:00:52', 'languages': ['eng'], 'filename': 'Unstructured RAG.txt', 'filetype': 'text/plain', 'category': 'NarrativeText', 'element_id': '35548ff4e0b0df036569862a71194320'}
Content Preview:
Traditional generative models rely solely on internal parameters for producing responses, which limits their ability to provide up-to-date or domain-specific knowledge. RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources.
---

Section 3
Metadata: {'source': 'Unstructured RAG.txt', 'last_modified': '2025-07-29T23:00:52', 'languages': ['eng'], 'filename': 'Unstructured RAG.txt', 'filetype': 'text/plain', 'category': 'NarrativeText', 'element_id': 'aadaa6675dbdc0f20bcc4c636c67f922'}
Content Preview:
Traditional generative models laid the foundation for today’s LLMs. They helped us understand how to model processes represent knowledge, user input and generate data. However, they are now mostly replaced or augmented by deep learning-based transformer models, which offer greater accuracy, coherenc
---

基于目录摄取的批量加载

批量加载是指一次性从一个目录中读取并处理大量文档的过程。在 RAG 流水线中,尤其是在面对知识库、档案库或持续更新的内容仓库时,通常分批摄取文档会比逐个处理更高效。

下面的部分将结合实现代码讨论相关 recipes。

Recipe 24

本 recipe 演示如何使用 LangChain 的文档加载器并行加载文档。它支持并发加载 PDF、TXT 和 DOCX 文件,以提高效率:

get_loader 函数中,根据文件扩展名获取适当的加载器。该函数会为给定文件类型返回正确的加载器。

并行加载文档。该函数使用 ThreadPoolExecutor 并发加载文档。在目录中,它会查找受支持扩展名的文件并进行加载。该目录中包含三种受支持格式的文件:PDF、TXT 和 DOCX。

从目录中获取文件路径。该函数会从指定目录中提取所有受支持文件的路径。

打印已加载文档的数量。这将输出从指定目录中加载的文档总数。

主执行代码块。当脚本被直接运行时,这部分代码会执行。它会从指定目录中获取文件路径,并并行加载这些文档。

安装所需依赖:

pip install langchain tqdm pypdf python-docx

parallel_document_loading.py

请参考以下代码:

# This code demonstrates how to load documents in parallel using
# LangChain's document loaders
# It supports loading PDF, TXT, and DOCX files concurrently to
# improve efficiency.
import os
from langchain.document_loaders import TextLoader, PyPDFLoader, UnstructuredWordDocumentLoader
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

# 1. Get the appropriate loader based on file extension
# This function returns the correct loader for the given
# file type.
def get_loader(file_path):
    ext = os.path.splitext(file_path)[1].lower()
    if ext == ".pdf":
        return PyPDFLoader(file_path)
    elif ext == ".txt":
        return TextLoader(file_path)
    elif ext in [".docx", ".doc"]:
        return UnstructuredWordDocumentLoader(file_path)
    else:
        return None

# 2. Load documents in parallel
# This function uses ThreadPoolExecutor to load documents concurrently.
# In the directory, it will look for files with supported extensions
# and load them.
# In the directory, there are 3 files with supported extensions
# like PDF, TXT, and DOCX.
def load_documents_parallel(file_paths, max_workers=4):
    documents = []
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {
            executor.submit(get_loader(fp).load): fp
            for fp in file_paths if get_loader(fp)
        }
        for future in tqdm(as_completed(futures), total=len(futures), desc="Loading"):
            try:
                result = future.result()
                documents.extend(result)
            except Exception as e:
                print(f"Failed to load {futures[future]}: {e}")
    return documents

# 3. Get file paths from a directory
# This function retrieves all supported file paths from a specified
# directory.
def get_file_paths(directory):
    supported_exts = {".pdf", ".txt", ".docx"}
    return [
        os.path.join(directory, f)
        for f in os.listdir(directory)
        if os.path.splitext(f)[1].lower() in supported_exts
    ]

# Main execution block -
# This part of the code runs when the script is executed directly.
# It retrieves file paths from a specified directory and loads the
# documents in parallel.
# Change 'documents/' to the directory containing your files.
if __name__ == "__main__":
    folder_path = "documents/"  # change this
    files = get_file_paths(folder_path)
    all_docs = load_documents_parallel(files)

    # 4. Print the number of loaded documents
    # This will output the total number of documents loaded from the
    # specified directory.
    print(f"\nLoaded {len(all_docs)} documents.")

输入:

documents 文件夹路径中包含三个文件,扩展名分别为 .pdf.txt.docx

RAG.pdfRAG.txtRAG.docx 的内容如下:

Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.
Traditional generative models rely solely on internal parameters for producing responses, which limits their ability to provide up-to-date or domain-specific knowledge. RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources.
Traditional generative models laid the foundation for today’s LLMs. They helped us understand how to model processes represent knowledge, user input and generate

输出:

Loaded 3 documents.

Recipe 25

本 recipe 演示如何根据 token 数量将多个小型文本文件批量归组为逻辑批次,然后再把这些批次切分为更小的 chunk,以便进行嵌入:

初始化分词器。你可以使用任何与你的嵌入模型兼容的 tokenizer。

从文件夹中加载文本文档。该函数会从指定文件夹中加载所有文本文件,并返回一个 Document 对象列表。

根据 token 数量对文档进行分批。该函数会基于最大 token 数限制,把文档分组成若干批次。

将批次切分为更小的 chunk。该函数使用 RecursiveCharacterTextSplitter 把每个批次进一步切分为更小的块,并允许你指定 chunk 大小和重叠长度,以便更好地保留上下文。

主执行方法。该脚本会加载多个小文本文件,根据 token 数将它们归并为逻辑批次,然后将这些批次切分为可用于嵌入的小块。

安装所需依赖:

pip install langchain tiktoken

batching_before_splitting.py

请参考以下代码:

# This script demonstrates how to batch small text files into logical
# groups
# based on token count, and then split those batches into smaller
# chunks for embedding.
import os
from langchain_community.document_loaders import TextLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
import tiktoken

# 1. Initialize the tokenizer
# You can use any tokenizer compatible with your embedding model
tokenizer = tiktoken.get_encoding("cl100k_base")  # or use tokenizer.encode(text)

# 2. Load text documents from a folder
# This function loads all text files from a specified folder and
# returns a list of Document objects
def load_text_documents(folder_path):
    documents = []
    for file in os.listdir(folder_path):
        if file.endswith(".txt"):
            loader = TextLoader(os.path.join(folder_path, file))
            documents.extend(loader.load())
    return documents

# 3. Batch documents based on token count
# This function groups documents into batches based on a maximum token
# count
def batch_documents_by_tokens(documents, max_tokens=200):
    batches = []
    current_batch = ""
    for doc in documents:
        content = doc.page_content.strip()
        if not content:
            continue
        if len(tokenizer.encode(current_batch + content)) <= max_tokens:
            current_batch += "\n" + content
        else:
            batches.append(Document(page_content=current_batch))
            current_batch = content
    if current_batch:
        batches.append(Document(page_content=current_batch))
    return batches

# 4. Split batches into smaller chunks
# This function splits each batch into smaller chunks using
# RecursiveCharacterTextSplitter
# It allows you to specify chunk size and overlap for better
# context retention
def split_batches(batched_docs, chunk_size=200, chunk_overlap=20):
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    return splitter.split_documents(batched_docs)

# 5. main method to execute the batching and splitting
# This script will load small text files, batch them into logical
# groups based on token count,
# and then split those batches into smaller chunks ready for
# embedding
if __name__ == "__main__":
    folder = "text_file_batch/"  # your folder of small files
    raw_docs = load_text_documents(folder)
    print(f" 1) Loaded {len(raw_docs)} small documents")
    batched_docs = batch_documents_by_tokens(raw_docs, max_tokens=300)
    print(f" 2) Grouped into {len(batched_docs)} logical batches")
    split_docs = split_batches(batched_docs)
    print(f" 3) Split into {len(split_docs)} chunks ready for embedding")

输入:

text_file_batch 文件夹中有以下三个文件,程序将使用它们:RAG1.txtRAG2.txtRAG3.txt

RAG1.txt

Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.

RAG2.txt

Traditional generative models rely solely on internal parameters for producing responses, which limits their ability to provide up-to-date or domain-specific knowledge. RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources.

RAG3.txt

Traditional generative models laid the foundation for today’s LLMs. They helped us understand how to model processes represent knowledge, user input and generate data. However, they are now mostly replaced or augmented by deep learning-based transformer models, which offer greater accuracy, coherence, and scalability.

输出:

1) Loaded 3 small documents
2) Grouped into 1 logical batches
3) Split into 6 chunks ready for embedding

自定义文档加载器

自定义文档加载器使你能够处理那些不属于标准格式(如 PDF、CSV 或 DOCX)的数据源。在 RAG 流水线中,你可能需要从 API、数据库、专有文件格式或异常文本结构中加载信息。这正是需要创建自定义文档加载器的时候。

下面的 recipe 说明如何实现一个为内容添加自定义元数据的文档加载器。

Recipe 26

本 recipe 定义了一个用于加载文本文件的自定义文档加载器:

主函数会在脚本被直接运行时执行。它会使用指定的文件路径和元数据创建 CustomTextLoader 的一个实例。

CustomTextLoader 是一个继承自 BaseLoader 的类,用于加载文本文件。它会读取文本文件内容,并将其作为一个 Document 对象返回。

打印已加载文档。每个文档的内容和元数据都会被输出到控制台。

安装所需依赖:

pip install langchain

custom_document_loader.py

请参考以下代码:

# This script defines a custom document loader for loading text files
# It uses LangChain's BaseLoader to create a custom loader that reads
# text files
from langchain_core.documents import Document
from langchain_community.document_loaders.base import BaseLoader
import os

# 2. CustomTextLoader is a class that extends BaseLoader to load text
# files
# It reads the content of a text file and returns it as a
# document object
class CustomTextLoader(BaseLoader):
    def __init__(self, file_path: str, metadata: dict = None):
        self.file_path = file_path
        self.metadata = metadata or {}

    def load(self):
        with open(self.file_path, "r", encoding="utf-8") as f:
            text = f.read()
        return [Document(page_content=text, metadata=self.metadata)]

# 1. Main is executed when the script is run directly
# It creates an instance of CustomTextLoader with a specified file
# path and metadata
if __name__ == "__main__":
    file_path = "RAG.txt"
    custom_metadata = {
        "source": file_path,
        "category": "session_notes",
        "author": "Deepak",
        "tags": ["custom", "demo", "loader"]
    }
    loader = CustomTextLoader(file_path=file_path, metadata=custom_metadata)
    docs = loader.load()

    # 3. Print the loaded documents
    # Each document will have its content and metadata printed to the
    # console
    for doc in docs:
        print("Text Content:\n", doc.page_content)
        print("\nMetadata:\n", doc.metadata)

输入:

RAG.txt 是本程序使用的输入文件,其内容如下:

Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.
Traditional generative models rely solely on internal parameters for producing responses, which limits their ability to provide up-to-date or domain-specific knowledge. RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources.
Traditional generative models laid the foundation for today’s LLMs. They helped us understand how to model processes represent knowledge, user input and generate data. However, they are now mostly replaced or augmented by deep learning-based transformer models, which offer greater accuracy, coherence, and scalability.

输出:

Text Content:
Retrieval Augmented Generation (RAG) is an architecture that combines the ability of large language models (LLMs) with a retrieval system to enhance the factual accuracy, contextual relevance, and quality of generated response against the query raised by user to a RAG system.
Traditional generative models rely solely on internal parameters for producing responses, which limits their ability to provide up-to-date or domain-specific knowledge. RAG mitigates this by augmenting the generation process with real-time retrieval from external knowledge sources.
Traditional generative models laid the foundation for today’s LLMs. They helped us understand how to model processes represent knowledge, user input and generate data. However, they are now mostly replaced or augmented by deep learning-based transformer models, which offer greater accuracy, coherence, and scalability.

Metadata:
{'source': 'RAG.txt', 'category': 'session_notes', 'author': 'Deepak', 'tags': ['custom', 'demo', 'loader']}

结论

在本章中,我们探讨了文档加载器在 RAG 流水线中的作用。我们了解了什么是文档加载器,以及它如何将原始、非结构化文件转换为结构化的文档对象。本章还讲解了如何从多种格式和来源加载文档,例如 PDF、CSV、Markdown 文件、Excel 表、JSON 和网页。

我们认识到了在加载过程中进行元数据增强的重要性,因为这有助于更好地进行过滤与检索。现在,我们已经掌握了根据数据源、数据规模和结构选择合适加载器的最佳实践,并清晰理解了如何将不同类型的数据摄取为适合 RAG 的格式,从而确保系统从干净、有组织、可检索的文档开始运行。

不过,加载只是第一步。大型文档通常过大,无法直接进行嵌入,因此需要切分为更小、同时尽量保留上下文的 chunk,以提升检索准确性、降低成本,并支持更优的 LLM 响应。

在下一章“文档切分技术”中,我们将学习如何将已加载的文档拆分开来,包括从简单的固定大小切分,到基于语义和主题的切分策略,为后续在向量数据库中的嵌入与存储做好准备。

如果你要,我可以继续把这一章也处理成更适合正式出版的中文版,包括统一术语、修正原文里的小问题,并润色成更像教材的表达。