UnstructuredLoader 学习和使用笔记UnstructuredLoader 概述 Unstructured

UnstructuredLoader

概述

UnstructuredLoader 是 LangChain 生态系统中的一个强大文档加载器，专门用于处理非结构化文档。它能够从各种格式的文档中提取文本和元数据，并将其转换为 LangChain 可以处理的 Document 对象。

主要特性

多格式支持: 支持 PDF、Word、HTML、图片等多种文档格式
本地和云端处理: 既可以本地处理，也可以通过 Unstructured API 进行云端处理
批量处理: 支持同时处理多个文件
异步加载: 提供异步文档加载功能
URL 支持: 直接从网页 URL 加载内容
后处理器: 支持自定义文本后处理功能
丰富的元数据: 提取文档的详细元数据信息

安装和配置

安装依赖

pip install -U langchain-unstructured

环境变量配置

如果使用 API 处理，需要设置环境变量：

export UNSTRUCTURED_API_KEY="your-api-key"
export UNSTRUCTURED_URL="https://api.unstructuredapp.io/general/v0/general"  # 可选，默认值

核心参数说明

构造函数参数

file_path: Optional[str | Path | list[str] | list[Path]]
- 要加载的文件路径，支持单个文件或文件列表
file: Optional[IO[bytes] | list[IO[bytes]]]
- 文件对象，与 file_path 互斥
partition_via_api: bool = False
- 是否通过 API 进行文档分割处理
post_processors: Optional[list[Callable[[str], str]]]
- 文本后处理函数列表
api_key: Optional[str]
- Unstructured API 密钥
client: Optional[UnstructuredClient]
- 自定义客户端
url: Optional[str]
- 自定义 API 端点 URL
web_url: Optional[str]
- 要加载的网页 URL
kwargs: 传递给 Unstructured 的其他参数

使用方法

1. 基本文件加载

from langchain_unstructured import UnstructuredLoader

# 加载单个文件
loader = UnstructuredLoader(
    file_path="example.pdf",
    partition_via_api=True,
    chunking_strategy="by_title",
    strategy="fast",
)

# 同步加载
docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)

2. 批量文件加载

# 加载多个文件
loader = UnstructuredLoader(
    file_path=["file1.pdf", "file2.docx", "file3.html"],
    partition_via_api=True
)

docs = loader.load()
for doc in docs:
    print(f"来源: {doc.metadata['source']}")
    print(f"内容: {doc.page_content[:50]}...")

3. 惰性加载（节省内存）

loader = UnstructuredLoader(
    file_path=["large_file1.pdf", "large_file2.pdf"],
    partition_via_api=True
)

# 惰性加载，逐个处理文档
for doc in loader.lazy_load():
    # 处理单个文档
    process_document(doc)

4. 异步加载

import asyncio

async def load_documents():
    loader = UnstructuredLoader(
        file_path="example.pdf",
        partition_via_api=True
    )
    
    docs = await loader.aload()
    return docs

# 运行异步加载
docs = asyncio.run(load_documents())

5. 从 URL 加载

# 从网页加载内容
loader = UnstructuredLoader(
    web_url="https://www.example.com/"
)

docs = loader.load()
for doc in docs:
    print(f"类别: {doc.metadata['category']}")
    print(f"内容: {doc.page_content}")

6. 使用后处理器

def clean_text(text: str) -> str:
    """清理文本的后处理函数"""
    return text.strip().replace('\n\n', '\n')

def remove_headers(text: str) -> str:
    """移除页眉页脚"""
    lines = text.split('\n')
    # 自定义逻辑移除页眉页脚
    return '\n'.join(lines[1:-1])

loader = UnstructuredLoader(
    file_path="document.pdf",
    partition_via_api=True,
    post_processors=[clean_text, remove_headers]
)

docs = loader.load()

7. 本地处理（需要安装额外依赖）

# 首先安装本地处理依赖
# pip install unstructured

loader = UnstructuredLoader(
    file_path="document.pdf",
    partition_via_api=False,  # 使用本地处理
    strategy="hi_res",  # 高分辨率处理
    infer_table_structure=True  # 推断表格结构
)

docs = loader.load()

8. 自定义客户端配置

from unstructured_client import UnstructuredClient

# 创建自定义客户端
custom_client = UnstructuredClient(
    api_key_auth="your-api-key",
    server_url="https://your-custom-endpoint.com"
)

loader = UnstructuredLoader(
    file_path="document.pdf",
    client=custom_client,
    partition_via_api=True
)

文档元数据说明

每个加载的文档都包含丰富的元数据：

docs = loader.load()
metadata = docs[0].metadata

# 常见元数据字段
print(f"来源文件: {metadata.get('source')}")
print(f"文件类型: {metadata.get('filetype')}")
print(f"类别: {metadata.get('category')}")
print(f"页码: {metadata.get('page_number')}")
print(f"语言: {metadata.get('languages')}")
print(f"最后修改时间: {metadata.get('last_modified')}")
print(f"元素ID: {metadata.get('element_id')}")
print(f"坐标信息: {metadata.get('coordinates')}")

常见文档类别

Title: 标题
NarrativeText: 叙述性文本
UncategorizedText: 未分类文本
Table: 表格
ListItem: 列表项
Header: 页眉
Footer: 页脚

最佳实践

1. 选择合适的处理方式

# 对于简单文档，使用本地处理
loader_local = UnstructuredLoader(
    file_path="simple.txt",
    partition_via_api=False
)

# 对于复杂文档，使用 API 处理
loader_api = UnstructuredLoader(
    file_path="complex.pdf",
    partition_via_api=True,
    strategy="hi_res"
)

2. 处理大批量文档

import os
from pathlib import Path

def load_documents_from_directory(directory: str):
    """从目录批量加载文档"""
    pdf_files = list(Path(directory).glob("*.pdf"))
    
    loader = UnstructuredLoader(
        file_path=pdf_files,
        partition_via_api=True,
        chunking_strategy="by_title"
    )
    
    # 使用惰性加载节省内存
    documents = []
    for doc in loader.lazy_load():
        documents.append(doc)
        
        # 可以在这里添加进度跟踪
        if len(documents) % 10 == 0:
            print(f"已加载 {len(documents)} 个文档")
    
    return documents

3. 错误处理

def safe_load_documents(file_paths: list[str]):
    """安全加载文档，处理可能的错误"""
    successful_docs = []
    failed_files = []
    
    for file_path in file_paths:
        try:
            loader = UnstructuredLoader(
                file_path=file_path,
                partition_via_api=True
            )
            docs = loader.load()
            successful_docs.extend(docs)
        except Exception as e:
            print(f"加载文件 {file_path} 失败: {e}")
            failed_files.append(file_path)
    
    return successful_docs, failed_files

高级配置选项

分块策略

loader = UnstructuredLoader(
    file_path="document.pdf",
    partition_via_api=True,
    chunking_strategy="by_title",  # 按标题分块
    max_characters=1000,  # 最大字符数
    new_after_n_chars=800,  # 新块字符数
    overlap=200  # 重叠字符数
)

表格处理

loader = UnstructuredLoader(
    file_path="spreadsheet.xlsx",
    partition_via_api=True,
    infer_table_structure=True,  # 推断表格结构
    skip_infer_table_types=["png", "jpg"]  # 跳过某些格式的表格推断
)

注意事项

API 限制: 使用 Unstructured API 时注意请求频率和文件大小限制
内存管理: 处理大量文档时建议使用 lazy_load() 方法
文件格式: 确保文档格式受支持
网络连接: API 处理需要稳定的网络连接
成本考虑: API 调用可能产生费用

故障排除

常见错误及解决方案

ModuleNotFoundError: No module named 'unstructured'
```
pip install unstructured
```

API 认证失败

export UNSTRUCTURED_API_KEY="your-valid-api-key"

文件读取权限错误

# 确保文件路径正确且有读取权限
import os
print(os.path.exists("your-file.pdf"))

内存不足

# 使用惰性加载
for doc in loader.lazy_load():
    process_document(doc)

总结

UnstructuredLoader 是一个功能强大且灵活的文档加载器，适用于各种非结构化文档处理场景。通过合理配置参数和选择适当的处理方式，可以高效地将各种格式的文档转换为 LangChain 可处理的结构化数据。