检索增强中pdf文档的层次化拆分解决方案

83 阅读6分钟

一、目标与背景

  • 目标:从PDF中提取层次化文档结构,拆分为适合检索增强生成(RAG)的独立文档块,用于向量存储和语义检索。
  • 背景
    • RAG系统依赖文档的语义分割,要求每个块具有独立语义且长度适中(通常200-1000 token)。
    • 层次化拆分能保留PDF的章节结构,提升检索精度。
  • 输出:JSON格式的文档块列表,每个块包含层次标识、标题和内容。

二、整体流程

  1. 文本提取:从PDF提取完整文本,作为备用。
  2. 层次化解析:使用你的代码提取PDF大纲和内容,生成层次化结构。
  3. 文档拆分
    • 按层次(如章节、子章节)分割为独立块。
    • 对长文本块使用滑动窗口进一步拆分。
  4. 输出构建:生成带元数据的文档块,保存为JSON。

三、完整代码实现

import yaml
import json
from pathlib import Path
from typing import Any, Optional
from collections import OrderedDict
from pdfminer.pdfparser import PDFParser, PDFSyntaxError
from pdfminer.pdfdocument import PDFDocument, PDFNoOutlines
from pdfminer.pdfpage import PDFPage
from pdfminer.pdftypes import PDFObjRef
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from enum import Enum, auto
from io import StringIO

# 你的原始代码(保持不变)
class PDFRefType(Enum):
    from enum import auto
    PDF_OBJ_REF = auto()
    DICTIONARY = auto()
    LIST = auto()
    NAMED_REF = auto()
    UNK = auto()

class RefPageNumberResolver:
    def __init__(self, document: PDFDocument):
        self.document = document
        self.objid_to_pagenum = {
            page.pageid: page_num
            for page_num, page in enumerate(PDFPage.create_pages(document), 1)
        }

    @classmethod
    def get_ref_type(cls, ref: Any) -> 'PDFRefType':
        if isinstance(ref, PDFObjRef):
            return PDFRefType.PDF_OBJ_REF
        elif isinstance(ref, dict) and "D" in ref:
            return PDFRefType.DICTIONARY
        elif isinstance(ref, list) and any(isinstance(e, PDFObjRef) for e in ref):
            return PDFRefType.LIST
        elif isinstance(ref, bytes):
            return PDFRefType.NAMED_REF
        else:
            return PDFRefType.UNK

    @classmethod
    def is_ref_page(cls, ref: Any) -> bool:
        from pdfminer.pdfpage import LITERAL_PAGE
        return isinstance(ref, dict) and "Type" in ref and ref["Type"] is LITERAL_PAGE

    def resolve(self, ref: Any) -> Optional[int]:
        ref_type = self.get_ref_type(ref)
        if ref_type is PDFRefType.PDF_OBJ_REF and self.is_ref_page(ref.resolve()):
            return self.objid_to_pagenum.get(ref.objid)
        elif ref_type is PDFRefType.PDF_OBJ_REF:
            return self.resolve(ref.resolve())
        if ref_type is PDFRefType.DICTIONARY:
            return self.resolve(ref["D"])
        if ref_type is PDFRefType.LIST:
            return self.resolve(next(filter(lambda e: isinstance(e, PDFObjRef), ref)))
        if ref_type is PDFRefType.NAMED_REF:
            return self.resolve(self.document.get_dest(ref))
        return None

def extract_page_text(document: PDFDocument, page_num: int) -> str:
    rsrcmgr = PDFResourceManager()
    output_string = StringIO()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    
    for i, page in enumerate(PDFPage.create_pages(document), 1):
        if i == page_num:
            interpreter.process_page(page)
            break
    
    text = output_string.getvalue()
    device.close()
    output_string.close()
    return text.strip()

def build_hierarchy(outlines, document: PDFDocument) -> OrderedDict:
    hierarchy = OrderedDict()
    resolver = RefPageNumberResolver(document)
    stack = [(hierarchy, 1)]  # (node, expected_level)

    for level, title, dest, a, se in outlines:
        page_num = None
        if dest:
            page_num = resolver.resolve(dest)
        elif a:
            page_num = resolver.resolve(a)
        elif se:
            page_num = resolver.resolve(se)

        content = extract_page_text(document, page_num) if page_num else ""

        node = OrderedDict([('title', title)])
        if content:
            node['content'] = content

        while stack and stack[-1][1] > level:
            stack.pop()
        parent = stack[-1][0]
        
        if 'children' not in parent:
            parent['children'] = []
        parent['children'].append(node)
        stack.append((node, level + 1))

    return hierarchy

def pdf_to_yaml(pdf_path: str, output_path: str = None) -> Optional[OrderedDict]:
    if output_path is None:
        output_path = Path(pdf_path).with_suffix('.yaml')

    try:
        with open(pdf_path, 'rb') as fp:
            parser = PDFParser(fp)
            document = PDFDocument(parser)
            outlines = list(document.get_outlines())
            
            if not outlines:
                print(f"No outlines found in {pdf_path}")
                return None
            
            hierarchy = build_hierarchy(outlines, document)
            
            with open(output_path, 'w', encoding='utf-8') as yaml_file:
                yaml.dump(hierarchy, yaml_file, allow_unicode=True, default_flow_style=False)
            
            print(f"转换完成!YAML文件已保存至: {output_path}")
            return hierarchy

    except PDFNoOutlines:
        print(f"No outlines found in {pdf_path}")
        return None
    except PDFSyntaxError:
        print(f"Corrupted PDF or non-PDF file: {pdf_path}")
        return None
    except Exception as e:
        print(f"发生错误:{str(e)}")
        return None
    finally:
        try:
            parser.close()
        except NameError:
            pass

# 扩展部分:RAG层次化拆分

# 1. 提取完整文本(备用)
def extract_full_text(pdf_path: str) -> str:
    try:
        with open(pdf_path, 'rb') as fp:
            parser = PDFParser(fp)
            document = PDFDocument(parser)
            rsrcmgr = PDFResourceManager()
            output_string = StringIO()
            device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
            interpreter = PDFPageInterpreter(rsrcmgr, device)
            
            for page in PDFPage.create_pages(document):
                interpreter.process_page(page)
            
            full_text = output_string.getvalue().strip()
            device.close()
            output_string.close()
            return full_text
    except Exception as e:
        print(f"提取全文失败:{str(e)}")
        return ""

# 2. 滑动窗口拆分(对长文本块)
def sliding_window_split(text: str, max_length: int = 1000, stride: int = 500) -> list:
    """将长文本按滑动窗口拆分为适合RAG的块"""
    text_blocks = []
    text_length = len(text)
    
    for start in range(0, text_length, stride):
        end = min(start + max_length, text_length)
        block = text[start:end]
        if end < text_length:
            last_period = block.rfind('.')
            if last_period > max_length // 2:
                block = block[:last_period + 1]
        text_blocks.append(block.strip())
    
    return text_blocks

# 3. 层次化拆分与重组
def hierarchical_split(yaml_path: str, pdf_path: str, max_length: int = 1000) -> list:
    """从YAML生成层次化文档块"""
    with open(yaml_path, 'r', encoding='utf-8') as f:
        hierarchy = yaml.safe_load(f)
    
    doc_blocks = []
    doc_id = 0
    
    def traverse_node(node, level=0, parent_titles=[]):
        nonlocal doc_id
        title = node.get('title', '')
        content = node.get('content', '')
        hierarchy_path = parent_titles + [title] if title else parent_titles
        
        # 组合标题和内容
        block_text = f"{title}\n{content}".strip() if title and content else content or title
        
        # 如果块过长,使用滑动窗口拆分
        if len(block_text) > max_length:
            sub_blocks = sliding_window_split(block_text, max_length)
            for i, sub_block in enumerate(sub_blocks):
                doc_blocks.append({
                    "id": f"doc_{doc_id}_{i}",
                    "hierarchy": " > ".join(hierarchy_path),
                    "level": level,
                    "content": sub_block
                })
            doc_id += 1
        else:
            doc_blocks.append({
                "id": f"doc_{doc_id}",
                "hierarchy": " > ".join(hierarchy_path),
                "level": level,
                "content": block_text
            })
            doc_id += 1
        
        # 递归处理子节点
        for child in node.get('children', []):
            traverse_node(child, level + 1, hierarchy_path)
    
    traverse_node(hierarchy)
    
    # 若无层次结构,使用完整文本并拆分
    if not doc_blocks:
        full_text = extract_full_text(pdf_path)
        if full_text:
            text_blocks = sliding_window_split(full_text, max_length)
            for i, block in enumerate(text_blocks):
                doc_blocks.append({
                    "id": f"doc_{i}",
                    "hierarchy": "Full Document",
                    "level": 0,
                    "content": block
                })
    
    return doc_blocks

# 4. 主函数:构建RAG文档块
def build_rag_chunks(pdf_path: str, output_dir: str, max_length: int = 1000):
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)
    
    pdf_name = Path(pdf_path).stem
    yaml_file = output_dir / f"{pdf_name}.yaml"
    chunks_file = output_dir / f"{pdf_name}_chunks.json"
    
    # 步骤1:层次化解析
    hierarchy = pdf_to_yaml(pdf_path, str(yaml_file))
    
    # 步骤2:拆分为文档块
    doc_chunks = hierarchical_split(str(yaml_file), pdf_path, max_length)
    
    # 步骤3:保存为JSON
    with open(chunks_file, 'w', encoding='utf-8') as f:
        json.dump(doc_chunks, f, ensure_ascii=False, indent=2)
    print(f"RAG文档块已保存至: {chunks_file}")

# 主流程
if __name__ == "__main__":
    import glob
    pdf_files = glob.glob("document/*.pdf")
    output_dir = "rag_chunks"
    
    for pdf_file in pdf_files:
        build_rag_chunks(pdf_file, output_dir)

四、方案步骤详解

1. 文本提取
  • 函数extract_full_text
  • 功能:提取PDF完整文本,作为无大纲时的备用。
  • 作用:确保即使PDF无层次结构也能生成文档块。
2. 层次化解析
  • 函数pdf_to_yaml(你的代码)
  • 功能:提取PDF的大纲和内容,生成层次化的YAML文件。
  • 输出:YAML文件,包含标题和内容的嵌套结构。
3. 文档拆分
  • 函数hierarchical_split
  • 逻辑
    • 遍历YAML的层次结构,生成每个节点的文档块。
    • 每个块包含:
      • id:唯一标识符。
      • hierarchy:层次路径(如"Chapter 1 > Section 1.1")。
      • level:层级深度。
      • content:标题+内容的组合。
    • 若块长度超过max_length(默认1000字符),使用sliding_window_split进一步拆分。
  • 备用策略:若无大纲,使用完整文本并按滑动窗口拆分。
4. 输出构建
  • 函数build_rag_chunks
  • 输出:JSON文件,包含所有文档块。

五、输出示例

假设PDF包含:

  • Chapter 1: "This is a long chapter text exceeding 1000 characters..."
  • Section 1.1: "Short text here."

输出(example_chunks.json):

[
  {
    "id": "doc_0_0",
    "hierarchy": "Chapter 1",
    "level": 1,
    "content": "This is a long chapter text exceeding 1000 characters... [第一部分]"
  },
  {
    "id": "doc_0_1",
    "hierarchy": "Chapter 1",
    "level": 1,
    "content": "[第二部分]..."
  },
  {
    "id": "doc_1",
    "hierarchy": "Chapter 1 > Section 1.1",
    "level": 2,
    "content": "Section 1.1\nShort text here."
  }
]

六、适配RAG系统

  • 向量存储:将content向量化(如使用Sentence-Transformers),存入向量数据库(如FAISS、Chroma)。
  • 元数据hierarchylevel可用于过滤或排序检索结果。
  • 长度控制max_length=1000适配常见嵌入模型的输入限制。

七、方案优点

  1. 层次化保留:利用你的代码,保持PDF的语义结构。
  2. 灵活性:支持有/无大纲的PDF,自动调整拆分策略。
  3. RAG适配:输出格式直接兼容向量化和检索。

八、后续优化

  • 分句优化:使用NLP工具(如nltk)提高句子边界精度。
  • 多语言支持:适配中英文混合PDF。
  • 并行处理:对大量PDF启用多线程。