如何通过HTML章节拆分文本

52 阅读3分钟

如何通过HTML章节拆分文本

引言

在处理大规模HTML文档时,将内容拆分为语义上相关的章节显得尤为重要。这不仅有助于提高信息检索的效率,还能保留文档结构中蕴含的丰富上下文信息。本文将介绍一种"结构感知"的分块器——HTMLSectionSplitter,它能在元素级别上拆分文本,并为每个章节添加元数据。

主要内容

什么是HTMLSectionSplitter

HTMLSectionSplitter是一种专门用于HTML文档的分块器。它能检测并拆分文档中的章节,并为每个章节添加相关的元数据。可以根据提供的标签,通过XSLT路径转换HTML,使其更容易检测到章节。

如何使用HTMLSectionSplitter

1. 拆分HTML字符串
from langchain_text_splitters import HTMLSectionSplitter

html_string = """
    <!DOCTYPE html>
    <html>
    <body>
        <div>
            <h1>Foo</h1>
            <p>Some intro text about Foo.</p>
            <div>
                <h2>Bar main section</h2>
                <p>Some intro text about Bar.</p>
                <h3>Bar subsection 1</h3>
                <p>Some text about the first subtopic of Bar.</p>
                <h3>Bar subsection 2</h3>
                <p>Some text about the second subtopic of Bar.</p>
            </div>
            <div>
                <h2>Baz</h2>
                <p>Some text about Baz</p>
            </div>
            <br>
            <p>Some concluding text about Foo</p>
        </div>
    </body>
    </html>
"""

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]

html_splitter = HTMLSectionSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

输出:

[Document(page_content='Foo \n Some intro text about Foo.', metadata={'Header 1': 'Foo'}),
 Document(page_content='Bar main section \n Some intro text about Bar. \n Bar subsection 1 \n Some text about the first subtopic of Bar. \n Bar subsection 2 \n Some text about the second subtopic of Bar.', metadata={'Header 2': 'Bar main section'}),
 Document(page_content='Baz \n Some text about Baz \n \n \n Some concluding text about Foo', metadata={'Header 2': 'Baz'})]

2. 限制章节大小

HTMLSectionSplitter可以与其他文本分块器结合使用,例如RecursiveCharacterTextSplitter,在章节大小超过设定的块大小时,使用递归字符分块器进行处理。

from langchain_text_splitters import HTMLSectionSplitter, RecursiveCharacterTextSplitter

html_string = """
    <!DOCTYPE html>
    <html>
    <body>
        <div>
            <h1>Foo</h1>
            <p>Some intro text about Foo.</p>
            <div>
                <h2>Bar main section</h2>
                <p>Some intro text about Bar.</p>
                <h3>Bar subsection 1</h3>
                <p>Some text about the first subtopic of Bar.</p>
                <h3>Bar subsection 2</h3>
                <p>Some text about the second subtopic of Bar.</p>
            </div>
            <div>
                <h2>Baz</h2>
                <p>Some text about Baz</p>
            </div>
            <br>
            <p>Some concluding text about Foo</p>
        </div>
    </body>
    </html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLSectionSplitter(headers_to_split_on)

html_header_splits = html_splitter.split_text(html_string)

chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# 拆分
splits = text_splitter.split_documents(html_header_splits)
splits

输出:

[Document(page_content='Foo \n Some intro text about Foo.', metadata={'Header 1': 'Foo'}),
 Document(page_content='Bar main section \n Some intro text about Bar.', metadata={'Header 2': 'Bar main section'}),
 Document(page_content='Bar subsection 1 \n Some text about the first subtopic of Bar.', metadata={'Header 3': 'Bar subsection 1'}),
 Document(page_content='Bar subsection 2 \n Some text about the second subtopic of Bar.', metadata={'Header 3': 'Bar subsection 2'}),
 Document(page_content='Baz \n Some text about Baz \n \n \n Some concluding text about Foo', metadata={'Header 2': 'Baz'})]

常见问题和解决方案

常见问题

  1. 拆分失败或章节标签未能正确识别
    • 检查提供的XSLT路径是否正确,并确保目标HTML经过适当的预处理。
  2. 分块结果不理想,章节间逻辑不连贯
    • 调整要拆分的标签和元数据,以更好地反映文档的结构。

解决方案

  • 使用XSLT转换HTML以确保准确的章节识别。
  • 根据实际需要调整HTMLSectionSplitterRecursiveCharacterTextSplitter的参数。

总结和进一步学习资源

通过使用HTMLSectionSplitter,我们可以高效地将HTML文档拆分为语义相关的章节,同时保留文档结构中的关键信息。结合其他文本分块器如RecursiveCharacterTextSplitter,可以进一步优化分块结果。

进一步学习资源

参考资料

  1. Langchain Text Splitters
  2. W3Schools XSLT Tutorial

如果这篇文章对你有帮助,欢迎点赞并关注我的博客。您的支持是我持续创作的动力! ---END---