如何通过HTML章节拆分文本
引言
在处理大规模HTML文档时,将内容拆分为语义上相关的章节显得尤为重要。这不仅有助于提高信息检索的效率,还能保留文档结构中蕴含的丰富上下文信息。本文将介绍一种"结构感知"的分块器——HTMLSectionSplitter,它能在元素级别上拆分文本,并为每个章节添加元数据。
主要内容
什么是HTMLSectionSplitter
HTMLSectionSplitter是一种专门用于HTML文档的分块器。它能检测并拆分文档中的章节,并为每个章节添加相关的元数据。可以根据提供的标签,通过XSLT路径转换HTML,使其更容易检测到章节。
如何使用HTMLSectionSplitter
1. 拆分HTML字符串
from langchain_text_splitters import HTMLSectionSplitter
html_string = """
<!DOCTYPE html>
<html>
<body>
<div>
<h1>Foo</h1>
<p>Some intro text about Foo.</p>
<div>
<h2>Bar main section</h2>
<p>Some intro text about Bar.</p>
<h3>Bar subsection 1</h3>
<p>Some text about the first subtopic of Bar.</p>
<h3>Bar subsection 2</h3>
<p>Some text about the second subtopic of Bar.</p>
</div>
<div>
<h2>Baz</h2>
<p>Some text about Baz</p>
</div>
<br>
<p>Some concluding text about Foo</p>
</div>
</body>
</html>
"""
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]
html_splitter = HTMLSectionSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits
输出:
[Document(page_content='Foo \n Some intro text about Foo.', metadata={'Header 1': 'Foo'}),
Document(page_content='Bar main section \n Some intro text about Bar. \n Bar subsection 1 \n Some text about the first subtopic of Bar. \n Bar subsection 2 \n Some text about the second subtopic of Bar.', metadata={'Header 2': 'Bar main section'}),
Document(page_content='Baz \n Some text about Baz \n \n \n Some concluding text about Foo', metadata={'Header 2': 'Baz'})]
2. 限制章节大小
HTMLSectionSplitter可以与其他文本分块器结合使用,例如RecursiveCharacterTextSplitter,在章节大小超过设定的块大小时,使用递归字符分块器进行处理。
from langchain_text_splitters import HTMLSectionSplitter, RecursiveCharacterTextSplitter
html_string = """
<!DOCTYPE html>
<html>
<body>
<div>
<h1>Foo</h1>
<p>Some intro text about Foo.</p>
<div>
<h2>Bar main section</h2>
<p>Some intro text about Bar.</p>
<h3>Bar subsection 1</h3>
<p>Some text about the first subtopic of Bar.</p>
<h3>Bar subsection 2</h3>
<p>Some text about the second subtopic of Bar.</p>
</div>
<div>
<h2>Baz</h2>
<p>Some text about Baz</p>
</div>
<br>
<p>Some concluding text about Foo</p>
</div>
</body>
</html>
"""
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
("h4", "Header 4"),
]
html_splitter = HTMLSectionSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
# 拆分
splits = text_splitter.split_documents(html_header_splits)
splits
输出:
[Document(page_content='Foo \n Some intro text about Foo.', metadata={'Header 1': 'Foo'}),
Document(page_content='Bar main section \n Some intro text about Bar.', metadata={'Header 2': 'Bar main section'}),
Document(page_content='Bar subsection 1 \n Some text about the first subtopic of Bar.', metadata={'Header 3': 'Bar subsection 1'}),
Document(page_content='Bar subsection 2 \n Some text about the second subtopic of Bar.', metadata={'Header 3': 'Bar subsection 2'}),
Document(page_content='Baz \n Some text about Baz \n \n \n Some concluding text about Foo', metadata={'Header 2': 'Baz'})]
常见问题和解决方案
常见问题
- 拆分失败或章节标签未能正确识别:
- 检查提供的XSLT路径是否正确,并确保目标HTML经过适当的预处理。
- 分块结果不理想,章节间逻辑不连贯:
- 调整要拆分的标签和元数据,以更好地反映文档的结构。
解决方案
- 使用XSLT转换HTML以确保准确的章节识别。
- 根据实际需要调整
HTMLSectionSplitter和RecursiveCharacterTextSplitter的参数。
总结和进一步学习资源
通过使用HTMLSectionSplitter,我们可以高效地将HTML文档拆分为语义相关的章节,同时保留文档结构中的关键信息。结合其他文本分块器如RecursiveCharacterTextSplitter,可以进一步优化分块结果。
进一步学习资源
参考资料
如果这篇文章对你有帮助,欢迎点赞并关注我的博客。您的支持是我持续创作的动力! ---END---