**利用参考示例提升AI数据提取效果的实用指南**2. 定义数据提取的架构为提取任务定义一个数据架构，例如： 3. 构

# 利用参考示例提升AI数据提取效果的实用指南

## 引言

随着大语言模型（LLM）的发展，数据提取已成为处理非结构化文本的关键挑战。通过提供参考示例，可以显著提高提取的质量。本指南将演示如何为工具调用构建少量示例，以指导数据提取和类似应用的行为。

## 主要内容

### 1. 引入参考示例

使用工具调用模型时，参考示例可以增强提取效果。LangChain提供了一个`tool-call`属性，用于处理LLM的消息。通过以下步骤构建参考示例：

- **HumanMessage**：包含示例输入。
- **AIMessage**：包含示例工具调用。
- **ToolMessage**：包含示例工具输出。

以下是构建提示模板的代码示例：

```python
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked "
            "to extract, return null for the attribute's value.",
        ),
        MessagesPlaceholder("examples"),
        ("human", "{text}"),
    ]
)

2. 定义数据提取的架构

为提取任务定义一个数据架构，例如：

from typing import List, Optional
from langchain_core.pydantic_v1 import BaseModel, Field

class Person(BaseModel):
    name: Optional[str] = Field(..., description="The name of the person")
    hair_color: Optional[str] = Field(..., description="The color of the person's hair if known")
    height_in_meters: Optional[str] = Field(..., description="Height in METERs")

class Data(BaseModel):
    people: List[Person]

3. 构建参考示例

创建包含输入和期望输出的参考示例：

import uuid
from typing import Dict, List, TypedDict
from langchain_core.messages import AIMessage, BaseMessage, HumanMessage, ToolMessage

class Example(TypedDict):
    input: str
    tool_calls: List[BaseModel]

def tool_example_to_messages(example: Example) -> List[BaseMessage]:
    messages = [HumanMessage(content=example["input"])]
    tool_calls = []
    for tool_call in example["tool_calls"]:
        tool_calls.append({
            "id": str(uuid.uuid4()),
            "args": tool_call.dict(),
            "name": tool_call.__class__.__name__,
        })
    messages.append(AIMessage(content="", tool_calls=tool_calls))
    tool_outputs = example.get("tool_outputs") or ["You have correctly called this tool."] * len(tool_calls)
    for output, tool_call in zip(tool_outputs, tool_calls):
        messages.append(ToolMessage(content=output, tool_call_id=tool_call["id"]))
    return messages

代码示例

设置并调用模型：

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4-0125-preview", temperature=0)

runnable = prompt | llm.with_structured_output(schema=Data, method="function_calling", include_raw=False)

# 调用示例
for _ in range(5):
    text = "The solar system is large, but earth has only 1 moon."
    print(runnable.invoke({"text": text, "examples": messages}))

常见问题和解决方案

挑战：

提取准确性不足：没有参考示例时，模型可能在简单测试中失败。
模型过于依赖工具调用：部分模型对工具调用的依赖较大。

解决方案：

提供足够的参考示例：通过提供有效的参考示例，可以显著提高模型在复杂场景下的提取准确性。
调整模型参数：根据任务调整模型的温度和其他参数，以改善稳定性。

总结和进一步学习资源

通过本文介绍的方法，您可以通过提供参考示例来提高数据提取的效果。要进一步学习，可以参考以下资源：

参考资料

LangChain 文档: langchain.docs
Pydantic: pydantic-docs.helpmanual.io/

如果这篇文章对你有帮助，欢迎点赞并关注我的博客。您的支持是我持续创作的动力！

---END---