伴学笔记5Pydantic（JSON）解析器实战 Pydantic (JSON) 解析器应该是最常用也是最重要的解析器，

Pydantic（JSON）解析器实战 Pydantic (JSON) 解析器应该是最常用也是最重要的解析器，我带着你用它来重构鲜花文案生成程序。

Pydantic 是一个 Python 数据验证和设置管理库，主要基于 Python 类型提示。尽管它不是专为 JSON 设计的，但由于 JSON 是现代 Web 应用和 API 交互中的常见数据格式，Pydantic 在处理和验证 JSON 数据时特别有用。

接下来，我们将使用Pydantic解析器来重构鲜花文案生成程序。

第一步：创建模型实例

先通过环境变量设置OpenAI API密钥，然后使用LangChain库创建一个OpenAI的模型实例，选择text-davinci-003作为大语言模型。

------Part 1

设置OpenAI API密钥

import os os.environ["OPENAI_API_KEY"] = '你的OpenAI API Key'

创建模型实例

from langchain import OpenAI model = OpenAI(model_name='gpt-3.5-turbo-instruct') 第二步：定义输出数据的格式

创建一个空的DataFrame，用于存储从模型生成的描述。然后，通过定义一个名为FlowerDescription的Pydantic BaseModel类，来指定期望的数据格式（即数据的结构）。

------Part 2

创建一个空的DataFrame用于存储结果

import pandas as pd df = pd.DataFrame(columns=["flower_type", "price", "description", "reason"])

数据准备

flowers = ["玫瑰", "百合", "康乃馨"] prices = ["50", "30", "20"]

定义我们想要接收的数据格式

from pydantic import BaseModel, Field class FlowerDescription(BaseModel): flower_type: str = Field(description="鲜花的种类") price: int = Field(description="鲜花的价格") description: str = Field(description="鲜花的描述文案") reason: str = Field(description="为什么要这样写这个文案")

Pydantic的特点包括：

数据验证：自动验证输入数据是否符合指定的类型和其他验证条件。数据转换：可以自动进行数据转换，例如将字符串转换为整数。易于使用：只需使用Python的类型注解功能，即可在类定义中指定每个字段的类型。 JSON支持：可以很容易地从JSON数据创建Pydantic类实例，并可以将类的数据转换为JSON格式。第三步：创建输出解析器

使用LangChain库中的PydanticOutputParser创建输出解析器，该解析器将用于解析模型的输出，以确保其符合FlowerDescription的格式。然后，使用解析器的get_format_instructions方法获取输出格式的指示。

------Part 3

创建输出解析器

from langchain.output_parsers import PydanticOutputParser output_parser = PydanticOutputParser(pydantic_object=FlowerDescription)

获取输出格式指示

format_instructions = output_parser.get_format_instructions()

打印提示

print("输出格式：",format_instructions) 程序输出如下：

输出格式： The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}} the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:

{"properties": {"flower_type": {"title": "Flower Type", "description": "\u9c9c\u82b1\u7684\u79cd\u7c7b", "type": "string"}, "price": {"title": "Price", "description": "\u9c9c\u82b1\u7684\u4ef7\u683c", "type": "integer"}, "description": {"title": "Description", "description": "\u9c9c\u82b1\u7684\u63cf\u8ff0\u6587\u6848", "type": "string"}, "reason": {"title": "Reason", "description": "\u4e3a\u4ec0\u4e48\u8981\u8fd9\u6837\u5199\u8fd9\u4e2a\u6587\u6848", "type": "string"}}, "required": ["flower_type", "price", "description", "reason"]} 下面，我们会把这个内容也传输到模型的提示中，让输入模型的提示和输出解析器的要求相互吻合，前后就呼应得上。

第四步：创建提示模板

定义一个提示模板，该模板将用于为模型生成输入提示。模板中包含需要模型填充的变量（如价格和花的种类），以及之前获取的输出格式指示。

------Part 4

创建提示模板

from langchain import PromptTemplate prompt_template = """您是一位专业的鲜花店文案撰写员。对于售价为 {price} 元的 {flower} ，您能提供一个吸引人的简短中文描述吗？ {format_instructions}"""

根据模板创建提示，同时在提示中加入输出解析器的说明

prompt = PromptTemplate.from_template(prompt_template, partial_variables={"format_instructions": format_instructions})

打印提示

print("提示：", prompt) 输出：

提示： input_variables=['flower', 'price']

output_parser=None

partial_variables={'format_instructions': 'The output should be formatted as a JSON instance that conforms to the JSON schema below.\n\n As an example, for the schema { "properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}\n the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\n Here is the output schema:\n\n {"properties": { "flower_type": {"title": "Flower Type", "description": "\\u9c9c\\u82b1\\u7684\\u79cd\\u7c7b", "type": "string"}, "price": {"title": "Price", "description": "\\u9c9c\\u82b1\\u7684\\u4ef7\\u683c", "type": "integer"}, "description": {"title": "Description", "description": "\\u9c9c\\u82b1\\u7684\\u63cf\\u8ff0\\u6587\\u6848", "type": "string"}, "reason": {"title": "Reason", "description": "\\u4e3a\\u4ec0\\u4e48\\u8981\\u8fd9\\u6837\\u5199\\u8fd9\\u4e2a\\u6587\\u6848", "type": "string"}}, "required": ["flower_type", "price", "description", "reason"]}\n'}

template='您是一位专业的鲜花店文案撰写员。 \n对于售价为 {price} 元的 {flower} ，您能提供一个吸引人的简短中文描述吗？\n {format_instructions}'

template_format='f-string'

validate_template=True

总的来说，这个提示模板是一个用于生成模型输入的工具。你可以在模板中定义需要的输入变量，以及模板字符串的格式和结构，然后使用这个模板来为每种鲜花生成一个描述。

后面，我们还要把实际的信息，循环传入提示模板，生成一个个的具体提示。下面让我们继续。

第五步：生成提示，传入模型并解析输出

循环处理所有的花和它们的价格。对于每种花，根据提示模板创建输入，然后获取模型的输出。使用之前创建的解析器来解析输出，并将解析后的输出添加到DataFrame中。

------Part 5

for flower, price in zip(flowers, prices): # 根据提示准备模型的输入 input = prompt.format(flower=flower, price=price) # 打印提示 print("提示：", input)

# 获取模型的输出
output = model(input)

# 解析模型的输出
parsed_output = output_parser.parse(output)
parsed_output_dict = parsed_output.dict()  # 将Pydantic格式转换为字典

# 将解析后的输出添加到DataFrame中
df.loc[len(df)] = parsed_output.dict()

打印字典

print("输出的数据：", df.to_dict(orient='records'))

具体来说，输出的一个提示是这样的：

提示：您是一位专业的鲜花店文案撰写员。对于售价为 20 元的康乃馨，您能提供一个吸引人的简短中文描述吗？

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}

the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:

{"properties": {"flower_type": {"title": "Flower Type", "description": "\u9c9c\u82b1\u7684\u79cd\u7c7b", "type": "string"}, "price": {"title": "Price", "description": "\u9c9c\u82b1\u7684\u4ef7\u683c", "type": "integer"}, "description": {"title": "Description", "description": "\u9c9c\u82b1\u7684\u63cf\u8ff0\u6587\u6848", "type": "string"}, "reason": {"title": "Reason", "description": "\u4e3a\u4ec0\u4e48\u8981\u8fd9\u6837\u5199\u8fd9\u4e2a\u6587\u6848", "type": "string"}}, "required": ["flower_type", "price", "description", "reason"]} 下面，程序解析模型的输出。在这一步中，你使用你之前定义的输出解析器（output_parser）将模型的输出解析成了一个FlowerDescription的实例。FlowerDescription是你之前定义的一个Pydantic类，它包含了鲜花的类型、价格、描述以及描述的理由。

然后，将解析后的输出添加到DataFrame中。在这一步中，你将解析后的输出（即FlowerDescription实例）转换为一个字典，并将这个字典添加到你的DataFrame中。这个DataFrame是你用来存储所有鲜花描述的。

最后，打印出所有的结果，并可以选择将其保存到CSV文件中。

输出的数据： [{'flower_type': 'Rose', 'price': 50, 'description': '玫瑰是最浪漫的花，它具有柔和的粉红色，有着浓浓的爱意，价格实惠，50元就可以拥有一束玫瑰。', 'reason': '玫瑰代表着爱情，是最浪漫的礼物，以实惠的价格，可以让您尽情体验爱的浪漫。'}, {'flower_type': '百合', 'price': 30, 'description': '这支百合，柔美的花蕾，在你的手中摇曳，仿佛在与你深情的交谈', 'reason': '营造浪漫氛围'}, {'flower_type': 'Carnation', 'price': 20, 'description': '艳丽缤纷的康乃馨，带给你温馨、浪漫的气氛，是最佳的礼物选择！', 'reason': '康乃馨是一种颜色鲜艳、芬芳淡雅、具有浪漫寓意的鲜花，非常适合作为礼物，而且20元的价格比较实惠。'}]

自动修复解析器（OutputFixingParser）实战自动修复解析器主要用于纠正小的格式错误。当输出格式不正确时，它会尝试修复格式错误，而不是重新生成输出。

首先，让我们来设计一个解析时出现的错误。

导入所需要的库和模块

from langchain.output_parsers import PydanticOutputParser from pydantic import BaseModel, Field from typing import List

使用Pydantic创建一个数据格式，表示花

class Flower(BaseModel): name: str = Field(description="name of a flower") colors: List[str] = Field(description="the colors of this flower")

定义一个用于获取某种花的颜色列表的查询

flower_query = "Generate the charaters for a random flower."

定义一个格式不正确的输出

misformatted = "{'name': '康乃馨', 'colors': ['粉红色','白色','红色','紫色','黄色']}"

创建一个用于解析输出的Pydantic解析器，此处希望解析为Flower格式

parser = PydanticOutputParser(pydantic_object=Flower)

使用Pydantic解析器解析不正确的输出

parser.parse(misformatted)

这段代码如果运行，会出现错误。

langchain.schema.output_parser.OutputParserException: Failed to parse Flower from completion {'name': '康乃馨', 'colors': ['粉红色','白色']}. Got: Expecting property name enclosed in double quotes: line 1 column 2 (char 1) 不过，这里我并不想这样解决问题，而是尝试使用OutputFixingParser来帮助咱们自动解决类似的格式错误。

从langchain库导入所需的模块

from langchain.chat_models import ChatOpenAI from langchain.output_parsers import OutputFixingParser

设置OpenAI API密钥

import os os.environ["OPENAI_API_KEY"] = '你的OpenAI API Key'

使用OutputFixingParser创建一个新的解析器，该解析器能够纠正格式不正确的输出

new_parser = OutputFixingParser.from_llm(parser=parser, llm=ChatOpenAI())

使用新的解析器解析不正确的输出

result = new_parser.parse(misformatted) # 错误被自动修正 print(result) # 打印解析后的输出结果用上面的新的new_parser来代替Parser进行解析，你会发现，JSON格式的错误问题被解决了，程序不再出错。

输出如下：

name='Rose' colors=['red', 'pink', 'white'] 这里的秘密在于，在OutputFixingParser内部，调用了原有的PydanticOutputParser，如果成功，就返回；如果失败，它会将格式错误的输出以及格式化的指令传递给大模型，并要求LLM进行相关的修复。

神奇吧，大模型不仅给我们提供知识，还随时帮助分析并解决程序出错的信息。

我们通过一个示例来演示如何使用自动修复解析器。假设我们有一个格式错误的JSON字符串，使用PydanticOutputParser解析时会引发错误。此时，我们可以使用OutputFixingParser来尝试自动修复格式错误。

重试解析器（RetryWithErrorOutputParser）实战重试解析器在模型的初次输出不符合预期时，会尝试重新生成新的输出。它通过重新与模型交互，利用模型的推理能力来找回相关信息，使得输出更加完整和符合预期。

首先还是设计一个解析过程中的错误。

定义一个模板字符串，这个模板将用于生成提问

template = """Based on the user question, provide an Action and Action Input for what step should be taken. {format_instructions} Question: {query} Response:"""

定义一个Pydantic数据格式，它描述了一个"行动"类及其属性

from pydantic import BaseModel, Field class Action(BaseModel): action: str = Field(description="action to take") action_input: str = Field(description="input to the action")

使用Pydantic格式Action来初始化一个输出解析器

from langchain.output_parsers import PydanticOutputParser parser = PydanticOutputParser(pydantic_object=Action)

定义一个提示模板，它将用于向模型提问

from langchain.prompts import PromptTemplate prompt = PromptTemplate( template="Answer the user query.\n{format_instructions}\n{query}\n", input_variables=["query"], partial_variables={"format_instructions": parser.get_format_instructions()}, ) prompt_value = prompt.format_prompt(query="What are the colors of Orchid?")

定义一个错误格式的字符串

bad_response = '{"action": "search"}' parser.parse(bad_response) # 如果直接解析，它会引发一个错误

由于bad_response只提供了action字段，而没有提供action_input字段，这与Action数据格式的预期不符，所以解析会失败。

我们首先尝试用OutputFixingParser来解决这个错误。

from langchain.output_parsers import OutputFixingParser from langchain.chat_models import ChatOpenAI fix_parser = OutputFixingParser.from_llm(parser=parser, llm=ChatOpenAI()) parse_result = fix_parser.parse(bad_response) print('OutputFixingParser的parse结果:',parse_result) 我们来看看这个尝试解决了什么问题，没解决什么问题。

解决的问题有：

不完整的数据：原始的bad_response只提供了action字段而没有action_input字段。OutputFixingParser已经填补了这个缺失，为action_input字段提供了值 'query'。没解决的问题有：

具体性：尽管OutputFixingParser为action_input字段提供了默认值 'query'，但这并不具有描述性。真正的查询是 “Orchid（兰花）的颜色是什么？”。所以，这个修复只是提供了一个通用的值，并没有真正地回答用户的问题。可能的误导：'query' 可能被误解为一个指示，要求进一步查询某些内容，而不是作为实际的查询输入。当然，还有更鲁棒的选择，我们最后尝试一下RetryWithErrorOutputParser这个解析器。

初始化RetryWithErrorOutputParser，它会尝试再次提问来得到一个正确的输出

from langchain.output_parsers import RetryWithErrorOutputParser from langchain.llms import OpenAI retry_parser = RetryWithErrorOutputParser.from_llm( parser=parser, llm=OpenAI(temperature=0) ) parse_result = retry_parser.parse_with_prompt(bad_response, prompt_value) print('RetryWithErrorOutputParser的parse结果:',parse_result) 我们通过一个示例来演示如何使用重试解析器。假设我们有一个输出不完整的响应，使用OutputFixingParser无法完全修复。此时，我们可以使用RetryWithErrorOutputParser来尝试重新生成完整的输出。这个解析器没有让我们失望，成功地还原了格式，甚至也根据传入的原始提示，还原了action_input字段的内容。RetryWithErrorOutputParser的parse结果：action='search' action_input='colors of Orchid'

总结结构化解析器和Pydantic解析器都旨在从大型语言模型中获取格式化的输出。结构化解析器更适合简单的文本响应，而Pydantic解析器则提供了对复杂数据结构和类型的支持。选择哪种解析器取决于应用的具体需求和输出的复杂性。

自动修复解析器主要适用于纠正小的格式错误，而重试解析器则可以处理更复杂的问题，包括格式错误和内容缺失。

在选择解析器时，需要考虑具体的应用场景。如果仅面临格式问题，自动修复解析器可能足够；但如果输出的完整性和准确性至关重要，那么重试解析器可能是更好的选择。 ————————————————