【可能是全网最丝滑的LangChain教程】十一、LangChain进阶之Output Parsers

1,377 阅读11分钟

我们一路奋战,不是为了改变世界,而是为了不让世界改变我们。

系列文章地址

【可能是全网最丝滑的LangChain教程】一、LangChain介绍 - 掘金 (juejin.cn)

【可能是全网最丝滑的LangChain教程】二、LangChain安装 - 掘金 (juejin.cn)

【可能是全网最丝滑的LangChain教程】三、快速入门LLMChain - 掘金 (juejin.cn)

【可能是全网最丝滑的LangChain教程】四、快速入门Retrieval Chain - 掘金 (juejin.cn)

【可能是全网最丝滑的LangChain教程】五、快速入门Conversation Retrieval Chain - 掘金 (juejin.cn)

【可能是全网最丝滑的LangChain教程】六、快速入门Agent - 掘金 (juejin.cn)

【可能是全网最丝滑的LangChain教程】七、LCEL表达式语言 - 掘金 (juejin.cn)

【可能是全网最丝滑的LangChain教程】八、LangChain进阶之LLM - 掘金 (juejin.cn)

【可能是全网最丝滑的LangChain教程】九、LangChain进阶之Chat Model - 掘金 (juejin.cn)

【可能是全网最丝滑的LangChain教程】十、LangChain进阶之Prompts - 掘金 (juejin.cn)

01 Output Parsers 介绍

输出解析器是LangChain的一个关键组件,它们的作用是将模型生成的文本输出转换成结构化的数据,以便更容易地使用和分析。

简单来说就是:模型输出------>结构化数据。

如果你正在从事或者你正打算从事大模型应用相关研发,我想你都能明白结构化数据输出的重要性。

02 原理解析

技术还是要学到本质,掌握本质就能以不变应万变。这里以CSV Parser为例。

from langchain.output_parsers import CommaSeparatedListOutputParser
from langchain_core.prompts import PromptTemplate

"""
1. 这是csv数据解析器
2. CSV数据默认以逗号分割
"""
output_parser = CommaSeparatedListOutputParser()

"""
1. 调用方法生成一段字符串
2. 字符串内容是:Your response should be a list of comma separated values, eg: `foo, bar, baz` or `foo,bar,baz`
"""
format_instructions = output_parser.get_format_instructions()

"""
1. 构造输入提示
2. subject内容是用户的输入,format_instructions上面已经备注
3. partial_variables作用是一次输入,多次使用。(目的是不用没次都设置)
"""
prompt = PromptTemplate(
    template="列举5条与{subject}有关的经典语录。\n{format_instructions}",
    input_variables=["subject"],
    partial_variables={"format_instructions": format_instructions},
)

"""
1. 链式调用执行(LCEL表达式)
2. 提示---模型---输出解析器
"""
chain = prompt | model | output_parser

"""
1. 最终给到模型的输入是:
    列举5条与江湖有关的经典语录。\nYour response should be a list of comma separated values, eg: `foo, bar, baz` or `foo,bar,baz`
2. 模型的输出是一段字符串:刀光剑影,快意恩仇, 一诺千金, 路见不平,拔刀相助, 人在江湖,身不由己, 血雨腥风,笑傲江湖。
3. 输出解析器解析成list:['刀光剑影,快意恩仇', '一诺千金', '路见不平,拔刀相助', '人在江湖,身不由己', '血雨腥风,笑傲江湖。']
"""
chain.invoke({"subject": "江湖"})

"""
输出:
['刀光剑影,快意恩仇', '一诺千金', '路见不平,拔刀相助', '人在江湖,身不由己', '血雨腥风,笑傲江湖。']
"""

CSV Parser是怎么解析的?答案很简单,源码里就一个方法,方法内部就一行代码

def parse(self, text: str) -> List\[str\]:  
    """Parse the output of an LLM call."""  
    return \[part.strip() for part in text.split(",")\]  

所以,你在阅读我这篇文章的时候能想到其他的输出解析器的实现原理吗?给我留言,让我看看~

03 基本使用

这里我将介绍LangChain中目前的常用输出解析器的使用。

CSV parser

CSV输出解析器,用于将模型的输出,按照逗号分割,解析成列表。使用上面已经介绍,这里不再重复说明。

Datetime parser

顾名思义,解析输出中的时间信息。

from langchain.output_parsers import DatetimeOutputParser



output_parser = DatetimeOutputParser()
template = """Answer the users question:

{question}

{format_instructions}"""
prompt = PromptTemplate.from_template(
    template,
    partial_variables={"format_instructions": output_parser.get_format_instructions()},
)

chain = prompt | model | output_parser

output = chain.invoke({"question": "中国人民共和国哪一年成立?"})

print(output)

本质是这句话:Answer the users question:\n\nxxxxxx哪一年成立?\n\nWrite a datetime string that matches the following pattern: '%Y-%m-%dT%H:%M:%S.%fZ'.\n\nExamples: 0018-02-08T10:24:18.419248Z, 0638-03-23T17:42:46.562176Z, 0093-01-29T12:37:35.194184Z\n\nReturn ONLY this string, no other words!

解析方法是

 formatstr = "%Y-%m-%dT%H:%M:%S.%fZ"  
 """The string value that used as the datetime format."""  
   
 def parse(self, response: str) -> datetime:  
     try:  
         return datetime.strptime(response.strip(), self.format)  
     except ValueError as e:  
         raise OutputParserException(  
             f"Could not parse datetime string: {response}"  
        ) from e

Enum parser

顾名思义,将模型输出解析成枚举类型。

from langchain.output_parsers.enum import EnumOutputParser
from enum import Enum


class Colors(Enum):
    RED = "红色"
    GREEN = "绿色"
    BLUE = "蓝色"


parser = EnumOutputParser(enum=Colors)

from langchain_core.prompts import PromptTemplate

prompt = PromptTemplate.from_template(
    """请回答这个国家的人喜欢什么颜色。

> Country: {country}

Instructions: {instructions}"""
).partial(instructions=parser.get_format_instructions())
chain = prompt | model | parser

chain.invoke({"country": "中国"})

"""
输出:
<Colors.RED: '红色'>
"""

本质是这行代码,将字符串转换成枚举:

def parse(self, response: str) -> Any:  
    try:  
        return self.enum(response.strip())  
    except ValueError:  
        raise OutputParserException(  
            f"Response '{response}' is not one of the "  
            f"expected values: {self.\_valid\_values}"  
        )  

JSON parser

将模型输出格式化为json字符串。

from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field


# 这里是重点,定义你的json格式
# description必填
class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")


joke_query = "Tell me a joke."

parser = JsonOutputParser(pydantic_object=Joke)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser

chain.invoke({"query": joke_query})

"""
输出:
{'setup': "Why don't scientists trust atoms?",
 'punchline': 'Because they make up everything.'}
"""

本质是这句话:Answer the user query.\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\nthe object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\"foo\": [\"bar\", \"baz\"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{\"properties\": {\"setup\": {\"title\": \"Setup\", \"description\": \"question to set up a joke\", \"type\": \"string\"}, \"punchline\": {\"title\": \"Punchline\", \"description\": \"answer to resolve the joke\", \"type\": \"string\"}}, \"required\": [\"setup\", \"punchline\"]}\n```\nTell me a joke.

解析方法是

def parse_result(self, result: List[Generation], *, partial: bool = False) -> Any:
    text = result[0].text
    text = text.strip()
    if partial:
        try:
            return parse_json_markdown(text)
        except JSONDecodeError:
            return None
    else:
        try:
            return parse_json_markdown(text)
        except JSONDecodeError as e:
            msg = f"Invalid json output: {text}"
            raise OutputParserException(msg, llm_output=text) from e

def parse_json_markdown(
    json_string: str, *, parser: Callable[[str], Any] = parse_partial_json
) -> dict:
    """
    Parse a JSON string from a Markdown string.

    Args:
        json_string: The Markdown string.

    Returns:
        The parsed JSON object as a Python dictionary.
    """
    try:
        return _parse_json(json_string, parser=parser)
    except json.JSONDecodeError:
        # Try to find JSON string within triple backticks
        match = re.search(r"```(json)?(.*)", json_string, re.DOTALL)

        # If no match found, assume the entire string is a JSON string
        if match is None:
            json_str = json_string
        else:
            # If match found, use the content within the backticks
            json_str = match.group(2)
    return _parse_json(json_str, parser=parser)

JsonOutputFunctionsParser

OpenAI独有,使用OpenAI的函数调用来构建输出,以json形式返回。

from langchain_community.utils.openai_functions import (
    convert_pydantic_to_openai_function,
)
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field


class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")


openai_functions = [convert_pydantic_to_openai_function(Joke)]

prompt = ChatPromptTemplate.from_messages(
    [("system", "You are helpful assistant"), ("user", "{input}")]
)

from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser

parser = JsonOutputFunctionsParser()

# LCEL表达式
chain = prompt | model.bind(functions=openai_functions) | parser

chain.invoke({"input": "tell me a joke"})

"""
输出:
{'setup': "Why don't scientists trust atoms?",
 'punchline': 'Because they make up everything!'}
"""

JsonKeyOutputFunctionsParser

以json数组形式返回数据。

from typing import List
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser


class Jokes(BaseModel):
    """Jokes to tell user."""

    joke: List[Joke]
    funniness_level: int


parser = JsonKeyOutputFunctionsParser(key_name="joke")

openai_functions = [convert_pydantic_to_openai_function(Jokes)]
chain = prompt | model.bind(functions=openai_functions) | parser

chain.invoke({"input": "tell me two jokes"})

"""
输出:
[{'setup': "Why don't scientists trust atoms?",
  'punchline': 'Because they make up everything!'},
 {'setup': 'Why did the scarecrow win an award?',
  'punchline': 'Because he was outstanding in his field!'}]
"""

PydanticOutputFunctionsParser

直接以对象的形式返回数据。

from pydantic.v1 import validator
from langchain.output_parsers.openai_functions import PydanticOutputFunctionsParser


class Joke(BaseModel):
    """Joke to tell user."""

    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")

    # You can add custom validation logic easily with Pydantic.
    @validator("setup")
    def question_ends_with_question_mark(cls, field):
        if field[-1] != "?":
            raise ValueError("Badly formed question!")
        return field


parser = PydanticOutputFunctionsParser(pydantic_schema=Joke)

openai_functions = [convert_pydantic_to_openai_function(Joke)]
chain = prompt | model.bind(functions=openai_functions) | parser

chain.invoke({"input": "tell me a joke"})

"""
输出:
Joke(setup="Why don't scientists trust atoms?", punchline='Because they make up everything!')
"""

Output-fixing parser

本质是啥?本质是在模型解析错误的时候,将格式不正确的数据和格式说明一起传递给llm,让llm修复后再次解析。

from typing import List
from langchain.output_parsers import OutputFixingParser
from langchain.output_parsers import PydanticOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field


class Actor(BaseModel):
    name: str = Field(description="name of an actor")
    film_names: List[str] = Field(description="list of names of films they starred in")


actor_query = "Generate the filmography for a random actor."

parser = PydanticOutputParser(pydantic_object=Actor)

misformatted = "{'name': 'Tom Hanks', 'film_names': ['Forrest Gump']}"

parser.parse(misformatted)

new_parser = OutputFixingParser.from_llm(parser=parser, llm=llm)

new_parser.parse(misformatted)

以上代码的最终输出是:Actor(name='Tom Hanks', film_names=['Forrest Gump'])。

Pydantic parser

此输出解析器允许用户指定任意Pydantic模型,并将LLM的输出转化成Pydantic模型。

from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field


class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")


joke_query = "Tell me a joke."

parser = PydanticOutputParser(pydantic_object=Joke)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser

chain.invoke({"query": joke_query})

"""
输出:
Joke(setup="Why don't scientists trust atoms?", punchline='Because they make up everything.')
"""

Retry parser

与OutputFixingParser类似,但是RetryOutputParser传递提示(以及原始输出)后重试以获得更好的响应。

from langchain.output_parsers import RetryOutputParser  
  
retry_parser = RetryOutputParser.from_llm(parser=parser, llm=llm)  
retry_parser.parse_with_prompt(bad_response, prompt_value)  

Structured output parser

与 Pydantic/JSON 解析器类似,但是这个对于本身能力较弱的模型比较有用。

from langchain.output_parsers import ResponseSchema, StructuredOutputParser
from langchain_core.prompts import PromptTemplate

response_schemas = [
    ResponseSchema(name="answer", description="answer to the user's question"),
    ResponseSchema(
        name="source",
        description="source used to answer the user's question, should be a website.",
    ),
]
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)

format_instructions = output_parser.get_format_instructions()
prompt = PromptTemplate(
    template="answer the users question as best as possible.\n{format_instructions}\n{question}",
    input_variables=["question"],
    partial_variables={"format_instructions": format_instructions},
)

chain = prompt | model | output_parser

chain.invoke({"question": "what's the capital of france?"})

本质是:answer the users question as best as possible.\nThe output should be a markdown code snippet formatted in the following schema, including the leading and trailing \"```json\" and \"```\":\n\n```json\n{\n\t\"answer\": string  // answer to the user's question\n\t\"source\": string  // source used to answer the user's question, should be a website.\n}\n```\nwhat's the capital of france?

XML parser

顾名思义,解析xml标签形式的数据,然后结构化返回数据。

from langchain.output_parsers import XMLOutputParser
from langchain_core.prompts import PromptTemplate

actor_query = "Generate the shortened filmography for Tom Hanks."
output = model.invoke(
    f"""{actor_query}
Please enclose the movies in <movie></movie> tags"""
)

parser = XMLOutputParser()

# 用于解析xml数据,无标签
# prompt = PromptTemplate(
#     template="""{query}\n{format_instructions}""",
#     input_variables=["query"],
#     partial_variables={"format_instructions": parser.get_format_instructions()},
# )
# 
# chain = prompt | model | parser
# output = chain.invoke({"query": actor_query})

"""
用于解析xml数据,并添加标签。
也可以不添加标签。
"""
parser = XMLOutputParser(tags=["movies", "actor", "film", "name", "genre"])
prompt = PromptTemplate(
    template="""{query}\n{format_instructions}""",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser

output = chain.invoke({"query": actor_query})

"""
输出:
{'movies': [{'actor': [{'film': [{'name': 'The Philadelphia Story (1980)'}]}, {'film': [{'name': 'Sleepless in Seattle (1993)'}]}, {'film': [{'name': 'Forrest Gump (1994)'}]}, {'film': [{'name': 'Cast Away (2000)'}]}, {'film': [{'name': 'Saving Private Ryan (1998)'}]}, {'film': [{'name': 'Toy Story (1995)'}]}, {'film': [{'name': 'Toy Story 2 (1999)'}]}, {'film': [{'name': 'The Da Vinci Code (2006)'}]}, {'film': [{'name': 'Apollo 13 (1995)'}]}, {'film': [{'name': 'Captain Phillips (2013)'}]}]}]}
"""

本质是这句话:Generate the shortened filmography for Tom Hanks.\nThe output should be formatted as a XML file.\n1. Output should conform to the tags below. \n2. If tags are not given, make them on your own.\n3. Remember to always open and close all the tags.\n\nAs an example, for the tags [\"foo\", \"bar\", \"baz\"]:\n1. String \"\n   \n      \n   \n\" is a well-formatted instance of the schema. \n2. String \"\n   \n   \" is a badly-formatted instance.\n3. String \"\n   \n   \n\" is a badly-formatted instance.\n\nHere are the output tags:\n```\n['movies', 'actor', 'film', 'name', 'genre']\n```

YAML parser

解析yaml格式的数据,然后返回数据实体。

from langchain.output_parsers import YamlOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field


class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")


joke_query = "Tell me a joke."

parser = YamlOutputParser(pydantic_object=Joke)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser

chain.invoke({"query": joke_query})

"""
输出:
Joke(setup='Why did the tomato turn red?', punchline='Because it saw the salad dressing!')
"""

本质是这句话:Answer the user query.\nThe output should be formatted as a YAML instance that conforms to the given JSON schema below.\n\n# Examples\n## Schema\n```\n{\"title\": \"Players\", \"description\": \"A list of players\", \"type\": \"array\", \"items\": {\"$ref\": \"#/definitions/Player\"}, \"definitions\": {\"Player\": {\"title\": \"Player\", \"type\": \"object\", \"properties\": {\"name\": {\"title\": \"Name\", \"description\": \"Player name\", \"type\": \"string\"}, \"avg\": {\"title\": \"Avg\", \"description\": \"Batting average\", \"type\": \"number\"}}, \"required\": [\"name\", \"avg\"]}}}\n```\n## Well formatted instance\n```\n- name: John Doe\n  avg: 0.3\n- name: Jane Maxfield\n  avg: 1.4\n```\n\n## Schema\n```\n{\"properties\": {\"habit\": { \"description\": \"A common daily habit\", \"type\": \"string\" }, \"sustainable_alternative\": { \"description\": \"An environmentally friendly alternative to the habit\", \"type\": \"string\"}}, \"required\": [\"habit\", \"sustainable_alternative\"]}\n```\n## Well formatted instance\n```\nhabit: Using disposable water bottles for daily hydration.\nsustainable_alternative: Switch to a reusable water bottle to reduce plastic waste and decrease your environmental footprint.\n``` \n\nPlease follow the standard YAML formatting conventions with an indent of 2 spaces and make sure that the data types adhere strictly to the following JSON schema: \n```\n{\"properties\": {\"setup\": {\"title\": \"Setup\", \"description\": \"question to set up a joke\", \"type\": \"string\"}, \"punchline\": {\"title\": \"Punchline\", \"description\": \"answer to resolve the joke\", \"type\": \"string\"}}, \"required\": [\"setup\", \"punchline\"]}\n```\n\nMake sure to always enclose the YAML output in triple backticks (```). Please do not add anything other than valid YAML output!\nTell me a joke.

通过正则匹配和三方库做解析

pattern: re.Pattern = re.compile(  
     r"^```(?:ya?ml)?(?P<yaml>[^`]*)", re.MULTILINE | re.DOTALL  
 )  

def parse(self, text: str) -> T:  
     try:  
         # Greedy search for 1st yaml candidate.  
         match = re.search(self.pattern, text.strip())  
         yaml_str = ""  
        if match:  
            yaml_str = match.group("yaml")  
        else:  
            # If no backticks were present, try to parse the entire output as yaml.  
            yaml_str = text  
  
        json_object = yaml.safe_load(yaml_str)  
        return self.pydantic_object.parse_obj(json_object)  
  
    except (yaml.YAMLError, ValidationError) as e:  
        name = self.pydantic_object.__name__  
        msg = f"Failed to parse {name} from completion {text}. Got: {e}"  
        raise OutputParserException(msg, llm_output=text) from e  

04 总结

关于Output parsers我这里总结一句话:Prompt提示+正则匹配+三方库。

这一句话就概括了Output parsers的本质,很复杂吗?其实也不复杂。你说它简单吗?这些Prompt提示还是比较难写的,一般人一时半会还真写不出这些提示。

如果能帮我点个免费的关注,那就是对我个人的最大的肯定。

image.png

以上内容依据官方文档编写,官方地址:python.langchain.com/docs/module…

Peace Guys~