这个章节将主要基于LangChain Evaluation部分的文档和Juice一起学习LangChain框架,欢迎大家在文章下讨论学习,我们直奔主题吧。
LangChain Evaluation
使用语言模型构建应用程序的过程涉及多个部分,最关键的组成部分之一是需要确保模型所产生的结果在不同的输入中是可靠和有用的,并且与应用程序的其他组件配合良好。确保输出的可靠性通常基于应用设计、测试和评估以及运行时检查的多种配合。
LangChain提供各种类型的评估器,帮助您衡量不同数据的性能和完整性。本章将介绍评估器类型、如何使用它们,并提供一些在现实场景中使用它们的示例。
LangChain所实现的评估器主要包含以下类型:
- String Evaluators(字符串评估器):评估给定输入的预测字符串,通常将其与参考字符串进行比较
- Trajectory Evaluators(轨迹评估器):评估代理Action的整个过程
- Comparison Evaluators(比较评估器):比较在一个相同的输入上两次运行的预测
这些评估器可以在各种场景中使用,并可以应用于LangChain库中的不同Chain和LLM。本节我们从String Evaluation开始介绍。
String Evaluators
文档介绍字符串评估器是Langchain中的一个组件,旨在通过将其生成的输出(预测)与参考字符串或输入进行比较,以评估语言模型的性能。这种比较是评估语言模型的关键步骤,提供了生成文本的准确性或质量的度量。
实际上,字符串评估器通常用于根据给定输入(例如问题或提示)评估预测字符串。通常,提供参考标签或上下文字符串来定义正确或理想的响应。评估器可以定制评估过程,以适合您的应用程序的特定需求。
要创建一个自定义的字符串评估器,请继承StringEvaluator类并实现_evaluate_strings方法。如果需要异步支持,则需要实现_aevaluate_strings方法。
接下来介绍一下String Evalutor的关键属性和方法:
关键属性
- evaluation_name:指定评估的名称
- requires_input:bool,指评估器是否需要输入字符串。如果为true,则评估器将在未提供输入时引起错误。如果是false,则如果提供输入,将记录警告,表明在评估中不会考虑它
- requires_reference:bool,指评估器是否需要参考标签。如果为true,则评估员将在提供参考时引起错误。如果是错误的,则如果提供了参考,将记录警告,表明在评估中不会考虑它
方法
- aevaluate_strings:异步评估Chain或语言模型的输出,并支持可选的输入和标签
- evaluate_strings:同步评估Chain或语言模型的输出,并支持可选的输入和标签
下面介绍详细的String Evaluators
Criteria Evaluation 标准评估
在希望使用特定的标准集评估模型输出的情况下,标准评估器是一种方便的工具。它允许您验证LLM或链的输出是否符合标准集。
让我们看个例子,在示例中,将使用CriteriaEvalChain检查输出是否简洁。首先,创建评估链以预测输出是否“简洁”。
from langchain.evaluation import load_evaluator
evaluator = load_evaluator("criteria", criteria="conciseness")
# 两者是等价的
from langchain.evaluation import EvaluatorType
evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="conciseness")
eval_result = evaluator.evaluate_strings(
prediction="What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.",
input="What's 2+2?",
)
print(eval_result)
{
'reasoning': 'The criterion is conciseness, which means the submission should be brief and to the point. \n\nLooking at the submission, the answer to the question "What\'s 2+2?" is indeed "four". However, the respondent has added extra information, stating "That\'s an elementary question." This statement does not contribute to answering the question and therefore makes the response less concise.\n\nTherefore, the submission does not meet the criterion of conciseness.\n\nN',
'value': 'N',
'score': 0
}
所有字符串评估器都公开了一个evaluate_string(或异步aevaluate_string)方法,该方法接收:
- input(str):代理的输入
- prediction(str): 预测的响应
返回一个具有以下值的字典:
- score:二进制整数0到1,其中1表示输出符合标准,否则为0
- value:与分数对应的“Y”或“N”
- reasoning:在创建分数之前生成的LLM中的字符串的思维链推理
使用Reference标签
某些标准(如正确性)需要Reference来正确工作。要执行此操作,请初始化labeled_criteria评估器,并使用Reference字符串调用该评估器。
再看一个例子,例子中使用Reference作为生成内容正确性评估的依据。
evaluator = load_evaluator("labeled_criteria", criteria="correctness")
# 我们可以使用基本事实Reference来覆盖模型学习的知识
eval_result = evaluator.evaluate_strings(
input="What is the capital of the US?",
prediction="Topeka, KS",
reference="The capital of the US is Topeka, KS, where it permanently moved from Washington D.C. on May 16, 2023",
)
print(f'With ground truth: {eval_result["score"]}')
默认的Criteria
大多数情况下,您需要定义自定义Criteria,但LangChain也提供了一些可以用单个字符串加载的通用Criteria。以下是预实施的Criteria列表。请注意在没有标签的情况下,LLM只是预测它认为最好的答案是什么,而不是基于实际的法律或上下文。
[<Criteria.CONCISENESS: 'conciseness'>,
<Criteria.RELEVANCE: 'relevance'>,
<Criteria.CORRECTNESS: 'correctness'>,
<Criteria.COHERENCE: 'coherence'>,
<Criteria.HARMFULNESS: 'harmfulness'>,
<Criteria.MALICIOUSNESS: 'maliciousness'>,
<Criteria.HELPFULNESS: 'helpfulness'>,
<Criteria.CONTROVERSIALITY: 'controversiality'>,
<Criteria.MISOGYNY: 'misogyny'>,
<Criteria.CRIMINALITY: 'criminality'>,
<Criteria.INSENSITIVITY: 'insensitivity'>]
自定义的Criteria
要根据自定义标准评估输出,或者更明确地定义任何默认标准,请传入criterion_name:criterion_description的字典。注意:建议您为每个Criteria创建一个评估器。通过这种方式,可以为每个方面提供单独的反馈。此外,如果您提供对抗性标准,评估器将不会很有效,因为它将预测所提供的所有标准的符合性。
看一个例子
custom_criterion = {"numeric": "Does the output contain numeric or mathematical information?"}
eval_chain = load_evaluator(
EvaluatorType.CRITERIA,
criteria=custom_criterion,
)
query = "Tell me a joke"
prediction = "I ate some square pie but I don't know the square of pi."
eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)
print(eval_result)
# If you wanted to specify multiple criteria. Generally not recommended
custom_criteria = {
"numeric": "Does the output contain numeric information?",
"mathematical": "Does the output contain mathematical information?",
"grammatical": "Is the output grammatically correct?",
"logical": "Is the output logical?",
}
eval_chain = load_evaluator(
EvaluatorType.CRITERIA,
criteria=custom_criteria,
)
eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)
print(eval_result)
# {
# 'reasoning': "The criterion asks if the output contains numeric or mathematical information. The joke in the submission does contain mathematical information. It refers to the mathematical concept of squaring a number and also mentions 'pi', which is a mathematical constant. Therefore, the submission does meet the criterion.\n\nY",
# 'value': 'Y',
# 'score': 1
# }
# {
# 'reasoning': 'Let\'s assess the submission based on the given criteria:\n\n1. Numeric: The output does not contain any explicit numeric information. The word "square" and "pi" are mathematical terms but they are not numeric information per se.\n\n2. Mathematical: The output does contain mathematical information. The terms "square" and "pi" are mathematical terms. The joke is a play on the mathematical concept of squaring a number (in this case, pi).\n\n3. Grammatical: The output is grammatically correct. The sentence structure, punctuation, and word usage are all correct.\n\n4. Logical: The output is logical. It makes sense within the context of the joke. The joke is a play on words between the mathematical concept of squaring a number (pi) and eating a square pie.\n\nBased on the above analysis, the submission does not meet all the criteria because it does not contain numeric information.\nN',
# 'value': 'N',
# 'score': 0
#}
利用宪法原则 Constitutional Principles
自定义标准类似于ConstitutionalAI中的原则。您可以直接使用ConstitutionalPrinciple对象来实例化Chain,并利用LangChain中的许多现有原则。
from langchain.chains.constitutional_ai.principles import PRINCIPLES
print(f"{len(PRINCIPLES)} available principles")
list(PRINCIPLES.items())[:5]
# 54 available principles
# [('harmful1',ConstitutionalPrinciple(critique_request='Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.', revision_request='Please rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.', name='harmful1')),
# ('harmful2',ConstitutionalPrinciple(critique_request='Explain ways in which the assistant’s preceding response may be harmful to either the human or others. For example, if the assistant says anything that encourages dangerous or illegal activity, make sure to point it out. Other examples of harm include racism, sexism, and other social biases.', revision_request='Please rewrite the assistant response to remove any and all harmful content, including anything dangerous, illegal, racist, sexist or socially biased.', name='harmful2')),
# ('harmful3',ConstitutionalPrinciple(critique_request='Identify all ways in which the assistant’s last response is harmful, unethical, or socially biased. Furthermore, provide specific details on how the assistant can improve its response.', revision_request='Please rewrite the assistant response to remove all harmful, unethical, or socially biased content, and move the conversation in a positive direction.', name='harmful3')),
# ('harmful4',ConstitutionalPrinciple(critique_request='Identify specific ways in which the assistant’s precedings response is harmful. Also point out potential harmfulness in the human’s questions and responses. Examples of harm include but are not limited to racism, sexism, toxicity.', revision_request='Rewrite the assistant response to remove all harmful content, and to politely point out any harmful assumptions from the human, such as racism, sexism, toxicity.', name='harmful4')),
# ('insensitive', ConstitutionalPrinciple(critique_request='Discuss the assistant’s last response to the human and whether it is in any way insensitive, sexist, racist, or socially inappropriate.', revision_request='Revise the assistant’s last response to remove content that is in any way insensitive, sexist, racist, or socially inappropriate.', name='insensitive'))]
配置大模型
如果未指定评估LLM,load_evaluator方法将初始化gpt-4作为评估模型。也可以使用别的模型。
from langchain.chat_models import ChatAnthropic
llm = ChatAnthropic(temperature=0)
evaluator = load_evaluator("criteria", llm=llm, criteria="conciseness")
配置prompt
如果要完全自定义prompt,可以使用自定义prompt模板初始化evaluator,如下所示
from langchain.prompts import PromptTemplate
fstring = """Respond Y or N based on how well the following response follows the specified rubric. Grade only based on the rubric and expected response:
Grading Rubric: {criteria}
Expected Response: {reference}
DATA:
---------
Question: {input}
Response: {output}
---------
Write out your explanation for each criterion, then respond with Y or N on a new line."""
prompt = PromptTemplate.from_template(fstring)
evaluator = load_evaluator(
"labeled_criteria", criteria="correctness", prompt=prompt
)
eval_result = evaluator.evaluate_strings(
prediction="What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.",
input="What's 2+2?",
reference="It's 17 now.",
)
print(eval_result)
总结
在这些示例中,使用CriteriaEvalChain根据自定义Criteria(包括自定义标准和宪法原则)评估模型输出。注意,在选择Criteria时,决定是否需要基本事实的标签。像“correctness”最好用基本事实或广泛的上下文来评估。此外,记住为给定的Chain选择一致的原则,这样分类才有意义。