C-Eval 测评大模型
如今的对话大模型已经差不多人手一个,每家都说自己的模型已经无限接近ChatGPT,急需一个相对客观的测评对比。
本文就来介绍一款由上海交通大学、清华大学以及爱丁堡大学合作产出的一个评测集套件 C-Eval 以及其用法。
它包含了13948个多项选择题,涵盖了52个不同的学科和四个难度级别,入下图所示。在一定程度上能够帮助开发人员跟踪模型开发的进展,以及分析开发中模型的优点和弱点。
Zero-shot 零样本模型准确率
Model | STEM | Social Science | Humanities | Other | Average |
---|---|---|---|---|---|
ChatGLM-6B | 36.3 | 46.2 | 40.5 | 38.8 | 39.8 |
ChatGLM2-6B | 24.2 | 21.5 | 18.7 | 26.0 | 23.1 |
Chinese-Alpaca-Plus-7B | 33.3 | 40.4 | 38.9 | 36.7 | 36.8 |
Chinese-Alpaca-Plus-13B | 33.3 | 41.4 | 45.9 | 39.3 | 43.5 |
Five-shot 低样本模型准确率
Model | STEM | Social Science | Humanities | Other | Average | scst |
---|---|---|---|---|---|---|
ChatGLM-6B | 32.1 | 41.1 | 40.9 | 38.5 | 37.4 | - |
ChatGLM-6B(测试集) | 29.9 | 39.0 | 37.8 | 34.1 | 34.2 | - |
ChatGLM2-6B | 47.2 | 63.3 | 57.6 | 50.0 | 53.3 | - |
ChatGLM2-6B(测试集) | 47.0 | 63.3 | 51.1 | 47.6 | 51.1 | - |
Chinese-Alpaca-Plus-7B | 31.9 | 36.7 | 31.5 | 32.8 | 33.1 | - |
Chinese-Alpaca-Plus-13B | 39.1 | 47.3 | 43.2 | 41.7 | 42.3 | - |
Baichuan-7B | 25.1 | 25.8 | 26.8 | 25.0 | 25.6 | - |
Baichuan-7B(测试集) | 38.2 | 52.0 | 46.2 | 39.3 | 42.8 | - |
Baichuan-7B(sft) | 39.1 | 51.8 | 46.4 | 40.1 | 43.3 | - |
ChatGLM2-6B(sft-1000) | 51.0 | 46.2 | 39.5 | 49.8 | 45.7 | 60.0 |
ChatGLM2-6B(sft-8000) | 43.5 | 40.5 | 35.8 | 43.2 | 40.2 | 47.5 |
上面列出了一些进行评估的模型的zero-shot和five-shot准确率,对于一些指令微调之后的模型来说,Zero-shot
结果要好于Few-shot
。
操作步骤
1.克隆仓库
git clone https://github.com/SJTU-LIT/ceval.git
2.下载测试数据
cd ceval
wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
# 解压
unzip ceval-exam.zip -d data
3.使用验证数据集
python3 code/evaluator_series/eval_all.py \
--ntrain 5 \
--cuda_device 0 \
--model_name /home/user/code/models/chatglm2-6b \
--few_shot
4.使用测试数据集(比较慢)
python3 code/evaluator_series/eval_all.py \
--ntrain 5 \
--cuda_device 0 \
--model_name /home/user/code/models/chatglm2-6b \
--few_shot \
--do_test
参数说明
model_name
: 待测试模型,可以是本地路径cot
: 是否使用chain-of-thoughtfew-shot
: 是否使用few-shotntrain
: 当有few_shot参数时,指定few-shot实例的数量(5-shot:ntrain=5);若无few_shot参数时该项不起作用cuda_device
: 指定显卡do_test
: 指定时使用test测试集数据(12342条), 不指定则使用val验证数据集(1346条)
5.查看测评结果
最终结果在logs文件夹下,summary.json
为汇总并分类之后的得分,submission.json
为本次测评答案汇总。
root@f63647a53020:/home/user/code/scripts/ceval/logs/take0# ls
accountant_test.csv discrete_mathematics_test.csv law_test.csv operating_system_test.csv
advanced_mathematics_test.csv education_science_test.csv legal_professional_test.csv physician_test.csv
art_studies_test.csv electrical_engineer_test.csv logic_test.csv plant_protection_test.csv
basic_medicine_test.csv environmental_impact_assessment_engineer_test.csv mao_zedong_thought_test.csv probability_and_statistics_test.csv
business_administration_test.csv fire_engineer_test.csv marxism_test.csv professional_tour_guide_test.csv
chinese_language_and_literature_test.csv high_school_biology_test.csv metrology_engineer_test.csv sports_science_test.csv
civil_servant_test.csv high_school_chemistry_test.csv middle_school_biology_test.csv submission.json
clinical_medicine_test.csv high_school_chinese_test.csv middle_school_chemistry_test.csv summary.json
college_chemistry_test.csv high_school_geography_test.csv middle_school_geography_test.csv tax_accountant_test.csv
college_economics_test.csv high_school_history_test.csv middle_school_history_test.csv teacher_qualification_test.csv
college_physics_test.csv high_school_mathematics_test.csv middle_school_mathematics_test.csv urban_and_rural_planner_test.csv
college_programming_test.csv high_school_physics_test.csv middle_school_physics_test.csv veterinary_medicine_test.csv
computer_architecture_test.csv high_school_politics_test.csv middle_school_politics_test.csv
computer_network_test.csv ideological_and_moral_cultivation_test.csv modern_chinese_history_test.csv
⚠️ 当在测试集上预测时(指定 do_test),因为没有测试集标签,score和correct将为0,为正常现象。 测试集结果需要将
submission.json
文件提交至C-Eval官方进行获取,具体请参考C-Eval官方提交流程。
脚本优化
新建测试所有类别脚本 eval_all.py
import os
import json
import argparse
import pandas as pd
import torch
from evaluators.chatgpt import ChatGPT_Evaluator
from evaluators.moss import Moss_Evaluator
from evaluators.chatglm import ChatGLM_Evaluator
from evaluators.minimax import MiniMax_Evaluator
import time
choices = ["A", "B", "C", "D"]
def main(args):
model_type = ''
if "turbo" in args.model_name or "gpt-4" in args.model_name:
evaluator = ChatGPT_Evaluator(
choices=choices,
k=args.ntrain,
api_key=args.openai_key,
model_name=args.model_name
)
model_type = "ChatGPT"
elif "moss" in args.model_name:
evaluator = Moss_Evaluator(
choices=choices,
k=args.ntrain,
model_name=args.model_name
)
model_type = "Moss"
elif "chatglm" in args.model_name:
if args.cuda_device:
os.environ["CUDA_VISIBLE_DEVICES"] = args.cuda_device
device = torch.device("cuda")
evaluator = ChatGLM_Evaluator(
choices=choices,
k=args.ntrain,
model_name=args.model_name,
device=device
)
model_type = "ChatGLM"
elif "minimax" in args.model_name:
evaluator = MiniMax_Evaluator(
choices=choices,
k=args.ntrain,
group_id=args.minimax_group_id,
api_key=args.minimax_key,
model_name=args.model_name
)
model_type = "MiniMax"
else:
print("Unknown model name")
return -1
# 确认评测类别文件是否存在
assert os.path.exists("subject_mapping.json"), "subject_mapping.json not found!"
with open("subject_mapping.json") as f:
subject_mapping = json.load(f)
print(f'{"*"*50} 测评类别 {"*"*50}')
print(json.dumps(subject_mapping, sort_keys=True, indent=2, ensure_ascii=False, separators=(',', ': ')))
filenames = os.listdir("data/val")
subject_list = [val_file.replace("_val.csv", "") for val_file in filenames]
accuracy, summary, all_answers = {}, {}, {}
# 测评文件夹
if not os.path.exists(r"logs"):
os.mkdir(r"logs")
run_date = time.strftime('%Y%m%d_%H%M%S', time.localtime(time.time()))
save_result_dir = os.path.join(r"logs", f"{model_type}_{run_date}")
os.mkdir(save_result_dir)
print('args.do_test------>', args.do_test, args.do_test == True)
# 循环测评
for index, subject_name in enumerate(subject_list):
run_subject_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
print(f"{run_subject_time} [{index+1}/{len(subject_list)}] 测评 {model_type} 模型的 [{subject_mapping[subject_name][1]}] 类别!")
val_file_path = os.path.join('data/val', f'{subject_name}_val.csv')
dev_file_path = os.path.join('data/dev', f'{subject_name}_dev.csv')
test_file_path = os.path.join('data/test', f'{subject_name}_test.csv')
val_df = pd.read_csv(val_file_path) if args.do_test else pd.read_csv(test_file_path)
dev_df = pd.read_csv(dev_file_path) if args.few_shot else None
correct_ratio, answers = evaluator.eval_subject(
subject_name=subject_name, test_df=val_df, dev_df=dev_df, few_shot=args.few_shot,
save_result_dir=save_result_dir, cot=args.cot, do_test=args.do_test
)
print(f"Subject: {subject_mapping[subject_name][1]}, Acc: {correct_ratio}")
print()
accuracy[subject_name] = correct_ratio
summary[subject_name] = {"score": correct_ratio, "num": len(val_df), "correct": correct_ratio*len(val_df)/100}
all_answers[subject_name] = answers
# 保存测试结果
json.dump(all_answers, open(save_result_dir+'/submission.json', 'w'), ensure_ascii=False, indent=4)
print(f'{"*"*50} 准确性汇总 {"*"*50}')
for k, v in accuracy.items():
print(f"{k}({subject_mapping[k][1]}): {v}")
print()
total_num = 0
total_correct = 0
summary['grouped'] = {
"STEM": {"correct": 0.0, "num": 0},
"Social Science": {"correct": 0.0, "num": 0},
"Humanities": {"correct": 0.0, "num": 0},
"Other": {"correct": 0.0, "num": 0}
}
for subj, info in subject_mapping.items():
group = info[2]
summary['grouped'][group]["num"] += summary[subj]['num']
summary['grouped'][group]["correct"] += summary[subj]['correct']
for group, info in summary['grouped'].items():
info['score'] = info["correct"] / info["num"] * 100
total_num += info["num"]
total_correct += info["correct"]
summary['All'] = {"score": total_correct / total_num * 100, "num": total_num, "correct": total_correct}
json.dump(summary, open(save_result_dir+'/summary.json', 'w'), ensure_ascii=False, indent=2)
print(f'{"*"*50} 最终评分 {"*"*50}')
print(json.dumps(summary['grouped'], sort_keys=True, indent=2, ensure_ascii=False, separators=(',', ': ')))
print()
print(json.dumps(summary['All'], sort_keys=True, indent=2, ensure_ascii=False, separators=(',', ': ')))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--ntrain", "-k", type=int, default=5)
parser.add_argument("--openai_key", type=str, default="xxx")
parser.add_argument("--minimax_group_id", type=str, default="xxx")
parser.add_argument("--minimax_key", type=str, default="xxx")
parser.add_argument("--few_shot", action="store_true")
parser.add_argument("--model_name", type=str)
parser.add_argument("--cot", action="store_true")
parser.add_argument("--subject", "-s", type=str, default="operating_system")
parser.add_argument("--cuda_device", type=str)
parser.add_argument("--do_test", action="store_true")
args = parser.parse_args()
main(args)
修改一下 chatglm.py
脚本
import os
import re
from tqdm import tqdm
import torch
from transformers import AutoTokenizer, AutoModel
from transformers.generation.logits_process import LogitsProcessor
from transformers.generation.utils import LogitsProcessorList
from evaluators.evaluator import Evaluator
class InvalidScoreLogitsProcessor(LogitsProcessor):
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
if torch.isnan(scores).any() or torch.isinf(scores).any():
scores.zero_()
scores[..., 5] = 5e4
return scores
class ChatGLM_Evaluator(Evaluator):
def __init__(self, choices, k, model_name, device):
super(ChatGLM_Evaluator, self).__init__(choices, model_name, k)
# try adding 'mirror="tuna"' and 'resume_download=True' if facing the 'read timed out' problem
# or directly clone the model
self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, mirror="tuna")
self.model = AutoModel.from_pretrained(model_name, trust_remote_code=True, mirror="tuna", resume_download=True).half().to(device)
def eval_subject(self, subject_name, test_df, dev_df=None, few_shot=False, cot=False, save_result_dir=None, do_test=False):
correct_num = 0
if save_result_dir:
if few_shot:
result = []
score = []
if few_shot:
history = self.generate_few_shot_prompt(subject_name, dev_df, cot=cot)
else:
history = []
all_answers = {}
answers = ['NA'] * len(test_df) if do_test else list(test_df['answer'])
for row_index, row in tqdm(test_df.iterrows(), total=len(test_df)):
question = self.format_example(row, include_answer=False, cot=cot)
if few_shot:
response, _ = self.model.chat(self.tokenizer, question, do_sample=False, history=history)
response = response.strip()
# For ChatGLM, we use answer extraction in answer-only mode too.
ans, direct_extract = self.extract_cot_answer(row, response)
else: # zero-shot by extracting answer from distribution
ans = self.generate_dist(self.model, self.tokenizer, question, do_sample=False, max_length=2048, history=history)
if ans == answers[row_index]:
correct_num += 1
correct = 1
else:
correct = 0
if save_result_dir:
if few_shot:
result.append(response)
score.append(correct)
all_answers[str(row_index)] = ans
correct_ratio = 100*correct_num/len(answers)
if save_result_dir:
if few_shot:
test_df['model_output'] = result
test_df['correctness'] = score
test_df.to_csv(os.path.join(save_result_dir, f'{subject_name}_test.csv'))
return correct_ratio, all_answers
def generate_few_shot_prompt(self, subject, dev_df, cot=False):
message = []
k = self.k
if self.k == -1:
k = dev_df.shape[0]
message.append(self.format_example(dev_df.iloc[0, :], cot=cot, add_prompt=f"以下是中国关于{subject}考试的单项选择题,请选出其中的正确答案。\n\n"))
for i in range(1, k):
message.append(self.format_example(dev_df.iloc[i, :], cot=cot))
return message
def format_example(self, line, include_answer=True, cot=False, add_prompt=''):
example = add_prompt + line['question']
# print(example)
for choice in self.choices:
example += f'\n{choice}. {line[f"{choice}"]}'
example += '\n答案:'
if include_answer:
if cot:
ans = "让我们一步一步思考,\n" + line["explanation"] + f"\n所以答案是{line['answer']}。"
else:
ans = line["answer"]
m = (example, ans)
return m
return example
def extract_cot_answer(self, line, gen_ans):
m = re.findall(r'所以答案是(.+?)。', gen_ans, re.M)
if len(m) > 0 and m[-1] in self.choices:
return m[-1], True
answer_patterns = [
r'([ABCD])是正确的',
r'选项([ABCD])正确',
r'答案为([ABCD])',
r'答案是([ABCD])',
r'答案([ABCD])',
r'选择([ABCD])',
r'答案:([ABCD])',
r'选择答案([ABCD])'
]
# RE extraction
for answer_pattern in answer_patterns:
m = re.search(answer_pattern, gen_ans, re.M)
if m:
answer = m.group(1)
return answer, False
# only containing one choice-character
m = re.findall(r'[ABCD]', gen_ans, re.M)
if len(m) == 1:
answer = m[0]
return answer, False
answer_word_counter = 0
# only containing one choice-context
for c in self.choices:
if str(line[f'{c}']) in gen_ans:
answer = c
answer_word_counter += 1
if answer_word_counter == 1:
return answer, False
return '-', False
def generate_dist(self, model, tokenizer, query, history, num_beams=1, max_length=2048,
do_sample=False, top_p=0.7, temperature=0.95, logits_processor=None, **kwargs):
if history is None:
history = []
if logits_processor is None:
logits_processor = LogitsProcessorList()
logits_processor.append(InvalidScoreLogitsProcessor())
gen_kwargs = {"num_beams": num_beams, "do_sample": do_sample, "top_p": top_p, "max_length": 2048,
"temperature": temperature, "logits_processor": logits_processor, **kwargs}
if not history:
prompt = query
else:
prompt = ""
for i, (old_query, response) in enumerate(history):
prompt += "[Round {}]\n问:{}\n答:{}\n".format(i, old_query, response)
prompt += "[Round {}]\n问:{}\n答:".format(len(history), query)
inputs = tokenizer([prompt], return_tensors="pt")
inputs = inputs.to(model.device)
outputs = model.generate(**inputs, return_dict_in_generate=True, output_scores=True, **gen_kwargs)
score = outputs.scores[0][0].tolist()
choice_score = [score[167], score[333], score[251], score[416]]
ranked_index = [index for index, value in sorted(list(enumerate(choice_score)), key=lambda x:x[1], reverse=True)]
return self.choices[ranked_index[0]]