C-Eval 测评大模型

2,951 阅读5分钟

C-Eval 测评大模型

如今的对话大模型已经差不多人手一个,每家都说自己的模型已经无限接近ChatGPT,急需一个相对客观的测评对比。

本文就来介绍一款由上海交通大学、清华大学以及爱丁堡大学合作产出的一个评测集套件 C-Eval 以及其用法。

它包含了13948个多项选择题,涵盖了52个不同的学科和四个难度级别,入下图所示。在一定程度上能够帮助开发人员跟踪模型开发的进展,以及分析开发中模型的优点和弱点。

image.png

Zero-shot 零样本模型准确率

ModelSTEMSocial ScienceHumanitiesOtherAverage
ChatGLM-6B36.346.240.538.839.8
ChatGLM2-6B24.221.518.726.023.1
Chinese-Alpaca-Plus-7B33.340.438.936.736.8
Chinese-Alpaca-Plus-13B33.341.445.939.343.5

Five-shot 低样本模型准确率

ModelSTEMSocial ScienceHumanitiesOtherAveragescst
ChatGLM-6B32.141.140.938.537.4-
ChatGLM-6B(测试集)29.939.037.834.134.2-
ChatGLM2-6B47.263.357.650.053.3-
ChatGLM2-6B(测试集)47.063.351.147.651.1-
Chinese-Alpaca-Plus-7B31.936.731.532.833.1-
Chinese-Alpaca-Plus-13B39.147.343.241.742.3-
Baichuan-7B25.125.826.825.025.6-
Baichuan-7B(测试集)38.252.046.239.342.8-
Baichuan-7B(sft)39.151.846.440.143.3-
ChatGLM2-6B(sft-1000)51.046.239.549.845.760.0
ChatGLM2-6B(sft-8000)43.540.535.843.240.247.5

上面列出了一些进行评估的模型的zero-shot和five-shot准确率,对于一些指令微调之后的模型来说,Zero-shot结果要好于Few-shot

操作步骤

1.克隆仓库

git clone https://github.com/SJTU-LIT/ceval.git

2.下载测试数据

cd ceval
wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
# 解压
unzip ceval-exam.zip -d data

3.使用验证数据集

python3 code/evaluator_series/eval_all.py \
    --ntrain 5 \ 
    --cuda_device 0 \
    --model_name /home/user/code/models/chatglm2-6b \ 
    --few_shot

4.使用测试数据集(比较慢)

python3 code/evaluator_series/eval_all.py \
    --ntrain 5 \ 
    --cuda_device 0 \
    --model_name /home/user/code/models/chatglm2-6b \ 
    --few_shot \
    --do_test

参数说明

  1. model_name: 待测试模型,可以是本地路径
  2. cot: 是否使用chain-of-thought
  3. few-shot: 是否使用few-shot
  4. ntrain: 当有few_shot参数时,指定few-shot实例的数量(5-shot:ntrain=5);若无few_shot参数时该项不起作用
  5. cuda_device: 指定显卡
  6. do_test: 指定时使用test测试集数据(12342条), 不指定则使用val验证数据集(1346条)

5.查看测评结果

最终结果在logs文件夹下,summary.json 为汇总并分类之后的得分,submission.json 为本次测评答案汇总。

root@f63647a53020:/home/user/code/scripts/ceval/logs/take0# ls
accountant_test.csv                       discrete_mathematics_test.csv                      law_test.csv                        operating_system_test.csv
advanced_mathematics_test.csv             education_science_test.csv                         legal_professional_test.csv         physician_test.csv
art_studies_test.csv                      electrical_engineer_test.csv                       logic_test.csv                      plant_protection_test.csv
basic_medicine_test.csv                   environmental_impact_assessment_engineer_test.csv  mao_zedong_thought_test.csv         probability_and_statistics_test.csv
business_administration_test.csv          fire_engineer_test.csv                             marxism_test.csv                    professional_tour_guide_test.csv
chinese_language_and_literature_test.csv  high_school_biology_test.csv                       metrology_engineer_test.csv         sports_science_test.csv
civil_servant_test.csv                    high_school_chemistry_test.csv                     middle_school_biology_test.csv      submission.json
clinical_medicine_test.csv                high_school_chinese_test.csv                       middle_school_chemistry_test.csv    summary.json
college_chemistry_test.csv                high_school_geography_test.csv                     middle_school_geography_test.csv    tax_accountant_test.csv
college_economics_test.csv                high_school_history_test.csv                       middle_school_history_test.csv      teacher_qualification_test.csv
college_physics_test.csv                  high_school_mathematics_test.csv                   middle_school_mathematics_test.csv  urban_and_rural_planner_test.csv
college_programming_test.csv              high_school_physics_test.csv                       middle_school_physics_test.csv      veterinary_medicine_test.csv
computer_architecture_test.csv            high_school_politics_test.csv                      middle_school_politics_test.csv
computer_network_test.csv                 ideological_and_moral_cultivation_test.csv         modern_chinese_history_test.csv

⚠️ 当在测试集上预测时(指定 do_test),因为没有测试集标签,score和correct将为0,为正常现象。 测试集结果需要将 submission.json 文件提交至C-Eval官方进行获取,具体请参考C-Eval官方提交流程

脚本优化

新建测试所有类别脚本 eval_all.py

import os
import json
import argparse
import pandas as pd
import torch
from evaluators.chatgpt import ChatGPT_Evaluator
from evaluators.moss import Moss_Evaluator
from evaluators.chatglm import ChatGLM_Evaluator
from evaluators.minimax import MiniMax_Evaluator

import time
choices = ["A", "B", "C", "D"]


def main(args):
    model_type = ''
    if "turbo" in args.model_name or "gpt-4" in args.model_name:
        evaluator = ChatGPT_Evaluator(
            choices=choices,
            k=args.ntrain,
            api_key=args.openai_key,
            model_name=args.model_name
        )
        model_type = "ChatGPT"
    elif "moss" in args.model_name:
        evaluator = Moss_Evaluator(
            choices=choices,
            k=args.ntrain,
            model_name=args.model_name
        )
        model_type = "Moss"
    elif "chatglm" in args.model_name:
        if args.cuda_device:
            os.environ["CUDA_VISIBLE_DEVICES"] = args.cuda_device
        device = torch.device("cuda")
        evaluator = ChatGLM_Evaluator(
            choices=choices,
            k=args.ntrain,
            model_name=args.model_name,
            device=device
        )
        model_type = "ChatGLM"
    elif "minimax" in args.model_name:
        evaluator = MiniMax_Evaluator(
            choices=choices,
            k=args.ntrain,
            group_id=args.minimax_group_id,
            api_key=args.minimax_key,
            model_name=args.model_name
        )
        model_type = "MiniMax"
    else:
        print("Unknown model name")
        return -1

    # 确认评测类别文件是否存在
    assert os.path.exists("subject_mapping.json"), "subject_mapping.json not found!"
    with open("subject_mapping.json") as f:
        subject_mapping = json.load(f)
    print(f'{"*"*50} 测评类别 {"*"*50}')
    print(json.dumps(subject_mapping, sort_keys=True, indent=2, ensure_ascii=False, separators=(',', ': ')))

    filenames = os.listdir("data/val")
    subject_list = [val_file.replace("_val.csv", "") for val_file in filenames]
    accuracy, summary, all_answers = {}, {}, {}

    # 测评文件夹
    if not os.path.exists(r"logs"):
        os.mkdir(r"logs")
    run_date = time.strftime('%Y%m%d_%H%M%S', time.localtime(time.time()))
    save_result_dir = os.path.join(r"logs", f"{model_type}_{run_date}")
    os.mkdir(save_result_dir)
    print('args.do_test------>', args.do_test, args.do_test == True)
    # 循环测评
    for index, subject_name in enumerate(subject_list):
        run_subject_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
        print(f"{run_subject_time}   [{index+1}/{len(subject_list)}] 测评 {model_type} 模型的 [{subject_mapping[subject_name][1]}] 类别!")

        val_file_path = os.path.join('data/val', f'{subject_name}_val.csv')
        dev_file_path = os.path.join('data/dev', f'{subject_name}_dev.csv')
        test_file_path = os.path.join('data/test', f'{subject_name}_test.csv')

        val_df = pd.read_csv(val_file_path) if args.do_test else pd.read_csv(test_file_path)
        dev_df = pd.read_csv(dev_file_path) if args.few_shot else None

        correct_ratio, answers = evaluator.eval_subject(
            subject_name=subject_name, test_df=val_df, dev_df=dev_df, few_shot=args.few_shot,
            save_result_dir=save_result_dir, cot=args.cot, do_test=args.do_test
        )

        print(f"Subject: {subject_mapping[subject_name][1]}, Acc: {correct_ratio}")
        print()
        accuracy[subject_name] = correct_ratio
        summary[subject_name] = {"score": correct_ratio, "num": len(val_df), "correct": correct_ratio*len(val_df)/100}
        all_answers[subject_name] = answers

    # 保存测试结果
    json.dump(all_answers, open(save_result_dir+'/submission.json', 'w'), ensure_ascii=False, indent=4)
    print(f'{"*"*50} 准确性汇总 {"*"*50}')
    for k, v in accuracy.items():
        print(f"{k}({subject_mapping[k][1]}): {v}")
    print()

    total_num = 0
    total_correct = 0
    summary['grouped'] = {
        "STEM": {"correct": 0.0, "num": 0},
        "Social Science": {"correct": 0.0, "num": 0},
        "Humanities": {"correct": 0.0, "num": 0},
        "Other": {"correct": 0.0, "num": 0}
    }
    for subj, info in subject_mapping.items():
        group = info[2]
        summary['grouped'][group]["num"] += summary[subj]['num']
        summary['grouped'][group]["correct"] += summary[subj]['correct']
    for group, info in summary['grouped'].items():
        info['score'] = info["correct"] / info["num"] * 100
        total_num += info["num"]
        total_correct += info["correct"]
    summary['All'] = {"score": total_correct / total_num * 100, "num": total_num, "correct": total_correct}

    json.dump(summary, open(save_result_dir+'/summary.json', 'w'), ensure_ascii=False, indent=2)

    print(f'{"*"*50} 最终评分 {"*"*50}')
    print(json.dumps(summary['grouped'], sort_keys=True, indent=2, ensure_ascii=False, separators=(',', ': ')))
    print()
    print(json.dumps(summary['All'], sort_keys=True, indent=2, ensure_ascii=False, separators=(',', ': ')))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--ntrain", "-k", type=int, default=5)
    parser.add_argument("--openai_key", type=str, default="xxx")
    parser.add_argument("--minimax_group_id", type=str, default="xxx")
    parser.add_argument("--minimax_key", type=str, default="xxx")
    parser.add_argument("--few_shot", action="store_true")
    parser.add_argument("--model_name", type=str)
    parser.add_argument("--cot", action="store_true")
    parser.add_argument("--subject", "-s", type=str, default="operating_system")
    parser.add_argument("--cuda_device", type=str)
    parser.add_argument("--do_test", action="store_true")
    args = parser.parse_args()
    main(args)

修改一下 chatglm.py 脚本

import os
import re
from tqdm import tqdm
import torch
from transformers import AutoTokenizer, AutoModel
from transformers.generation.logits_process import LogitsProcessor
from transformers.generation.utils import LogitsProcessorList
from evaluators.evaluator import Evaluator

class InvalidScoreLogitsProcessor(LogitsProcessor):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
        if torch.isnan(scores).any() or torch.isinf(scores).any():
            scores.zero_()
            scores[..., 5] = 5e4
        return scores

class ChatGLM_Evaluator(Evaluator):
    def __init__(self, choices, k, model_name, device):
        super(ChatGLM_Evaluator, self).__init__(choices, model_name, k)
        # try adding 'mirror="tuna"' and 'resume_download=True' if facing the 'read timed out' problem
        # or directly clone the model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, mirror="tuna")
        self.model = AutoModel.from_pretrained(model_name, trust_remote_code=True, mirror="tuna", resume_download=True).half().to(device)

    def eval_subject(self, subject_name, test_df, dev_df=None, few_shot=False, cot=False, save_result_dir=None, do_test=False):
        correct_num = 0
        if save_result_dir:
            if few_shot:
                result = []
            score = []
        if few_shot:
            history = self.generate_few_shot_prompt(subject_name, dev_df, cot=cot)
        else:
            history = []

        all_answers = {}
        answers = ['NA'] * len(test_df) if do_test else list(test_df['answer'])
        for row_index, row in tqdm(test_df.iterrows(), total=len(test_df)):
            question = self.format_example(row, include_answer=False, cot=cot)
            if few_shot:
                response, _ = self.model.chat(self.tokenizer, question, do_sample=False, history=history)
                response = response.strip()
                # For ChatGLM, we use answer extraction in answer-only mode too.
                ans, direct_extract = self.extract_cot_answer(row, response)
            else:   # zero-shot by extracting answer from distribution
                ans = self.generate_dist(self.model, self.tokenizer, question, do_sample=False, max_length=2048, history=history)
            if ans == answers[row_index]:
                correct_num += 1
                correct = 1
            else:
                correct = 0
            if save_result_dir:
                if few_shot:
                    result.append(response)
                score.append(correct)
            all_answers[str(row_index)] = ans
        correct_ratio = 100*correct_num/len(answers)
        
        if save_result_dir:
            if few_shot:
                test_df['model_output'] = result
            test_df['correctness'] = score
            test_df.to_csv(os.path.join(save_result_dir, f'{subject_name}_test.csv'))

        return correct_ratio, all_answers
    
    def generate_few_shot_prompt(self, subject, dev_df, cot=False):
        message = []
        k = self.k
        if self.k == -1:
            k = dev_df.shape[0]
        message.append(self.format_example(dev_df.iloc[0, :], cot=cot, add_prompt=f"以下是中国关于{subject}考试的单项选择题,请选出其中的正确答案。\n\n"))
        for i in range(1, k):
            message.append(self.format_example(dev_df.iloc[i, :], cot=cot))
        return message
        
    def format_example(self, line, include_answer=True, cot=False, add_prompt=''):
        example = add_prompt + line['question']
        # print(example)
        for choice in self.choices:
            example += f'\n{choice}. {line[f"{choice}"]}'
        example += '\n答案:'
        if include_answer:
            if cot:
                ans = "让我们一步一步思考,\n" + line["explanation"] + f"\n所以答案是{line['answer']}。"
            else:
                ans = line["answer"]
            m = (example, ans)
            return m
        return example
    
    def extract_cot_answer(self, line, gen_ans):
        m = re.findall(r'所以答案是(.+?)。', gen_ans, re.M)
        if len(m) > 0 and m[-1] in self.choices:
            return m[-1], True
        answer_patterns = [
            r'([ABCD])是正确的',
            r'选项([ABCD])正确',
            r'答案为([ABCD])',
            r'答案是([ABCD])',
            r'答案([ABCD])',
            r'选择([ABCD])',
            r'答案:([ABCD])',
            r'选择答案([ABCD])'
        ]
        # RE extraction
        for answer_pattern in answer_patterns:
            m = re.search(answer_pattern, gen_ans, re.M)
            if m:
                answer = m.group(1)
                return answer, False
        # only containing one choice-character
        m = re.findall(r'[ABCD]', gen_ans, re.M)
        if len(m) == 1:
            answer = m[0]
            return answer, False
        answer_word_counter = 0
        # only containing one choice-context
        for c in self.choices:
            if str(line[f'{c}']) in gen_ans:
                answer = c
                answer_word_counter += 1
        if answer_word_counter == 1:
            return answer, False
        return '-', False
    
    def generate_dist(self, model, tokenizer, query, history, num_beams=1, max_length=2048,
                      do_sample=False, top_p=0.7, temperature=0.95, logits_processor=None, **kwargs):
        if history is None:
            history = []
        if logits_processor is None:
            logits_processor = LogitsProcessorList()
        logits_processor.append(InvalidScoreLogitsProcessor())
        gen_kwargs = {"num_beams": num_beams, "do_sample": do_sample, "top_p": top_p, "max_length": 2048,
                      "temperature": temperature, "logits_processor": logits_processor, **kwargs}
        if not history:
            prompt = query
        else:
            prompt = ""
            for i, (old_query, response) in enumerate(history):
                prompt += "[Round {}]\n问:{}\n答:{}\n".format(i, old_query, response)
            prompt += "[Round {}]\n问:{}\n答:".format(len(history), query)
        inputs = tokenizer([prompt], return_tensors="pt")
        inputs = inputs.to(model.device)
        outputs = model.generate(**inputs, return_dict_in_generate=True, output_scores=True, **gen_kwargs)
        
        score = outputs.scores[0][0].tolist()
        choice_score = [score[167], score[333], score[251], score[416]]
        ranked_index = [index for index, value in sorted(list(enumerate(choice_score)), key=lambda x:x[1], reverse=True)]
        return self.choices[ranked_index[0]]