我目前正在参加“书生大模型实战营”。这是一个旨在帮助学员掌握大模型开发和应用的实战课程。
为了更好地记录完成过程,我根据官方提供的教程文档提取了核心步骤,并去掉了详细的背景知识介绍和说明,这样后续一个手册查找起来会更加直观。
但建议大家在实际学习过程中还是多看看原文,因为原文档确实非常的详细和完整,方便了解每一步的具体原因和背后的原理,这样有助于更牢固地掌握知识,提高实战能力。
基础岛-第6关
本地环境:Win11。
完成任务步骤记录
任务一:评测浦语API
目标:使用 OpenCompass 评测浦语 API 记录复现过程并截图。
完成所需时间:3小时,其中运行时缺失包花费了不少时间。
步骤:
- 配置环境:
conda create -n opencompass python=3.10
conda activate opencompass
cd /root
git clone -b 0.3.3 https://github.com/open-compass/opencompass
cd opencompass
pip install -e .
pip install -r requirements.txt
pip install huggingface_hub==0.25.2
2. 配置模型:
新建文件/root/opencompass/configs/models/openai/puyu_api.py,并编写代码。
import os
from opencompass.models import OpenAISDK
internlm_url = 'https://internlm-chat.intern-ai.org.cn/puyu/api/v1/' # 你前面获得的 api 服务地址
internlm_api_key = os.getenv('INTERNLM_API_KEY')
models = [
dict(
# abbr='internlm2.5-latest',
type=OpenAISDK,
path='internlm2.5-latest', # 请求服务时的 model name
# 换成自己申请的APIkey
key=internlm_api_key, # API key
openai_api_base=internlm_url, # 服务地址
rpm_verbose=True, # 是否打印请求速率
query_per_second=0.16, # 服务请求速率
max_out_len=1024, # 最大输出长度
max_seq_len=4096, # 最大输入长度
temperature=0.01, # 生成温度
batch_size=1, # 批处理大小
retry=3, # 重试次数
)
]
3. 配置数据集:
新建文件/root/opencompass/configs/datasets/demo/demo_cmmlu_chat_gen.py,并编写代码。
from mmengine import read_base
with read_base():
from ..cmmlu.cmmlu_gen_c13365 import cmmlu_datasets
# 每个数据集只取前2个样本进行评测
for d in cmmlu_datasets:
d['abbr'] = 'demo_' + d['abbr']
d['reader_cfg']['test_range'] = '[0:1]' # 这里每个数据集只取1个样本, 方便快速评测.
4. 运行:
按照官方教程直接执行下述命令会出现缺少包的情况,需要运行以下命令。
pip install importlib_metadata
conda install -c conda-forge rouge
终端运行:
python run.py --models puyu_api.py --datasets demo_cmmlu_chat_gen.py --debug
任务二:评测本地模型(原生部署)性能
目标:使用 OpenCompass 评测 internlm2.5-chat-1.8b 模型在 ceval 数据集上的性能,记录复现过程并截图。
完成所需时间:接近半天,最后运行那步一直有错误,最终在查找很多材料加上群里翻了很多记录才解决。
步骤:
- 配置环境:
cd /root/opencompass
conda activate opencompass
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia -y
apt-get update
apt-get install cmake
pip install protobuf==4.25.3
pip install huggingface-hub==0.23.2
2. 下载数据集:
cp /share/temp/datasets/OpenCompassData-core-20231110.zip /root/opencompass/
unzip OpenCompassData-core-20231110.zip
- 配置本地模型:
新建文件/root/opencompass/configs/models/hf_internlm/hf_internlm2_5_1_8b_chat.py,并编写代码。
from opencompass.models import HuggingFacewithChatTemplate
models = [
dict(
type=HuggingFacewithChatTemplate,
abbr='internlm2_5-1_8b-chat-hf',
path='/share/new_models/Shanghai_AI_Laboratory/internlm2_5-1_8b-chat/',
max_out_len=2048,
batch_size=8,
run_cfg=dict(num_gpus=1),
)
]
# python run.py --datasets ceval_gen --models hf_internlm2_5_1_8b_chat --debug
4. 运行:
直接按照官方教程运行存在几个问题(我的开发机环境是
Cuda12.2-conda):
- numpy和transformers版本不兼容,需要:pip install transformers==4.41.2
- importlib_metadata不存在,需要:pip install importlib_metadata
- rouge不存在,需要:conda install -c conda-forge rouge
在此感谢书生群里的各路大佬。
终端运行:
python run.py --datasets ceval_gen --models hf_internlm2_5_1_8b_chat --debug
任务三:主观评测
目标:使用 OpenCompass 进行主观评测。
完成所需时间:3小时,主要查找资料,学习时间较长。
步骤:
这一块配套教程没有专门列举,以下是一些参考资料:
- 主观评测指引 — OpenCompass 0.3.6 文档
- compare型数据集主观测评 · open-compass/opencompass · Discussion #1575
- opencompass源码中也有示例,
/root/opencompass/configs/eval_subjective.py
- 配置环境:
直接使用前面任务配置的环境。
- 配置主观评测文件:
测评的配置文件涉及到3个模型文件:
- 想要测评的模型,我这里是
chatglm3-6b,源码示例里面就是它。 - 标杆模型,一般是某个版本的gpt4,不可修改。
- judge model,用来对以上两个模型的结果进行比较,我理解的这个一般是评测机构选择的,我这里直接选择了
internlm2_5-7b(本地版本)
from mmengine.config import read_base
from opencompass.models import HuggingFacewithChatTemplate
with read_base():
from opencompass.configs.datasets.subjective.alignbench.alignbench_judgeby_critiquellm import alignbench_datasets
from opencompass.configs.datasets.subjective.alpaca_eval.alpacav2_judgeby_gpt4 import alpacav2_datasets
from opencompass.configs.datasets.subjective.compassarena.compassarena_compare import compassarena_datasets
from opencompass.configs.datasets.subjective.arena_hard.arena_hard_compare import arenahard_datasets
from opencompass.configs.datasets.subjective.compassbench.compassbench_compare import compassbench_datasets
from opencompass.configs.datasets.subjective.fofo.fofo_judge import fofo_datasets
from opencompass.configs.datasets.subjective.wildbench.wildbench_pair_judge import wildbench_datasets
from opencompass.configs.datasets.subjective.multiround.mtbench_single_judge_diff_temp import mtbench_datasets
from opencompass.configs.datasets.subjective.multiround.mtbench101_judge import mtbench101_datasets
from opencompass.models import HuggingFaceCausalLM, HuggingFace, HuggingFaceChatGLM3, OpenAI
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.partitioners.sub_num_worker import SubjectiveNumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.summarizers import SubjectiveSummarizer
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]
)
# -------------Inference Stage ----------------------------------------
# For subjective evaluation, we often set do sample for models
models = [
dict(
type=HuggingFaceChatGLM3,
abbr='chatglm3-6b-hf',
path='THUDM/chatglm3-6b',
tokenizer_path='THUDM/chatglm3-6b',
model_kwargs=dict(
device_map='auto',
trust_remote_code=True,
),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
generation_kwargs=dict(
do_sample=True, #For subjective evaluation, we suggest you do set do_sample when running model inference!
),
meta_template=api_meta_template,
max_out_len=2048,
max_seq_len=4096,
batch_size=8,
run_cfg=dict(num_gpus=1, num_procs=1),
)
]
datasets = [*alignbench_datasets, *alpacav2_datasets, *arenahard_datasets, *compassarena_datasets, *compassbench_datasets, *fofo_datasets, *mtbench_datasets, *mtbench101_datasets, *wildbench_datasets] # add datasets you want
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=OpenICLInferTask)),
)
# -------------Evalation Stage ----------------------------------------
## ------------- JudgeLLM Configuration
judge_models = [
dict(
type=HuggingFacewithChatTemplate,
abbr='internlm2_5-7b-chat-hf',
path='/share/new_models/Shanghai_AI_Laboratory/internlm2_5-1_8b-chat/',
max_out_len=2048,
batch_size=8,
run_cfg=dict(num_gpus=1),
)
]
## ------------- Evaluation Configuration
eval = dict(
partitioner=dict(type=SubjectiveNaivePartitioner, models=models, judge_models=judge_models,),
runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=SubjectiveEvalTask)),
)
summarizer = dict(type=SubjectiveSummarizer, function='subjective')
work_dir = 'outputs/subjective/'
3. 运行:
opencompass configs/eval_subjective.py --debug
任务四:评测本地模型(LMDeploy)性能
目标:使用 OpenCompass 评测 InternLM2-Chat-1.8B 模型使用 LMDeploy部署后在 ceval 数据集上的性能。
完成所需时间:0.5小时,这块直接沿用原来的配置,很快就完成了。
步骤:
- 配置环境:
pip install lmdeploy==0.6.1 openai==1.52.0
2. 启动LMDeploy:
lmdeploy serve api_server /share/new_models/Shanghai_AI_Laboratory/internlm2_5-1_8b-chat/ --server-port 23333
3. **配置模型 **:
创建配置脚本/root/opencompass/configs/models/hf_internlm/hf_internlm2_5_1_8b_chat_api.py。
from opencompass.models import OpenAI
api_meta_template = dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
])
models = [
dict(
abbr='InternLM-2.5-1.8B-Chat',
type=OpenAI,
path='/share/new_models/Shanghai_AI_Laboratory/internlm2_5-1_8b-chat/', # 注册的模型名称
key='sk-123456',
openai_api_base='http://0.0.0.0:23333/v1/chat/completions',
meta_template=api_meta_template,
query_per_second=1,
max_out_len=2048,
max_seq_len=4096,
batch_size=8),
]
4. **运行 **:
我这里运行时提示需要scikit_learn==1.5.0,需要
pip install scikit_learn==1.5.0
opencompass --models hf_internlm2_5_1_8b_chat_api --datasets ceval_gen --debug