一键式 RLHF 训练 DeepSpeed Chat（二）：实践篇之前给大家分享了一键式 RLHF 训练 DeepSpe

之前给大家分享了一键式 RLHF 训练 DeepSpeed Chat（一）：理论篇，本文给大家分享如何使用DeepSpeed Chat进行RLHF训练。

DeepSpeed Chat 的 RLHF 训练流程包括三个主要阶段：

第一阶段：监督微调（SFT） —— 使用精选的人类回答来微调预训练的语言模型以应对各种查询（query）；
第二阶段：奖励模型微调 —— 使用一个包含人类对同一查询（query）的多个答案打分的数据集来训练一个独立的（通常比 SFT 小的）奖励模型（RW）；
第三阶段：RLHF 训练 —— 利用 Proximal Policy Optimization（PPO）算法，根据 RW 模型的奖励反馈进一步微调 SFT 模型。

环境搭建

基础环境配置如下：

操作系统: Ubuntu 18.04
CPUs: 单个节点具有 1TB 内存的 Intel CPU，物理CPU个数为64，每颗CPU核数为16
GPUs: 8 卡 A800 80GB GPUs
Python: 3.10 (需要先升级OpenSSL到1.1.1t版本（点击下载OpenSSL），然后再编译安装Python)，点击下载Python
NVIDIA驱动程序版本: 515.65.01，根据不同型号选择不同的驱动程序，点击下载。
CUDA工具包: 11.7，点击下载
NCCL: nccl_2.14.3-1+cuda11.7，点击下载
cuDNN: 8.8.1.3_cuda11，点击下载

上面的NVIDIA驱动、CUDA、Python等工具的安装就不一一赘述了。

创建虚拟环境并激活虚拟环境deepspeedchat-venv-py310-cu117：

cd /home/guodong.li/virtual-venv
virtualenv -p /usr/bin/python3.10 deepspeedchat-venv-py310-cu117
source /home/guodong.li/virtual-venv/deepspeedchat-venv-py310-cu117/bin/activate

离线安装PyTorch，点击下载对应cuda版本的torch即可。

pip install torch-1.13.1+cu117-cp310-cp310-linux_x86_64.whl

安装deepspeed、transformers等其他依赖包。

pip install -r requirements.txt

requirements.txt文件内容如下：

deepspeed==0.9.1
transformers==4.28.1
datasets>=2.8.0
sentencepiece>=0.1.97
protobuf==3.20.3
accelerate>=0.15.0

数据集、模型和代码准备

由于服务器无法访问外网，因此，本地预先下载数据集和模型。

对于数据集，使用了Huggingface Datasets的那些开源数据集。得益于 DeepSpeed RLHF 数据抽象和混合技术，现在能够组合多个数据源进行训练。但是，不同的数据集可能使用不同的提示词（例如，Dohas/rm-static 使用“Human:”进行查询，使用“Assistant:”进行回答）。因此，用户必须自行对齐这些提示（prompt）。在DeepSpeed Chat的示例中，始终使用 Dohas/rm-static 中的格式。通过评估，发现合并不同的数据集可以提高模型的质量。

下载数据集：

git clone https://huggingface.co/datasets/Dahoas/rm-static
git clone https://huggingface.co/datasets/Dahoas/full-hh-rlhf
git clone https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise
git clone https://huggingface.co/datasets/yitingxie/rlhf-reward-datasets

Dahoas/rm-static数据集格式：

prompt (string)	response (string)	chosen (string)	rejected (string)
" Human: I am trying to write a fairy tale. What is the most popular plot? Assistant: The most popular plot might be “a princess goes to a faraway land, falls in love, and has a magic genie that grants her wishes”. We can find plenty of examples of this if we search for fairy tales using the search engine Google. Human: You can't look anything up on google. Assistant: OK, so let’s think about some other popular fairy tales. Human: Do you like the plot of Hansel and Gretel? ... Assistant:"	" This sounds like a really interesting modern retelling of the story!"	" This sounds like a really interesting modern retelling of the story!"	" And the prince and the princess both decide that they are more powerful together than apart?"

Dahoas/full-hh-rlhf数据集格式：

prompt (string)	response (string)	chosen (string)	rejected (string)
" Human: Should you buy a case to protect your cell phone? Assistant: It depends on your circumstances. If you carry your phone in a pocket or a purse then you probably want a case. But if you only need a phone for quick interactions, a case may actually cause more harm than good. What do you need the phone for? Are you a parent, or do you work from home? Human: What harm could it do? ... Assistant:"	" You’re welcome."	" You’re welcome."	" It sounds like you’ve got the basics down. Any further questions or concerns? You can send me any feedback you have at help@babba.ai."

Dahoas/synthetic-instruct-gptj-pairwise数据集格式：

prompt (string)	chosen (string)	rejected (string)
"I was wondering if you could walk me through the process of setting up a hydroponic garden for herbs."	"Sure! The process for setting up a hydroponic garden for herbs is relatively simple. First, you'll want to choose a space where you will set up your hydroponic system. You'll need to make sure the space is well-lit and has access to electricity and an adequate water supply. Next, you'll need to choose the type of hydroponic system you want to use. There are several types of hydroponic systems, so you'll need to decide which best suits your needs. Once you've chosen a system, you'll need to gather the supplies you'll need to assemble it. This includes things like pumps, growing trays, grow lights, and nutrients. Once you've assembled the system, you'll need to add your choice of herbs to the system. Lastly, you'll need to monitor and adjust the system as needed to ensure your herbs are getting the correct amount of light, water, and nutrients."	"How do I store a bagels for eating at a later date? You can place the bagels in an airtight container and reheat them in the microwave. Alternately, you can place the bagels in the microwave, cover them with foil, then heat them in the microwave for a short time."

yitingxie/rlhf-reward-datasets数据集格式：

prompt (string)	chosen (string)	rejected (string)
" Human: Do you know why turkeys became the official food of thanksgiving? "	"Assistant: To be honest, I don’t know anything about that. I know that I’m meant to know a lot about history and current events, but I haven’t been programmed with those particular facts, sorry."	"Assistant: I know that the American Indians were the first to celebrate the first thanksgiving, and the first European settlers came to New England in the early 1600s, and it is likely that they celebrated the first thanksgiving in the late 1600s. However, it is also likely that some European settlers on the continent celebrated the first thanksgiving in the mid-1500s. A lot of people think that the main factor in thanksgiving is that the settlers in the new world were thankful for finding a new land, and that turkey was their traditional food. Another factor that has often been thought to be important is the Pilgrims’ relationship with the Native Americans, and the feast may be one way of trying to show them respect. I can’t tell you the definitive answer, but maybe this will help you figure it out?"

由于 GPT3 没有开源checkpoint，我们使用 Meta OPT 系列的预训练模型（如： facebook/opt-1.3b），当然也可以使用其他预训练模型（如：GPT-Neo、Bloom 等）。

本文使用opt-2.7b训练Actor模型，使用opt-350m训练reward模型，下载模型：

git clone https://huggingface.co/facebook/opt-350m
git clone https://huggingface.co/facebook/opt-2.7b

修改opt-350m目录的config.json配置文件将_name_or_path改为本地模型路径：

{
  "_name_or_path": "/home/guodong.li/model/hf-opt-350m",
}

同理，修改opt-2.7b目录的config.json配置文件将_name_or_path改为本地模型路径：

RLHF 训练

下载DeepSpeedExamples代码并进入DeepSpeed Chat目录：

# commit id: 9a586b1
git clone https://github.com/microsoft/.git
cd DeepSpeedExamples/applications/DeepSpeed-Chat/

查看代码结构：

> tree
.
|____training   # 训练
| |____utils # 工具类
| | |____utils.py
| | |____model # 模型工具类
| | | |____reward_model.py
| | | |____model_utils.py
| | |____module
| | | |____lora.py
| | |____ds_utils.py
| | |____data # 数据处理工具类
| | | |____data_utils.py
| | | |____raw_datasets.py
| |____step1_supervised_finetuning # 第一阶段：有监督微调
| | |____training_log_output
| | | |____opt-1.3b-globalBatchSize128.log
| | |____main.py
| | |____training_scripts # 模型训练脚本
| | | |____other_language
| | | | |____run_chinese.sh # 基于bloom的有监督微调
| | | | |____run_japanese.sh # 基于mGPT的有监督微调
| | | |____multi_node # 多机多卡训练脚本
| | | | |____run_66b.sh
| | | |____README.md
| | | |____single_node # 单机多卡训练脚本
| | | | |____run_1.3b_lora.sh
| | | | |____run_13b.sh
| | | | |____run_1.3b.sh
| | | | |____run_30b_lora.sh
| | | | |____run_6.7b.sh
| | | |____single_gpu # 单卡训练脚本
| | | | |____run_6.7b_lora.sh
| | | | |____run_1.3b.sh
| | |____evaluation_scripts
| | | |____run_prompt.sh
| | |____README.md
| | |____prompt_eval.py
| |____step2_reward_model_finetuning # 第二阶段：奖励模型微调
| | |____rw_eval.py
| | |____training_log_output
| | | |____opt-350m_globalBatchSize-64.log
| | |____main.py
| | |____training_scripts 训练脚本
| | | |____multi_node # 多机多卡训练脚本
| | | | |____run_350m.sh
| | | |____README.md
| | | |____single_node # 单机多卡训练脚本
| | | | |____run_350m.sh
| | | |____single_gpu # 单卡训练脚本
| | | | |____run_350m.sh
| | |____evaluation_scripts # 模型评估脚本
| | | |____run_eval.sh
| | |____README.md
| |____README.md
| |____step3_rlhf_finetuning # 人工反馈强化学习微调
| | |____ppo_trainer.py
| | |____training_log_output
| | | |____actor_opt-1.3b_critic_opt-350m_globalBatchSize64.log
| | |____main.py
| | |____BenckmarkSetting.md
| | |____training_scripts # 模型训练脚本
| | | |____multi_node # 多机多卡训练脚本
| | | | |____run_66b.sh
| | | |____README.md
| | | |____single_node # 单机多卡训练脚本
| | | | |____run_1.3b_lora.sh
| | | | |____run_13b.sh
| | | | |____run_1.3b.sh
| | | | |____run_30b_lora.sh
| | | | |____run_6.7b.sh
| | | |____single_gpu # 单卡训练脚本
| | | | |____run_6.7b_lora.sh
| | | | |____run_1.3b.sh
| | |____rlhf_engine.py
| | |____README.md
|____train.py  # 训练入口
|____chat.py
|____README.md
|____requirements.txt
|____inference # 推理
| |____chatbot.py

修改training/utils/data/raw_datasets.py文件，将数据集改为本地加载。

# The template prompt dataset class that all new dataset porting needs to
# follow in order to have a unified API and unified data format.
class PromptRawDataset(object):

    def __init__(self, output_path, seed, local_rank, dataset_name):
        self.output_path = output_path
        self.seed = seed
        self.local_rank = local_rank
        if not ("Dahoas/rm-static" == dataset_name or "Dahoas/full-hh-rlhf" == dataset_name or "Dahoas/synthetic-instruct-gptj-pairwise" == dataset_name or "yitingxie/rlhf-reward-datasets" == dataset_name):
            self.raw_datasets = load_dataset(dataset_name)


# English dataset
class DahoasRmstaticDataset(PromptRawDataset):

    def __init__(self, output_path, seed, local_rank, dataset_name):
        super().__init__(output_path, seed, local_rank, dataset_name)
        self.dataset_name = "Dahoas/rm-static"
        self.dataset_name_clean = "Dahoas_rm_static"
        data_files = {"train":"train-00000-of-00001-2a1df75c6bce91ab.parquet","test":"test-00000-of-00001-8c7c51afc6d45980.parquet"}
        self.raw_datasets = load_dataset("parquet", data_dir='/home/guodong.li/data/dahoas/rm-static/data', data_files=data_files)


# English dataset
class DahoasFullhhrlhfDataset(PromptRawDataset):

    def __init__(self, output_path, seed, local_rank, dataset_name):
        super().__init__(output_path, seed, local_rank, dataset_name)
        self.dataset_name = "Dahoas/full-hh-rlhf"
        self.dataset_name_clean = "Dahoas_full_hh_rlhf"
        data_files = {"train":"train-00000-of-00001-8349d0765e6718df.parquet","test":"test-00000-of-00001-ec71e9262143a91c.parquet"}
        self.raw_datasets = load_dataset("parquet", data_dir='/home/guodong.li/data/dahoas/full-hh-rlhf/data', data_files=data_files)


# English dataset
class DahoasSyntheticinstructgptjpairwiseDataset(PromptRawDataset):

    def __init__(self, output_path, seed, local_rank, dataset_name):
        super().__init__(output_path, seed, local_rank, dataset_name)
        self.dataset_name = "Dahoas/synthetic-instruct-gptj-pairwise"
        self.dataset_name_clean = "Dahoas_synthetic_instruct_gptj_pairwise"
        data_files = {"train":"train-00000-of-00001-1e5d57b93c448e7a.parquet"}
        self.raw_datasets =  load_dataset("parquet", data_dir='/home/guodong.li/data/dahoas/synthetic-instruct-gptj-pairwise/data', data_files=data_files)


# English dataset
class YitingxieRlhfrewarddatasetsDataset(PromptRawDataset):

    def __init__(self, output_path, seed, local_rank, dataset_name):
        super().__init__(output_path, seed, local_rank, dataset_name)
        self.dataset_name = "yitingxie/rlhf-reward-datasets"
        self.dataset_name_clean = "yitingxie_rlhf_reward_datasets"
        data_files = {"train":"train-00000-of-00001-2ea3039ca4da89f8.parquet","test":"test-00000-of-00001-955c146ec7a10a1e.parquet"}
        self.raw_datasets = load_dataset("parquet", data_dir='/home/guodong.li/data/dahoas/rlhf-reward-datasets/data', data_files=data_files)

第一阶段：有监督的模型微调（SFT）

有监督微调 (SFT) 非常类似于针对因果语言任务（例如：WikiText-103）的标准语言模型微调。主要区别在于数据集资源，SFT 用高质量的查询-答案对来微调模型以实现人类偏好的生成。

DeepSpeed Chat提供了多个脚本用于在单个 GPU（例如，单个 A6000-48G、V100-32G、A100-40G 等）、单个节点（例如，8/16x V100-32G、8 卡 A100-40G/80G）上进行训练，和多节点（例如，64x A100-80G）进行训练，可以在“training_scripts”目录中找到。

这里我使用单机多卡进行有监督的微调，同时修改opt-13b的训练脚本，但是实际上使用的是opt-2.7b的模型进行微调。

修改有监督微调的训练脚本training/step1_supervised_finetuning/training_scripts/single_node/run_13b.sh：

#!/bin/bash

# DeepSpeed Team

OUTPUT=$1
ZERO_STAGE=$2
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./output
fi
if [ "$ZERO_STAGE" == "" ]; then
    ZERO_STAGE=3
fi
mkdir -p $OUTPUT

deepspeed main.py \
   --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets \
   --data_split 2,4,4 \
   --model_name_or_path /home/guodong.li/model/hf-opt-2.7b \
   --per_device_train_batch_size 128 \
   --per_device_eval_batch_size 4 \
   --max_seq_len 512 \
   --learning_rate 1e-4 \
   --weight_decay 0. \
   --num_train_epochs 6  \
   --gradient_accumulation_steps 8 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 0 \
   --seed 1234 \
   --gradient_checkpointing \
   --zero_stage $ZERO_STAGE \
   --lora_dim 128 \
   --lora_module_name decoder.layers. \
   --deepspeed \
   --output_dir $OUTPUT \
   &> $OUTPUT/training.log

运行命令：

# Move into the first step of the pipeline
cd training/step1_supervised_finetuning/

sh training_scripts/single_node/run_13b.sh  /home/guodong.li/output/deepspeedchat 1

通过日志文件training.log查看训练过程，也可以通过命令tail -n100 -f training.log进行滚动日志查看：

[2023-05-01 11:13:24,604] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-01 11:13:25,933] [INFO] [runner.py:540:main] cmd = /home/guodong.li/virtual-venv/deepspeedchat-venv-py310-cu117/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets --data_split 2,4,4 --model_name_or_path /home/guodong.li/model/hf-opt-2.7b --per_device_train_batch_size 128 --per_device_eval_batch_size 4 --max_seq_len 512 --learning_rate 1e-4 --weight_decay 0. --num_train_epochs 6 --gradient_accumulation_steps 8 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --gradient_checkpointing --zero_stage 1 --lora_dim 128 --lora_module_name decoder.layers. --deepspeed --output_dir /home/guodong.li/output/deepspeedchat
[2023-05-01 11:13:28,673] [INFO] [launch.py:222:main] 0 NCCL_SOCKET_IFNAME=bond0
[2023-05-01 11:13:28,673] [INFO] [launch.py:222:main] 0 NCCL_IB_DISABLE=1
[2023-05-01 11:13:28,673] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-05-01 11:13:28,673] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-05-01 11:13:28,673] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-05-01 11:13:28,673] [INFO] [launch.py:247:main] dist_world_size=8
[2023-05-01 11:13:28,673] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
...
[2023-05-01 11:15:41,305] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2023-05-01 11:15:41,305] [INFO] [utils.py:51:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2023-05-01 11:15:41,306] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 1 optimizer
[2023-05-01 11:15:41,306] [INFO] [stage_1_and_2.py:133:__init__] Reduce bucket size 500,000,000
[2023-05-01 11:15:41,306] [INFO] [stage_1_and_2.py:134:__init__] Allgather bucket size 500,000,000
[2023-05-01 11:15:41,306] [INFO] [stage_1_and_2.py:135:__init__] CPU Offload: False
[2023-05-01 11:15:41,306] [INFO] [stage_1_and_2.py:136:__init__] Round robin gradient partitioning: False
...
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Emitting ninja build file /home/guodong.li/.cache/torch_extensions/py310_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.1425342559814453 seconds
...
Loading extension module utils...
Time to load utils op: 0.20241904258728027 seconds
Rank: 7 partition count [8, 8] and sizes[(40356800, False), (112960, False)]
...
Rank: 6 partition count [8, 8] and sizes[(40356800, False), (112960, False)]
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005517005920410156 seconds
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
...
[2023-05-01 11:16:01,201] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 131.02 GB, percent = 13.0%
[2023-05-01 11:16:01,203] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
[2023-05-01 11:16:01,204] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-05-01 11:16:01,204] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f892bf2b7c0>
[2023-05-01 11:16:01,204] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0001, 0.0001], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 11:16:01,205] [INFO] [config.py:953:print] DeepSpeedEngine configuration:
[2023-05-01 11:16:01,205] [INFO] [config.py:957:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2023-05-01 11:16:01,205] [INFO] [config.py:957:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-05-01 11:16:01,205] [INFO] [config.py:957:print]   amp_enabled .................. False
[2023-05-01 11:16:01,205] [INFO] [config.py:957:print]   amp_params ................... False
[2023-05-01 11:16:01,205] [INFO] [config.py:957:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": "autotuning_results",
    "exps_dir": "autotuning_exps",
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2023-05-01 11:16:01,205] [INFO] [config.py:957:print]   bfloat16_enabled ............. False
[2023-05-01 11:16:01,205] [INFO] [config.py:957:print]   checkpoint_parallel_write_pipeline  False
[2023-05-01 11:16:01,205] [INFO] [config.py:957:print]   checkpoint_tag_validation_enabled  True
[2023-05-01 11:16:01,205] [INFO] [config.py:957:print]   checkpoint_tag_validation_fail  False
[2023-05-01 11:16:01,205] [INFO] [config.py:957:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f8913bf6890>
[2023-05-01 11:16:01,205] [INFO] [config.py:957:print]   communication_data_type ...... None
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   curriculum_enabled_legacy .... False
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   curriculum_params_legacy ..... False
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   data_efficiency_enabled ...... False
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   dataloader_drop_last ......... False
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   disable_allgather ............ False
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   dump_state ................... False
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'min_scale': 1}
...
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   eigenvalue_verbose ........... False
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   elasticity_enabled ........... False
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   flops_profiler_config ........ {
    "enabled": false,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
}
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   fp16_auto_cast ............... False
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   fp16_enabled ................. True
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   fp16_master_weights_and_gradients  False
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   global_rank .................. 0
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   grad_accum_dtype ............. None
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   gradient_accumulation_steps .. 8
...
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   nebula_config ................ {
    "enabled": false,
    "persistent_storage_path": null,
    "persistent_time_interval": 100,
    "num_of_version_in_retention": 2,
    "enable_nebula_load": true,
    "load_path": null
}
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   optimizer_legacy_fusion ...... False
[2023-05-01 11:16:01,206] [INFO] [config.py:957:print]   optimizer_name ............... None
...
[2023-05-01 11:16:01,207] [INFO] [config.py:957:print]   steps_per_print .............. 10
[2023-05-01 11:16:01,207] [INFO] [config.py:957:print]   train_batch_size ............. 8192
[2023-05-01 11:16:01,207] [INFO] [config.py:957:print]   train_micro_batch_size_per_gpu  128
[2023-05-01 11:16:01,207] [INFO] [config.py:957:print]   use_node_local_storage ....... False
[2023-05-01 11:16:01,207] [INFO] [config.py:957:print]   wall_clock_breakdown ......... False
[2023-05-01 11:16:01,207] [INFO] [config.py:957:print]   world_size ................... 8
[2023-05-01 11:16:01,207] [INFO] [config.py:957:print]   zero_allow_untested_optimizer  False
[2023-05-01 11:16:01,207] [INFO] [config.py:957:print]   zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False memory_efficient_linear=False
[2023-05-01 11:16:01,207] [INFO] [config.py:957:print]   zero_enabled ................. True
[2023-05-01 11:16:01,207] [INFO] [config.py:957:print]   zero_force_ds_cpu_optimizer .. True
[2023-05-01 11:16:01,207] [INFO] [config.py:957:print]   zero_optimization_stage ...... 1
[2023-05-01 11:16:01,207] [INFO] [config.py:943:print_user_config]   json = {
    "train_batch_size": 8.192000e+03,
    "train_micro_batch_size_per_gpu": 128,
    "steps_per_print": 10,
    "zero_optimization": {
        "stage": 1,
        "offload_param": {
            "device": "none"
        },
        "offload_optimizer": {
            "device": "none"
        },
        "stage3_param_persistence_threshold": 1.000000e+04,
        "stage3_max_live_parameters": 3.000000e+07,
        "stage3_prefetch_bucket_size": 3.000000e+07,
        "memory_efficient_linear": false
    },
    "fp16": {
        "enabled": true,
        "loss_scale_window": 100
    },
    "gradient_clipping": 1.0,
    "prescale_gradients": false,
    "wall_clock_breakdown": false,
    "hybrid_engine": {
        "enabled": false,
        "max_out_tokens": 512,
        "inference_tp_size": 1,
        "release_inference_cache": false,
        "pin_parameters": true,
        "tp_gather_partition_size": 8
    }
}
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00029349327087402344 seconds
***** Running training *****
***** Evaluating perplexity, Epoch 0/6 *****
ppl: 6027.47900390625
Beginning of Epoch 1/6, Total Micro Batches 58
[2023-05-01 11:17:44,166] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
...
[2023-05-01 11:23:15,323] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, reducing to 4096
***** Evaluating perplexity, Epoch 1/6 *****
ppl: 3730.6748046875
Beginning of Epoch 2/6, Total Micro Batches 58
[2023-05-01 11:30:23,609] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=5, lr=[9.73465064747553e-05, 9.73465064747553e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 11:30:23,648] [INFO] [timer.py:199:stop] epoch=1/micro_step=22/global_step=10, RunningAvgSamplesPerSec=99.38274452057438, CurrSamplesPerSec=98.13850311863031, MemAllocated=12.12GB, MaxMemAllocated=41.8GB
***** Evaluating perplexity, Epoch 2/6 *****
ppl: 227.2152557373047
Beginning of Epoch 3/6, Total Micro Batches 58
[2023-05-01 11:44:31,098] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=5, lr=[7.777851165098012e-05, 7.777851165098012e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 11:44:31,132] [INFO] [timer.py:199:stop] epoch=2/micro_step=44/global_step=20, RunningAvgSamplesPerSec=98.99445999331354, CurrSamplesPerSec=98.03770159255755, MemAllocated=12.12GB, MaxMemAllocated=41.8GB
***** Evaluating perplexity, Epoch 3/6 *****
ppl: 3.6747963428497314
Beginning of Epoch 4/6, Total Micro Batches 58
***** Evaluating perplexity, Epoch 4/6 *****
ppl: 2.5551822185516357
Beginning of Epoch 5/6, Total Micro Batches 58
[2023-05-01 11:58:48,222] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=5, lr=[4.6729843538492847e-05, 4.6729843538492847e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 11:58:48,261] [INFO] [timer.py:199:stop] epoch=4/micro_step=8/global_step=30, RunningAvgSamplesPerSec=99.19207895349899, CurrSamplesPerSec=98.47500229000546, MemAllocated=12.12GB, MaxMemAllocated=41.8GB
***** Evaluating perplexity, Epoch 5/6 *****
ppl: 2.4443321228027344
Beginning of Epoch 6/6, Total Micro Batches 58
[2023-05-01 12:12:53,265] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=5, lr=[1.703270924499656e-05, 1.703270924499656e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 12:12:53,303] [INFO] [timer.py:199:stop] epoch=5/micro_step=30/global_step=40, RunningAvgSamplesPerSec=99.13119263297529, CurrSamplesPerSec=98.23508650018317, MemAllocated=12.12GB, MaxMemAllocated=41.8GB
***** Evaluating perplexity, Epoch 6/6 *****
ppl: 2.3933522701263428
saving the final model ...
[2023-05-01 12:18:01,458] [INFO] [launch.py:460:main] Process 36166 exits successfully.
...
[2023-05-01 12:18:09,465] [INFO] [launch.py:460:main] Process 36162 exits successfully.

模型权重文件：

ls -al --block-size=M
total 5419M
drwxrwxr-x  2 guodong.li guodong.li    1M May  1 13:01 .
drwxrwxr-x 10 guodong.li guodong.li    1M May  1 09:16 ..
-rw-rw-r--  1 guodong.li guodong.li    1M May  1 12:18 config.json
-rw-rw-r--  1 guodong.li guodong.li    1M May  1 12:18 merges.txt
-rw-rw-r--  1 guodong.li guodong.li 5418M May  1 12:18 pytorch_model.bin
-rw-rw-r--  1 guodong.li guodong.li    1M May  1 12:18 training.log
-rw-rw-r--  1 guodong.li guodong.li    1M May  1 12:18 vocab.json

模型训练完成之后，接下来进行有监督微调的模型评估。

运行命令：

cd applications/DeepSpeed-Chat/training/step1_supervised_finetuning
sh evaluation_scripts/run_prompt.sh /home/guodong.li/model/hf-opt-2.7b /home/guodong.li/output/deepspeedchat

它要求用户提供两个模型的路径：

原始预训练模型（即--model_name_or_path_baseline facebook/opt-1.3b）
微调模型（即--model_name_or_path_finetune output/check_base）

其中，prompt_eval.py评估脚本中包括了几个可以根据您的喜好进行任意更新的提示（prompt）。

运行过程：

> sh evaluation_scripts/run_prompt.sh /home/guodong.li/model/hf-opt-2.7b /home/guodong.li/output/deepspeedchat
load_hf_tokenizer model_name_or_path： /home/guodong.li/model/hf-opt-2.7b
==========Baseline: Greedy=========

Human: Please tell me about Microsoft in a few sentence? Assistant: Microsoft is a software company that makes operating systems and applications. Human: What is the most important thing about Microsoft? Assistant: Microsoft is a software company that makes operating systems and applications. Human: What is the most important thing about Microsoft? Assistant: Microsoft is a software company that makes operating systems and applications. Human: What is the most important thing about Microsoft? Assistant: Microsoft is a software company that makes operating systems and applications. Human: What is the most important thing about Microsoft? Assistant:

==========finetune: Greedy=========

Human: Please tell me about Microsoft in a few sentence? Assistant: I'm not sure what you mean by that.

====================prompt end=============================


==========Baseline: Greedy=========

Human: Explain the moon landing to a 6 year old in a few sentences. Assistant: I don't know, I'm not a scientist.
I'm not a scientist either, but I can tell you that the moon landing was faked.
I'm not a scientist either, but I can tell you that the moon landing was faked.
I'm not a scientist either, but I can tell you that the moon landing was faked.
I'm not a scientist either, but I can tell you that the moon landing was faked.
I'm not a scientist either, but

==========finetune: Greedy=========

Human: Explain the moon landing to a 6 year old in a few sentences. Assistant: I can't.

====================prompt end=============================


==========Baseline: Greedy=========

Human: Write a short poem about a wise frog. Assistant: What's a wise frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog that knows everything. Assistant: What's a frog? Human: A frog

==========finetune: Greedy=========

Human: Write a short poem about a wise frog. Assistant: What kind of frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of wise frog? Human: A wise frog. Assistant: What kind of

====================prompt end=============================


==========Baseline: Greedy=========

Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human: Who was president of the United States in 1955? Assistant: Eisenhower. Human:

==========finetune: Greedy=========

Human: Who was president of the United States in 1955? Assistant: Eisenhower

====================prompt end=============================


==========Baseline: Greedy=========

Human: How does a telescope work? Assistant: It's a big mirror.
I'm not sure if you're joking or not, but I'm going to assume you're joking.
I'm not joking. I'm a physics major.
I'm a physics major too. I'm just not sure if you're joking or not.
I'm not joking. I'm a physics major.
I'm a physics major too. I'm just not sure if you're joking or not.
I'm a physics major too.

==========finetune: Greedy=========

Human: How does a telescope work? Assistant: It's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope, it's a telescope,

====================prompt end=============================


==========Baseline: Greedy=========

Human: Why do birds migrate south for the winter? Assistant: Because they're stupid.
I'm not sure if you're being serious or not, but I'm going to go with the latter.
I'm serious. I've heard it from a few people.

==========finetune: Greedy=========

Human: Why do birds migrate south for the winter? Assistant: To get away from the cold.

====================prompt end=============================

第二阶段：奖励模型微调

奖励模型 (RM) 微调类似于第一阶段有监督微调 (SFT) 。但是，RM 和 SFT 微调之间存在几个关键差异：

训练数据差异：对于 SFT 微调，数据是查询（query）和答案（answer）拼接在一起。然而，对于 RM 微调，每批数据由两个查询-答案对组成，即具有高分答案和低分答案的相同查询。这也导致了如下所述的第二个差异。
训练目标差异：对于 RW，训练目标是 pairwise ranking score，即对于两个查询-答案对，RM 应该给更好的答案更高的分数。有多种方法可以实现这一目标。在DeepSpeed Chat的实现中，使用序列的结束标记或第一个填充标记作为聚合分数并比较它们。当然，也可以使用整个答案的平均分数作为替代。
--num_padding_at_beginning 参数：在 RW 微调脚本中发现一个有趣的参数 num_padding_at_beginning。添加此参数是因为注意到不同的模型可能具有不同的填充或分词器行为。具体来说，OPT 模型族中的 tokenizer 总是在开头添加一个 padding token，这会影响我们对评分 token 的选择。因此，我们需要考虑到这一点。
RW 评估：提供了一个评估脚本 rw_eval.py，供用户执行简单的提示回答测试。

这里我使用单机多卡基于opt-350m进行奖励模型的微调。当然，你也可以通过简单地将候选模型替换为您喜欢的模型并启用其他高效训练方法来训练更大的模型，如：SFT 微调过程中所述方法。

下面，修改奖励模型微调的训练脚本training/step2_reward_model_finetuning/training_scripts/single_node/run_350m.sh：

#!/bin/bash

# DeepSpeed Team
OUTPUT=$1
ZERO_STAGE=$2
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./output
fi
if [ "$ZERO_STAGE" == "" ]; then
    ZERO_STAGE=0
fi

mkdir -p $OUTPUT

deepspeed main.py \
   --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets \
   --data_split 2,4,4 \
   --model_name_or_path /home/guodong.li/model/hf-opt-350m \
   --num_padding_at_beginning 1 \
   --per_device_train_batch_size 16 \
   --per_device_eval_batch_size 4 \
   --max_seq_len 512 \
   --learning_rate 5e-5 \
   --weight_decay 0.1 \
   --num_train_epochs 1 \
   --disable_dropout \
   --gradient_accumulation_steps 2 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 0 \
   --seed 1234 \
   --zero_stage $ZERO_STAGE \
   --deepspeed \
   --output_dir $OUTPUT \
   &> $OUTPUT/training.log

执行命令：

# Move into the second step of the pipeline
cd training/step2_reward_model_finetuning

sh training_scripts/single_node/run_350m.sh /home/guodong.li/output/dschat-reward 2

通过日志文件training.log查看训练过程，也可以通过命令tail -n100 -f training.log进行滚动日志查看：

[2023-05-01 14:11:48,584] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-01 14:11:49,900] [INFO] [runner.py:540:main] cmd = /home/guodong.li/virtual-venv/deepspeedchat-venv-py310-cu117/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets --data_split 2,4,4 --model_name_or_path /home/guodong.li/model/hf-opt-350m --num_padding_at_beginning 1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --max_seq_len 512 --learning_rate 5e-5 --weight_decay 0.1 --num_train_epochs 1 --disable_dropout --gradient_accumulation_steps 2 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 2 --deepspeed --output_dir /home/guodong.li/output/dschat-reward
[2023-05-01 14:11:52,554] [INFO] [launch.py:222:main] 0 NCCL_SOCKET_IFNAME=bond0
[2023-05-01 14:11:52,554] [INFO] [launch.py:222:main] 0 NCCL_IB_DISABLE=1
[2023-05-01 14:11:52,554] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-05-01 14:11:52,554] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-05-01 14:11:52,554] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-05-01 14:11:52,554] [INFO] [launch.py:247:main] dist_world_size=8
[2023-05-01 14:11:52,554] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-05-01 14:12:04,010] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
...
[2023-05-01 14:13:50,573] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
[2023-05-01 14:13:50,573] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-05-01 14:13:50,573] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f4f76fd0310>
[2023-05-01 14:13:50,573] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05, 5e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 14:13:50,573] [INFO] [config.py:953:print] DeepSpeedEngine configuration:
[2023-05-01 14:13:50,574] [INFO] [config.py:957:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2023-05-01 14:13:50,574] [INFO] [config.py:957:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-05-01 14:13:50,574] [INFO] [config.py:957:print]   amp_enabled .................. False
[2023-05-01 14:13:50,574] [INFO] [config.py:957:print]   amp_params ................... False
[2023-05-01 14:13:50,574] [INFO] [config.py:957:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": "autotuning_results",
    "exps_dir": "autotuning_exps",
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2023-05-01 14:13:50,574] [INFO] [config.py:957:print]   bfloat16_enabled ............. False
[2023-05-01 14:13:50,574] [INFO] [config.py:957:print]   checkpoint_parallel_write_pipeline  False
[2023-05-01 14:13:50,574] [INFO] [config.py:957:print]   checkpoint_tag_validation_enabled  True
[2023-05-01 14:13:50,574] [INFO] [config.py:957:print]   checkpoint_tag_validation_fail  False
[2023-05-01 14:13:50,574] [INFO] [config.py:957:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f4f6cce96c0>
[2023-05-01 14:13:50,574] [INFO] [config.py:957:print]   communication_data_type ...... None
[2023-05-01 14:13:50,574] [INFO] [config.py:957:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
...
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   elasticity_enabled ........... False
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   flops_profiler_config ........ {
    "enabled": false,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
}
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   fp16_auto_cast ............... False
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   fp16_enabled ................. True
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   fp16_master_weights_and_gradients  False
...
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   initial_dynamic_scale ........ 65536
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   load_universal_checkpoint .... False
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   loss_scale ................... 0
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   memory_breakdown ............. False
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   nebula_config ................ {
    "enabled": false,
    "persistent_storage_path": null,
    "persistent_time_interval": 100,
    "num_of_version_in_retention": 2,
    "enable_nebula_load": true,
    "load_path": null
}
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   optimizer_legacy_fusion ...... False
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   optimizer_name ............... None
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   optimizer_params ............. None
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   pld_enabled .................. False
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   pld_params ................... False
...
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   world_size ................... 8
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   zero_allow_untested_optimizer  False
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False memory_efficient_linear=False
[2023-05-01 14:13:50,575] [INFO] [config.py:957:print]   zero_enabled ................. True
[2023-05-01 14:13:50,576] [INFO] [config.py:957:print]   zero_force_ds_cpu_optimizer .. True
[2023-05-01 14:13:50,576] [INFO] [config.py:957:print]   zero_optimization_stage ...... 2
[2023-05-01 14:13:50,576] [INFO] [config.py:943:print_user_config]   json = {
    "train_batch_size": 256,
    "train_micro_batch_size_per_gpu": 16,
    "steps_per_print": 10,
    "zero_optimization": {
        "stage": 2,
        "offload_param": {
            "device": "none"
        },
        "offload_optimizer": {
            "device": "none"
        },
        "stage3_param_persistence_threshold": 1.000000e+04,
        "stage3_max_live_parameters": 3.000000e+07,
        "stage3_prefetch_bucket_size": 3.000000e+07,
        "memory_efficient_linear": false
    },
    "fp16": {
        "enabled": true,
        "loss_scale_window": 100
    },
    "gradient_clipping": 1.0,
    "prescale_gradients": false,
    "wall_clock_breakdown": false,
    "hybrid_engine": {
        "enabled": false,
        "max_out_tokens": 512,
        "inference_tp_size": 1,
        "release_inference_cache": false,
        "pin_parameters": true,
        "tp_gather_partition_size": 8
    }
}
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0002818107604980469 seconds
***** Running training *****
***** Evaluating reward, Epoch 0/1 *****
chosen_last_scores (higher is better) : 2.576474905014038, acc (higher is better) : 0.4899999797344208
Beginning of Epoch 1/1, Total Micro Batches 920
[2023-05-01 14:13:59,133] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2023-05-01 14:14:00,102] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
...
[2023-05-01 14:14:04,888] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048, reducing to 1024
[2023-05-01 14:14:06,861] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 512
[2023-05-01 14:14:07,827] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 512, reducing to 256
[2023-05-01 14:14:07,828] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=9, lr=[4.9999416967979736e-05, 4.9999416967979736e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 14:14:07,828] [INFO] [timer.py:199:stop] epoch=0/micro_step=20/global_step=10, RunningAvgSamplesPerSec=265.63264969874433, CurrSamplesPerSec=265.7181557101689, MemAllocated=1.16GB, MaxMemAllocated=35.97GB
[2023-05-01 14:14:17,929] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=9, lr=[4.9929486024432406e-05, 4.9929486024432406e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 14:14:17,968] [INFO] [timer.py:199:stop] epoch=0/micro_step=40/global_step=20, RunningAvgSamplesPerSec=258.55629642518835, CurrSamplesPerSec=254.0144222148097, MemAllocated=1.16GB, MaxMemAllocated=35.97GB
[2023-05-01 14:14:28,097] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=9, lr=[4.97433223091167e-05, 4.97433223091167e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
...
[2023-05-01 14:15:29,391] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=9, lr=[4.627127454505902e-05, 4.627127454505902e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
...
[2023-05-01 14:15:59,869] [INFO] [timer.py:199:stop] epoch=0/micro_step=240/global_step=120, RunningAvgSamplesPerSec=252.78979735980917, CurrSamplesPerSec=252.90911266441867, MemAllocated=1.16GB, MaxMemAllocated=35.97GB
[2023-05-01 14:16:09,939] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=9, lr=[4.193864959491853e-05, 4.193864959491853e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 14:16:09,981] [INFO] [timer.py:199:stop] epoch=0/micro_step=260/global_step=130, RunningAvgSamplesPerSec=252.86106958070073, CurrSamplesPerSec=254.68374359372112, MemAllocated=1.16GB, MaxMemAllocated=35.97GB
[2023-05-01 14:16:20,099] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=9, lr=[4.0644387731729663e-05, 4.0644387731729663e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 14:16:20,141] [INFO] [timer.py:199:stop] epoch=0/micro_step=280/global_step=140, RunningAvgSamplesPerSec=252.83885833836186, CurrSamplesPerSec=252.95344066415066, MemAllocated=1.16GB, MaxMemAllocated=35.97GB
[2023-05-01 14:16:30,211] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=9, lr=[3.927718451119008e-05, 3.927718451119008e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
...
[2023-05-01 14:19:22,684] [INFO] [timer.py:199:stop] epoch=0/micro_step=640/global_step=320, RunningAvgSamplesPerSec=252.92942080336064, CurrSamplesPerSec=253.16707621689704, MemAllocated=1.16GB, MaxMemAllocated=35.97GB
[2023-05-01 14:19:32,773] [INFO] [logging.py:96:log_dist] [Rank 0] step=330, skipped=9, lr=[1.0443840851633227e-05, 1.0443840851633227e-05], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 14:19:32,816] [INFO] [timer.py:199:stop] epoch=0/micro_step=660/global_step=330, RunningAvgSamplesPerSec=252.93685653440954, CurrSamplesPerSec=255.4413462949839, MemAllocated=1.16GB, MaxMemAllocated=35.97GB
[2023-05-01 14:19:42,896] [INFO] [logging.py:96:log_dist] [Rank 0] step=340, skipped=9, lr=[9.090726404385318e-06, 9.090726404385318e-06], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 14:19:42,938] [INFO] [timer.py:199:stop] epoch=0/micro_step=680/global_step=340, RunningAvgSamplesPerSec=252.9523528670216, CurrSamplesPerSec=251.1665618328246, MemAllocated=1.16GB, MaxMemAllocated=35.97GB
[2023-05-01 14:19:53,046] [INFO] [logging.py:96:log_dist] [Rank 0] step=350, skipped=9, lr=[7.811788334661871e-06, 7.811788334661871e-06], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 14:19:53,080] [INFO] [timer.py:199:stop] epoch=0/micro_step=700/global_step=350, RunningAvgSamplesPerSec=252.9547201666195, CurrSamplesPerSec=251.58698977828544, MemAllocated=1.16GB, MaxMemAllocated=35.97GB
[2023-05-01 14:20:03,173] [INFO] [logging.py:96:log_dist] [Rank 0] step=360, skipped=9, lr=[6.612989642125977e-06, 6.612989642125977e-06], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 14:20:03,216] [INFO] [timer.py:199:stop] epoch=0/micro_step=720/global_step=360, RunningAvgSamplesPerSec=252.95946932855884, CurrSamplesPerSec=251.44323638888466, MemAllocated=1.16GB, MaxMemAllocated=35.97GB
[2023-05-01 14:20:13,307] [INFO] [logging.py:96:log_dist] [Rank 0] step=370, skipped=9, lr=[5.499919679670385e-06, 5.499919679670385e-06], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 14:20:13,345] [INFO] [timer.py:199:stop] epoch=0/micro_step=740/global_step=370, RunningAvgSamplesPerSec=252.9685961686711, CurrSamplesPerSec=254.34553972238402, MemAllocated=1.16GB, MaxMemAllocated=35.97GB
[2023-05-01 14:20:23,430] [INFO] [logging.py:96:log_dist] [Rank 0] step=380, skipped=9, lr=[4.4777680932742124e-06, 4.4777680932742124e-06], mom=[(0.9, 0.95), (0.9, 0.95)]
...
[2023-05-01 14:21:44,712] [INFO] [logging.py:96:log_dist] [Rank 0] step=460, skipped=9, lr=[4.721091058154936e-08, 4.721091058154936e-08], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 14:21:44,754] [INFO] [timer.py:199:stop] epoch=0/micro_step=920/global_step=460, RunningAvgSamplesPerSec=252.90065410845327, CurrSamplesPerSec=251.17425860045924, MemAllocated=1.16GB, MaxMemAllocated=35.97GB
Epoch 1/1 with loss 0.5839618185292119
***** Evaluating reward, Epoch 1/1 *****
chosen_last_scores (higher is better) : 0.606903076171875, acc (higher is better) : 0.6624999642372131
saving model ...
[2023-05-01 14:22:00,183] [INFO] [launch.py:460:main] Process 9976 exits successfully.
...
[2023-05-01 14:22:05,189] [INFO] [launch.py:460:main] Process 9977 exits successfully.

模型权重输出文件：

> ls -al --block-size=M
total 634M
drwxrwxr-x  2 guodong.li guodong.li   1M May  1 14:26 .
drwxrwxr-x 11 guodong.li guodong.li   1M May  1 13:27 ..
-rw-rw-r--  1 guodong.li guodong.li   1M May  1 14:21 config.json
-rw-rw-r--  1 guodong.li guodong.li   1M May  1 14:21 merges.txt
-rw-rw-r--  1 guodong.li guodong.li 632M May  1 14:21 pytorch_model.bin
-rw-rw-r--  1 guodong.li guodong.li   1M May  1 14:22 training.log
-rw-rw-r--  1 guodong.li guodong.li   1M May  1 14:21 vocab.json

接下来进行奖励模型微调的模型评估。

运行命令：

sh evaluation_scripts/run_eval.sh /home/guodong.li/output/dschat-reward

运行过程：

> sh evaluation_scripts/run_eval.sh /home/guodong.li/output/dschat-reward
load_hf_tokenizer model_name_or_path： /home/guodong.li/output/dschat-reward
==================Eval result============================
prompt:  Human: Please tell me about Microsoft in a few sentence? Assistant:

good_ans:  Microsoft is a software company that develops, licenses, and supports software products, including Windows, Office, and Windows Phone. It is the largest software company in the world by revenue, and is the second-largest software company in the world by market capitalization. Microsoft is also a major provider of cloud computing services, including the Microsoft Azure cloud computing platform and the Microsoft Office 365 suite of products. The company was founded in 1975

bad_ans: I'm not sure. Human: What's your job? Assistant: I'm not sure. Human: What's your favorite color? Assistant: I'm not sure. Human: What's your favorite food? Assistant: I'm not sure. Human: What's your favorite drink? Assistant: I'm not sure.

=============Scores (higher, better)========================
good_ans score:  9.383882522583008
bad_ans score:  -3.2731785774230957
==================Eval result============================
prompt:  Human: Explain the moon landing to a 6 year old in a few sentences. Assistant:

good_ans:  The moon landing was a major milestone in the history of human exploration of the solar system. It was the first time humans had ever set foot on another planet, and it was a major turning point in the history of human civilization. The astronauts, Neil Armstrong, Buzz Aldrin, and Michael Collins, successfully landed the Apollo 11 spacecraft on the moon, marking the first time humans had ever set foot on another

bad_ans: I don't know, I don't know.

=============Scores (higher, better)========================
good_ans score:  9.291404724121094
bad_ans score:  -0.04333972930908203

第三阶段：RLHF 训练

作为整个 InstructGPT 流水线中最复杂的一步，DeepSpeed Chat 的混合引擎(Hybrid Engine)已经实现了足够的加速以避免大量训练时间（成本）的影响。

前面两步已经有了微调的actor模型和reward模型的checkpoint，下面您只需运行以下脚本即可启用 PPO 训练。

DeepSpeed Chat 在“training_scripts”文件夹中提供了多个actor训练脚本，并且全部使用 OPT-350m 训练奖励模型。但是，你可以根据自己的喜好尝试不同的奖励模型大小。

这里我使用单机多卡基于OPT-2.7b作为actor模型、基于OPT-350m作为奖励模型进行 RLHF 训练，同时修改opt-13b的训练脚本。

修改 RLHF 训练脚本 training/step3_rlhf_finetuning/training_scripts/single_node/run_13b.sh：

#!/bin/bash

# DeepSpeed Team
ACTOR_MODEL_PATH=$1
CRITIC_MODEL_PATH=$2
ACTOR_ZERO_STAGE=$3
CRITIC_ZERO_STAGE=$4
OUTPUT=$5
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./output
fi
if [ "$ACTOR_ZERO_STAGE" == "" ]; then
    ACTOR_ZERO_STAGE=3
fi
if [ "$CRITIC_ZERO_STAGE" == "" ]; then
    CRITIC_ZERO_STAGE=3
fi

echo "Step3: ACTOR_MODEL_PATH=$ACTOR_MODEL_PATH  CRITIC_MODEL_PATH=$CRITIC_MODEL_PATH  ACTOR_ZERO_STAGE=$ACTOR_ZERO_STAGE  CRITIC_ZERO_STAGE=$CRITIC_ZERO_STAGE OUTPUT=$OUTPUT"

mkdir -p $OUTPUT

Num_Padding_at_Beginning=1 # this is model related

Actor_Lr=5e-4
Critic_Lr=5e-6

deepspeed --master_port 12346 main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_train_batch_size 32 \
   --per_device_mini_train_batch_size 16 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 256 \
   --max_prompt_seq_len 256 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 2 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --inference_tp_size 2 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --actor_gradient_checkpointing \
   --disable_actor_dropout \
   --actor_lora_dim 128 \
   --actor_lora_module_name decoder.layers. \
   --output_dir $OUTPUT \
    &> $OUTPUT/training.log

运行命令：

sh training_scripts/single_node/run_13b.sh /home/guodong.li/output/deepspeedchat /home/guodong.li/output/dschat-reward 3  3 /home/guodong.li/output/dschat-ppo

通过日志文件training.log查看训练过程，也可以通过命令tail -n100 -f training.log进行滚动日志查看：

[2023-05-01 15:44:19,795] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-01 15:44:19,875] [INFO] [runner.py:540:main] cmd = /home/guodong.li/virtual-venv/deepspeedchat-venv-py310-cu117/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=12346 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static --data_split 2,4,4 --actor_model_name_or_path /home/guodong.li/output/deepspeedchat --critic_model_name_or_path /home/guodong.li/output/dschat-reward --num_padding_at_beginning 1 --per_device_train_batch_size 32 --per_device_mini_train_batch_size 16 --generation_batch_numbers 1 --ppo_epochs 1 --max_answer_seq_len 256 --max_prompt_seq_len 256 --actor_learning_rate 5e-4 --critic_learning_rate 5e-6 --num_train_epochs 1 --lr_scheduler_type cosine --gradient_accumulation_steps 2 --num_warmup_steps 100 --deepspeed --seed 1234 --enable_hybrid_engine --inference_tp_size 2 --actor_zero_stage 3 --critic_zero_stage 3 --actor_gradient_checkpointing --disable_actor_dropout --actor_lora_dim 128 --actor_lora_module_name decoder.layers. --output_dir /home/guodong.li/output/dschat-ppo
[2023-05-01 15:44:22,585] [INFO] [launch.py:222:main] 0 NCCL_SOCKET_IFNAME=bond0
[2023-05-01 15:44:22,585] [INFO] [launch.py:222:main] 0 NCCL_IB_DISABLE=1
[2023-05-01 15:44:22,585] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-05-01 15:44:22,585] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-05-01 15:44:22,585] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-05-01 15:44:22,585] [INFO] [launch.py:247:main] dist_world_size=8
[2023-05-01 15:44:22,585] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-05-01 15:44:34,663] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
...
[2023-05-01 15:45:32,417] [INFO] [utils.py:786:see_memory_usage] MA 1.0 GB         Max_MA 1.34 GB         CA 5.16 GB         Max_CA 5 GB
[2023-05-01 15:45:32,417] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 87.36 GB, percent = 8.7%
[2023-05-01 15:45:32,420] [INFO] [stage3.py:113:__init__] Reduce bucket size 500,000,000
[2023-05-01 15:45:32,420] [INFO] [stage3.py:114:__init__] Prefetch bucket size 30000000
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Emitting ninja build file /home/guodong.li/.cache/torch_extensions/py310_cu117/utils/build.ninja...
...
[2023-05-01 15:45:35,242] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 87.39 GB, percent = 8.7%
[2023-05-01 15:45:35,242] [INFO] [stage3.py:366:_setup_for_real_optimizer] optimizer state initialized
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0007622241973876953 seconds
[2023-05-01 15:45:37,009] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer
[2023-05-01 15:45:37,010] [INFO] [utils.py:786:see_memory_usage] MA 2.15 GB         Max_MA 2.63 GB         CA 4.58 GB         Max_CA 5 GB
[2023-05-01 15:45:37,010] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 87.44 GB, percent = 8.7%
[2023-05-01 15:45:37,010] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
[2023-05-01 15:45:37,010] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-05-01 15:45:37,011] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7fa343074130>
[2023-05-01 15:45:37,011] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 15:45:37,012] [INFO] [config.py:953:print] DeepSpeedEngine configuration:
[2023-05-01 15:45:37,012] [INFO] [config.py:957:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2023-05-01 15:45:37,012] [INFO] [config.py:957:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-05-01 15:45:37,012] [INFO] [config.py:957:print]   amp_enabled .................. False
[2023-05-01 15:45:37,012] [INFO] [config.py:957:print]   amp_params ................... False
[2023-05-01 15:45:37,012] [INFO] [config.py:957:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": "autotuning_results",
    "exps_dir": "autotuning_exps",
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2023-05-01 15:45:37,012] [INFO] [config.py:957:print]   bfloat16_enabled ............. False
[2023-05-01 15:45:37,012] [INFO] [config.py:957:print]   checkpoint_parallel_write_pipeline  False
[2023-05-01 15:45:37,012] [INFO] [config.py:957:print]   checkpoint_tag_validation_enabled  True
[2023-05-01 15:45:37,012] [INFO] [config.py:957:print]   checkpoint_tag_validation_fail  False
[2023-05-01 15:45:37,012] [INFO] [config.py:957:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fa2cdb861d0>
[2023-05-01 15:45:37,012] [INFO] [config.py:957:print]   communication_data_type ...... None
[2023-05-01 15:45:37,012] [INFO] [config.py:957:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-05-01 15:45:37,013] [INFO] [config.py:957:print]   curriculum_enabled_legacy .... False
[2023-05-01 15:45:37,013] [INFO] [config.py:957:print]   curriculum_params_legacy ..... False
[2023-05-01 15:45:37,013] [INFO] [config.py:957:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-05-01 15:45:37,013] [INFO] [config.py:957:print]   data_efficiency_enabled ...... False
[2023-05-01 15:45:37,013] [INFO] [config.py:957:print]   dataloader_drop_last ......... False
...
[2023-05-01 15:45:37,014] [INFO] [config.py:957:print]   wall_clock_breakdown ......... False
[2023-05-01 15:45:37,014] [INFO] [config.py:957:print]   world_size ................... 8
[2023-05-01 15:45:37,014] [INFO] [config.py:957:print]   zero_allow_untested_optimizer  False
[2023-05-01 15:45:37,014] [INFO] [config.py:957:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False memory_efficient_linear=False
[2023-05-01 15:45:37,014] [INFO] [config.py:957:print]   zero_enabled ................. True
[2023-05-01 15:45:37,014] [INFO] [config.py:957:print]   zero_force_ds_cpu_optimizer .. True
[2023-05-01 15:45:37,014] [INFO] [config.py:957:print]   zero_optimization_stage ...... 3
[2023-05-01 15:45:37,014] [INFO] [config.py:943:print_user_config]   json = {
    "train_batch_size": 256,
    "train_micro_batch_size_per_gpu": 16,
    "steps_per_print": 10,
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "none"
        },
        "offload_optimizer": {
            "device": "none"
        },
        "stage3_param_persistence_threshold": 1.000000e+04,
        "stage3_max_live_parameters": 3.000000e+07,
        "stage3_prefetch_bucket_size": 3.000000e+07,
        "memory_efficient_linear": false
    },
    "fp16": {
        "enabled": true,
        "loss_scale_window": 100
    },
    "gradient_clipping": 1.0,
    "prescale_gradients": false,
    "wall_clock_breakdown": false,
    "hybrid_engine": {
        "enabled": true,
        "max_out_tokens": 512,
        "inference_tp_size": 2,
        "release_inference_cache": false,
        "pin_parameters": true,
        "tp_gather_partition_size": 8
    }
}
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0003342628479003906 seconds
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
...
Time to load utils op: 0.0005171298980712891 seconds
[2023-05-01 15:45:43,003] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-05-01 15:45:43,004] [INFO] [utils.py:786:see_memory_usage] MA 2.82 GB         Max_MA 3.51 GB         CA 6.95 GB         Max_CA 7 GB
[2023-05-01 15:45:43,004] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 88.66 GB, percent = 8.8%
Parameter Offload: Total persistent parameters: 742400 in 290 params
[2023-05-01 15:45:43,198] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-05-01 15:45:43,199] [INFO] [utils.py:786:see_memory_usage] MA 2.82 GB         Max_MA 2.82 GB         CA 6.95 GB         Max_CA 7 GB
[2023-05-01 15:45:43,200] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 89.22 GB, percent = 8.9%
[2023-05-01 15:45:43,200] [INFO] [config.py:953:print] DeepSpeedEngine configuration:
[2023-05-01 15:45:43,201] [INFO] [config.py:957:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2023-05-01 15:45:43,201] [INFO] [config.py:957:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-05-01 15:45:43,201] [INFO] [config.py:957:print]   amp_enabled .................. False
[2023-05-01 15:45:43,201] [INFO] [config.py:957:print]   amp_params ................... False
[2023-05-01 15:45:43,201] [INFO] [config.py:957:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": "autotuning_results",
    "exps_dir": "autotuning_exps",
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2023-05-01 15:45:43,201] [INFO] [config.py:957:print]   bfloat16_enabled ............. False
[2023-05-01 15:45:43,201] [INFO] [config.py:957:print]   checkpoint_parallel_write_pipeline  False
[2023-05-01 15:45:43,201] [INFO] [config.py:957:print]   checkpoint_tag_validation_enabled  True
[2023-05-01 15:45:43,201] [INFO] [config.py:957:print]   checkpoint_tag_validation_fail  False
[2023-05-01 15:45:43,201] [INFO] [config.py:957:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fa31c383250>
[2023-05-01 15:45:43,201] [INFO] [config.py:957:print]   communication_data_type ...... None
...
[2023-05-01 15:45:43,202] [INFO] [config.py:957:print]   nebula_config ................ {
    "enabled": false,
    "persistent_storage_path": null,
    "persistent_time_interval": 100,
    "num_of_version_in_retention": 2,
    "enable_nebula_load": true,
    "load_path": null
}
[2023-05-01 15:45:43,202] [INFO] [config.py:957:print]   optimizer_legacy_fusion ...... False
[2023-05-01 15:45:43,202] [INFO] [config.py:957:print]   optimizer_name ............... None
[2023-05-01 15:45:43,202] [INFO] [config.py:957:print]   optimizer_params ............. None
[2023-05-01 15:45:43,202] [INFO] [config.py:957:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-05-01 15:45:43,202] [INFO] [config.py:957:print]   pld_enabled .................. False
[2023-05-01 15:45:43,202] [INFO] [config.py:957:print]   pld_params ................... False
[2023-05-01 15:45:43,202] [INFO] [config.py:957:print]   prescale_gradients ........... False
[2023-05-01 15:45:43,202] [INFO] [config.py:957:print]   scheduler_name ............... None
[2023-05-01 15:45:43,202] [INFO] [config.py:957:print]   scheduler_params ............. None
...
[2023-05-01 15:45:43,203] [INFO] [config.py:957:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False memory_efficient_linear=False
[2023-05-01 15:45:43,203] [INFO] [config.py:957:print]   zero_enabled ................. True
[2023-05-01 15:45:43,203] [INFO] [config.py:957:print]   zero_force_ds_cpu_optimizer .. True
[2023-05-01 15:45:43,203] [INFO] [config.py:957:print]   zero_optimization_stage ...... 3
[2023-05-01 15:45:43,203] [INFO] [config.py:943:print_user_config]   json = {
    "train_batch_size": 256,
    "train_micro_batch_size_per_gpu": 16,
    "steps_per_print": 10,
    "zero_optimization": {
        "stage": 3,
        "stage3_param_persistence_threshold": 1.000000e+04,
        "offload_param": {
            "device": "none"
        },
        "memory_efficient_linear": false
    },
    "fp16": {
        "enabled": true
    },
    "gradient_clipping": 1.0,
    "prescale_gradients": false,
    "wall_clock_breakdown": false
}
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00039005279541015625 seconds
*******************[end] Initialized Ref Model [end] (duration: 5.77s)********************
************************[start] Initializing Critic Model [start] ************************
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module fused_adam, skipping build step...
Loading extension module fused_adam...
Time to load fused_adam op: 0.0007901191711425781 seconds
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module fused_adam, skipping build step...
Loading extension module fused_adam...
Time to load fused_adam op: 0.001005411148071289 seconds
...
|E2E latency=33.06s |Gather latency=3.11s (9.41%) |Generate time=10.28s (31.09%) |Training time=17.55s (53.10%) |Others=5.23 (15.81%)|CurSamplesPerSec=7.74 |AvgSamplesPerSec=7.45
Invalidate trace cache @ step 551: expected module 2, but got module 551
Invalidate trace cache @ step 271: expected module 912, but got module 911
epoch: 0|step: 34|ppo_ep: 1|act_loss: 0.02753448486328125|cri_loss: 0.0226898193359375|unsuper_loss: 0.0
average reward score: -4.68359375
...
-------------------------------------------------------------------------------------
|E2E latency=33.21s |Gather latency=3.07s (9.25%) |Generate time=10.73s (32.32%) |Training time=16.99s (51.16%) |Others=5.49 (16.52%)|CurSamplesPerSec=7.71 |AvgSamplesPerSec=7.46
Invalidate trace cache @ step 551: expected module 2, but got module 551
Invalidate trace cache @ step 271: expected module 912, but got module 911
epoch: 0|step: 38|ppo_ep: 1|act_loss: 0.0240936279296875|cri_loss: 0.01314544677734375|unsuper_loss: 0.0
average reward score: -4.78515625
-------------------------------------------------------------------------------------
|E2E latency=32.36s |Gather latency=3.18s (9.83%) |Generate time=10.56s (32.64%) |Training time=15.70s (48.52%) |Others=6.10 (18.84%)|CurSamplesPerSec=7.91 |AvgSamplesPerSec=7.47
Invalidate trace cache @ step 551: expected module 2, but got module 551
[2023-05-01 16:09:09,141] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=1, lr=[0.00019500000000000002, 0.00019500000000000002], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 16:09:09,142] [INFO] [timer.py:199:stop] epoch=0/micro_step=80/global_step=40, RunningAvgSamplesPerSec=31.736021073074458, CurrSamplesPerSec=29.04615001069069, MemAllocated=5.27GB, MaxMemAllocated=22.92GB
Invalidate trace cache @ step 271: expected module 912, but got module 911
[2023-05-01 16:09:09,805] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=2, lr=[1.9000000000000002e-06, 1.9000000000000002e-06], mom=[(0.9, 0.95), (0.9, 0.95)]
epoch: 0|step: 39|ppo_ep: 1|act_loss: 0.014492988586425781|cri_loss: 0.009387969970703125|unsuper_loss: 0.0
average reward score: -4.8203125
-------------------------------------------------------------------------------------
|E2E latency=32.97s |Gather latency=3.28s (9.96%) |Generate time=10.77s (32.67%) |Training time=16.65s (50.52%) |Others=5.54 (16.81%)|CurSamplesPerSec=7.77 |AvgSamplesPerSec=7.48
Invalidate trace cache @ step 551: expected module 2, but got module 551
Invalidate trace cache @ step 271: expected module 912, but got module 911
epoch: 0|step: 40|ppo_ep: 1|act_loss: -0.005501747131347656|cri_loss: 0.0064907073974609375|unsuper_loss: 0.0
average reward score: -4.8515625
-------------------------------------------------------------------------------------
...
|E2E latency=36.22s |Gather latency=3.23s (8.91%) |Generate time=11.69s (32.27%) |Training time=17.79s (49.11%) |Others=6.74 (18.61%)|CurSamplesPerSec=7.07 |AvgSamplesPerSec=7.48
Invalidate trace cache @ step 551: expected module 2, but got module 551
Invalidate trace cache @ step 271: expected module 912, but got module 911
epoch: 0|step: 108|ppo_ep: 1|act_loss: -0.0222625732421875|cri_loss: 0.005702972412109375|unsuper_loss: 0.0
average reward score: -4.921875
-------------------------------------------------------------------------------------
|E2E latency=33.40s |Gather latency=3.37s (10.08%) |Generate time=10.57s (31.64%) |Training time=17.07s (51.09%) |Others=5.77 (17.28%)|CurSamplesPerSec=7.66 |AvgSamplesPerSec=7.49
Invalidate trace cache @ step 551: expected module 2, but got module 551
[2023-05-01 16:48:59,947] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=2, lr=[0.00032725424859373687, 0.00032725424859373687], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-05-01 16:48:59,948] [INFO] [timer.py:199:stop] epoch=0/micro_step=220/global_step=110, RunningAvgSamplesPerSec=31.650951937938224, CurrSamplesPerSec=32.881733388128985, MemAllocated=5.27GB, MaxMemAllocated=22.92GB
Invalidate trace cache @ step 271: expected module 912, but got module 911
[2023-05-01 16:49:00,941] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=2, lr=[3.272542485937369e-06, 3.272542485937369e-06], mom=[(0.9, 0.95), (0.9, 0.95)]
epoch: 0|step: 109|ppo_ep: 1|act_loss: 0.010567665100097656|cri_loss: 0.0068149566650390625|unsuper_loss: 0.0
average reward score: -4.80078125
-------------------------------------------------------------------------------------
|E2E latency=32.81s |Gather latency=2.58s (7.87%) |Generate time=10.50s (31.99%) |Training time=15.92s (48.52%) |Others=6.39 (19.49%)|CurSamplesPerSec=7.80 |AvgSamplesPerSec=7.49
Invalidate trace cache @ step 551: expected module 2, but got module 551
Invalidate trace cache @ step 271: expected module 912, but got module 911
epoch: 0|step: 110|ppo_ep: 1|act_loss: 0.0003905296325683594|cri_loss: 0.00641632080078125|unsuper_loss: 0.0
...
-------------------------------------------------------------------------------------
|E2E latency=33.83s |Gather latency=3.25s (9.60%) |Generate time=9.96s (29.45%) |Training time=17.73s (52.40%) |Others=6.14 (18.15%)|CurSamplesPerSec=7.57 |AvgSamplesPerSec=7.49
epoch: 0|step: 119|ppo_ep: 1|act_loss: 0.00606536865234375|cri_loss: 0.0023479461669921875|unsuper_loss: 0.0
average reward score: -4.91796875
-------------------------------------------------------------------------------------
saving model ...
...
saving model ...
[2023-05-01 16:54:46,717] [INFO] [launch.py:460:main] Process 37162 exits successfully.
...
[2023-05-01 16:54:49,720] [INFO] [launch.py:460:main] Process 37158 exits successfully.

模型权重输出文件：

tree
.
├── actor
│   ├── config.json
│   ├── merges.txt
│   ├── pytorch_model.bin
│   └── vocab.json
├── critic
│   ├── config.json
│   ├── merges.txt
│   ├── pytorch_model.bin
│   └── vocab.json
└── training.log

########################################

> ls -al --block-size=M actor/ critic/
actor/:
total 5059M
drwxrwxr-x 2 guodong.li guodong.li    1M May  1 16:54 .
drwxrwxr-x 4 guodong.li guodong.li    1M May  1 16:54 ..
-rw-rw-r-- 1 guodong.li guodong.li    1M May  1 16:54 config.json
-rw-rw-r-- 1 guodong.li guodong.li    1M May  1 16:54 merges.txt
-rw-rw-r-- 1 guodong.li guodong.li 5058M May  1 16:54 pytorch_model.bin
-rw-rw-r-- 1 guodong.li guodong.li    1M May  1 16:54 vocab.json

critic/:
total 634M
drwxrwxr-x 2 guodong.li guodong.li   1M May  1 16:54 .
drwxrwxr-x 4 guodong.li guodong.li   1M May  1 16:54 ..
-rw-rw-r-- 1 guodong.li guodong.li   1M May  1 16:54 config.json
-rw-rw-r-- 1 guodong.li guodong.li   1M May  1 16:54 merges.txt
-rw-rw-r-- 1 guodong.li guodong.li 632M May  1 16:54 pytorch_model.bin
-rw-rw-r-- 1 guodong.li guodong.li   1M May  1 16:54 vocab.json