使用DeepSpeed/P-Tuning v2对ChatGLM-6B进行微调之前尝试了基于ChatGLM-6B使用LoR

之前尝试了基于ChatGLM-6B使用LoRA进行参数高效微调，本文给大家分享使用DeepSpeed和P-Tuning v2对ChatGLM-6B进行微调。

ChatGLM-6B简介

ChatGLM-6B相关的简介请查看之前的文章，这里不再赘述。

P-Tuning v2简介

P-Tuning是一种较新的模型微调方法，它采用了参数剪枝的技术，可以将微调的参数量减少到原来的0.1%。具体来说，P-Tuning v2是基于P-Tuning v1的升级版，主要的改进在于采用了更加高效的剪枝方法，可以进一步减少模型微调的参数量。

P-Tuning v2的原理是通过对已训练好的大型语言模型进行参数剪枝，得到一个更加小巧、效率更高的轻量级模型。具体地，P-Tuning v2首先使用一种自适应的剪枝策略，对大型语言模型中的参数进行裁剪，去除其中不必要的冗余参数。然后，对于被剪枝的参数，P-Tuning v2使用了一种特殊的压缩方法，能够更加有效地压缩参数大小，并显著减少模型微调的总参数量。

总的来说，P-Tuning v2的核心思想是让模型变得更加轻便、更加高效，同时尽可能地保持模型的性能不受影响。这不仅可以加快模型的训练和推理速度，还可以减少模型在使用过程中的内存和计算资源消耗，让模型更适用于各种实际应用场景中。

环境搭建

基础环境配置如下：

操作系统: Ubuntu 18.04
CPUs: 单个节点具有 1TB 内存的 Intel CPU，物理CPU个数为64，每颗CPU核数为16
GPUs: 8 卡 A800 80GB GPUs
Python: 3.10 (需要先升级OpenSSL到1.1.1t版本（点击下载OpenSSL），然后再编译安装Python)，点击下载Python
NVIDIA驱动程序版本: 515.65.01，根据不同型号选择不同的驱动程序，点击下载。
CUDA工具包: 11.7，点击下载
NCCL: nccl_2.14.3-1+cuda11.7，点击下载
cuDNN: 8.8.1.3_cuda11，点击下载

上面的NVIDIA驱动、CUDA、Python等工具的安装就不一一赘述了。

创建虚拟环境并激活虚拟环境chatglm-ptuningv2-venv-py310-cu117：

cd /home/guodong.li/virtual-venv
virtualenv -p /usr/bin/python3.10 chatglm-ptuningv2-venv-py310-cu117
source /home/guodong.li/virtual-venv/chatglm-ptuningv2-venv-py310-cu117/bin/activate

离线安装PyTorch，点击下载对应cuda版本的torch和torchvision即可。

pip install torch-1.13.1+cu117-cp310-cp310-linux_x86_64.whl
pip install torchvision-0.14.1+cu117-cp310-cp310-linux_x86_64.whl

安装其他依赖库。

pip install -r requirements.txt

requirements.txt文件内容如下：

protobuf
transformers==4.28.0
cpm_kernels
gradio
mdtex2html
sentencepiece
rouge_chinese
nltk
jieba
datasets
deepspeed
accelerate

注意：

官方文档的transformers版本为4.27.1，chatglm加载模型时会调用transformers/dynamic_module_utils.py文件下的get_class_in_module方法，而该方法在并发情况下会存在找不到文件的问题。将transformers版本升级到4.28.0可以规避此问题。

数据准备

下面以 ADGEN (广告生成) 数据集为例来介绍微调的具体使用。

ADGEN 数据集为根据输入（content）生成一段广告词（summary），具体格式如下所示：

{
    "content": "类型#上衣*版型#宽松*版型#显瘦*图案#线条*衣样式#衬衫*衣袖型#泡泡袖*衣款式#抽绳",
    "summary": "这件衬衫的款式非常的宽松，利落的线条可以很好的隐藏身材上的小缺点，穿在身上有着很好的显瘦效果。领口装饰了一个可爱的抽绳，漂亮的绳结展现出了十足的个性，配合时尚的泡泡袖型，尽显女性甜美可爱的气息。"
}

请从官网下载 ADGEN 数据集，同通过此链接下载，并将其解压到 AdvertiseGen 目录。

tar -zxvf AdvertiseGen.tar.gz

查看数据集大小：

> wc -l AdvertiseGen/*
> 1070 AdvertiseGen/dev.json
> 114599 AdvertiseGen/train.json
> 115669 total

使用DeepSpeed DP+Zero对ChatGLM-6B进行全参数微调

首先，我们使用DeepSpeed对ChatGLM-6B进行全参数微调。

首先，下载源代码，为确保代码的一致性切换到对应的commitid：

git clone https://github.com/THUDM/ChatGLM-6B.git
cd ChatGLM-6B
git checkout 8633db1
cd ptuning

修改ds_train_finetune.sh脚本使用DeepSpeed进行全参数微调。

LR=1e-4

MASTER_PORT=$(shuf -n 1 -i 10000-65535)

deepspeed --num_gpus=8 --master_port $MASTER_PORT main.py \
    --deepspeed deepspeed.json \
    --do_train \
    --train_file /data/nfs/llm/data/AdvertiseGen/train.json \
    --test_file /data/nfs/llm/data/AdvertiseGen/dev.json \
    --prompt_column content \
    --response_column summary \
    --overwrite_cache \
    --model_name_or_path /data/nfs/llm/model/chatglm-6b \
    --output_dir /home/guodong.li/output/adgen-chatglm-6b-ft-$LR \
    --overwrite_output_dir \
    --max_source_length 64 \
    --max_target_length 64 \
    --per_device_train_batch_size 24 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --predict_with_generate \
    --num_train_epochs 2 \
    --logging_steps 10 \
    --save_steps 300 \
    --learning_rate $LR \
    --fp16

运行过程：

> sh ds_train_finetune.sh
[2023-04-14 18:01:33,206] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-14 18:01:33,417] [INFO] [runner.py:540:main] cmd = /home/guodong.li/virtual-venv/chatglm-ptuningv2-venv-py310-cu117/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=44148 --enable_each_rank_log=None main.py --deepspeed deepspeed.json --do_train --train_file /data/nfs/llm/data/AdvertiseGen/train.json --test_file /data/nfs/llm/data/AdvertiseGen/dev.json --prompt_column content --response_column summary --overwrite_cache --model_name_or_path /data/nfs/llm/model/chatglm-6b --output_dir /home/guodong.li/output/adgen-chatglm-6b-ft-1e-4 --overwrite_output_dir --max_source_length 64 --max_target_length 64 --per_device_train_batch_size 24 --per_device_eval_batch_size 1 --gradient_accumulation_steps 2 --predict_with_generate --num_train_epochs 2 --logging_steps 10 --save_steps 300 --learning_rate 1e-4 --fp16
[2023-04-14 18:01:35,945] [INFO] [launch.py:222:main] 0 NCCL_SOCKET_IFNAME=bond0
[2023-04-14 18:01:35,945] [INFO] [launch.py:222:main] 0 NCCL_IB_DISABLE=1
[2023-04-14 18:01:35,945] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-04-14 18:01:35,945] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-04-14 18:01:35,945] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-04-14 18:01:35,945] [INFO] [launch.py:247:main] dist_world_size=8
[2023-04-14 18:01:35,945] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-04-14 18:01:40,133] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
04/14/2023 18:01:41 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: True
...
04/14/2023 18:01:41 - WARNING - __main__ - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: True
04/14/2023 18:01:41 - INFO - __main__ - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=deepspeed.json,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=2,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0001,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/home/guodong.li/output/adgen-chatglm-6b-ft-1e-4/runs/Apr14_18-01-40_ai-app-2-46,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=2.0,
optim=adamw_hf,
optim_args=None,
output_dir=/home/guodong.li/output/adgen-chatglm-6b-ft-1e-4,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=24,
predict_with_generate=True,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=/home/guodong.li/output/adgen-chatglm-6b-ft-1e-4,
save_on_each_node=False,
save_safetensors=False,
save_steps=300,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
04/14/2023 18:03:01 - WARNING - datasets.builder - Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-386448e4f2983a9a/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 184.03it/s]
04/14/2023 18:03:01 - WARNING - datasets.builder - Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-386448e4f2983a9a/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
[WARNING|configuration_auto.py:925] 2023-04-14 18:03:01,664 >> Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
04/14/2023 18:03:01 - WARNING - datasets.builder - Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-386448e4f2983a9a/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
  0%|                                                                                                                                                                                   | 0/2 [00:00<?, ?it/s][WARNING|tokenization_auto.py:675] 2023-04-14 18:03:01,675 >> Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 240.57it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 197.48it/s]
[INFO|configuration_utils.py:666] 2023-04-14 18:03:01,678 >> loading configuration file /data/nfs/llm/model/chatglm-6b/config.json
[WARNING|configuration_auto.py:925] 2023-04-14 18:03:01,678 >> Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
[WARNING|configuration_auto.py:925] 2023-04-14 18:03:01,679 >> Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
[INFO|configuration_utils.py:666] 2023-04-14 18:03:01,685 >> loading configuration file /data/nfs/llm/model/chatglm-6b/config.json
04/14/2023 18:03:01 - WARNING - datasets.builder - Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-386448e4f2983a9a/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
[INFO|configuration_utils.py:720] 2023-04-14 18:03:01,687 >> Model config ChatGLMConfig {
  "_name_or_path": "/data/nfs/llm/model/chatglm-6b",
  "architectures": [
    "ChatGLMModel"
  ],
  "auto_map": {
    "AutoConfig": "configuration_chatglm.ChatGLMConfig",
    "AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration"
  },
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "gmask_token_id": 130001,
  "hidden_size": 4096,
  "inner_hidden_size": 16384,
  "layernorm_epsilon": 1e-05,
  "mask_token_id": 130000,
  "max_sequence_length": 2048,
  "model_type": "chatglm",
  "num_attention_heads": 32,
  "num_layers": 28,
  "pad_token_id": 3,
  "position_encoding_2d": true,
  "pre_seq_len": null,
  "prefix_projection": false,
  "quantization_bit": 0,
  "torch_dtype": "float16",
  "transformers_version": "4.28.0",
  "use_cache": true,
  "vocab_size": 130528
}

  0%|                                                                                                                                                                                   | 0/2 [00:00<?, ?it/s][WARNING|tokenization_auto.py:675] 2023-04-14 18:03:01,688 >> Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[WARNING|tokenization_auto.py:675] 2023-04-14 18:03:01,689 >> Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[INFO|tokenization_utils_base.py:1807] 2023-04-14 18:03:01,694 >> loading file ice_text.model
[INFO|tokenization_utils_base.py:1807] 2023-04-14 18:03:01,694 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:1807] 2023-04-14 18:03:01,694 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:1807] 2023-04-14 18:03:01,694 >> loading file tokenizer_config.json
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 285.37it/s]
[INFO|modeling_utils.py:2531] 2023-04-14 18:03:01,992 >> loading weights file /data/nfs/llm/model/chatglm-6b/pytorch_model.bin.index.json
[INFO|configuration_utils.py:575] 2023-04-14 18:03:01,993 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "pad_token_id": 3,
  "transformers_version": "4.28.0"
}

Loading checkpoint shards:   0%|                                                                                                                                                        | 0/8 [00:00<?, ?it/s][WARNING|auto_factory.py:456] 2023-04-14 18:03:02,077 >> Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[WARNING|auto_factory.py:456] 2023-04-14 18:03:02,109 >> Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:13<00:00,  1.70s/it]
[INFO|modeling_utils.py:3190] 2023-04-14 18:03:15,622 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.

[INFO|modeling_utils.py:3198] 2023-04-14 18:03:15,622 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at /data/nfs/llm/model/chatglm-6b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training.
Loading checkpoint shards:  25%|████████████████████████████████████                                                                                                            | 2/8 [00:13<00:40,  6.73s/it][INFO|modeling_utils.py:2839] 2023-04-14 18:03:15,703 >> Generation config file not found, using a generation config created from the model config.
...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:34<00:00,  4.32s/it]
input_ids [5, 65421, 61, 67329, 32, 98339, 61, 72043, 32, 65347, 61, 70872, 32, 69768, 61, 68944, 32, 67329, 64103, 61, 96914, 130001, 130004, 5, 87052, 96914, 81471, 64562, 65759, 64493, 64988, 6, 65840, 65388, 74531, 63825, 75786, 64009, 63823, 65626, 63882, 64619, 65388, 6, 64480, 65604, 85646, 110945, 10, 64089, 65966, 87052, 67329, 65544, 6, 71964, 70533, 64417, 63862, 89978, 63991, 63823, 77284, 88473, 64219, 63848, 112012, 6, 71231, 65099, 71252, 66800, 85768, 64566, 64338, 100323, 75469, 63823, 117317, 64218, 64257, 64051, 74197, 6, 63893, 130005, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
inputs 类型#裤*版型#宽松*风格#性感*图案#线条*裤型#阔腿裤 宽松的阔腿裤这两年真的吸粉不少,明星时尚达人的心头爱。毕竟好穿时尚,谁都能穿出腿长2米的效果宽松的裤腿,当然是遮肉小能手啊。上身随性自然不拘束,面料亲肤舒适贴身体验感棒棒哒。系带部分增加设计看点,还
...
label_ids [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 130004, 5, 87052, 96914, 81471, 64562, 65759, 64493, 64988, 6, 65840, 65388, 74531, 63825, 75786, 64009, 63823, 65626, 63882, 64619, 65388, 6, 64480, 65604, 85646, 110945, 10, 64089, 65966, 87052, 67329, 65544, 6, 71964, 70533, 64417, 63862, 89978, 63991, 63823, 77284, 88473, 64219, 63848, 112012, 6, 71231, 65099, 71252, 66800, 85768, 64566, 64338, 100323, 75469, 63823, 117317, 64218, 64257, 64051, 74197, 6, 63893, 130005, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]
labels 宽松的阔腿裤这两年真的吸粉不少,明星时尚达人的心头爱。毕竟好穿时尚,谁都能穿出腿长2米的效果宽松的裤腿,当然是遮肉小能手啊。上身随性自然不拘束,面料亲肤舒适贴身体验感棒棒哒。系带部分增加设计看点,还
[2023-04-14 18:06:30,469] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-04-14 18:06:30,470] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-04-14 18:06:30,470] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-04-14 18:06:30,483] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2023-04-14 18:06:30,484] [INFO] [utils.py:51:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'transformers.optimization.AdamW'>
[2023-04-14 18:06:30,484] [WARNING] [engine.py:1118:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2023-04-14 18:06:30,484] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2023-04-14 18:06:30,484] [INFO] [stage_1_and_2.py:133:__init__] Reduce bucket size 500000000
[2023-04-14 18:06:30,484] [INFO] [stage_1_and_2.py:134:__init__] Allgather bucket size 500000000
[2023-04-14 18:06:30,484] [INFO] [stage_1_and_2.py:135:__init__] CPU Offload: False
[2023-04-14 18:06:30,484] [INFO] [stage_1_and_2.py:136:__init__] Round robin gradient partitioning: False
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Emitting ninja build file /home/guodong.li/.cache/torch_extensions/py310_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
ninja: no work to do.
Loading extension module utils...
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Time to load utils op: 0.10171675682067871 seconds
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Emitting ninja build file /home/guodong.li/.cache/torch_extensions/py310_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.18768668174743652 seconds
...
Loading extension module utils...
Time to load utils op: 0.3021426200866699 seconds
Rank: 2 partition count [8, 8] and sizes[(771473408, False), (187392, False)]
...
Rank: 4 partition count [8, 8] and sizes[(771473408, False), (187392, False)]
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Time to load utils op: 0.0005774497985839844 seconds
...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0011382102966308594 seconds
[2023-04-14 18:06:48,321] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states
[2023-04-14 18:06:48,321] [INFO] [utils.py:786:see_memory_usage] MA 14.37 GB         Max_MA 14.37 GB         CA 14.39 GB         Max_CA 14 GB
[2023-04-14 18:06:48,322] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 50.56 GB, percent = 5.0%
04/14/2023 18:06:48 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
...
04/14/2023 18:06:48 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
[2023-04-14 18:06:48,431] [INFO] [utils.py:785:see_memory_usage] After initializing optimizer states
[2023-04-14 18:06:48,434] [INFO] [utils.py:786:see_memory_usage] MA 20.12 GB         Max_MA 25.87 GB         CA 25.9 GB         Max_CA 26 GB
[2023-04-14 18:06:48,435] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 50.84 GB, percent = 5.0%
[2023-04-14 18:06:48,435] [INFO] [stage_1_and_2.py:489:__init__] optimizer state initialized
[2023-04-14 18:06:48,512] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer
[2023-04-14 18:06:48,513] [INFO] [utils.py:786:see_memory_usage] MA 20.12 GB         Max_MA 20.12 GB         CA 25.9 GB         Max_CA 26 GB
[2023-04-14 18:06:48,513] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 51.29 GB, percent = 5.1%
[2023-04-14 18:06:48,515] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-04-14 18:06:48,515] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-04-14 18:06:48,515] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f172c367a30>
[2023-04-14 18:06:48,515] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0001, 0.0001], mom=[(0.9, 0.999), (0.9, 0.999)]
[2023-04-14 18:06:48,515] [INFO] [config.py:953:print] DeepSpeedEngine configuration:
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   amp_enabled .................. False
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   amp_params ................... False
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": "autotuning_results",
    "exps_dir": "autotuning_exps",
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   bfloat16_enabled ............. False
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   checkpoint_parallel_write_pipeline  False
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   checkpoint_tag_validation_enabled  True
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   checkpoint_tag_validation_fail  False
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f172843d6f0>
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   communication_data_type ...... None
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   curriculum_enabled_legacy .... False
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   curriculum_params_legacy ..... False
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   data_efficiency_enabled ...... False
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   dataloader_drop_last ......... False
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   disable_allgather ............ False
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   dump_state ................... False
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   eigenvalue_enabled ........... False
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   eigenvalue_gas_boundary_resolution  1
[2023-04-14 18:06:48,516] [INFO] [config.py:957:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   eigenvalue_layer_num ......... 0
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   eigenvalue_max_iter .......... 100
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   eigenvalue_stability ......... 1e-06
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   eigenvalue_tol ............... 0.01
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   eigenvalue_verbose ........... False
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   elasticity_enabled ........... False
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   flops_profiler_config ........ {
    "enabled": false,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
}
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   fp16_auto_cast ............... False
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   fp16_enabled ................. True
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   fp16_master_weights_and_gradients  False
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   global_rank .................. 0
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   grad_accum_dtype ............. None
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   gradient_accumulation_steps .. 1
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   gradient_clipping ............ 0.0
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   gradient_predivide_factor .... 1.0
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   initial_dynamic_scale ........ 65536
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   load_universal_checkpoint .... False
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   loss_scale ................... 0
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   memory_breakdown ............. False
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   nebula_config ................ {
    "enabled": false,
    "persistent_storage_path": null,
    "persistent_time_interval": 100,
    "num_of_version_in_retention": 2,
    "enable_nebula_load": true,
    "load_path": null
}
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   optimizer_legacy_fusion ...... False
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   optimizer_name ............... None
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   optimizer_params ............. None
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   pld_enabled .................. False
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   pld_params ................... False
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   prescale_gradients ........... False
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   scheduler_name ............... None
[2023-04-14 18:06:48,517] [INFO] [config.py:957:print]   scheduler_params ............. None
[2023-04-14 18:06:48,518] [INFO] [config.py:957:print]   sparse_attention ............. None
[2023-04-14 18:06:48,518] [INFO] [config.py:957:print]   sparse_gradients_enabled ..... False
[2023-04-14 18:06:48,518] [INFO] [config.py:957:print]   steps_per_print .............. 10
[2023-04-14 18:06:48,518] [INFO] [config.py:957:print]   train_batch_size ............. 192
[2023-04-14 18:06:48,518] [INFO] [config.py:957:print]   train_micro_batch_size_per_gpu  24
[2023-04-14 18:06:48,518] [INFO] [config.py:957:print]   use_node_local_storage ....... False
[2023-04-14 18:06:48,518] [INFO] [config.py:957:print]   wall_clock_breakdown ......... False
[2023-04-14 18:06:48,518] [INFO] [config.py:957:print]   world_size ................... 8
[2023-04-14 18:06:48,518] [INFO] [config.py:957:print]   zero_allow_untested_optimizer  True
[2023-04-14 18:06:48,518] [INFO] [config.py:957:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False memory_efficient_linear=True
[2023-04-14 18:06:48,518] [INFO] [config.py:957:print]   zero_enabled ................. True
[2023-04-14 18:06:48,518] [INFO] [config.py:957:print]   zero_force_ds_cpu_optimizer .. True
[2023-04-14 18:06:48,518] [INFO] [config.py:957:print]   zero_optimization_stage ...... 2
[2023-04-14 18:06:48,518] [INFO] [config.py:943:print_user_config]   json = {
    "train_micro_batch_size_per_gpu": 24,
    "zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "initial_scale_power": 16,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 5.000000e+08,
        "overlap_comm": false,
        "reduce_scatter": true,
        "reduce_bucket_size": 5.000000e+08,
        "contiguous_gradients": true
    }
}
Using /home/guodong.li/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00031948089599609375 seconds
  0%|                                                                                                                                                                                 | 0/596 [00:00<?, ?it/s]04/14/2023 18:06:48 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
[2023-04-14 18:06:53,718] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2023-04-14 18:06:55,883] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
  0%|▎                                                                                                                                                                      | 1/596 [00:07<1:13:02,  7.37s/it][2023-04-14 18:06:57,948] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384
[2023-04-14 18:07:00,007] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384, reducing to 8192
  0%|▌                                                                                                                                                                        | 2/596 [00:11<54:01,  5.46s/it][2023-04-14 18:07:06,332] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, reducing to 4096
  1%|▊                                                                                                                                                                        | 3/596 [00:17<57:51,  5.85s/it][2023-04-14 18:07:08,383] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048
  1%|█▏                                                                                                                                                                       | 4/596 [00:24<59:20,  6.01s/it][2023-04-14 18:07:18,876] [INFO] [loss_scaler.py:181:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048, reducing to 1024
[2023-04-14 18:07:18,876] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=7, lr=[9.949664429530202e-05, 9.949664429530202e-05], mom=[(0.9, 0.999), (0.9, 0.999)]
[2023-04-14 18:07:18,877] [INFO] [timer.py:199:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=66.98818896434254, CurrSamplesPerSec=93.79590019766518, MemAllocated=21.59GB, MaxMemAllocated=28.8GB
  1%|█▍                                                                                                                                                                     | 5/596 [00:30<1:00:11,  6.11s/it]
...
[2023-04-14 18:47:55,207] [INFO] [logging.py:96:log_dist] [Rank 0] step=590, skipped=12, lr=[3.02013422818792e-06, 3.02013422818792e-06], mom=[(0.9, 0.999), (0.9, 0.999)]
[2023-04-14 18:47:57,392] [INFO] [timer.py:199:stop] epoch=0/micro_step=590/global_step=590, RunningAvgSamplesPerSec=45.931193758598916, CurrSamplesPerSec=45.63412532914195, MemAllocated=21.59GB, MaxMemAllocated=28.8GB
 50%|███████████████████████████████████████████████████████████████████████████████████▊                                                                                   | 299/596 [41:42<41:37,  8.41s/it][2023-04-14 18:48:37,273] [INFO] [logging.py:96:log_dist] [Rank 0] step=600, skipped=12, lr=[1.3422818791946309e-06, 1.3422818791946309e-06], mom=[(0.9, 0.999), (0.9, 0.999)]
[2023-04-14 18:48:39,453] [INFO] [timer.py:199:stop] epoch=0/micro_step=600/global_step=600, RunningAvgSamplesPerSec=45.92850276413307, CurrSamplesPerSec=45.66031263997641, MemAllocated=21.59GB, MaxMemAllocated=28.8GB
{'loss': 13.3487, 'learning_rate': 1.3422818791946309e-06, 'epoch': 1.01}
 50%|████████████████████████████████████████████████████████████████████████████████████                                                                                   | 300/596 [41:50<41:30,  8.41s/it]Saving the whole model
[INFO|configuration_utils.py:457] 2023-04-14 18:48:39,458 >> Configuration saved in /home/guodong.li/output/adgen-chatglm-6b-ft-1e-4/checkpoint-300/config.json
[INFO|configuration_utils.py:362] 2023-04-14 18:48:39,459 >> Configuration saved in /home/guodong.li/output/adgen-chatglm-6b-ft-1e-4/checkpoint-300/generation_config.json
[INFO|modeling_utils.py:1855] 2023-04-14 18:49:03,951 >> The model is bigger than the maximum size per checkpoint (10GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at /home/guodong.li/output/adgen-chatglm-6b-ft-1e-4/checkpoint-300/pytorch_model.bin.index.json.
[INFO|tokenization_utils_base.py:2171] 2023-04-14 18:49:03,953 >> tokenizer config file saved in /home/guodong.li/output/adgen-chatglm-6b-ft-1e-4/checkpoint-300/tokenizer_config.json
[INFO|tokenization_utils_base.py:2178] 2023-04-14 18:49:03,953 >> Special tokens file saved in /home/guodong.li/output/adgen-chatglm-6b-ft-1e-4/checkpoint-300/special_tokens_map.json
[2023-04-14 18:49:03,983] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step600 is about to be saved!
[2023-04-14 18:49:03,988] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /home/guodong.li/output/adgen-chatglm-6b-ft-1e-4/checkpoint-300/global_step600/mp_rank_00_model_states.pt
[2023-04-14 18:49:03,988] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /home/guodong.li/output/adgen-chatglm-6b-ft-1e-4/checkpoint-300/global_step600/mp_rank_00_model_states.pt...
[2023-04-14 18:49:15,934] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /home/guodong.li/output/adgen-chatglm-6b-ft-1e-4/checkpoint-300/global_step600/mp_rank_00_model_states.pt.
[2023-04-14 18:49:15,937] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /home/guodong.li/output/adgen-chatglm-6b-ft-1e-4/checkpoint-300/global_step600/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-04-14 18:49:28,049] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /home/guodong.li/output/adgen-chatglm-6b-ft-1e-4/checkpoint-300/global_step600/zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-04-14 18:49:28,049] [INFO] [engine.py:3125:_save_zero_checkpoint] zero checkpoint saved /home/guodong.li/output/adgen-chatglm-6b-ft-1e-4/checkpoint-300/global_step600/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2023-04-14 18:49:28,049] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step600 is ready now!
 51%|████████████████████████████████████████████████████████████████████████████████████▏                                                                                | 304/596 [43:14<1:05:51, 13.53s/it][2023-04-14 18:50:09,137] [INFO] [logging.py:96:log_dist] [Rank 0] step=610, skipped=12, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2023-04-14 18:50:11,316] [INFO] [timer.py:199:stop] epoch=0/micro_step=610/global_step=610, RunningAvgSamplesPerSec=45.926876625767875, CurrSamplesPerSec=45.66709917655267, MemAllocated=21.59GB, MaxMemAllocated=28.8GB
 52%|██████████████████████████████████████████████████████████████████████████████████████▌                                                                                | 309/596 [43:56<44:16,  9.26s/it][2023-04-14 18:50:51,114] [INFO] [logging.py:96:log_dist] [Rank 0] step=620, skipped=12, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2023-04-14 18:50:53,302] [INFO] [timer.py:199:stop] epoch=0/micro_step=620/global_step=620, RunningAvgSamplesPerSec=45.92462533252217, CurrSamplesPerSec=45.55552426651123, MemAllocated=21.59GB, MaxMemAllocated=28.8GB
{'loss': 13.3202, 'learning_rate': 0.0, 'epoch': 1.04}
...
 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████  | 589/596 [1:23:07<00:58,  8.41s/it][2023-04-14 19:30:02,654] [INFO] [logging.py:96:log_dist] [Rank 0] step=1180, skipped=12, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2023-04-14 19:30:04,820] [INFO] [timer.py:199:stop] epoch=0/micro_step=1180/global_step=1180, RunningAvgSamplesPerSec=45.85904109663022, CurrSamplesPerSec=45.73521852038509, MemAllocated=21.59GB, MaxMemAllocated=28.8GB
{'loss': 13.3537, 'learning_rate': 0.0, 'epoch': 1.98}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 594/596 [1:23:49<00:16,  8.41s/it][2023-04-14 19:30:44,847] [INFO] [logging.py:96:log_dist] [Rank 0] step=1190, skipped=12, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2023-04-14 19:30:47,022] [INFO] [timer.py:199:stop] epoch=0/micro_step=1190/global_step=1190, RunningAvgSamplesPerSec=45.856487437478386, CurrSamplesPerSec=45.579988341622055, MemAllocated=21.59GB, MaxMemAllocated=28.8GB
{'train_runtime': 5046.8863, 'train_samples_per_second': 45.414, 'train_steps_per_second': 0.118, 'train_loss': 13.905431555421561, 'epoch': 2.0}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 596/596 [1:24:06<00:00,  8.47s/it]
***** train metrics *****
  epoch                    =        2.0
  train_loss               =    13.9054
  train_runtime            = 1:24:06.88
  train_samples            =     114599
  train_samples_per_second =     45.414
  train_steps_per_second   =      0.118
[2023-04-14 19:30:58,560] [INFO] [launch.py:460:main] Process 35198 exits successfully.
[2023-04-14 19:30:58,561] [INFO] [launch.py:460:main] Process 35192 exits successfully.
[2023-04-14 19:30:58,561] [INFO] [launch.py:460:main] Process 35193 exits successfully.
[2023-04-14 19:30:58,561] [INFO] [launch.py:460:main] Process 35195 exits successfully.
[2023-04-14 19:30:58,561] [INFO] [launch.py:460:main] Process 35191 exits successfully.
[2023-04-14 19:30:59,562] [INFO] [launch.py:460:main] Process 35194 exits successfully.
[2023-04-14 19:30:59,563] [INFO] [launch.py:460:main] Process 35197 exits successfully.
[2023-04-14 19:31:00,564] [INFO] [launch.py:460:main] Process 35196 exits successfully.

GPU显存占用：

Fri Apr 14 18:27:45 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  Off  | 00000000:34:00.0 Off |                    0 |
| N/A   59C    P0    92W / 300W |  36539MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80G...  Off  | 00000000:35:00.0 Off |                    0 |
| N/A   61C    P0    96W / 300W |  38395MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A800 80G...  Off  | 00000000:36:00.0 Off |                    0 |
| N/A   63C    P0    93W / 300W |  38395MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A800 80G...  Off  | 00000000:37:00.0 Off |                    0 |
| N/A   65C    P0   102W / 300W |  38347MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A800 80G...  Off  | 00000000:9B:00.0 Off |                    0 |
| N/A   64C    P0   108W / 300W |  38347MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A800 80G...  Off  | 00000000:9C:00.0 Off |                    0 |
| N/A   64C    P0   105W / 300W |  38395MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A800 80G...  Off  | 00000000:9D:00.0 Off |                    0 |
| N/A   58C    P0    97W / 300W |  36433MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A800 80G...  Off  | 00000000:9E:00.0 Off |                    0 |
| N/A   59C    P0    92W / 300W |  38347MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     35191      C   ...nv-py310-cu117/bin/python    36537MiB |
|    1   N/A  N/A     35192      C   ...nv-py310-cu117/bin/python    38393MiB |
|    2   N/A  N/A     35193      C   ...nv-py310-cu117/bin/python    38393MiB |
|    3   N/A  N/A     35194      C   ...nv-py310-cu117/bin/python    38345MiB |
|    4   N/A  N/A     35195      C   ...nv-py310-cu117/bin/python    38345MiB |
|    5   N/A  N/A     35196      C   ...nv-py310-cu117/bin/python    38393MiB |
|    6   N/A  N/A     35197      C   ...nv-py310-cu117/bin/python    36431MiB |
|    7   N/A  N/A     35198      C   ...nv-py310-cu117/bin/python    38345MiB |
+-----------------------------------------------------------------------------+

输出文件：

 tree /home/guodong.li/output/adgen-chatglm-6b-ft-1e-4
/home/guodong.li/output/adgen-chatglm-6b-ft-1e-4
├── all_results.json
├── checkpoint-300
│   ├── config.json
│   ├── configuration_chatglm.py
│   ├── generation_config.json
│   ├── global_step600
│   │   ├── mp_rank_00_model_states.pt
│   │   ├── zero_pp_rank_0_mp_rank_00_optim_states.pt
│   │   ├── zero_pp_rank_1_mp_rank_00_optim_states.pt
│   │   ├── zero_pp_rank_2_mp_rank_00_optim_states.pt
│   │   ├── zero_pp_rank_3_mp_rank_00_optim_states.pt
│   │   ├── zero_pp_rank_4_mp_rank_00_optim_states.pt
│   │   ├── zero_pp_rank_5_mp_rank_00_optim_states.pt
│   │   ├── zero_pp_rank_6_mp_rank_00_optim_states.pt
│   │   └── zero_pp_rank_7_mp_rank_00_optim_states.pt
│   ├── ice_text.model
│   ├── latest
│   ├── modeling_chatglm.py
│   ├── pytorch_model-00001-of-00002.bin
│   ├── pytorch_model-00002-of-00002.bin
│   ├── pytorch_model.bin.index.json
│   ├── quantization.py
│   ├── rng_state_0.pth
│   ├── rng_state_1.pth
│   ├── rng_state_2.pth
│   ├── rng_state_3.pth
│   ├── rng_state_4.pth
│   ├── rng_state_5.pth
│   ├── rng_state_6.pth
│   ├── rng_state_7.pth
│   ├── special_tokens_map.json
│   ├── tokenization_chatglm.py
│   ├── tokenizer_config.json
│   ├── trainer_state.json
│   ├── training_args.bin
│   └── zero_to_fp32.py
├── trainer_state.json
└── train_results.json

2 directories, 36 files

训练结束后没有保存模型权重，只保存了训练过程中的checkpoint，可在代码中添加trainer.save_model()进行保存。

使用DeepSpeed进行full finetuning，对于显存要求较高，且训练较慢。因此下面尝试使用官网提供的P-Tuning v2进行高效参数微调。

使用P-Tuning v2对ChatGLM-6B进行参数高效微调

对于 ChatGLM-6B 模型基于 P-Tuning v2 进行微调。可将需要微调的参数量减少到原来的 0.1%，再通过模型量化、Gradient Checkpoint 等方法，最低只需要 7GB 显存即可运行。

首先，修改train.sh脚本，主要是修改train_file、validation_file、model_name_or_path、output_dir参数：

PRE_SEQ_LEN=128
LR=2e-2

CUDA_VISIBLE_DEVICES=0 python3 main.py \
--do_train \
--train_file /data/nfs/llm/data/AdvertiseGen/train.json \
--validation_file /data/nfs/llm/data/AdvertiseGen/dev.json \
--prompt_column content \
--response_column summary \
--overwrite_cache \
--model_name_or_path /data/nfs/llm/model/chatglm-6b \
--output_dir /home/guodong.li/output/adgen-chatglm-6b-pt-$PRE_SEQ_LEN-$LR \
--overwrite_output_dir \
--max_source_length 64 \
--max_target_length 64 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16 \
--predict_with_generate \
--max_steps 3000 \
--logging_steps 10 \
--save_steps 1000 \
--learning_rate $LR \
--pre_seq_len $PRE_SEQ_LEN \
--quantization_bit 4

运行过程：

  0%|                  | 0/3000 [00:00<?, ?it/s]
...  
{'loss': 4.2962, 'learning_rate': 0.0196, 'epoch': 0.01}
{'loss': 4.3112, 'learning_rate': 0.019533333333333333, 'epoch': 0.01}
  2%|███▊             | 70/3000 [03:20<2:17:06,  2.81s/it]

GPU显存占用：

|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  Off  | 00000000:34:00.0 Off |                    0 |
| N/A   71C    P0   300W / 300W |   6291MiB / 81920MiB |     74%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

对显存的占用确实低，即使用了P-Tuning v2进行参数高效微调，但训练的速度还是很慢。

修改train.sh增大batch_size继续干。

PRE_SEQ_LEN=128
LR=2e-2

CUDA_VISIBLE_DEVICES=0 python3 main.py \
    --do_train \
    --train_file /data/nfs/llm/data/AdvertiseGen/train.json \
    --validation_file /data/nfs/llm/data/AdvertiseGen/dev.json \
    --prompt_column content \
    --response_column summary \
    --overwrite_cache \
    --model_name_or_path /data/nfs/llm/model/chatglm-6b \
    --output_dir /home/guodong.li/output/adgen-chatglm-6b-pt-$PRE_SEQ_LEN-$LR \
    --overwrite_output_dir \
    --max_source_length 64 \
    --max_target_length 64 \
    --per_device_train_batch_size 128 \
    --per_device_eval_batch_size 8 \
    --gradient_accumulation_steps 16 \
    --predict_with_generate \
    --num_train_epochs 1 \
    --logging_steps 10 \
    --save_steps 100 \
    --learning_rate $LR \
    --pre_seq_len $PRE_SEQ_LEN \
    --quantization_bit 4

运行过程：

sh train.sh
04/14/2023 19:46:38 - WARNING - __main__ - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: Fals
04/14/2023 19:46:38 - INFO - __main__ - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=16,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.02,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/home/guodong.li/output/adgen-chatglm-6b-pt-128-2e-2/runs/Apr14_19-46-38_ai-app-2-46,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=1.0,
optim=adamw_hf,
optim_args=None,
output_dir=/home/guodong.li/output/adgen-chatglm-6b-pt-128-2e-2,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=128,
predict_with_generate=True,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=/home/guodong.li/output/adgen-chatglm-6b-pt-128-2e-2,
save_on_each_node=False,
save_safetensors=False,
save_steps=100,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
04/14/2023 19:47:58 - WARNING - datasets.builder - Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-1cf934bed8e233e6e)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
[INFO|configuration_utils.py:666] 2023-04-14 19:47:58,671 >> loading configuration file /data/nfs/llm/model/chatglm-6b/config.json
[WARNING|configuration_auto.py:925] 2023-04-14 19:47:58,671 >> Explicitly passing a `revision` is encouraged when loading a configuratio a newer revision.
[INFO|configuration_utils.py:666] 2023-04-14 19:47:58,679 >> loading configuration file /data/nfs/llm/model/chatglm-6b/config.json
[INFO|configuration_utils.py:720] 2023-04-14 19:47:58,681 >> Model config ChatGLMConfig {
  "_name_or_path": "/data/nfs/llm/model/chatglm-6b",
  "architectures": [
    "ChatGLMModel"
  ],
  "auto_map": {
    "AutoConfig": "configuration_chatglm.ChatGLMConfig",
    "AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration"
  },
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "gmask_token_id": 130001,
  "hidden_size": 4096,
  "inner_hidden_size": 16384,
  "layernorm_epsilon": 1e-05,
  "mask_token_id": 130000,
  "max_sequence_length": 2048,
  "model_type": "chatglm",
  "num_attention_heads": 32,
  "num_layers": 28,
  "pad_token_id": 3,
  "position_encoding_2d": true,
  "pre_seq_len": null,
  "prefix_projection": false,
  "quantization_bit": 0,
  "torch_dtype": "float16",
  "transformers_version": "4.28.0",
  "use_cache": true,
  "vocab_size": 130528
}

[WARNING|tokenization_auto.py:675] 2023-04-14 19:47:58,683 >> Explicitly passing a `revision` is encouraged when loading a model with curevision.
[INFO|tokenization_utils_base.py:1807] 2023-04-14 19:47:58,692 >> loading file ice_text.model
[INFO|tokenization_utils_base.py:1807] 2023-04-14 19:47:58,692 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:1807] 2023-04-14 19:47:58,692 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:1807] 2023-04-14 19:47:58,692 >> loading file tokenizer_config.json
[WARNING|auto_factory.py:456] 2023-04-14 19:47:59,089 >> Explicitly passing a `revision` is encouraged when loading a model with custom ion.
[INFO|modeling_utils.py:2531] 2023-04-14 19:47:59,115 >> loading weights file /data/nfs/llm/model/chatglm-6b/pytorch_model.bin.index.jso
[INFO|configuration_utils.py:575] 2023-04-14 19:47:59,117 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "pad_token_id": 3,
  "transformers_version": "4.28.0"
}

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████
[INFO|modeling_utils.py:3190] 2023-04-14 19:48:08,508 >> All model checkpoint weights were used when initializing ChatGLMForConditionalG

[WARNING|modeling_utils.py:3192] 2023-04-14 19:48:08,508 >> Some weights of ChatGLMForConditionalGeneration were not initialized from thtialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:2839] 2023-04-14 19:48:08,548 >> Generation config file not found, using a generation config created from the mo
Quantized to 4 bit
input_ids [5, 65421, 61, 67329, 32, 98339, 61, 72043, 32, 65347, 61, 70872, 32, 69768, 61, 68944, 32, 67329, 64103, 61, 96914, 130001, 15388, 74531, 63825, 75786, 64009, 63823, 65626, 63882, 64619, 65388, 6, 64480, 65604, 85646, 110945, 10, 64089, 65966, 87052, 67329, 65564219, 63848, 112012, 6, 71231, 65099, 71252, 66800, 85768, 64566, 64338, 100323, 75469, 63823, 117317, 64218, 64257, 64051, 74197, 6, 6 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
inputs 类型#裤*版型#宽松*风格#性感*图案#线条*裤型#阔腿裤 宽松的阔腿裤这两年真的吸粉不少,明星时尚达人的心头爱。毕竟好穿时尚,谁都能穿出腿长适贴身体验感棒棒哒。系带部分增加设计看点,还
label_ids [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,65840, 65388, 74531, 63825, 75786, 64009, 63823, 65626, 63882, 64619, 65388, 6, 64480, 65604, 85646, 110945, 10, 64089, 65966, 87052, 67 88473, 64219, 63848, 112012, 6, 71231, 65099, 71252, 66800, 85768, 64566, 64338, 100323, 75469, 63823, 117317, 64218, 64257, 64051, 741-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100
labels 宽松的阔腿裤这两年真的吸粉不少,明星时尚达人的心头爱。毕竟好穿时尚,谁都能穿出腿长2米的效果宽松的裤腿,当然是遮肉小能手啊。上身随性自
/home/guodong.li/virtual-venv/chatglm-ptuningv2-venv-py310-cu117/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWain a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warn
  warnings.warn(
  0%|                                                                                                                                   04/14/2023 19:51:19 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - `use_cache=True` is incompatible with gradient checkp
{'loss': 6.0246, 'learning_rate': 0.016428571428571428, 'epoch': 0.18}
{'loss': 7.8721, 'learning_rate': 0.012857142857142859, 'epoch': 0.36}
{'loss': 8.2653, 'learning_rate': 0.009285714285714286, 'epoch': 0.54}
{'loss': 8.6636, 'learning_rate': 0.005714285714285714, 'epoch': 0.71}
{'loss': 8.5985, 'learning_rate': 0.002142857142857143, 'epoch': 0.89}
{'train_runtime': 4868.4062, 'train_samples_per_second': 23.539, 'train_steps_per_second': 0.012, 'train_loss': 7.956800188337054, 'epoc
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
***** train metrics *****
  epoch                    =        1.0
  train_loss               =     7.9568
  train_runtime            = 1:21:08.40
  train_samples            =     114599
  train_samples_per_second =     23.539
  train_steps_per_second   =      0.012

显存占用：

Sun Apr 16 19:53:00 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  Off  | 00000000:34:00.0 Off |                    0 |
| N/A   71C    P0   281W / 300W |  63275MiB / 81920MiB |     92%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     20126      C   python3                         63273MiB |
+-----------------------------------------------------------------------------+

输出文件：

> ls -al  /home/guodong.li/output/adgen-chatglm-6b-pt-128-2e-2
total 12
drwxrwxr-x 2 guodong.li guodong.li   98 Apr 14 21:12 .
drwxrwxr-x 8 guodong.li guodong.li  177 Apr 14 17:12 ..
-rw-rw-r-- 1 guodong.li guodong.li  195 Apr 14 21:12 all_results.json
-rw-rw-r-- 1 guodong.li guodong.li 1185 Apr 14 21:12 trainer_state.json
-rw-rw-r-- 1 guodong.li guodong.li  195 Apr 14 21:12 train_results.json

可以看到，通过调整batch_size，显存使用及利用率都提升上去了。

如果需要使用DeepSpeed进行数据并行，可参考如下命令：

PRE_SEQ_LEN=128
LR=2e-2

deepspeed --include localhost:1,2,3 --master_port 29001 main.py \
    --deepspeed deepspeed.json \
    --do_train \
    --train_file /data/nfs/llm/data/AdvertiseGen/train.json \
    --validation_file /data/nfs/llm/data/AdvertiseGen/dev.json \
    --prompt_column content \
    --response_column summary \
    --overwrite_cache \
    --model_name_or_path /data/nfs/llm/model/chatglm-6b \
    --output_dir /home/guodong.li/output/adgen-chatglm-6b-pt \
    --overwrite_output_dir \
    --max_source_length 64 \
    --max_target_length 64 \
    --per_device_train_batch_size 128 \
    --per_device_eval_batch_size 8 \
    --gradient_accumulation_steps 16 \
    --predict_with_generate \
    --num_train_epochs 10 \
    --logging_steps 10 \
    --save_steps 100 \
    --learning_rate $LR \
    --pre_seq_len $PRE_SEQ_LEN

模型评估

修改evaluate.sh文件，修改model_name_or_path（模型路径），ptuning_checkpoint（P-Tuning v2微调之后的权重路径）等参数：

PRE_SEQ_LEN=128
CHECKPOINT=adgen-chatglm-6b-pt-128-2e-2
STEP=3000

PRE_SEQ_LEN=128
CHECKPOINT=adgen-chatglm-6b-pt-128-2e-2
STEP=3000

CUDA_VISIBLE_DEVICES=1 python3 main.py \
    --do_predict \
    --validation_file /data/nfs/llm/data/AdvertiseGen/dev.json \
    --test_file /data/nfs/llm/data/AdvertiseGen/dev.json \
    --overwrite_cache \
    --prompt_column content \
    --response_column summary \
    --model_name_or_path /data/nfs/llm/model/chatglm-6b \
    --ptuning_checkpoint /home/guodong.li/output/adgen-chatglm-6b-pt-128-2e-2/checkpoint-500 \
    --output_dir /home/guodong.li/output/adgen-chatglm-6b-pt-128-2e-2/checkpoint-500 \
    --overwrite_output_dir \
    --max_source_length 64 \
    --max_target_length 64 \
    --per_device_eval_batch_size 1 \
    --predict_with_generate \
    --pre_seq_len $PRE_SEQ_LEN \
    --quantization_bit 4

运行过程：

sh evaluate.sh
04/16/2023 20:18:01 - WARNING - __main__ - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
04/16/2023 20:18:01 - INFO - __main__ - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
...
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
...
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
Downloading and preparing dataset json/default to /home/guodong.li/.cache/huggingface/datasets/json/default-df42438b5ccb0b44/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...
Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3419.73it/s]
Extracting data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 196.48it/s]
Dataset json downloaded and prepared to /home/guodong.li/.cache/huggingface/datasets/json/default-df42438b5ccb0b44/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 326.85it/s]
[INFO|configuration_utils.py:666] 2023-04-16 20:19:21,784 >> loading configuration file /data/nfs/llm/model/chatglm-6b/config.json
[WARNING|configuration_auto.py:925] 2023-04-16 20:19:21,785 >> Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
[INFO|configuration_utils.py:666] 2023-04-16 20:19:21,792 >> loading configuration file /data/nfs/llm/model/chatglm-6b/config.json
[INFO|configuration_utils.py:720] 2023-04-16 20:19:21,795 >> Model config ChatGLMConfig {
  "_name_or_path": "/data/nfs/llm/model/chatglm-6b",
  "architectures": [
    "ChatGLMModel"
  ],
  "auto_map": {
    "AutoConfig": "configuration_chatglm.ChatGLMConfig",
    "AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration"
  },
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "gmask_token_id": 130001,
  "hidden_size": 4096,
  "inner_hidden_size": 16384,
  "layernorm_epsilon": 1e-05,
  "mask_token_id": 130000,
  "max_sequence_length": 2048,
  "model_type": "chatglm",
  "num_attention_heads": 32,
  "num_layers": 28,
  "pad_token_id": 3,
  "position_encoding_2d": true,
  "pre_seq_len": null,
  "prefix_projection": false,
  "quantization_bit": 0,
  "torch_dtype": "float16",
  "transformers_version": "4.28.0",
  "use_cache": true,
  "vocab_size": 130528
}

[WARNING|tokenization_auto.py:675] 2023-04-16 20:19:21,797 >> Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[INFO|tokenization_utils_base.py:1807] 2023-04-16 20:19:21,805 >> loading file ice_text.model
[INFO|tokenization_utils_base.py:1807] 2023-04-16 20:19:21,805 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:1807] 2023-04-16 20:19:21,805 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:1807] 2023-04-16 20:19:21,805 >> loading file tokenizer_config.json
[WARNING|auto_factory.py:456] 2023-04-16 20:19:22,186 >> Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
[INFO|modeling_utils.py:2531] 2023-04-16 20:19:22,222 >> loading weights file /data/nfs/llm/model/chatglm-6b/pytorch_model.bin.index.json
[INFO|configuration_utils.py:575] 2023-04-16 20:19:22,224 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "pad_token_id": 3,
  "transformers_version": "4.28.0"
}

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:08<00:00,  1.04s/it]
[INFO|modeling_utils.py:3190] 2023-04-16 20:19:30,912 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.

[WARNING|modeling_utils.py:3192] 2023-04-16 20:19:30,912 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /data/nfs/llm/model/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:2839] 2023-04-16 20:19:30,967 >> Generation config file not found, using a generation config created from the model config.
Quantized to 4 bit
input_ids [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 65421, 61, 75898, 32, 68554, 61, 77257, 64555, 32, 65107, 61, 66268, 32, 65347, 61, 71689, 32, 69768, 61, 85428, 32, 65173, 73942, 61, 70984, 32, 65173, 70936, 61, 64703, 65509, 130001, 130004]
inputs 类型#上衣*材质#牛仔布*颜色#白色*风格#简约*图案#刺绣*衣样式#外套*衣款式#破洞
label_ids [5, 71689, 66561, 67061, 77257, 70984, 6, 72194, 65173, 64290, 64622, 81549, 63823, 65173, 64290, 83343, 63832, 63912, 65209, 64703, 65509, 64051, 6, 69418, 78598, 87019, 6, 64257, 71319, 66069, 74197, 63823, 65173, 72265, 64880, 64131, 63832, 73416, 85428, 66261, 6, 65594, 87834, 6, 73412, 105145, 65388, 63823, 130001, 130004]
labels 简约而不简单的牛仔外套,白色的衣身十分百搭。衣身多处有做旧破洞设计,打破单调乏味,增加一丝造型看点。衣身后背处有趣味刺绣装饰,丰富层次感,彰显别样时尚。
04/16/2023 20:21:30 - INFO - __main__ - *** Predict ***
[INFO|configuration_utils.py:575] 2023-04-16 20:21:30,090 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "pad_token_id": 3,
  "transformers_version": "4.28.0"
}

  0%|                                                                                                                                                                                | 0/1070 [00:00<?, ?it/s][INFO|configuration_utils.py:575] 2023-04-16 20:21:34,430 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "pad_token_id": 3,
  "transformers_version": "4.28.0"
}

  0%|▎                                                                                                                                                                       | 2/1070 [00:02<25:39,  1.44s/it][INFO|configuration_utils.py:575] 2023-04-16 20:21:37,311 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "pad_token_id": 3,
  "transformers_version": "4.28.0"
}

  0%|▍                                                                                                                                                                       | 3/1070 
...
  1%|█▎                                                                                                                                                                      | 8/1070 [00:20<50:13,  2.84s/it][INFO|configuration_utils.py:575] 2023-04-16 20:21:55,233 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "pad_token_id": 3,
  "transformers_version": "4.28.0"
}

  1%|█▍                                                                                                                                                                      | 9/1070 [00:23<50:24,  2.85s/it][INFO|configuration_utils.py:575] 2023-04-16 20:21:58,112 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "pad_token_id": 3,
  "transformers_version": "4.28.0"
}

  1%|█▌                                                                                                                                                                     | 10/1070 [00:26<50:30,  2.86s/it][INFO|configuration_utils.py:575] 2023-04-16 20:22:00,990 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "pad_token_id": 3,
  "transformers_version": "4.28.0"
}

  1%|█▋                                                                                                                                                                     | 11/1070 [00:29<50:37,  2.87s/it][INFO|configuration_utils.py:575] 2023-04-16 20:22:03,880 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "pad_token_id": 3,
  "transformers_version": "4.28.0"
}

  1%|█▊                                                                                                                                                                     | 12/1070 [00:32<50:38,  2.87s/it][INFO|configuration_utils.py:575] 2023-04-16 20:22:06,761 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "pad_token_id": 3,
  "transformers_version": "4.28.0"
}
...
[INFO|configuration_utils.py:575] 2023-04-16 21:13:16,240 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "pad_token_id": 3,
  "transformers_version": "4.28.0"
}

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 1069/1070 [51:44<00:02,  2.92s/it][INFO|configuration_utils.py:575] 2023-04-16 21:13:19,107 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "pad_token_id": 3,
  "transformers_version": "4.28.0"
}

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1070/1070 [51:47<00:00,  2.90s/it]Building prefix dict from the default dictionary ...
04/16/2023 21:13:22 - DEBUG - jieba - Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
04/16/2023 21:13:22 - DEBUG - jieba - Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.634 seconds.
04/16/2023 21:13:22 - DEBUG - jieba - Loading model cost 0.634 seconds.
Prefix dict has been built successfully.
04/16/2023 21:13:22 - DEBUG - jieba - Prefix dict has been built successfully.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1070/1070 [51:53<00:00,  2.91s/it]
***** predict metrics *****
  predict_bleu-4             =     0.7846
  predict_rouge-1            =     8.8941
  predict_rouge-2            =     1.3703
  predict_rouge-l            =    16.4982
  predict_runtime            = 0:51:57.77
  predict_samples            =       1070
  predict_samples_per_second =      0.343
  predict_steps_per_second   =      0.343

模型推理

新增inference.py文件：

import os
import torch
from transformers import AutoConfig, AutoModel, AutoTokenizer


MODEL_PATH = "/data/nfs/llm/model/chatglm-6b"
CHECKPOINT_PATH = "/home/guodong.li/output/adgen-chatglm-6b-pt-128-2e-2/checkpoint-500"

# 载入Tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)

config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True, pre_seq_len=128)
model = AutoModel.from_pretrained(MODEL_PATH, config=config, trust_remote_code=True).cuda()

prefix_state_dict = torch.load(os.path.join(CHECKPOINT_PATH, "pytorch_model.bin"))
new_prefix_state_dict = {}

for k, v in prefix_state_dict.items():
    if k.startswith("transformer.prefix_encoder."):
        new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v
model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict)



print(f"Quantized to 4 bit")
model = model.quantize(4)
model = model.half().cuda()
model.transformer.prefix_encoder.float()
model = model.eval()


print("用户：你好\n")
response, history = model.chat(tokenizer, "你好", history=[])
print("ChatGLM-6B：\n",response)
print("\n------------------------------------------------\n用户：")

line = input()
while line:
    response, history = model.chat(tokenizer, line, history=history)
    print("ChatGLM-6B：\n", response)
    print("\n------------------------------------------------\n用户：")
    line = input()

运行命令：

CUDA_VISIBLE_DEVICES=0 python3 inference.py

结语

上面使用了DeepSpeed DP+ZeRO对ChatGLM-6B进行全参数微调，同时，当我们遇到GPU资源不足的情况下，可以利用P-Tuning v2进行了高效参数微调。

参考文档：

P-Tuning v2