大模型私有化部署实践(三):使用DeepSpeed多机多卡训练

1,181 阅读6分钟

专栏

DeepSpeed简介

  • DeepSpeed 是由微软开发的开源深度学习优化库,旨在加速大规模模型的训练和推理。它通过创新的内存优化、分布式训练技术和高效的通信策略,显著降低了训练大模型所需的计算资源和时间。DeepSpeed 支持 ZeRO(Zero Redundancy Optimizer)技术,能够有效减少显存占用,使训练超大规模模型成为可能。此外,DeepSpeed 还提供了混合精度训练、梯度累积、模型并行等功能,适用于从单机到多节点的各种硬件环境。DeepSpeed 广泛应用于自然语言处理、计算机视觉等领域,是训练和部署大模型的强大工具。

DeepSpeed 实践

  • 因为之前装vllm分布式ray集群的时候用的cuda 11.8版本,现将其升级到cuda 12.5版本, 注意关于cuDNN, PyTorch会使用自带的库,通常位于 PyTorch 安装目录中
nvcc --version

nvcc.png

升级cuda(可选,如果你的已经是cuda12.x版本则忽略)
  • 注: 后面介绍的蒸馏技术也需要要升级cuda版本,这样坑少一些
  • 执行下面命令, 能在/usr/local看到cuda-12.5目录
cd /usr/local
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.5.0/local_installers/cuda-repo-ubuntu2204-12-5-local_12.5.0-555.42.02-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-5-local_12.5.0-555.42.02-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-5-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-5
  • 然后修改环境变量
vim ~/.bashrc
export PATH=/usr/local/cuda-12.5/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.5/lib64:$LD_LIBRARY_PATH
export LIBRARY_PATH=/usr/local/cuda-12.5/lib64:$LIBRARY_PATH
export CUDA_LAUNCH_BLOCKING=1
source ~/.bashrc
gcc升级(可选)
  • 升级之后变成12.3版本了
sudo apt-get purge gcc g++
sudo apt install gcc-12
sudo ln -s /usr/bin/gcc-12 /usr/bin/gcc
# 查看版本
gcc --version 

gcc升级.png

nvida驱动重新安装
  • 执行下面命令,然后reboot重启,我这边遇到conda activate xxx环境被搞坏了删掉重装才生效DeepSpeed分布式多机多卡部署
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install nvidia-driver-555
重新安装torch等
  • 确保torch等版本跟DeepSpeed匹配上
pip show torch
pip show deepspeed

torch.png

deepspeed.png

ssh免密
  • 这边用两台机器做DeepSpeed多机多卡测试,需要处理两台机器的ssh免密, 我这边端口是20022不是传统的22这个要根据你们自己的端口进行更改
ssh-keygen  -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh -p 20022 root@xxx cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
scp -P 20022 ~/.ssh/authorized_keys root@xxx:~/.ssh/authorized_keys
cd ~/.ssh
sudo chmod 755 ~/.ssh 
sudo chmod 644 authorized_keys
ssh-add

# 更改配置
vim ~/.ssh/config
Host *
    Port 20022

# 测试
ssh -p 20022 root@xxx
  • pdsh安装(deepspeed需要推送)
sudo apt install pdsh
# 添加下面内容
vim /etc/pdsh/rcmd_default 
ssh
#查看添加的内容是否已经准确添加,正确的话cat应该输出ssh
cat /etc/pdsh/rcmd_default

  • 查看/etc/ssh/sshd_config有个参数Port是否是我这边更改的20022,是的话pdsh就可以连了
简单测试NCCL是否可通
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=xxx --master_port=29500 /usr/local/nccl_test.py

# nccl_test.py 具体代码

import os
import torch
import torch.distributed as dist

def run():
    rank = int(os.environ['RANK'])
    world_size = int(os.environ['WORLD_SIZE'])
    local_rank = int(os.environ['LOCAL_RANK'])

    dist.init_process_group("nccl")
    
    tensor = torch.ones(1).to(local_rank)
    dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
    
    if rank == 0:
        print(f"All-reduce result: {tensor.item()}")
        print(f"Expected result: {world_size}")
        if tensor.item() == world_size:
            print("NCCL test passed successfully!")
        else:
            print("NCCL test failed!")

    dist.destroy_process_group()

if __name__ == "__main__":
    run()

# 主节点执行nproc_per_node根据各自机器gpu情况来写
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=NODEA --master_port=29500 /usr/local/nccl_test.py

# 备节点执行
python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=NODEA --master_port=29500 /usr/local/nccl_test.py

开始时.png

备机加入后.png

执行多机多卡加速

  • 注意多台机器环境设置完之后所有的路径要一致,记住是所有包含miniconda之类的。多头64我又只有三个gpu导致只有两台机器两个gpu运行。
  • 我在测试时有执行下面命令ens3是我的网卡地址,gloo理论这个设置没起作用,这边只是记录下
export CUDA_VISIBLE_DEVICES="0,1"; 
export NCCL_SOCKET_IFNAME="ens3"; 
export NCCL_TIMEOUT="60";
export NCCL_IB_DISABLE="1"; 
export NCCL_DEBUG="INFO"; 
export NCCL_P2P_DISABLE="1"; 
export NCCL_P2P_LEVEL="NVL"; 
export NCCL_IBEXT_DISABLE="1"; 
export TORCH_DISTRIBUTED_BACKEND="gloo"; 
  • 由于显卡资源有限,我的deepspeed使用的是zero2阶段,你们可以自由选择,下面是我的ds_zero2_v2.yaml配置
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
    deepspeed_hostfile: /root/hostfile
    deepspeed_multinode_launcher: pdsh
    offload_optimizer_device: cpu
    zero3_init_flag: false
    zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_process_ip: NODEA
main_process_port: 21011
main_training_function: main
mixed_precision: 'bf16'
num_machines: 2
num_processes: 3
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

  • /root/hostfile文件, NODEA改成你主节点ip,NODEB备节点ip, slots是你的显卡张数
NDOEA slots=2
NDOEB slots=1

  • 主节点或者备节点执行deepspeed加速,CUDA_VISIBLE_DEVICES=0,1看你显卡数量
CUDA_VISIBLE_DEVICES=0,1 accelerate launch  --config_file /usr/local/src/LLM-Dojo/rlhf/ds_config/ds_zero2_v2.yaml xxx.py
--model_name_or_path /usr/local/Llama-3.2-1B-Instruct/Llama-3.2-1B-Instruct \
-- xxxx各种各样的配置
  • 在主备节点执行, 能发现显存被占用了
nvidia-smi
  • 下面是我使用KD蒸馏时候的deepspeed多机多卡加速日志,可以看到使用了pdsh还有配置的export nccl都正确使用了,格式被我简化了
cmd = pdsh -S -f 1024 -w NODEA,NODEA
export PYTHONPATH="/root";
export SHELL="/bin/bash";
export CONDA_EXE="/root/miniconda3/bin/conda";
export PWD="/root";
export LOGNAME="root";
export XDG_SESSION_TYPE="tty";
export CONDA_PREFIX="/root/miniconda3/envs/llmdojo";
export MOTD_SHOWN="pam";
export HOME="/root";
export LANG="en_US.UTF-8";
export XDG_SESSION_CLASS="user";
export TERM="xterm";
export USER="root";
export CONDA_SHLVL="1";
export DISPLAY="localhost:10.0";
export SHLVL="1";
export XDG_SESSION_ID="4885";
export CONDA_PYTHON_EXE="/root/miniconda3/bin/python";
export XDG_RUNTIME_DIR="/run/user/0";
export CONDA_DEFAULT_ENV="llmdojo";
export XDG_DATA_DIRS="/usr/local/share:/usr/share:/var/lib/snapd/desktop";
export PATH="/root/miniconda3/envs/llmdojo/bin:/usr/local/cuda-12.5/bin:/root/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/root/miniconda3/bin";
export DBUS_SESSION_BUS_ADDRESS="unix:path=/run/user/0/bus";
export SSH_TTY="/dev/pts/0";
export _="/root/miniconda3/envs/llmdojo/bin/accelerate";
export CUDA_MODULE_LOADING="LAZY";
export ACCELERATE_MIXED_PRECISION="bf16";
export ACCELERATE_CONFIG_DS_FIELDS="deepspeed_hostfile,deepspeed_multinode_launcher,offload_optimizer_device,zero3_init_flag,zero_stage,mixed_precision";
export ACCELERATE_USE_DEEPSPEED="true";
export ACCELERATE_DEEPSPEED_ZERO_STAGE="2";
export ACCELERATE_DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE="cpu";
export ACCELERATE_DEEPSPEED_ZERO3_INIT="false";
export CUDA_LAUNCH_BLOCKING="1";
export LIBRARY_PATH="/usr/local/cuda-12.5/lib64:";
export LD_LIBRARY_PATH="/usr/local/cuda-12.5/lib64:";
export CUDA_VISIBLE_DEVICES="0,1";
export NCCL_SOCKET_IFNAME="ens3";
export NCCL_TIMEOUT="60";
export NCCL_IB_DISABLE="1";
export NCCL_DEBUG="INFO";
export NCCL_P2P_DISABLE="1";
export NCCL_P2P_LEVEL="NVL";
export NCCL_IBEXT_DISABLE="1";
export TORCH_DISTRIBUTED_BACKEND="gloo";
cd /root;
/root/miniconda3/envs/llmdojo/bin/python -u -m deepspeed.launcher.launch --world_info=eyIxODMuNi4yMTEuMTk4IjogWzBdLCAiMTIxLjMyLjIzNi4yMzgiOiBbMF19 --node_rank=%n --master_addr=NODEA --master_port=21011 --no_local_rank /usr/local/src/LLM-Dojo/rlhf/train_gkd.py --model_name_or_path /usr/local/Llama-3.2-1B-Instruct/Llama-3.2-1B-Instruct --teacher_model_name_or_path /usr/local/llama-3-7B-lora-trans-export --dataset_name /usr/local/src/LLM-Dojo/converted_data.json --learning_rate 2e-5 --per_device_train_batch_size 3 --gradient_accumulation_steps 2 --output_dir gkd-model2 --logging_steps 2 --dataset_batch_size 3 --num_train_epochs 1 --gradient_checkpointing --lmbda 0.5 --beta 0.5 --use_peft --lora_r 8 --lora_alpha 16 --trust_remote_code --bf16 --save_strategy steps --save_steps 180 --save_total_limit 5 --warmup_steps 0 --lr_scheduler_type cosine --seq_kd True --torch_dtype auto

参考文章