ubuntu下配置大模型微调训练环境过程

232 阅读1分钟

Here's the reformatted document with Chinese notes and clear organization:


NVIDIA驱动安装指南

Update your system:

  • (This lists recommended drivers for your GPU.)

  • Install the recommended driver automatically:

    bash

  • sudo ubuntu-drivers autoinstall
    

    (Alternatively, install a specific version, e.g., sudo apt install nvidia-driver-550)

  • Reboot your system:

    bash

  • sudo reboot
    
  • Verify installation:

    bash

nvidia-smi

(This should show GPU details if the driver is working.)

3. CUDA Toolkit安装

注意:Ubuntu 24.04需先安装libtinfo5

wget http://archive.ubuntu.com/ubuntu/pool/main/n/ncurses/libtinfo5_6.1-1ubuntu1_amd64.deb
sudo dpkg -i libtinfo5_6.1-1ubuntu1_amd64.deb

从NVIDIA官网下载CUDA 12.4: this step will auto install the driver booth 开发者下载页面

Installer Type(deb local)

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4

4. cuDNN安装

cuDNN下载页面

wget https://developer.download.nvidia.com/compute/cudnn/9.6.0/local_installers/cudnn-local-repo-ubuntu2404-9.6.0_1.0-1_amd64.deb
sudo dpkg -i cudnn-local-repo-ubuntu2404-9.6.0_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2404-9.6.0/cudnn-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cudnn

5. 环境变量配置

nano ~/.bashrc
# 在文件末尾添加:
export PATH=/usr/local/cuda/bin:$PATH
# ctrl+o then enter ctrl+x保存后执行:
source ~/.bashrc
# 验证安装:
nvcc -V

6. Miniconda安装

Miniconda文档

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# 安装后重新打开终端
conda --version  # 验证安装

7. Python环境配置

# 创建环境
conda create --name unsloth_env python=3.11
conda activate unsloth_env

# 安装PyTorch(CUDA 12.4)
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu124

# 验证CUDA可用性
python -c "import torch; print(torch.cuda.is_available())"

8. Flash Attention 2安装

# 从GitHub下载(注意选择ABI=FALSE的版本)
pip install flash_attn-2.7.1.post1+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl

9. Git安装(如需要)

sudo apt update
sudo apt install git

10. Unsloth安装

pip install "unsloth[cu124-ampere-torch250] @ git+https://github.com/unslothai/unsloth.git"

11. LLaMA-Factory安装

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]" -i https://pypi.tuna.tsinghua.edu.cn/simple

12. GPU监控

watch -n 1 nvidia-smi  # 实时监控GPU状态(每秒刷新)

13. vLLM环境配置

conda create --name vllm_env python=3.11
conda activate vllm_env
pip install vllm -i https://pypi.tuna.tsinghua.edu.cn/simple

14. API服务启动脚本(run_vllm.sh)

#!/bin/bash

# 激活conda环境
conda activate vllm_env

# 启动API服务(后台运行)
python -m vllm.entrypoints.openai.api_server \
  --model /home/hzw/Documents/Meta-Llama-3.1-8B1212 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 16384 \
  --uvicorn-log-level debug &

PYTHON_PID=$!  # 获取进程ID

# 运行客户端
/home/hzw/Documents/AIRequestClient

# 检查客户端执行状态
if [ $? -eq 0 ]; then
    echo "客户端执行成功"
else
    echo "客户端执行失败"
    exit 1
fi

# 终止API服务
kill $PYTHON_PID
if [ $? -eq 0 ]; then
    echo "API服务已终止"
else
    echo "终止API服务失败"
fi

15. 脚本权限设置

chmod +x run_vllm.sh  # 添加执行权限
./run_vllm.sh         # 运行脚本

关键说明:

  1. 所有-i https://pypi.tuna.tsinghua.edu.cn/simple参数表示使用清华镜像源加速下载
  2. 注意根据实际CUDA版本(如12.4对应cu124)和PyTorch版本(如2.5.0)选择正确的包
  3. 路径/home/hzw/Documents/需替换为实际模型存放路径
  4. 使用watch -n 1 nvidia-smi可实时监控GPU显存占用情况