在 WLS2 系统上部署大模型

目标

安装 CUDA
安装 transformers
安装模型
运行 prompt

前置准备

配备 N 卡。本次使用的是 2060 8G 显卡。
安装 WLS 2 系统。本次安装的 Ubuntu 系统。重新安装时注意执行 wsl --unregister Ubuntu 清理配置。
配置系统代理。
1. cat /etc/resolv.conf 获取宿主机 IP。
2. 增加环境变量 export ALL_PROXY=http://172.30.208.1:10809。
更新系统，sudo apt update && sudo apt upgrade。
VSCode 连接 WSL 系统。

安装 CUDA

Remove the old GPG key
```
sudo apt-key del 7fa2af80
```

Installation of Linux x86 CUDA Toolkit using WSL-Ubuntu Package

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-6

配置环境变量

export CUDA_HOME=/usr/local/cuda-12.6
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64

验证安装结果

nvcc -V 成功输出版本信息。

nvidia-smi 成功输出显卡算力。

安装 transformers

官方文档地址 huggingface.co/docs/transf…

准备 python 环境
1. 安装 pip
```
 sudo apt install python3-pip
```
2. pip 全局换源
```
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple
pip config set install.trusted-host mirrors.aliyun.com
```
3. 创建项目目录
  
  VSCode 连接 WSL。创建项目目录 ~/llm101，打开该目录创建文件 transformers_run.py，创建 python 虚拟环境。

安装 transformers

进入项目虚拟环境执行。

pip install torch
pip install transformers datasets evaluate accelerate

准备模型

官方文档地址 huggingface.co/docs/huggin…

配置环境变量

使用镜像站加速下载。
```
export HF_ENDPOINT=https://hf-mirror.com
```
安装 huggingface-cli
```
pip install -U huggingface_hub
```

下载模型

huggingface-cli download facebook/opt-125m

运行 prompt

参考文档地址 huggingface.co/docs/transf…

使用 pipeline 执行 text-generation。

from transformers import pipeline, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m", clean_up_tokenization_spaces=False)
pipe = pipeline(task="text-generation", model="facebook/opt-125m", device="cuda", tokenizer=tokenizer)

# Define a pirate greeting to prepend to user input
pirate_greeting = "Avast, matey! How be ye askin' about..."

# Combine user message with pirate greeting
user_message = pirate_greeting + "How many helicopters can a human eat in one sitting?"

# Generate response using the combined message
response = pipe(user_message, max_new_tokens=128)[0]['generated_text']

# Print the last word of the generated response (likely pirate-styled)
print(response)