分享4090显卡24G显存部署DeepSeek-R1-14B/32B的代码分享我在4090显卡上部署deepseek-r

环境

首先你要有一张4090或更好的显卡
安装好NVIDIA的显卡驱动，通过命令：nvidia-smi 查看驱动是否安装好

3. 安装好python3环境，我的是3.12.3版本

步骤

下载模型

deepseek-ai/DeepSeek-R1-Distill-Qwen-14B, 推荐直接在浏览器上下载，我用huggingface-cli没成功

创建工程

mkdir deepseek-r1-14B
cd ./deepseek-r1-14B
# copy 你下载的model到 ./deepseek-r1-14B/models/deepseek-ai/

python3 -m venv .venv

copy文件

requirements.txt

accelerate==1.3.0
aiofiles==23.2.1
aiohappyeyeballs==2.4.4
aiohttp==3.11.11
aiohttp-cors==0.7.0
aiosignal==1.3.2
airportsdata==20241001
annotated-types==0.7.0
anyio==4.8.0
astor==0.8.1
asttokens==3.0.0
attrs==25.1.0
backcall==0.2.0
beautifulsoup4==4.12.3
bitsandbytes==0.45.1
blake3==1.0.4
bleach==6.2.0
cachetools==5.5.1
certifi==2024.12.14
charset-normalizer==3.4.1
click==8.1.8
cloudpickle==3.1.1
colorful==0.5.6
compressed-tensors==0.9.0
decorator==5.1.1
defusedxml==0.7.1
depyf==0.18.0
dill==0.3.9
diskcache==5.6.3
distlib==0.3.9
distro==1.9.0
docopt==0.6.2
einops==0.8.0
executing==2.2.0
fastapi==0.115.7
fastjsonschema==2.21.1
ffmpy==0.5.0
filelock==3.17.0
frozenlist==1.5.0
fsspec==2024.12.0
gguf==0.10.0
google-api-core==2.24.1
google-auth==2.38.0
googleapis-common-protos==1.66.0
gradio==5.13.1
gradio_client==1.6.0
grpcio==1.70.0
h11==0.14.0
httpcore==1.0.7
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.27.1
idna==3.10
importlib_metadata==8.6.1
iniconfig==2.0.0
interegular==0.3.3
ipython==8.12.3
jedi==0.19.2
Jinja2==3.1.5
jiter==0.8.2
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
jupyter_client==8.6.3
jupyter_core==5.7.2
jupyterlab_pygments==0.3.0
lark==1.2.2
lm-format-enforcer==0.10.9
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib-inline==0.1.7
mdurl==0.1.2
mistral_common==1.5.2
mistune==3.1.1
mpmath==1.3.0
msgpack==1.1.0
msgspec==0.19.0
multidict==6.1.0
nbclient==0.10.2
nbconvert==7.16.6
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.4.2
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-ml-py==12.570.86
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
openai==1.60.2
opencensus==0.11.4
opencensus-context==0.1.3
opencv-python-headless==4.11.0.86
orjson==3.10.15
outlines==0.1.11
outlines_core==0.1.26
packaging==24.2
pandas==2.2.3
pandocfilters==1.5.1
parso==0.8.4
partial-json-parser==0.2.1.1.post5
pexpect==4.9.0
pickleshare==0.7.5
pillow==10.4.0
pip-autoremove==0.10.0
platformdirs==4.3.6
pluggy==1.5.0
prometheus-fastapi-instrumentator==7.0.2
prometheus_client==0.21.1
prompt_toolkit==3.0.50
propcache==0.2.1
proto-plus==1.26.0
protobuf==5.29.3
psutil==6.1.1
ptyprocess==0.7.0
pure_eval==0.2.3
py-cpuinfo==9.0.0
py-spy==0.4.0
pyasn1==0.6.1
pyasn1_modules==0.4.1
pybind11==2.13.6
pycountry==24.6.1
pydantic==2.10.6
pydantic_core==2.27.2
pydub==0.25.1
Pygments==2.19.1
pytest==8.3.4
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
pytz==2024.2
PyYAML==6.0.2
pyzmq==26.2.0
ray==2.41.0
referencing==0.36.2
regex==2024.11.6
requests==2.32.3
rich==13.9.4
rpds-py==0.22.3
rsa==4.9
ruff==0.9.3
safehttpx==0.1.6
safetensors==0.5.2
semantic-version==2.10.0
sentencepiece==0.2.0
setuptools==75.8.0
shellingham==1.5.4
six==1.17.0
smart-open==7.1.0
sniffio==1.3.1
soupsieve==2.6
stack-data==0.6.3
starlette==0.45.3
sympy==1.13.1
tiktoken==0.7.0
tinycss2==1.4.0
tokenizers==0.21.0
tomlkit==0.13.2
torch==2.5.1
torchaudio==2.5.1
torchvision==0.20.1
tornado==6.4.2
tqdm==4.67.1
traitlets==5.14.3
transformers==4.48.1
triton==3.1.0
typer==0.15.1
typing_extensions==4.12.2
tzdata==2025.1
urllib3==2.3.0
uvicorn==0.34.0
uvloop==0.21.0
virtualenv==20.29.1
watchfiles==1.0.4
wcwidth==0.2.13
webencodings==0.5.1
websockets==14.2
wrapt==1.17.2
xformers==0.0.28.post3
xgrammar==0.1.11
yarg==0.1.9
yarl==1.18.3
zipp==3.21.0

deepseek_r1_14B.py

import gradio as gr
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)
import torch

# 量化配置，适配24G显存
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# 加载模型
model_path = './models/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B'
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_path,
    quantization_config=quant_config,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_path)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token  # Set pad_token to eos_token

def generate_response(prompt, temperature=0.7, max_tokens=512):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=temperature,
        top_p=0.9,
        do_sample=True,
        use_cache=True,
        pad_token_id=tokenizer.eos_token_id,
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# 创建 Gradio 界面
with gr.Blocks(theme=gr.themes.Soft(), title="伊邦DeepSeek-R1-14B 对话系统") as demo:
    gr.Markdown("# 伊邦DeepSeek-R1-14B 对话系统")

    with gr.Row():
        with gr.Column():
            input_prompt = gr.Textbox(label="输入提示", lines=5)
            temp_slider = gr.Slider(0.1, 1.0, value=0.7, label="温度参数")
            max_token_slider = gr.Slider(128, 2048, value=512, step=128, label="最大生成长度")
            submit_btn = gr.Button('生成', variant="primary")

        with gr.Column():
            output_text = gr.Textbox(label="模型响应", interactive=False, lines=1000)

    submit_btn.click(
        fn=generate_response,
        inputs=[input_prompt, temp_slider, max_token_slider],
        outputs=output_text
    )

if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", server_port=8888, share=False)

运行工程

假定你只有一个显卡

source .venv/bin/activate
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
CUDA_VISIBLE_DEVICES=0 python deepseek_r1_14B.py

效果

实测大约消耗10G显存，运行较为流程

其他版本模型

1.5B、7B、8B、14B、32B、70B版本的模型，其Tensor type都是 BF16 格式，所以下述的全精度指以BF16格式载入参数，半精度指以8bit参入参数、1/4精度指以4bit载入参数，其中半精度和1/4精度需要量化配置；

全精度载入模型代码

model_path = './models/deepseek-ai/DeepSeek-R1-Distill-xxxx'
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_path,
    torch_dtype=torch.bfloat16,  
    device_map="auto",
    trust_remote_code=True
)

半精度载入模型代码

# 量化配置
quant_config = BitsAndBytesConfig(
    load_in_8bit=True,
)

# 加载模型
model_path = './models/deepseek-ai/DeepSeek-R1-Distill-xxxx'
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_path,
    quantization_config=quant_config,
    device_map="auto",
    trust_remote_code=True
)

1/4精度载入模型代码

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# 加载模型
model_path = './models/deepseek-ai/DeepSeek-R1-Distill-xxxx'
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_path,
    quantization_config=quant_config,
    device_map="auto",
    trust_remote_code=True
)

计算方式

对于TensorType为BF16的模型，可简单认为一个参数占用两个字节，需要的显存最小为：

全精度：14B x 2 = 28 G
半精度：14B x 1 = 14 G
1/4精度：14B x 0.5 = 7 G

我的4090x24G显卡，部署半精度的14B模型，效果较好，多次回答问题后大约占用20G显存