L1G3000 | LMDeploy 高效部署量化实践LMDeploy 高效部署量化实践闯关任务复现 InternVL

闯关任务

复现 InternVL 和 InternLM 的部署和量化提交为在线飞书文档。

提交地址：aicarrier.feishu.cn/share/base/…

安装

conda create -n lmdeploy python=3.8 -y
conda activate lmdeploy
pip install lmdeploy

如果发生错误/版本冲突，可指定版本安装

export LMDEPLOY_VERSION=0.7.2.post1
export PYTHON_VERSION=38
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118

2. 源码安装

conda activate lmdeploy
git clone https://github.com/InternLM/lmdeploy.git
cd lmdeploy
pip install -e .

快速上手

1. 本地推理

Successfully installed tokenizers-0.20.3 transformers-4.46.3 pip install --upgrade transformers

import lmdeploy
pipe = lmdeploy.pipeline("internlm/InternVL2-1B")
response = pipe(prompts=["Hi, pls intro yourself", "Shanghai is"],
                gen_config=GenerationConfig(max_new_tokens=1024,
                                            top_p=0.8,
                                            top_k=40,
                                            temperature=0.6))
print(response)

如果要用InternVL2需要huggingface下载模型

#huggingface
huggingface-cli download OpenGVLab/InternVL2_5-1B --local-dir internlm/InternVL2_5-1B

#modelscope
modelscope download OpenGVLab/InternVL3-14B --local_dir /share/new_models/InternVL3/InternVL3-14B

输出结果

python localinfer.py

报错 NameError: name 'GenerationConfig' is not defined

导入from transformers import GenerationConfig

版本错误 The current version of transformers is transformers==4.46.3, which is lower than the required version transformers==4.48.3. Please upgrade to the required version.

如果使用本地share的InternVL3

/root/share/new_models/InternVL3/InternVL3-1B

输出结果

修正尝试

删除虚拟环境

conda remove -n lmdeploy --all

python用3.10

pip版本升级 python -m pip install –upgrade pip

改用源码下载 2. 源码安装

conda activate lmdeploy
git clone https://github.com/InternLM/lmdeploy.git
cd lmdeploy
pip install -e .

注意一般就在第一级deploy下执行推理脚本

最终发现是不能导入transformer的GenerationConfig

注意导入需要用 from lmdeploy import pipeline, TurbomindEngineConfig,GenerationConfig

更改代码如下

import lmdeploy
from lmdeploy import pipeline, TurbomindEngineConfig,GenerationConfig
#pipe = lmdeploy.pipeline("/root/InternVL2_5-1B")
pipe = lmdeploy.pipeline("/root/share/new_models/InternVL3/InternVL3-1B")
response = pipe(prompts=["Hi, pls intro yourself", "Shanghai is"],
                gen_config=GenerationConfig(max_new_tokens=1024,
                                            top_p=0.8,
                                            top_k=40,
                                            temperature=0.6))
print(response)

指定推理引擎

### TurbomindEngineConfig推理引擎

from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline('/root/share/new_models/InternVL3/InternVL3-1B',
                backend_config=TurbomindEngineConfig(
                    max_batch_size=32,
                    enable_prefix_caching=True,
                    cache_max_entry_count=0.8,
                    session_len=8192,
                ))
### PytorchEngineConfig推理引擎


from lmdeploy import pipeline, PytorchEngineConfig
pipe = pipeline('/root/share/new_models/InternVL3/InternVL3-1B',
                backend_config=PytorchEngineConfig(
                    max_batch_size=32,
                    enable_prefix_caching=True,
                    cache_max_entry_count=0.8,
                    session_len=8192,
                ))

tips 就是会有说AMP: CUDA graph is not supported，可能使用了自动混合精度

VLM 推理(Visual Language Models）

总结

lmdeploy 的精髓应该在 turbomind 推理引擎编译难点也是在 turbomind 的编译上

大语言模型(LLMs)部署

1.离线部署

import lmdeploy
from lmdeploy import GenerationConfig
pipe = lmdeploy.pipeline("/root/share/new_models/internlm3/internlm3-8b-instruct")
response = pipe(prompts=["Hi, pls intro yourself", "Shanghai is"],
               gen_config=GenerationConfig(max_new_tokens=1024,
                                           top_p=0.8,
                                           top_k=40,
                                           temperature=0.6))
print(response)

conda activate lmdeploy3

部署类OpenAI 服务

使用 lmdeploy cli 工具

lmdeploy serve api_server /root/share/new_models/InternVL3/InternVL3-1B --server-port 23333

使用类openai方式
- (i) 不启用权限鉴别(api-key)

from openai import OpenAI
client = OpenAI(
    api_key='none',# 若未启用鉴权，可填任意值（如 "none"）
    base_url="http://0.0.0.0:23333/v1"
)
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
  model=model_name,
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": " provide three suggestions about time management"},
  ],
    temperature=0.8,
    top_p=0.8
)
print(response)