闯关任务
复现 InternVL 和 InternLM 的部署和量化提交为在线飞书文档。
提交地址:aicarrier.feishu.cn/share/base/…
安装
conda create -n lmdeploy python=3.8 -y
conda activate lmdeploy
pip install lmdeploy
如果发生错误/版本冲突,可指定版本安装
export LMDEPLOY_VERSION=0.7.2.post1
export PYTHON_VERSION=38
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
2. 源码安装
conda activate lmdeploy
git clone https://github.com/InternLM/lmdeploy.git
cd lmdeploy
pip install -e .
快速上手
1. 本地推理
Successfully installed tokenizers-0.20.3 transformers-4.46.3 pip install --upgrade transformers
import lmdeploy
pipe = lmdeploy.pipeline("internlm/InternVL2-1B")
response = pipe(prompts=["Hi, pls intro yourself", "Shanghai is"],
gen_config=GenerationConfig(max_new_tokens=1024,
top_p=0.8,
top_k=40,
temperature=0.6))
print(response)
如果要用InternVL2需要huggingface下载模型
#huggingface
huggingface-cli download OpenGVLab/InternVL2_5-1B --local-dir internlm/InternVL2_5-1B
#modelscope
modelscope download OpenGVLab/InternVL3-14B --local_dir /share/new_models/InternVL3/InternVL3-14B
输出结果
python localinfer.py
报错
NameError: name 'GenerationConfig' is not defined
导入from transformers import GenerationConfig
版本错误
The current version of transformers is transformers==4.46.3, which is lower than the required version transformers==4.48.3. Please upgrade to the required version.
如果使用本地share的InternVL3
/root/share/new_models/InternVL3/InternVL3-1B
输出结果
修正尝试
删除虚拟环境
conda remove -n lmdeploy --all
python用3.10
pip版本升级 python -m pip install –upgrade pip
改用源码下载 2. 源码安装
conda activate lmdeploy
git clone https://github.com/InternLM/lmdeploy.git
cd lmdeploy
pip install -e .
注意 一般就在第一级deploy下执行推理脚本
最终发现是不能导入transformer的GenerationConfig
注意导入需要用 from lmdeploy import pipeline, TurbomindEngineConfig,GenerationConfig
更改代码如下
import lmdeploy
from lmdeploy import pipeline, TurbomindEngineConfig,GenerationConfig
#pipe = lmdeploy.pipeline("/root/InternVL2_5-1B")
pipe = lmdeploy.pipeline("/root/share/new_models/InternVL3/InternVL3-1B")
response = pipe(prompts=["Hi, pls intro yourself", "Shanghai is"],
gen_config=GenerationConfig(max_new_tokens=1024,
top_p=0.8,
top_k=40,
temperature=0.6))
print(response)
指定推理引擎
### TurbomindEngineConfig推理引擎
from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline('/root/share/new_models/InternVL3/InternVL3-1B',
backend_config=TurbomindEngineConfig(
max_batch_size=32,
enable_prefix_caching=True,
cache_max_entry_count=0.8,
session_len=8192,
))
### PytorchEngineConfig推理引擎
from lmdeploy import pipeline, PytorchEngineConfig
pipe = pipeline('/root/share/new_models/InternVL3/InternVL3-1B',
backend_config=PytorchEngineConfig(
max_batch_size=32,
enable_prefix_caching=True,
cache_max_entry_count=0.8,
session_len=8192,
))
tips 就是会有说AMP: CUDA graph is not supported,可能使用了自动混合精度
VLM 推理(Visual Language Models)
总结
lmdeploy 的精髓应该在 turbomind 推理引擎 编译难点也是在 turbomind 的编译上
大语言模型(LLMs)部署
1.离线部署
import lmdeploy
from lmdeploy import GenerationConfig
pipe = lmdeploy.pipeline("/root/share/new_models/internlm3/internlm3-8b-instruct")
response = pipe(prompts=["Hi, pls intro yourself", "Shanghai is"],
gen_config=GenerationConfig(max_new_tokens=1024,
top_p=0.8,
top_k=40,
temperature=0.6))
print(response)
conda activate lmdeploy3
部署类OpenAI 服务
- 使用 lmdeploy cli 工具
lmdeploy serve api_server /root/share/new_models/InternVL3/InternVL3-1B --server-port 23333
-
使用类openai方式
- (i) 不启用权限鉴别(api-key)
from openai import OpenAI
client = OpenAI(
api_key='none',# 若未启用鉴权,可填任意值(如 "none")
base_url="http://0.0.0.0:23333/v1"
)
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
model=model_name,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": " provide three suggestions about time management"},
],
temperature=0.8,
top_p=0.8
)
print(response)