1 多模态大模型
多模态大语言模型能够处理和融合文本、图像、音频、视频等多种数据类型。这些模型基于深度学习技术,理解和生成多种模态的数据,在复杂应用场景中展现强大能力。研究重点在于对齐不同模态特征空间。
- BLIP2:采用Q-Former(基于Transformer)块,将视觉token对齐到文本空间。
- MiniGPT4:使用BLIP2的ViT作为图像编码器,通过Q-former和线性层将视觉空间对齐到文本空间,使用Vicuna作为语言模型,线性层是可学习的。
- LLaVA:使用简单线性层将视觉特征对齐到文本空间,参数量少但效果好。通过视觉指令任务调优,输入图像经编码器处理后,与指令嵌入拼接作为输入,预测输出。
- LLaVA1.5-HD:为处理不同分辨率,将大分辨率图片分割成小块,并resize整张图片获取全局信息。
- LLaVA-NeXT:采用动态分辨率策略,根据输入与预设长宽比的相似度进行放缩和切块。
Q-Former参数量大,导致收敛速度慢,且性能提升不明显,不如简单的MLP方案。
2 InternVL2
InternVL2采用LLaVA式架构(ViT-MLP-LLM),包括InternLM2-20B和InternViT-6B。通过扩大ViT参数至6B,并参考CLIP的对比学习,直接与LLM对齐。InternVL2使用动态分辨率和高质量数据。
- 动态高分辨率:为提高视觉特征表达能力,输入图片resize成448的倍数,然后按预定义尺寸比例crop区域。
- PixelShuffle:一种上采样方法,通过重新排列和填充低分辨率图像像素生成高分辨率图像。在InternVL中,r=0.5,相当于下采样,HW各自缩小1/2,通道变成4倍。
- 多任务输出:初始化任务特化embedding(图像生成、分割、检测),利用VisionLLMv2技术,添加任务路由token训练下游任务特化embedding。生成路由token时,将任务embedding拼接在路由embedding后,送给LLM获取hidden state,再送至对应解码器生成图像/bounding box/masks。
训练流程:
- 预训练:仅训练MLP,在不同视觉任务上训练,实现初步的视觉文本对齐。
- 视觉指令训练:在所有模块上训练,提高模型遵循视觉指令的能力。
3 LMDeploy部署
3.1 环境安装
首先安装LMDeploy的环境:
conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy
pip install lmdeploy==0.6.1 gradio==4.44.1 timm==1.0.9
# pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.2.post1/flash_attn-2.7.2.post1+cu11torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
3.2 基本用法
首先测试下lmdeploy的pipeline接口的使用:
## 1.导入相关依赖包
from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
from lmdeploy.vl import load_image
## 2.使用你的模型初始化推理管线
model_path = "/root/share/new_models/OpenGVLab/InternVL2-2B/"
pipe = pipeline(model_path,
backend_config=TurbomindEngineConfig(session_len=8192))
## 3.读取图片(此处使用PIL读取也行)
image = load_image('1.jpg')
## 4.配置推理参数
gen_config = GenerationConfig(top_p=0.8, temperature=0.8)
## 5.利用 pipeline.chat 接口 进行对话,需传入生成参数
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
## 6.之后的对话轮次需要传入之前的session,以告知模型历史上下文
sess = pipe.chat('What is the cat doing?', session=sess, gen_config=gen_config)
print(sess.response.text)
推理结果如下:
The image features a cat standing in front of a vertical striped background that resembles a ruler. The cat is wearing a white apron over its chest and is holding a piece of paper with text written in Thai script. The text on the paper appears to be a message or a statement. The cat's expression is neutral, and it is looking directly at the camera. The background and the cat's attire give the image a humorous and playful tone
画面中,一只猫站在竖条纹背景前,背景像一把尺子。这只猫胸前围着一条白色围裙,手里拿着一张纸,上面用泰文写着文字。纸上的文字似乎是一个信息或声明。猫的表情中性,直视镜头。背景和猫咪的装束为画面增添了幽默和俏皮的基调
3.3 API
lmdeploy serve api_server /root/share/new_models/OpenGVLab/InternVL2-2B/ --server-port 23333 --cache-max-entry-count 0.4
使用下面的代码调用api:
from openai import OpenAI
client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
model=model_name,
messages=[{
'role':
'user',
'content': [{
'type': 'text',
'text': 'Describe the image please',
}, {
'type': 'image_url',
'image_url': {
'url':
'1.jpg',
},
}],
}],
temperature=0.8,
top_p=0.8)
print(response)
3.4 gradio
首先克隆InternVL2的webui代码并启动:
cd /root
git clone https://github.com/Control-derek/InternVL2-Tutorial.git
cd InternVL2-Tutorial
conda activate lmdeploy
python demo.py
demo代码如下:
import os
import random
import numpy as np
import torch
import torch.backends.cudnn as cudnn
import gradio as gr
from utils import load_json, init_logger
from demo import ConversationalAgent, CustomTheme
FOOD_EXAMPLES = "demo/food_for_demo.json"
OUTPUT_PATH = "./outputs"
MODEL_PATH = "/root/share/new_models/OpenGVLab/InternVL2-2B"
def setup_seeds():
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
cudnn.benchmark = False
cudnn.deterministic = True
def main():
setup_seeds()
# logging
init_logger(OUTPUT_PATH)
# food examples
food_examples = load_json(FOOD_EXAMPLES)
agent = ConversationalAgent(model_path=MODEL_PATH,
outputs_dir=OUTPUT_PATH)
theme = CustomTheme()
titles = [
"""<center><B><font face="Comic Sans MS" size=10>书生大模型实战营</font></B></center>""" ## Kalam:wght@700
"""<center><B><font face="Courier" size=5>「进阶岛」InternVL 多模态模型部署微调实践</font></B></center>"""
]
language = """Language: 中文 and English"""
with gr.Blocks(theme) as demo_chatbot:
for title in titles:
gr.Markdown(title)
# gr.Markdown(article)
gr.Markdown(language)
with gr.Row():
with gr.Column(scale=3):
start_btn = gr.Button("Start Chat", variant="primary", interactive=True)
clear_btn = gr.Button("Clear Context", interactive=True)
image = gr.Image(type="pil", interactive=False)
upload_btn = gr.Button("🖼️ Upload Image", interactive=True)
with gr.Accordion("Generation Settings"):
top_p = gr.Slider(minimum=0, maximum=1, step=0.1,
value=0.8,
interactive=True,
label='top-p value',
visible=True)
temperature = gr.Slider(minimum=0, maximum=1.5, step=0.1,
value=0.8,
interactive=True,
label='temperature',
visible=True)
with gr.Column(scale=7):
chat_state = gr.State()
chatbot = gr.Chatbot(label='InternVL2', height=800, avatar_images=((os.path.join(os.path.dirname(__file__), 'demo/user.png')), (os.path.join(os.path.dirname(__file__), "demo/bot.png"))))
text_input = gr.Textbox(label='User', placeholder="Please click the <Start Chat> button to start chat!", interactive=True)
gr.Markdown("### 输入示例")
def on_text_change(text):
return gr.update(interactive=True)
text_input.change(fn=on_text_change, inputs=text_input, outputs=text_input)
gr.Examples(
examples=[["图片中的食物通常属于哪个菜系?"],
["如果让你简单形容一下品尝图片中的食物的滋味,你会描述它"],
["去哪个地方游玩时应该品尝当地的特色美食图片中的食物?"],
["食用图片中的食物时,一般它上菜或摆盘时的特点是?"]],
inputs=[text_input]
)
with gr.Row():
gr.Markdown("### 食物快捷栏")
with gr.Row():
example_xinjiang_food = gr.Examples(examples=food_examples["新疆菜"], inputs=image, label="新疆菜")
example_sichuan_food = gr.Examples(examples=food_examples["川菜(四川,重庆)"], inputs=image, label="川菜(四川,重庆)")
example_xibei_food = gr.Examples(examples=food_examples["西北菜 (陕西,甘肃等地)"], inputs=image, label="西北菜 (陕西,甘肃等地)")
with gr.Row():
example_guizhou_food = gr.Examples(examples=food_examples["黔菜 (贵州)"], inputs=image, label="黔菜 (贵州)")
example_jiangsu_food = gr.Examples(examples=food_examples["苏菜(江苏)"], inputs=image, label="苏菜(江苏)")
example_guangdong_food = gr.Examples(examples=food_examples["粤菜(广东等地)"], inputs=image, label="粤菜(广东等地)")
with gr.Row():
example_hunan_food = gr.Examples(examples=food_examples["湘菜(湖南)"], inputs=image, label="湘菜(湖南)")
example_fujian_food = gr.Examples(examples=food_examples["闽菜(福建)"], inputs=image, label="闽菜(福建)")
example_zhejiang_food = gr.Examples(examples=food_examples["浙菜(浙江)"], inputs=image, label="浙菜(浙江)")
with gr.Row():
example_dongbei_food = gr.Examples(examples=food_examples["东北菜 (黑龙江等地)"], inputs=image, label="东北菜 (黑龙江等地)")
start_btn.click(agent.start_chat, [chat_state], [text_input, start_btn, clear_btn, image, upload_btn, chat_state])
clear_btn.click(agent.restart_chat, [chat_state], [chatbot, text_input, start_btn, clear_btn, image, upload_btn, chat_state], queue=False)
upload_btn.click(agent.upload_image, [image, chatbot, chat_state], [image, chatbot, chat_state])
text_input.submit(
agent.respond,
inputs=[text_input, image, chatbot, top_p, temperature, chat_state],
outputs=[text_input, image, chatbot, chat_state]
)
demo_chatbot.launch(share=True, server_name="127.0.0.1", server_port=1096, allowed_paths=['./'])
demo_chatbot.queue()
if __name__ == "__main__":
main()
接下来使用下面命令端口转发:
ssh -CNg -L 1096:127.0.0.1:1096 root@ssh.intern-ai.org.cn -p <你的 SSH 端口号>
打开http://localhost:1096/ 进入webui界面:
bug修复:
修改/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py
文件,注释掉126-127行,并且添加:
self._create_event_loop_task()
广东菜肠粉:
东北菜锅包肉:
4 X-Tuner微调
4.1 环境安装
conda create --name xtuner-env python=3.10 -y
conda activate xtuner-env
pip install xtuner==0.1.23 timm==1.0.9
pip install 'xtuner[deepspeed]'
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.39.0 tokenizers==0.15.2 peft==0.13.2 datasets==3.1.0 accelerate==1.2.0 huggingface-hub==0.26.5
4.2 数据集
数据集:huggingface.co/datasets/ly…
InternStudio路径:/root/share/datasets/FoodieQA
数据集格式如下,标签文件sivqa_llava.json
:
图片:
数据集处理脚本:
import json
input_path = "/root/share/datasets/FoodieQA/images" # sivqa_tidy.json所在位置
output_path = "/root/finetune/data/sivqa_llava.json" # 输出文件位置
output_path = input_path + "sivqa_llava.json"
with open("/root/share/datasets/FoodieQA/sivqa_llava.json", 'r', encoding='utf-8') as f:
foodqa = json.load(f)
llava_format = []
for data in foodqa:
llava_format.append({
"image": data['food_meta']['food_file'],
"conversations": [
{
"from": "human",
"value": data['question']+"\n<image>"
},
{
"from": "gpt",
"value": data['choices'][int(data['answer'])] + ",图中的菜是"+ data['food_meta']['food_name']
}
]
})
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(llava_format, f, indent=4, ensure_ascii=False)
4.3 微调
cd /root/xtuner
conda activate xtuner-env # 或者是你自命名的训练环境
cp /root/InternVL2-Tutorial/xtuner_config/internvl_v2_internlm2_2b_lora_finetune_food.py /root/xtuner/xtuner/configs/internvl/v2/internvl_v2_internlm2_2b_lora_finetune_food.py
xtuner train config/internvl_v2_internlm2_2b_lora_finetune_food.py --work-dir workspace/internvl2/
微调后,转换模型权重:
python xtuner/configs/internvl/v1_5/convert_to_official.py xtuner/configs/internvl/v2/internvl_v2_internlm2_2b_lora_finetune_food.py workspace/internvl2/internvl_v2_internlm2_2b_lora_finetune_food/iter_640.pth workspace/internvl2/internvl_v2_internlm2_2b_lora_finetune_food/lr35_ep10/
如果修改了超参数,iter_xxx.pth
需要修改为对应的想要转的checkpoint。
4.4 gradio
修改demo.py文件的模型权重后重启:
MODEL_PATH = "/root/finetune/workspace/internvl4/lr35_step448/"
conda activate lmdeploy
python demo.py
接下来重新测试下3.3的两个示例,发现结果都正确了,模型确实学到了看图识菜~,并且回答的比较简短,可能是训练数据集比较短。
广东菜肠粉:
东北菜锅包肉:
试试换个猫猫,结果发现过拟合啦,哈哈~