书生大模型实战营第4期——进阶篇4 InternVL 多模态模型部署微调实践

388 阅读7分钟

1 多模态大模型

多模态大语言模型能够处理和融合文本、图像、音频、视频等多种数据类型。这些模型基于深度学习技术,理解和生成多种模态的数据,在复杂应用场景中展现强大能力。研究重点在于对齐不同模态特征空间。

  • BLIP2:采用Q-Former(基于Transformer)块,将视觉token对齐到文本空间。
  • MiniGPT4:使用BLIP2的ViT作为图像编码器,通过Q-former和线性层将视觉空间对齐到文本空间,使用Vicuna作为语言模型,线性层是可学习的。
  • LLaVA:使用简单线性层将视觉特征对齐到文本空间,参数量少但效果好。通过视觉指令任务调优,输入图像经编码器处理后,与指令嵌入拼接作为输入,预测输出。
  • LLaVA1.5-HD:为处理不同分辨率,将大分辨率图片分割成小块,并resize整张图片获取全局信息。
  • LLaVA-NeXT:采用动态分辨率策略,根据输入与预设长宽比的相似度进行放缩和切块。

Q-Former参数量大,导致收敛速度慢,且性能提升不明显,不如简单的MLP方案。

2 InternVL2

image.png

InternVL2采用LLaVA式架构(ViT-MLP-LLM),包括InternLM2-20B和InternViT-6B。通过扩大ViT参数至6B,并参考CLIP的对比学习,直接与LLM对齐。InternVL2使用动态分辨率和高质量数据。

  • 动态高分辨率:为提高视觉特征表达能力,输入图片resize成448的倍数,然后按预定义尺寸比例crop区域。 image.png
  • PixelShuffle:一种上采样方法,通过重新排列和填充低分辨率图像像素生成高分辨率图像。在InternVL中,r=0.5,相当于下采样,HW各自缩小1/2,通道变成4倍。 image.png
  • 多任务输出:初始化任务特化embedding(图像生成、分割、检测),利用VisionLLMv2技术,添加任务路由token训练下游任务特化embedding。生成路由token时,将任务embedding拼接在路由embedding后,送给LLM获取hidden state,再送至对应解码器生成图像/bounding box/masks。

训练流程

  1. 预训练:仅训练MLP,在不同视觉任务上训练,实现初步的视觉文本对齐。
  2. 视觉指令训练:在所有模块上训练,提高模型遵循视觉指令的能力。

3 LMDeploy部署

3.1 环境安装

首先安装LMDeploy的环境:


conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy
pip install lmdeploy==0.6.1 gradio==4.44.1 timm==1.0.9
# pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.2.post1/flash_attn-2.7.2.post1+cu11torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

3.2 基本用法

首先测试下lmdeploy的pipeline接口的使用:

image.png

    ## 1.导入相关依赖包
    from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
    from lmdeploy.vl import load_image

    ## 2.使用你的模型初始化推理管线
    model_path = "/root/share/new_models/OpenGVLab/InternVL2-2B/"
    pipe = pipeline(model_path,
                    backend_config=TurbomindEngineConfig(session_len=8192))
                    
    ## 3.读取图片(此处使用PIL读取也行)
    image = load_image('1.jpg')

    ## 4.配置推理参数
    gen_config = GenerationConfig(top_p=0.8, temperature=0.8)
    ## 5.利用 pipeline.chat 接口 进行对话,需传入生成参数
    sess = pipe.chat(('describe this image', image), gen_config=gen_config)
    print(sess.response.text)
    ## 6.之后的对话轮次需要传入之前的session,以告知模型历史上下文
    sess = pipe.chat('What is the cat doing?', session=sess, gen_config=gen_config)
    print(sess.response.text)

推理结果如下:

The image features a cat standing in front of a vertical striped background that resembles a ruler. The cat is wearing a white apron over its chest and is holding a piece of paper with text written in Thai script. The text on the paper appears to be a message or a statement. The cat's expression is neutral, and it is looking directly at the camera. The background and the cat's attire give the image a humorous and playful tone

画面中,一只猫站在竖条纹背景前,背景像一把尺子。这只猫胸前围着一条白色围裙,手里拿着一张纸,上面用泰文写着文字。纸上的文字似乎是一个信息或声明。猫的表情中性,直视镜头。背景和猫咪的装束为画面增添了幽默和俏皮的基调

image.png

3.3 API

lmdeploy serve api_server /root/share/new_models/OpenGVLab/InternVL2-2B/ --server-port 23333 --cache-max-entry-count  0.4

image.png

使用下面的代码调用api:

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'Describe the image please',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                '1.jpg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)

image.png

3.4 gradio

首先克隆InternVL2的webui代码并启动:

cd /root
git clone https://github.com/Control-derek/InternVL2-Tutorial.git
cd InternVL2-Tutorial
conda activate lmdeploy
python demo.py

demo代码如下:

import os
import random
import numpy as np
import torch
import torch.backends.cudnn as cudnn
import gradio as gr

from utils import load_json, init_logger
from demo import ConversationalAgent, CustomTheme

FOOD_EXAMPLES = "demo/food_for_demo.json"
OUTPUT_PATH = "./outputs"
MODEL_PATH = "/root/share/new_models/OpenGVLab/InternVL2-2B"

def setup_seeds():
    seed = 42

    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

    cudnn.benchmark = False
    cudnn.deterministic = True


def main():
    setup_seeds()
    # logging
    init_logger(OUTPUT_PATH)
    # food examples
    food_examples = load_json(FOOD_EXAMPLES)
    
    agent = ConversationalAgent(model_path=MODEL_PATH,
                                outputs_dir=OUTPUT_PATH)
    
    theme = CustomTheme()
    
    titles = [
        """<center><B><font face="Comic Sans MS" size=10>书生大模型实战营</font></B></center>"""  ## Kalam:wght@700
        """<center><B><font face="Courier" size=5>「进阶岛」InternVL 多模态模型部署微调实践</font></B></center>"""
    ]
    
    language = """Language: 中文 and English"""
    with gr.Blocks(theme) as demo_chatbot:
        for title in titles:
            gr.Markdown(title)
        # gr.Markdown(article)
        gr.Markdown(language)
        
        with gr.Row():
            with gr.Column(scale=3):
                start_btn = gr.Button("Start Chat", variant="primary", interactive=True)
                clear_btn = gr.Button("Clear Context", interactive=True)
                image = gr.Image(type="pil", interactive=False)
                upload_btn = gr.Button("🖼️ Upload Image", interactive=True)
                
                with gr.Accordion("Generation Settings"):                    
                    top_p = gr.Slider(minimum=0, maximum=1, step=0.1,
                                      value=0.8,
                                      interactive=True,
                                      label='top-p value',
                                      visible=True)
                    
                    temperature = gr.Slider(minimum=0, maximum=1.5, step=0.1,
                                            value=0.8,
                                            interactive=True,
                                            label='temperature',
                                            visible=True)
                    
            with gr.Column(scale=7):
                chat_state = gr.State()
                chatbot = gr.Chatbot(label='InternVL2', height=800, avatar_images=((os.path.join(os.path.dirname(__file__), 'demo/user.png')), (os.path.join(os.path.dirname(__file__), "demo/bot.png"))))
                text_input = gr.Textbox(label='User', placeholder="Please click the <Start Chat> button to start chat!", interactive=True)
                gr.Markdown("### 输入示例")
                def on_text_change(text):
                    return gr.update(interactive=True)
                text_input.change(fn=on_text_change, inputs=text_input, outputs=text_input)
                gr.Examples(
                    examples=[["图片中的食物通常属于哪个菜系?"],
                              ["如果让你简单形容一下品尝图片中的食物的滋味,你会描述它"],
                              ["去哪个地方游玩时应该品尝当地的特色美食图片中的食物?"],
                              ["食用图片中的食物时,一般它上菜或摆盘时的特点是?"]],
                    inputs=[text_input]
                )
        
        with gr.Row():
            gr.Markdown("### 食物快捷栏")
        with gr.Row():
            example_xinjiang_food = gr.Examples(examples=food_examples["新疆菜"], inputs=image, label="新疆菜")
            example_sichuan_food = gr.Examples(examples=food_examples["川菜(四川,重庆)"], inputs=image, label="川菜(四川,重庆)")
            example_xibei_food = gr.Examples(examples=food_examples["西北菜 (陕西,甘肃等地)"], inputs=image, label="西北菜 (陕西,甘肃等地)")
        with gr.Row():
            example_guizhou_food = gr.Examples(examples=food_examples["黔菜 (贵州)"], inputs=image, label="黔菜 (贵州)")
            example_jiangsu_food = gr.Examples(examples=food_examples["苏菜(江苏)"], inputs=image, label="苏菜(江苏)")
            example_guangdong_food = gr.Examples(examples=food_examples["粤菜(广东等地)"], inputs=image, label="粤菜(广东等地)")
        with gr.Row():
            example_hunan_food = gr.Examples(examples=food_examples["湘菜(湖南)"], inputs=image, label="湘菜(湖南)")
            example_fujian_food = gr.Examples(examples=food_examples["闽菜(福建)"], inputs=image, label="闽菜(福建)")
            example_zhejiang_food = gr.Examples(examples=food_examples["浙菜(浙江)"], inputs=image, label="浙菜(浙江)")
        with gr.Row():
            example_dongbei_food = gr.Examples(examples=food_examples["东北菜 (黑龙江等地)"], inputs=image, label="东北菜 (黑龙江等地)")
            
                
        start_btn.click(agent.start_chat, [chat_state], [text_input, start_btn, clear_btn, image, upload_btn, chat_state])
        clear_btn.click(agent.restart_chat, [chat_state], [chatbot, text_input, start_btn, clear_btn, image, upload_btn, chat_state], queue=False)
        upload_btn.click(agent.upload_image, [image, chatbot, chat_state], [image, chatbot, chat_state])
        text_input.submit(
            agent.respond,
            inputs=[text_input, image, chatbot, top_p, temperature, chat_state], 
            outputs=[text_input, image, chatbot, chat_state]
        )

    demo_chatbot.launch(share=True, server_name="127.0.0.1", server_port=1096, allowed_paths=['./'])
    demo_chatbot.queue()
    

if __name__ == "__main__":
    main()

image.png

接下来使用下面命令端口转发:

ssh -CNg -L 1096:127.0.0.1:1096 root@ssh.intern-ai.org.cn -p <你的 SSH 端口号>

image.png

打开http://localhost:1096/ 进入webui界面:

image.png

bug修复: image.png

修改/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py文件,注释掉126-127行,并且添加:

self._create_event_loop_task()

image.png

广东菜肠粉: image.png

东北菜锅包肉: image.png

4 X-Tuner微调

4.1 环境安装

conda create --name xtuner-env python=3.10 -y
conda activate xtuner-env
pip install xtuner==0.1.23 timm==1.0.9
pip install 'xtuner[deepspeed]'
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.39.0 tokenizers==0.15.2 peft==0.13.2 datasets==3.1.0 accelerate==1.2.0 huggingface-hub==0.26.5 

4.2 数据集

数据集:huggingface.co/datasets/ly…

InternStudio路径:/root/share/datasets/FoodieQA

数据集格式如下,标签文件sivqa_llava.json

image.png

图片:

image.png

数据集处理脚本:

import json
input_path = "/root/share/datasets/FoodieQA/images"  # sivqa_tidy.json所在位置
output_path = "/root/finetune/data/sivqa_llava.json"  # 输出文件位置
output_path = input_path + "sivqa_llava.json"
with open("/root/share/datasets/FoodieQA/sivqa_llava.json", 'r', encoding='utf-8') as f:
    foodqa = json.load(f)

llava_format = []
for data in foodqa:
    llava_format.append({
        "image": data['food_meta']['food_file'],
        "conversations": [
            {
                "from": "human",
                "value": data['question']+"\n<image>"
            },
            {
                "from": "gpt",
                "value": data['choices'][int(data['answer'])] + ",图中的菜是"+ data['food_meta']['food_name']
            }
        ]
    })

with open(output_path, 'w', encoding='utf-8') as f:
    json.dump(llava_format, f, indent=4, ensure_ascii=False)

4.3 微调

cd /root/xtuner
conda activate xtuner-env  # 或者是你自命名的训练环境
cp /root/InternVL2-Tutorial/xtuner_config/internvl_v2_internlm2_2b_lora_finetune_food.py /root/xtuner/xtuner/configs/internvl/v2/internvl_v2_internlm2_2b_lora_finetune_food.py
xtuner train config/internvl_v2_internlm2_2b_lora_finetune_food.py   --work-dir workspace/internvl2/

image.png

image.png

image.png

微调后,转换模型权重:

python xtuner/configs/internvl/v1_5/convert_to_official.py xtuner/configs/internvl/v2/internvl_v2_internlm2_2b_lora_finetune_food.py  workspace/internvl2/internvl_v2_internlm2_2b_lora_finetune_food/iter_640.pth workspace/internvl2/internvl_v2_internlm2_2b_lora_finetune_food/lr35_ep10/

如果修改了超参数,iter_xxx.pth需要修改为对应的想要转的checkpoint。

4.4 gradio

修改demo.py文件的模型权重后重启:

MODEL_PATH = "/root/finetune/workspace/internvl4/lr35_step448/"
conda activate lmdeploy 
python demo.py

接下来重新测试下3.3的两个示例,发现结果都正确了,模型确实学到了看图识菜~,并且回答的比较简短,可能是训练数据集比较短。

广东菜肠粉: image.png

东北菜锅包肉:

image.png

试试换个猫猫,结果发现过拟合啦,哈哈~

image.png