安装 python 环境

参考(windows 下安装 stable diffusion webui - 掘金 (juejin.cn) 安装 python 环境

MiniGPT-4

介绍

MiniGPT 提供一个预训练视频编码器 BLIP-2 模型与 LLM 大语言 Vicuna-7B 模型，通过一个线性映射层进行对齐。从而提供类似 ChatGPT4 的多模态的图像理解能力。

分两个阶段训练 MiniGPT-4。第一个传统的预训练阶段是使用 5 个 A10 在 4 小时内使用大约 100 万个对齐的图像文本对进行训练。在第一阶段之后，骆马能够理解图像。但骆马的生成能力受到严重影响。
为了解决这个问题并提高可用性，我们提出了一种新颖的方法，通过模型本身和 ChatGPT 一起创建高质量的图像文本对。在此基础上，我们创建了一个小的（总共3500对）但高质量的数据集。
第二个微调阶段在对话模板中对此数据集进行训练，以显着提高其生成可靠性和整体可用性。令我们惊讶的是，这个阶段的计算效率很高，使用单个 A7 只需要大约 100 分钟。
MiniGPT-4 产生了许多新兴的视觉语言功能，类似于 GPT-4 中展示的功能

源码与环境

源码下载

GitHub - Vision-CAIR/MiniGPT-4: MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models

git clone --depth 1 https://github.com/Vision-CAIR/MiniGPT-4.git

依赖环境

cd MiniGPT-4
# 创建 python 虚拟环境，安装依赖
conda env create -f environment.yml
# 更新，或者 create 报错，再次重试
conda env update -f environment.yml

依赖库安装报错：Failed to build pycocotools error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": visualstudio.microsoft.com/visual-cpp-… 根据报错，需要按照 visual studio

Failed to build pycocotools

Pip subprocess error:
  error: subprocess-exited-with-error

  × Building wheel for pycocotools (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [16 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-cpython-39
      creating build\lib.win-amd64-cpython-39\pycocotools
      copying pycocotools\coco.py -> build\lib.win-amd64-cpython-39\pycocotools
      copying pycocotools\cocoeval.py -> build\lib.win-amd64-cpython-39\pycocotools
      copying pycocotools\mask.py -> build\lib.win-amd64-cpython-39\pycocotools
      copying pycocotools\__init__.py -> build\lib.win-amd64-cpython-39\pycocotools
      running build_ext
      cythoning pycocotools/_mask.pyx to pycocotools\_mask.c
      C:\Users\xxx\AppData\Local\Temp\pip-build-env-wizy_vdo\overlay\Lib\site-packages\Cython\Compiler\Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: C:\Users\leoli\AppData\Local\Temp\pip-install-jusbsrop\pycocotools_4ae91584b5bc48c2b4532a174f86344f\pycocotools\_mask.pyx
        tree = Parsing.p_module(s, pxd, full_module_name)
      building 'pycocotools._mask' extension
      error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pycocotools
ERROR: Could not build wheels for pycocotools, which is required to install pyproject.toml-based projects

failed

CondaEnvException: Pip failed

conda activate minigpt4

模型配置

显存要求 23G GPU memory for Vicuna 13B and 11.5G GPU memory for Vicuna 7B. llama 提供了 13B 与 7B 两个模型，我的显存是 12G ，所以选择 7B。

For more powerful GPUs, you can run the model in 16 bit by setting low_resource to False in the config file minigpt4_eval.yaml and use a larger beam search width.

vicuna 模型由于 llama 模型的开源协议规定，基于 llama 微调训练的模型不能直接提供模型文件。于是出现了几个中模型文件的提供方式：

用 lora 增加旁路参数
全参数微调就只发布增量文件
- 比如bitwise xor，为构造增量文件比较方便，对于bit，x xor y = delta, x xor delta = y）。
- 简单一点还有直接作差的，用的时候再加回去，比如这里的Victuna.

下面介绍 vicuna 模型配置方式

llama decapoda-research (Decapoda Research) (huggingface.co) 注意这里并不是直接使用的工作权重，而是工作权重和 LLAMA-7B 原始权重之间的差异。

git lfs install
git clone https://huggingface.co/decapoda-research/llama-7b-hf

vicuna lmsys (lmsys) (huggingface.co) vicuna 提供了 v0 和 v1.1 两个版本。minigpt4 是基于 vicuna v0 版本，

git lfs install
git clone https://huggingface.co/lmsys/vicuna-7b-delta-v0

当这两个权重准备好后，我们可以使用 Vicuna 团队提供的工具来创建真正的工作权重。首先，通过以下方式安装与v0版本的Vicuna兼容的库：

pip install git+https://github.com/lm-sys/FastChat.git@v0.1.10

将下载后的两个模型分别移动至D:/path/minigpt4/ 目录下新建Model文件夹，分别放入或新建如下路径文件

LLAMA-7B模型路径 path/minigpt4/Model/llama-7b-hf
vicuna-7b-delta-v0模型路径 path/minigpt44/Model/vicuna-7b-delta-v0
working-vicuna最终输出路径 path/minigpt4/Model/working-vicuna

最后，运行以下命令以创建最终的工作权重：

python -m fastchat.model.apply_delta --base path/minigpt4/Model/llama-7b-hf/ --target path/minigpt4/Model/weight/ --delta path/minigpt4/Model/vicuna-7b-delta-v0/

path/minigpt4/Model/working-vicuna目录下生成已编译后的新权重文件, 模型权重文件在同一目录下，目录结构如下

vicuna_weights
├── config.json
├── generation_config.json
├── pytorch_model.bin.index.json
├── pytorch_model-00001-of-00003.bin
...

修改 path/minigpt4/configs\models\minigpt4.yaml 文件，将第16行的内容替换为你的模型路径。

model:
  arch: mini_gpt4
...

  # Vicuna
  llama_model: "path/minigpt4/Model/working-vicuna"
 ...

minigpt4

依据前面选择的 vicuna 模型类型，下载对应的 minigpt 预训练 checkpoints 模型 Checkpoint Aligned with Vicuna 13B Checkpoint Aligned with Vicuna 7B |
修改配置文件 eval_configs/minigpt4_eval.yaml 中 11 行，设置 minigpt 预训练 checkpoints 模型路径

运行

执行行如命令，启动MiniGPT-4

python demo.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0

To save GPU memory, Vicuna loads as 8 bit by default, with a beam search width of 1. This configuration requires about 23G GPU memory for Vicuna 13B and 11.5G GPU memory for Vicuna 7B. For more powerful GPUs, you can run the model in 16 bit by setting low_resource to False in the config file minigpt4_eval.yaml and use a larger beam search width.

训练

The training of MiniGPT-4 contains two alignment stages.

1. First pretraining stage

In the first pretrained stage, the model is trained using image-text pairs from Laion and CC datasets to align the vision and language model. To download and prepare the datasets, please check our first stage dataset preparation instruction. After the first stage, the visual features are mapped and can be understood by the language model. To launch the first stage training, run the following command. In our experiments, we use 4 A100. You can change the save path in the config file train_configs/minigpt4_stage1_pretrain.yaml

torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage1_pretrain.yaml

A MiniGPT-4 checkpoint with only stage one training can be downloaded here (13B) or here (7B). Compared to the model after stage two, this checkpoint generate incomplete and repeated sentences frequently.

2. Second finetuning stage

In the second stage, we use a small high quality image-text pair dataset created by ourselves and convert it to a conversation format to further align MiniGPT-4. To download and prepare our second stage dataset, please check our second stage dataset preparation instruction. To launch the second stage alignment, first specify the path to the checkpoint file trained in stage 1 in train_configs/minigpt4_stage1_pretrain.yaml. You can also specify the output path there. Then, run the following command. In our experiments, we use 1 A100.

torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage2_finetune.yaml

After the second stage alignment, MiniGPT-4 is able to talk about the image coherently and user-friendly.

Acknowledgement

BLIP2 The model architecture of MiniGPT-4 follows BLIP-2. Don't forget to check this great open-source work if you don't know it before!
Lavis This repository is built upon Lavis!
Vicuna The fantastic language ability of Vicuna with only 13B parameters is just amazing. And it is open-source!

If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX:

@misc{zhu2022minigpt4,
      title={MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models}, 
      author={Deyao Zhu and Jun Chen and Xiaoqian Shen and Xiang Li and Mohamed Elhoseiny},
      journal={arXiv preprint arXiv:2304.10592},
      year={2023},
}

windows 下安装 MiniGPT-4