FastChat+vicuna1.1部署与流式调用实践FastChat配合vicuna v1.1模型，基于LLaMA微调

FastChat配合vicuna v1.1模型，基于LLaMA微调，即使较小的7B模型，推理效果也是挺不错的，在16G显存的推理卡上就可以跑，特别是对中文也有很好的支持。

由于FastChat发展很快，再加上网络情况，安装过程问题不少，安装测试完成后，如何应用于产品中，还有一些技术工作要做。本文详细描述了FastChat安装部署过程，提供了流式调用解决方案和应用案例，供大家参考。

可以通过网址为：gitclone.com/aiit/chat/ ，查看使用效果。这个网站支持vicuna v1.1、ChatGLM-6b、Salesforce codegen三种模型。

1、基础条件

nvidia显卡，至少16G显存，如P100、T4、3090等（7B模型需要12G显存）
linux操作系统（windows上没测试过，应也可以）
conda虚拟环境

显卡驱动及conda环境安装可参照zhuanlan.zhihu.com/p/597063490

2、建立虚拟环境

conda create -n fastchat python=3.10
conda activate fastchat

3、基础安装

git clone https://github.com/lm-sys/FastChat --depth=1
cd FastChat
python -m pip install --upgrade pip -i https://pypi.mirrors.ustc.edu.cn/simple --trusted-host=pypi.mirrors.ustc.edu.cn
pip install -e . -i https://pypi.mirrors.ustc.edu.cn/simple --trusted-host=pypi.mirrors.ustc.edu.cn
pip install --upgrade tokenizers -i https://pypi.mirrors.ustc.edu.cn/simple --trusted-host=pypi.mirrors.ustc.edu.cn
pip install protobuf==3.19.0 -i https://pypi.mirrors.ustc.edu.cn/simple --trusted-host=pypi.mirrors.ustc.edu.cn

这里有几个注意事项：一是github.com访问不稳定，可用gitclone.com加速，二是pip install要用清华的镜像加速，三是pip install -e问题很多，需要反复重试，四是tokenizers和protobuf还有额外的版本要求。

4、下载llama模型

export GIT_TRACE=1
export GIT_CURL_VERBOSE=1
pip install git+https://github.com/juncongmoo/pyllama
python -m llama.download --model_size 7B

这里是用llama的下载器下载模型，可以断点续传，有几个注意事项：直接用pip intall的pyllama版本太低，下载模型得半手动，最好用以上命令从github装最新的，设置前两个环境变量是为了查看访问github.com卡到哪了，加速方法参见：小五哥：github常见加速方法

5、转换模型到半精度

（1）先找到convert_llama_weights_to_hf.py脚本在什么地方（一般在conda的相应虚拟环境目录下）

sudo find  / -name convert_llama_weights_to_hf.py

（2）转换

cd pyllama_data
python convert_llama_weights_to_hf.py --input_dir ./ --model_size 7B --output_dir ./output/7B

注意convert_llama_weights_to_hf.py用全路径，或将其拷到pyllama_data目录。

6、下载vicuna模型

cd ..
git clone https://huggingface.co/lmsys/vicuna-7b-delta-v1.1 --depth=1
cd vicuna-7b-delta-v1.1

注意git默认没有lfs支持，clone下来是没有二进制的模型文件的，所以要从huggingface.co/lmsys/vicun…下载那两个bin文件，放到vicuna-7b-delta-v1.1目录下。

7、合并两个模型

cd ..
python -m fastchat.model.apply_delta --base ./pyllama_data/output/7B --target vicuna_data/vicuna-7b-v1.1 --delta lmsys/vicuna-7b-delta-v1.1

8、命令行测试

假设用第二块显卡，命令如下，然后就可以在命令行输入问题测试，可以问中文问题。

CUDA_VISIBLE_DEVICES=1 python -m fastchat.serve.cli --model-path vicuna_data/vicuna-7b-v1.1

9、流式接口测试

github.com/lm-sys/Fast… 的fastchat/serve/api.py当前实现不是流式返回，而是从模型用流式httpx逐步得到推理结果，但是等待全部结果出来才返给客户端，无法应用于产品，笔者fork项目后到github.com/little51/Fa…，增加了一个简单的流式接口，文件在fastchat/serve/api_stream.py，将这个文件下载放到之前已clone到本地的FastChat/fastchat/serve下即可。为了避免开多个shell窗口，笔者提供了以下脚本，在一个shell窗口中执行。

pkill -9 -f fastchat
nohup python -u -m fastchat.serve.controller >> fastchat.log  2>&1 &
nohup python -u -m fastchat.serve.model_worker --model-name 'vicuna-7b-v1.1' --model-path vicuna_data/vicuna-7b-v1.1 >> fastchat.log  2>&1 &
FASTCHAT_CONTROLLER_URL=http://localhost:21001 CUDA_VISIBLE_DEVICES=1 nohup python -u -m fastchat.serve.api_stream --host 0.0.0.0 --port 8000 >> fastchat.log  2>&1 &
tail -f  fastchat.log

10、测试

多次用curl发post请求，直到遇到[stop]停用词说明取完了推理结果。

curl http://localhost:8000/v1/chat/completions/stream   \
-H "Content-Type: application/json"  \
-d '{"model": "vicuna-7b-v1.1","messages": [{"role": "user", "content": "请写一篇100字的日记"}]}'