使用Transformers、ChatGLM3项目、创建FastAPI应用等方式部署调用ChatGLM3-6B模型

Transformers部署调用

环境初始化

升级pip

python -m pip install --upgrade pip

更换pypi源

pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

创建虚拟环境并使用

conda create -n chatglm3-6b python=3.10
conda activate chatglm3-6b

安装PyTorch

注意：需根据GPU显卡驱动版本和CUDA版本选择对应匹配的PyTorch版本。

PyTorch版本列表

pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121

安装依赖

pip install modelscope
pip install transformers

模型下载

使用魔塔的modelscope库中的snapshot_download函数下载模型

import torch
from modelscope import snapshot_download, AutoModel, AutoTokenizer
import os

# 指定：模型名称、下载路径、版本号
model_dir = snapshot_download('ZhipuAI/chatglm3-6b', cache_dir='/root/models', revision='master')

模型加载

# 导入transformer库中的AutoTokenizer和AutoModelForCausalLM来加载分词器和对话模型
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# 使用模型下载到的本地路径以加载
model_dir = './chatglm3-6b'
# 从本地模型加载分词器，trust_remote_code=True：允许从网络上下载模型权重和相关的代码
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
# 使用AutoModelForCausalLM类，从本地加载模型
model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True)

# 将模型移动到GPU上进行加速，否则使用CPU
if torch.cuda.is_available:
    device = torch.device("cuda")
else:
     device = torch.device("cpu")
model.to(device)
# 使用模型的评估模式来产生对话
model.eval()

# 第一轮对话
response, history = model.chat(tokenizer, "你好", history=[])
print(response + "\n")
# 第二轮对话
response, history = model.chat(tokenizer, "你是谁", history=history)
print(response + "\n")

在这里插入图片描述

模型GPU显存占用

默认情况下，模型以半精度（float16）加载，模型权重大概需要13GB不到的显存。

执行nvidia-smi，查看显卡显存占用情况在这里插入图片描述

获取当前模型实际占用的GPU显存

memory_bytes = model.get_memory_footprint()
# 转换为GB
memory_gb = memory_bytes / (1024 ** 3)  
print(f"{memory_gb :.2f}GB") # 11.63GB

注意：与实际进程占用有差异，差值为预留给PyTorch的显存

基于ChatGLM3项目部署调用

ChatGLM3项目

使用开源项目ChatGLM3进行ChatGLM3-6B部署

ChatGLM3项目提供了OpenAI / ZhipuAI格式的开源模型API部署代码，这里使用OpenAI格式的API部署。

下载ChatGLM3仓库

git clone https://github.com/THUDM/ChatGLM3

cd ChatGLM3

使用pip安装依赖

注意：当环境已安装torch时，需修改requirements.txt文件，移除torch

pip install -r requirements.txt

配置模型、嵌入模型路径

编辑ChatGLM3/openai_api_demo/api_server.py文件，需要设置AI大模型、嵌入模型的地址

# 设置AI大模型的路径
MODEL_PATH = os.environ.get('MODEL_PATH', '/root/models/chatglm3-6b')

# 设置嵌入模型的路径
EMBEDDING_PATH = os.environ.get('EMBEDDING_PATH', '/root/models/bge-large-zh')

在这里插入图片描述

部署API

启动部署API Sever

python  ChatGLM3/openai_api_demo/api_server.py

启动日志如下：

(ChatGLM3) root@master:~/work/ChatGLM3/openai_api_demo# python api_server.py
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:13<00:00,  2.69s/it]
INFO:     Started server process [437763]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Curl测试

!curl -X POST "http://192.168.5.210:8001/v1/chat/completions" \
-H "Content-Type: application/json" \
-d "{\"model\": \"chatglm3-6b\", \"messages\": [{\"role\": \"system\", \"content\": \"You are ChatGLM3, a large language model trained by Zhipu.AI\"}, {\"role\": \"user\", \"content\": \"你好，给我讲一个故事，大概100字\"}], \"stream\": false, \"max_tokens\": 100, \"temperature\": 0.8, \"top_p\": 0.8}"

在这里插入图片描述

python代码测试

安装openai库

pip install openai

编写python代码调用API

from openai import OpenAI

base_url = "http://127.0.0.1:8000/v1/"
client = OpenAI(api_key="EMPTY", base_url=base_url)

def simple_chat(use_stream=True):
    messages = [{"role":"system","content":"你是一个乐于助人的助手"},{"role":"user","content":"你是谁？"}]
    
    response = client.chat.completions.create(
        model="chatglm3-6b",
        messages=messages,
        stream=use_stream,
        max_tokens=256,
        temperature=0.8,
        presence_penalty=1.1,
        top_p=0.8)
    
    if response:
        if use_stream:
            for chunk in response:
                print(chunk.choices[0].delta.content)
        else:
            content = response.choices[0].message.content
            print(content)
    else:
        print("Error:", response.status_code)

if __name__ == '__main__':
    simple_chat(use_stream=False)

创建FastAPI应用部署调用

基于ChatGLM3项目提供的API部署代码可以很方便的进行ChatGLM3-6B模型的API部署。

但是其提供的API部署代码可能不适用于其他AI模型，因此可以创建一个更通用的FastAPI应用，用于进行各种模型的API部署

安装依赖库

pip install fastapi uvicorn requests

创建FastAPI应用

创建api.py文件，编写如下代码

from fastapi import FastAPI, Request
from transformers import AutoTokenizer, AutoModelForCausalLM
import uvicorn
import json
import datetime
import torch

def torch_gc():
	 """
    清理GPU显存
    """
    if torch.cuda.is_available():  # 检查CUDA是否可用
        with torch.cuda.device("cuda:0"):  # 指定CUDA设备
            torch.cuda.empty_cache()  # 清空CUDA缓存
            torch.cuda.ipc_collect()  # 收集CUDA内存碎片

# 创建FastAPI应用
app = FastAPI()

@app.post("/")
async def create_item(request: Request):
	"""
    处理POST请求
    """
    global model, tokenizer   # 声明全局变量以便在函数内部使用模型和分词器
    json_post_raw = await request.json()  # 获取请求JSON数据
    json_post = json.dumps(json_post_raw)  # 将JSON数据转换为字符串
    json_post_list = json.loads(json_post)  # 将字符串转换为Python对象
    prompt = json_post_list.get('prompt')  # 获取参数prompt
    history = json_post_list.get('history')  # 获取参数history
    max_length = json_post_list.get('max_length')  # 获取参数max_length
    top_p = json_post_list.get('top_p')  # 获取参数top_p
    temperature = json_post_list.get('temperature')  # 获取参数temperature
    
    # 调用模型进行对话生成
    response, history = model.chat(
        tokenizer,
        prompt,
        history=history,
        max_length=max_length if max_length else 2048,  # 如果未提供最大长度，默认使用2048
        top_p=top_p if top_p else 0.7,  # 如果未提供top_p参数，默认使用0.7
        temperature=temperature if temperature else 0.6  # 如果未提供温度参数，默认使用0.6
    )

    now = datetime.datetime.now()  # 获取当前时间
    time = now.strftime("%Y-%m-%d %H:%M:%S")  # 格式化时间为字符串

    # 构建响应JSON
    answer = {
        "response": response,
        "history": history,
        "status": 200,
        "time": time
    }

    # 构建日志信息
    log = f"[{time}] prompt: {prompt}, response: {repr(response)}"
    print(log)  # 打印日志
    torch_gc()  # 执行GPU内存清理
    return answer  # 返回响应


if __name__ == '__main__':
	model_path = "./chatglm3-6b"
    # 加载预训练的分词器和模型
    tokenizer = AutoTokenizer.from_pretrained(model_path , trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(model_path , trust_remote_code=True).to(torch.bfloat16).cuda()
    model.eval()  # 设置模型为评估模式
    # 启动FastAPI应用
    uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)

启动api服务

(ChatGLM3) root@master:~/work/# python api.py 
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:02<00:00,  3.01it/s]
INFO:     Started server process [439374]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
[2024-05-11 08:22:10] prompt: 你好, response: '你好👋！我是人工智能助手 ChatGLM3-6B，很高兴见到你，欢迎问我任何问题。'
INFO:     192.168.5.210:55094 - "POST / HTTP/1.1" 200 OK

python代码测试

requests库代码调用如下

import requests
import json

def get_completion(prompt):
    headers = {'Content-Type': 'application/json'}
    data = {"prompt": prompt, "history": []}
    response = requests.post(url='http://127.0.0.1:8000', headers=headers, data=json.dumps(data))
    return response.json()['response']

if __name__ == '__main__':
    print(get_completion('你好'))