大模型部署实战模型下载先安装 modelscope 下载执行下载： Ollama 官网：https://ollama

模型下载

先安装 modelscope

pip install modelscope

下载

#模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('Qwen/Qwen2.5-1.5B-Instruct', cache_dir='/root/autodl-tmp/llm')

执行下载：

python ~/dowanload.py

Ollama

官网：ollama.com/

ollama: 针对个人用户

只支持gguf模式（量化后的模型，阉割版模型）

需使用官网的模型或者魔塔社区中gguf模式的模型

安装

初始化 conda 环境

# 创建虚拟环境
conda create -n ollama1
# 激活环境 conda active ollama1 不生效则使用下面命令
source activate ollama1

安装 ollama

curl -fsSL https://ollama.com/install.sh | sh

启动

目前版本，会自动启动，且服务器重启后会自动启动

(ollama1) root@xxxxx > ollama serve

调用

上面这个窗口启动后不能关闭，否则 ollama 就退出了。另外启动个窗口 B，进入虚拟环境, 下载模型

可从ollama官网 models模块搜索感兴的模块，复制红框中的命令，拉取运行模型

(ollama1) root@xxxx > ollama run qwen2.5:0.5b

窗口 B 可以关闭了，通过代码进行调用。

单轮对话：

#使用openai的API风格调用ollama
#如果缺包 pip install OpenAI
from openai import OpenAI

client = OpenAI(base_url="httpL//localhost:11434/v1/",api_key="suibianxie")

chat_completion = client.chat.completions.create(
    messages=[{"role":"user","content":"你好，请介绍下你自己。"}],model="qwen2.5:0.5b"
)
print(chat_completion.choices[0])

多轮对话：

#多轮对话
from openai import OpenAI

#定义多轮对话方法
def run_chat_session():
    #初始化客户端
    client = OpenAI(base_url="http://localhost:11434/v1/",api_key="suibianxie")
    #初始化对话历史
    chat_history = []
    #启动对话循环
    while True:
        #获取用户输入
        user_input = input("用户：")
        if user_input.lower() == "exit":
            print("退出对话。")
            break
        #更新对话历史(添加用户输入)
        chat_history.append({"role":"user","content":user_input})
        #调用模型回答
        try:
            chat_complition = client.chat.completions.create(messages=chat_history,model="qwen2.5:0.5b")
            #获取最新回答
            model_response = chat_complition.choices[0]
            print("AI:",model_response.message.content)
            #更新对话历史（添加AI模型的回复）
            chat_history.append({"role":"assistant","content":model_response.message.content})
        except Exception as e:
            print("发生错误：",e)
            break
if __name__ == '__main__':
    run_chat_session()

vllm

官网：

vllm.hyper.ai/docs/gettin…

docs.vllm.com.cn/en/latest/

安装

目前只支持两个 cuda 版本

安装前需要注意官网中的 cuda 和算力要求

使用 pip 安装 vLLM：

# 创建环境
conda create -n vllm python=3.10 -y
# 环境变量
# source ~/.bashrc
# 激活环境
source activate vllm
# 安装vllm
pip install vllm

启动

vllm serve /root/autodl-tmp/llm/Qwen/Qwen2.5-0.5B-Instruct

指定显卡启动：

CUDA_VISIBLE_DEVICES=1 vllm serve /home/cw/llms/Qwen/Qwen1.5-1.8B-Chat --dtype=half --enforce-eager

调用

#多轮对话
from openai import OpenAI

#定义多轮对话方法
def run_chat_session():
    #初始化客户端
    client = OpenAI(base_url="http://localhost:8000/v1/",api_key="suibianxie")
    #初始化对话历史
    chat_history = []
    #启动对话循环
    while True:
        #获取用户输入
        user_input = input("用户：")
        if user_input.lower() == "exit":
            print("退出对话。")
            break
        #更新对话历史(添加用户输入)
        chat_history.append({"role":"user","content":user_input})
        #调用模型回答
        try:
            chat_complition = client.chat.completions.create(messages=chat_history,model="/root/llm/Qwen/Qwen2.5-0.5B-Instruct")
            #获取最新回答
            model_response = chat_complition.choices[0]
            print("AI:",model_response.message.content)
            #更新对话历史（添加AI模型的回复）
            chat_history.append({"role":"assistant","content":model_response.message.content})
        except Exception as e:
            print("发生错误：",e)
            break
if __name__ == '__main__':
    run_chat_session()

LMDeploy

LMDeploy是LLM在英伟达设备上部署的全流程解决方案。包括模型轻量化、推理和服务。

项目地址：github.com/InternLM/Im…

官网：internlm.intern-ai.org.cn/

文档：lmdeploy.readthedocs.io/zh-cn/lates…

安装

# 就安装官网指定环境
conda create -n lmdeploy python=3.8 -y
source activate lmdeploy
pip install lmdeploy

启动

lmdeploy serve api_server /root/autodl-tmp/llm/Qwen/Qwen2.5-0.5B-Instruct

报错缺包：

pip install partial_json_parser

调用

#多轮对话
from openai import OpenAI

#定义多轮对话方法
def run_chat_session():
    #初始化客户端
    client = OpenAI(base_url="http://localhost:23333/v1/",api_key="suibianxie")
    #初始化对话历史
    chat_history = []
    #启动对话循环
    while True:
        #获取用户输入
        user_input = input("用户：")
        if user_input.lower() == "exit":
            print("退出对话。")
            break
        #更新对话历史(添加用户输入)
        chat_history.append({"role":"user","content":user_input})
        #调用模型回答
        try:
            chat_complition = client.chat.completions.create(messages=chat_history,model="/root/autodl-tmp/llm/Qwen/Qwen2.5-0.5B-Instruct")
            #获取最新回答
            model_response = chat_complition.choices[0]
            print("AI:",model_response.message.content)
            #更新对话历史（添加AI模型的回复）
            chat_history.append({"role":"assistant","content":model_response.message.content})
        except Exception as e:
            print("发生错误：",e)
            break
if __name__ == '__main__':
    run_chat_session()

open webUI

Open WebUI 是一个可扩展的、自托管的 AI 界面，可以适应您的工作流程，同时完全离线操作。

仓库：github.com/open-webui/…

文档：docs.openwebui.com/

web UI 暂不支持自定义模板，不适合在微调的场景

安装

python 必须是 3.11

conda create -n openwebui python=3.11 -y

source activate openwebui

pip install open-webui

pip install -U open-webui torch transformers

设置环境变量：

export HF_ENDPOINT=https://hf-mirror.com

后端用的 vllm，需要将 ollama 关闭掉

如果是用 ollama 则设置为 True

export ENABLE_OLLAMA_API=False

配置是 vllm 的访问地址

:::color3 需要先将 vllm 启动起来

:::

export OPENAI_API_BASE_URL=http://127.0.0.1:8000/v1

启动

open-webui serve