[在您的GPU上运行大型语言模型：使用ExLlamaV2实现本地推理]在您的GPU上运行大型语言模型：使用ExLlama

在您的GPU上运行大型语言模型：使用ExLlamaV2实现本地推理

引言

大型语言模型（LLM）的推理通常需要大量计算资源。ExLlamaV2是一款专为现代消费级GPU设计的快速推理库，支持在本地运行GPTQ和EXL2量化模型。这篇文章将介绍如何在LangChain中使用ExLlamaV2进行模型推理，并提供实用的示例代码和解决方案。

主要内容

安装指南

要开始使用ExLlamaV2，请确保您的环境满足以下要求：

Python 3.11
LangChain 0.1.7
CUDA 12.1.0
PyTorch 2.1.1+cu121
ExLlamaV2 0.0.12+cu121

安装ExLlamaV2：

pip install https://github.com/turboderp/exllamav2/releases/download/v0.0.12/exllamav2-0.0.12+cu121-cp311-cp311-linux_x86_64.whl

如果使用Conda，可以安装如下依赖：

- conda-forge::ninja
- nvidia/label/cuda-12.1.0::cuda
- conda-forge::ffmpeg
- conda-forge::gxx=11.4

使用ExLlamaV2

在本地运行LLM无需API令牌。以下代码示例将指导您如何从Hugging Face下载模型并在LangChain中进行推理。

import os
from huggingface_hub import snapshot_download
from langchain_community.llms.exllamav2 import ExLlamaV2
from langchain_core.callbacks import StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate
from libs.langchain.langchain.chains.llm import LLMChain

# 下载GPTQ模型的函数
def download_GPTQ_model(model_name: str, models_dir: str = "./models/") -> str:
    """从Hugging Face存储库下载模型。"""
    if not os.path.exists(models_dir):
        os.makedirs(models_dir)

    _model_name = model_name.split("/")
    _model_name = "_".join(_model_name)
    model_path = os.path.join(models_dir, _model_name)
    if _model_name not in os.listdir(models_dir):
        snapshot_download(repo_id=model_name, local_dir=model_path, local_dir_use_symlinks=False)
    else:
        print(f"{model_name} already exists in the models directory")

    return model_path

from exllamav2.generator import ExLlamaV2Sampler

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.85
settings.top_k = 50
settings.top_p = 0.8
settings.token_repetition_penalty = 1.05

model_path = download_GPTQ_model("TheBloke/Mistral-7B-Instruct-v0.2-GPTQ")

callbacks = [StreamingStdOutCallbackHandler()]

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

# 初始化模型
llm = ExLlamaV2(
    model_path=model_path,
    callbacks=callbacks,
    verbose=True,
    settings=settings,
    streaming=True,
    max_new_tokens=150,
)

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What Football team won the UEFA Champions League in the year the iphone 6s was released?"

output = llm_chain.invoke({"question": question})
print(output)

常见问题和解决方案

问题：模型下载失败。
- 解决方案： 检查网络连接，确保访问Hugging Face的权限，必要时使用API代理服务以提高访问稳定性。
问题：GPU内存不足时模型加载失败。
- 解决方案： 尝试使用更小的模型或调整量化参数。

总结和进一步学习资源

ExLlamaV2为在本地设备上运行大型语言模型提供了高效解决方案。通过结合LangChain，您可以轻松地将复杂AI应用集成到您的项目中。要深入了解LLM的使用，请查看以下资源：

参考资料

ExLlamaV2 GitHub项目: github.com/turboderp/e…
Hugging Face 文档: huggingface.co/docs

如果这篇文章对你有帮助，欢迎点赞并关注我的博客。您的支持是我持续创作的动力！

---END---