使用ExLlamaV2在本地运行LLMs：快速指南如果使用conda，请确保安装以下依赖：使用无需API_TOKEN

# 引言

在现代消费级GPU上本地运行大型语言模型（LLM）正变得越来越普遍。ExLlamaV2是一个快速推理库，支持在本地运行GPTQ和EXL2量化模型。本篇文章旨在介绍如何在LangChain框架内使用ExLlamaV2，探讨其安装、使用及相关挑战。

# 主要内容

## 安装

要在本地运行ExLlamaV2，你需要准备满足以下条件的环境：

- Python 3.11
- LangChain 0.1.7
- CUDA 12.1.0
- Torch 2.1.1+cu121
- ExLlamaV2 0.0.12+cu121

### 安装步骤

如果使用`pip`，可以通过以下命令安装ExLlamaV2：

```bash
pip install https://github.com/turboderp/exllamav2/releases/download/v0.0.12/exllamav2-0.0.12+cu121-cp311-cp311-linux_x86_64.whl

如果使用conda，请确保安装以下依赖：

conda install -c conda-forge ninja ffmpeg gxx=11.4
conda install -c nvidia/label/cuda-12.1.0 cuda

使用

无需API_TOKEN即可本地运行LLM。需要注意的是，选择合适的模型以匹配本地机器的性能。

import os
from huggingface_hub import snapshot_download
from langchain_community.llms.exllamav2 import ExLlamaV2
from langchain_core.callbacks import StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate
from libs.langchain.langchain.chains.llm import LLMChain
from exllamav2.generator import ExLlamaV2Sampler

# 使用API代理服务提高访问稳定性
def download_GPTQ_model(model_name: str, models_dir: str = "./models/") -> str:
    if not os.path.exists(models_dir):
        os.makedirs(models_dir)
    
    _model_name = model_name.split("/")
    _model_name = "_".join(_model_name)
    model_path = os.path.join(models_dir, _model_name)
    
    if _model_name not in os.listdir(models_dir):
        snapshot_download(repo_id=model_name, local_dir=model_path, local_dir_use_symlinks=False)
    else:
        print(f"{model_name} already exists in the models directory")
    
    return model_path

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.85
settings.top_k = 50
settings.top_p = 0.8
settings.token_repetition_penalty = 1.05

model_path = download_GPTQ_model("TheBloke/Mistral-7B-Instruct-v0.2-GPTQ")
callbacks = [StreamingStdOutCallbackHandler()]

template = """Question: {question}\n\nAnswer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])

llm = ExLlamaV2(model_path=model_path, callbacks=callbacks, verbose=True, settings=settings, streaming=True, max_new_tokens=150)
llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What Football team won the UEFA Champions League in the year the iphone 6s was released?"
output = llm_chain.invoke({"question": question})
print(output)

常见问题和解决方案

模型下载失败：因网络问题或访问限制，模型下载可能会失败。建议使用API代理服务。
资源消耗过高：确保GPU拥有足够的显存。可以调整量化参数以适应本地硬件条件。

总结与进一步学习资源

ExLlamaV2为在消费级GPU上高效地运行LLMs提供了便利。虽然初次配置可能需要一些努力，但一旦成功配置，便可以享受快速推理的优势。建议查看以下资源以深入理解：

参考资料

如果这篇文章对你有帮助，欢迎点赞并关注我的博客。您的支持是我持续创作的动力！

---END---