使用ExLlamaV2在本地运行大语言模型：快速入门指南如果你使用conda管理环境，以下是必要的依赖项：使用无需A

# 使用ExLlamaV2在本地运行大语言模型：快速入门指南

## 引言

在现代消费级GPU上本地运行大语言模型（LLM）变得越来越可行，而ExLlamaV2库在这方面提供了一种快速推理的方法。它支持GPTQ和EXL2量化模型，尤其适合那些希望离线运行模型的开发者。本篇文章将指导你如何结合LangChain来使用ExLlamaV2库进行本地推理。

## 主要内容

### 安装

在开始之前，确保你的环境符合以下要求：

- Python 3.11
- LangChain 0.1.7
- CUDA 12.1.0
- torch==2.1.1+cu121
- exllamav2 (0.0.12+cu121)

可以通过以下命令安装ExLlamaV2：

```bash
pip install https://github.com/turboderp/exllamav2/releases/download/v0.0.12/exllamav2-0.0.12+cu121-cp311-cp311-linux_x86_64.whl

如果你使用conda管理环境，以下是必要的依赖项：

conda install conda-forge::ninja
conda install nvidia/label/cuda-12.1.0::cuda
conda install conda-forge::ffmpeg
conda install conda-forge::gxx=11.4

使用

无需API_TOKEN，因为我们将在本地运行LLM。可以在Hugging Face上查看不同量化大小和方法的模型资源占用情况，以便选择适合自己机器的模型。

代码示例

以下是使用ExLlamaV2进行简单问题回答的代码示例：

import os
from huggingface_hub import snapshot_download
from langchain_community.llms.exllamav2 import ExLlamaV2
from langchain_core.callbacks import StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate
from libs.langchain.langchain.chains.llm import LLMChain

def download_GPTQ_model(model_name: str, models_dir: str = "./models/") -> str:
    if not os.path.exists(models_dir):
        os.makedirs(models_dir)

    _model_name = model_name.split("/")
    _model_name = "_".join(_model_name)
    model_path = os.path.join(models_dir, _model_name)
    if _model_name not in os.listdir(models_dir):
        snapshot_download(repo_id=model_name, local_dir=model_path, local_dir_use_symlinks=False)
    return model_path

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.85
settings.top_k = 50
settings.top_p = 0.8
settings.token_repetition_penalty = 1.05

model_path = download_GPTQ_model("TheBloke/Mistral-7B-Instruct-v0.2-GPTQ")

callbacks = [StreamingStdOutCallbackHandler()]

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

llm = ExLlamaV2(
    model_path=model_path,
    callbacks=callbacks,
    verbose=True,
    settings=settings,
    streaming=True,
    max_new_tokens=150,
)

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What Football team won the UEFA Champions League in the year the iphone 6s was released?"

output = llm_chain.invoke({"question": question})
print(output)

使用API代理服务提高访问稳定性

常见问题和解决方案

内存不足：确保你的GPU足够强大，拥有足够的显存来运行所选的模型。
CUDA版本不匹配：确保安装的CUDA版本与PyTorch兼容。

总结和进一步学习资源

通过ExLlamaV2库，你可以在消费级GPU上实现快速的本地大语言模型推理。推荐继续研究：

参考资料

如果这篇文章对你有帮助，欢迎点赞并关注我的博客。您的支持是我持续创作的动力！

---END---