使用llama-cpp-python实现本地化LLM推理使用llama-cpp-python实现本地化LLM推理引言

使用llama-cpp-python实现本地化LLM推理

引言

近年来，随着大规模语言模型（LLM）的普及，越来越多的开发者希望能在本地运行这些模型，而不是依赖于云端服务。llama-cpp-python 是一个非常出色的工具，它使得在本地运行多个LLM变得更加容易。本篇文章将介绍如何使用 llama-cpp-python，详细讲解其安装、使用以及相关的潜在挑战和解决方法。

主要内容

1. 安装llama-cpp-python

根据不同的需求和系统环境，我们可以选择不同的安装方式。以下是几种常见的安装方式：

CPU Only 安装

%pip install --upgrade --quiet llama-cpp-python

OpenBLAS / cuBLAS / CLBlast 后端安装

llama.cpp 支持多个 BLAS 后端，可以加速处理。使用 FORCE_CMAKE=1 环境变量强制使用 cmake 并安装pip包。

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

Metal (仅适用于MacOS)

!CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

2. 安装和使用方法

在 Windows 下的安装

安装 llama-cpp-python 库可以通过从源码编译的方式进行。需要的工具和依赖如下：

git
python
cmake
Visual Studio Community

git clone --recursive -j8 https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python
set FORCE_CMAKE=1
set CMAKE_ARGS=-DLLAMA_CUBLAS=OFF
python -m pip install -e .

3. 使用llama-cpp-python进行推理

以下示例展示了如何使用 llama-cpp-python 在本地运行LLM，并通过 LangChain 进行交互。

from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate.from_template(template)

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# 使用API代理服务提高访问稳定性
llm = LlamaCpp(
    model_path="/Users/rlm/Desktop/Code/llama.cpp/models/openorca-platypus2-13b.gguf.q4_0.bin",
    temperature=0.75,
    max_tokens=2000,
    top_p=1,
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

question = """
Question: A rap battle between Stephen Colbert and John Oliver
"""
llm.invoke(question)

常见问题和解决方案

性能优化
- 合理设置 n_gpu_layers 和 n_batch 参数可以显著提高推理速度。
- 在使用 Metal 时，确保 f16_kv 设置为 True。
模型文件不匹配
- 请确保使用适合的模型文件版本，例如 GGUF 格式文件。
网络访问问题
- 由于网络限制，建议在需要时使用API代理服务来提高访问稳定性。

总结和进一步学习资源

llama-cpp-python 提供了强大的本地化LLM推理能力，适合各类开发者使用。建议进一步阅读以下资源，深入了解其更多功能和用法。

参考资料

如果这篇文章对你有帮助，欢迎点赞并关注我的博客。您的支持是我持续创作的动力！

---END---