llama.cpp系列-量化部署实践LLaMA.cpp 项目：Inference of Meta's LLaMA mod

LLaMA.cpp 项目：Inference of Meta's LLaMA model (and others) in pure C/C++。开发者Georgi Gerganov基于Meta释出的LLaMA模型（简易Python代码示例）手撸的纯C/C++版本，用于模型推理。

llama.cpp的主要目标是在本地和云端的各种硬件上以最少的设置和SOTA性能实现LLM推理，主要优势：

纯C/C++实现，无需任何依赖，直接编译出可执行文件
支持Apple silicon
支持x86架构AVX, AVX2， AVX512
内置量化1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit
自定义CUDA kernels（NVIDIA GPUs）
支持Vulkan, SYCL, (部分) OpenCL
CPU+GPU异构推理，加速大于Vram总容量的模型

Step1 下载、编译llama.cpp

git clone https://github.com/ggerganov/llama.cpp

### 执行CMake编译（CPU执行，安装简单）
cd llama.cpp
cmake -B build
cmake --build build --config Release

其他编译方式参考官网文档

Step2 量化模型

参照下载llama2模型指导，拷贝权重、tokenizer.model到目录llama2-models

├── 7B
│   ├── checklist.chk
│   ├── consolidated.00.pth
│   └── params.json
└── tokenizer.model

pip install requirements.txt

python convert.py llama2-models/7B/

./build/bin/quantize ./llama2-models/7B/7B-7B-F32.gguf ./llama2-models/7B/ggml-7B-q4_0.bin q4

量化日志显示：fp32优化到4-bit后，模型压缩8倍

llama_model_quantize_internal: model size  = 25705.02 MB
llama_model_quantize_internal: quant size  =  3647.87 MB

注意：llama2-7B转换后模型是26GB，原因是convert.py 参数选项 --outtype (default: f16 or f32 based on input)，由于llama2-7B原模型权重是fp32，所以转换ggml dtype仍然是fp32

Step3 执行推理

./build/bin/main -m ./llama2-models/7B/ggml-7B-q4_0.bin --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1

交互式对话如下:

 Below is an instruction that describes a task. Write a response that appropriately completes the request.
> How old is the universe?
 The universe was created on October 23, 4004 BC.
> Where is Shanxi?
Shanxi is a province in the North China Plain, located in the central part of northern China. It borders Henan to the south, Hebei and Inner Mongolia to the north, and Shaanxi to the west. The capital city is Taiyuan.

### Instruction:
> What is NWPU?
Northwestern Polytechnical University (NWPU) is a public research university located in Xi'an, Shaanxi Province, China. It was founded in 1958 and has since become one of the most prestigious universities in China. The university offers undergraduate, graduate, and doctoral programs across six academic disciplines: engineering, science, management, humanities and social sciences, law, and medicine. NWPU also houses several research institutes that focus on areas such as artificial intelligence (AI), robotics, energy conservation and environmental protection technologies.

其他prompt方式参考: github.com/ggerganov/l…

分别加载未量化模型fp32、4-bit量化模型对比，内存分别占用26g、4.5g，推理性能明显4-bit模型更快。