META LLAMA 7B 模型在mac M2上部署META LLAMA 7B模型在MAC M2 电脑上部署； llam

环境准备：

由于需要从源代码编译 llama.cpp，所以 build-essential 是必须得装的。

Python 的版本就使用 Ubuntu2204 默认的 Python 3.10.6 即可。

首先安装编译和运行必须的系统依赖，一般来说 python3 默认应该是装了的，不过这里包括进来并不会有其他影响。

apt install build-essential python3

然后需要从 github 上克隆 llama.cpp 的代码库。

git clone https://github.com/ggerganov/llama.cpp

模型编译：

直接进入 llama.cpp目录，编译C++程序

make

直接下载facebook的模型下载脚本

wget https://raw.githubusercontent.com/shawwn/llama-dl/56f50b96072f42fb2520b1ad5a1d6ef30351f23c/llama.sh

ls ./models
13B
30B
65B
7B
llama.sh
tokenizer.model
tokenizer_checklist.chk

编辑 llama.sh文件中的第11 行

MODEL_SIZE="7B,13B,30B,65B"

更改为

MODEL_SIZE="7B"

这样就只会下载 13G 的 7B 模型了。13B模型有25G 大小，30B 模型有 64G 的大小。

在运行llama.sh脚本的时候遇到了自身bash的版本太低，不知道脚本中的declare -A 的语法，因为本身也不需要 13B和其他大的模型，所以直接再次修改脚本，将${i}直接修改为“0”，直接下载第一个7B的模型。

for s in $(seq -f "0%g" 0 "0")

将llama.sh拷贝到llama.cpp/models目录，然后运行llama.sh

sh llama.sh

中途遇到了mdusum命令不能执行的问题，随又安装了md5sum包

brew install md5sha1sum

模型准备

模型的转换使用了 python 脚本，因此需要先对 python 的环境进行配置和安装。

pipenv shell --python 3.10
pip install torch numpy sentencepiece

接下来，将 LLaMA 模型转换，也就是 Georgi Gerganov machine learning format。

python convert-pth-to-ggml.py models/7B/ 1

这个格式转换脚本，会将原始的模型 models/7B/consolidated.00.pth 转换为 models/7B/ggml-model-f16.bin 的一个同样是 13G 的 ggml 模型。第三个参数为0时，使用 float32，转换的结果文件会大一倍。该参数值为 1，则使用 float16 这个默认值。

接下来，就是模型的“量化”，或者说离散化。"quantizes the model to 4-bits":

./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2

models/7B/ggml-model-q4_0.bin - a 3.9GB 文件是模型的可运行文件

模型运行与prompt

./main -m ./models/7B/ggml-model-q4_0.bin \
  -t 8 \
  -n 128 \
  -p 'The first man on the moon was '

./main --help` shows the options. `-m` is the model. `-t` is the number of threads to use. `-n` is the number of tokens to generate. `-p` is the prompt.

运行你的第一次prompt吧

./main -m ./models/7B/ggml-model-q4_0.bin -p "The first man on the moo"

更详细的可以参照这里