教程：三进程启动——对话（GPU）/ 向量（CPU）/ 重排（CPU）模型与二进制（先设变量再开终端）在每个要启动

硬件示例：12GB 级独显（文内以 RTX 系为例，按你卡替换）。

模型与二进制（先设变量再开终端）

在每个要启动进程的终端里，先设好（按你本机只改这一组）：

# llama.cpp 已 cmake 编过，能拿到 build/bin/llama-server
export LLAMA_BIN="/path/to/llama.cpp/build/bin/llama-server"

# 三份 GGUF 的「完整文件路径」
export CHAT_GGUF="/path/to/models/gguf/your-4b-chat-Q4_K_M.gguf"
export EMBED_GGUF="/path/to/models/gguf/bge-base-zh-v1.5/bge-base-zh-v1.5-f16.gguf"
export RERANK_GGUF="/path/to/models/gguf/bge-reranker-base/bge-reranker-base-f16.gguf"

用途	说明
对话 4B（GPU）	如 Qwen3 4B 系、Q4_K_M 档 GGUF
向量（CPU）	BGE 中文、F16 或你转好的 GGUF，与 `--embedding` 一致
重排（CPU）	BGE rerank、与 `--pooling rank --reranking` 一致

二进制 即上表 LLAMA_BIN；不要用「目录名」代替 -m 的 .gguf 文件全路径。

0. 准备：每个进程开一个独立终端

三个进程互不依赖，可单独启停；各占一个终端最方便看日志。

1. 进程 A：对话（GPU）

export CUDA_VISIBLE_DEVICES=0

"$LLAMA_BIN" \
  -m "$CHAT_GGUF" \
  --host 127.0.0.1 --port 28080 \
  -ngl 99 \
  -c 8192

-ngl 99：在支持范围内把层放 GPU；OOM 时减小 -c 或 -ngl。
-c 8192：长上下文，按需调。

自检：

curl -s http://127.0.0.1:28080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen",
    "messages": [{"role": "user", "content": "你好，说一句话。"}],
    "max_tokens": 512
  }' | jq .choices[0].message.content

2. 进程 B：向量 embedding（CPU）

export CUDA_VISIBLE_DEVICES=

"$LLAMA_BIN" \
  -m "$EMBED_GGUF" \
  --host 127.0.0.1 --port 28081 \
  --embedding \
  --pooling cls \
  -ngl 0 \
  -ub 512

--embedding、--pooling cls、-ngl 0 见原件说明。

自检：

curl -s http://127.0.0.1:28081/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"bge","input":"测试向量化一句话"}' | jq '.data[0].embedding | length'

3. 进程 C：重排 rerank（CPU）

export CUDA_VISIBLE_DEVICES=

"$LLAMA_BIN" \
  -m "$RERANK_GGUF" \
  --host 127.0.0.1 --port 28082 \
  --embedding \
  --pooling rank \
  --reranking \
  -ngl 0

自检（端口与进程监听一致，默认用 28082）：

curl -s http://127.0.0.1:28082/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bge-reranker",
    "query": "今天天气怎么样",
    "top_n": 2,
    "documents": ["今天阳光明媚", "明天会下雨", "无关内容"]
  }' | jq .

4. 三个服务 URL 速查

服务	URL
对话	`http://127.0.0.1:28080/v1/chat/completions`
向量	`http://127.0.0.1:28081/v1/embeddings`
重排	`http://127.0.0.1:28082/v1/rerank`

5. 常见问题

现象	处理
对话 OOM	减 `-c` 或 `-ngl`
`embedding` 400/空	查 `--pooling` 与模型卡是否一致
`rerank` 404	是否带齐 `--embedding --pooling rank --reranking`
向量/重排仍吃 GPU	该进程用空 `CUDA_VISIBLE_DEVICES` 且 `-ngl 0`