BitNet.cpp:高效1.58位LLM推理框架
BitNet.cpp 是微软官方推出的1位大语言模型(如BitNet b1.58)推理框架。它提供了一套优化的内核,支持在CPU和GPU上进行快速且无损的1.58位模型推理(NPU支持即将到来)。
功能特性
- 高效推理内核:针对1.58位模型优化的专用计算内核,支持W2A8(2位权重×8位激活)计算
- 多平台支持:全面支持CPU(ARM/x86)和GPU推理,NPU支持即将推出
- 显著性能提升:在ARM CPU上实现1.37x至5.07x加速,x86 CPU上实现2.37x至6.17x加速
- 大幅能耗降低:减少55.4%至82.2%的能源消耗,提升整体效率
- 大模型支持:可在单CPU上运行100B参数的BitNet b1.58模型,达到人类阅读速度(5-7令牌/秒)
- 完整的工具链:提供模型转换、量化、推理和服务器部署的一站式解决方案
安装指南
系统要求
- Python 3.x
- CMake
- C++编译器(支持C++17)
- CUDA(GPU版本)
基础安装
- 克隆仓库:
git clone https://github.com/microsoft/BitNet
cd bitnet.cpp
- 设置环境并构建:
# 运行设置脚本
python setup_env.py
GPU版本安装
# 创建conda环境
conda create --name bitnet-gpu "python<3.13"
conda activate bitnet-gpu
# 安装依赖
pip install -r requirements.txt
# 构建CUDA内核
cd bitnet_kernels
bash compile.sh
cd ..
模型下载与转换
# 下载BitNet-b1.58-2B模型
mkdir checkpoints
huggingface-cli download microsoft/bitnet-b1.58-2B-4T-bf16 --local-dir ./checkpoints/bitnet-b1.58-2B-4T-bf16
# 转换模型格式
python ./convert_safetensors.py --safetensors_file ./checkpoints/bitnet-b1.58-2B-4T-bf16/model.safetensors --output checkpoints/model_state.pt --model_name 2B
python ./convert_checkpoint.py --input ./checkpoints/model_state.pt
使用说明
命令行推理
# 运行推理
python run_inference.py -p "Microsoft Corporation is an American multinational corporation and technology company headquartered in Redmond, Washington." -n 128
交互式生成
python3 ./generate.py ./checkpoints/ --interactive --chat_format
启动推理服务器
# 启动服务器
python run_server.py --model models/bitnet_b1_58-3B/ggml-model-i2_s.gguf --port 8080
# 服务器默认运行在127.0.0.1:8080,支持连续批处理
API参数说明
-m, --model:模型文件路径-p, --prompt:生成文本的提示-n, --n-predict:预测的令牌数量-t, --threads:使用的线程数-c, --ctx-size:上下文大小--temperature:控制生成文本随机性的超参数--host:服务器监听地址--port:服务器监听端口
核心代码
1. 模型参数配置
@dataclass
class ModelArgs:
dim: int = 2560
n_layers: int = 30
n_heads: int = 20
n_kv_heads: int = 5
vocab_size: int = 128256
ffn_dim: int = 6912
norm_eps: float = 1e-5
rope_theta: float = 500000.0
use_kernel: bool = False
定义BitNet模型的基本参数配置,包括维度、层数、注意力头数等。
2. 1.58位线性层实现
class BitLinearKernel(nn.Module):
def __init__(self, in_features: int, out_features: int, bias: bool = False):
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.weight = torch.nn.Parameter(torch.zeros(out_features, in_features//4, dtype=torch.int8))
self.weight_scale = torch.nn.Parameter(torch.ones(4, dtype=torch.bfloat16))
def forward(self, x: torch.Tensor) -> torch.Tensor:
if self.use_kernel:
return bitnet_int8xint2_linear(x, self.weight, self.weight_scale)
else:
# 备用实现
return x @ self.weight.float()
实现1.58位线性层,使用2位权重和8位激活进行计算,支持内核加速。
3. 权重转换与量化
def convert_weight_int8_to_int2(weight):
"""将int8权重转换为int2格式,支持高效存储和计算"""
N = weight.shape[0]
K = weight.shape[1]
weight = weight + 2 # 调整值范围
weight = weight.cpu().numpy()
# 权重置换优化内存访问
permutated_weight = permutate_weight_fastest(weight)
# 压缩int2到int8
compressed_weight = compress_int2_to_int8(permutated_weight)
# 交织权重以加速解码
interleaved_weight = interleave_weight_int8(compressed_weight, 2)
return torch.from_numpy(interleaved_weight).reshape((N, K // 4))
实现权重从int8到int2的高效转换,包含置换、压缩和交织优化。
4. CUDA内核函数
template <int M, int N, int K, int ws_num, int K_block_size, int N_block_size>
__global__ void ladder_int8xint2_kernel(int8_t* A, int8_t* B, __nv_bfloat16* dtype_transform,
__nv_bfloat16* s, __nv_bfloat16* ws) {
// 内核实现int8×int2矩阵乘法
constexpr int K_per_loop = 16;
constexpr int wmma_K = 32;
constexpr int wmma_N = 16;
int in_thread_C_local[1] = {0};
#pragma unroll
for (int k_0 = 0; k_0 < K/(K_per_loop * K_block_size); ++k_0) {
// 加载A矩阵
*(int4*)(A_local) = *(int4*)(A + ((k_0 * K_per_loop * K_block_size) + (((int)threadIdx.x) * K_per_loop)));
// 加载并解码B矩阵(int2格式)
B_reshape_local[0] = *(int*)(B + ...);
decode_i2s_to_i8s(B_reshape_local, B_decode_local, 16);
// 使用dp4a指令进行点积计算
#pragma unroll
for (int k_2_0 = 0; k_2_0 < 4; ++k_2_0) {
in_thread_C_local[0] = __dp4a(*(int *)&A_local[((k_2_0 * 4))],
*(int *)&B_decode_local[((k_2_0 * 4))],
in_thread_C_local[0]);
}
}
// 规约求和
red_buf0[0] = in_thread_C_local[0];
#pragma unroll
for (int offset = K_block_size/2; offset > 0; offset /= 2) {
red_buf0[0] += __shfl_down_sync(__activemask(), red_buf0[0], offset, K_block_size);
}
// 应用缩放并转换数据类型
dtype_transform[out_idx] = (__nv_bfloat16)(((float)red_buf0[0])/(float)s[0]*(float)ws[ws_idx]);
}
CUDA内核实现高效的int8×int2矩阵乘法,使用dp4a指令和内存访问优化。
5. Top-p采样实现
@torch.compile
def top_p(probs: torch.Tensor, p: float) -> torch.Tensor:
"""
在概率分布上执行top-p(核)采样。
参数:
probs: 概率分布张量
p: top-p采样的概率阈值
返回:
采样的令牌索引
"""
probs_sort, probs_idx = torch.sort(probs, dim=-1, descending=True)
probs_sum = torch.cumsum(probs_sort, dim=-1)
mask = probs_sum - probs_sort > p
probs_sort[mask] = 0.0
next_token = torch.multinomial(probs_sort, num_samples=1)
next_token = torch.gather(probs_idx, -1, next_token)
return next_token
实现top-p采样算法,用于控制生成文本的多样性和质量。
6. 模型转换工具
def convert_back(safetensors_path: str, output_file: str, model_name: Optional[str] = None):
"""将Safetensors格式转换回Torch检查点格式"""
st_dict = load_file(safetensors_path)
cfg = ModelArgs.from_name(model_name)
recovered = {}
for layer in range(cfg.n_layer):
base = f"model.layers.{layer}."
# 合并QKV权重
wq = st_dict[f"{base}self_attn.q_proj.weight"]
wk = st_dict[f"{base}self_attn.k_proj.weight"]
wv = st_dict[f"{base}self_attn.v_proj.weight"]
wq = invert_convert_q(wq, cfg)
wk = invert_convert_k(wk, cfg)
wqkv = torch.cat([wq, wk, wv], dim=0)
recovered[f"layers.{layer}.attention.wqkv.weight"] = wqkv
# 保存其他权重
recovered[f"layers.{layer}.attention.wo.weight"] = st_dict[f"{base}self_attn.o_proj.weight"]
recovered[f"layers.{layer}.feed_forward.w2.weight"] = st_dict[f"{base}mlp.down_proj.weight"]
torch.save(recovered, output_file)
实现模型格式转换工具,支持从HuggingFace格式转换为BitNet.cpp内部格式。
这些核心代码展示了BitNet.cpp框架的关键技术实现,包括1.58位量化、高效内核设计、模型转换和推理优化,为1位大语言模型提供了完整的推理解决方案。