LLaMA(Large Language Model Meta AI)是由 Meta(原 Facebook)开发的一系列开源大型语言模型。其中Llama2在2023年发布,有7B, 13B, 70B尺寸。详细的介绍可参考其论文 Llama 2: Open Foundation and Fine-Tuned Chat Models,以及其他资料,本文不再赘述,而是关注其核心技术与代码实现。本文将在【手搓大模型-GPT2系列】博客的技术上,在GPT2源码基础上改动,写出Llama2的源码,并加载其公开权重。
本文代码链接:Llama2,原始参考的代码链接: rasbt。
RMSNorm Layer
在GPT2中规则化使用的是标准的LayerNorm,也就是“减去均值再除以标准差”,而在Llama2中,使用的是RMSNorm,也就是“不再减均值,直接除以均方根”。相当于只做了缩放,而未做中心化。之所以这样做的,核心原因是为了减少计算量,省略了计算均值和方差,只计算平方和。而在实践中发现,在比较大的模型中,即便不做中心化,缩放依然保留了方向信息,在实际训练中更稳定,效果也挺好。
对比代码如下:
import torch
from torch import nn
class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5
# Learnable scale (gamma) and shift (beta) parameters
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x):
# Compute mean and variance along the last dimension
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
# Normalize input
norm_x = (x - mean) / torch.sqrt(var + self.eps)
# Apply scale and shift
return self.scale * norm_x + self.shift
class RMSNorm(nn.Module):
def __init__(self, emb_dim, eps=1e-5):
super().__init__()
self.eps = eps
self.emb_dim = emb_dim
# Learnable scaling parameter (gamma)
self.weight = nn.Parameter(torch.ones(emb_dim)).float()
def forward(self, x):
# Compute root mean square (RMS)
means = x.pow(2).mean(dim=-1, keepdim=True)
# Normalize input by RMS
x_normed = x * torch.rsqrt(means + self.eps)
# Apply scaling and restore original dtype
return (x_normed * self.weight).to(dtype=x.dtype)
我们构造样例,分别执行Norm:
# Set random seed for reproducibility
torch.manual_seed(123)
# Create input tensor with uniform distribution shifted away from zero
example = torch.rand(2, 3, 10) * 4 + 3 # values roughly in [3,7]
print("Input tensor (example):")
print("Raw mean:", example.mean().item())
print("Raw std :", example.std().item())
print("Raw RMS :", torch.sqrt(example.pow(2).mean(dim=-1).mean()).item())
# Instantiate normalization layers
layer_norm = LayerNorm(emb_dim=example.shape[-1])
rms_norm = RMSNorm(emb_dim=example.shape[-1])
rms_norm_pytorch = torch.nn.RMSNorm(example.shape[-1], eps=1e-5) # PyTorch built-in
# Apply normalization
out_layer = layer_norm(example)
out_rms = rms_norm(example)
out_rms_pt = rms_norm_pytorch(example)
# Print normalized outputs statistics
print("After LayerNorm:")
print("Mean:", out_layer.mean().item())
print("Std :", out_layer.std().item())
print("RMS :", torch.sqrt(out_layer.pow(2).mean(dim=-1).mean()).item())
print("After RMSNorm (custom):")
print("Mean:", out_rms.mean().item())
print("Std :", out_rms.std().item())
print("RMS :", torch.sqrt(out_rms.pow(2).mean(dim=-1).mean()).item())
print("After RMSNorm (PyTorch built-in):")
print("Mean:", out_rms_pt.mean().item())
print("Std :", out_rms_pt.std().item())
print("RMS :", torch.sqrt(out_rms_pt.pow(2).mean(dim=-1).mean()).item())
结果如下:
Input tensor (example):
Raw mean: 5.003686428070068
Raw std : 1.1390745639801025
Raw RMS : 5.129594802856445
After LayerNorm:
Mean: -1.033147185580674e-07
Std : 1.0084344148635864
RMS : 0.9999955296516418
After RMSNorm (custom):
Mean: 0.9775436520576477
Std : 0.2125103920698166
RMS : 0.9999997615814209
After RMSNorm (PyTorch built-in):
Mean: 0.9775436520576477
Std : 0.2125103920698166
RMS : 0.9999997615814209
可见标准的LayerNorm之后,mean会接近0,std和rms会接近1;但是RMS norm之后,只有rms会接近1。
另外,上面自己实现的RMSNorm与pytorch自带的结果一致,后续可直接使用pytorch自带的torch.nn.RMSNorm模块。
SiLU activation
相比GPT2使用GELU激活函数,Llama2改为使用SiLU(Sigmoid Linear Unit),也称为 Swish函数。
其公式是:
代码实现特别简单:
class SiLU(nn.Module):
def __init__(self):
super(SiLU, self).__init__()
def forward(self, x):
return x * torch.sigmoid(x)
更简单地,也可以直接使用pytorch自带的torch.nn.Silu()模块。
我们可以画图对比下GELU和SiLU:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
gelu = nn.GELU()
silu = nn.SiLU()
x = torch.linspace(-5, 5, 200)
y_gelu = gelu(x)
y_silu = silu(x)
plt.figure(figsize=(12, 3))
for i, (y, label) in enumerate(zip(
[y_gelu, y_silu], ["GELU", "SiLU"]), 1):
plt.subplot(1, 3, i)
plt.plot(x.numpy(), y.detach().numpy(), label=label)
plt.title(f"{label} activation")
plt.xlabel("x")
plt.ylabel(f"{label}(x)")
plt.grid(True)
plt.tight_layout()
plt.show()
可见,二者非常接近。不过SiLU在计算上更简单、更快,因为只有sigmod和乘法。
SwiGLU in FeedForward
GPT2直接在Feedforward模块使用GELU激活函数,而Llama2使用了一种基于 SiLU 的门控激活函数(Gates Linear Unit)变体——SwiGLU,其公式如下:
也就是说,SwiGLU需要2个输入线性层,以便实现门控结构。
完整的FeedForward模块对比代码如下:
class FeedForwardInGPT2(nn.Module):
def __init__(self, cfg):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
nn.GELU(),
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
)
def forward(self, x):
return self.layers(x)
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.fc1 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
self.fc2 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
self.fc3 = nn.Linear(cfg["hidden_dim"], cfg["emb_dim"], dtype=cfg["dtype"], bias=False)
self.silu = nn.SiLU()
def forward(self, x):
x_fc1 = self.fc1(x)
x_fc2 = self.fc2(x)
x = self.silu(x_fc1) * x_fc2
return self.fc3(x)
可见,在Llama2中不能再使用nn.Sequential进行线性叠加;而是需要2个线性输入层fc1和fc2,做SwiGLU门控操作之后,再输入fc3中。另外,Llama2的所有线性层中都去掉了bias,也就是bias=False。这是因为去掉bias可以减少计算量,而在实践中发现,并未影响训练效果。
RoPE
相比传统的绝对位置编码,Llama2使用了旋转位置编码RoPE,以便能同时捕捉绝对和相对位置信息。RoPE 通过将位置编码转换为角度旋转,作用在 Q/K 上,从而实现“相对位置敏感性”。RoPE的设计非常巧妙,受到复数旋转的启发,详细的设计思路也可以参考作者博客。
这里我们简要说明其下背后的数学,也可参考wiki。
- 对原始向量旋转角度θ:
假设原始向量用复数表示为:
将它乘以单位模长的复数:
得到:
这个过程可以表达成矩阵乘法:
也就是说对原始向量乘以单位模长的复数,相当于把原始二维向量 [a, b] 顺时针旋转了角度 θ。
- 旋转后向量间的距离取决于相对位置m-n:
假设位置为m和n的token,旋转角度θ后的向量分别是:
其之间的欧几里得距离(二范数)是:
也就说,旋转后向量间距离取决于相对位置。
而RoPE 的关键思想是:把每个位置的向量旋转一个特定角度,而不是直接加上位置编码。它不使用 x + pos_embedding,而是将 Query 和 Key 向量旋转,使它们隐式地包含了位置信息。RoPE 通过将 token 向量中的每对维度视为复数,对其进行基于位置的旋转,将位置信息编码到向量内部。
RoPE的完整代码如下:
import torch
def precompute_rope_params(seq_len, head_dim):
"""
Precompute sin and cos tensors for RoPE.
Args:
seq_len: sequence length
head_dim: embedding dimension (must be even)
Returns:
sin, cos: tensors of shape (seq_len, dim//2)
"""
half_dim = head_dim // 2
inv_freq = 1.0 / (10000 ** (torch.arange(half_dim).float() / half_dim))
positions = torch.arange(seq_len, dtype=torch.float32)
angles = torch.einsum("i,j->ij", positions, inv_freq) # (seq_len, half_dim)
return torch.sin(angles), torch.cos(angles)
def rotary_pos_emb(x, sin, cos):
"""
Apply Rotary Positional Embedding on input tensor x using precomputed sin and cos.
Args:
x: tensor of shape (batch, seq_len, dim)
sin: precomputed sin tensor of shape (seq_len, dim//2)
cos: precomputed cos tensor of shape (seq_len, dim//2)
Returns:
tensor same shape as x with RoPE applied.
"""
print("Rotary Positional Embedding",x.shape)
# x: (batch_size, num_heads, seq_len, head_dim)
batch, num_heads, seq_len, head_dim = x.shape
# x: (batch, seq_len, dim) -> (batch, seq_len, half_dim, 2)
x_ = x.view(batch, num_heads, seq_len, head_dim // 2, 2)
print("shape of x_", x_.shape)
print("shape of cos", cos.shape)
# ➤ Crop sin/cos to match actual seq_len
sin = sin[:seq_len, :]
cos = cos[:seq_len, :]
x_rotated = torch.zeros_like(x_)
x_rotated[..., 0] = x_[..., 0] * cos - x_[..., 1] * sin
x_rotated[..., 1] = x_[..., 0] * sin + x_[..., 1] * cos
return x_rotated.view_as(x)
解释下RoPE代码的关键步骤如下:
- 计算旋转频率(inv_freq)
half_dim = dim // 2
inv_freq = 1.0 / (10000 ** (torch.arange(0, half_dim, 2).float() / half_dim))
half_dim:将嵌入维度分成两半(因为每两个维度一组,形成复数的实部和虚部)。
inv_freq:计算旋转的频率倒数,公式来自 Transformer 经典的位置编码思想:
- 生成位置索引
positions = torch.arange(seq_len, dtype=torch.float32)
生成一个 [0, 1, 2, ..., seq_len-1] 的序列,表示每个 token 的位置索引。
- 计算旋转角度矩阵
angles = torch.einsum("i,j->ij", positions, inv_freq)
此处使用了矩阵的爱因斯坦求和约定,计算外积,等价于以下矩阵乘法:
angles = positions[:, None] * inv_freq[None, :]
每个位置乘以每个频率,得到一个形状为 (seq_len, half_dim/2) 的矩阵,表示每个位置在每个维度上的旋转角度。
- 计算sin和cos
sin = torch.sin(angles)
cos = torch.cos(angles)
对角度矩阵分别取正弦和余弦,生成对应的旋转矩阵元素。
- 调整输入张量维度
x_ = x.view(*x.shape[:-1], half_dim, 2)
将输入的 (batch, seq_len, dim) 重塑成 (batch, seq_len, half_dim, 2),即把每两个维度看作一组,形如复数的实部和虚部。
- 旋转计算
x_rotated = torch.zeros_like(x_)
x_rotated[..., 0] = x_[..., 0] * cos - x_[..., 1] * sin
x_rotated[..., 1] = x_[..., 0] * sin + x_[..., 1] * cos
其中:
x_[..., 0] 表示每对维度的第一个分量,类似二维向量的 x 分量(实部)。
x_[..., 1] 表示每对维度的第二个分量,类似二维向量的 y 分量(虚部)。
这两行代码是用二维旋转矩阵公式对每个二维向量(由两维embedding组成)进行旋转:
而 cos 和 sin 的形状是 (seq_len, half_dim/2),但因为 broadcast 规则,会自动匹配 (batch, seq_len, half_dim, 2) 的维度。
- 恢复形状输出
return x_rotated.view_as(x)
将旋转后的张量再 reshape 回原始的 (batch, seq_len, dim) 形状。
使用示例如下:
# Example usage
batch_size, seq_len, dim = 2, 16, 64
x = torch.randn(batch_size, seq_len, dim)
# Step 1: Precompute sin and cos for RoPE
sin, cos = precompute_rope_params(seq_len, dim)
# Step 2: Apply rotary positional embedding with precomputed sin and cos
x_rope = rotary_pos_emb(x, sin, cos)
print(x_rope.shape) # Should be (2, 16, 64)
update MHA with RoPE
GPT2使用的绝对位置编码是作用在inputs上,而Llama2的旋转位置编码是作用在注意力机制中的Query和Key上,因此我们更新MHA的代码如下:
class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, num_heads, dtype=None):
super().__init__()
assert d_out % num_heads == 0, "d_out must be divisible by n_heads"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads
# Use nn.Linear with shared kwargs to reduce repetition
linear_kwargs = dict(bias=False, dtype=dtype)
self.W_query = nn.Linear(d_in, d_out, **linear_kwargs)
self.W_key = nn.Linear(d_in, d_out, **linear_kwargs)
self.W_value = nn.Linear(d_in, d_out, **linear_kwargs)
self.out_proj = nn.Linear(d_out, d_out, **linear_kwargs)
self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1).bool())
cos, sin = precompute_rope_params(seq_len=context_length, head_dim=self.head_dim)
self.register_buffer("cos", cos)
self.register_buffer("sin", sin)
def forward(self, x):
b, num_tokens, d_in = x.shape
# Project inputs
keys = self.W_key(x).view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2)
queries = self.W_query(x).view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2)
values = self.W_value(x).view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2)
# Apply RoPE
keys = rotary_pos_emb(keys, self.cos, self.sin)
queries = rotary_pos_emb(queries, self.cos, self.sin)
# Attention scores with causal mask
attn_scores = queries @ keys.transpose(-2, -1)
attn_scores.masked_fill_(self.mask[:num_tokens, :num_tokens], float('-inf'))
attn_weights = torch.softmax(attn_scores / self.head_dim ** 0.5, dim=-1)
context_vec = (attn_weights @ values).transpose(1, 2).reshape(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec)
return context_vec
构造example计算mha如下:
batch_size = 2
context_len = 128
max_context_len = 1280
embed_dim = 64
num_heads = 4
example_batch = torch.randn(batch_size, context_len, embed_dim)
mha = MultiHeadAttention(
d_in=embed_dim,
d_out=embed_dim,
context_length=max_context_len,
num_heads=num_heads
)
output = mha(example_batch)
print(output.shape) # Expected: (batch_size, context_len, embed_dim)
Rotary Positional Embedding torch.Size([2, 4, 128, 16])
shape of x_ torch.Size([2, 4, 128, 8, 2])
shape of cos torch.Size([1280, 8])
Rotary Positional Embedding torch.Size([2, 4, 128, 16])
shape of x_ torch.Size([2, 4, 128, 8, 2])
shape of cos torch.Size([1280, 8])
torch.Size([2, 128, 64])
Update TransformerBlock
至此,我们已经完成了Llama2代码中核心代码,接下来综合上述代码,更新TransformerBlock,核心变动点包括:
1)用RMSNorm替换LayerNorm,简化计算。
2)去掉Dropout,因为模型参数够大时,dropout并非必须的。
3)去掉bias设置,减少计算量,提高数值稳定性。
4)增加dtype设置,支持更高效的低精度训练与推理;如用bfloat16节省显存、加速训练。
完整代码如下:
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
dtype=cfg["dtype"]
)
self.ff = FeedForward(cfg)
self.norm1 = RMSNorm(cfg["emb_dim"])
self.norm2 = RMSNorm(cfg["emb_dim"])
def forward(self, x):
# Shortcut connection for attention block
shortcut = x
x = self.norm1(x)
x = self.att(x) # Shape [batch_size, num_tokens, emb_size]
x = x + shortcut # Add the original input back
# Shortcut connection for feed-forward block
shortcut = x
x = self.norm2(x)
x = self.ff(x)
x = x + shortcut # Add the original input back
return x
Update the Model
回想GPT2,Model都是对TransformerBlock的多次堆叠重复。Llama2中,我们需要去掉pos_emb改为用RoPE,以及改用RMSNorm,以及设置dtype。代码如下:
class Llama2Model(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"], dtype=cfg["dtype"])
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
self.final_norm = RMSNorm(cfg["emb_dim"])
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False, dtype=cfg["dtype"])
def forward(self, in_idx):
x = self.tok_emb(in_idx)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits
Initialize Model
Llama2的模型参数如下:
LLAMA2_CONFIG_7B = {
"vocab_size": 32000, # Vocabulary size
"context_length": 4096, # Context length
"emb_dim": 4096, # Embedding dimension
"n_heads": 32, # Number of attention heads
"n_layers": 32, # Number of layers
"hidden_dim": 11008, # Size of the intermediate dimension in FeedForward
"dtype": torch.bfloat16
}
加载模型如下:
model = Llama2Model(LLAMA2_CONFIG_7B)
计算总参数:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")
Total number of parameters: 6,738,415,616
可见参数为6.7B,通常简称为7B。
Load Tokenizer
GPT2使用的Tokenizer是tiktoken,但Llama2使用了Google的SentencePiece。Meta在HuggingFace上公布了训练后的公开权重以及Tokenizer词库。我们可以从HF下载并加载,不过因为Llama2并非完全开源(仅用于个人和非商业用途),需要先申请权限并登录HF。
下载tokenizer如下:
from huggingface_hub import login
login(token="your hf oken")
from huggingface_hub import hf_hub_download
tokenizer_file = hf_hub_download(
repo_id="meta-llama/Llama-2-7b",
filename="tokenizer.model",
local_dir="Llama-2-7b"
)
定义tokenizer如下:
import sentencepiece as spm
class LlamaTokenizer:
def __init__(self, tokenizer_file):
sp = spm.SentencePieceProcessor()
sp.load(tokenizer_file)
self.tokenizer = sp
def encode(self, text):
return self.tokenizer.encode_as_ids(text)
def decode(self, ids):
return self.tokenizer.decode_pieces(ids)
tokenizer = LlamaTokenizer(tokenizer_file)
尝试运行生成文本,如下:
from gpt2_v2 import generate_text_simple, text_to_tensor, tensor_to_text
torch.manual_seed(123)
token_ids = generate_text_simple(
model=model,
idx=text_to_tensor("At the start of", tokenizer).to("cpu"),
max_new_tokens=30,
context_size=LLAMA2_CONFIG_7B["context_length"],
top_k=1,
temperature=0.
)
print("Output text:\n", tensor_to_text(token_ids, tokenizer))
Output text:
At the start ofзей Warjarewnę обще Opera з went eeuwể Other collaborationlauf’Powerремħ’Powerремħ’ep kur extremely____dataset Multi vida curv
可见tokenizer已顺利加载,只是生成的接近乱码,这是因为模型未训练,只有随机初始权重。
Load pretrained Weights
同上,我们可以下载并加载Meta AI预训练并公开的模型权重,如下:
weights_file = hf_hub_download(
repo_id="meta-llama/Llama-2-7b",
filename="consolidated.00.pth",
local_dir="Llama-2-7b"
)
weights = torch.load(weights_file, weights_only=True)
加载权重参数的过程,本质上是参数复制,如下:
def assign(left, right):
if left.shape != right.shape:
raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
return torch.nn.Parameter(right.clone().detach()) if isinstance(right, torch.Tensor) else torch.nn.Parameter(torch.tensor(right))
def load_weights_into_llama(model, param_config, params):
model.tok_emb.weight = assign(model.tok_emb.weight, params["tok_embeddings.weight"])
for l in range(param_config["n_layers"]):
block = model.trf_blocks[l]
# map of attribute path (relative to block) -> param name
attr_param_map = {
f"att.W_query.weight": f"layers.{l}.attention.wq.weight",
f"att.W_key.weight": f"layers.{l}.attention.wk.weight",
f"att.W_value.weight": f"layers.{l}.attention.wv.weight",
f"att.out_proj.weight": f"layers.{l}.attention.wo.weight",
f"norm1.weight": f"layers.{l}.attention_norm.weight",
f"ff.fc1.weight": f"layers.{l}.feed_forward.w1.weight",
f"ff.fc2.weight": f"layers.{l}.feed_forward.w3.weight", # swapped order
f"ff.fc3.weight": f"layers.{l}.feed_forward.w2.weight",
f"norm2.weight": f"layers.{l}.ffn_norm.weight",
}
for attr_path, param_name in attr_param_map.items():
obj = block
*parents, attr = attr_path.split('.')
for p in parents:
obj = getattr(obj, p)
old_tensor = getattr(obj, attr)
setattr(obj, attr, assign(old_tensor, params[param_name]))
model.final_norm.weight = assign(model.final_norm.weight, params["norm.weight"])
model.out_head.weight = assign(model.out_head.weight, params["output.weight"])
device = torch.device("cpu")
load_weights_into_llama(model, LLAMA2_CONFIG_7B, weights)
model.to(device);
再次运行上面的生成语句示例,得到:
Output text:
At the start of the 20th century, the city was a major industrial center, with a large number of factories and mills. The city was also
可见,生成的结果符合语义,这也说明我们的模型代码是正确的,且权重已经顺利加载。
try instruction-finetuned model
上例中,我们加载的是7B基础模型,该模型只经过预训练,未经过微调。因此仅能补全文本,不能响应指令。接下来类似的,我们下载并加载instruction-FT权重。
下载并加载chat model的权重如下:
weights_file = hf_hub_download(
repo_id="meta-llama/Llama-2-7b-chat",
filename="consolidated.00.pth",
local_dir="Llama-2-7b-chat"
)
model = Llama2Model(LLAMA2_CONFIG_7B)
load_weights_into_llama(model, LLAMA2_CONFIG_7B, weights)
model.to(device)
这次我们运行提问,示例如下:
from gpt2_v2 import generate_text_simple, text_to_tensor, tensor_to_text
torch.manual_seed(123)
token_ids = generate_text_simple(
model=model,
idx=text_to_tensor("What do llamas eat?", tokenizer).to(device),
max_new_tokens=30,
context_size=LLAMA2_CONFIG_7B["context_length"],
top_k=1,
temperature=0.
)
print("Output text:\n", tensor_to_text(token_ids, tokenizer))
Output text:
What do llamas eat?
Llamas are herbivores, which means they eat plants. They eat grass, hay, and other plants.
What do llam
至此,我们已经完成了Llama2的代码实现,以及下载并加载预训练权重。