【手搓大模型】从零手写Llama2

157 阅读13分钟

LLaMA(Large Language Model Meta AI)是由 Meta(原 Facebook)开发的一系列开源大型语言模型。其中Llama2在2023年发布,有7B, 13B, 70B尺寸。详细的介绍可参考其论文 Llama 2: Open Foundation and Fine-Tuned Chat Models,以及其他资料,本文不再赘述,而是关注其核心技术与代码实现。本文将在【手搓大模型-GPT2系列】博客的技术上,在GPT2源码基础上改动,写出Llama2的源码,并加载其公开权重。

本文代码链接:Llama2,原始参考的代码链接: rasbt

RMSNorm Layer

在GPT2中规则化使用的是标准的LayerNorm,也就是“减去均值再除以标准差”,而在Llama2中,使用的是RMSNorm,也就是“不再减均值,直接除以均方根”。相当于只做了缩放,而未做中心化。之所以这样做的,核心原因是为了减少计算量,省略了计算均值和方差,只计算平方和。而在实践中发现,在比较大的模型中,即便不做中心化,缩放依然保留了方向信息,在实际训练中更稳定,效果也挺好。

对比代码如下:

import torch
from torch import nn

class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        # Learnable scale (gamma) and shift (beta) parameters
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        # Compute mean and variance along the last dimension
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        # Normalize input
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        # Apply scale and shift
        return self.scale * norm_x + self.shift

class RMSNorm(nn.Module):
    def __init__(self, emb_dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.emb_dim = emb_dim
        # Learnable scaling parameter (gamma)
        self.weight = nn.Parameter(torch.ones(emb_dim)).float()

    def forward(self, x):
        # Compute root mean square (RMS)
        means = x.pow(2).mean(dim=-1, keepdim=True)
        # Normalize input by RMS
        x_normed = x * torch.rsqrt(means + self.eps)
        # Apply scaling and restore original dtype
        return (x_normed * self.weight).to(dtype=x.dtype)

我们构造样例,分别执行Norm:

# Set random seed for reproducibility
torch.manual_seed(123)

# Create input tensor with uniform distribution shifted away from zero
example = torch.rand(2, 3, 10) * 4 + 3  # values roughly in [3,7]

print("Input tensor (example):")
print("Raw mean:", example.mean().item())
print("Raw std :", example.std().item())
print("Raw RMS :", torch.sqrt(example.pow(2).mean(dim=-1).mean()).item())

# Instantiate normalization layers
layer_norm = LayerNorm(emb_dim=example.shape[-1])
rms_norm = RMSNorm(emb_dim=example.shape[-1])
rms_norm_pytorch = torch.nn.RMSNorm(example.shape[-1], eps=1e-5)  # PyTorch built-in

# Apply normalization
out_layer = layer_norm(example)
out_rms = rms_norm(example)
out_rms_pt = rms_norm_pytorch(example)

# Print normalized outputs statistics
print("After LayerNorm:")
print("Mean:", out_layer.mean().item())
print("Std :", out_layer.std().item())
print("RMS :", torch.sqrt(out_layer.pow(2).mean(dim=-1).mean()).item())

print("After RMSNorm (custom):")
print("Mean:", out_rms.mean().item())
print("Std :", out_rms.std().item())
print("RMS :", torch.sqrt(out_rms.pow(2).mean(dim=-1).mean()).item())

print("After RMSNorm (PyTorch built-in):")
print("Mean:", out_rms_pt.mean().item())
print("Std :", out_rms_pt.std().item())
print("RMS :", torch.sqrt(out_rms_pt.pow(2).mean(dim=-1).mean()).item())

结果如下:

Input tensor (example):
Raw mean: 5.003686428070068
Raw std : 1.1390745639801025
Raw RMS : 5.129594802856445
After LayerNorm:
Mean: -1.033147185580674e-07
Std : 1.0084344148635864
RMS : 0.9999955296516418
After RMSNorm (custom):
Mean: 0.9775436520576477
Std : 0.2125103920698166
RMS : 0.9999997615814209
After RMSNorm (PyTorch built-in):
Mean: 0.9775436520576477
Std : 0.2125103920698166
RMS : 0.9999997615814209

可见标准的LayerNorm之后,mean会接近0,std和rms会接近1;但是RMS norm之后,只有rms会接近1。

另外,上面自己实现的RMSNorm与pytorch自带的结果一致,后续可直接使用pytorch自带的torch.nn.RMSNorm模块。

SiLU activation

相比GPT2使用GELU激活函数,Llama2改为使用SiLU(Sigmoid Linear Unit),也称为 Swish函数。

其公式是:

silu(x)=xσ(x),whereσ(x) is the logistic sigmoid.\text{silu}(x) = x \cdot \sigma(x), \quad \text{where} \quad \sigma(x) \text{ is the logistic sigmoid.}

代码实现特别简单:

class SiLU(nn.Module):
    def __init__(self):
        super(SiLU, self).__init__()

    def forward(self, x):
        return x * torch.sigmoid(x)

更简单地,也可以直接使用pytorch自带的torch.nn.Silu()模块。

我们可以画图对比下GELU和SiLU:

import torch
import torch.nn as nn
import matplotlib.pyplot as plt

gelu = nn.GELU()
silu = nn.SiLU()

x = torch.linspace(-5, 5, 200)

y_gelu = gelu(x)
y_silu = silu(x)

plt.figure(figsize=(12, 3))
for i, (y, label) in enumerate(zip(
    [y_gelu, y_silu], ["GELU", "SiLU"]), 1):
    plt.subplot(1, 3, i)
    plt.plot(x.numpy(), y.detach().numpy(), label=label)
    plt.title(f"{label} activation")
    plt.xlabel("x")
    plt.ylabel(f"{label}(x)")
    plt.grid(True)

plt.tight_layout()
plt.show()

可见,二者非常接近。不过SiLU在计算上更简单、更快,因为只有sigmod和乘法。

SwiGLU in FeedForward

GPT2直接在Feedforward模块使用GELU激活函数,而Llama2使用了一种基于 SiLU 的门控激活函数(Gates Linear Unit)变体——SwiGLU,其公式如下:

SwiGLU(x)=SiLU(Linear1(x))(Linear2(x))\text{SwiGLU}(x) = \text{SiLU}(\text{Linear}_1(x)) * (\text{Linear}_2(x))

也就是说,SwiGLU需要2个输入线性层,以便实现门控结构。

完整的FeedForward模块对比代码如下:

class FeedForwardInGPT2(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
            nn.GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
        )

    def forward(self, x):
        return self.layers(x)

class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.fc1 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        self.fc2 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        self.fc3 = nn.Linear(cfg["hidden_dim"], cfg["emb_dim"], dtype=cfg["dtype"], bias=False)
        self.silu = nn.SiLU()

    def forward(self, x):
        x_fc1 = self.fc1(x)
        x_fc2 = self.fc2(x)
        x = self.silu(x_fc1) * x_fc2
        return self.fc3(x)

可见,在Llama2中不能再使用nn.Sequential进行线性叠加;而是需要2个线性输入层fc1和fc2,做SwiGLU门控操作之后,再输入fc3中。另外,Llama2的所有线性层中都去掉了bias,也就是bias=False。这是因为去掉bias可以减少计算量,而在实践中发现,并未影响训练效果。

RoPE

相比传统的绝对位置编码,Llama2使用了旋转位置编码RoPE,以便能同时捕捉绝对和相对位置信息。RoPE 通过将位置编码转换为角度旋转,作用在 Q/K 上,从而实现“相对位置敏感性”。RoPE的设计非常巧妙,受到复数旋转的启发,详细的设计思路也可以参考作者博客

这里我们简要说明其下背后的数学,也可参考wiki

  1. 对原始向量旋转角度θ:

假设原始向量用复数表示为:z=a+biz = a + b i

将它乘以单位模长的复数:eiθ=cosθ+isinθe^{i\theta} = \cos\theta + i\sin\theta

得到:z=zeiθ=(acosθbsinθ)+i(asinθ+bcosθ)z' = z \cdot e^{i\theta} = (a\cos\theta - b\sin\theta) + i(a\sin\theta + b\cos\theta)

这个过程可以表达成矩阵乘法:

z=[ab]=[cosθsinθsinθcosθ][ab]z' = \begin{bmatrix} a' \\ b' \end{bmatrix} = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} \begin{bmatrix} a \\ b \end{bmatrix}

也就是说对原始向量乘以单位模长的复数,相当于把原始二维向量 [a, b] 顺时针旋转了角度 θ。

  1. 旋转后向量间的距离取决于相对位置m-n:

假设位置为m和n的token,旋转角度θ后的向量分别是:

zm=zeiθm,zn=zeiθnz_m = z \cdot e^{i\theta_m}, \quad z_n = z \cdot e^{i\theta_n}

其之间的欧几里得距离(二范数)是:

dist(zm,zn)=zeiθmzeiθn=zeiθmeiθn=zei(θmθn)=zei(mn)ω1\text{dist}(z_m, z_n) = |z \cdot e^{i\theta_m} - z \cdot e^{i\theta_n}| = |z| \cdot |e^{i\theta_m} - e^{i\theta_n}| = |z| \cdot |e^{i(\theta_m - \theta_n)}=|z| \cdot |e^{i(m - n)\omega} - 1|

也就说,旋转后向量间距离取决于相对位置。

而RoPE 的关键思想是:把每个位置的向量旋转一个特定角度,而不是直接加上位置编码。它不使用 x + pos_embedding,而是将 Query 和 Key 向量旋转,使它们隐式地包含了位置信息。RoPE 通过将 token 向量中的每对维度视为复数,对其进行基于位置的旋转,将位置信息编码到向量内部。

RoPE的完整代码如下:

import torch

def precompute_rope_params(seq_len, head_dim):
    """
Precompute sin and cos tensors for RoPE.

Args:
seq_len: sequence length
head_dim: embedding dimension (must be even)

Returns:
sin, cos: tensors of shape (seq_len, dim//2)
"""
half_dim = head_dim // 2
    inv_freq = 1.0 / (10000 ** (torch.arange(half_dim).float() / half_dim))
    positions = torch.arange(seq_len, dtype=torch.float32)
    angles = torch.einsum("i,j->ij", positions, inv_freq)  # (seq_len, half_dim)
    return torch.sin(angles), torch.cos(angles)


def rotary_pos_emb(x, sin, cos):
    """
Apply Rotary Positional Embedding on input tensor x using precomputed sin and cos.

Args:
x: tensor of shape (batch, seq_len, dim)
sin: precomputed sin tensor of shape (seq_len, dim//2)
cos: precomputed cos tensor of shape (seq_len, dim//2)

Returns:
tensor same shape as x with RoPE applied.
"""
print("Rotary Positional Embedding",x.shape)
    # x: (batch_size, num_heads, seq_len, head_dim)
    batch, num_heads, seq_len, head_dim = x.shape

    # x: (batch, seq_len, dim) -> (batch, seq_len, half_dim, 2)
    x_ = x.view(batch, num_heads, seq_len, head_dim // 2, 2)
    print("shape of x_", x_.shape)
    print("shape of cos", cos.shape)

    # ➤ Crop sin/cos to match actual seq_len
    sin = sin[:seq_len, :]
    cos = cos[:seq_len, :]

    x_rotated = torch.zeros_like(x_)
    x_rotated[..., 0] = x_[..., 0] * cos - x_[..., 1] * sin
    x_rotated[..., 1] = x_[..., 0] * sin + x_[..., 1] * cos

    return x_rotated.view_as(x)

解释下RoPE代码的关键步骤如下:

  1. 计算旋转频率(inv_freq)
    half_dim = dim // 2
    inv_freq = 1.0 / (10000 ** (torch.arange(0, half_dim, 2).float() / half_dim))

half_dim:将嵌入维度分成两半(因为每两个维度一组,形成复数的实部和虚部)。

inv_freq:计算旋转的频率倒数,公式来自 Transformer 经典的位置编码思想:ωj=1100002j/dω_j = \frac{1}{10000^{2j / d}}

  1. 生成位置索引
positions = torch.arange(seq_len, dtype=torch.float32)

生成一个 [0, 1, 2, ..., seq_len-1] 的序列,表示每个 token 的位置索引。

  1. 计算旋转角度矩阵
angles = torch.einsum("i,j->ij", positions, inv_freq)

此处使用了矩阵的爱因斯坦求和约定,计算外积,等价于以下矩阵乘法:

angles = positions[:, None] * inv_freq[None, :]

每个位置乘以每个频率,得到一个形状为 (seq_len, half_dim/2) 的矩阵,表示每个位置在每个维度上的旋转角度。

  1. 计算sin和cos
sin = torch.sin(angles)
cos = torch.cos(angles)

对角度矩阵分别取正弦和余弦,生成对应的旋转矩阵元素。

  1. 调整输入张量维度
x_ = x.view(*x.shape[:-1], half_dim, 2)

将输入的 (batch, seq_len, dim) 重塑成 (batch, seq_len, half_dim, 2),即把每两个维度看作一组,形如复数的实部和虚部。

  1. 旋转计算
x_rotated = torch.zeros_like(x_)
x_rotated[..., 0] = x_[..., 0] * cos - x_[..., 1] * sin
x_rotated[..., 1] = x_[..., 0] * sin + x_[..., 1] * cos

其中:

x_[..., 0] 表示每对维度的第一个分量,类似二维向量的 x 分量(实部)。

x_[..., 1] 表示每对维度的第二个分量,类似二维向量的 y 分量(虚部)。

这两行代码是用二维旋转矩阵公式对每个二维向量(由两维embedding组成)进行旋转:

[xy]=[cosθsinθsinθcosθ][xy]\begin{bmatrix} x' \\ y' \end{bmatrix} = \begin{bmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix}

cossin 的形状是 (seq_len, half_dim/2),但因为 broadcast 规则,会自动匹配 (batch, seq_len, half_dim, 2) 的维度。

  1. 恢复形状输出
return x_rotated.view_as(x)

将旋转后的张量再 reshape 回原始的 (batch, seq_len, dim) 形状。

使用示例如下:

# Example usage
batch_size, seq_len, dim = 2, 16, 64
x = torch.randn(batch_size, seq_len, dim)
# Step 1: Precompute sin and cos for RoPE
sin, cos = precompute_rope_params(seq_len, dim)
# Step 2: Apply rotary positional embedding with precomputed sin and cos
x_rope = rotary_pos_emb(x, sin, cos)

print(x_rope.shape)  # Should be (2, 16, 64)

update MHA with RoPE

GPT2使用的绝对位置编码是作用在inputs上,而Llama2的旋转位置编码是作用在注意力机制中的Query和Key上,因此我们更新MHA的代码如下:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, num_heads, dtype=None):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by n_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads

        # Use nn.Linear with shared kwargs to reduce repetition
        linear_kwargs = dict(bias=False, dtype=dtype)
        self.W_query = nn.Linear(d_in, d_out, **linear_kwargs)
        self.W_key = nn.Linear(d_in, d_out, **linear_kwargs)
        self.W_value = nn.Linear(d_in, d_out, **linear_kwargs)
        self.out_proj = nn.Linear(d_out, d_out, **linear_kwargs)

        self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1).bool())

        cos, sin = precompute_rope_params(seq_len=context_length, head_dim=self.head_dim)
        self.register_buffer("cos", cos)
        self.register_buffer("sin", sin)

    def forward(self, x):
        b, num_tokens, d_in = x.shape

        # Project inputs
        keys = self.W_key(x).view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2)
        queries = self.W_query(x).view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2)
        values = self.W_value(x).view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2)

        # Apply RoPE
        keys = rotary_pos_emb(keys, self.cos, self.sin)
        queries = rotary_pos_emb(queries, self.cos, self.sin)

        # Attention scores with causal mask
        attn_scores = queries @ keys.transpose(-2, -1)
        attn_scores.masked_fill_(self.mask[:num_tokens, :num_tokens], float('-inf'))

        attn_weights = torch.softmax(attn_scores / self.head_dim ** 0.5, dim=-1)

        context_vec = (attn_weights @ values).transpose(1, 2).reshape(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec)

        return context_vec

构造example计算mha如下:

batch_size = 2
context_len = 128
max_context_len = 1280
embed_dim = 64
num_heads = 4

example_batch = torch.randn(batch_size, context_len, embed_dim)

mha = MultiHeadAttention(
    d_in=embed_dim,
    d_out=embed_dim,
    context_length=max_context_len,
    num_heads=num_heads
)

output = mha(example_batch)
print(output.shape)  # Expected: (batch_size, context_len, embed_dim)
Rotary Positional Embedding torch.Size([2, 4, 128, 16])
shape of x_ torch.Size([2, 4, 128, 8, 2])
shape of cos torch.Size([1280, 8])
Rotary Positional Embedding torch.Size([2, 4, 128, 16])
shape of x_ torch.Size([2, 4, 128, 8, 2])
shape of cos torch.Size([1280, 8])
torch.Size([2, 128, 64])

Update TransformerBlock

至此,我们已经完成了Llama2代码中核心代码,接下来综合上述代码,更新TransformerBlock,核心变动点包括:

1)用RMSNorm替换LayerNorm,简化计算。

2)去掉Dropout,因为模型参数够大时,dropout并非必须的。

3)去掉bias设置,减少计算量,提高数值稳定性。

4)增加dtype设置,支持更高效的低精度训练与推理;如用bfloat16节省显存、加速训练。

完整代码如下:

class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"],
            dtype=cfg["dtype"]
        )
        self.ff = FeedForward(cfg)

        self.norm1 = RMSNorm(cfg["emb_dim"])
        self.norm2 = RMSNorm(cfg["emb_dim"])
    def forward(self, x):
        # Shortcut connection for attention block
        shortcut = x
        x = self.norm1(x)
        x = self.att(x)   # Shape [batch_size, num_tokens, emb_size]
        x = x + shortcut  # Add the original input back

        # Shortcut connection for feed-forward block
        shortcut = x
        x = self.norm2(x)
        x = self.ff(x)
        x = x + shortcut  # Add the original input back

        return x

Update the Model

回想GPT2,Model都是对TransformerBlock的多次堆叠重复。Llama2中,我们需要去掉pos_emb改为用RoPE,以及改用RMSNorm,以及设置dtype。代码如下:

class Llama2Model(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"], dtype=cfg["dtype"])
        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
        self.final_norm = RMSNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False, dtype=cfg["dtype"])

    def forward(self, in_idx):
        x = self.tok_emb(in_idx)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

Initialize Model

Llama2的模型参数如下:

LLAMA2_CONFIG_7B = {
    "vocab_size": 32000,     # Vocabulary size
    "context_length": 4096,  # Context length
    "emb_dim": 4096,         # Embedding dimension
    "n_heads": 32,           # Number of attention heads
    "n_layers": 32,          # Number of layers
    "hidden_dim": 11008,     # Size of the intermediate dimension in FeedForward
    "dtype": torch.bfloat16  
}

加载模型如下:

model = Llama2Model(LLAMA2_CONFIG_7B)

计算总参数:

total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")

Total number of parameters: 6,738,415,616

可见参数为6.7B,通常简称为7B。

Load Tokenizer

GPT2使用的Tokenizer是tiktoken,但Llama2使用了Google的SentencePiece。Meta在HuggingFace上公布了训练后的公开权重以及Tokenizer词库。我们可以从HF下载并加载,不过因为Llama2并非完全开源(仅用于个人和非商业用途),需要先申请权限并登录HF。

下载tokenizer如下:

from huggingface_hub import login

login(token="your hf oken")
from huggingface_hub import hf_hub_download

tokenizer_file = hf_hub_download(
    repo_id="meta-llama/Llama-2-7b",
    filename="tokenizer.model",
    local_dir="Llama-2-7b"
)

定义tokenizer如下:

import sentencepiece as spm

class LlamaTokenizer:
    def __init__(self, tokenizer_file):
        sp = spm.SentencePieceProcessor()
        sp.load(tokenizer_file)
        self.tokenizer = sp

    def encode(self, text):
        return self.tokenizer.encode_as_ids(text)

    def decode(self, ids):
        return self.tokenizer.decode_pieces(ids)


tokenizer = LlamaTokenizer(tokenizer_file)

尝试运行生成文本,如下:

from gpt2_v2 import generate_text_simple, text_to_tensor, tensor_to_text

torch.manual_seed(123)

token_ids = generate_text_simple(
    model=model,
    idx=text_to_tensor("At the start of", tokenizer).to("cpu"),
    max_new_tokens=30,
    context_size=LLAMA2_CONFIG_7B["context_length"],
    top_k=1,
    temperature=0.
)

print("Output text:\n", tensor_to_text(token_ids, tokenizer))
Output text:
 At the start ofзей Warjarewnę обще Opera з went eeuwể Other collaborationlauf’Powerремħ’Powerремħ’ep kur extremely____dataset Multi vida curv

可见tokenizer已顺利加载,只是生成的接近乱码,这是因为模型未训练,只有随机初始权重。

Load pretrained Weights

同上,我们可以下载并加载Meta AI预训练并公开的模型权重,如下:

weights_file = hf_hub_download(
   repo_id="meta-llama/Llama-2-7b",
   filename="consolidated.00.pth",
   local_dir="Llama-2-7b"
)
weights = torch.load(weights_file, weights_only=True)

加载权重参数的过程,本质上是参数复制,如下:

def assign(left, right):
    if left.shape != right.shape:
        raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
    return torch.nn.Parameter(right.clone().detach()) if isinstance(right, torch.Tensor) else torch.nn.Parameter(torch.tensor(right))


def load_weights_into_llama(model, param_config, params):
    model.tok_emb.weight = assign(model.tok_emb.weight, params["tok_embeddings.weight"])

    for l in range(param_config["n_layers"]):
        block = model.trf_blocks[l]

        # map of attribute path (relative to block) -> param name
        attr_param_map = {
            f"att.W_query.weight": f"layers.{l}.attention.wq.weight",
            f"att.W_key.weight": f"layers.{l}.attention.wk.weight",
            f"att.W_value.weight": f"layers.{l}.attention.wv.weight",
            f"att.out_proj.weight": f"layers.{l}.attention.wo.weight",
            f"norm1.weight": f"layers.{l}.attention_norm.weight",
            f"ff.fc1.weight": f"layers.{l}.feed_forward.w1.weight",
            f"ff.fc2.weight": f"layers.{l}.feed_forward.w3.weight",  # swapped order
            f"ff.fc3.weight": f"layers.{l}.feed_forward.w2.weight",
            f"norm2.weight": f"layers.{l}.ffn_norm.weight",
        }

        for attr_path, param_name in attr_param_map.items():
            obj = block
            *parents, attr = attr_path.split('.')
            for p in parents:
                obj = getattr(obj, p)
            old_tensor = getattr(obj, attr)
            setattr(obj, attr, assign(old_tensor, params[param_name]))

    model.final_norm.weight = assign(model.final_norm.weight, params["norm.weight"])
    model.out_head.weight = assign(model.out_head.weight, params["output.weight"])
device = torch.device("cpu")
load_weights_into_llama(model, LLAMA2_CONFIG_7B, weights)
model.to(device);

再次运行上面的生成语句示例,得到:

Output text:
 At the start of the 20th century, the city was a major industrial center, with a large number of factories and mills. The city was also

可见,生成的结果符合语义,这也说明我们的模型代码是正确的,且权重已经顺利加载。

try instruction-finetuned model

上例中,我们加载的是7B基础模型,该模型只经过预训练,未经过微调。因此仅能补全文本,不能响应指令。接下来类似的,我们下载并加载instruction-FT权重。

下载并加载chat model的权重如下:

weights_file = hf_hub_download(
   repo_id="meta-llama/Llama-2-7b-chat",
   filename="consolidated.00.pth",
   local_dir="Llama-2-7b-chat"
)
model = Llama2Model(LLAMA2_CONFIG_7B)
load_weights_into_llama(model, LLAMA2_CONFIG_7B, weights)
model.to(device)

这次我们运行提问,示例如下:

from gpt2_v2 import generate_text_simple, text_to_tensor, tensor_to_text

torch.manual_seed(123)

token_ids = generate_text_simple(
    model=model,
    idx=text_to_tensor("What do llamas eat?", tokenizer).to(device),
    max_new_tokens=30,
    context_size=LLAMA2_CONFIG_7B["context_length"],
    top_k=1,
    temperature=0.
)

print("Output text:\n", tensor_to_text(token_ids, tokenizer))
Output text:
 What do llamas eat?
Llamas are herbivores, which means they eat plants. They eat grass, hay, and other plants.
What do llam

至此,我们已经完成了Llama2的代码实现,以及下载并加载预训练权重。