【手搓大模型】从零手写Llama2本文在GPT2源码的基础上，实现Llama2代码，理解RMSNorm，RoPE，Swi

LLaMA（Large Language Model Meta AI）是由 Meta（原 Facebook）开发的一系列开源大型语言模型。其中Llama2在2023年发布，有7B, 13B, 70B尺寸。详细的介绍可参考其论文 Llama 2: Open Foundation and Fine-Tuned Chat Models，以及其他资料，本文不再赘述，而是关注其核心技术与代码实现。本文将在【手搓大模型-GPT2系列】博客的技术上，在GPT2源码基础上改动，写出Llama2的源码，并加载其公开权重。

本文代码链接：Llama2，原始参考的代码链接: rasbt。

RMSNorm Layer

在GPT2中规则化使用的是标准的LayerNorm，也就是“减去均值再除以标准差”，而在Llama2中，使用的是RMSNorm，也就是“不再减均值，直接除以均方根”。相当于只做了缩放，而未做中心化。之所以这样做的，核心原因是为了减少计算量，省略了计算均值和方差，只计算平方和。而在实践中发现，在比较大的模型中，即便不做中心化，缩放依然保留了方向信息，在实际训练中更稳定，效果也挺好。

对比代码如下：

import torch
from torch import nn

class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        # Learnable scale (gamma) and shift (beta) parameters
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        # Compute mean and variance along the last dimension
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        # Normalize input
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        # Apply scale and shift
        return self.scale * norm_x + self.shift

class RMSNorm(nn.Module):
    def __init__(self, emb_dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.emb_dim = emb_dim
        # Learnable scaling parameter (gamma)
        self.weight = nn.Parameter(torch.ones(emb_dim)).float()

    def forward(self, x):
        # Compute root mean square (RMS)
        means = x.pow(2).mean(dim=-1, keepdim=True)
        # Normalize input by RMS
        x_normed = x * torch.rsqrt(means + self.eps)
        # Apply scaling and restore original dtype
        return (x_normed * self.weight).to(dtype=x.dtype)

我们构造样例，分别执行Norm：

# Set random seed for reproducibility
torch.manual_seed(123)

# Create input tensor with uniform distribution shifted away from zero
example = torch.rand(2, 3, 10) * 4 + 3  # values roughly in [3,7]

print("Input tensor (example):")
print("Raw mean:", example.mean().item())
print("Raw std :", example.std().item())
print("Raw RMS :", torch.sqrt(example.pow(2).mean(dim=-1).mean()).item())

# Instantiate normalization layers
layer_norm = LayerNorm(emb_dim=example.shape[-1])
rms_norm = RMSNorm(emb_dim=example.shape[-1])
rms_norm_pytorch = torch.nn.RMSNorm(example.shape[-1], eps=1e-5)  # PyTorch built-in

# Apply normalization
out_layer = layer_norm(example)
out_rms = rms_norm(example)
out_rms_pt = rms_norm_pytorch(example)

# Print normalized outputs statistics
print("After LayerNorm:")
print("Mean:", out_layer.mean().item())
print("Std :", out_layer.std().item())
print("RMS :", torch.sqrt(out_layer.pow(2).mean(dim=-1).mean()).item())

print("After RMSNorm (custom):")
print("Mean:", out_rms.mean().item())
print("Std :", out_rms.std().item())
print("RMS :", torch.sqrt(out_rms.pow(2).mean(dim=-1).mean()).item())

print("After RMSNorm (PyTorch built-in):")
print("Mean:", out_rms_pt.mean().item())
print("Std :", out_rms_pt.std().item())
print("RMS :", torch.sqrt(out_rms_pt.pow(2).mean(dim=-1).mean()).item())

结果如下：

Input tensor (example):
Raw mean: 5.003686428070068
Raw std : 1.1390745639801025
Raw RMS : 5.129594802856445
After LayerNorm:
Mean: -1.033147185580674e-07
Std : 1.0084344148635864
RMS : 0.9999955296516418
After RMSNorm (custom):
Mean: 0.9775436520576477
Std : 0.2125103920698166
RMS : 0.9999997615814209
After RMSNorm (PyTorch built-in):
Mean: 0.9775436520576477
Std : 0.2125103920698166
RMS : 0.9999997615814209

可见标准的LayerNorm之后，mean会接近0，std和rms会接近1；但是RMS norm之后，只有rms会接近1。

另外，上面自己实现的RMSNorm与pytorch自带的结果一致，后续可直接使用pytorch自带的torch.nn.RMSNorm模块。

SiLU activation

相比GPT2使用GELU激活函数，Llama2改为使用SiLU（Sigmoid Linear Unit），也称为 Swish函数。

其公式是：

$\text{silu}(x) = x \cdot \sigma(x), \quad \text{where} \quad \sigma(x) \text{ is the logistic sigmoid.}$

代码实现特别简单：

class SiLU(nn.Module):
    def __init__(self):
        super(SiLU, self).__init__()

    def forward(self, x):
        return x * torch.sigmoid(x)

更简单地，也可以直接使用pytorch自带的torch.nn.Silu()模块。

我们可以画图对比下GELU和SiLU：

import torch
import torch.nn as nn
import matplotlib.pyplot as plt

gelu = nn.GELU()
silu = nn.SiLU()

x = torch.linspace(-5, 5, 200)

y_gelu = gelu(x)
y_silu = silu(x)

plt.figure(figsize=(12, 3))
for i, (y, label) in enumerate(zip(
    [y_gelu, y_silu], ["GELU", "SiLU"]), 1):
    plt.subplot(1, 3, i)
    plt.plot(x.numpy(), y.detach().numpy(), label=label)
    plt.title(f"{label} activation")
    plt.xlabel("x")
    plt.ylabel(f"{label}(x)")
    plt.grid(True)

plt.tight_layout()
plt.show()

可见，二者非常接近。不过SiLU在计算上更简单、更快，因为只有sigmod和乘法。

SwiGLU in FeedForward

GPT2直接在Feedforward模块使用GELU激活函数，而Llama2使用了一种基于 SiLU 的门控激活函数（Gates Linear Unit）变体——SwiGLU，其公式如下：

$\text{SwiGLU}(x) = \text{SiLU}(\text{Linear}_1(x)) * (\text{Linear}_2(x))$

也就是说，SwiGLU需要2个输入线性层，以便实现门控结构。

完整的FeedForward模块对比代码如下：

class FeedForwardInGPT2(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
            nn.GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
        )

    def forward(self, x):
        return self.layers(x)

class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.fc1 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        self.fc2 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        self.fc3 = nn.Linear(cfg["hidden_dim"], cfg["emb_dim"], dtype=cfg["dtype"], bias=False)
        self.silu = nn.SiLU()

    def forward(self, x):
        x_fc1 = self.fc1(x)
        x_fc2 = self.fc2(x)
        x = self.silu(x_fc1) * x_fc2
        return self.fc3(x)

可见，在Llama2中不能再使用nn.Sequential进行线性叠加；而是需要2个线性输入层fc1和fc2，做SwiGLU门控操作之后，再输入fc3中。另外，Llama2的所有线性层中都去掉了bias，也就是bias=False。这是因为去掉bias可以减少计算量，而在实践中发现，并未影响训练效果。

RoPE

相比传统的绝对位置编码，Llama2使用了旋转位置编码RoPE，以便能同时捕捉绝对和相对位置信息。RoPE 通过将位置编码转换为角度旋转，作用在 Q/K 上，从而实现“相对位置敏感性”。RoPE的设计非常巧妙，受到复数旋转的启发，详细的设计思路也可以参考作者博客。

这里我们简要说明其下背后的数学，也可参考wiki。

对原始向量旋转角度θ：

假设原始向量用复数表示为： $z = a + b i$

将它乘以单位模长的复数： $e^{i\theta} = \cos\theta + i\sin\theta$

得到： $z' = z \cdot e^{i\theta} = (a\cos\theta - b\sin\theta) + i(a\sin\theta + b\cos\theta)$

这个过程可以表达成矩阵乘法：

$z' = \begin{bmatrix} a' \\ b' \end{bmatrix} = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} \begin{bmatrix} a \\ b \end{bmatrix}$

也就是说对原始向量乘以单位模长的复数，相当于把原始二维向量 [a, b] 顺时针旋转了角度 θ。

旋转后向量间的距离取决于相对位置m-n：

假设位置为m和n的token，旋转角度θ后的向量分别是：

$z_m = z \cdot e^{i\theta_m}, \quad z_n = z \cdot e^{i\theta_n}$

其之间的欧几里得距离（二范数）是：

$\text{dist}(z_m, z_n) = |z \cdot e^{i\theta_m} - z \cdot e^{i\theta_n}| = |z| \cdot |e^{i\theta_m} - e^{i\theta_n}| = |z| \cdot |e^{i(\theta_m - \theta_n)}=|z| \cdot |e^{i(m - n)\omega} - 1|$

也就说，旋转后向量间距离取决于相对位置。

而RoPE 的关键思想是：把每个位置的向量旋转一个特定角度，而不是直接加上位置编码。它不使用 x + pos_embedding，而是将 Query 和 Key 向量旋转，使它们隐式地包含了位置信息。RoPE 通过将 token 向量中的每对维度视为复数，对其进行基于位置的旋转，将位置信息编码到向量内部。

RoPE的完整代码如下：

import torch

def precompute_rope_params(seq_len, head_dim):
    """
Precompute sin and cos tensors for RoPE.

Args:
seq_len: sequence length
head_dim: embedding dimension (must be even)

Returns:
sin, cos: tensors of shape (seq_len, dim//2)
"""
half_dim = head_dim // 2
    inv_freq = 1.0 / (10000 ** (torch.arange(half_dim).float() / half_dim))
    positions = torch.arange(seq_len, dtype=torch.float32)
    angles = torch.einsum("i,j->ij", positions, inv_freq)  # (seq_len, half_dim)
    return torch.sin(angles), torch.cos(angles)


def rotary_pos_emb(x, sin, cos):
    """
Apply Rotary Positional Embedding on input tensor x using precomputed sin and cos.

Args:
x: tensor of shape (batch, seq_len, dim)
sin: precomputed sin tensor of shape (seq_len, dim//2)
cos: precomputed cos tensor of shape (seq_len, dim//2)

Returns:
tensor same shape as x with RoPE applied.
"""
print("Rotary Positional Embedding",x.shape)
    # x: (batch_size, num_heads, seq_len, head_dim)
    batch, num_heads, seq_len, head_dim = x.shape

    # x: (batch, seq_len, dim) -> (batch, seq_len, half_dim, 2)
    x_ = x.view(batch, num_heads, seq_len, head_dim // 2, 2)
    print("shape of x_", x_.shape)
    print("shape of cos", cos.shape)

    # ➤ Crop sin/cos to match actual seq_len
    sin = sin[:seq_len, :]
    cos = cos[:seq_len, :]

    x_rotated = torch.zeros_like(x_)
    x_rotated[..., 0] = x_[..., 0] * cos - x_[..., 1] * sin
    x_rotated[..., 1] = x_[..., 0] * sin + x_[..., 1] * cos

    return x_rotated.view_as(x)

解释下RoPE代码的关键步骤如下：

计算旋转频率（inv_freq）

    half_dim = dim // 2
    inv_freq = 1.0 / (10000 ** (torch.arange(0, half_dim, 2).float() / half_dim))

half_dim：将嵌入维度分成两半（因为每两个维度一组，形成复数的实部和虚部）。

inv_freq：计算旋转的频率倒数，公式来自 Transformer 经典的位置编码思想： $ω_j = \frac{1}{10000^{2j / d}}$

生成位置索引

positions = torch.arange(seq_len, dtype=torch.float32)

生成一个 [0, 1, 2, ..., seq_len-1] 的序列，表示每个 token 的位置索引。

计算旋转角度矩阵

angles = torch.einsum("i,j->ij", positions, inv_freq)

此处使用了矩阵的爱因斯坦求和约定，计算外积，等价于以下矩阵乘法：

angles = positions[:, None] * inv_freq[None, :]

每个位置乘以每个频率，得到一个形状为 (seq_len, half_dim/2) 的矩阵，表示每个位置在每个维度上的旋转角度。

计算sin和cos

sin = torch.sin(angles)
cos = torch.cos(angles)

对角度矩阵分别取正弦和余弦，生成对应的旋转矩阵元素。

调整输入张量维度

x_ = x.view(*x.shape[:-1], half_dim, 2)

将输入的 (batch, seq_len, dim) 重塑成 (batch, seq_len, half_dim, 2)，即把每两个维度看作一组，形如复数的实部和虚部。

旋转计算

x_rotated = torch.zeros_like(x_)
x_rotated[..., 0] = x_[..., 0] * cos - x_[..., 1] * sin
x_rotated[..., 1] = x_[..., 0] * sin + x_[..., 1] * cos

其中：

x_[..., 0] 表示每对维度的第一个分量，类似二维向量的 x 分量（实部）。

x_[..., 1] 表示每对维度的第二个分量，类似二维向量的 y 分量（虚部）。

这两行代码是用二维旋转矩阵公式对每个二维向量（由两维embedding组成）进行旋转：

$\begin{bmatrix} x' \\ y' \end{bmatrix} = \begin{bmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix}$

而 cos 和 sin 的形状是 (seq_len, half_dim/2)，但因为 broadcast 规则，会自动匹配 (batch, seq_len, half_dim, 2) 的维度。

恢复形状输出

return x_rotated.view_as(x)

将旋转后的张量再 reshape 回原始的 (batch, seq_len, dim) 形状。

使用示例如下：

# Example usage
batch_size, seq_len, dim = 2, 16, 64
x = torch.randn(batch_size, seq_len, dim)
# Step 1: Precompute sin and cos for RoPE
sin, cos = precompute_rope_params(seq_len, dim)
# Step 2: Apply rotary positional embedding with precomputed sin and cos
x_rope = rotary_pos_emb(x, sin, cos)

print(x_rope.shape)  # Should be (2, 16, 64)

update MHA with RoPE

GPT2使用的绝对位置编码是作用在inputs上，而Llama2的旋转位置编码是作用在注意力机制中的Query和Key上，因此我们更新MHA的代码如下：

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, num_heads, dtype=None):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by n_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads

        # Use nn.Linear with shared kwargs to reduce repetition
        linear_kwargs = dict(bias=False, dtype=dtype)
        self.W_query = nn.Linear(d_in, d_out, **linear_kwargs)
        self.W_key = nn.Linear(d_in, d_out, **linear_kwargs)
        self.W_value = nn.Linear(d_in, d_out, **linear_kwargs)
        self.out_proj = nn.Linear(d_out, d_out, **linear_kwargs)

        self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1).bool())

        cos, sin = precompute_rope_params(seq_len=context_length, head_dim=self.head_dim)
        self.register_buffer("cos", cos)
        self.register_buffer("sin", sin)

    def forward(self, x):
        b, num_tokens, d_in = x.shape

        # Project inputs
        keys = self.W_key(x).view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2)
        queries = self.W_query(x).view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2)
        values = self.W_value(x).view(b, num_tokens, self.num_heads, self.head_dim).transpose(1, 2)

        # Apply RoPE
        keys = rotary_pos_emb(keys, self.cos, self.sin)
        queries = rotary_pos_emb(queries, self.cos, self.sin)

        # Attention scores with causal mask
        attn_scores = queries @ keys.transpose(-2, -1)
        attn_scores.masked_fill_(self.mask[:num_tokens, :num_tokens], float('-inf'))

        attn_weights = torch.softmax(attn_scores / self.head_dim ** 0.5, dim=-1)

        context_vec = (attn_weights @ values).transpose(1, 2).reshape(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec)

        return context_vec

构造example计算mha如下：

batch_size = 2
context_len = 128
max_context_len = 1280
embed_dim = 64
num_heads = 4

example_batch = torch.randn(batch_size, context_len, embed_dim)

mha = MultiHeadAttention(
    d_in=embed_dim,
    d_out=embed_dim,
    context_length=max_context_len,
    num_heads=num_heads
)

output = mha(example_batch)
print(output.shape)  # Expected: (batch_size, context_len, embed_dim)

Rotary Positional Embedding torch.Size([2, 4, 128, 16])
shape of x_ torch.Size([2, 4, 128, 8, 2])
shape of cos torch.Size([1280, 8])
Rotary Positional Embedding torch.Size([2, 4, 128, 16])
shape of x_ torch.Size([2, 4, 128, 8, 2])
shape of cos torch.Size([1280, 8])
torch.Size([2, 128, 64])

Update TransformerBlock

至此，我们已经完成了Llama2代码中核心代码，接下来综合上述代码，更新TransformerBlock，核心变动点包括：

1）用RMSNorm替换LayerNorm，简化计算。

2）去掉Dropout，因为模型参数够大时，dropout并非必须的。

3）去掉bias设置，减少计算量，提高数值稳定性。

4）增加dtype设置，支持更高效的低精度训练与推理；如用bfloat16节省显存、加速训练。

完整代码如下：

class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"],
            dtype=cfg["dtype"]
        )
        self.ff = FeedForward(cfg)

        self.norm1 = RMSNorm(cfg["emb_dim"])
        self.norm2 = RMSNorm(cfg["emb_dim"])
    def forward(self, x):
        # Shortcut connection for attention block
        shortcut = x
        x = self.norm1(x)
        x = self.att(x)   # Shape [batch_size, num_tokens, emb_size]
        x = x + shortcut  # Add the original input back

        # Shortcut connection for feed-forward block
        shortcut = x
        x = self.norm2(x)
        x = self.ff(x)
        x = x + shortcut  # Add the original input back

        return x

Update the Model

回想GPT2，Model都是对TransformerBlock的多次堆叠重复。Llama2中，我们需要去掉pos_emb改为用RoPE，以及改用RMSNorm，以及设置dtype。代码如下：

class Llama2Model(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"], dtype=cfg["dtype"])
        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
        self.final_norm = RMSNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False, dtype=cfg["dtype"])

    def forward(self, in_idx):
        x = self.tok_emb(in_idx)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

Initialize Model

Llama2的模型参数如下：

LLAMA2_CONFIG_7B = {
    "vocab_size": 32000,     # Vocabulary size
    "context_length": 4096,  # Context length
    "emb_dim": 4096,         # Embedding dimension
    "n_heads": 32,           # Number of attention heads
    "n_layers": 32,          # Number of layers
    "hidden_dim": 11008,     # Size of the intermediate dimension in FeedForward
    "dtype": torch.bfloat16  
}

加载模型如下：

model = Llama2Model(LLAMA2_CONFIG_7B)

计算总参数：

total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")

Total number of parameters: 6,738,415,616

可见参数为6.7B，通常简称为7B。

Load Tokenizer

GPT2使用的Tokenizer是tiktoken，但Llama2使用了Google的SentencePiece。Meta在HuggingFace上公布了训练后的公开权重以及Tokenizer词库。我们可以从HF下载并加载，不过因为Llama2并非完全开源（仅用于个人和非商业用途），需要先申请权限并登录HF。

下载tokenizer如下：

from huggingface_hub import login

login(token="your hf oken")

from huggingface_hub import hf_hub_download

tokenizer_file = hf_hub_download(
    repo_id="meta-llama/Llama-2-7b",
    filename="tokenizer.model",
    local_dir="Llama-2-7b"
)

定义tokenizer如下:

import sentencepiece as spm

class LlamaTokenizer:
    def __init__(self, tokenizer_file):
        sp = spm.SentencePieceProcessor()
        sp.load(tokenizer_file)
        self.tokenizer = sp

    def encode(self, text):
        return self.tokenizer.encode_as_ids(text)

    def decode(self, ids):
        return self.tokenizer.decode_pieces(ids)


tokenizer = LlamaTokenizer(tokenizer_file)

尝试运行生成文本，如下：

from gpt2_v2 import generate_text_simple, text_to_tensor, tensor_to_text

torch.manual_seed(123)

token_ids = generate_text_simple(
    model=model,
    idx=text_to_tensor("At the start of", tokenizer).to("cpu"),
    max_new_tokens=30,
    context_size=LLAMA2_CONFIG_7B["context_length"],
    top_k=1,
    temperature=0.
)

print("Output text:\n", tensor_to_text(token_ids, tokenizer))

Output text:
 At the start ofзей Warjarewnę обще Opera з went eeuwể Other collaborationlauf’Powerремħ’Powerремħ’ep kur extremely____dataset Multi vida curv

可见tokenizer已顺利加载，只是生成的接近乱码，这是因为模型未训练，只有随机初始权重。

Load pretrained Weights

同上，我们可以下载并加载Meta AI预训练并公开的模型权重，如下：

weights_file = hf_hub_download(
   repo_id="meta-llama/Llama-2-7b",
   filename="consolidated.00.pth",
   local_dir="Llama-2-7b"
)

weights = torch.load(weights_file, weights_only=True)

加载权重参数的过程，本质上是参数复制，如下：

def assign(left, right):
    if left.shape != right.shape:
        raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
    return torch.nn.Parameter(right.clone().detach()) if isinstance(right, torch.Tensor) else torch.nn.Parameter(torch.tensor(right))


def load_weights_into_llama(model, param_config, params):
    model.tok_emb.weight = assign(model.tok_emb.weight, params["tok_embeddings.weight"])

    for l in range(param_config["n_layers"]):
        block = model.trf_blocks[l]

        # map of attribute path (relative to block) -> param name
        attr_param_map = {
            f"att.W_query.weight": f"layers.{l}.attention.wq.weight",
            f"att.W_key.weight": f"layers.{l}.attention.wk.weight",
            f"att.W_value.weight": f"layers.{l}.attention.wv.weight",
            f"att.out_proj.weight": f"layers.{l}.attention.wo.weight",
            f"norm1.weight": f"layers.{l}.attention_norm.weight",
            f"ff.fc1.weight": f"layers.{l}.feed_forward.w1.weight",
            f"ff.fc2.weight": f"layers.{l}.feed_forward.w3.weight",  # swapped order
            f"ff.fc3.weight": f"layers.{l}.feed_forward.w2.weight",
            f"norm2.weight": f"layers.{l}.ffn_norm.weight",
        }

        for attr_path, param_name in attr_param_map.items():
            obj = block
            *parents, attr = attr_path.split('.')
            for p in parents:
                obj = getattr(obj, p)
            old_tensor = getattr(obj, attr)
            setattr(obj, attr, assign(old_tensor, params[param_name]))

    model.final_norm.weight = assign(model.final_norm.weight, params["norm.weight"])
    model.out_head.weight = assign(model.out_head.weight, params["output.weight"])

device = torch.device("cpu")
load_weights_into_llama(model, LLAMA2_CONFIG_7B, weights)
model.to(device);

再次运行上面的生成语句示例，得到：

Output text:
 At the start of the 20th century, the city was a major industrial center, with a large number of factories and mills. The city was also

可见，生成的结果符合语义，这也说明我们的模型代码是正确的，且权重已经顺利加载。

try instruction-finetuned model

上例中，我们加载的是7B基础模型，该模型只经过预训练，未经过微调。因此仅能补全文本，不能响应指令。接下来类似的，我们下载并加载instruction-FT权重。

下载并加载chat model的权重如下：

weights_file = hf_hub_download(
   repo_id="meta-llama/Llama-2-7b-chat",
   filename="consolidated.00.pth",
   local_dir="Llama-2-7b-chat"
)

model = Llama2Model(LLAMA2_CONFIG_7B)
load_weights_into_llama(model, LLAMA2_CONFIG_7B, weights)
model.to(device)

这次我们运行提问，示例如下：

from gpt2_v2 import generate_text_simple, text_to_tensor, tensor_to_text

torch.manual_seed(123)

token_ids = generate_text_simple(
    model=model,
    idx=text_to_tensor("What do llamas eat?", tokenizer).to(device),
    max_new_tokens=30,
    context_size=LLAMA2_CONFIG_7B["context_length"],
    top_k=1,
    temperature=0.
)

print("Output text:\n", tensor_to_text(token_ids, tokenizer))

Output text:
 What do llamas eat?
Llamas are herbivores, which means they eat plants. They eat grass, hay, and other plants.
What do llam

至此，我们已经完成了Llama2的代码实现，以及下载并加载预训练权重。