轻松上手训练自己的大模型制作自己的大型语言模型 (LLM) 是一件很酷的事情，许多大公司（如 Google、Twitte

制作自己的大型语言模型 (LLM) 是一件很酷的事情，许多大公司（如 Google、Twitter 和 Facebook）都在做这件事。他们发布了这些模型的不同版本，如 70 亿、130 亿或 700 亿。甚至较小的社区也在这样做。您可能已经阅读过有关创建自己的 LLM 的博客或观看过视频，但它们通常谈论的都是理论，而不是实际步骤和代码。

在这篇博文中，我将尝试制作一个只有 230 万个参数的 LLM，有趣的是，我们不需要花哨的 GPU。我们会保持简单并使用基本数据集，这样您就可以看到创建自己的百万参数 LLM 是多么容易。

先决条件

确保你对面向对象编程 ( OOP ) 和神经网络 ( NN ) 有基本的了解。熟悉PyTorch也会对编码有所帮助。

了解Transformer 架构

在深入研究使用 LLaMA 方法创建我们自己的 LLM 之前，必须了解 LLaMA 的架构。下面是 vanilla Transformer 和 LLaMA 之间的比较图。

如果您不熟悉 vanilla Transformer 架构，可以阅读此博客获取基本指南。

让我们更详细地了解一下 LLaMA 的基本概念：

使用 RMSNorm 进行预规范化：

在 LLaMA 方法中，采用了一种称为 RMSNorm 的技术来规范化每个 Transformer 子层的输入。该方法受到 GPT-3 的启发，旨在优化与层规范化相关的计算成本。RMSNorm 提供与 LayerNorm 类似的性能，但显著减少了运行时间（减少了 7%∼64%）。

它通过强调重新缩放不变性和根据均方根 (RMS) 统计量调节总和输入来实现这一点。主要动机是通过删除均值统计量来简化 LayerNorm。感兴趣的读者可以在此处探索 RMSNorm 的详细实现。

SwiGLU 激活函数：

LLaMA 引入了 SwiGLU 激活函数，灵感来自 PaLM。要理解 SwiGLU，首先必须掌握 Swish 激活函数。SwiGLU 扩展了 Swish，并包含一个自定义层，该层具有密集网络，用于拆分和乘以输入激活。

目的是通过引入更复杂的激活函数来增强模型的表达能力。有关 SwiGLU 的更多详细信息，请参阅相关论文。

RoPE：

旋转嵌入 (RoPE) 是 LLaMA 中使用的一种位置嵌入。它使用旋转矩阵对绝对位置信息进行编码，并在自注意力公式中自然包含显式相对位置依赖性。RoPE 具有诸多优势，例如可扩展到各种序列长度，并且随着相对距离的增加，标记间依赖性逐渐减弱。

这是通过与旋转矩阵相乘来编码相对位置来实现的，从而导致相对距离衰减——这是自然语言编码的理想特征。对数学细节感兴趣的人可以参考RoPE 论文。

除了这些概念之外，LLaMA 论文还介绍了其他重要方法，包括使用具有特定参数的AdamW 优化器、xformers 库中可用的因果多头注意力运算符等高效实现，以及手动实现的 Transformer 层后向函数，以优化后向传递期间的计算。

特别感谢Anush Kumar对 LLaMA 每个重要方面的深入讲解。

设置环境

我们将在整个项目中使用一系列 Python 库，因此让我们导入它们：

# PyTorch for implementing LLM (No GPU)
import torch

# Neural network modules and functions from PyTorch
from torch import nn
from torch.nn import functional as F

# NumPy for numerical operations
import numpy as np

# Matplotlib for plotting Loss etc.
from matplotlib import pyplot as plt

# Time module for tracking execution time
import time

# Pandas for data manipulation and analysis
import pandas as pd

# urllib for handling URL requests (Downloading Dataset)
import urllib.request

此外，我正在创建一个存储模型参数的配置对象。

# Configuration object for model parameters
MASTER_CONFIG = {
    # Adding parameters later
}

这种方法保持了灵活性，允许在未来根据需要添加更多参数。

数据预处理

在原始 LLaMA 论文中，采用了各种开源数据集来训练和评估模型。

不幸的是，对于较小的项目来说，使用大量数据集可能不切实际。因此，对于我们的实现，我们将采取一种更为温和的方法，创建一个大幅缩小版的 LLaMA。

鉴于无法访问大量数据的限制，我们将专注于使用 TinyShakespeare 数据集训练简化版的 LLaMA。此开源数据集可在此处获取，包含来自各种莎士比亚作品的约 40,000 行文本。这一选择受到Karpathy 的 Makemore 系列的影响，该系列为训练语言模型提供了宝贵的见解。

虽然 LLaMA 是在包含1.4 万亿个标记的庞大数据集上进行训练的，但我们的数据集 TinyShakespeare 包含大约100 万个字符。

首先，让我们通过下载获取数据集：

# The URL of the raw text file on GitHub
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"

# The file name for local storage
file_name = "tinyshakespeare.txt"

# Execute the download
urllib.request.urlretrieve(url, file_name)

该 Python 脚本从指定的 URL 获取 tinyshakespeare 数据集，并将其保存在本地，文件名为 “tinyshakespeare.txt”

接下来，让我们确定词汇量，它代表数据集中字符的唯一数量。以下是代码片段：

# Read the content of the dataset
lines = open("tinyshakespeare.txt", 'r').read()

# Create a sorted list of unique characters in the dataset
vocab = sorted(list(set(lines)))

# Display the first 10 characters in the vocabulary list
print('Printing the first 10 characters of the vocab list:', vocab[:10])

# Output the total number of characters in our dataset (Vocabulary Size)
print('Total number of characters in our dataset (Vocabulary Size):', len(vocab))

现在，我们创建整数到字符 ( itos ) 和字符到整数 ( stoi ) 之间的映射。代码如下：

# Mapping integers to characters (itos)
itos = {i: ch for i, ch in enumerate(vocab)}

# Mapping characters to integers (stoi)
stoi = {ch: i for i, ch in enumerate(vocab)}

在原始的 LLaMA 论文中，使用了 Google 的SentencePiece 字节对编码标记器。但是，为了简单起见，我们将选择基本的字符级标记器。让我们创建稍后将应用于数据集的编码和解码函数：

# Encode function: Converts a string to a list of integers using the mapping stoi
def encode(s):
    return [stoi[ch] for ch in s]

# Decode function: Converts a list of integers back to a string using the mapping itos
def decode(l):
    return ''.join([itos[i] for i in l])

# Example: Encode the string "hello" and then decode the result
decode(encode("morning"))

最后一行将输出morning确认编码和解码功能的正确功能。

我们现在将数据集转换为 torch 张量，并指定其数据类型以便使用PyTorch进行进一步操作：

# Convert the dataset into a torch tensor with specified data type (dtype)
dataset = torch.tensor(encode(lines), dtype=torch.int8)

# Display the shape of the resulting tensor
print(dataset.shape)

输出 istorch.Size([1115394]) 表示我们的数据集包含大约一百万个 token。值得注意的是，这比包含1.4 万亿个 token 的LLaMA 数据集要小得多。

我们将创建一个函数，负责将数据集拆分为训练集、验证集或测试集。在机器学习或深度学习项目中，这种拆分对于开发和评估模型至关重要，同样的原则也适用于复制大型语言模型 (LLM) 方法：

# Function to get batches for training, validation, or testing
def get_batches(data, split, batch_size, context_window, config=MASTER_CONFIG):
    # Split the dataset into training, validation, and test sets
    train = data[:int(.8 * len(data))]
    val = data[int(.8 * len(data)): int(.9 * len(data))]
    test = data[int(.9 * len(data)):]

    # Determine which split to use
    batch_data = train
    if split == 'val':
        batch_data = val
    if split == 'test':
        batch_data = test

    # Pick random starting points within the data
    ix = torch.randint(0, batch_data.size(0) - context_window - 1, (batch_size,))

    # Create input sequences (x) and corresponding target sequences (y)
    x = torch.stack([batch_data[i:i+context_window] for i in ix]).long()
    y = torch.stack([batch_data[i+1:i+context_window+1] for i in ix]).long()

    return x, y

现在我们的分裂函数已经定义，让我们建立对此过程至关重要的两个参数：

# Update the MASTER_CONFIG with batch_size and context_window parameters
MASTER_CONFIG.update({
    'batch_size': 8,          # Number of batches to be processed at each random split
    'context_window': 16      # Number of characters in each input (x) and target (y) sequence of each batch
})

batch_size 决定每次随机分割处理多少个批次，而 context_window 指定每个批次的每个输入（x）和目标（y）序列中的字符数。

让我们从数据集中批次 8 和上下文窗口 16 的训练分割中打印一个随机样本：

# Obtain batches for training using the specified batch size and context window
xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])

# Decode the sequences to obtain the corresponding text representations
decoded_samples = [(decode(xs[i].tolist()), decode(ys[i].tolist())) for i in range(len(xs))]

# Print the random sample
print(decoded_samples)

评估策略

现在，我们将创建一个专门用于评估我们自己创建的 LLaMA 架构的函数。在定义实际模型方法之前这样做的原因是为了能够在训练过程中进行持续评估。

@torch.no_grad()  # Don't compute gradients for this function
def evaluate_loss(model, config=MASTER_CONFIG):
    # Placeholder for the evaluation results
    out = {}
    
    # Set the model to evaluation mode
    model.eval()

    # Iterate through training and validation splits
    for split in ["train", "val"]:
        # Placeholder for individual losses
        losses = []

        # Generate 10 batches for evaluation
        for _ in range(10):
            # Get input sequences (xb) and target sequences (yb)
            xb, yb = get_batches(dataset, split, config['batch_size'], config['context_window'])
            
            # Perform model inference and calculate the loss
            _, loss = model(xb, yb)
            
            # Append the loss to the list
            losses.append(loss.item())

        # Calculate the mean loss for the split and store it in the output dictionary
        out[split] = np.mean(losses)
    
    # Set the model back to training mode
    model.train()
    
    return out

我们使用损失作为衡量模型在训练迭代过程中表现的指标。我们的函数迭代训练和验证分割，计算每个分割的 10 个批次的平均损失，最后返回结果。然后使用 model.train() 将模型重新设置为训练模式。

建立基础神经网络模型

我们正在构建一个基本的神经网络，稍后我们将使用 LLaMA 技术对其进行改进。

# Definition of a basic neural network class
class SimpleBrokenModel(nn.Module):
    def __init__(self, config=MASTER_CONFIG):
        super().__init__()
        self.config = config

        # Embedding layer to convert character indices to vectors (vocab size: 65)
        self.embedding = nn.Embedding(config['vocab_size'], config['d_model'])

        # Linear layers for modeling relationships between features
        # (to be updated with SwiGLU activation function as in LLaMA)
        self.linear = nn.Sequential(
            nn.Linear(config['d_model'], config['d_model']),
            nn.ReLU(),  # Currently using ReLU, will be replaced with SwiGLU as in LLaMA
            nn.Linear(config['d_model'], config['vocab_size']),
        )

        # Print the total number of model parameters
        print("Model parameters:", sum([m.numel() for m in self.parameters()]))

在当前架构中，嵌入层的词汇量为 65，代表我们数据集中的字符。由于这是我们的基础模型，因此我们使用 **ReLU ** 作为线性层中的激活函数；但是，稍后将使用 LLaMA 中使用的 SwiGLU 替换它。

为了为我们的基础模型创建前向传递，我们必须在 NN 模型中定义一个前向函数。

# Definition of a basic neural network class
class SimpleBrokenModel(nn.Module):
    def __init__(self, config=MASTER_CONFIG):

        # Rest of the code        
        ... 

        # Forward pass function for the base model
        def forward(self, idx, targets=None):
            # Embedding layer converts character indices to vectors
            x = self.embedding(idx)
            
            # Linear layers for modeling relationships between features
            a = self.linear(x)
            
            # Apply softmax activation to obtain probability distribution
            logits = F.softmax(a, dim=-1)

            # If targets are provided, calculate and return the cross-entropy loss
            if targets is not None:
                # Reshape logits and targets for cross-entropy calculation
                loss = F.cross_entropy(logits.view(-1, self.config['vocab_size']), targets.view(-1))
                return logits, loss

            # If targets are not provided, return the logits
            else:
                return logits

        # Print the total number of model parameters
        print("Model parameters:", sum([m.numel() for m in self.parameters()]))

此正向传递函数将字符索引 (idx) 作为输入，应用嵌入层，将结果传递到线性层，应用 softmax 激活以获得概率分布 (logits)。如果提供了目标，它会计算交叉熵损失并返回 logits 和损失。如果没有提供目标，它只返回 logits。

要实例化此模型，我们可以直接调用该类并打印简单神经网络模型中的参数总数。我们将线性层的维度设置为 128，并在配置对象中指定此值：

# Update MASTER_CONFIG with the dimension of linear layers (128)
MASTER_CONFIG.update({
    'd_model': 128,
})

# Instantiate the SimpleBrokenModel using the updated MASTER_CONFIG
model = SimpleBrokenModel(MASTER_CONFIG)

# Print the total number of parameters in the model
print("Total number of parameters in the Simple Neural Network Model:", sum([m.numel() for m in model.parameters()]))

我们的简单神经网络模型包含大约 33,000 个参数。

类似地，为了计算对数和损失，我们只需要将分割的数据集输入到我们的模型中：

# Obtain batches for training using the specified batch size and context window
xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])

# Calculate logits and loss using the model
logits, loss = model(xs, ys)

为了训练我们的基础模型并记录其性能，我们需要指定一些参数。我们总共训练了 1000 个时期。将批次大小从 8 增加到 32，并将 log_interval 设置为 10，表示代码将每 10 个批次打印或记录有关训练进度的信息。为了进行优化，我们将使用 Adam 优化器。

# Update MASTER_CONFIG with training parameters
MASTER_CONFIG.update({
    'epochs': 1000,          # Number of training epochs
    'log_interval': 10,      # Log information every 10 batches during training
    'batch_size': 32,        # Increase batch size to 32
})

# Instantiate the SimpleBrokenModel with updated configuration
model = SimpleBrokenModel(MASTER_CONFIG)

# Define the Adam optimizer for model parameters
optimizer = torch.optim.Adam(
    model.parameters(),      # Pass the model parameters to the optimizer
)

让我们执行训练过程并捕获基础模型的损失，包括参数总数。此外，为了清晰起见，每行都进行了注释：

# Function to perform training
def train(model, optimizer, scheduler=None, config=MASTER_CONFIG, print_logs=False):
    # Placeholder for storing losses
    losses = []
    
    # Start tracking time
    start_time = time.time()

    # Iterate through epochs
    for epoch in range(config['epochs']):
        # Zero out gradients
        optimizer.zero_grad()

        # Obtain batches for training
        xs, ys = get_batches(dataset, 'train', config['batch_size'], config['context_window'])

        # Forward pass through the model to calculate logits and loss
        logits, loss = model(xs, targets=ys)

        # Backward pass and optimization step
        loss.backward()
        optimizer.step()

        # If a learning rate scheduler is provided, adjust the learning rate
        if scheduler:
            scheduler.step()

        # Log progress every specified interval
        if epoch % config['log_interval'] == 0:
            # Calculate batch time
            batch_time = time.time() - start_time
            
            # Evaluate loss on validation set
            x = evaluate_loss(model)
            
            # Store the validation loss
            losses += [x]
            
            # Print progress logs if specified
            if print_logs:
                print(f"Epoch {epoch} | val loss {x['val']:.3f} | Time {batch_time:.3f} | ETA in seconds {batch_time * (config['epochs'] - epoch)/config['log_interval'] :.3f}")
                
            # Reset the timer
            start_time = time.time()

            # Print learning rate if a scheduler is provided
            if scheduler:
                print("lr: ", scheduler.get_lr())

    # Print the final validation loss
    print("Validation loss: ", losses[-1]['val'])
    
    # Plot the training and validation loss curves
    return pd.DataFrame(losses).plot()

# Execute the training process
train(model, optimizer)

训练前的初始交叉熵损失为 4.17，经过 1000 个 epoch 后，该损失降至 3.93。在这种情况下，交叉熵反映了选择错误单词的可能性。

我们的模型在 logits 上加入了一个 softmax 层，它将数字向量转换为概率分布。我们使用内置的 F.cross_entropy 函数，需要直接传入未归一化的 logits。因此，我们将相应地修改我们的模型。

# Modified SimpleModel class without softmax layer
class SimpleModel(nn.Module):
    def __init__(self, config):
       
       # Rest of the code
       ...

    def forward(self, idx, targets=None):
        # Embedding layer converts character indices to vectors
        x = self.embedding(idx)
        
        # Linear layers for modeling relationships between features
        logits = self.linear(x)

        # If targets are provided, calculate and return the cross-entropy loss
        if targets is not None:

            # Rest of the code
            ...

让我们重新创建更新后的 SimpleModel 并对其进行 1000 个 epoch 的训练以观察任何变化：

# Create the updated SimpleModel
model = SimpleModel(MASTER_CONFIG)

# Obtain batches for training
xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])

# Calculate logits and loss using the model
logits, loss = model(xs, ys)

# Define the Adam optimizer for model parameters
optimizer = torch.optim.Adam(model.parameters())

# Train the model for 100 epochs
train(model, optimizer)

将损失降低到 2.51 后，让我们探索一下我们的语言模型（约有**33,000 个参数）**如何在推理过程中生成文本。我们将创建一个“generate”函数，稍后在复制 LLaMA 时会用到它：

# Generate function for text generation using the trained model
def generate(model, config=MASTER_CONFIG, max_new_tokens=30):
    idx = torch.zeros(5, 1).long()
    for _ in range(max_new_tokens):
        # Call the model
        logits = model(idx[:, -config['context_window']:])
        last_time_step_logits = logits[
            :, -1, :
        ]  # all the batches (1), last time step, all the logits
        p = F.softmax(last_time_step_logits, dim=-1)  # softmax to get probabilities
        idx_next = torch.multinomial(
            p, num_samples=1
        )  # sample from the distribution to get the next token
        idx = torch.cat([idx, idx_next], dim=-1)  # append to the sequence
    return [decode(x) for x in idx.tolist()]

# Generate text using the trained model
generate(model)

使用我们约 33K 个参数的基本模型，生成的文本看起来并不好。不过，既然我们已经用这个简单的模型奠定了基础，我们将在下一节中继续构建 LLaMA 架构。

复制LLaMA架构

在博客的前面部分，我们介绍了基本概念，现在，我们将这些概念整合到我们的基础模型中。LLaMA 对原始 Transformer 进行了三项架构修改：

RMSNorm 用于预归一化
旋转嵌入
SwiGLU 激活函数

我们会将这些修改逐一纳入我们的基础模型，并在其基础上进行迭代和构建。

预归一化的 RMSNorm：

我们正在定义具有以下功能的 RMSNorm 函数：

class RMSNorm(nn.Module):
    def __init__(self, layer_shape, eps=1e-8, bias=False):
        super(RMSNorm, self).__init__()

        # Registering a learnable parameter 'scale' as a parameter of the module
        self.register_parameter("scale", nn.Parameter(torch.ones(layer_shape)))

    def forward(self, x):
        """
        Assumes shape is (batch, seq_len, d_model)
        """
        # Calculating the Frobenius norm, RMS = 1/sqrt(N) * Frobenius norm
        ff_rms = torch.linalg.norm(x, dim=(1,2)) * x[0].numel() ** -.5

        # Normalizing the input tensor 'x' with respect to RMS
        raw = x / ff_rms.unsqueeze(-1).unsqueeze(-1)

        # Scaling the normalized tensor using the learnable parameter 'scale'
        return self.scale[:x.shape[1], :].unsqueeze(0) * raw

我们定义 RMSNorm 类。在初始化期间，它会注册一个比例参数。在前向传递中，它会计算输入张量的Frobenius 范数，然后对张量进行归一化。最后，通过注册的比例参数对张量进行缩放。此函数旨在用于 LLaMA 中以替换 LayerNorm 操作。

现在是时候将 LLaMA 的第一个实现概念 RMNSNorm 合并到我们的简单 NN 模型中了。以下是更新后的代码：

# Define the SimpleModel_RMS with RMSNorm
class SimpleModel_RMS(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config

        # Embedding layer to convert character indices to vectors
        self.embedding = nn.Embedding(config['vocab_size'], config['d_model'])

        # RMSNorm layer for pre-normalization
        self.rms = RMSNorm((config['context_window'], config['d_model']))

        # Linear layers for modeling relationships between features
        self.linear = nn.Sequential(
            # Rest of the code
            ...
        )

        # Print the total number of model parameters
        print("Model parameters:", sum([m.numel() for m in self.parameters()]))

    def forward(self, idx, targets=None):
        # Embedding layer converts character indices to vectors
        x = self.embedding(idx)

        # RMSNorm pre-normalization
        x = self.rms(x)

        # Linear layers for modeling relationships between features
        logits = self.linear(x)

        if targets is not None:

            # Rest of the code
            ...

让我们用 RMNSNorm 执行修改后的 NN 模型，并观察模型中更新的参数数量以及损失：

# Create an instance of SimpleModel_RMS
model = SimpleModel_RMS(MASTER_CONFIG)

# Obtain batches for training
xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])

# Calculate logits and loss using the model
logits, loss = model(xs, ys)

# Define the Adam optimizer for model parameters
optimizer = torch.optim.Adam(model.parameters())

# Train the model
train(model, optimizer)

验证损失略有减少，我们更新后的 LLM 的参数现在总计约为 55,000 个。

旋转嵌入：

接下来，我们将实现旋转位置嵌入。在 RoPE 中，作者建议通过旋转嵌入来嵌入序列中标记的位置，在每个位置应用不同的旋转。让我们创建一个模拟 RoPE 实际论文实现的函数：

def get_rotary_matrix(context_window, embedding_dim):
    # Initialize a tensor for the rotary matrix with zeros
    R = torch.zeros((context_window, embedding_dim, embedding_dim), requires_grad=False)
    
    # Loop through each position in the context window
    for position in range(context_window):
        # Loop through each dimension in the embedding
        for i in range(embedding_dim // 2):
            # Calculate the rotation angle (theta) based on the position and embedding dimension
            theta = 10000. ** (-2. * (i - 1) / embedding_dim)
            # Calculate the rotated matrix elements using sine and cosine functions
            m_theta = position * theta
            R[position, 2 * i, 2 * i] = np.cos(m_theta)
            R[position, 2 * i, 2 * i + 1] = -np.sin(m_theta)
            R[position, 2 * i + 1, 2 * i] = np.sin(m_theta)
            R[position, 2 * i + 1, 2 * i + 1] = np.cos(m_theta)
    return R

我们根据指定的上下文窗口和嵌入维度生成一个旋转矩阵，遵循提出的 RoPE 实现。

您可能熟悉涉及注意力头的 transformers 架构，因此在复制 LLaMA 时，我们同样需要创建注意力头。首先，让我们使用之前为旋转嵌入开发的 get_rotary_matrix 函数创建一个**带掩码的注意力头。**此外，为了清晰起见，每行都进行了注释：

class RoPEAttentionHead(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        # Linear transformation for query
        self.w_q = nn.Linear(config['d_model'], config['d_model'], bias=False)
        # Linear transformation for key
        self.w_k = nn.Linear(config['d_model'], config['d_model'], bias=False)
        # Linear transformation for value
        self.w_v = nn.Linear(config['d_model'], config['d_model'], bias=False)
        # Obtain rotary matrix for positional embeddings
        self.R = get_rotary_matrix(config['context_window'], config['d_model'])

    def get_rotary_matrix(context_window, embedding_dim):
        # Generate rotational matrix for RoPE
        R = torch.zeros((context_window, embedding_dim, embedding_dim), requires_grad=False)
        for position in range(context_window):
            for i in range(embedding_dim//2):
                
                # Rest of the code
                ...

        return R

    def forward(self, x, return_attn_weights=False):
        # x: input tensor of shape (batch, sequence length, dimension)

        b, m, d = x.shape  # batch size, sequence length, dimension

        # Linear transformations for Q, K, and V
        q = self.w_q(x)
        k = self.w_k(x)
        v = self.w_v(x)

        # Rotate Q and K using the RoPE matrix
        q_rotated = (torch.bmm(q.transpose(0, 1), self.R[:m])).transpose(0, 1)
        k_rotated = (torch.bmm(k.transpose(0, 1), self.R[:m])).transpose(0, 1)

        # Perform scaled dot-product attention
        activations = F.scaled_dot_product_attention(
            q_rotated, k_rotated, v, dropout_p=0.1, is_causal=True
        )

        if return_attn_weights:
            # Create a causal attention mask
            attn_mask = torch.tril(torch.ones((m, m)), diagonal=0)
            # Calculate attention weights and add causal mask
            attn_weights = torch.bmm(q_rotated, k_rotated.transpose(1, 2)) / np.sqrt(d) + attn_mask
            attn_weights = F.softmax(attn_weights, dim=-1)
            return activations, attn_weights

        return activations

现在我们有一个返回注意力权重的单个掩蔽注意力头，下一步是创建一个多头注意力机制。

class RoPEMaskedMultiheadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        # Create a list of RoPEMaskedAttentionHead instances as attention heads
        self.heads = nn.ModuleList([
            RoPEMaskedAttentionHead(config) for _ in range(config['n_heads'])
        ])
        self.linear = nn.Linear(config['n_heads'] * config['d_model'], config['d_model'])  # Linear layer after concatenating heads
        self.dropout = nn.Dropout(.1)  # Dropout layer

    def forward(self, x):
        # x: input tensor of shape (batch, sequence length, dimension)

        # Process each attention head and concatenate the results
        heads = [h(x) for h in self.heads]
        x = torch.cat(heads, dim=-1)
        
        # Apply linear transformation to the concatenated output
        x = self.linear(x)
        
        # Apply dropout
        x = self.dropout(x)
        return x

原始论文在其较小的 7b LLM 变体中使用了 32 个主管，但由于限制，我们将在我们的方法中使用 8 个主管。

# Update the master configuration with the number of attention heads
MASTER_CONFIG.update({
    'n_heads': 8,
})

现在我们已经实现了旋转嵌入和多头注意力，让我们用更新的代码重写我们的 RMNSorm 神经网络模型。我们将测试其性能，计算损失，并检查参数数量。我们将这个更新的模型称为**“RopeModel”**

class RopeModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config

        # Embedding layer for input tokens
        self.embedding = nn.Embedding(config['vocab_size'], config['d_model'])
        
        # RMSNorm layer for pre-normalization
        self.rms = RMSNorm((config['context_window'], config['d_model']))
        
        # RoPEMaskedMultiheadAttention layer
        self.rope_attention = RoPEMaskedMultiheadAttention(config)

        # Linear layer followed by ReLU activation
        self.linear = nn.Sequential(
            nn.Linear(config['d_model'], config['d_model']),
            nn.ReLU(),
        )

        # Final linear layer for prediction
        self.last_linear = nn.Linear(config['d_model'], config['vocab_size'])

        print("model params:", sum([m.numel() for m in self.parameters()]))

    def forward(self, idx, targets=None):
        # idx: input indices
        x = self.embedding(idx)

        # One block of attention
        x = self.rms(x)  # RMS pre-normalization
        x = x + self.rope_attention(x)

        x = self.rms(x)  # RMS pre-normalization
        x = x + self.linear(x)

        logits = self.last_linear(x)

        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, self.config['vocab_size']), targets.view(-1))
            return logits, loss

        else:
            return logits

让我们使用 RMNSNorm、旋转嵌入和 Masked Multi Head Attentions 执行修改后的 NN 模型，以观察模型中更新的参数数量以及损失：

# Create an instance of RopeModel (RMSNorm, RoPE, Multi-Head)
model = RopeModel(MASTER_CONFIG)

# Obtain batches for training
xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])

# Calculate logits and loss using the model
logits, loss = model(xs, ys)

# Define the Adam optimizer for model parameters
optimizer = torch.optim.Adam(model.parameters())

# Train the model
train(model, optimizer)

验证损失再次略有下降，我们更新后的 LLM 的参数现在总计约为 55,000 个。

让我们对模型进行更多次训练，看看我们重新创建的 LLaMA LLM 的损失是否继续减少。

# Updating training configuration with more epochs and a logging interval
MASTER_CONFIG.update({
    "epochs": 5000,
    "log_interval": 10,
})

# Training the model with the updated configuration
train(model, optimizer)

验证损失持续减少，这表明更多次的训练可以进一步减少损失，尽管效果并不显著。

SwiGLU 激活函数：

如前所述，LLaMA 的创建者使用 SwiGLU 而不是 ReLU，因此我们将在代码中实现 SwiGLU 方程。

class SwiGLU(nn.Module):
    """ Paper Link -> https://arxiv.org/pdf/2002.05202v1.pdf """
    def __init__(self, size):
        super().__init__()
        self.config = config  # Configuration information
        self.linear_gate = nn.Linear(size, size)  # Linear transformation for the gating mechanism
        self.linear = nn.Linear(size, size)  # Linear transformation for the main branch
        self.beta = torch.randn(1, requires_grad=True)  # Random initialization of the beta parameter

        # Using nn.Parameter for beta to ensure it's recognized as a learnable parameter
        self.beta = nn.Parameter(torch.ones(1))
        self.register_parameter("beta", self.beta)

    def forward(self, x):
        # Swish-Gated Linear Unit computation
        swish_gate = self.linear_gate(x) * torch.sigmoid(self.beta * self.linear_gate(x))
        out = swish_gate * self.linear(x)  # Element-wise multiplication of the gate and main branch
        return out

在 python 中实现 SwiGLU 方程后，我们需要将其集成到我们修改后的 LLaMA 语言模型 ( RopeModel ) 中。

class RopeModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config

        # Embedding layer for input tokens
        self.embedding = nn.Embedding(config['vocab_size'], config['d_model'])
        
        # RMSNorm layer for pre-normalization
        self.rms = RMSNorm((config['context_window'], config['d_model']))
        
        # Multi-head attention layer with RoPE (Rotary Positional Embeddings)
        self.rope_attention = RoPEMaskedMultiheadAttention(config)

        # Linear layer followed by SwiGLU activation
        self.linear = nn.Sequential(
            nn.Linear(config['d_model'], config['d_model']),
            SwiGLU(config['d_model']),  # Adding SwiGLU activation
        )

        # Output linear layer
        self.last_linear = nn.Linear(config['d_model'], config['vocab_size'])

        # Printing total model parameters
        print("model params:", sum([m.numel() for m in self.parameters()]))

    def forward(self, idx, targets=None):
        x = self.embedding(idx)

        # One block of attention
        x = self.rms(x)  # RMS pre-normalization
        x = x + self.rope_attention(x)

        x = self.rms(x)  # RMS pre-normalization
        x = x + self.linear(x)  # Applying SwiGLU activation

        logits = self.last_linear(x)

        if targets is not None:
            # Calculate cross-entropy loss if targets are provided
            loss = F.cross_entropy(logits.view(-1, self.config['vocab_size']), targets.view(-1))
            return logits, loss

        else:
            return logits

让我们使用 RMNSNorm、旋转嵌入、Masked Multi Head Attentions 和 SwiGLU 执行修改后的 NN 模型，以观察模型中更新的参数数量以及损失：

# Create an instance of RopeModel (RMSNorm, RoPE, Multi-Head, SwiGLU)
model = RopeModel(MASTER_CONFIG)

# Obtain batches for training
xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])

# Calculate logits and loss using the model
logits, loss = model(xs, ys)

# Define the Adam optimizer for model parameters
optimizer = torch.optim.Adam(model.parameters())

# Train the model
train(model, optimizer)

验证损失再次略有下降，我们更新后的 LLM 的参数现在总计约为 60,000 个。

到目前为止，我们已经成功实现了论文中的关键组件，即 RMNSorm、RoPE 和 SwiGLU。我们观察到这些实现导致损失略有减少。

现在我们将为 LLaMA 添加层来检查其对损失的影响。原始论文对 7b 版本使用了 32 层，但我们只使用 4 层。让我们相应地调整模型设置。

# Update model configurations for the number of layers
MASTER_CONFIG.update({
    'n_layers': 4,  # Set the number of layers to 4
})

让我们首先创建一个单层来了解其影响。

# add RMSNorm and residual connection
class LlamaBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config

        # RMSNorm layer
        self.rms = RMSNorm((config['context_window'], config['d_model']))

        # RoPE Masked Multihead Attention layer
        self.attention = RoPEMaskedMultiheadAttention(config)

        # Feedforward layer with SwiGLU activation
        self.feedforward = nn.Sequential(
            nn.Linear(config['d_model'], config['d_model']),
            SwiGLU(config['d_model']),
        )

    def forward(self, x):
        # one block of attention
        x = self.rms(x) # RMS pre-normalization
        x = x + self.attention(x)  # residual connection

        x = self.rms(x) # RMS pre-normalization
        x = x + self.feedforward(x)  # residual connection
        return x

创建 LlamaBlock 类的实例并将其应用于随机张量。

# Create an instance of the LlamaBlock class with the provided configuration
block = LlamaBlock(MASTER_CONFIG)

# Generate a random tensor with the specified batch size, context window, and model dimension
random_input = torch.randn(MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'], MASTER_CONFIG['d_model'])

# Apply the LlamaBlock to the random input tensor
output = block(random_input)

成功创建单层后，我们现在可以使用它来构建多层。此外，我们将模型类从**“ropemodel”重命名为“Llama”，**因为我们已经复制了 LLaMA 语言模型的每个组件。

class Llama(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        # Embedding layer for token representations
        self.embeddings = nn.Embedding(config['vocab_size'], config['d_model'])
        # Sequential block of LlamaBlocks based on the specified number of layers
        self.llama_blocks = nn.Sequential(
            OrderedDict([(f"llama_{i}", LlamaBlock(config)) for i in range(config['n_layers'])])
        )
        # Feedforward network (FFN) for final output
        self.ffn = nn.Sequential(
            nn.Linear(config['d_model'], config['d_model']),
            SwiGLU(config['d_model']),
            nn.Linear(config['d_model'], config['vocab_size']),
        )

        # Print total number of parameters in the model
        print("model params:", sum([m.numel() for m in self.parameters()]))

    def forward(self, idx, targets=None):
        # Input token indices are passed through the embedding layer
        x = self.embeddings(idx)
        # Process the input through the LlamaBlocks
        x = self.llama_blocks(x)
        # Pass the processed input through the final FFN for output logits
        logits = self.ffn(x)

        # If targets are not provided, return only the logits
        if targets is None:
            return logits
        # If targets are provided, compute and return the cross-entropy loss
        else:
            loss = F.cross_entropy(logits.view(-1, self.config['vocab_size']), targets.view(-1))
            return logits, loss

让我们使用 RMNSNorm、旋转嵌入、Masked Multi Head Attentions、SwiGLU 和 N_layers 执行修改后的 LLaMA 模型，以观察模型中更新的参数数量以及损失：

# Create an instance of RopeModel (RMSNorm, RoPE, Multi-Head, SwiGLU, N_layers)
llama = Llama(MASTER_CONFIG)

# Obtain batches for training
xs, ys = get_batches(dataset, 'train', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])

# Calculate logits and loss using the model
logits, loss = llama(xs, ys)

# Define the Adam optimizer for model parameters
optimizer = torch.optim.Adam(llama.parameters())

# Train the model
train(llama, optimizer)

虽然存在过度拟合的可能性，但探索延长训练周期数是否会导致损失进一步减少至关重要。另外，请注意，我们当前的 LLM 有超过 200 万个参数。

让我们对它进行更多次的训练。

# Update the number of epochs in the configuration
MASTER_CONFIG.update({
    'epochs': 10000,
})
# Train the LLaMA model for the specified number of epochs
train(llama, optimizer, scheduler=None, config=MASTER_CONFIG)

这里的损失是 1.08，我们可以实现更低的损失，而不会遇到明显的过拟合。这表明该模型表现良好。

让我们再次训练模型，这次加入一个调度程序

# Training the model again, scheduler for better optimization.
train(llama, optimizer, config=MASTER_CONFIG)

到目前为止，我们已经在自定义数据集上成功实现了 LLaMA 架构的精简版本。现在，让我们检查一下 200 万参数语言模型生成的输出。

# Generate text using the trained LLM (llama) with a maximum of 500 tokens
generated_text = generate(llama, MASTER_CONFIG, 500)[0]
print(generated_text)

尽管一些生成的单词可能不是完美的英语，但我们仅有 200 万个参数的 LLM 已经显示出对英语的基本理解。

现在，让我们看看我们的模型在测试集上的表现如何。

# Get batches from the test set
xs, ys = get_batches(dataset, 'test', MASTER_CONFIG['batch_size'], MASTER_CONFIG['context_window'])

# Pass the test data through the LLaMA model
logits, loss = llama(xs, ys)

# Print the loss on the test set
print(loss)

测试集上的计算损失约为 1.236。

检查生成的输出的变化的一个简单方法是进行大量时期的训练并观察结果。

尝试超参数

超参数调整是训练神经网络的关键步骤。在原始的 Llama 论文中，作者采用了余弦退火学习计划。然而，在我们的实验中，它表现不佳。以下是使用不同学习计划试验超参数的示例：

# Update configuration
MASTER_CONFIG.update({
    "epochs": 1000
})

# Create Llama model with Cosine Annealing learning schedule
llama_with_cosine = Llama(MASTER_CONFIG)

# Define Adam optimizer with specific hyperparameters
llama_optimizer = torch.optim.Adam(
    llama.parameters(),
    betas=(.9, .95),
    weight_decay=.1,
    eps=1e-9,
    lr=1e-3
)

# Define Cosine Annealing learning rate scheduler
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(llama_optimizer, 300, eta_min=1e-5)

# Train the Llama model with the specified optimizer and scheduler
train(llama_with_cosine, llama_optimizer, scheduler=scheduler)

保存你的语言模型（法学硕士）

您可以使用以下命令保存整个 LLM 或仅保存参数：

# Save the entire model
torch.save(llama, 'llama_model.pth')

# If you want to save only the model parameters
torch.save(llama.state_dict(), 'llama_model_params.pth')

要将 PyTorch 模型保存到 Hugging Face 的 Transformers 库，可以使用 save_pretrained 方法。以下是示例：

from transformers import GPT2LMHeadModel, GPT2Config

# Assuming Llama is your PyTorch model
llama_config = GPT2Config.from_dict(MASTER_CONFIG)
llama_transformers = GPT2LMHeadModel(config=llama_config)
llama_transformers.load_state_dict(llama.state_dict())

# Specify the directory where you want to save the model
output_dir = "llama_model_transformers"

# Save the model and configuration
llama_transformers.save_pretrained(output_dir)

GPT2Config 用于创建与 GPT-2 兼容的配置对象。然后，创建 GPT2LMHeadModel 并加载 Llama 模型中的权重。最后，调用 save_pretrained 将模型和配置保存在指定的目录中。

然后您可以使用 Transformers 库加载模型：

from transformers import GPT2LMHeadModel, GPT2Config

# Specify the directory where the model was saved
output_dir = "llama_model_transformers"

# Load the model and configuration
llama_transformers = GPT2LMHeadModel.from_pretrained(output_dir)

结论

在这篇博文中，我们逐步介绍了如何实施 LLaMA 方法来构建您自己的小型语言模型 (LLM)。建议将您的模型扩展到大约 1500 万个参数，因为 1000 万到 2000 万之间的小型模型往往能更好地理解英语。一旦您的 LLM 精通语言，您就可以针对特定用例对其进行微调。