从零开始：用Python编码你的十亿参数LLM创作不易，方便的话点点关注，谢谢文章结尾有最新热度的文章，感兴趣的可以去

创作不易，方便的话点点关注，谢谢

文章结尾有最新热度的文章，感兴趣的可以去看看。

本文是经过严格查阅相关权威文献和资料，形成的专业的可靠的内容。全文数据都有据可依，可回溯。特别申明：数据和资料已获得授权。本文内容，不涉及任何偏颇观点，用中立态度客观事实描述事情本身

文章有点长(14000字阅读时长：25分)，期望您能坚持看完，并有所收获。

编码自己的十亿参数 LLM

LLaMA 3 是继 Mistral 之后最有前途的开源模型之一，可以解决各种任务。下面介绍如何利用 LLaMA 架构从零开始创建一个拥有 230 多万个参数的 LLM。现在 LLaMA-3 发布了，我们将以更简单的方式重新创建它。

在本文章中，我们不会使用 GPU，但您需要至少 17 GB 的内存，因为我们将加载一些超过 15 GB 的文件。如果这对你来说是个问题，你可以使用 Kaggle 作为解决方案。由于我们不需要 GPU，Kaggle 可提供 30 GB 内存，同时只使用 CPU 内核作为加速器。

预备知识

我们不会使用面向对象编程（OOP）编码，只是普通的 Python 编程。不过，您应该对神经网络和 Transformer 架构有基本的了解。这是学习本文章仅需的两个先决条件。

LLaMA 2 与 LLaMA 3 的区别

在了解技术细节之前，您必须知道的第一件事是 LLaMA 3 的整个架构与 LLaMA 2 相同。因此，如果您还没有了解过 LLaMA 3 的技术细节。即使您不了解 LLaMA 2 的架构，也不用担心，我们也会对其技术细节进行高层次的概述。下面是有关 LLaMA 2 和 LLaMA 3 的一些要点：

| FEATURE

Llama 3

Llama 2

| | --- | --- | --- | |

Tokenizer

Tiktoken (developed by OpenAI)

SentencePiece

| |

Number of Parameters

8B, 70B

70B, 13B, 7B

| |

Training Data

15T tokens

2.2T tokens

| |

Context Length

8192 tokens

4096 tokens

| |

Attention Mechanism

Grouped-query attention

| |

Fine-Tuned Models

Yes

| |

Performance

Better than Llama 2 on all benchmarks

Better than Llama 1 on most benchmarks

| |

Computational Requirements

Very high (70B model)

| |

Availability

Open source

| |

Reinforcement learning from human feedback

Yes

| |

Number of languages supported

30 languages

20 languages

| |

Suitable for

Best for more demanding tasks, such as reasoning, coding, and proficiency tests

Good for more demanding tasks, such as reasoning, coding, and proficiency tests

了解 LLaMA 3 的Transformer架构

在开始编码之前，了解 LLaMA 3 的架构非常重要。为了更直观地理解，下面是 Vanilla Transformer、LLaMA 2/3 和 Mistral 之间的对比图。

1.使用 RMSNorm 进行预归一化：

在与 LLaMA 2 相同的 LLaMA 3 方法中，一种名为 RMSNorm 的技术被用于对每个变压器子层的输入进行归一化处理。

想象一下，你正在为一场大型考试而复习，而你有一本篇幅很长的教科书，里面有很多章节。每一章都代表不同的主题，但有些章节对理解主题比其他章节更重要。

现在，在深入学习整本教科书之前，您决定评估每一章的重要性。你不想在每一章上都花同样多的时间；你想把更多的精力放在关键的章节上。

这就是使用 RMSNorm 对大型语言模型（LLM）（如 ChatGPT）进行预规范化的作用所在。这就好比根据每个章节的重要性为其分配权重。对主题至关重要的章节权重较高，而不太重要的章节权重较低。

因此，在深入学习之前，您要根据各章节的加权重要性调整学习计划。你要在权重较高的章节上分配更多的时间和精力，确保彻底掌握核心概念。

同样，使用 RMSNorm 进行预规范化可以帮助 LLM 优先考虑文本中哪些部分对于理解上下文和含义更为重要。它为重要的元素分配较高的权重，为不太重要的元素分配较低的权重，确保模型将注意力集中在准确理解最需要的地方。感兴趣的读者可以点击这里查看 RMSNorm 的详细实现过程。

2.SwiGLU 激活功能：

LLaMA 从 PaLM 中汲取灵感，引入了 SwiGLU 激活函数。

想象一下，你是一名教师，正试图向学生解释一个复杂的话题。你有一块大白板，你在上面写下要点并画出图表，让事情变得更清楚。但有时，你的字迹可能不太工整，或者你的图表可能画得不够完美。这会增加学生理解教材的难度。

现在，想象一下，如果你有一支神奇的笔，它能根据每个要点的重要性自动调整笔迹的大小和风格。如果某一点非常重要，这支笔就会把它写得更大更清晰，让它显得更加突出。如果不那么重要，这支笔就会把它写得小一些，但仍然清晰可辨。

SwiGLU 就像是 ChatGPT 等大型语言模型 (LLM) 的魔法笔。在生成文本之前，SwiGLU 会根据每个单词或短语与上下文的相关性来调整其重要性。就像魔术笔会调整书写的大小和风格一样，SwiGLU 也会调整每个单词或短语的重点。

这样，当法律硕士生成文本时，就可以更加突出重要部分，使其更加引人注目，并确保它们更有助于对文本的整体理解。这样，SwiGLU 就能帮助 LLM 生成更清晰、更易懂的文本，就像魔术笔能帮助你在白板上为学生创建更清晰的解释一样。有关 SwiGLU 的更多详情，请参阅相关论文。

3. Rotary Embeddings (RoPE):

旋转嵌入或绳索是LLaMA 3中使用的一种位置嵌入。

想象一下，你在一间教室里，想给学生分配座位进行小组讨论。通常情况下，你可能会按行和列来安排座位，每个学生都有一个固定的位置。但是，在某些情况下，你希望创建一个更动态的座位安排，让学生可以更自由地走动和互动。

ROPE 就像一种特殊的座位安排，允许学生旋转和变换位置，同时仍能保持彼此之间的相对位置。学生不再被固定在一个地方，而是可以做圆周运动，从而实现更流畅的互动。

在这种情况下，每个学生代表文本序列中的一个单词或标记，他们的位置与他们在序列中的位置相对应。正如 ROPE 允许学生旋转和改变位置一样，ROPE 允许文本序列中单词的位置嵌入根据它们之间的相对位置动态改变。

因此，在处理文本时，ROPE 不再将位置嵌入视为固定和静态的，而是引入了旋转方面，从而允许更灵活的表示，捕捉序列中单词之间的动态关系。这种灵活性有助于像 ChatGPT 这样的模型更好地理解和生成自然流畅并保持连贯性的文本，就像动态座位安排在课堂上促进互动讨论一样。对数学细节感兴趣的人可以参阅 RoPE 论文。

4.字节对编码（BPE）算法

LLaMA 3 使用 OpenAI 推出的 tiktoken 库中的字节对编码（BPE），而 LLaMA 2 标记符号生成器 BPE 则基于 sentencepiece 库。它们之间存在细微差别，但首先让我们了解一下 BPE 究竟是什么。

让我们从一个简单的例子开始。假设我们有一个包含单词的文本语料库："ab"、"bc"、"bcd "和 "cde"。我们首先用文本语料库中的所有单个字符初始化我们的词汇表，因此我们的初始词汇表是 {"a"、"b"、"c"、"d"、"e"}。

接下来，我们计算每个字符在文本语料库中的出现频率。在我们的例子中，频率为{"a":1, "b":3, "c":3, "d":2, "e":1}.

现在，我们开始合并过程。我们重复以下步骤，直到我们的词汇量达到所需的大小：首先，我们找出频率最高的一对连续字符。在本例中，频率最高的字符对是 "bc"，频率为 2。然后，我们合并这对字符，创建一个新的子词单位 "bc"。合并后，我们更新频率计数，以反映新的子词单位。更新后的频率为{"a"：1, "b":2, "c": 2, "d"：2, "e":1, "bc": 2}。我们将新的子词单位 "bc "添加到我们的词汇表中，现在词汇表变为 {"a"、"b"、"c"、"d"、"e"、"bc"}。

我们重复这一过程。下一个出现频率最高的词对是 "cd"。我们将 "cd "合并成一个新的子词单元 "cd"，并更新频率计数。更新后的频率为 {"a"：1, "b":2, "c":1, "d":1, "e":1, "bc": 2, "cd"：2}.我们将 "cd "添加到词汇表中，结果是{"a"、"b"、"c"、"d"、"e"、"bc"、"cd"}。

接着，下一个词频对是 "de"。我们将 "de "合并为子词单元 "de"，并将频率计数更新为 {"a"：1, "b":2, "c":1, "d":1, "e":0, "bc": 2, "cd"：1, "de"：1}.我们将 "de "添加到词汇表中，这样词汇表就变成了 {"a"、"b"、"c"、"d"、"e"、"bc"、"cd"、"de"}。

接下来，我们发现 "ab "是频率最高的词对。我们将 "ab "合并为子词单元 "ab"，并将频率计数更新为{"a"：0, "b":1, "c":1, "d":1, "e":0, "bc": 2, "cd"：1、"de"：1, "ab"：1}.我们将 "ab "添加到词汇表中，词汇表就变成了{"a"、"b"、"c"、"d"、"e"、"bc"、"cd"、"de"、"ab"}。

然后，下一个频繁词对是 "bcd"。我们将 "bcd "合并为子词单元 "bcd"，并将频率计数更新为 {"a"：0, "b":0, "c":0, "d":0, "e":0, "bc"：1, "cd"：0、"de"：1, "ab"：1, "bcd"：1}.我们将 "bcd "添加到词汇表中，结果是{"a"、"b"、"c"、"d"、"e"、"bc"、"cd"、"de"、"ab"、"bcd"}。

最后，出现频率最高的词对是 "cde"。我们将 "cde "合并为子词单元 "cde"，并将频率计数更新为{"a"：0, "b":0, "c":0, "d":0, "e":0, "bc"：1, "cd"：0、"de"：0、"ab"：1、"bcd"：1, "cde"：1}.我们将 "cde "添加到词汇表中，这样词汇表就变成了{"a"、"b"、"c"、"d"、"e"、"bc"、"cd"、"de"、"ab"、"bcd"、"cde"}。

这种技术可以提高 LLM 的性能，并处理罕见词和词汇量不足的词。TikToken BPE 与句子片段 BPE 的最大区别在于，如果整个单词已经为人所知，TikToken BPE 并不总是将单词分割成更小的部分。例如，如果词汇中有 "hugging"（拥抱），它就会保持为一个标记，而不会拆分成["hug", "ging"]。

做准备

我们将使用少量 Python 库，但最好还是先安装这些库，以免遇到 "找不到模块 "的错误。

pip install sentencepiece tiktoken torch blobfile matplotlib huggingface_hub

安装完所需库后，我们需要下载一些文件。由于我们要复制 llama-3-8B 的架构，因此您必须拥有 HuggingFace 的账户。此外，由于 llama-3 是一个封闭模型，您必须接受他们的条款和条件才能访问模型内容。步骤如下：

通过此链接创建 HuggingFace 账户（https://huggingface.co/welcome）

从该链接接受 llama-3-8B 的条款和条件（https://huggingface.co/meta-llama/Meta-Llama-3-8B）

完成这两个步骤后，现在我们要下载一些文件。有两种方法可供选择： (选项 1：手动）从此链接进入 llama-3-8B HF 目录，手动下载这三个文件。

(选项 2：编码）我们可以使用之前安装的 hugging_face 库下载所有这些文件。不过，首先我们需要在工作笔记本中使用 HF 令牌登录 HuggingFace Hub。您可以创建一个新令牌，或通过此链接访问。

# Import the `notebook_login` function from the `huggingface_hub` module.
from huggingface_hub import notebook_login

# Execute the `notebook_login` function to log in to the Hugging Face Hub.
notebook_login()

运行该单元后，系统会要求您输入令牌。如果登录过程中出现错误，请重试，但一定要取消选中添加令牌作为 git 凭证。之后，我们只需运行一段简单的 Python 代码，下载 llama-3-8B 架构的三个主要文件。

# Import the necessary function from the huggingface_hub library
from huggingface_hub import hf_hub_download

# Define the repository information
repo_id ="meta-llama/Meta-Llama-3-8B"
subfolder ="original"# Specify the subfolder within the repository

# List of filenames to download
filenames =["params.json","tokenizer.model","consolidated.00.pth"]

# Specify the directory where you want to save the downloaded files
save_directory ="llama-3-8B/"# Replace with your desired path

# Download each file
for filename in filenames:
    hf_hub_download(
        repo_id=repo_id,# Repository ID
        filename=filename,# Name of the file to download
        subfolder=subfolder,# Subfolder within the repository
        local_dir=save_directory  # Directory to save the downloaded file
    )

下载完所有文件后，我们需要导入本文章中将要使用的库。

# Tokenization library
import tiktoken

# BPE loading function
from tiktoken.load import load_tiktoken_bpe

# PyTorch library
import torch

# JSON handling
import json

接下来，我们需要了解每个文件的用途。

了解文件结构

由于我们的目标是精确复制 llama-3，这意味着我们的输入文本必须产生有意义的输出。例如，如果我们的输入是 "太阳的颜色是？"，输出就必须是 "白色"。要做到这一点，需要在大型数据集上训练我们的 LLM，这对计算能力的要求很高，对我们来说是不可行的。

不过，Meta 已经公开发布了他们的 llama-3 架构文件，或者用更复杂的术语来说，他们的预训练权重，以供使用。我们刚刚下载了这些文件，这样我们就可以复制他们的架构，而不需要训练或大型数据集。一切都已准备就绪，我们只需在正确的地方使用正确的组件。

来看看这些文件及其重要性： tokenizer.model - 正如我们之前讨论过的，LLaMA-3 使用的是 tiktoken 的字节对编码（BPE）标记符，它是在一个包含 15 万亿个标记的数据集上训练出来的，比 LLaMA-2 使用的数据集大 7 倍。让我们加载这个文件，看看它有什么。

# Loading the tokenizer from llama-3-8B
tokenizer_model = load_tiktoken_bpe("tokenizer.model")

# Get the length of the tokenizer model 
len(tokenizer_model)
# OUTPUT: 128000

# Get the type of the `tokenizer_model` object.
type(tokenizer_model)
# OUTPUT: dictionary

length 属性显示总词汇量，即训练数据中唯一的字符数。tokenizer_model 的类型是字典。

# Printing the first 10 items of tokenizer model
dict(list(tokenizer_model.items())[5600:5610])

#### OUTPUT ####
{

b'mitted':5600,
b" $('#":5601,
b' saw':5602,
b' approach':5603,
b'ICE':5604,
b' saying':5605,
b' anyone':5606,
b'meta':5607,
b'SD':5608,
b' song':5609

}
#### OUTPUT ####

当我们从中打印 10 个随机项时，你会看到使用 BPE 算法形成的字符串，与我们之前讨论的示例类似。键代表来自 BPE 训练的字节序列，而值则代表基于频率的合并等级。

consolidated.00.pth - 包含 Llama-3-8B 的学习参数（权重）。这些参数包括有关模型如何理解和处理语言的信息，例如如何表示词块、计算注意力、执行前馈转换以及对输出进行归一化。

# Loading a PyTorch model of LLaMA-3-8B
model = torch.load("consolidated.00.pth")

# printing first 11 layers of the architecture
list(model.keys())[:11]

#### OUTPUT ####
[

'tok_embeddings.weight',
'layers.0.attention.wq.weight',
'layers.0.attention.wk.weight',
'layers.0.attention.wv.weight',
'layers.0.attention.wo.weight',
'layers.0.feed_forward.w1.weight',
'layers.0.feed_forward.w3.weight',
'layers.0.feed_forward.w2.weight',
'layers.0.attention_norm.weight',
'layers.0.ffn_norm.weight',
'layers.1.attention.wq.weight',

]
#### OUTPUT ####

如果你熟悉变换器架构，就会知道查询、关键矩阵等。稍后，我们将在 Llama-3 的架构中使用这些层/权重来创建此类矩阵。 params.json- 包含各种参数值，例如

# Opening the parameters JSON file
withopen("params.json","r")as f:
    config = json.load(f)

# Printing the content
print(config)

#### OUTPUT ####
{

'dim':4096,
'n_layers':32,
'n_heads':32,
'n_kv_heads':8,
'vocab_size':128256,
'multiple_of':1024,
'ffn_dim_multiplier':1.3,
'norm_eps':1e-05,
'rope_theta':500000.0

}
#### OUTPUT ####

这些值将帮助我们复制 Llama-3 架构，指定头的数量、嵌入向量的维度等细节。让我们存储这些值，以便以后使用。

# Dimension
dim = config["dim"]

# Layers
n_layers = config["n_layers"]

# Heads
n_heads = config["n_heads"]

# KV_heads
n_kv_heads = config["n_kv_heads"]

# Vocabulary
vocab_size = config["vocab_size"]

# Multiple
multiple_of = config["multiple_of"]

# Multiplier
ffn_dim_multiplier = config["ffn_dim_multiplier"]

# Epsilon
norm_eps = config["norm_eps"]

# RoPE
rope_theta = torch.tensor(config["rope_theta"])

对输入数据进行标记我们需要做的第一件事就是将输入文本转换为标记符，为此，我们首先需要创建一些特殊标记符，以便在标记化文本中提供结构化标记，使标记化器能够识别和处理特定条件或指令。

special_tokens = [
"<|begin_of_text|>",# Marks the beginning of a text sequence.
"<|end_of_text|>",# Marks the end of a text sequence.
"<|reserved_special_token_0|>",# Reserved for future use.
"<|reserved_special_token_1|>",# Reserved for future use.
"<|reserved_special_token_2|>",# Reserved for future use.
"<|reserved_special_token_3|>",# Reserved for future use.
"<|start_header_id|>",# Indicates the start of a header ID.
"<|end_header_id|>",# Indicates the end of a header ID.
"<|reserved_special_token_4|>",# Reserved for future use.
"<|eot_id|>",# Marks the end of a turn (in a conversational context).
]+[f"<|reserved_special_token_{i}|>"for i inrange(5,256-5)]  # A large set of tokens reserved for future use.

接下来，我们通过指定不同的模式来匹配输入文本中的各类子串，从而定义将文本分割成标记的规则。我们可以这样做

# patterns based on which text will be break into tokens
tokenize_breaker = r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"

它可以从输入文本中提取单词、缩略语、数字（最多三位）和非空格字符序列，你还可以根据自己的要求进行定制。

我们需要使用 TikToken BPE 编写一个简单的标记化函数，它需要三个输入：tokenizer_model、tokenize_breaker 和 special_tokens。该函数将对输入文本进行相应的编码/解码。

# Initialize tokenizer with specified parameters
tokenizer = tiktoken.Encoding(

# make sure to set path to tokenizer.model file
    name ="tokenizer.model",

# Define tokenization pattern string
    pat_str = tokenize_breaker,

# Assign BPE mergeable ranks from tokenizer_model of LLaMA-3
    mergeable_ranks = tokenizer_model,

# Set special tokens with indices
    special_tokens={token:len(tokenizer_model)+ i for i, token inenumerate(special_tokens)},
)

# Encode "hello world!" and decode tokens to string
tokenizer.decode(tokenizer.encode("hello world!"))

#### OUTPUT ####
hello world!
#### OUTPUT ####

为了验证我们的编码器函数方法是否正常工作，我们将 "Hello World "传递给它。首先，它对文本进行编码，将其转换为数值。然后，解码回文本，得到 "hello world!"。这证明函数工作正常。让我们对输入进行标记化。

# input prompt
prompt ="the answer to the ultimate question of life, the universe, and everything is "

# Encode the prompt using the tokenizer and prepend a special token (128000)
tokens =[128000]+ tokenizer.encode(prompt)

print(tokens)# Print the encoded tokens

# Convert the list of tokens into a PyTorch tensor
tokens = torch.tensor(tokens)

# Decode each token back into its corresponding string
prompt_split_as_tokens =[tokenizer.decode([token.item()])for token in tokens]

print(prompt_split_as_tokens)# Print the decoded tokens


#### OUTPUT ####
[128000,1820,4320,311,...]
['<|begin_of_text|>','the',' answer',' to',...]
#### OUTPUT ####

我们对输入文本 "生命、宇宙和万物终极问题的答案是 "进行了编码，以一个特殊标记开始。

为每个令牌创建嵌入

如果我们检查输入矢量的长度，结果会是

# checking dimension of input vector
len(tokens)

#### OUTPUT ####
17
#### OUTPUT ####

checking dimension of embedding vector from llama-3 architecture

print(dim)

OUTPUT

4096

OUTPUT


我们的输入向量目前的维数为 (17x1)，需要转换为每个标记词的嵌入。这意味着我们的 (17x1) 标记将变为 (17x4096)，其中每个标记都有一个长度为 4096 的相应嵌入。

Define embedding layer with vocab size and embedding dimension

embedding_layer = torch.nn.Embedding(vocab_size, dim)

Copy pre-trained token embeddings to the embedding layer

embedding_layer.weight.data.copy_(model["tok_embeddings.weight"])

Get token embeddings for given tokens, converting to torch.bfloat16 format

token_embeddings_unnormalized = embedding_layer(tokens).to(torch.bfloat16)

Print shape of resulting token embeddings

token_embeddings_unnormalized.shape

OUTPUT

torch.Size([17,4096])

OUTPUT


这些嵌入没有进行归一化处理，如果我们不对其进行归一化处理，将会产生严重影响。下一节，我们将对输入向量进行归一化处理。

使用 RMSNorm 进行归一化
================

我们将使用前面提到的 RMSNorm 公式对输入向量进行归一化处理，以确保输入已归一化。![图片](https://mmbiz.qpic.cn/mmbiz_png/CibM5VmPwqwDIc51D5ia4tUPna6farbUEjknwSZt8icPJ1M08iavupbAmowA2vIlgo67Kj6uwuVxSpyVJY6bicbC7bg/640?wx_fmt=png&from=appmsg)

Calculating RMSNorm

defrms_norm(tensor, norm_weights):

Calculate the mean of the square of tensor values along the last dimension

squared_mean = tensor.pow(2).mean(-1, keepdim=True)

Add a small value to avoid division by zero

normalized = torch.rsqrt(squared_mean + norm_eps)

Multiply normalized tensor by the provided normalization weights

return(tensor * normalized)* norm_weights


我们将使用第 0 层的注意力权重对未归一化的嵌入进行归一化处理。使用第 0 层的原因是，我们正在创建 LLaMA-3 转换器架构的第一层。

using RMS normalization and provided normalization weights

token_embeddings = rms_norm(token_embeddings_unnormalized, model["layers.0.attention_norm.weight"])

Print the shape of the resulting token embeddings

token_embeddings.shape

OUTPUT

torch.Size([17, 4096])

OUTPUT


您可能已经知道，维数不会改变，因为我们只是对向量进行归一化处理，而不是其他。

注意头（查询、关键字、值）
=============

首先，让我们从模型中加载查询、键、值和输出向量。

Print the shapes of different weights

print(

Query weight shape

model["layers.0.attention.wq.weight"].shape,

Key weight shape

model["layers.0.attention.wk.weight"].shape,

Value weight shape

model["layers.0.attention.wv.weight"].shape,

Output weight shape

model["layers.0.attention.wo.weight"].shape )

OUTPUT

torch.Size([4096,4096])# Query weight dimension torch.Size([1024,4096])# Key weight dimension torch.Size([1024,4096])# Value weight dimension torch.Size([4096,4096])# Output weight dimension

OUTPUT


这些维度表明，我们下载的模型权重并不是针对每个头部的，而是针对多个注意力头部的，这是因为我们采用了并行方法/训练。不过，我们可以解开这些矩阵，使其仅适用于单个注意头。

Retrieve query weight for the first layer of attention

q_layer0 = model["layers.0.attention.wq.weight"]

Calculate dimension per head

head_dim = q_layer0.shape[0]// n_heads

Reshape query weight to separate heads

q_layer0 = q_layer0.view(n_heads, head_dim, dim)

Print the shape of the reshaped query weight tensor

q_layer0.shape

OUTPUT

torch.Size([32,128,4096])

OUTPUT


这里，32 是 Llama-3 中的注意力头数，128 是查询向量的大小，4096 是标记嵌入的大小。我们可以通过以下方法访问第一层第一个头部的查询权重矩阵：

Extract the query weight for the first head of the first layer of attention

q_layer0_head0 = q_layer0[0]

Print the shape of the extracted query weight tensor for the first head

q_layer0_head0.shape

OUTPUT

torch.Size([128, 4096])

OUTPUT


为了找到每个标记的查询向量，我们将查询权重与标记嵌入相乘。

Matrix multiplication: token embeddings with transpose of query weight for first head

q_per_token = torch.matmul(token_embeddings, q_layer0_head0.T)

Shape of resulting tensor: queries per token

q_per_token.shape

OUTPUT

torch.Size([17, 128])

OUTPUT


查询矢量本来就不知道自己在提示符中的位置，因此我们将使用 RoPE 让它们意识到这一点。

实施 RoPE
=======

我们将查询矢量分成两对，然后对每对矢量进行旋转角度移动。

Convert queries per token to float and split into pairs

q_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0], -1, 2)

Print the shape of the resulting tensor after splitting into pairs

q_per_token_split_into_pairs.shape

OUTPUT

torch.Size([17, 64, 2])

OUTPUT


我们有一个大小为 \[17x64x2\] 的向量，它表示 128 个长度的查询，每个提示符分成 64 对。每对查询将按 m\*theta 旋转，其中 m 是我们要旋转查询的标记的位置。 我们将使用复数的点积来旋转矢量。

Generate values from 0 to 1 split into 64 parts

zero_to_one_split_into_64_parts = torch.tensor(range(64))/64

Print the resulting tensor

zero_to_one_split_into_64_parts

OUTPUT

tensor([0.0000,0.0156,0.0312,0.0469,0.0625,0.0781,0.0938,0.1094,0.1250, 0.1406,0.1562,0.1719,0.1875,0.2031,0.2188,0.2344,0.2500,0.2656, 0.2812,0.2969,0.3125,0.3281,0.3438,0.3594,0.3750,0.3906,0.4062, 0.4219,0.4375,0.4531,0.4688,0.4844,0.5000,0.5156,0.5312,0.5469, 0.5625,0.5781,0.5938,0.6094,0.6250,0.6406,0.6562,0.6719,0.6875, 0.7031,0.7188,0.7344,0.7500,0.7656,0.7812,0.7969,0.8125,0.8281, 0.8438,0.8594,0.8750,0.8906,0.9062,0.9219,0.9375,0.9531,0.9688, 0.9844])

OUTPUT


分割步骤完成后，我们将计算其频率。

Calculate frequencies using a power operation

freqs =1.0/(rope_theta ** zero_to_one_split_into_64_parts)

Display the resulting frequencies

freqs

OUTPUT

tensor([1.0000e+00,8.1462e-01,6.6360e-01,5.4058e-01,4.4037e-01,3.5873e-01, 2.9223e-01,2.3805e-01,1.9392e-01,1.5797e-01,1.2869e-01,1.0483e-01, 8.5397e-02,6.9566e-02,5.6670e-02,4.6164e-02,3.7606e-02,3.0635e-02, 2.4955e-02,2.0329e-02,1.6560e-02,1.3490e-02,1.0990e-02,8.9523e-03, 7.2927e-03,5.9407e-03,4.8394e-03,3.9423e-03,3.2114e-03,2.6161e-03, 2.1311e-03,1.7360e-03,1.4142e-03,1.1520e-03,9.3847e-04,7.6450e-04, 6.2277e-04,5.0732e-04,4.1327e-04,3.3666e-04,2.7425e-04,2.2341e-04, 1.8199e-04,1.4825e-04,1.2077e-04,9.8381e-05,8.0143e-05,6.5286e-05, 5.3183e-05,4.3324e-05,3.5292e-05,2.8750e-05,2.3420e-05,1.9078e-05, 1.5542e-05,1.2660e-05,1.0313e-05,8.4015e-06,6.8440e-06,5.5752e-06, 4.5417e-06,3.6997e-06,3.0139e-06,2.4551e-06])

OUTPUT


现在，每个标记的查询元素都有一个复数，我们可以将查询转换成复数，然后使用点乘法根据它们的位置进行旋转。

Convert queries per token to complex numbers

q_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs)

q_per_token_as_complex_numbers.shape

Output: torch.Size([17, 64])

Calculate frequencies for each token using outer product of arange(17) and freqs

freqs_for_each_token = torch.outer(torch.arange(17), freqs)

Calculate complex numbers from frequencies_for_each_token using polar coordinates

freqs_cis = torch.polar(torch.ones_like(freqs_for_each_token), freqs_for_each_token)

Rotate complex numbers by frequencies

q_per_token_as_complex_numbers_rotated = q_per_token_as_complex_numbers * freqs_cis

q_per_token_as_complex_numbers_rotated.shape

Output: torch.Size([17, 64])


在得到旋转向量后，我们可以将复数再次视为实数，从而回到原来的成对查询。

Convert rotated complex numbers back to real numbers

q_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers_rotated)

Print the shape of the resulting tensor

q_per_token_split_into_pairs_rotated.shape

OUTPUT

torch.Size([17, 64, 2])

OUTPUT


现在，旋转后的数据对将被合并，从而产生一个新的查询向量（旋转查询向量），其形状为 \[17x128\]，其中 17 是标记数，128 是查询向量的维度。

Reshape rotated token queries to match the original shape

q_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q_per_token.shape)

Print the shape of the resulting tensor

q_per_token_rotated.shape

OUTPUT

torch.Size([17, 128])

OUTPUT


对于密钥，过程与此类似，但要记住密钥向量也是 128 维的。密钥的权重只有查询的 1/4，因为它们一次由 4 个磁头共享，以尽量减少计算量。密钥也会旋转，以包含位置信息，这与查询类似。

Extract the weight tensor for the attention mechanism's key in the first layer of the model

k_layer0 = model["layers.0.attention.wk.weight"]

Reshape key weight for the first layer of attention to separate heads

k_layer0 = k_layer0.view(n_kv_heads, k_layer0.shape[0]// n_kv_heads, dim)

Print the shape of the reshaped key weight tensor

k_layer0.shape # Output: torch.Size([8, 128, 4096])

Extract the key weight for the first head of the first layer of attention

k_layer0_head0 = k_layer0[0]

Print the shape of the extracted key weight tensor for the first head

k_layer0_head0.shape # Output: torch.Size([128, 4096])

Calculate key per token by matrix multiplication

k_per_token = torch.matmul(token_embeddings, k_layer0_head0.T)

Print the shape of the resulting tensor representing keys per token

k_per_token.shape # Output: torch.Size([17, 128])

Split key per token into pairs and convert to float

k_per_token_split_into_pairs = k_per_token.float().view(k_per_token.shape[0],-1,2)

Print the shape of the resulting tensor after splitting into pairs

k_per_token_split_into_pairs.shape # Output: torch.Size([17, 64, 2])

Convert key per token to complex numbers

k_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pairs)

Print the shape of the resulting tensor representing key per token as complex numbers

k_per_token_as_complex_numbers.shape # Output: torch.Size([17, 64])

Rotate complex key per token by frequencies

k_per_token_split_into_pairs_rotated = torch.view_as_real(k_per_token_as_complex_numbers * freqs_cis)

Print the shape of the rotated complex key per token

k_per_token_split_into_pairs_rotated.shape # Output: torch.Size([17, 64, 2])

Reshape rotated key per token to match the original shape

k_per_token_rotated = k_per_token_split_into_pairs_rotated.view(k_per_token.shape)

Print the shape of the rotated key per token

k_per_token_rotated.shape # Output: torch.Size([17, 128])


现在我们有了每个令牌的旋转查询和密钥，每个令牌的大小为 \[17x128\]。

实施自我关注
======

将查询矩阵和密钥矩阵相乘，就能得到将每个标记映射到另一个标记的得分。这个分数表示每个标记的查询和关键字之间的关系。

Calculate query-key dot products per token

qk_per_token = torch.matmul(q_per_token_rotated, k_per_token_rotated.T) / (head_dim) ** 0.5

Print the shape of the resulting tensor representing query-key dot products per token

qk_per_token.shape

OUTPUT

torch.Size([17, 17])

OUTPUT


\[17x17\]形状代表注意力分数（qk\_per\_token），其中 17 是提示符的数量。我们需要屏蔽查询键得分。在训练过程中，未来标记的查询密钥得分是被屏蔽的，因为我们只学会使用过去的标记来预测标记。因此，在推理过程中，我们会将未来标记设置为零。

Create a mask tensor filled with negative infinity values

mask = torch.full((len(tokens),len(tokens)),float("-inf"), device=tokens.device)

Set upper triangular part of the mask tensor to negative infinity

mask = torch.triu(mask, diagonal=1)

Print the resulting mask tensor

mask

OUTPUT

tensor([[0.,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf], [0.,0.,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf], [0.,0.,0.,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf], [0.,0.,0.,0.,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf], [0.,0.,0.,0.,0.,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf], [0.,0.,0.,0.,0.,0.,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf], [0.,0.,0.,0.,0.,0.,0.,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf], [0.,0.,0.,0.,0.,0.,0.,0.,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf], [0.,0.,0.,0.,0.,0.,0.,0.,0.,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf], [0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,-inf,-inf,-inf,-inf,-inf,-inf,-inf], [0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,-inf,-inf,-inf,-inf,-inf,-inf], [0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,-inf,-inf,-inf,-inf,-inf], [0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,-inf,-inf,-inf,-inf], [0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,-inf,-inf,-inf], [0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,-inf,-inf], [0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,-inf], [0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]])

OUTPUT


现在，我们必须对每个标记向量的查询密钥应用掩码。此外，我们还要在其上应用 softmax，将输出分数转换为概率。这有助于从模型的词汇表中选择最有可能的标记或标记序列，使模型的预测结果更具可解释性，更适合语言生成和分类等任务。

Add the mask to the query-key dot products per token

qk_per_token_after_masking = qk_per_token + mask

Apply softmax along the second dimension after masking

qk_per_token_after_masking_after_softmax = torch.nn.functional.softmax(qk_per_token_after_masking, dim=1).to(torch.bfloat16)


值矩阵标志着自我注意力部分的结束，与键类似，值权重也是每 4 个注意力头共享的，以节省计算量。因此，值权重矩阵的形状为 \[8x128x4096\]。

Retrieve the value weight for the first layer of attention

v_layer0 = model["layers.0.attention.wv.weight"]

Reshape value weight for the first layer of attention to separate heads

v_layer0 = v_layer0.view(n_kv_heads, v_layer0.shape[0] // n_kv_heads, dim)

Print the shape of the reshaped value weight tensor

v_layer0.shape

OUTPUT

torch.Size([8, 128, 4096])

OUTPUT


与查询矩阵和关键矩阵类似，第一层和首部的值矩阵也可以通过以下方法获得：

Extract the value weight for the first head of the first layer of attention

v_layer0_head0 = v_layer0[0]

Print the shape of the extracted value weight tensor for the first head

v_layer0_head0.shape

OUTPUT

torch.Size([128, 4096])

OUTPUT


通过值权重，我们计算出每个标记的注意力值，从而得到一个大小为 \[17x128\] 的矩阵。其中，17 表示提示符的数量，128 表示每个提示符的值向量的维度。

Calculate value per token by matrix multiplication

v_per_token = torch.matmul(token_embeddings, v_layer0_head0.T)

Print the shape of the resulting tensor representing values per token

v_per_token.shape

OUTPUT

torch.Size([17, 128])

OUTPUT


为了得到注意力矩阵，我们可以进行以下乘法运算：

Calculate QKV attention by matrix multiplication

qkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)

Print the shape of the resulting tensor

qkv_attention.shape

OUTPUT

torch.Size([17, 128])

OUTPUT


现在，我们得到了第一层和第一头的注意力值，换句话说，就是自我注意力值。

实施多头关注
======

将执行一个循环，进行与上述相同的计算，但针对的是第一层中的每个头部。

Store QKV attention for each head in a list

qkv_attention_store =[]

Iterate through each head

for head inrange(n_heads):

Extract query, key, and value weights for the current head

q_layer0_head = q_layer0[head] k_layer0_head = k_layer0[head//4]# Key weights are shared across 4 heads v_layer0_head = v_layer0[head//4]# Value weights are shared across 4 heads

Calculate query per token by matrix multiplication

q_per_token = torch.matmul(token_embeddings, q_layer0_head.T)

Calculate key per token by matrix multiplication

k_per_token = torch.matmul(token_embeddings, k_layer0_head.T)

Calculate value per token by matrix multiplication

v_per_token = torch.matmul(token_embeddings, v_layer0_head.T)

Split query per token into pairs and rotate them

q_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0],-1,2) q_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs) q_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers * freqs_cis[:len(tokens)]) q_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q_per_token.shape)

Split key per token into pairs and rotate them

k_per_token_split_into_pairs = k_per_token.float().view(k_per_token.shape[0],-1,2) k_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pairs) k_per_token_split_into_pairs_rotated = torch.view_as_real(k_per_token_as_complex_numbers * freqs_cis[:len(tokens)]) k_per_token_rotated = k_per_token_split_into_pairs_rotated.view(k_per_token.shape)

Calculate query-key dot products per token

qk_per_token = torch.matmul(q_per_token_rotated, k_per_token_rotated.T)/(128)**0.5

Create a mask tensor filled with negative infinity values

mask = torch.full((len(tokens),len(tokens)),float("-inf"), device=tokens.device)

Set upper triangular part of the mask tensor to negative infinity

mask = torch.triu(mask, diagonal=1)

Add the mask to the query-key dot products per token

qk_per_token_after_masking = qk_per_token + mask

Apply softmax along the second dimension after masking

qk_per_token_after_masking_after_softmax = torch.nn.functional.softmax(qk_per_token_after_masking, dim=1).to(torch.bfloat16)

Calculate QKV attention by matrix multiplication

qkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)

Store QKV attention for the current head

qkv_attention_store.append(qkv_attention)

Print the number of QKV attentions stored

len(qkv_attention_store)

OUTPUT


现在，第一层所有 32 个头的 QKV 注意力矩阵都已得到，所有的注意力分数将合并成一个大小为 \[17x4096\] 的大矩阵。

Concatenate QKV attentions from all heads along the last dimension

stacked_qkv_attention = torch.cat(qkv_attention_store, dim=-1)

Print the shape of the resulting tensor

stacked_qkv_attention.shape

OUTPUT

torch.Size([17, 4096])

OUTPUT


关注第 0 层的最后一个步骤是将权重矩阵与堆叠 QKV 矩阵相乘。

Calculate the embedding delta by matrix multiplication with the output weight

embedding_delta = torch.matmul(stacked_qkv_attention, model["layers.0.attention.wo.weight"].T)

Print the shape of the resulting tensor

embedding_delta.shape

OUTPUT

torch.Size([17, 4096])

OUTPUT


现在，我们有了关注后嵌入值的变化，这些变化应添加到原始标记嵌入值中。

Add the embedding delta to the unnormalized token embeddings to get the final embeddings

embedding_after_edit = token_embeddings_unnormalized + embedding_delta

Print the shape of the resulting tensor

embedding_after_edit.shape

OUTPUT

torch.Size([17, 4096])

OUTPUT


对嵌入变化进行归一化处理，然后通过前馈神经网络运行。

Normalize edited embeddings using root mean square normalization and provided weights

embedding_after_edit_normalized = rms_norm(embedding_after_edit, model["layers.0.ffn_norm.weight"])

Print the shape of resulting normalized embeddings

embedding_after_edit_normalized.shape

OUTPUT

torch.Size([17, 4096])

OUTPUT


执行 SwiGLU 激活功能
==============

鉴于我们已经熟悉了前一节中的 SwiGLU 激活函数，我们将在这里应用我们之前研究过的方程。![图片](https://mmbiz.qpic.cn/mmbiz_png/CibM5VmPwqwDIc51D5ia4tUPna6farbUEj0RJ1bdzS9Agl6Azkzj25AbOjvH698ib6F1eSHc8pugX9sD6ZILgicMVA/640?wx_fmt=png&from=appmsg)

Retrieve weights for feedforward layer

w1 = model["layers.0.feed_forward.w1.weight"] w2 = model["layers.0.feed_forward.w2.weight"] w3 = model["layers.0.feed_forward.w3.weight"]

Perform operations for feedforward layer

output_after_feedforward = torch.matmul(torch.functional.F.silu(torch.matmul(embedding_after_edit_normalized, w1.T))* torch.matmul(embedding_after_edit_normalized, w3.T), w2.T)

Print the shape of the resulting tensor after feedforward

output_after_feedforward.shape

OUTPUT

torch.Size([17,4096])

OUTPUT


合并一切
====

现在一切准备就绪，我们需要合并代码，生成另外 31 个图层。

Initialize final embedding with unnormalized token embeddings

final_embedding = token_embeddings_unnormalized

Iterate through each layer

for layer inrange(n_layers):

Initialize list to store QKV attentions for each head

qkv_attention_store =[]

Normalize the final embedding using root mean square normalization and weights from the current layer

layer_embedding_norm = rms_norm(final_embedding, model[f"layers.{layer}.attention_norm.weight"])

Retrieve query, key, value, and output weights for the attention mechanism of the current layer

q_layer = model[f"layers.{layer}.attention.wq.weight"] q_layer = q_layer.view(n_heads, q_layer.shape[0]// n_heads, dim) k_layer = model[f"layers.{layer}.attention.wk.weight"] k_layer = k_layer.view(n_kv_heads, k_layer.shape[0]// n_kv_heads, dim) v_layer = model[f"layers.{layer}.attention.wv.weight"] v_layer = v_layer.view(n_kv_heads, v_layer.shape[0]// n_kv_heads, dim) w_layer = model[f"layers.{layer}.attention.wo.weight"]

Iterate through each head

for head inrange(n_heads):

Extract query, key, and value weights for the current head

q_layer_head = q_layer[head] k_layer_head = k_layer[head//4]# Key weights are shared across 4 heads v_layer_head = v_layer[head//4]# Value weights are shared across 4 heads

Calculate query per token by matrix multiplication

q_per_token = torch.matmul(layer_embedding_norm, q_layer_head.T)

Calculate key per token by matrix multiplication

k_per_token = torch.matmul(layer_embedding_norm, k_layer_head.T)

Calculate value per token by matrix multiplication

v_per_token = torch.matmul(layer_embedding_norm, v_layer_head.T)

Split query per token into pairs and rotate them

q_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0],-1,2) q_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs) q_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers * freqs_cis) q_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q_per_token.shape)

Split key per token into pairs and rotate them

k_per_token_split_into_pairs = k_per_token.float().view(k_per_token.shape[0],-1,2) k_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pairs) k_per_token_split_into_pairs_rotated = torch.view_as_real(k_per_token_as_complex_numbers * freqs_cis) k_per_token_rotated = k_per_token_split_into_pairs_rotated.view(k_per_token.shape)

Calculate query-key dot products per token

qk_per_token = torch.matmul(q_per_token_rotated, k_per_token_rotated.T)/(128)**0.5

Create a mask tensor filled with negative infinity values

mask = torch.full((len(token_embeddings_unnormalized),len(token_embeddings_unnormalized)),float("-inf"))

Set upper triangular part of the mask tensor to negative infinity

mask = torch.triu(mask, diagonal=1)

Add the mask to the query-key dot products per token

qk_per_token_after_masking = qk_per_token + mask

Apply softmax along the second dimension after masking

qk_per_token_after_masking_after_softmax = torch.nn.functional.softmax(qk_per_token_after_masking, dim=1).to(torch.bfloat16)

Calculate QKV attention by matrix multiplication

qkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)

Store QKV attention for the current head

qkv_attention_store.append(qkv_attention)

Concatenate QKV attentions from all heads along the last dimension

stacked_qkv_attention = torch.cat(qkv_attention_store, dim=-1)

Calculate embedding delta by matrix multiplication with the output weight

embedding_delta = torch.matmul(stacked_qkv_attention, w_layer.T)

Add the embedding delta to the current embedding to get the edited embedding

embedding_after_edit = final_embedding + embedding_delta

Normalize the edited embedding using root mean square normalization and weights from the current layer

embedding_after_edit_normalized = rms_norm(embedding_after_edit, model[f"layers.{layer}.ffn_norm.weight"])

Retrieve weights for the feedforward layer

w1 = model[f"layers.{layer}.feed_forward.w1.weight"] w2 = model[f"layers.{layer}.feed_forward.w2.weight"] w3 = model[f"layers.{layer}.feed_forward.w3.weight"]

Perform operations for the feedforward layer

output_after_feedforward = torch.matmul(torch.functional.F.silu(torch.matmul(embedding_after_edit_normalized, w1.T))* torch.matmul(embedding_after_edit_normalized, w3.T), w2.T)

Update the final embedding with the edited embedding plus the output from the feedforward layer

final_embedding = embedding_after_edit + output_after_feedforward


生成输出
====

现在我们有了最终的嵌入，它代表了模型对下一个标记的猜测。它的形状与普通的标记嵌入相同，即 \[17x4096\]，有 17 个标记，嵌入维度为 4096。

Normalize the final embedding using root mean square normalization and provided weights

final_embedding = rms_norm(final_embedding, model["norm.weight"])

Print the shape of the resulting normalized final embedding

final_embedding.shape

OUTPUT

torch.Size([17, 4096])

OUTPUT


现在，我们可以将嵌入解码为令牌值。

Print the shape of the output weight tensor

model["output.weight"].shape

OUTPUT

torch.Size([128256, 4096])

OUTPUT


为了预测下一个值，我们利用了上一个标记的嵌入。

Calculate logits by matrix multiplication between the final embedding and the transpose of the output weight tensor

logits = torch.matmul(final_embedding[-1], model["output.weight"].T)

Print the shape of the resulting logits tensor

logits.shape

OUTPUT

torch.Size([128256])

OUTPUT

# Find the index of the maximum value along the last dimension to determine the next token
next_token = torch.argmax(logits, dim=-1)

# Output the index of the next token
next_token

#### OUTPUT ####
tensor(2983)
#### OUTPUT ####
```

从令牌 ID 获取生成的文本

```
# Decode the index of the next token using the tokenizer
tokenizer.decode([next_token.item()])


#### OUTPUT ####
42
#### OUTPUT ####
```

因此，我们的输入是 "生命、宇宙和万物的终极问题的答案是"，输出是 "42"，这是正确答案。 您只需在整个代码中修改这两行，就可以尝试使用不同的输入文本，其他代码保持不变！

```
# input prompt
prompt = "Your Input"

# Replacing 17 number with total number of tokens in your input
# You can check total number of tokens using len(tokens)
freqs_for_each_token = torch.outer(torch.arange(17), freqs)
```

**希望你们喜欢这篇文章，并能从中学到新东西！**

![图片](https://mmbiz.qpic.cn/mmbiz_gif/CibM5VmPwqwDIc51D5ia4tUPna6farbUEj1LoylEloyn3syElGfr0MCWfh7eneDwr4yxlic33Uo0L0ia7Hia7vjAaQQ/640?wx_fmt=gif&from=appmsg)

**点个“在看”不失联**

最新热门文章推荐：

[国外Rust程序员分享：Rust与C++的完美结合](https://mp.weixin.qq.com/s?__biz=MzkzNjI3ODkyNQ==&mid=2247486074&idx=1&sn=2770aefeb2110c3353590daa43184bf6&scene=21#wechat_redirect)

[国外C++程序员分享：2024/2025年C++是否还值得学习？](https://mp.weixin.qq.com/s?__biz=MzkzNjI3ODkyNQ==&mid=2247486076&idx=1&sn=d646c8a6c06a3a95a2ced4094c021aa2&scene=21#wechat_redirect)

[外国人眼中的贾扬清：从清华到阿里，再创业LeptonAI](https://mp.weixin.qq.com/s?__biz=MzkzNjI3ODkyNQ==&mid=2247486075&idx=1&sn=9f8ce330fcc87e908b523df3afaa104e&scene=21#wechat_redirect)

[白宫关注下，C++的内存安全未来走向何方？](https://mp.weixin.qq.com/s?__biz=MzkzNjI3ODkyNQ==&mid=2247486035&idx=1&sn=f6ce5327b66d91a36cd4fa1d9b8a9f80&scene=21#wechat_redirect)

[国外Python程序员分享：如何用Python构建一个多代理AI应用](https://mp.weixin.qq.com/s?__biz=MzkzNjI3ODkyNQ==&mid=2247485989&idx=1&sn=439789fd8492ee8bda40fef51d3e6145&scene=21#wechat_redirect)

[本地部署六款大模型：保护隐私、节省成本，特定环境首选](https://mp.weixin.qq.com/s?__biz=MzkzNjI3ODkyNQ==&mid=2247485969&idx=1&sn=54c20e46c70172307c72026abbf2755b&scene=21#wechat_redirect)

[国外CUDA程序员分享：2024年GPU编程CUDA C++（从环境安装到进阶技巧）](https://mp.weixin.qq.com/s?__biz=MzkzNjI3ODkyNQ==&mid=2247485788&idx=1&sn=77e8feae61345b275e8e5a8ee400b2bb&scene=21#wechat_redirect)

[我卸载了VSCode，我的生产力大幅提升](https://mp.weixin.qq.com/s?__biz=MzkzNjI3ODkyNQ==&mid=2247485827&idx=1&sn=5e9f02fdd3959cb427f15c70d9203122&scene=21#wechat_redirect)

[国外Python程序员分享：2024年NumPy高性能计算库（高级技巧）](https://mp.weixin.qq.com/s?__biz=MzkzNjI3ODkyNQ==&mid=2247485735&idx=1&sn=99586afdf165f53caeb4cf7a9062cfde&scene=21#wechat_redirect)

[外国人眼中的程明明：从“电脑小白”到CV领域领军者](https://mp.weixin.qq.com/s?__biz=MzkzNjI3ODkyNQ==&mid=2247485725&idx=1&sn=1a74cb4c9ea6f868ff5a88d91338718d&scene=21#wechat_redirect)

[外国人眼中的周志华：人工智能奖获得者、人工智能学院院长](https://mp.weixin.qq.com/s?__biz=MzkzNjI3ODkyNQ==&mid=2247485669&idx=1&sn=85842c2b4802ecc9dc988bab8394772f&scene=21#wechat_redirect)

[国外C++程序员分享：C++多线程实战掌握图像处理高级技巧](https://mp.weixin.qq.com/s?__biz=MzkzNjI3ODkyNQ==&mid=2247485687&idx=1&sn=6e1634cea46c7aacf3759d77a2d64338&scene=21#wechat_redirect)

[外国人眼中的卢湖川：从大连理工到全球舞台，他的科研成果震撼世界！](https://mp.weixin.qq.com/s?__biz=MzkzNjI3ODkyNQ==&mid=2247485650&idx=1&sn=396801f5408a194aee956a3acd30129b&scene=21#wechat_redirect)

[外国人眼中的张祥雨：交大90后男神博士，3年看1800篇论文，还入选福布斯精英榜](https://mp.weixin.qq.com/s?__biz=MzkzNjI3ODkyNQ==&mid=2247485593&idx=1&sn=cba8bf9d52b4a63321cd07afd88efa43&scene=21#wechat_redirect)

参考文献： 《图片来源网络》 《Building LLaMA 3 From Scratch with Python》

> 本文使用 [文章同步助手](https://juejin.cn/post/6940875049587097631) 同步