图解Transformer（下）上篇文章图解Transformer（上）讲解了 Transformer 的基本结构及

上篇文章图解Transformer（上）讲解了 Transformer 的基本结构及 self-attention。

Representing The Order of The Sequence Using Positional Encoding

用位置编码表示句子中单词的顺序。

在attention机制中，并不关心单词的位置信息，但是在NLP中，单词的位置又很重要，因此需要在 self-attention 之前将位置信息与 word embedding 一起参与运算。

The transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.

位置向量加入到input embedding，可以标注每个单词的位置，又能计算句子中不同单词的距离。直观理解（或者将the intuition here翻译成其核心思想）：这位置编码加入到 embedding 中，当这些 embedding 向量在映射为 Q/K/V 向量并计算点积注意力时，可以提供 enbedding 向量间有意义的距离信息。

POSITIONAL ENCODING一般也会被编码成512维的向量。取值范围在 [-1, 1] 之间。

在论文《Attention Is All You Need》（section 3.5）中，给出了计算公式：

where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid（正弦曲线）. The wavelengths（波长） form a geometric progression（等比数列） from 2π to 10000 · 2π. We chose this function because we hypothesized（推测） it would allow the model to easily learn to attend（attention注意力） by relative positions, since for any fixed offset k, PEpos+k can be represented as a linear function of PEpos.

Wealso experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate（推断） to sequence lengths longer than the ones encountered during training.

可学习的位置嵌入（learned positional embeddings）和正弦版本的位置编码效果差不多。之所以选择正弦版本是因为：正弦版本可以更好的推断在训练时未遇到的更长的句子。

公式中 pos 表示位置，即是第几个单词，第一个单词 pos = 1，第二个单词 pos = 2。i 表示维数。 $d_{model}$ 表示模型维度，一般取值为512。i 的取值范围是 0~255。2i 表示偶数维，2i+1 表示奇数维。也就是位置编码向量中的数据一个是sin数值，一个是cos数值，这样交替出现。如下图所示：

每一行表示一个token position，每一列是一维度。一般是512维的。

The Residuals

residual 残差

Each sub-layer (self-attention, ffnn) in each encoder has a residual connection（残差连接） around it, and is followed by a layer-normalization （层归一化）step.

也就是下图中的 Add & Nornalize。

residual connection又叫skip connection，输入会直接传递到子层的输出端，与子层的输出相加。残差连接的主要作用是允许梯度直接通过跳跃连接反向传播，减少了经过深层网络时的梯度消失问题。这使得训练更深的网络成为可能，因为梯度可以更有效地传递到浅层。此外，残差连接还有助于保留原始输入的信息。在处理过程中，网络可能会丢失一些重要信息，而通过残差连接，原始输入的信息可以保留下来，并与子层的输出相结合，增强了模型的表达能力。

Self-Attention的输出与原始输入残差，然后归一化。Feed Forward的输出与 Self-Attention之后的残差归一化输出做残差归一化。更细化的结构图如下所示：

在每一层的ENCODER内部子层间都有残差归一化，在DECODER的每个子层间也有残差归一化。从上图也可以看到一个细节：ENCODER之后的输出会传递给每一个DECODER的 Encoder-Decoder Attention。

The Decoder Side

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence:

ENCODERS栈顶的ENCODER的输出被转化成了 K V 向量，K V向量依次被传给了每个DECODER层的encoder-decoder attention 层，使得DECODER将注意力放在输入语句的合适位置。具体过程如下所示：

The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

每一步的输出被传给了下一步的底层decoder，这些DECODERS就像前面讲的ENCODERS一样，不断的“冒泡”解码结果。就像在处理ENCODER输入时一样，我们将 embedding vector 和位置编码一起作为decoder的输入。具体过程如下图所示：

观察上图：下一个输出是在上一个输出的基础上计算得来的。

再看一下整体结构：

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.

The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

在DECODER中，self-attention层，仅允许注意输出序列中较早的位置（因为后面的位置还没有计算出来）。仅仅将 future position 置位为 -inf。

encoder-decoder attention层就像多头自注意力一样工作，它从底层（self-attention）获得 Queries 矩阵，从 ENCODERS 栈中获得 Keys 和 Values 矩阵。

The Final Linear and Softmax Layer

通过DECODERS输出的是 float 向量，那么怎么转换成语言单词呢？这就用到了 Final Linear 层和 Softmax 层。

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.

Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.

The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

Linear层是一个简单的全连接神经网络，将DECODERS栈的输出向量映射成一个叫做 logits 向量的大的多的向量。

假设模型认识1万个不同的英文单词（模型的输出词库，从训练数据集中学习得来）。Linear层使得每个 logits 向量有1万维，每维表示不同单词的相应分值。

Softmax层就会根据 Linear 层得到的分值向量 logits vector 计算可能性。可能性最大的那一维被选择，这一维对应的单词就是输出。

具体过程如下图所示：

Recap Of Training

训练的摘要说明

在训练期间，也是像前面讲到的前向传输过程（从 token embedding 到 ENCODERS 再到 DECODERS 最后到 Linear and Softmax层）一样，只是在训练时，我们有标注的结果数据，我们可以对比模型输出和期望结果之间的差异。

假设输出词库有6个单词（包含一个 eos，end of sentence），并使用 one-hot 编码（独热编码）。

The Loss Function

loss function – the metric we are optimizing during the training phase to lead up to a trained and hopefully amazingly accurate model.

loss function（损失函数）又叫 cost function（代价函数），是一种度量，这种度量可以引导我们通过训练得到一个准确的模型。通俗的讲，我们怎么知道模型的输出好不好呢？跟标注的训练数据结果对比一下，看看差距，这个差距就是通过 loss function 来量化的。

Say we are training our model. Say it’s our first step in the training phase, and we’re training it on a simple example – translating “merci” into “thanks”.

What this means, is that we want the output to be a probability distribution indicating the word “thanks”. But since this model is not yet trained, that’s unlikely to happen just yet.

训练将 “merci” 翻译成 "thanks"。我们希望输出一个指向 "thanks" 的概率分布。但是模型还未训练，不太可能得到这样的结果。

模型的权重参数是随机初始化的，未训练的模型随意生成了概率分布。我们可以比较这个模型输出与标注输出，使用back propagation（反向传播）tweak（微调）所有的模型权重参数，使得模型输出接近期望的输出。

我们怎么比较这两个向量（未训练的模型输出和期望输出）？可以使用最小二乘法的方式。每位相减求平方和，平方和最小的时候就对了。

现在看个比较真实的例子。

More realistically, we’ll use a sentence longer than one word. For example – input: “je suis étudiant” and expected output: “i am a student”. What this really means, is that we want our model to successively output probability distributions where:

Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 30,000 or 50,000)

The first probability distribution has the highest probability at the cell associated with the word “i”

The second probability distribution has the highest probability at the cell associated with the word “am”

And so on, until the fifth output distribution indicates ‘<end of sentence>’ symbol, which also has a cell associated with it from the 10,000 element vocabulary.

训练将一句法语（“je suis étudiant”）翻译成英语（“i am a student”）。

每个概率分布通过一个词库大小的向量表示。该例子中词库大小是6。
第一个概率分布希望在单词 i 处概率最大
第二个概率分布希望在单词 am 处概率最大
直到第五个概率分布希望在 eos 符号处概率最大

通过在足够大的数据集上进行多次训练之后，我们期望的概率分布如下所示：

Hopefully upon training, the model would output the right translation we expect. Of course it's no real indication if this phrase was part of the training dataset (see: cross validation). Notice that every position gets a little bit of probability even if it's unlikely to be the output of that time step -- that's a very useful property of softmax which helps the training process.

每个位置都有一个很小的可能性，即使该位置不可能是该步的输出，这是 softmax 的有用特性，这能帮助训练过程。

Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That’s one way to do it (called greedy decoding).

因为模型每次只产生一个输出，我们假设模型总是从概率分布中选择概率最大的那个单词并丢弃其他的。这种方式被称为贪心解码。

Another way to do it would be to hold on to, say, the top two words (say, ‘I’ and ‘a’ for example), then in the next step, run the model twice: once assuming the first output position was the word ‘I’, and another time assuming the first output position was the word ‘a’, and whichever version produced less error considering both positions #1 and #2 is kept. We repeat this for positions #2 and #3…etc. This method is called “beam search”, where in our example, beam_size was two (meaning that at all times, two partial hypotheses (unfinished translations) are kept in memory), and top_beams is also two (meaning we’ll return two translations). These are both hyperparameters that you can experiment with.

另一种方式是，我们不是每次都选择概率最高的1个单词，而是选择概率最高的2个单词（比如选择了 I 和 a），下一步，运行模型2次：一次假设第一个位置的输出是 I，另一次假设第一个位置的输出是 a，看两次假设的下一个位置的输出，同时考虑位置1和位置2的输出，产生较小错误的那个版本就被保留下来。重复这个过程，直到完成。这种方式被称为 beam search（束搜索）。在我们的例子中，beam_size 参数是2，top_beams 参数是2。这两个超参数都可以在实验过程中调整。

beam_size 2：每次在内存中总是在未完成的翻译中使用2个部分期望（每次挑选top2的单词）
top_beams 2：返回两种翻译结果

贪心解码可能导致“局部最优未必是全局最优”的问题，bean search 的计算量较大。

下面的对比表来自DeepSeek：

更多干货欢迎各位大佬关注wx公众号：double6

参考

The Illustrated Transformer – Jay Alammar