Machine-Learning-Mastery-注意力机制-二-Machine Learning Mastery 注意

Machine Learning Mastery 注意力机制（二）

原文：Machine Learning Mastery

协议：CC BY-NC-SA 4.0

如何在 TensorFlow 和 Keras 中从头实现多头注意力机制

原文：machinelearningmastery.com/how-to-implement-multi-head-attention-from-scratch-in-tensorflow-and-keras/

我们已经熟悉了 Transformer 模型及其注意力机制的理论。我们已经开始了实现完整模型的旅程，学习如何实现缩放点积注意力。现在，我们将进一步将缩放点积注意力封装成多头注意力机制，这是核心组成部分。我们的最终目标是将完整模型应用于自然语言处理（NLP）。

在本教程中，您将了解如何在 TensorFlow 和 Keras 中从头实现多头注意力机制。

完成本教程后，您将了解：

形成多头注意力机制的层。
如何从头实现多头注意力机制。

启动您的项目，使用我的书籍使用注意力构建 Transformer 模型。它提供了 自学教程 和 工作代码，指导您构建一个完全工作的 Transformer 模型，可用于

将句子从一种语言翻译成另一种语言...

让我们开始吧。

如何在 TensorFlow 和 Keras 中从头实现多头注意力机制

照片由 Everaldo Coelho 拍摄，部分权利保留。

教程概述

本教程分为三个部分；它们分别是：

Transformer 架构回顾
- Transformer 多头注意力
从头实现多头注意力
测试代码

先决条件

对于本教程，我们假设您已经熟悉：

Transformer 架构回顾

回顾你已经看到 Transformer 架构遵循编码器-解码器结构。左侧的编码器负责将输入序列映射到连续表示序列；右侧的解码器接收编码器的输出以及前一个时间步骤的解码器输出，以生成输出序列。

Transformer 架构的编码器-解码器结构

摘自“Attention Is All You Need”

在生成输出序列时，Transformer 不依赖递归和卷积。

你已经看到，Transformer 的解码器部分在架构上与编码器有许多相似之处。编码器和解码器共同拥有的核心机制之一是多头注意力机制。

Transformer 多头注意力

每个多头注意力块由四个连续的层组成：

在第一层，三个线性（稠密）层分别接收查询、键或值。
在第二层，一个缩放点积注意力函数。第一层和第二层执行的操作会根据组成多头注意力块的头数重复执行h次，并且并行进行。
在第三层，一个连接操作将不同头部的输出连接起来。
在第四层，一个最终的线性（稠密）层生成输出。

多头注意力

摘自“Attention Is All You Need”

回顾一下将作为多头注意力实现构建块的重要组件：

查询、键和值：这些是每个多头注意力块的输入。在编码器阶段，它们携带相同的输入序列，该序列在经过嵌入和位置编码信息增强后，作为输入提供。同样，在解码器端，输入到第一个注意力块的查询、键和值代表了经过嵌入和位置编码信息增强后的相同目标序列。解码器的第二个注意力块接收来自编码器的输出，形式为键和值，并且将第一个解码器注意力块的归一化输出作为查询。查询和键的维度由 $d_k$ 表示，而值的维度由 $d_v$ 表示。
投影矩阵：当应用于查询、键和值时，这些投影矩阵会生成每个的不同子空间表示。每个注意力头然后对这些查询、键和值的投影版本中的一个进行处理。另一个投影矩阵也会应用于多头注意力块的输出，在每个单独的头的输出被连接在一起之后。投影矩阵在训练过程中学习得到。

现在我们来看看如何在 TensorFlow 和 Keras 中从零实现多头注意力。

从零实现多头注意力

我们从创建MultiHeadAttention类开始，它继承自 Keras 中的Layer基类，并初始化一些你将使用的实例属性（属性描述可以在注释中找到）：

Python

class MultiHeadAttention(Layer):
    def __init__(self, h, d_k, d_v, d_model, **kwargs):
        super(MultiHeadAttention, self).__init__(**kwargs)
        self.attention = DotProductAttention()  # Scaled dot product attention 
        self.heads = h  # Number of attention heads to use
        self.d_k = d_k  # Dimensionality of the linearly projected queries and keys
        self.d_v = d_v  # Dimensionality of the linearly projected values
        self.W_q = Dense(d_k)  # Learned projection matrix for the queries
        self.W_k = Dense(d_k)  # Learned projection matrix for the keys
        self.W_v = Dense(d_v)  # Learned projection matrix for the values
        self.W_o = Dense(d_model)  # Learned projection matrix for the multi-head output
        ...

注意到之前实现的DotProductAttention类的一个实例已经被创建，并且它的输出被分配给了变量attention。回顾你是这样实现DotProductAttention类的：

Python

from tensorflow import matmul, math, cast, float32
from tensorflow.keras.layers import Layer
from keras.backend import softmax

# Implementing the Scaled-Dot Product Attention
class DotProductAttention(Layer):
    def __init__(self, **kwargs):
        super(DotProductAttention, self).__init__(**kwargs)

    def call(self, queries, keys, values, d_k, mask=None):
        # Scoring the queries against the keys after transposing the latter, and scaling
        scores = matmul(queries, keys, transpose_b=True) / math.sqrt(cast(d_k, float32))

        # Apply mask to the attention scores
        if mask is not None:
            scores += -1e9 * mask

        # Computing the weights by a softmax operation
        weights = softmax(scores)

        # Computing the attention by a weighted sum of the value vectors
        return matmul(weights, values)

接下来，你将重新调整线性投影后的查询、键和值，以便能够并行计算注意力头。

查询、键和值将作为输入传入多头注意力块，其形状为（batch size，sequence length，model dimensionality），其中batch size是训练过程中的一个超参数，sequence length定义了输入/输出短语的最大长度，model dimensionality是模型所有子层生成的输出的维度。然后，它们会通过各自的密集层，线性投影到（batch size，sequence length，queries/keys/values dimensionality）的形状。

线性投影后的查询、键和值将被重新排列为（batch size，number of heads，sequence length，depth），首先将它们重塑为（batch size，sequence length，number of heads，depth），然后转置第二和第三维。为此，你将创建类方法reshape_tensor，如下所示：

Python

def reshape_tensor(self, x, heads, flag):
    if flag:
        # Tensor shape after reshaping and transposing: (batch_size, heads, seq_length, -1)
        x = reshape(x, shape=(shape(x)[0], shape(x)[1], heads, -1))
        x = transpose(x, perm=(0, 2, 1, 3))
    else:
        # Reverting the reshaping and transposing operations: (batch_size, seq_length, d_model)
        x = transpose(x, perm=(0, 2, 1, 3))
        x = reshape(x, shape=(shape(x)[0], shape(x)[1], -1))
    return x

reshape_tensor方法接收线性投影后的查询、键或值作为输入（同时将标志设置为True）以进行如前所述的重新排列。一旦生成了多头注意力输出，它也会被传入相同的函数（这次将标志设置为False）以执行反向操作，从而有效地将所有头的结果连接在一起。

因此，下一步是将线性投影后的查询、键和值输入到 reshape_tensor 方法中进行重排，然后将它们输入到缩放点积注意力函数中。为此，让我们创建另一个类方法 call，如下所示：

Python

def call(self, queries, keys, values, mask=None):
    # Rearrange the queries to be able to compute all heads in parallel
    q_reshaped = self.reshape_tensor(self.W_q(queries), self.heads, True)
    # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)

    # Rearrange the keys to be able to compute all heads in parallel
    k_reshaped = self.reshape_tensor(self.W_k(keys), self.heads, True)
    # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)

    # Rearrange the values to be able to compute all heads in parallel
    v_reshaped = self.reshape_tensor(self.W_v(values), self.heads, True)
    # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)

    # Compute the multi-head attention output using the reshaped queries, keys and values
    o_reshaped = self.attention(q_reshaped, k_reshaped, v_reshaped, self.d_k, mask)
    # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)
    ...

请注意，reshape_tensor 方法除了接收查询、键和值作为输入外，还可以接收一个掩码（默认值为None）。

回顾 Transformer 模型引入了一个 前瞻掩码 以防止解码器关注后续单词，从而使得对特定单词的预测只能依赖于其前面的已知输出。此外，由于词嵌入被零填充到特定的序列长度，还需要引入一个 填充掩码 以防止零值与输入一起被处理。这些前瞻掩码和填充掩码可以通过 mask 参数传递给缩放点积注意力。

一旦你从所有注意力头中生成了多头注意力输出，最后的步骤是将所有输出连接成一个形状为（批大小，序列长度，值的维度）的张量，并通过一个最终的全连接层。为此，你将向 call 方法添加以下两行代码。

Python

...
# Rearrange back the output into concatenated form
output = self.reshape_tensor(o_reshaped, self.heads, False)
# Resulting tensor shape: (batch_size, input_seq_length, d_v)

# Apply one final linear projection to the output to generate the multi-head attention
# Resulting tensor shape: (batch_size, input_seq_length, d_model)
return self.W_o(output)

将所有内容整合在一起，你会得到以下的多头注意力实现：

Python

from tensorflow import math, matmul, reshape, shape, transpose, cast, float32
from tensorflow.keras.layers import Dense, Layer
from keras.backend import softmax

# Implementing the Scaled-Dot Product Attention
class DotProductAttention(Layer):
    def __init__(self, **kwargs):
        super(DotProductAttention, self).__init__(**kwargs)

    def call(self, queries, keys, values, d_k, mask=None):
        # Scoring the queries against the keys after transposing the latter, and scaling
        scores = matmul(queries, keys, transpose_b=True) / math.sqrt(cast(d_k, float32))

        # Apply mask to the attention scores
        if mask is not None:
            scores += -1e9 * mask

        # Computing the weights by a softmax operation
        weights = softmax(scores)

        # Computing the attention by a weighted sum of the value vectors
        return matmul(weights, values)

# Implementing the Multi-Head Attention
class MultiHeadAttention(Layer):
    def __init__(self, h, d_k, d_v, d_model, **kwargs):
        super(MultiHeadAttention, self).__init__(**kwargs)
        self.attention = DotProductAttention()  # Scaled dot product attention
        self.heads = h  # Number of attention heads to use
        self.d_k = d_k  # Dimensionality of the linearly projected queries and keys
        self.d_v = d_v  # Dimensionality of the linearly projected values
        self.d_model = d_model  # Dimensionality of the model
        self.W_q = Dense(d_k)  # Learned projection matrix for the queries
        self.W_k = Dense(d_k)  # Learned projection matrix for the keys
        self.W_v = Dense(d_v)  # Learned projection matrix for the values
        self.W_o = Dense(d_model)  # Learned projection matrix for the multi-head output

    def reshape_tensor(self, x, heads, flag):
        if flag:
            # Tensor shape after reshaping and transposing: (batch_size, heads, seq_length, -1)
            x = reshape(x, shape=(shape(x)[0], shape(x)[1], heads, -1))
            x = transpose(x, perm=(0, 2, 1, 3))
        else:
            # Reverting the reshaping and transposing operations: (batch_size, seq_length, d_k)
            x = transpose(x, perm=(0, 2, 1, 3))
            x = reshape(x, shape=(shape(x)[0], shape(x)[1], self.d_k))
        return x

    def call(self, queries, keys, values, mask=None):
        # Rearrange the queries to be able to compute all heads in parallel
        q_reshaped = self.reshape_tensor(self.W_q(queries), self.heads, True)
        # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)

        # Rearrange the keys to be able to compute all heads in parallel
        k_reshaped = self.reshape_tensor(self.W_k(keys), self.heads, True)
        # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)

        # Rearrange the values to be able to compute all heads in parallel
        v_reshaped = self.reshape_tensor(self.W_v(values), self.heads, True)
        # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)

        # Compute the multi-head attention output using the reshaped queries, keys and values
        o_reshaped = self.attention(q_reshaped, k_reshaped, v_reshaped, self.d_k, mask)
        # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)

        # Rearrange back the output into concatenated form
        output = self.reshape_tensor(o_reshaped, self.heads, False)
        # Resulting tensor shape: (batch_size, input_seq_length, d_v)

        # Apply one final linear projection to the output to generate the multi-head attention
        # Resulting tensor shape: (batch_size, input_seq_length, d_model)
        return self.W_o(output)

想要开始构建具有注意力机制的 Transformer 模型吗？

现在就参加我的免费 12 天电子邮件速成课程（包括示例代码）。

点击注册，并获得课程的免费 PDF 电子书版本。

测试代码

你将使用 Vaswani 等人（2017）在论文 Attention Is All You Need 中指定的参数值：

Python

h = 8  # Number of self-attention heads
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
d_model = 512  # Dimensionality of the model sub-layers' outputs
batch_size = 64  # Batch size from the training process
...

至于序列长度以及查询、键和值，你将暂时使用虚拟数据，直到你到达另一个教程中训练完整 Transformer 模型的阶段，到时你将使用实际句子：

Python

...
input_seq_length = 5  # Maximum length of the input sequence

queries = random.random((batch_size, input_seq_length, d_k))
keys = random.random((batch_size, input_seq_length, d_k))
values = random.random((batch_size, input_seq_length, d_v))
...

在完整的 Transformer 模型中，序列长度以及查询、键和值的值将通过词标记化和嵌入过程获得。我们将在另一个教程中覆盖这部分内容。

回到测试过程，下一步是创建 MultiHeadAttention 类的新实例，并将其输出赋值给 multihead_attention 变量：

Python

...
multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model)
...

由于 MultiHeadAttention 类继承自 Layer 基类，因此前者的 call() 方法将由后者的魔法 __call()__ 方法自动调用。最后一步是传入输入参数并打印结果：

Python

...
print(multihead_attention(queries, keys, values))

将所有内容整合在一起，生成以下代码清单：

Python

from numpy import random

input_seq_length = 5  # Maximum length of the input sequence
h = 8  # Number of self-attention heads
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
d_model = 512  # Dimensionality of the model sub-layers' outputs
batch_size = 64  # Batch size from the training process

queries = random.random((batch_size, input_seq_length, d_k))
keys = random.random((batch_size, input_seq_length, d_k))
values = random.random((batch_size, input_seq_length, d_v))

multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model)
print(multihead_attention(queries, keys, values))

运行这段代码将会产生形状为（批量大小，序列长度，模型维度）的输出。请注意，由于查询、键和值的随机初始化以及密集层的参数值，可能会看到不同的输出。

Python

tf.Tensor(
[[[-0.02185373  0.32784638  0.15958631 ... -0.0353895   0.6645204
   -0.2588266 ]
  [-0.02272229  0.32292002  0.16208754 ... -0.03644213  0.66478664
   -0.26139447]
  [-0.01876744  0.32900316  0.16190802 ... -0.03548665  0.6645842
   -0.26155376]
  [-0.02193783  0.32687354  0.15801215 ... -0.03232524  0.6642926
   -0.25795174]
  [-0.02224652  0.32437912  0.1596448  ... -0.0340827   0.6617497
   -0.26065096]]
 ...

 [[ 0.05414441  0.27019292  0.1845745  ...  0.0809482   0.63738805
   -0.34231138]
  [ 0.05546578  0.27191412  0.18483458 ...  0.08379208  0.6366671
   -0.34372014]
  [ 0.05190979  0.27185103  0.18378328 ...  0.08341806  0.63851804
   -0.3422392 ]
  [ 0.05437043  0.27318984  0.18792395 ...  0.08043509  0.6391771
   -0.34357914]
  [ 0.05406848  0.27073097  0.18579456 ...  0.08388947  0.6376929
   -0.34230167]]], shape=(64, 5, 512), dtype=float32)

进一步阅读

如果你想深入了解这个主题，本节提供了更多资源。

书籍

Python 深度学习进阶，2019
自然语言处理中的变形金刚，2021

论文

注意力机制就是一切，2017

总结

在本教程中，你学会了如何在 TensorFlow 和 Keras 中从头实现多头注意力机制。

具体来说，你学到了：

构成多头注意力机制的层
如何从头实现多头注意力机制

你有任何问题吗？

在下面的评论中提出你的问题，我会尽力回答。

如何在 TensorFlow 和 Keras 中从零开始实现缩放点积注意力

原文：machinelearningmastery.com/how-to-implement-scaled-dot-product-attention-from-scratch-in-tensorflow-and-keras/

在熟悉了 Transformer 模型及其注意力机制的理论之后，我们将开始实现一个完整的 Transformer 模型，首先了解如何实现缩放点积注意力。缩放点积注意力是多头注意力的核心部分，而多头注意力又是 Transformer 编码器和解码器的重要组件。我们的最终目标是将完整的 Transformer 模型应用于自然语言处理（NLP）。

在本教程中，你将学习如何在 TensorFlow 和 Keras 中从零开始实现缩放点积注意力。

完成本教程后，你将知道：

构成缩放点积注意力机制的一部分操作
如何从零开始实现缩放点积注意力机制

启动你的项目，请阅读我的书籍构建带有注意力的 Transformer 模型。它提供了自学教程和实用代码，指导你构建一个完全工作的 Transformer 模型。

如何将句子从一种语言翻译成另一种语言...

让我们开始吧。

如何在 TensorFlow 和 Keras 中从零开始实现缩放点积注意力

教程概述

本教程分为三个部分；它们是：

Transformer 架构回顾
- Transformer 缩放点积注意力
从零开始实现缩放点积注意力
代码测试

前提条件

对于本教程，我们假设你已经熟悉：

Transformer 架构回顾

回忆起见过 Transformer 架构遵循编码器-解码器结构。编码器位于左侧，负责将输入序列映射到一系列连续表示；解码器位于右侧，接收编码器的输出以及前一时间步的解码器输出，生成输出序列。

Transformer 架构的编码器-解码器结构

取自 “注意力机制是你所需要的”

在生成输出序列时，Transformer 不依赖于递归和卷积。

您已经看到 Transformer 的解码器部分在其架构中与编码器有许多相似之处。在它们的多头注意力块内，编码器和解码器共享的核心组件之一是缩放点积注意力。

Transformer 缩放点积注意力机制

首先，回想一下查询（queries）、键（keys）和值（values）作为你将要处理的重要组件。

在编码器阶段，它们在嵌入并通过位置信息增强之后携带相同的输入序列。类似地，在解码器侧，进入第一个注意力块的查询、键和值代表同样经过嵌入和通过位置信息增强的目标序列。解码器的第二个注意力块接收编码器输出作为键和值，并接收第一个注意力块的归一化输出作为查询。查询和键的维度由 $d_k$ 表示，而值的维度由 $d_v$ 表示。

缩放点积注意力将这些查询、键和值作为输入，并首先计算查询与键的点积。然后结果被 $d_k$ 的平方根缩放，生成注意力分数。然后将它们输入 softmax 函数，得到一组注意力权重。最后，注意力权重通过加权乘法操作来缩放值。整个过程可以用数学方式解释如下，其中 $\mathbf{Q}$ 、 $\mathbf{K}$ 和 $\mathbf{V}$ 分别表示查询、键和值：

$\text{attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax} \left( \frac{\mathbf{Q} \mathbf{K}^\mathsf{T}}{\sqrt{d_k}} \right) \mathbf{V}$

Transformer 模型中的每个多头注意力块实现了如下所示的缩放点积注意力操作：

缩放点积注意力和多头注意力

取自“注意力机制是你所需要的一切”

您可能注意到，在将注意力分数输入到 softmax 函数之前，缩放点积注意力也可以应用一个掩码。

由于单词嵌入被零填充到特定的序列长度，需要引入一个填充掩码，以防止零令牌与编码器和解码器阶段的输入一起处理。此外，还需要一个前瞻掩码，以防止解码器关注后续单词，从而特定单词的预测只能依赖于其前面已知的单词输出。

这些前瞻和填充掩码应用于缩放点积注意力中，将输入到 softmax 函数中的所有值设置为- $\infty$ ，这些值不应考虑。对于每个这些大负输入，softmax 函数将产生一个接近零的输出值，有效地屏蔽它们。当你进入单独的教程实现编码器和解码器块时，这些掩码的用途将变得更加清晰。

暂时先看看如何在 TensorFlow 和 Keras 中从零开始实现缩放点积注意力。

想要开始使用注意力机制构建 Transformer 模型吗？

现在就免费获取我为期 12 天的电子邮件快速课程（带有示例代码）。

点击注册并获得课程的免费 PDF 电子书版本。

从零开始实现缩放点积注意力

为此，您将创建一个名为DotProductAttention的类，该类继承自 Keras 中的Layer基类。

在其中，您将创建类方法call()，该方法接受查询、键和值作为输入参数，还有维度 $d_k$ 和一个掩码（默认为None）：

Python

class DotProductAttention(Layer):
    def __init__(self, **kwargs):
        super(DotProductAttention, self).__init__(**kwargs)

    def call(self, queries, keys, values, d_k, mask=None):
        ...

第一步是在查询和键之间执行点积运算，然后转置后者。结果将通过除以 $d_k$ 的平方根进行缩放。您将在call()类方法中添加以下代码行：

Python

...
scores = matmul(queries, keys, transpose_b=True) / sqrt(d_k)
...

接下来，您将检查mask参数是否已设置为非默认值None。

掩码将包含0值，表示应在计算中考虑输入序列中的相应标记，或者1表示相反。掩码将乘以-1e9 以将1值设置为大负数（请记住在前一节中提到过这一点），然后应用于注意力分数：

Python

...
if mask is not None:
    scores += -1e9 * mask
...

然后，注意力分数将通过 softmax 函数传递以生成注意力权重：

Python

...
weights = softmax(scores)
...

最后一步是通过另一个点积操作用计算出的注意力权重加权值：

Python

...
return matmul(weights, values)

完整的代码列表如下：

Python

from tensorflow import matmul, math, cast, float32
from tensorflow.keras.layers import Layer
from keras.backend import softmax

# Implementing the Scaled-Dot Product Attention
class DotProductAttention(Layer):
    def __init__(self, **kwargs):
        super(DotProductAttention, self).__init__(**kwargs)

    def call(self, queries, keys, values, d_k, mask=None):
        # Scoring the queries against the keys after transposing the latter, and scaling
        scores = matmul(queries, keys, transpose_b=True) / math.sqrt(cast(d_k, float32))

        # Apply mask to the attention scores
        if mask is not None:
            scores += -1e9 * mask

        # Computing the weights by a softmax operation
        weights = softmax(scores)

        # Computing the attention by a weighted sum of the value vectors
        return matmul(weights, values)

测试代码

你将使用论文中指定的参数值，Attention Is All You Need，由 Vaswani 等人（2017 年）：

Python

d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
batch_size = 64  # Batch size from the training process
...

至于序列长度和查询、键、值，你将暂时使用虚拟数据，直到你在另一个教程中进入训练完整 Transformer 模型的阶段，那时你将使用实际句子。同样，对于掩码，暂时将其保持为默认值：

Python

...
input_seq_length = 5  # Maximum length of the input sequence

queries = random.random((batch_size, input_seq_length, d_k))
keys = random.random((batch_size, input_seq_length, d_k))
values = random.random((batch_size, input_seq_length, d_v))
...

在完整的 Transformer 模型中，序列长度及查询、键、值的值将通过词语标记化和嵌入过程获得。你将在另一个教程中覆盖这些内容。

回到测试过程，下一步是创建DotProductAttention类的新实例，将其输出分配给attention变量：

Python

...
attention = DotProductAttention()
...

由于DotProductAttention类继承自Layer基类，前者的call()方法将由后者的魔术__call()__方法自动调用。最后一步是输入参数并打印结果：

Python

...
print(attention(queries, keys, values, d_k))

将一切结合起来产生以下代码列表：

Python

from numpy import random

input_seq_length = 5  # Maximum length of the input sequence
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
batch_size = 64  # Batch size from the training process

queries = random.random((batch_size, input_seq_length, d_k))
keys = random.random((batch_size, input_seq_length, d_k))
values = random.random((batch_size, input_seq_length, d_v))

attention = DotProductAttention()
print(attention(queries, keys, values, d_k))

运行此代码会产生一个形状为 (batch size, sequence length, values dimensionality) 的输出。请注意，由于查询、键和值的随机初始化，你可能会看到不同的输出。

Python

tf.Tensor(
[[[0.60413814 0.52436507 0.46551135 ... 0.5260341  0.33879933 0.43999898]
  [0.60433316 0.52383804 0.465411   ... 0.5262608  0.33915892 0.43782598]
  [0.62321603 0.5349194  0.46824688 ... 0.531323   0.34432083 0.43554053]
  [0.60013235 0.54162943 0.47391182 ... 0.53600514 0.33722004 0.4192218 ]
  [0.6295709  0.53511244 0.46552944 ... 0.5317217  0.3462567  0.43129003]]
 ...

[[0.20291057 0.18463902 0.641182   ... 0.4706118  0.4194418  0.39908117]
  [0.19932748 0.18717204 0.64831126 ... 0.48373622 0.3995132  0.37968236]
  [0.20611541 0.18079443 0.6374859  ... 0.48258874 0.41704425 0.4016996 ]
  [0.19703123 0.18210654 0.6400498  ... 0.47037745 0.4257752  0.3962079 ]
  [0.19237372 0.18474475 0.64944196 ... 0.49497223 0.38804317 0.36352912]]], 
shape=(64, 5, 64), dtype=float32)

进一步阅读

本节提供了更多资源，如果你想深入了解这个话题。

书籍

深入学习 Python，2019 年
自然语言处理中的 Transformer，2021 年

论文

Attention Is All You Need，2017 年

总结

在本教程中，你学习了如何在 TensorFlow 和 Keras 中从头实现缩放点积注意力机制。

具体来说，你学到了：

组成缩放点积注意力机制的一部分操作
如何从头实现缩放点积注意力机制

你有任何问题吗？

在下面的评论中提问，我会尽力回答。

在 TensorFlow 和 Keras 中从零开始实现 Transformer 解码器

原文：machinelearningmastery.com/implementing-the-transformer-decoder-from-scratch-in-tensorflow-and-keras/

Transformer 编码器和解码器之间存在许多相似之处，例如它们实现了多头注意力机制、层归一化以及作为最终子层的全连接前馈网络。在实现了Transformer 编码器之后，我们现在将继续应用我们的知识来实现 Transformer 解码器，作为实现完整 Transformer 模型的进一步步骤。您的最终目标是将完整模型应用于自然语言处理（NLP）。

在本教程中，您将学习如何在 TensorFlow 和 Keras 中从零开始实现 Transformer 解码器。

完成本教程后，您将了解：

构成 Transformer 解码器的层
如何从零开始实现 Transformer 解码器

使用我的书籍使用注意力构建 Transformer 模型启动您的项目。它提供了具有工作代码的自学教程，引导您构建一个完全工作的 Transformer 模型，可以

将一种语言的句子翻译成另一种语言...

让我们开始吧。

在 TensorFlow 和 Keras 中从零开始实现 Transformer 解码器

照片由 François Kaiser 拍摄，部分权利保留。

教程概述

本教程分为三个部分，它们是：

Transformer 架构回顾
- Transformer 解码器
在 TensorFlow 和 Keras 中从零开始实现 Transformer 解码器
- 解码器层
- Transformer 解码器
测试代码

先决条件

本教程假设您已经熟悉以下内容：

Transformer 架构回顾

回忆已经看到，Transformer 架构遵循编码器-解码器结构。编码器在左侧负责将输入序列映射到连续表示的序列；解码器在右侧接收编码器输出以及前一时间步的解码器输出，生成输出序列。

Transformer 架构的编码器-解码器结构

取自“注意力机制全是你需要的“

在生成输出序列时，Transformer 不依赖于循环和卷积。

您已经看到 Transformer 的解码器部分在架构上与编码器有许多相似之处。本教程将探索这些相似之处。

Transformer 解码器

类似于Transformer 编码器，Transformer 解码器也由 $N$ 个相同层的堆叠组成。然而，Transformer 解码器还实现了一个额外的多头注意力块，总共有三个主要子层：

第一子层包括一个多头注意力机制，接收查询（queries）、键（keys）和值（values）作为输入。
第二子层包括第二个多头注意力机制。
第三子层包括一个全连接的前馈网络。

Transformer 架构的解码器块

取自“注意力机制全是你需要的“

这三个子层中的每一个后面都跟着层归一化，层归一化步骤的输入是其对应的子层输入（通过残差连接）和输出。

在解码器端，进入第一个多头注意力块的查询、键和值也代表相同的输入序列。然而，这一次是将目标序列嵌入并增强了位置信息，然后才提供给解码器。另一方面，第二个多头注意力块接收编码器输出作为键和值，并接收第一个解码器注意力块的归一化输出作为查询。在这两种情况下，查询和键的维度保持等于 $d_k$ ，而值的维度保持等于 $d_v$ 。

Vaswani 等人还通过对每个子层的输出（在层归一化步骤之前）以及传入解码器的位置编码应用 dropout 来在解码器端引入正则化。

现在让我们来看一下如何从头开始在 TensorFlow 和 Keras 中实现 Transformer 解码器。

想开始构建带有注意力机制的 Transformer 模型吗？

立即参加我的免费 12 天电子邮件速成课程（包括示例代码）。

点击注册并获取免费的 PDF 电子书版本课程。

从头开始实现 Transformer 解码器

解码器层

由于你在实现 Transformer 编码器时已经实现了所需的子层，因此你将创建一个解码器层类，直接利用这些子层：

Python

from multihead_attention import MultiHeadAttention
from encoder import AddNormalization, FeedForward

class DecoderLayer(Layer):
    def __init__(self, h, d_k, d_v, d_model, d_ff, rate, **kwargs):
        super(DecoderLayer, self).__init__(**kwargs)
        self.multihead_attention1 = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout1 = Dropout(rate)
        self.add_norm1 = AddNormalization()
        self.multihead_attention2 = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout2 = Dropout(rate)
        self.add_norm2 = AddNormalization()
        self.feed_forward = FeedForward(d_ff, d_model)
        self.dropout3 = Dropout(rate)
        self.add_norm3 = AddNormalization()
        ...

请注意，由于不同子层的代码已经保存到多个 Python 脚本（即 multihead_attention.py 和 encoder.py）中，因此需要导入它们才能使用所需的类。

正如你在 Transformer 编码器中所做的那样，你现在将创建 call() 类方法，来实现所有解码器子层：

Python

...
def call(self, x, encoder_output, lookahead_mask, padding_mask, training):
    # Multi-head attention layer
    multihead_output1 = self.multihead_attention1(x, x, x, lookahead_mask)
    # Expected output shape = (batch_size, sequence_length, d_model)

    # Add in a dropout layer
    multihead_output1 = self.dropout1(multihead_output1, training=training)

    # Followed by an Add & Norm layer
    addnorm_output1 = self.add_norm1(x, multihead_output1)
    # Expected output shape = (batch_size, sequence_length, d_model)

    # Followed by another multi-head attention layer
    multihead_output2 = self.multihead_attention2(addnorm_output1, encoder_output, encoder_output, padding_mask)

    # Add in another dropout layer
    multihead_output2 = self.dropout2(multihead_output2, training=training)

    # Followed by another Add & Norm layer
    addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2)

    # Followed by a fully connected layer
    feedforward_output = self.feed_forward(addnorm_output2)
    # Expected output shape = (batch_size, sequence_length, d_model)

    # Add in another dropout layer
    feedforward_output = self.dropout3(feedforward_output, training=training)

    # Followed by another Add & Norm layer
    return self.add_norm3(addnorm_output2, feedforward_output)

多头注意力子层还可以接收填充掩码或前瞻掩码。简要提醒一下在之前的教程中提到的内容，填充掩码是必要的，以防止输入序列中的零填充被处理与实际输入值一起处理。前瞻掩码防止解码器关注后续单词，这样对特定单词的预测只能依赖于前面单词的已知输出。

相同的 call() 类方法也可以接收一个 training 标志，以便仅在训练期间应用 Dropout 层，当标志的值设置为 True 时。

Transformer 解码器

Transformer 解码器将你刚刚实现的解码器层复制 $N$ 次。

你将创建以下 Decoder() 类来实现 Transformer 解码器：

Python

from positional_encoding import PositionEmbeddingFixedWeights

class Decoder(Layer):
    def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate, **kwargs):
        super(Decoder, self).__init__(**kwargs)
        self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model)
        self.dropout = Dropout(rate)
        self.decoder_layer = [DecoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)
        ...

与 Transformer 编码器一样，解码器侧第一个多头注意力块的输入接收经过词嵌入和位置编码处理后的输入序列。为此，初始化一个 PositionEmbeddingFixedWeights 类的实例（在这个教程中介绍），并将其输出分配给 pos_encoding 变量。

最后一步是创建一个类方法 call()，该方法将词嵌入和位置编码应用于输入序列，并将结果与编码器输出一起馈送到 $N$ 个解码器层：

Python

...
def call(self, output_target, encoder_output, lookahead_mask, padding_mask, training):
    # Generate the positional encoding
    pos_encoding_output = self.pos_encoding(output_target)
    # Expected output shape = (number of sentences, sequence_length, d_model)

    # Add in a dropout layer
    x = self.dropout(pos_encoding_output, training=training)

    # Pass on the positional encoded values to each encoder layer
    for i, layer in enumerate(self.decoder_layer):
        x = layer(x, encoder_output, lookahead_mask, padding_mask, training)

    return x

完整 Transformer 解码器的代码清单如下：

Python

from tensorflow.keras.layers import Layer, Dropout
from multihead_attention import MultiHeadAttention
from positional_encoding import PositionEmbeddingFixedWeights
from encoder import AddNormalization, FeedForward

# Implementing the Decoder Layer
class DecoderLayer(Layer):
    def __init__(self, h, d_k, d_v, d_model, d_ff, rate, **kwargs):
        super(DecoderLayer, self).__init__(**kwargs)
        self.multihead_attention1 = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout1 = Dropout(rate)
        self.add_norm1 = AddNormalization()
        self.multihead_attention2 = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout2 = Dropout(rate)
        self.add_norm2 = AddNormalization()
        self.feed_forward = FeedForward(d_ff, d_model)
        self.dropout3 = Dropout(rate)
        self.add_norm3 = AddNormalization()

    def call(self, x, encoder_output, lookahead_mask, padding_mask, training):
        # Multi-head attention layer
        multihead_output1 = self.multihead_attention1(x, x, x, lookahead_mask)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Add in a dropout layer
        multihead_output1 = self.dropout1(multihead_output1, training=training)

        # Followed by an Add & Norm layer
        addnorm_output1 = self.add_norm1(x, multihead_output1)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Followed by another multi-head attention layer
        multihead_output2 = self.multihead_attention2(addnorm_output1, encoder_output, encoder_output, padding_mask)

        # Add in another dropout layer
        multihead_output2 = self.dropout2(multihead_output2, training=training)

        # Followed by another Add & Norm layer
        addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2)

        # Followed by a fully connected layer
        feedforward_output = self.feed_forward(addnorm_output2)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Add in another dropout layer
        feedforward_output = self.dropout3(feedforward_output, training=training)

        # Followed by another Add & Norm layer
        return self.add_norm3(addnorm_output2, feedforward_output)

# Implementing the Decoder
class Decoder(Layer):
    def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate, **kwargs):
        super(Decoder, self).__init__(**kwargs)
        self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model)
        self.dropout = Dropout(rate)
        self.decoder_layer = [DecoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)]

    def call(self, output_target, encoder_output, lookahead_mask, padding_mask, training):
        # Generate the positional encoding
        pos_encoding_output = self.pos_encoding(output_target)
        # Expected output shape = (number of sentences, sequence_length, d_model)

        # Add in a dropout layer
        x = self.dropout(pos_encoding_output, training=training)

        # Pass on the positional encoded values to each encoder layer
        for i, layer in enumerate(self.decoder_layer):
            x = layer(x, encoder_output, lookahead_mask, padding_mask, training)

        return x

测试代码

您将使用文献《Attention Is All You Need》（Vaswani 等人，2017 年）中指定的参数值：

Python

h = 8  # Number of self-attention heads
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
d_ff = 2048  # Dimensionality of the inner fully connected layer
d_model = 512  # Dimensionality of the model sub-layers' outputs
n = 6  # Number of layers in the encoder stack

batch_size = 64  # Batch size from the training process
dropout_rate = 0.1  # Frequency of dropping the input units in the dropout layers
...

至于输入序列，暂时您将使用虚拟数据，直到您在单独的教程中训练完整的 Transformer 模型，届时您将使用实际的句子：

Python

...
dec_vocab_size = 20 # Vocabulary size for the decoder
input_seq_length = 5  # Maximum length of the input sequence

input_seq = random.random((batch_size, input_seq_length))
enc_output = random.random((batch_size, input_seq_length, d_model))
...

接下来，您将创建Decoder类的新实例，将其输出分配给decoder变量，随后传入输入参数并打印结果。目前，您将把填充和前瞻掩码设置为None，但在实现完整的 Transformer 模型时将返回到这些设置：

Python

...
decoder = Decoder(dec_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)
print(decoder(input_seq, enc_output, None, True)

将所有内容综合起来，得到以下代码清单：

Python

from numpy import random

dec_vocab_size = 20  # Vocabulary size for the decoder
input_seq_length = 5  # Maximum length of the input sequence
h = 8  # Number of self-attention heads
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
d_ff = 2048  # Dimensionality of the inner fully connected layer
d_model = 512  # Dimensionality of the model sub-layers' outputs
n = 6  # Number of layers in the decoder stack

batch_size = 64  # Batch size from the training process
dropout_rate = 0.1  # Frequency of dropping the input units in the dropout layers

input_seq = random.random((batch_size, input_seq_length))
enc_output = random.random((batch_size, input_seq_length, d_model))

decoder = Decoder(dec_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)
print(decoder(input_seq, enc_output, None, True))

运行此代码会生成形状为（批大小，序列长度，模型维度）的输出。请注意，由于输入序列的随机初始化和密集层参数值的不同，您可能会看到不同的输出。

Python

tf.Tensor(
[[[-0.04132953 -1.7236308   0.5391184  ... -0.76394725  1.4969798
    0.37682498]
  [ 0.05501875 -1.7523409   0.58404493 ... -0.70776534  1.4498456
    0.32555297]
  [ 0.04983566 -1.8431275   0.55850077 ... -0.68202156  1.4222856
    0.32104644]
  [-0.05684051 -1.8862512   0.4771412  ... -0.7101341   1.431343
    0.39346313]
  [-0.15625843 -1.7992781   0.40803364 ... -0.75190556  1.4602519
    0.53546077]]
...

 [[-0.58847624 -1.646842    0.5973466  ... -0.47778523  1.2060764
    0.34091905]
  [-0.48688865 -1.6809179   0.6493542  ... -0.41274604  1.188649
    0.27100053]
  [-0.49568555 -1.8002801   0.61536175 ... -0.38540334  1.2023914
    0.24383534]
  [-0.59913146 -1.8598882   0.5098136  ... -0.3984461   1.2115746
    0.3186561 ]
  [-0.71045107 -1.7778647   0.43008155 ... -0.42037937  1.2255307
    0.47380894]]], shape=(64, 5, 512), dtype=float32)

进一步阅读

本节提供了更多有关该主题的资源，如果您希望深入了解。

图书

Python 深度学习进阶, 2019
自然语言处理的 Transformer, 2021

论文

Attention Is All You Need, 2017

总结

在本教程中，您学习了如何在 TensorFlow 和 Keras 中从头开始实现 Transformer 解码器。

具体而言，您学习了：

组成 Transformer 解码器的层
如何从头开始实现 Transformer 解码器

您有什么问题吗？

在下面的评论中提出您的问题，我将尽力回答。

在 TensorFlow 和 Keras 中从头开始实现 Transformer 编码器

原文：machinelearningmastery.com/implementing-the-transformer-encoder-from-scratch-in-tensorflow-and-keras/

在看完如何实现缩放点积注意力并将其集成到 Transformer 模型的多头注意力后，让我们进一步实现完整的 Transformer 模型，通过应用其编码器来达到我们的最终目标，即将该模型应用于自然语言处理（NLP）。

在本教程中，您将学习如何在 TensorFlow 和 Keras 中从头开始实现 Transformer 编码器。

完成本教程后，您将了解：

组成 Transformer 编码器的层。
如何从头开始实现 Transformer 编码器。

用我的书使用注意力构建 Transformer 模型来启动您的项目。它提供了自学教程和可工作的代码，帮助您构建一个完全工作的 Transformer 模型，能够

将句子从一种语言翻译到另一种语言...

让我们开始吧。

在 TensorFlow 和 Keras 中从头开始实现 Transformer 编码器

照片由 ian dooley 提供，部分权利保留。

教程概述

本教程分为三个部分；它们是：

Transformer 架构总结
- Transformer 编码器
从头开始实现 Transformer 编码器
- 全连接前馈神经网络和层归一化
- 编码器层
- Transformer 编码器
测试代码

先决条件

对于本教程，我们假设您已经熟悉以下内容：

Transformer 架构总结

回顾已经看到 Transformer 架构遵循编码器-解码器结构。左侧的编码器负责将输入序列映射到连续表示的序列；右侧的解码器接收编码器的输出以及前一个时间步的解码器输出以生成输出序列。

Transformer 架构的编码器-解码器结构

摘自 “Attention Is All You Need“

在生成输出序列时，Transformer 不依赖于递归和卷积。

你已经看到 Transformer 的解码器部分在其架构上与编码器有许多相似之处。在本教程中，你将重点关注组成 Transformer 编码器的组件。

Transformer 编码器

Transformer 编码器由 $N$ 个相同的层堆叠而成，每层进一步包含两个主要子层：

第一个子层包括一个多头注意力机制，该机制将查询、键和值作为输入。
第二个子层包括一个全连接前馈网络。

Transformer 架构的编码器模块

摘自 “Attention Is All You Need“

每个这些两个子层后面都有层归一化，其中将子层输入（通过残差连接）和输出送入。每一步层归一化的输出如下：

LayerNorm（子层输入 + 子层输出）

为了方便这种操作——涉及子层输入和输出之间的加法，Vaswani 等人设计了模型中的所有子层和嵌入层以产生维度为 $d_{\text{model}}$ = 512 的输出。

另外，回顾将查询、键和值作为 Transformer 编码器的输入。

在这里，查询、键和值在经过嵌入和位置编码增强后，携带相同的输入序列，其中查询和键的维度为 $d_k$ ，而值的维度为 $d_v$ 。

此外，Vaswani 等人还通过在每个子层的输出（在层归一化步骤之前）以及位置编码输入编码器之前应用 dropout 来引入正则化。

现在，让我们看看如何从头开始在 TensorFlow 和 Keras 中实现 Transformer 编码器。

想要开始构建具有注意力机制的 Transformer 模型吗？

现在就可以立即领取我的免费 12 天电子邮件速成课程（包括示例代码）。

点击注册并获得免费的 PDF 电子书版课程。

从零开始实现 Transformer 编码器

全连接前馈神经网络和层归一化

我们首先创建如上图所示的Feed Forward和Add & Norm层的类。

Vaswani 等人告诉我们，全连接前馈网络由两个线性变换组成，中间夹有一个 ReLU 激活。第一个线性变换产生维度为 $d_{ff}$ = 2048 的输出，而第二个线性变换产生维度为 $d_{\text{model}}$ = 512 的输出。

为此，我们首先创建一个名为FeedForward的类，它继承自 Keras 中的Layer基类，并初始化稠密层和 ReLU 激活：

Python

class FeedForward(Layer):
    def __init__(self, d_ff, d_model, **kwargs):
        super(FeedForward, self).__init__(**kwargs)
        self.fully_connected1 = Dense(d_ff)  # First fully connected layer
        self.fully_connected2 = Dense(d_model)  # Second fully connected layer
        self.activation = ReLU()  # ReLU activation layer
        ...

我们将向其中添加一个类方法call()，它接收一个输入，并通过两个具有 ReLU 激活的全连接层，返回一个维度为 512 的输出：

Python

...
def call(self, x):
    # The input is passed into the two fully-connected layers, with a ReLU in between
    x_fc1 = self.fully_connected1(x)

    return self.fully_connected2(self.activation(x_fc1))

下一步是创建另一个类AddNormalization，它同样继承自 Keras 中的Layer基类，并初始化一个层归一化层：

Python

class AddNormalization(Layer):
    def __init__(self, **kwargs):
        super(AddNormalization, self).__init__(**kwargs)
        self.layer_norm = LayerNormalization()  # Layer normalization layer
        ...

在其中，包含以下类方法，它将其子层的输入和输出进行求和，然后对结果应用层归一化：

Python

...
def call(self, x, sublayer_x):
    # The sublayer input and output need to be of the same shape to be summed
    add = x + sublayer_x

    # Apply layer normalization to the sum
    return self.layer_norm(add)

编码器层

接下来，你将实现编码器层，Transformer 编码器将完全复制这个层 $N$ 次。

为此，我们首先创建一个名为EncoderLayer的类，并初始化它所包含的所有子层：

Python

class EncoderLayer(Layer):
    def __init__(self, h, d_k, d_v, d_model, d_ff, rate, **kwargs):
        super(EncoderLayer, self).__init__(**kwargs)
        self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout1 = Dropout(rate)
        self.add_norm1 = AddNormalization()
        self.feed_forward = FeedForward(d_ff, d_model)
        self.dropout2 = Dropout(rate)
        self.add_norm2 = AddNormalization()
        ...

在这里，你可能会注意到你已经初始化了之前创建的FeedForward和AddNormalization类的实例，并将它们的输出分配给各自的变量feed_forward和add_norm（1 和 2）。Dropout层是不言自明的，其中rate定义了输入单元被设为 0 的频率。你在上一篇教程中创建了MultiHeadAttention类，如果你将代码保存到了一个单独的 Python 脚本中，请不要忘记import它。我将我的代码保存到名为multihead_attention.py的 Python 脚本中，因此我需要包括代码行from multihead_attention import MultiHeadAttention.。

现在让我们继续创建实现所有编码器子层的类方法call()：

Python

...
def call(self, x, padding_mask, training):
    # Multi-head attention layer
    multihead_output = self.multihead_attention(x, x, x, padding_mask)
    # Expected output shape = (batch_size, sequence_length, d_model)

    # Add in a dropout layer
    multihead_output = self.dropout1(multihead_output, training=training)

    # Followed by an Add & Norm layer
    addnorm_output = self.add_norm1(x, multihead_output)
    # Expected output shape = (batch_size, sequence_length, d_model)

    # Followed by a fully connected layer
    feedforward_output = self.feed_forward(addnorm_output)
    # Expected output shape = (batch_size, sequence_length, d_model)

    # Add in another dropout layer
    feedforward_output = self.dropout2(feedforward_output, training=training)

    # Followed by another Add & Norm layer
    return self.add_norm2(addnorm_output, feedforward_output)

除了输入数据之外，call()方法还可以接收填充掩码。作为之前教程中提到的简要提醒，填充掩码是必要的，以抑制输入序列中的零填充与实际输入值一起处理。

同一个类方法可以接收一个training标志，当设置为True时，仅在训练期间应用 Dropout 层。

Transformer 编码器

最后一步是创建一个名为Encoder的 Transformer 编码器类：

Python

class Encoder(Layer):
    def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate, **kwargs):
        super(Encoder, self).__init__(**kwargs)
        self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model)
        self.dropout = Dropout(rate)
        self.encoder_layer = [EncoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)]
        ...

Transformer 编码器在此之后会接收一个经过单词嵌入和位置编码处理的输入序列。为了计算位置编码，让我们使用 Mehreen Saeed 在本教程中描述的PositionEmbeddingFixedWeights类。

就像您在前面的部分中所做的那样，在这里，您还将创建一个名为call()的类方法，该方法将单词嵌入和位置编码应用于输入序列，并将结果馈送到 $N$ 个编码器层：

Python

...
def call(self, input_sentence, padding_mask, training):
    # Generate the positional encoding
    pos_encoding_output = self.pos_encoding(input_sentence)
    # Expected output shape = (batch_size, sequence_length, d_model)

    # Add in a dropout layer
    x = self.dropout(pos_encoding_output, training=training)

    # Pass on the positional encoded values to each encoder layer
    for i, layer in enumerate(self.encoder_layer):
        x = layer(x, padding_mask, training)

    return x

完整的 Transformer 编码器的代码清单如下：

Python

from tensorflow.keras.layers import LayerNormalization, Layer, Dense, ReLU, Dropout
from multihead_attention import MultiHeadAttention
from positional_encoding import PositionEmbeddingFixedWeights

# Implementing the Add & Norm Layer
class AddNormalization(Layer):
    def __init__(self, **kwargs):
        super(AddNormalization, self).__init__(**kwargs)
        self.layer_norm = LayerNormalization()  # Layer normalization layer

    def call(self, x, sublayer_x):
        # The sublayer input and output need to be of the same shape to be summed
        add = x + sublayer_x

        # Apply layer normalization to the sum
        return self.layer_norm(add)

# Implementing the Feed-Forward Layer
class FeedForward(Layer):
    def __init__(self, d_ff, d_model, **kwargs):
        super(FeedForward, self).__init__(**kwargs)
        self.fully_connected1 = Dense(d_ff)  # First fully connected layer
        self.fully_connected2 = Dense(d_model)  # Second fully connected layer
        self.activation = ReLU()  # ReLU activation layer

    def call(self, x):
        # The input is passed into the two fully-connected layers, with a ReLU in between
        x_fc1 = self.fully_connected1(x)

        return self.fully_connected2(self.activation(x_fc1))

# Implementing the Encoder Layer
class EncoderLayer(Layer):
    def __init__(self, h, d_k, d_v, d_model, d_ff, rate, **kwargs):
        super(EncoderLayer, self).__init__(**kwargs)
        self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout1 = Dropout(rate)
        self.add_norm1 = AddNormalization()
        self.feed_forward = FeedForward(d_ff, d_model)
        self.dropout2 = Dropout(rate)
        self.add_norm2 = AddNormalization()

    def call(self, x, padding_mask, training):
        # Multi-head attention layer
        multihead_output = self.multihead_attention(x, x, x, padding_mask)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Add in a dropout layer
        multihead_output = self.dropout1(multihead_output, training=training)

        # Followed by an Add & Norm layer
        addnorm_output = self.add_norm1(x, multihead_output)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Followed by a fully connected layer
        feedforward_output = self.feed_forward(addnorm_output)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Add in another dropout layer
        feedforward_output = self.dropout2(feedforward_output, training=training)

        # Followed by another Add & Norm layer
        return self.add_norm2(addnorm_output, feedforward_output)

# Implementing the Encoder
class Encoder(Layer):
    def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate, **kwargs):
        super(Encoder, self).__init__(**kwargs)
        self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model)
        self.dropout = Dropout(rate)
        self.encoder_layer = [EncoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)]

    def call(self, input_sentence, padding_mask, training):
        # Generate the positional encoding
        pos_encoding_output = self.pos_encoding(input_sentence)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Add in a dropout layer
        x = self.dropout(pos_encoding_output, training=training)

        # Pass on the positional encoded values to each encoder layer
        for i, layer in enumerate(self.encoder_layer):
            x = layer(x, padding_mask, training)

        return x

测试代码

您将使用 Vaswani 等人（2017 年）在论文注意力机制全是你需要的中指定的参数值进行工作：

Python

h = 8  # Number of self-attention heads
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
d_ff = 2048  # Dimensionality of the inner fully connected layer
d_model = 512  # Dimensionality of the model sub-layers' outputs
n = 6  # Number of layers in the encoder stack

batch_size = 64  # Batch size from the training process
dropout_rate = 0.1  # Frequency of dropping the input units in the dropout layers
...

至于输入序列，暂时您将使用虚拟数据，直到在单独的教程中训练完整的 Transformer 模型时，您将使用实际句子：

Python

...
enc_vocab_size = 20 # Vocabulary size for the encoder
input_seq_length = 5  # Maximum length of the input sequence

input_seq = random.random((batch_size, input_seq_length))
...

接下来，您将创建Encoder类的一个新实例，将其输出分配给encoder变量，随后输入参数，并打印结果。暂时将填充掩码参数设置为None，但在实现完整的 Transformer 模型时会回到这里：

Python

...
encoder = Encoder(enc_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)
print(encoder(input_seq, None, True))

将所有内容联系在一起得到以下代码清单：

Python

from numpy import random

enc_vocab_size = 20 # Vocabulary size for the encoder
input_seq_length = 5  # Maximum length of the input sequence
h = 8  # Number of self-attention heads
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
d_ff = 2048  # Dimensionality of the inner fully connected layer
d_model = 512  # Dimensionality of the model sub-layers' outputs
n = 6  # Number of layers in the encoder stack

batch_size = 64  # Batch size from the training process
dropout_rate = 0.1  # Frequency of dropping the input units in the dropout layers

input_seq = random.random((batch_size, input_seq_length))

encoder = Encoder(enc_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)
print(encoder(input_seq, None, True))

运行此代码会生成形状为（批量大小，序列长度，模型维度）的输出。请注意，由于输入序列的随机初始化和密集层参数值的不同，您可能会看到不同的输出。

Python

tf.Tensor(
[[[-0.4214715  -1.1246173  -0.8444572  ...  1.6388322  -0.1890367
    1.0173352 ]
  [ 0.21662089 -0.61147404 -1.0946581  ...  1.4627445  -0.6000164
   -0.64127874]
  [ 0.46674493 -1.4155326  -0.5686513  ...  1.1790234  -0.94788337
    0.1331717 ]
  [-0.30638126 -1.9047263  -1.8556844  ...  0.9130118  -0.47863355
    0.00976158]
  [-0.22600567 -0.9702025  -0.91090447 ...  1.7457147  -0.139926
   -0.07021569]]
...

 [[-0.48047638 -1.1034104  -0.16164204 ...  1.5588069   0.08743562
   -0.08847156]
  [-0.61683714 -0.8403657  -1.0450369  ...  2.3587787  -0.76091915
   -0.02891812]
  [-0.34268388 -0.65042275 -0.6715749  ...  2.8530657  -0.33631966
    0.5215888 ]
  [-0.6288677  -1.0030932  -0.9749813  ...  2.1386387   0.0640307
   -0.69504136]
  [-1.33254    -1.2524267  -0.230098   ...  2.515467   -0.04207756
   -0.3395423 ]]], shape=(64, 5, 512), dtype=float32)

进一步阅读

如果您希望深入了解此主题，本节提供了更多资源。

书籍

Python 高级深度学习，2019
自然语言处理的 Transformer，2021

论文

注意力机制全是你需要的，2017

总结

在本教程中，你学会了如何从零开始在 TensorFlow 和 Keras 中实现 Transformer 编码器。

具体来说，你学到了：

形成 Transformer 编码器的一部分的层
如何从零开始实现 Transformer 编码器

你有什么问题吗？

在下面的评论中提出你的问题，我会尽力回答。

推断 Transformer 模型

原文：machinelearningmastery.com/inferencing-the-transformer-model/

我们已经了解了如何在英语和德语句子对的数据集上训练 Transformer 模型，以及如何绘制训练和验证损失曲线来诊断模型的学习性能，并决定在第几个 epoch 上对训练好的模型进行推断。我们现在准备对训练好的 Transformer 模型进行推断，以翻译输入句子。

在本教程中，你将发现如何对训练好的 Transformer 模型进行推断，以实现神经机器翻译。

完成本教程后，你将了解到：

如何对训练好的 Transformer 模型进行推断
如何生成文本翻译

用我的书籍 《使用注意力构建 Transformer 模型》 启动你的项目。它提供了带有可操作代码的自学教程，指导你构建一个完全可用的 Transformer 模型，该模型可以

将句子从一种语言翻译成另一种语言...

让我们开始吧。

推断 Transformer 模型

教程概述

本教程分为三个部分；它们是：

Transformer 架构的回顾
推断 Transformer 模型
测试代码

先决条件

对于本教程，我们假设你已经熟悉：

Transformer 架构的回顾

回忆 Transformer 架构遵循编码器-解码器结构。左侧的编码器负责将输入序列映射到一系列连续表示；右侧的解码器接收编码器的输出以及前一步的解码器输出，以生成输出序列。

Transformer 架构的编码器-解码器结构

摘自“Attention Is All You Need”

在生成输出序列时，Transformer 不依赖于递归和卷积。

你已经了解了如何实现完整的 Transformer 模型，并随后在英语和德语句子对的数据集上训练它。现在让我们继续对训练好的模型进行神经机器翻译推理。

推理 Transformer 模型

让我们从创建一个新的 TransformerModel 类实例开始，该类之前在这个教程中实现过。

你将向其中输入论文中Vaswani et al. (2017)所指定的相关输入参数以及有关使用的数据集的信息：

Python

# Define the model parameters
h = 8  # Number of self-attention heads
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
d_model = 512  # Dimensionality of model layers' outputs
d_ff = 2048  # Dimensionality of the inner fully connected layer
n = 6  # Number of layers in the encoder stack

# Define the dataset parameters
enc_seq_length = 7  # Encoder sequence length
dec_seq_length = 12  # Decoder sequence length
enc_vocab_size = 2405  # Encoder vocabulary size
dec_vocab_size = 3858  # Decoder vocabulary size

# Create model
inferencing_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, 0)

在这里，请注意，最后输入到 TransformerModel 中的输入对应于 Transformer 模型中每个 Dropout 层的丢弃率。这些 Dropout 层在模型推理过程中将不会被使用（你最终会将 training 参数设置为 False），所以你可以安全地将丢弃率设置为 0。

此外，TransformerModel 类已经保存到一个名为 model.py 的单独脚本中。因此，为了能够使用 TransformerModel 类，你需要包含 from model import TransformerModel。

接下来，让我们创建一个类 Translate，该类继承自 Keras 的 Module 基类，并将初始化的推理模型分配给变量 transformer：

Python

class Translate(Module):
    def __init__(self, inferencing_model, **kwargs):
        super(Translate, self).__init__(**kwargs)
        self.transformer = inferencing_model
        ...

当你训练 Transformer 模型时，你看到你首先需要对要输入到编码器和解码器的文本序列进行分词。你通过创建一个词汇表来实现这一点，并用相应的词汇表索引替换每个单词。

在将待翻译的文本序列输入到 Transformer 模型之前，你需要在推理阶段实现类似的过程。

为此，你将在类中包含以下 load_tokenizer 方法，该方法将用于加载在训练阶段生成并保存的编码器和解码器分词器：

Python

def load_tokenizer(self, name):
    with open(name, 'rb') as handle:
        return load(handle)

在推理阶段使用与 Transformer 模型训练阶段生成的相同分词器对输入文本进行分词是非常重要的，因为这些分词器已经在与你的测试数据类似的文本序列上进行了训练。

下一步是创建 call() 类方法，该方法将负责：

将开始（）和结束符号（）令牌添加到输入句子中：

Python

def __call__(self, sentence):
    sentence[0] = "<START> " + sentence[0] + " <EOS>"

加载编码器和解码器分词器（在本例中，分别保存在 enc_tokenizer.pkl 和 dec_tokenizer.pkl pickle 文件中）：

Python

enc_tokenizer = self.load_tokenizer('enc_tokenizer.pkl')
dec_tokenizer = self.load_tokenizer('dec_tokenizer.pkl')

准备输入句子，首先进行标记化，然后填充到最大短语长度，最后转换为张量：

Python

encoder_input = enc_tokenizer.texts_to_sequences(sentence)
encoder_input = pad_sequences(encoder_input, maxlen=enc_seq_length, padding='post')
encoder_input = convert_to_tensor(encoder_input, dtype=int64)

对输出中的和标记重复类似的标记化和张量转换过程：

Python

output_start = dec_tokenizer.texts_to_sequences(["<START>"])
output_start = convert_to_tensor(output_start[0], dtype=int64)

output_end = dec_tokenizer.texts_to_sequences(["<EOS>"])
output_end = convert_to_tensor(output_end[0], dtype=int64)

准备一个输出数组来包含翻译后的文本。由于你事先不知道翻译句子的长度，因此你将输出数组的大小初始化为 0，但将其dynamic_size参数设置为True，以便它可以超过初始大小。然后你将把这个输出数组中的第一个值设置为标记：

Python

decoder_output = TensorArray(dtype=int64, size=0, dynamic_size=True)
decoder_output = decoder_output.write(0, output_start)

迭代直到解码器序列长度，每次调用 Transformer 模型来预测一个输出标记。在这里，training输入被设置为False，然后传递到每个 Transformer 的Dropout层，以便在推断期间不丢弃任何值。然后选择得分最高的预测，并写入输出数组的下一个可用索引。当预测到标记时，for循环将通过break语句终止：

Python

for i in range(dec_seq_length):

    prediction = self.transformer(encoder_input, transpose(decoder_output.stack()), training=False)

    prediction = prediction[:, -1, :]

    predicted_id = argmax(prediction, axis=-1)
    predicted_id = predicted_id[0][newaxis]

    decoder_output = decoder_output.write(i + 1, predicted_id)

    if predicted_id == output_end:
        break

将预测的标记解码成输出列表并返回：

Python

output = transpose(decoder_output.stack())[0]
output = output.numpy()

output_str = []

# Decode the predicted tokens into an output list
for i in range(output.shape[0]):

   key = output[i]
   translation = dec_tokenizer.index_word[key]
   output_str.append(translation)

return output_str

迄今为止的完整代码清单如下：

Python

from pickle import load
from tensorflow import Module
from keras.preprocessing.sequence import pad_sequences
from tensorflow import convert_to_tensor, int64, TensorArray, argmax, newaxis, transpose
from model import TransformerModel

# Define the model parameters
h = 8  # Number of self-attention heads
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
d_model = 512  # Dimensionality of model layers' outputs
d_ff = 2048  # Dimensionality of the inner fully connected layer
n = 6  # Number of layers in the encoder stack

# Define the dataset parameters
enc_seq_length = 7  # Encoder sequence length
dec_seq_length = 12  # Decoder sequence length
enc_vocab_size = 2405  # Encoder vocabulary size
dec_vocab_size = 3858  # Decoder vocabulary size

# Create model
inferencing_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, 0)

class Translate(Module):
    def __init__(self, inferencing_model, **kwargs):
        super(Translate, self).__init__(**kwargs)
        self.transformer = inferencing_model

    def load_tokenizer(self, name):
        with open(name, 'rb') as handle:
            return load(handle)

    def __call__(self, sentence):
        # Append start and end of string tokens to the input sentence
        sentence[0] = "<START> " + sentence[0] + " <EOS>"

        # Load encoder and decoder tokenizers
        enc_tokenizer = self.load_tokenizer('enc_tokenizer.pkl')
        dec_tokenizer = self.load_tokenizer('dec_tokenizer.pkl')

        # Prepare the input sentence by tokenizing, padding and converting to tensor
        encoder_input = enc_tokenizer.texts_to_sequences(sentence)
        encoder_input = pad_sequences(encoder_input, maxlen=enc_seq_length, padding='post')
        encoder_input = convert_to_tensor(encoder_input, dtype=int64)

        # Prepare the output <START> token by tokenizing, and converting to tensor
        output_start = dec_tokenizer.texts_to_sequences(["<START>"])
        output_start = convert_to_tensor(output_start[0], dtype=int64)

        # Prepare the output <EOS> token by tokenizing, and converting to tensor
        output_end = dec_tokenizer.texts_to_sequences(["<EOS>"])
        output_end = convert_to_tensor(output_end[0], dtype=int64)

        # Prepare the output array of dynamic size
        decoder_output = TensorArray(dtype=int64, size=0, dynamic_size=True)
        decoder_output = decoder_output.write(0, output_start)

        for i in range(dec_seq_length):

            # Predict an output token
            prediction = self.transformer(encoder_input, transpose(decoder_output.stack()), training=False)

            prediction = prediction[:, -1, :]

            # Select the prediction with the highest score
            predicted_id = argmax(prediction, axis=-1)
            predicted_id = predicted_id[0][newaxis]

            # Write the selected prediction to the output array at the next available index
            decoder_output = decoder_output.write(i + 1, predicted_id)

            # Break if an <EOS> token is predicted
            if predicted_id == output_end:
                break

        output = transpose(decoder_output.stack())[0]
        output = output.numpy()

        output_str = []

        # Decode the predicted tokens into an output string
        for i in range(output.shape[0]):

            key = output[i]
            print(dec_tokenizer.index_word[key])

        return output_str

想开始构建带有注意力机制的 Transformer 模型吗？

立即参加我的免费 12 天电子邮件速成课程（包含示例代码）。

点击注册并获取课程的免费 PDF 电子书版本。

测试代码

为了测试代码，让我们查看你在准备训练数据集时保存的test_dataset.txt文件。这个文本文件包含了一组英语-德语句子对，已保留用于测试，你可以从中选择几句进行测试。

让我们从第一句开始：

Python

# Sentence to translate
sentence = ['im thirsty']

对于这一句的对应德语原文翻译，包括和解码器标记，应为：<START> ich bin durstig <EOS>。

如果你查看这个模型的绘制训练和验证损失曲线（在这里你正在训练 20 轮），你可能会注意到验证损失曲线显著减缓，并在第 16 轮左右开始趋于平稳。

现在让我们加载第 16 轮的保存模型权重，并查看模型生成的预测：

Python

# Load the trained model's weights at the specified epoch
inferencing_model.load_weights('weights/wghts16.ckpt')

# Create a new instance of the 'Translate' class
translator = Translate(inferencing_model)

# Translate the input sentence
print(translator(sentence))

运行上面的代码行会生成以下翻译后的单词列表：

Python

['start', 'ich', 'bin', 'durstig', ‘eos']

这等同于期望的德语原文句子（请始终记住，由于你是从头开始训练 Transformer 模型，结果可能会因为模型权重的随机初始化而有所不同）。

让我们看看如果您加载了一个对应于较早 epoch（如第 4 个 epoch）的权重集会发生什么。在这种情况下，生成的翻译如下：

Python

['start', 'ich', 'bin', 'nicht', 'nicht', 'eos']

英文中的翻译为：我不是不，这显然与输入的英文句子相去甚远，但这是预期的，因为在这个 epoch 中，Transformer 模型的学习过程仍处于非常早期的阶段。

让我们再试试测试数据集中的第二个句子：

Python

# Sentence to translate
sentence = ['are we done']

这句话的德语对应的地面真相翻译，包括和解码器标记，应为： sind wir dann durch 。

使用保存在第 16 个 epoch 的权重的模型翻译此句子为：

Python

['start', 'ich', 'war', 'fertig', 'eos']

相反，这句话的翻译是：我已准备好。尽管这也不等同于真相，但它接近其意思。

然而，最后的测试表明，Transformer 模型可能需要更多的数据样本来有效训练。这也得到了验证损失曲线在验证损失平稳期间保持相对较高的支持。

的确，Transformer 模型以需求大量数据而闻名。例如，Vaswani et al. (2017)在训练其英语到德语翻译模型时，使用了包含大约 450 万个句对的数据集。

我们在标准的 WMT 2014 英德数据集上进行了训练，该数据集包含约 450 万个句对…对于英法，我们使用了数量显著更多的 WMT 2014 英法数据集，其中包含了 3600 万个句子…

– 全神关注, 2017.

他们报告称，他们花费了 8 个 P100 GPU、3.5 天的时间来训练英语到德语的翻译模型。

相比之下，您只在此处的数据集上进行了训练，其中包括 10,000 个数据样本，分为训练、验证和测试集。

所以下一个任务实际上是给你。如果您有可用的计算资源，请尝试在更大的句子对集上训练 Transformer 模型，并查看是否可以获得比在有限数据量下获得的翻译结果更好的结果。

进一步阅读

本节提供了更多关于这一主题的资源，如果您希望深入了解。

书籍

Python 深度学习进阶, 2019
自然语言处理中的 Transformer, 2021

论文

全神关注, 2017

总结

在本教程中，您学会了如何对训练过的 Transformer 模型进行神经机器翻译推理。

具体来说，您学到了：

如何对训练过的 Transformer 模型进行推理
如何生成文本翻译

您有任何问题吗？

在下方的评论中提出你的问题，我会尽力回答。

结合 Transformer 编码器和解码器及掩码

原文：machinelearningmastery.com/joining-the-transformer-encoder-and-decoder-and-masking/

我们已经分别实现并测试了 Transformer 编码器和解码器，现在可以将它们结合成一个完整的模型。我们还将了解如何创建填充和前瞻掩码，以抑制在编码器或解码器计算中不考虑的输入值。我们的最终目标是将完整模型应用于自然语言处理（NLP）。

在本教程中，你将发现如何实现完整的 Transformer 模型并创建填充和前瞻掩码。

完成本教程后，你将了解到：

如何为编码器和解码器创建填充掩码
如何为解码器创建前瞻掩码
如何将 Transformer 编码器和解码器结合成一个模型
如何打印出编码器和解码器层的总结

开始吧。

结合 Transformer 编码器和解码器及掩码

照片由 John O’Nolan 提供，部分权利保留。

教程概述

本教程分为四个部分：

Transformer 架构回顾
掩码
- 创建填充掩码
- 创建前瞻掩码
结合 Transformer 编码器和解码器
创建 Transformer 模型的实例
- 打印出编码器和解码器层的总结

先决条件

对于本教程，我们假设你已经熟悉：

Transformer 架构回顾

回顾我们已经看到 Transformer 架构遵循编码器-解码器结构。左侧的编码器负责将输入序列映射到连续表示序列；右侧的解码器接收编码器的输出以及上一个时间步的解码器输出，以生成输出序列。

Transformer 架构的编码器-解码器结构

取自“Attention Is All You Need”

在生成输出序列时，Transformer 不依赖于递归和卷积。

您已经看到如何分别实现 Transformer 编码器和解码器。在本教程中，您将把两者结合起来，形成一个完整的 Transformer 模型，并在输入值上应用填充和前瞻掩码。

让我们首先了解如何应用掩码。

用我的书使用注意力构建 Transformer 模型启动您的项目。它提供了自学教程和工作代码，指导您构建一个完全工作的 Transformer 模型

将句子从一种语言翻译成另一种语言…

掩码

创建填充掩码

您应该已经了解在将其馈送到编码器和解码器之前对输入值进行掩码的重要性。

当您继续训练 Transformer 模型时，将把输入序列馈送到编码器和解码器之前，首先将其零填充到特定的序列长度。填充掩码的重要性在于确保这些零值不会与编码器和解码器同时处理的实际输入值混合在一起。

让我们创建以下函数为编码器和解码器生成填充掩码：

Python

from tensorflow import math, cast, float32

def padding_mask(input):
    # Create mask which marks the zero padding values in the input by a 1
    mask = math.equal(input, 0)
    mask = cast(mask, float32)

    return mask

收到输入后，此函数将生成一个张量，标记输入包含零值处的地方为一。

因此，如果您输入以下数组：

Python

from numpy import array

input = array([1, 2, 3, 4, 0, 0, 0])
print(padding_mask(input))

那么 padding_mask 函数的输出将如下所示：

Python

tf.Tensor([0\. 0\. 0\. 0\. 1\. 1\. 1.], shape=(7,), dtype=float32)

创建前瞻掩码

需要前瞻掩码以防止解码器关注后续的单词，这样特定单词的预测仅能依赖于其之前的已知输出。

为此，让我们创建以下函数以为解码器生成前瞻掩码：

Python

from tensorflow import linalg, ones

def lookahead_mask(shape):
    # Mask out future entries by marking them with a 1.0
    mask = 1 - linalg.band_part(ones((shape, shape)), -1, 0)

    return mask

您将向其传递解码器输入的长度。让我们以 5 为例：

Python

print(lookahead_mask(5))

那么 lookahead_mask 函数返回的输出如下：

Python

tf.Tensor(
[[0\. 1\. 1\. 1\. 1.]
 [0\. 0\. 1\. 1\. 1.]
 [0\. 0\. 0\. 1\. 1.]
 [0\. 0\. 0\. 0\. 1.]
 [0\. 0\. 0\. 0\. 0.]], shape=(5, 5), dtype=float32)

再次，一值掩盖了不应使用的条目。因此，每个单词的预测仅依赖于其之前的单词。

想要开始构建使用注意力的 Transformer 模型吗？

现在就注册我的免费 12 天电子邮件速成课程（包含示例代码）。

点击注册，还可免费获取课程的 PDF 电子书版本。

连接 Transformer 编码器和解码器

让我们首先创建TransformerModel类，它继承自 Keras 中的Model基类：

Python

class TransformerModel(Model):
    def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate, **kwargs):
        super(TransformerModel, self).__init__(**kwargs)

        # Set up the encoder
        self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate)

        # Set up the decoder
        self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate)

        # Define the final dense layer
        self.model_last_layer = Dense(dec_vocab_size)
        ...

创建TransformerModel类的第一步是初始化先前实现的Encoder和Decoder类的实例，并将它们的输出分别分配给变量encoder和decoder。如果你将这些类保存到单独的 Python 脚本中，不要忘记导入它们。我将代码保存在 Python 脚本encoder.py和decoder.py中，所以我需要相应地导入它们。

你还将包括一个最终的全连接层，生成最终的输出，类似于Vaswani et al. (2017)中的 Transformer 架构。

接下来，你将创建类方法call()，以将相关输入送入编码器和解码器。

首先生成一个填充掩码，以掩盖编码器输入和编码器输出，当这些被送入解码器的第二个自注意力块时：

Python

...
def call(self, encoder_input, decoder_input, training):

    # Create padding mask to mask the encoder inputs and the encoder outputs in the decoder
    enc_padding_mask = self.padding_mask(encoder_input)
...

然后生成一个填充掩码和一个前瞻掩码，以掩盖解码器输入。通过逐元素maximum操作将它们结合在一起：

Python

...
# Create and combine padding and look-ahead masks to be fed into the decoder
dec_in_padding_mask = self.padding_mask(decoder_input)
dec_in_lookahead_mask = self.lookahead_mask(decoder_input.shape[1])
dec_in_lookahead_mask = maximum(dec_in_padding_mask, dec_in_lookahead_mask)
...

接下来，将相关输入送入编码器和解码器，并通过将解码器输出送入一个最终的全连接层来生成 Transformer 模型输出：

Python

...
# Feed the input into the encoder
encoder_output = self.encoder(encoder_input, enc_padding_mask, training)

# Feed the encoder output into the decoder
decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, training)

# Pass the decoder output through a final dense layer
model_output = self.model_last_layer(decoder_output)

return model_output

将所有步骤结合起来，得到以下完整的代码清单：

Python

from encoder import Encoder
from decoder import Decoder
from tensorflow import math, cast, float32, linalg, ones, maximum, newaxis
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense

class TransformerModel(Model):
    def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate, **kwargs):
        super(TransformerModel, self).__init__(**kwargs)

        # Set up the encoder
        self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate)

        # Set up the decoder
        self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate)

        # Define the final dense layer
        self.model_last_layer = Dense(dec_vocab_size)

    def padding_mask(self, input):
        # Create mask which marks the zero padding values in the input by a 1.0
        mask = math.equal(input, 0)
        mask = cast(mask, float32)

        # The shape of the mask should be broadcastable to the shape
        # of the attention weights that it will be masking later on
        return mask[:, newaxis, newaxis, :]

    def lookahead_mask(self, shape):
        # Mask out future entries by marking them with a 1.0
        mask = 1 - linalg.band_part(ones((shape, shape)), -1, 0)

        return mask

    def call(self, encoder_input, decoder_input, training):

        # Create padding mask to mask the encoder inputs and the encoder outputs in the decoder
        enc_padding_mask = self.padding_mask(encoder_input)

        # Create and combine padding and look-ahead masks to be fed into the decoder
        dec_in_padding_mask = self.padding_mask(decoder_input)
        dec_in_lookahead_mask = self.lookahead_mask(decoder_input.shape[1])
        dec_in_lookahead_mask = maximum(dec_in_padding_mask, dec_in_lookahead_mask)

        # Feed the input into the encoder
        encoder_output = self.encoder(encoder_input, enc_padding_mask, training)

        # Feed the encoder output into the decoder
        decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, training)

        # Pass the decoder output through a final dense layer
        model_output = self.model_last_layer(decoder_output)

        return model_output

请注意，你对padding_mask函数返回的输出进行了小的更改。它的形状被调整为可广播到它在训练 Transformer 模型时将要掩盖的注意力权重张量的形状。

创建 Transformer 模型的实例

你将使用Vaswani et al. (2017)论文中指定的参数值：

Python

h = 8  # Number of self-attention heads
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
d_ff = 2048  # Dimensionality of the inner fully connected layer
d_model = 512  # Dimensionality of the model sub-layers' outputs
n = 6  # Number of layers in the encoder stack

dropout_rate = 0.1  # Frequency of dropping the input units in the dropout layers
...

至于输入相关的参数，你暂时将使用虚拟值，直到你达到训练完整的 Transformer 模型的阶段。到那时，你将使用实际的句子：

Python

...
enc_vocab_size = 20 # Vocabulary size for the encoder
dec_vocab_size = 20 # Vocabulary size for the decoder

enc_seq_length = 5  # Maximum length of the input sequence
dec_seq_length = 5  # Maximum length of the target sequence
...

你现在可以按如下方式创建TransformerModel类的实例：

Python

from model import TransformerModel

# Create model
training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

完整的代码清单如下：

Python

enc_vocab_size = 20 # Vocabulary size for the encoder
dec_vocab_size = 20 # Vocabulary size for the decoder

enc_seq_length = 5  # Maximum length of the input sequence
dec_seq_length = 5  # Maximum length of the target sequence

h = 8  # Number of self-attention heads
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
d_ff = 2048  # Dimensionality of the inner fully connected layer
d_model = 512  # Dimensionality of the model sub-layers' outputs
n = 6  # Number of layers in the encoder stack

dropout_rate = 0.1  # Frequency of dropping the input units in the dropout layers

# Create model
training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

打印编码器和解码器层的摘要

你还可以打印出 Transformer 模型的编码器和解码器块的摘要。选择单独打印它们将使你能够查看各个子层的详细信息。为此，将以下代码行添加到EncoderLayer和DecoderLayer类的__init__()方法中：

Python

self.build(input_shape=[None, sequence_length, d_model])

然后你需要将以下方法添加到EncoderLayer类中：

Python

def build_graph(self):
    input_layer = Input(shape=(self.sequence_length, self.d_model))
    return Model(inputs=[input_layer], outputs=self.call(input_layer, None, True))

以及以下方法到DecoderLayer类：

Python

def build_graph(self):
    input_layer = Input(shape=(self.sequence_length, self.d_model))
    return Model(inputs=[input_layer], outputs=self.call(input_layer, input_layer, None, None, True))

这导致EncoderLayer类被修改如下（call()方法下的三个点表示与这里实现的内容相同）：

Python

from tensorflow.keras.layers import Input
from tensorflow.keras import Model

class EncoderLayer(Layer):
    def __init__(self, sequence_length, h, d_k, d_v, d_model, d_ff, rate, **kwargs):
        super(EncoderLayer, self).__init__(**kwargs)
        self.build(input_shape=[None, sequence_length, d_model])
        self.d_model = d_model
        self.sequence_length = sequence_length
        self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout1 = Dropout(rate)
        self.add_norm1 = AddNormalization()
        self.feed_forward = FeedForward(d_ff, d_model)
        self.dropout2 = Dropout(rate)
        self.add_norm2 = AddNormalization()

    def build_graph(self):
        input_layer = Input(shape=(self.sequence_length, self.d_model))
        return Model(inputs=[input_layer], outputs=self.call(input_layer, None, True))

    def call(self, x, padding_mask, training):
        ...

类似的更改也可以应用于 DecoderLayer 类。

一旦你完成了必要的更改，你可以继续创建 EncoderLayer 和 DecoderLayer 类的实例，并按如下方式打印它们的总结：

Python

from encoder import EncoderLayer
from decoder import DecoderLayer

encoder = EncoderLayer(enc_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate)
encoder.build_graph().summary()

decoder = DecoderLayer(dec_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate)
decoder.build_graph().summary()

对编码器的结果总结如下：

Python

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_1 (InputLayer)           [(None, 5, 512)]     0           []                               

 multi_head_attention_18 (Multi  (None, 5, 512)      131776      ['input_1[0][0]',                
 HeadAttention)                                                   'input_1[0][0]',                
                                                                  'input_1[0][0]']                

 dropout_32 (Dropout)           (None, 5, 512)       0           ['multi_head_attention_18[0][0]']

 add_normalization_30 (AddNorma  (None, 5, 512)      1024        ['input_1[0][0]',                
 lization)                                                        'dropout_32[0][0]']             

 feed_forward_12 (FeedForward)  (None, 5, 512)       2099712     ['add_normalization_30[0][0]']   

 dropout_33 (Dropout)           (None, 5, 512)       0           ['feed_forward_12[0][0]']        

 add_normalization_31 (AddNorma  (None, 5, 512)      1024        ['add_normalization_30[0][0]',   
 lization)                                                        'dropout_33[0][0]']             

==================================================================================================
Total params: 2,233,536
Trainable params: 2,233,536
Non-trainable params: 0
__________________________________________________________________________________________________

而解码器的结果总结如下：

Python

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_2 (InputLayer)           [(None, 5, 512)]     0           []                               

 multi_head_attention_19 (Multi  (None, 5, 512)      131776      ['input_2[0][0]',                
 HeadAttention)                                                   'input_2[0][0]',                
                                                                  'input_2[0][0]']                

 dropout_34 (Dropout)           (None, 5, 512)       0           ['multi_head_attention_19[0][0]']

 add_normalization_32 (AddNorma  (None, 5, 512)      1024        ['input_2[0][0]',                
 lization)                                                        'dropout_34[0][0]',             
                                                                  'add_normalization_32[0][0]',   
                                                                  'dropout_35[0][0]']             

 multi_head_attention_20 (Multi  (None, 5, 512)      131776      ['add_normalization_32[0][0]',   
 HeadAttention)                                                   'input_2[0][0]',                
                                                                  'input_2[0][0]']                

 dropout_35 (Dropout)           (None, 5, 512)       0           ['multi_head_attention_20[0][0]']

 feed_forward_13 (FeedForward)  (None, 5, 512)       2099712     ['add_normalization_32[1][0]']   

 dropout_36 (Dropout)           (None, 5, 512)       0           ['feed_forward_13[0][0]']        

 add_normalization_34 (AddNorma  (None, 5, 512)      1024        ['add_normalization_32[1][0]',   
 lization)                                                        'dropout_36[0][0]']             

==================================================================================================
Total params: 2,365,312
Trainable params: 2,365,312
Non-trainable params: 0
__________________________________________________________________________________________________

进一步阅读

本节提供了更多关于该主题的资源，如果你希望深入了解。

书籍

Advanced Deep Learning with Python，2019
Transformers for Natural Language Processing，2021

论文

Attention Is All You Need，2017

总结

在本教程中，你学习了如何实现完整的 Transformer 模型以及创建填充和前瞻掩码。

具体来说，你学到了：

如何为编码器和解码器创建填充掩码
如何为解码器创建前瞻掩码
如何将 Transformer 编码器和解码器组合成一个单一模型
如何打印出编码器和解码器层的总结

你有任何问题吗？

在下面的评论中提出你的问题，我会尽力回答。