Tensorflow——手把手教你机器翻译(二)Transformer模型(上)

·  阅读 752
Tensorflow——手把手教你机器翻译(二)Transformer模型(上)

携手创作,共同成长!这是我参与「掘金日新计划 · 8 月更文挑战」的第14天,点击查看活动详情


在上一篇文章中,我们实现基于seq2seq+attention模型的机器翻译的实现。

今天我们来用Transformer模型来实现机器翻译


在上一篇文章中的模型思想

  • 模型思想——Attention

    • 去除定长的编码瓶颈,信息无损从Encoder传到Decoder
  • 但是

    • 采用GRU,计算依然有瓶颈,并行度不高 ,它都是一个从前往后处理的模型,在前面的词未处理完的时候,后面的词是无法被处理的,因而对于RNN来说,即便是加了attention,它的并行度依然是不够的
    • 只有Encoder和decoder之间有attention ,encoder自身和decoder自身是没有attention的,attention是一种无损的信息传递方式,而encoder自身和decoder自身只能依靠自身的GRU或者LSTM或者是RNN隐含状态来传递信息,而这种传递信息的方式在长距离上会产生信息损失。
      • 比如说一个decoder翻译一个比较长的句子的时候,可能对于原句子中的某个含义已经翻译过了,但是经过了几百个单词之后,因为有信息损失,它不记得这个含义被翻译过了,又会重新翻译这个含义,导致翻译质量下降。
  • 能否去掉RNN?

  • 能否给输入和输出分别加上self attention?

  • 1.1 Transformer模型

    • Encoder-Decoder结构
    • LL%`W`Z}%%P}{S(IUSV`TU3.png
    • 多层Encoder-Decoder
    • 位置编码
    • 多头注意力
      • 缩放点积注意力
    • Add & norm
  • 1.1.1 模型结构——Encoder-Decoder架构
    • M)VWSS~T@UEQ%OC6R58G9I0.png
    • 是一个多层的encoder-decoder架构
      • L9UZD$(D(0SOBX((UYO8EK2.png
      • 多层的有两个含义:encoder,decoder分别是多层的,第二个是encoder的输出要传给decoder的每一个块
      • W)Z`GL9{595@_(E(S0FCN.png
  • 模型结构——Encoder:每一个块都是分成两层:self Attention和Feed Forword netural Network,这两层每一层都有add nomorlize

    • 2FTUAFTVG}EYNCZ8K`@3KWU.png
  • 模型结构——Attention

    • 缩放点积attention
    • )P_L0ZZC8THC1WUG(}SL6YL.png
    • Q5R}1@{$ZEJB2PRMKEB65.png
    • 7H3GASM4XW`%SG8LXLRK7_C.png
    • _J2G0Y$Z{F3H2TASGNQ43.png
    • 为什么要除以根号dk
      • 防止内积总和过大
T`ERR3EE50N{$QKP2X4UD~G.png
  • 1.1.2 多头attention

  • M9(W(WE7_M%%SS_R66E%1SM.png

  • O8SFXQWKJPKI_0NZ8(ZIQ6.png](https://p6-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/1cf5a6e18ddb41d89dc1d43709f95ed3~tplv-k3u1fbpfcp-watermark.image?)
    - ![37NF29`B{ROZE4P6564(6WC.png](https://p1-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/9343fb2762974e0f86da6dbbb7d0bcef~tplv-k3u1fbpfcp-watermark.image?)
    - ![RE0PWL{I%ZVN71WOPM5D@Q.png

  • 1.1.3 位置编码

    • Z$_D~@YJDEOEV1U24O`VD18.png
    • (HU8UJ@}@DBJGN{QG$XVWSJ.png
  • 1.1.4 Add & Normalize

    • _QW[F77SD147I$4LE]QVCWM.png
  • 1.1.5 模型结构——Decoder

    • Train的时候并行化
    • Inference的时候仍然要序列式完成
    • Self attention时前词不能见后词
      • mask实现
  • 1.1.6 模型结构——输出

    • 全连接层到词表大小
    • softmax
  • 2.1 Transform实战

实战步骤:

# 1. loads data
# 2. preprocesses data -> dataset
# 3. tools
# 3.1 generates position embedding
# 3.2 create mask. (a. padding, b. decoder)
# 3.3 scaled_dot_product_attention
# 4. builds model
# 4.1 MultiheadAttention
# 4.2 EncoderLayer
# 4.3 DecoderLayer
# 4.4 EncoderModel
# 4.5 DecoderModel
# 4.6 Transformer
# 5. optimizer & loss
# 6. train step -> train
# 7. Evaluate and Visualize
复制代码
  • 2.1.1 载入数据:使用的是tfds中的数据,这个数据是基于subword的,Transformer模型是基本subword做的
import tensorflow_datasets as tfds

examples, info = tfds.load('ted_hrlr_translate/pt_to_en',
                           with_info = True,
                           as_supervised = True)

train_examples, val_examples = examples['train'], examples['validation']
print(info)
复制代码

打印看一下数据集都是什么样子的:

for pt, en in train_examples.take(5):
    print(pt.numpy())
    print(en.numpy())
    print()
复制代码

运行结果:

b'e quando melhoramos a procura , tiramos a \xc3\xbanica vantagem da impress\xc3\xa3o , que \xc3\xa9 a serendipidade .'
b'and when you improve searchability , you actually take away the one advantage of print , which is serendipity .'

b'mas e se estes fatores fossem ativos ?'
b'but what if it were active ?'

b'mas eles n\xc3\xa3o tinham a curiosidade de me testar .'
b"but they did n't test for curiosity ."

b'e esta rebeldia consciente \xc3\xa9 a raz\xc3\xa3o pela qual eu , como agn\xc3\xb3stica , posso ainda ter f\xc3\xa9 .'
b'and this conscious defiance is why i , as an agnostic , can still have faith .'

b"`` `` '' podem usar tudo sobre a mesa no meu corpo . ''"
b'you can use everything on the table on me .'
复制代码

在西班牙语中会有一些转义字符

2.1.2 从语料中构建subword的方法

en_tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (en.numpy() for pt, en in train_examples),
    target_vocab_size = 2 ** 13)
pt_tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (pt.numpy() for pt, en in train_examples),
    target_vocab_size = 2 ** 13)

复制代码
sample_string = "Transformer is awesome."


tokenized_string = en_tokenizer.encode(sample_string)
print('Tokenized string is {}'.format(tokenized_string))

# 把变成subword的词再变回去
origin_string = en_tokenizer.decode(tokenized_string)
print('The original string is {}'.format(origin_string))

assert origin_string == sample_string

for token in tokenized_string:
    print('{} --> "{}"'.format(token, en_tokenizer.decode([token])))
复制代码

运行结果:

Tokenized string is [7915, 1248, 7946, 7194, 13, 2799, 7877]
The original string is Transformer is awesome.
7915 --> "T"
1248 --> "ran"
7946 --> "s"
7194 --> "former "
13 --> "is "
2799 --> "awesome"
7877 --> "."
复制代码

2.2 创建数据集:

buffer_size = 20000
batch_size = 64
max_length = 40

# 把句子转化成subword之后的数据
def encode_to_subword(pt_sentence, en_sentence):
    pt_sequence = [pt_tokenizer.vocab_size] \
    + pt_tokenizer.encode(pt_sentence.numpy()) \
    + [pt_tokenizer.vocab_size + 1]
    en_sequence = [en_tokenizer.vocab_size] \
    + en_tokenizer.encode(en_sentence.numpy()) \
    + [en_tokenizer.vocab_size + 1]
    return pt_sequence, en_sequence


def filter_by_max_length(pt, en):
    return tf.logical_and(tf.size(pt) <= max_length,
                          tf.size(en) <= max_length)

# 使用py_function把python函数封装起来
def tf_encode_to_subword(pt_sentence, en_sentence):
    return tf.py_function(encode_to_subword,
                          [pt_sentence, en_sentence],
                          [tf.int64, tf.int64])

# 映射:把train_examples中所有的葡萄牙语和英语的句子都转成subword的id
train_dataset = train_examples.map(tf_encode_to_subword)
# 对新的dataset做一个filter
train_dataset = train_dataset.filter(filter_by_max_length)
train_dataset = train_dataset.shuffle(
    buffer_size).padded_batch(
    batch_size, padded_shapes=([-1], [-1]))
# padded_shapes=([-1], [-1]):都在当前维度扩展到最高的值

valid_dataset = val_examples.map(tf_encode_to_subword)
valid_dataset = valid_dataset.filter(
    filter_by_max_length).padded_batch(
    batch_size, padded_shapes=([-1], [-1]))
    
复制代码

在生成dataset之后再来check一下数据是不是对的

for pt_batch, en_batch in valid_dataset.take(5):
    print(pt_batch.shape, en_batch.shape)
复制代码

运行结果:

(64, 38) (64, 40)
(64, 39) (64, 35)
(64, 39) (64, 39)
(64, 39) (64, 39)
(64, 39) (64, 36)
复制代码

2.3 写一些工具函数

# PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
# PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

# pos.shape: [sentence_length, 1]
# i.shape  : [1, d_model]
# result.shape: [sentence_length, d_model]

# 获取所有的句子位置对应embedding的位置
def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000,
                               (2 * (i // 2)) / np.float32(d_model))
    return pos * angle_rates


# 对奇数位做正弦函数,对偶数位做余弦函数,再将结果拼接起来
def get_position_embedding(sentence_length, d_model):
    angle_rads = get_angles(np.arange(sentence_length)[:, np.newaxis],
                            np.arange(d_model)[np.newaxis, :],
                            d_model)
    # sines.shape: [sentence_length, d_model / 2]
    # cosines.shape: [sentence_length, d_model / 2]
    sines = np.sin(angle_rads[:, 0::2])
    cosines = np.cos(angle_rads[:, 1::2])
    
    # position_embedding.shape: [sentence_length, d_model]
    position_embedding = np.concatenate([sines, cosines], axis = -1)
    # position_embedding.shape: [1, sentence_length, d_model]
    position_embedding = position_embedding[np.newaxis, ...]
    
    return tf.cast(position_embedding, dtype=tf.float32)

position_embedding = get_position_embedding(50, 512)
print(position_embedding.shape)
复制代码

运行结果:

(1, 50, 512)
复制代码
def plot_position_embedding(position_embedding):
    plt.pcolormesh(position_embedding[0], cmap = 'RdBu')
    plt.xlabel('Depth')
    plt.xlim((0, 512))
    plt.ylabel('Position')
    plt.colorbar()
    plt.show()
    
plot_position_embedding(position_embedding)
复制代码

运行结果: output_9_0.png

  • 2.4 mask构建
# 1. padding mask, 2. look ahead

# batch_data.shape: [batch_size, seq_len]
def create_padding_mask(batch_data):
    padding_mask = tf.cast(tf.math.equal(batch_data, 0), tf.float32)
    # [batch_size, 1, 1, seq_len]
    return padding_mask[:, tf.newaxis, tf.newaxis, :]

x = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
create_padding_mask(x)
复制代码

运行结果:

<tf.Tensor: shape=(3, 1, 1, 5), dtype=float32, numpy=
array([[[[0., 0., 1., 1., 0.]]],


       [[[0., 0., 0., 1., 1.]]],


       [[[1., 1., 1., 0., 0.]]]], dtype=float32)>
复制代码
# attention_weights.shape: [3,3]
# [[1, 0, 0],
#  [4, 5, 0],
#  [7, 8, 9]]
def create_look_ahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask # (seq_len, seq_len)

create_look_ahead_mask(3)
复制代码

运行结果:

<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[0., 1., 1.],
       [0., 0., 1.],
       [0., 0., 0.]], dtype=float32)>
复制代码
  • 2.5 缩放点积注意力机制的实现
def scaled_dot_product_attention(q, k, v, mask):
    """
    Args:
    - q: shape == (..., seq_len_q, depth)
    - k: shape == (..., seq_len_k, depth)
    - v: shape == (..., seq_len_v, depth_v)
    - seq_len_k == seq_len_v
    - mask: shape == (..., seq_len_q, seq_len_k)
    Returns:
    - output: weighted sum
    - attention_weights: weights of attention
    """
    
    # matmul_qk.shape: (..., seq_len_q, seq_len_k)
    # transpose_b: 第二个矩阵是否做转置
    matmul_qk = tf.matmul(q, k, transpose_b = True)
    
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
    
    if mask is not None:
        # 使得在softmax后值趋近于0
        scaled_attention_logits += (mask * -1e9)
    
    # attention_weights.shape: (..., seq_len_q, seq_len_k)
    attention_weights = tf.nn.softmax(
        scaled_attention_logits, axis = -1)
    
    # output.shape: (..., seq_len_q, depth_v)
    output = tf.matmul(attention_weights, v)
    
    return output, attention_weights

def print_scaled_dot_product_attention(q, k, v):
    temp_out, temp_att = scaled_dot_product_attention(q, k, v, None)
    print("Attention weights are:")
    print(temp_att)
    print("Output is:")
    print(temp_out)
复制代码

写几个临时的矩阵来测试我们的代码是否正确:

temp_k = tf.constant([[10, 0, 0],
                      [0, 10, 0],
                      [0, 0, 10],
                      [0, 0, 10]], dtype=tf.float32) # (4, 3)

temp_v = tf.constant([[1, 0],
                      [10, 0],
                      [100, 5],
                      [1000, 6]], dtype=tf.float32) # (4, 2)

temp_q1 = tf.constant([[0, 10, 0]], dtype=tf.float32) # (1, 3)
np.set_printoptions(suppress=True)
print_scaled_dot_product_attention(temp_q1, temp_k, temp_v)
复制代码

运行结果:

Attention weights are:
tf.Tensor([[0. 1. 0. 0.]], shape=(1, 4), dtype=float32)
Output is:
tf.Tensor([[10.  0.]], shape=(1, 2), dtype=float32)
复制代码

下一篇文章中,我们将进行多头注意力机制的实现,feedforward 层次实现,EncoderLayer实现,DecoderLayer实现,EecoderModel实现,DecoderModel实现,Transformer实现,训练模型,自定义学习率,损失函数的实现, mask的创建与实现,模型预测,attention可视化,并且进行机器翻译的示例展示

9JQ4ZCQY3M({Q$KEN%9BFQX.png

收藏成功!
已添加到「」, 点击更改