携手创作,共同成长!这是我参与「掘金日新计划 · 8 月更文挑战」的第14天,点击查看活动详情
在上一篇文章中,我们实现基于seq2seq+attention模型的机器翻译的实现。
今天我们来用Transformer模型来实现机器翻译
在上一篇文章中的模型思想
-
模型思想——Attention
- 去除定长的编码瓶颈,信息无损从Encoder传到Decoder
-
但是
- 采用GRU,计算依然有瓶颈,并行度不高 ,它都是一个从前往后处理的模型,在前面的词未处理完的时候,后面的词是无法被处理的,因而对于RNN来说,即便是加了attention,它的并行度依然是不够的
- 只有Encoder和decoder之间有attention ,encoder自身和decoder自身是没有attention的,attention是一种无损的信息传递方式,而encoder自身和decoder自身只能依靠自身的GRU或者LSTM或者是RNN隐含状态来传递信息,而这种传递信息的方式在长距离上会产生信息损失。
- 比如说一个decoder翻译一个比较长的句子的时候,可能对于原句子中的某个含义已经翻译过了,但是经过了几百个单词之后,因为有信息损失,它不记得这个含义被翻译过了,又会重新翻译这个含义,导致翻译质量下降。
-
能否去掉RNN?
-
能否给输入和输出分别加上self attention?
-
1.1 Transformer模型
- Encoder-Decoder结构
- 多层Encoder-Decoder
- 位置编码
- 多头注意力
- 缩放点积注意力
- Add & norm
-
1.1.1 模型结构——Encoder-Decoder架构
- 是一个多层的encoder-decoder架构
- 多层的有两个含义:encoder,decoder分别是多层的,第二个是encoder的输出要传给decoder的每一个块
- 是一个多层的encoder-decoder架构
-
模型结构——Encoder:每一个块都是分成两层:self Attention和Feed Forword netural Network,这两层每一层都有add nomorlize
-
模型结构——Attention
- 缩放点积attention
- 为什么要除以根号dk
- 防止内积总和过大
-
1.1.2 多头attention
-
1.1.3 位置编码
-
1.1.4 Add & Normalize
-
1.1.5 模型结构——Decoder
- Train的时候并行化
- Inference的时候仍然要序列式完成
- Self attention时前词不能见后词
- mask实现
-
1.1.6 模型结构——输出
- 全连接层到词表大小
- softmax
-
2.1 Transform实战
实战步骤:
# 1. loads data
# 2. preprocesses data -> dataset
# 3. tools
# 3.1 generates position embedding
# 3.2 create mask. (a. padding, b. decoder)
# 3.3 scaled_dot_product_attention
# 4. builds model
# 4.1 MultiheadAttention
# 4.2 EncoderLayer
# 4.3 DecoderLayer
# 4.4 EncoderModel
# 4.5 DecoderModel
# 4.6 Transformer
# 5. optimizer & loss
# 6. train step -> train
# 7. Evaluate and Visualize
- 2.1.1 载入数据:使用的是tfds中的数据,这个数据是基于subword的,Transformer模型是基本subword做的
import tensorflow_datasets as tfds
examples, info = tfds.load('ted_hrlr_translate/pt_to_en',
with_info = True,
as_supervised = True)
train_examples, val_examples = examples['train'], examples['validation']
print(info)
打印看一下数据集都是什么样子的:
for pt, en in train_examples.take(5):
print(pt.numpy())
print(en.numpy())
print()
运行结果:
b'e quando melhoramos a procura , tiramos a \xc3\xbanica vantagem da impress\xc3\xa3o , que \xc3\xa9 a serendipidade .'
b'and when you improve searchability , you actually take away the one advantage of print , which is serendipity .'
b'mas e se estes fatores fossem ativos ?'
b'but what if it were active ?'
b'mas eles n\xc3\xa3o tinham a curiosidade de me testar .'
b"but they did n't test for curiosity ."
b'e esta rebeldia consciente \xc3\xa9 a raz\xc3\xa3o pela qual eu , como agn\xc3\xb3stica , posso ainda ter f\xc3\xa9 .'
b'and this conscious defiance is why i , as an agnostic , can still have faith .'
b"`` `` '' podem usar tudo sobre a mesa no meu corpo . ''"
b'you can use everything on the table on me .'
在西班牙语中会有一些转义字符
2.1.2 从语料中构建subword的方法
en_tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
(en.numpy() for pt, en in train_examples),
target_vocab_size = 2 ** 13)
pt_tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
(pt.numpy() for pt, en in train_examples),
target_vocab_size = 2 ** 13)
sample_string = "Transformer is awesome."
tokenized_string = en_tokenizer.encode(sample_string)
print('Tokenized string is {}'.format(tokenized_string))
# 把变成subword的词再变回去
origin_string = en_tokenizer.decode(tokenized_string)
print('The original string is {}'.format(origin_string))
assert origin_string == sample_string
for token in tokenized_string:
print('{} --> "{}"'.format(token, en_tokenizer.decode([token])))
运行结果:
Tokenized string is [7915, 1248, 7946, 7194, 13, 2799, 7877]
The original string is Transformer is awesome.
7915 --> "T"
1248 --> "ran"
7946 --> "s"
7194 --> "former "
13 --> "is "
2799 --> "awesome"
7877 --> "."
2.2 创建数据集:
buffer_size = 20000
batch_size = 64
max_length = 40
# 把句子转化成subword之后的数据
def encode_to_subword(pt_sentence, en_sentence):
pt_sequence = [pt_tokenizer.vocab_size] \
+ pt_tokenizer.encode(pt_sentence.numpy()) \
+ [pt_tokenizer.vocab_size + 1]
en_sequence = [en_tokenizer.vocab_size] \
+ en_tokenizer.encode(en_sentence.numpy()) \
+ [en_tokenizer.vocab_size + 1]
return pt_sequence, en_sequence
def filter_by_max_length(pt, en):
return tf.logical_and(tf.size(pt) <= max_length,
tf.size(en) <= max_length)
# 使用py_function把python函数封装起来
def tf_encode_to_subword(pt_sentence, en_sentence):
return tf.py_function(encode_to_subword,
[pt_sentence, en_sentence],
[tf.int64, tf.int64])
# 映射:把train_examples中所有的葡萄牙语和英语的句子都转成subword的id
train_dataset = train_examples.map(tf_encode_to_subword)
# 对新的dataset做一个filter
train_dataset = train_dataset.filter(filter_by_max_length)
train_dataset = train_dataset.shuffle(
buffer_size).padded_batch(
batch_size, padded_shapes=([-1], [-1]))
# padded_shapes=([-1], [-1]):都在当前维度扩展到最高的值
valid_dataset = val_examples.map(tf_encode_to_subword)
valid_dataset = valid_dataset.filter(
filter_by_max_length).padded_batch(
batch_size, padded_shapes=([-1], [-1]))
在生成dataset之后再来check一下数据是不是对的
for pt_batch, en_batch in valid_dataset.take(5):
print(pt_batch.shape, en_batch.shape)
运行结果:
(64, 38) (64, 40)
(64, 39) (64, 35)
(64, 39) (64, 39)
(64, 39) (64, 39)
(64, 39) (64, 36)
2.3 写一些工具函数
# PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
# PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
# pos.shape: [sentence_length, 1]
# i.shape : [1, d_model]
# result.shape: [sentence_length, d_model]
# 获取所有的句子位置对应embedding的位置
def get_angles(pos, i, d_model):
angle_rates = 1 / np.power(10000,
(2 * (i // 2)) / np.float32(d_model))
return pos * angle_rates
# 对奇数位做正弦函数,对偶数位做余弦函数,再将结果拼接起来
def get_position_embedding(sentence_length, d_model):
angle_rads = get_angles(np.arange(sentence_length)[:, np.newaxis],
np.arange(d_model)[np.newaxis, :],
d_model)
# sines.shape: [sentence_length, d_model / 2]
# cosines.shape: [sentence_length, d_model / 2]
sines = np.sin(angle_rads[:, 0::2])
cosines = np.cos(angle_rads[:, 1::2])
# position_embedding.shape: [sentence_length, d_model]
position_embedding = np.concatenate([sines, cosines], axis = -1)
# position_embedding.shape: [1, sentence_length, d_model]
position_embedding = position_embedding[np.newaxis, ...]
return tf.cast(position_embedding, dtype=tf.float32)
position_embedding = get_position_embedding(50, 512)
print(position_embedding.shape)
运行结果:
(1, 50, 512)
def plot_position_embedding(position_embedding):
plt.pcolormesh(position_embedding[0], cmap = 'RdBu')
plt.xlabel('Depth')
plt.xlim((0, 512))
plt.ylabel('Position')
plt.colorbar()
plt.show()
plot_position_embedding(position_embedding)
运行结果:
- 2.4 mask构建
# 1. padding mask, 2. look ahead
# batch_data.shape: [batch_size, seq_len]
def create_padding_mask(batch_data):
padding_mask = tf.cast(tf.math.equal(batch_data, 0), tf.float32)
# [batch_size, 1, 1, seq_len]
return padding_mask[:, tf.newaxis, tf.newaxis, :]
x = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
create_padding_mask(x)
运行结果:
<tf.Tensor: shape=(3, 1, 1, 5), dtype=float32, numpy=
array([[[[0., 0., 1., 1., 0.]]],
[[[0., 0., 0., 1., 1.]]],
[[[1., 1., 1., 0., 0.]]]], dtype=float32)>
# attention_weights.shape: [3,3]
# [[1, 0, 0],
# [4, 5, 0],
# [7, 8, 9]]
def create_look_ahead_mask(size):
mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
return mask # (seq_len, seq_len)
create_look_ahead_mask(3)
运行结果:
<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[0., 1., 1.],
[0., 0., 1.],
[0., 0., 0.]], dtype=float32)>
- 2.5 缩放点积注意力机制的实现
def scaled_dot_product_attention(q, k, v, mask):
"""
Args:
- q: shape == (..., seq_len_q, depth)
- k: shape == (..., seq_len_k, depth)
- v: shape == (..., seq_len_v, depth_v)
- seq_len_k == seq_len_v
- mask: shape == (..., seq_len_q, seq_len_k)
Returns:
- output: weighted sum
- attention_weights: weights of attention
"""
# matmul_qk.shape: (..., seq_len_q, seq_len_k)
# transpose_b: 第二个矩阵是否做转置
matmul_qk = tf.matmul(q, k, transpose_b = True)
dk = tf.cast(tf.shape(k)[-1], tf.float32)
scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
if mask is not None:
# 使得在softmax后值趋近于0
scaled_attention_logits += (mask * -1e9)
# attention_weights.shape: (..., seq_len_q, seq_len_k)
attention_weights = tf.nn.softmax(
scaled_attention_logits, axis = -1)
# output.shape: (..., seq_len_q, depth_v)
output = tf.matmul(attention_weights, v)
return output, attention_weights
def print_scaled_dot_product_attention(q, k, v):
temp_out, temp_att = scaled_dot_product_attention(q, k, v, None)
print("Attention weights are:")
print(temp_att)
print("Output is:")
print(temp_out)
写几个临时的矩阵来测试我们的代码是否正确:
temp_k = tf.constant([[10, 0, 0],
[0, 10, 0],
[0, 0, 10],
[0, 0, 10]], dtype=tf.float32) # (4, 3)
temp_v = tf.constant([[1, 0],
[10, 0],
[100, 5],
[1000, 6]], dtype=tf.float32) # (4, 2)
temp_q1 = tf.constant([[0, 10, 0]], dtype=tf.float32) # (1, 3)
np.set_printoptions(suppress=True)
print_scaled_dot_product_attention(temp_q1, temp_k, temp_v)
运行结果:
Attention weights are:
tf.Tensor([[0. 1. 0. 0.]], shape=(1, 4), dtype=float32)
Output is:
tf.Tensor([[10. 0.]], shape=(1, 2), dtype=float32)
在下一篇文章中,我们将进行多头注意力机制的实现,feedforward 层次实现,EncoderLayer实现,DecoderLayer实现,EecoderModel实现,DecoderModel实现,Transformer实现,训练模型,自定义学习率,损失函数的实现, mask的创建与实现,模型预测,attention可视化,并且进行机器翻译的示例展示