Tensorflow——手把手教你机器翻译(一)Seq2seq+attention

·  阅读 378
Tensorflow——手把手教你机器翻译(一)Seq2seq+attention

携手创作,共同成长!这是我参与「掘金日新计划 · 8 月更文挑战」的第13天,点击查看活动详情


今天我们来介绍如何使用tensorflow进行机器翻译,并使用Seq2seq+attention来搭建模型进行机器翻译


  • 1.1 模型思想——Atteneion

    • 1.1.1 原始的seq2seq
      • Encoder-Decoder结构
      • 缺点:
        • 定长编码是信息瓶颈
        • 长度越长,前面输入进RNN的信息就越被稀释,那么在最后的s中就不能被保存下来,因而,就影响了翻译的质量
    • 1.1.2 基于attention的seq2seq
      • Encoder-Decoder结构
      • encoder每一步输出都会参与到decoder每一步生成的计算中去
      • EO:encoder各个位置的输出
      • H:decoder某一步的隐含状态
      • FC:全连接层
      • X:decoder的一个输入
      • score = FC(tanh(FC(EO)+ FC(H))) [BAhdanau注意力]
      • 另一选项:score = EOWH [luong注意力]
      • attention_weights = softmax(score, axis = 1)
      • context = sum(attention_weights * EO, axis = 1)
      • final_input = concat(context, embed(x)
  • 2.1 数据预处理与读取

def unicode_to_ascii(s):
    # NFD : 如果有一个unicode由多个ASCII码,那么就把ASCII给拆开
    # MN : 重音
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

en_sentence = u"May I borrow this book?"
sp_sentence = u"¿Puedo tomar prestado este libro?"

print(unicode_to_ascii(en_sentence))
print(unicode_to_ascii(sp_sentence))
复制代码

运行结果:

May I borrow this book?
¿Puedo tomar prestado este libro?
复制代码

2.1.2 把标点符号和词语分开,并去掉多余的空格

# 把标点符号和词语分开,并去掉多余的空格
def preprocess_sentence(w):
    # 全部变成小写
    w = unicode_to_ascii(w.lower().strip())

    n-with-white-spaces-keeping-punctuation
    # 标点符号前后加空格
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    # 多余的空格变成一个空格
    w = re.sub(r'[" "]+', " ", w)

    # 除了标点符号和字母外都是空格
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
    
    # 去掉全部空格
    w = w.rstrip().strip()

    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<start> ' + w + ' <end>'
    return w

print(preprocess_sentence(en_sentence))
print(preprocess_sentence(sp_sentence).encode('utf-8'))
复制代码

运行结果:

<start> may i borrow this book ? <end>
b'<start> \xc2\xbf puedo tomar prestado este libro ? <end>'
复制代码

2.1.3 从文件中把数据读取出来

# 从文件中把数据读取出来
def create_dataset(path):
    # 把所有行读取出来,按照UTF-8的格式来读取,把空格去掉
    lines = open(path, encoding='UTF-8').read().strip().split('\n')

    # 把每一行都分成两个部分,一个英文,一个西班牙语,英文在前
    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines]

    return zip(*word_pairs)

en, sp = create_dataset(en_spa_file_path)
print(en[-1])
print(sp[-1])
复制代码

运行结果:

<start> if you want to sound like a native speaker , you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly and at the desired tempo . <end>
<start> si quieres sonar como un hablante nativo , debes estar dispuesto a practicar diciendo la misma frase una y otra vez de la misma manera en que un musico de banjo practica el mismo fraseo una y otra vez hasta que lo puedan tocar correctamente y en el tiempo esperado . <end>
复制代码

解包操作例子:

解包就是把对应的元素给解开,zip把对应的元素组合起来

a = [(1, 2), (3, 4), (5, 6)]
c, d = zip(*a)
print(c, d)
复制代码

运行结果;

(1, 3, 5) (2, 4, 6)
复制代码

2.1.4 读取了文本式的数据,文本式的数据要想被model读取,就需要转化位id式,因而,我们需要将他们做一个转化,变成id式的

  • tf.keras.preprocessing.text.Tokenizer:

    • num_word:对词语进行限制,默认为None,不进行限制
    • filters:黑名单
    • split:按照什么进行分割
  • fit_on_texts:统计词频,生成词表

  • lang_tokenizer.texts_to_sequences:将文本转成id

  • tf.keras.preprocessing.sequence.pad_sequences:对文本做padding

def tokenize(lang):
    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
    # 统计词频
    lang_tokenizer.fit_on_texts(lang)
    # 转成id
    tensor = lang_tokenizer.texts_to_sequences(lang)
    # padding
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')
    return tensor, lang_tokenizer
复制代码
def max_length(tensor):
    return max(len(t) for t in tensor)
input_tensor, input_tokenizer = tokenize(sp[0:30000])
output_tensor, output_tokenizer = tokenize(en[0:30000])

max_length_input = max_length(input_tensor)
max_length_output = max_length(output_tensor)
print(max_length_input, max_length_output)
复制代码

运行结果:

16 11
复制代码

2.1.5 训练集和测试集的切分:

# Creating training and validation sets using an 80-20 split
input_train, input_eval, output_train, output_eval = train_test_split(input_tensor, output_tensor, test_size=0.2)

# Show length
len(input_train), len(input_eval), len(output_train), len(output_eval)
复制代码

运行结果:

(24000, 6000, 24000, 6000)
复制代码

一共由24000个训练样本,6000个测试样本

2.1.6 接下来,我们来验证一下tokenizer是不是正确的:

def convert(example, tokenizer):
    for t in example:
        if t != 0:
            print ("%d ----> %s" % (t, tokenizer.index_word[t]))
            
print("Input Language; index to word mapping")
convert(input_train[0], input_tokenizer)
print()
print("Target Language; index to word mapping")
convert(output_train[0], output_tokenizer)
复制代码

运行结果:

Input Language; index to word mapping
1 ----> <start>
6 ----> ¿
11 ----> que
4356 ----> escondiste
5 ----> ?
2 ----> <end>

Target Language; index to word mapping
1 ----> <start>
32 ----> what
42 ----> did
6 ----> you
444 ----> hide
7 ----> ?
2 ----> <end>
复制代码

2.1.7 这样我们就完成了从文本数据集到id数据集的变化,然后,我们就可以生成dataset了

def make_dataset(input_tensor, output_tensor,
                 batch_size, epochs, shuffle):
    dataset = tf.data.Dataset.from_tensor_slices(
        (input_tensor, output_tensor))
    if shuffle:
        dataset = dataset.shuffle(30000)
    # drop_remainder: 剩下的数据不足batch的时候怎么做,True就是把剩下的数据丢掉
    dataset = dataset.repeat(epochs).batch(batch_size, drop_remainder = True)
    return dataset

batch_size = 64
epoches = 20

train_dataset = make_dataset(
    input_train, output_train, batch_size, epochs, True)
eval_dataset = make_dataset(
    input_eval, output_eval, batch_size, 1, False)

复制代码

我们可以打印出来几个数据看一下都是什么样子的:

for x, y in train_dataset.take(1):
    print(x.shape)
    print(y.shape)
    print(x)
    print(y)
复制代码

可以看到输入的size都是64x16的,输出的size都是64x11的,padding都是0

  • 3.1 Encoder构建

定义超参数

embedding_units = 256
units = 1024
input_vocal_size = len(input_tokenizer.word_index) + 1
output_vocal_size = len(output_tokenizer.word_index) + 1
复制代码

定义Encoder:这里使用子类API来实现

class Encoder(keras.Model):
    def __init__(self, vocab_size, embedding_units, encoding_units, batch_size):
        super(Encoder, self).__init__()
        self.batch_size = batch_size
        self.encoding_units = encoding_units
        self.embedding = keras.layers.Embedding(vocab_size, embedding_units)
        # gru是一个变种,在gru中把遗忘门和输入门变成了一个,在gru中认为,遗忘门和输入门加起来等于1,所以遗忘门等于1-输入门
        self.gru = keras.layers.GRU(self.encoding_units,
                                    return_sequences=True,
                                    return_state=True,
                                    recurrent_initializer='glorot_uniform')

    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state = hidden)
        return output, state

    def initialize_hidden_state(self):
        return tf.zeros((self.batch_size, self.encoding_units))
    
encoder = Encoder(input_vocal_size, embedding_units, units, batch_size)
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(x, sample_hidden)

print('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))
复制代码

运行结果:

Encoder output shape: (batch size, sequence length, units) (64, 16, 1024)
Encoder Hidden state shape: (batch size, units) (64, 1024)
复制代码

对于output来说就是一个64x16x1024的三维矩阵,16就是它的长度,1024就是状态的size

  • 4.1 attention构建

class BahdanauAttention(tf.keras.Model):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, decoder_hidden, encoder_outputs):
        # decoder_hidden shape.shape == (batch_size, units)
        # decoder_outputs.shape: (batch_size, length, units)
        # we are doing this to perform addition to calculate the score
        decoder_hidden_with_time_axis = tf.expand_dims(decoder_hidden, 1)

        # before V: (batch_size, length, units)
        # after V:  (batch_size, length, 1)
        score = self.V(tf.nn.tanh(self.W1(encoder_outputs) + self.W2(decoder_hidden_with_time_axis)))

        # attention_weights shape : (batch_size, max_length, 1)
        attention_weights = tf.nn.softmax(score, axis=1)

        # context_vector.shape: (batch_size, length, units)
        context_vector = attention_weights * encoder_outputs
        # context_vector.shape: (batch_size, units)
        context_vector = tf.reduce_sum(context_vector, axis=1)

        return context_vector, attention_weights
    
attention_model = BahdanauAttention(units = 10)
attention_result, attention_weights = attention_model(sample_hidden, sample_output)

print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))
复制代码

运行结果:

Attention result shape: (batch size, units) (64, 1024)
Attention weights shape: (batch_size, sequence_length, 1) (64, 16, 1)
复制代码
  • 5.1 Decoder构建

class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_units, decoding_units, batch_size):
        super(Decoder, self).__init__()
        self.batch_size = batch_size
        self.decoding_units = decoding_units
        self.embedding = keras.layers.Embedding(vocab_size, embedding_units)
        self.gru = keras.layers.GRU(self.decoding_units,
                                    return_sequences=True,
                                    return_state=True,
                                    recurrent_initializer='glorot_uniform')
        self.fc = keras.layers.Dense(vocab_size)

        # used for attention
        self.attention = BahdanauAttention(self.decoding_units)

    def call(self, x, hidden, encoding_output):
        # context_vector.shape : (batch_size, max_length, hidden_size)
        context_vector, attention_weights = self.attention(hidden, encoding_output)

        # before embedding : (batch_size, 1)
        # after embedding : (batch_size, 1, embedding_units)
        x = self.embedding(x)

        combined_x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

        # output.shape: [batch_size, 1, decoding_units]
        # state.shape: [batch_size, decoding_units]
        output, state = self.gru(combined_x)

        # output shape : (batch_size * 1, hidden_size)
        output = tf.reshape(output, (-1, output.shape[2]))

        # output shape : (batch_size, vocab)
        x = self.fc(output)

        return x, state, attention_weights

decoder = Decoder(output_vocal_size, embedding_units, units, batch_size)
outputs = decoder(tf.random.uniform((batch_size, 1)),
                                      sample_hidden, sample_output)
decoder_output, decoder_hidden, decoder_aw = outputs
print ('decoder_output_shape: ', decoder_output.shape)
print('decoder_hidden_shape: ', decoder_hidden.shape)
print('decoder_attention_weights.shape: ', decoder_aw.shape)
复制代码

运行结果:

decoder_output_shape:  (64, 4935)
decoder_hidden_shape:  (64, 1024)
decoder_attention_weights.shape:  (64, 16, 1)
复制代码

对于output的大小就是64x4935,4925就是词表的大小 对于hedden的大小就是64x1024 对于atteneiion_weights大小就是64x16x1

  • 6.1 定义损失函数

目标损失函数:在这里我们要预测的是词语id,也就是我们要在众多的词语id中预测出哪个是正确的词语id,因而它是一个分类问题,而对于分类问题,我们一般使用的损失函数是Crossentropy,这里我们的output就是一个word id,我们要使用SparseCategoricalCrossentropy

optimizer = keras.optimizers.Adam()
# from_logits=True: 这是因为在fc中是一个纯的fc输出,并没有加任何的激活函数,如果加激活函数就设置成False
loss_object = keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

复制代码

定义一个可以计算多步的损失函数

@tf.function
def train_step(inp, targ, encoding_hidden):
    loss = 0

    with tf.GradientTape() as tape:
        encoding_output, encoding_hidden = encoder(inp, encoding_hidden)

        decoding_hidden = encoding_hidden
        
        # eg: <strat. I am here <end>
        # 1. <start> -> I
        # 2. I -> am
        # 3. am -> here
        # 4. here -> <end>


        for t in range(1, targ.shape[1]):
            decoding_input = tf.expand_dims(targ[:, t], 1)
            # passing enc_output to the decoder
            predictions, decoding_hidden, _ = decoder(
                decoding_input, decoding_hidden, encoding_output)

            loss += loss_function(targ[:, t], predictions)

            # using teacher forcing
            decoding_input = tf.expand_dims(targ[:, t], 1)

    batch_loss = (loss / int(targ.shape[0]))
    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))
    return batch_loss
复制代码
  • 7.1 模型训练

epochs = 10
steps_per_epoch = len(input_tensor) // batch_size


for epoch in range(epochs):
    start = time.time()

    encoding_hidden = encoder.initialize_hidden_state()
    total_loss = 0

    for (batch, (inp, targ)) in enumerate(train_dataset.take(steps_per_epoch)):
        batch_loss = train_step(inp, targ, encoding_hidden)
        total_loss += batch_loss

        if batch % 100 == 0:
            print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, batch, batch_loss.numpy()))
    # saving (checkpoint) the model every 2 epochs


    print('Epoch {} Loss {:.4f}'.format(epoch + 1, total_loss / steps_per_epoch))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
复制代码
  • 8.1 模型预测实现

在evaluate方法中接受一个文本的字符串,然后去对它进行翻译,在翻译过程中,我们会调用encoder-decoder去得到我们的翻译结果,在decoder的过程中,我们调用了一个for循环,在for循环中,我们会使用上一步的输出作为下一步的输入,同时把每一步的attention_weights给保存到attentioin_matrix,这个attention_matrix就代表了输入和输出之间的注意力的关系

def evaluate(sentence):
    attention_plot = np.zeros((max_length_output, max_length_input))
    sentence = preprocess_sentence(input_sentence)

    # text -> id
    inputs = [input_tokenizer.word_index[i] for i in input_sentence.split(' ')]
    # padding
    inputs = keras.preprocessing.sequence.pad_sequences(
        [inputs], maxlen=max_length_input, padding='post')
    # inputs -> tensor
    inputs = tf.convert_to_tensor(inputs)

    result = ''

    encoding_hidden = encoder.initialize_hidden_state()
    encoding_out, encoding_hidden = encoder(inputs, encoding_hidden)
    decoding_hidden = encoding_hidden
    # eg: <strat> -> A
    # A -> B -> C -> D
    
    # decoding_input.shape: (1, 1)
    decoding_input = tf.expand_dims(
        [output_tokenizer.word_index['<start>']], 0)

    for t in range(max_length_output):
        predictions, decoding_hidden, attention_weights = decoder(
            decoding_input, decoding_hidden, encoding_outputs)

        # attention weights: (batch_size, input_length, 1) (1, 16, 1)
        attention_weights = tf.reshape(attention_weights, (-1, ))
        attention_matrix[t] = attention_weights.numpy()

        # predictions.shape: (batch_size, vocal_size) (1, 4935)
        predicted_id = tf.argmax(predictions[0]).numpy()

        result += output_tokenizer.index_word[predicted_id] + ' '

        if output_tokenizer.index_word[predicted_id] == '<end>':
            return result, input_sentence, attention_matrix

        # the predicted ID is fed back into the model
        decoding_input = tf.expand_dims([predicted_id], 0)

    return result, input_sentence, attention_matrix

复制代码
  • 8.2 对注意力机制进行可视化
def plot_attention(attention_matrix, input_sentence, predicted_sentence):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(attention_matrix, cmap='viridis')

    fontdict = {'fontsize': 14}

    ax.set_xticklabels([''] + input_sentence, fontdict=fontdict, rotation=90)
    ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)

    plt.show()
    
def translate(input_sentence):
    result, input_sentence, attention_plot = evaluate(input_sentence)

    print('Input: %s' % (input_sentence))
    print('Predicted translation: {}'.format(result))

    attention_matrix = attention_matrix[:len(result.split(' ')), 
                                        :len(input_sentence.split(' '))]
    plot_attention(attention_matrix, input_sentence.split(' '), 
                   result.split(' '))
复制代码
translate(u'Hace mucho frío aquí.')
复制代码

EH~HA]G2TEYG%S(VCC`O9.png

translate(u'Esta es mi vida.')
复制代码

JZJFZ{WXN}YQT794Q{@`XNQ.png

9JQ4ZCQY3M({Q$KEN%9BFQX.png

收藏成功!
已添加到「」, 点击更改