dl-ex-merge-2深度学习示例（三）第八章：目标检测 – CIFAR-10 示例在介绍了卷积神经网络（CNN

深度学习示例（三）

原文：annas-archive.org/md5/81c037237f3318d7e4e398047d4d8413

译者：飞龙

协议：CC BY-NC-SA 4.0

第八章：目标检测 – CIFAR-10 示例

在介绍了卷积神经网络（CNNs）的基础和直觉/动机后，我们将在目标检测中展示其在其中一个最流行的数据集上的应用。我们还将看到 CNN 的初始层获取对象的基本特征，而最终的卷积层将从第一层的这些基本特征构建更语义级别的特征。

本章将涵盖以下主题：

目标检测
CIFAR-10 图像中对象检测—模型构建和训练

目标检测

维基百科指出：

"目标检测是计算机视觉领域中用于在图像或视频序列中查找和识别对象的技术。人类在图像中识别多种对象并不费力，尽管对象的图像在不同视角、大小和比例以及平移或旋转时可能有所变化。即使对象部分遮挡视图时，也能识别出对象。对计算机视觉系统而言，这仍然是一个挑战。多年来已实施了多种方法来解决此任务。"

图像分析是深度学习中最显著的领域之一。图像易于生成和处理，它们恰好是机器学习的正确数据类型：对人类易于理解，但对计算机而言却很难。不奇怪，图像分析在深度神经网络的历史中发挥了关键作用。

图 11.1：检测对象的示例。来源：B. C. Russell, A. Torralba, C. Liu, R. Fergus, W. T. Freeman，《通过场景对齐进行对象检测》，2007 年进展神经信息处理系统，网址：bryanrussell.org/papers/nips…

随着自动驾驶汽车、面部检测、智能视频监控和人数统计解决方案的兴起，快速准确的目标检测系统需求量大。这些系统不仅包括图像中对象的识别和分类，还可以通过绘制适当的框来定位每个对象。这使得目标检测比传统的计算机视觉前身——图像分类更为复杂。

在本章中，我们将讨论目标检测——找出图像中有哪些对象。例如，想象一下自动驾驶汽车需要在道路上检测其他车辆，就像图 11.1中一样。目标检测有许多复杂的算法。它们通常需要庞大的数据集、非常深的卷积网络和长时间的训练。

CIFAR-10 – 建模、构建和训练

此示例展示了如何在 CIFAR-10 数据集中制作用于分类图像的 CNN。我们将使用一个简单的卷积神经网络实现一些卷积和全连接层。

即使网络架构非常简单，当尝试检测 CIFAR-10 图像中的对象时，您会看到它表现得有多好。

所以，让我们开始这个实现。

使用的包

我们导入了此实现所需的所有包：

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from urllib.request import urlretrieve
from os.path import isfile, isdir
from tqdm import tqdm
import tarfile
import numpy as np
import random
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import OneHotEncoder

import pickle
import tensorflow as tf

载入 CIFAR-10 数据集

在这个实现中，我们将使用 CIFAR-10 数据集，这是用于对象检测的最常用的数据集之一。因此，让我们先定义一个辅助类来下载和提取 CIFAR-10 数据集（如果尚未下载）：

cifar10_batches_dir_path = 'cifar-10-batches-py'

tar_gz_filename = 'cifar-10-python.tar.gz'

class DLProgress(tqdm):
    last_block = 0

    def hook(self, block_num=1, block_size=1, total_size=None):
        self.total = total_size
        self.update((block_num - self.last_block) * block_size)
        self.last_block = block_num

if not isfile(tar_gz_filename):
    with DLProgress(unit='B', unit_scale=True, miniters=1, desc='CIFAR-10 Python Images Batches') as pbar:
        urlretrieve(
            'https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz',
            tar_gz_filename,
            pbar.hook)

if not isdir(cifar10_batches_dir_path):
    with tarfile.open(tar_gz_filename) as tar:
        tar.extractall()
        tar.close()

下载并提取 CIFAR-10 数据集后，您会发现它已经分成了五个批次。CIFAR-10 包含了 10 个类别的图像：

airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck

在我们深入构建网络核心之前，让我们进行一些数据分析和预处理。

数据分析和预处理

我们需要分析数据集并进行一些基本的预处理。因此，让我们首先定义一些辅助函数，这些函数将使我们能够从我们有的五批次中加载特定批次，并打印关于此批次及其样本的一些分析：

# Defining a helper function for loading a batch of images
def load_batch(cifar10_dataset_dir_path, batch_num):

    with open(cifar10_dataset_dir_path + '/data_batch_' + str(batch_num), mode='rb') as file:
        batch = pickle.load(file, encoding='latin1')

    input_features = batch['data'].reshape((len(batch['data']), 3, 32, 32)).transpose(0, 2, 3, 1)
    target_labels = batch['labels']

    return input_features, target_labels

然后，我们定义一个函数，可以帮助我们显示特定批次中特定样本的统计信息：

#Defining a function to show the stats for batch ans specific sample
def batch_image_stats(cifar10_dataset_dir_path, batch_num, sample_num):

    batch_nums = list(range(1, 6))

    #checking if the batch_num is a valid batch number
    if batch_num not in batch_nums:
        print('Batch Num is out of Range. You can choose from these Batch nums: {}'.format(batch_nums))
        return None

    input_features, target_labels = load_batch(cifar10_dataset_dir_path, batch_num)

    #checking if the sample_num is a valid sample number
    if not (0 <= sample_num < len(input_features)):
        print('{} samples in batch {}. {} is not a valid sample number.'.format(len(input_features), batch_num, sample_num))
        return None

    print('\nStatistics of batch number {}:'.format(batch_num))
    print('Number of samples in this batch: {}'.format(len(input_features)))
    print('Per class counts of each Label: {}'.format(dict(zip(*np.unique(target_labels, return_counts=True)))))

    image = input_features[sample_num]
    label = target_labels[sample_num]
    cifar10_class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

    print('\nSample Image Number {}:'.format(sample_num))
    print('Sample image - Minimum pixel value: {} Maximum pixel value: {}'.format(image.min(), image.max()))
    print('Sample image - Shape: {}'.format(image.shape))
    print('Sample Label - Label Id: {} Name: {}'.format(label, cifar10_class_names[label]))
    plt.axis('off')
    plt.imshow(image)

现在，我们可以使用这个函数来操作我们的数据集，并可视化特定的图像：

# Explore a specific batch and sample from the dataset
batch_num = 3
sample_num = 6
batch_image_stats(cifar10_batches_dir_path, batch_num, sample_num)

输出如下：


Statistics of batch number 3:
Number of samples in this batch: 10000
Per class counts of each Label: {0: 994, 1: 1042, 2: 965, 3: 997, 4: 990, 5: 1029, 6: 978, 7: 1015, 8: 961, 9: 1029}

Sample Image Number 6:
Sample image - Minimum pixel value: 30 Maximum pixel value: 242
Sample image - Shape: (32, 32, 3)
Sample Label - Label Id: 8 Name: ship

图 11.2: 来自第 3 批次的样本图像 6

在继续将数据集馈送到模型之前，我们需要将其归一化到零到一的范围内。

批归一化优化了网络训练。已经显示它有几个好处：

训练更快：每个训练步骤会变慢，因为在网络的前向传播过程中需要额外的计算，在网络的反向传播过程中需要训练额外的超参数。然而，它应该会更快地收敛，因此总体训练速度应该更快。
更高的学习率：梯度下降算法通常需要较小的学习率才能使网络收敛到损失函数的最小值。随着神经网络变得更深，它们在反向传播过程中的梯度值会变得越来越小，因此通常需要更多的迭代次数。使用批归一化的想法允许我们使用更高的学习率，这进一步增加了网络训练的速度。
权重初始化简单: 权重初始化可能会很困难，特别是在使用深度神经网络时。批归一化似乎使我们在选择初始权重时可以更不谨慎。

因此，让我们继续定义一个函数，该函数将负责将输入图像列表归一化，以便这些图像的所有像素值都在零到一之间。

#Normalize CIFAR-10 images to be in the range of [0,1]

def normalize_images(images):

    # initial zero ndarray
    normalized_images = np.zeros_like(images.astype(float))

    # The first images index is number of images where the other indices indicates
    # hieight, width and depth of the image
    num_images = images.shape[0]

    # Computing the minimum and maximum value of the input image to do the normalization based on them
    maximum_value, minimum_value = images.max(), images.min()

    # Normalize all the pixel values of the images to be from 0 to 1
    for img in range(num_images):
        normalized_images[img,...] = (images[img, ...] - float(minimum_value)) / float(maximum_value - minimum_value)

    return normalized_images

接下来，我们需要实现另一个辅助函数，对输入图像的标签进行编码。在这个函数中，我们将使用 sklearn 的独热编码（one-hot encoding），其中每个图像标签通过一个零向量表示，除了该向量所代表的图像的类别索引。

输出向量的大小将取决于数据集中的类别数量，对于 CIFAR-10 数据集来说是 10 个类别：

#encoding the input images. Each image will be represented by a vector of zeros except for the class index of the image 
# that this vector represents. The length of this vector depends on number of classes that we have
# the dataset which is 10 in CIFAR-10

def one_hot_encode(images):

    num_classes = 10

    #use sklearn helper function of OneHotEncoder() to do that
    encoder = OneHotEncoder(num_classes)

    #resize the input images to be 2D
    input_images_resized_to_2d = np.array(images).reshape(-1,1)
    one_hot_encoded_targets = encoder.fit_transform(input_images_resized_to_2d)

    return one_hot_encoded_targets.toarray()

现在，是时候调用之前的辅助函数进行预处理并保存数据集，以便我们以后可以使用它了：

def preprocess_persist_data(cifar10_batches_dir_path, normalize_images, one_hot_encode):

    num_batches = 5
    valid_input_features = []
    valid_target_labels = []

    for batch_ind in range(1, num_batches + 1):

        #Loading batch
        input_features, target_labels = load_batch(cifar10_batches_dir_path, batch_ind)
        num_validation_images = int(len(input_features) * 0.1)

        # Preprocess the current batch and perisist it for future use
        input_features = normalize_images(input_features[:-num_validation_images])
        target_labels = one_hot_encode( target_labels[:-num_validation_images])

        #Persisting the preprocessed batch
        pickle.dump((input_features, target_labels), open('preprocess_train_batch_' + str(batch_ind) + '.p', 'wb'))

        # Define a subset of the training images to be used for validating our model
        valid_input_features.extend(input_features[-num_validation_images:])
        valid_target_labels.extend(target_labels[-num_validation_images:])

    # Preprocessing and persisting the validationi subset
    input_features = normalize_images( np.array(valid_input_features))
    target_labels = one_hot_encode(np.array(valid_target_labels))

    pickle.dump((input_features, target_labels), open('preprocess_valid.p', 'wb'))

    #Now it's time to preporcess and persist the test batche
    with open(cifar10_batches_dir_path + '/test_batch', mode='rb') as file:
        test_batch = pickle.load(file, encoding='latin1')

    test_input_features = test_batch['data'].reshape((len(test_batch['data']), 3, 32, 32)).transpose(0, 2, 3, 1)
    test_input_labels = test_batch['labels']

    # Normalizing and encoding the test batch
    input_features = normalize_images( np.array(test_input_features))
    target_labels = one_hot_encode(np.array(test_input_labels))

    pickle.dump((input_features, target_labels), open('preprocess_test.p', 'wb'))

# Calling the helper function above to preprocess and persist the training, validation, and testing set
preprocess_persist_data(cifar10_batches_dir_path, normalize_images, one_hot_encode)

现在，我们已经将预处理数据保存到磁盘。

我们还需要加载验证集，以便在训练过程的不同 epoch 上运行训练好的模型：

# Load the Preprocessed Validation data
valid_input_features, valid_input_labels = pickle.load(open('preprocess_valid.p', mode='rb'))

构建网络

现在是时候构建我们分类应用程序的核心，即 CNN 架构的计算图了，但为了最大化该实现的优势，我们不会使用 TensorFlow 层 API，而是将使用它的神经网络版本。

所以，让我们从定义模型输入占位符开始，这些占位符将输入图像、目标类别以及 dropout 层的保留概率参数（这有助于我们通过丢弃一些连接来减少架构的复杂性，从而减少过拟合的几率）：


# Defining the model inputs
def images_input(img_shape):
 return tf.placeholder(tf.float32, (None, ) + img_shape, name="input_images")

def target_input(num_classes):

 target_input = tf.placeholder(tf.int32, (None, num_classes), name="input_images_target")
 return target_input

#define a function for the dropout layer keep probability
def keep_prob_input():
 return tf.placeholder(tf.float32, name="keep_prob")

接下来，我们需要使用 TensorFlow 神经网络实现版本来构建我们的卷积层，并进行最大池化：

# Applying a convolution operation to the input tensor followed by max pooling
def conv2d_layer(input_tensor, conv_layer_num_outputs, conv_kernel_size, conv_layer_strides, pool_kernel_size, pool_layer_strides):

 input_depth = input_tensor.get_shape()[3].value
 weight_shape = conv_kernel_size + (input_depth, conv_layer_num_outputs,)

 #Defining layer weights and biases
 weights = tf.Variable(tf.random_normal(weight_shape))
 biases = tf.Variable(tf.random_normal((conv_layer_num_outputs,)))

 #Considering the biase variable
 conv_strides = (1,) + conv_layer_strides + (1,)

 conv_layer = tf.nn.conv2d(input_tensor, weights, strides=conv_strides, padding='SAME')
 conv_layer = tf.nn.bias_add(conv_layer, biases)

 conv_kernel_size = (1,) + conv_kernel_size + (1,)

 pool_strides = (1,) + pool_layer_strides + (1,)
 pool_layer = tf.nn.max_pool(conv_layer, ksize=conv_kernel_size, strides=pool_strides, padding='SAME')
 return pool_layer

正如你可能在前一章中看到的，最大池化操作的输出是一个 4D 张量，这与全连接层所需的输入格式不兼容。因此，我们需要实现一个展平层，将最大池化层的输出从 4D 转换为 2D 张量：

#Flatten the output of max pooling layer to be fing to the fully connected layer which only accepts the output
# to be in 2D
def flatten_layer(input_tensor):
return tf.contrib.layers.flatten(input_tensor)

接下来，我们需要定义一个辅助函数，允许我们向架构中添加一个全连接层：

#Define the fully connected layer that will use the flattened output of the stacked convolution layers
#to do the actuall classification
def fully_connected_layer(input_tensor, num_outputs):
 return tf.layers.dense(input_tensor, num_outputs)

最后，在使用这些辅助函数创建整个架构之前，我们需要创建另一个函数，它将接收全连接层的输出并产生 10 个实值，对应于我们数据集中类别的数量：

#Defining the output function
def output_layer(input_tensor, num_outputs):
    return  tf.layers.dense(input_tensor, num_outputs)

所以，让我们定义一个函数，把所有这些部分组合起来，创建一个具有三个卷积层的 CNN。每个卷积层后面都会跟随一个最大池化操作。我们还会有两个全连接层，每个全连接层后面都会跟一个 dropout 层，以减少模型复杂性并防止过拟合。最后，我们将有一个输出层，产生 10 个实值向量，每个值代表每个类别的得分，表示哪个类别是正确的：

def build_convolution_net(image_data, keep_prob):

 # Applying 3 convolution layers followed by max pooling layers
 conv_layer_1 = conv2d_layer(image_data, 32, (3,3), (1,1), (3,3), (3,3)) 
 conv_layer_2 = conv2d_layer(conv_layer_1, 64, (3,3), (1,1), (3,3), (3,3))
 conv_layer_3 = conv2d_layer(conv_layer_2, 128, (3,3), (1,1), (3,3), (3,3))

# Flatten the output from 4D to 2D to be fed to the fully connected layer
 flatten_output = flatten_layer(conv_layer_3)

# Applying 2 fully connected layers with drop out
 fully_connected_layer_1 = fully_connected_layer(flatten_output, 64)
 fully_connected_layer_1 = tf.nn.dropout(fully_connected_layer_1, keep_prob)
 fully_connected_layer_2 = fully_connected_layer(fully_connected_layer_1, 32)
 fully_connected_layer_2 = tf.nn.dropout(fully_connected_layer_2, keep_prob)

 #Applying the output layer while the output size will be the number of categories that we have
 #in CIFAR-10 dataset
 output_logits = output_layer(fully_connected_layer_2, 10)

 #returning output
 return output_logits

让我们调用之前的辅助函数来构建网络并定义它的损失和优化标准：

#Using the helper function above to build the network

#First off, let's remove all the previous inputs, weights, biases form the previous runs
tf.reset_default_graph()

# Defining the input placeholders to the convolution neural network
input_images = images_input((32, 32, 3))
input_images_target = target_input(10)
keep_prob = keep_prob_input()

# Building the models
logits_values = build_convolution_net(input_images, keep_prob)

# Name logits Tensor, so that is can be loaded from disk after training
logits_values = tf.identity(logits_values, name='logits')

# defining the model loss
model_cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits_values, labels=input_images_target))

# Defining the model optimizer
model_optimizer = tf.train.AdamOptimizer().minimize(model_cost)

# Calculating and averaging the model accuracy
correct_prediction = tf.equal(tf.argmax(logits_values, 1), tf.argmax(input_images_target, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='model_accuracy')
tests.test_conv_net(build_convolution_net)

现在，我们已经构建了该网络的计算架构，是时候启动训练过程并查看一些结果了。

模型训练

因此，让我们定义一个辅助函数，使我们能够启动训练过程。这个函数将接受输入图像、目标类别的独热编码以及保持概率值作为输入。然后，它将把这些值传递给计算图，并调用模型优化器：

#Define a helper function for kicking off the training process
def train(session, model_optimizer, keep_probability, in_feature_batch, target_batch):
session.run(model_optimizer, feed_dict={input_images: in_feature_batch, input_images_target: target_batch, keep_prob: keep_probability})

我们需要在训练过程中的不同时间点验证模型，因此我们将定义一个辅助函数，打印出模型在验证集上的准确率：

#Defining a helper funcitno for print information about the model accuracy and it's validation accuracy as well
def print_model_stats(session, input_feature_batch, target_label_batch, model_cost, model_accuracy):

    validation_loss = session.run(model_cost, feed_dict={input_images: input_feature_batch, input_images_target: target_label_batch, keep_prob: 1.0})
    validation_accuracy = session.run(model_accuracy, feed_dict={input_images: input_feature_batch, input_images_target: target_label_batch, keep_prob: 1.0})

    print("Valid Loss: %f" %(validation_loss))
    print("Valid accuracy: %f" % (validation_accuracy))

让我们还定义一些模型的超参数，这些参数可以帮助我们调整模型以获得更好的性能：

# Model Hyperparameters
num_epochs = 100
batch_size = 128
keep_probability = 0.5

现在，让我们启动训练过程，但只针对 CIFAR-10 数据集的单一批次，看看基于该批次的模型准确率。

然而，在此之前，我们将定义一个辅助函数，加载一个批次的训练数据，并将输入图像与目标类别分开：

# Splitting the dataset features and labels to batches
def batch_split_features_labels(input_features, target_labels, train_batch_size):
    for start in range(0, len(input_features), train_batch_size):
        end = min(start + train_batch_size, len(input_features))
        yield input_features[start:end], target_labels[start:end]

#Loading the persisted preprocessed training batches
def load_preprocess_training_batch(batch_id, batch_size):
    filename = 'preprocess_train_batch_' + str(batch_id) + '.p'
    input_features, target_labels = pickle.load(open(filename, mode='rb'))

    # Returning the training images in batches according to the batch size defined above
    return batch_split_features_labels(input_features, target_labels, train_batch_size)

现在，让我们开始一个批次的训练过程：

print('Training on only a Single Batch from the CIFAR-10 Dataset...')
with tf.Session() as sess:

 # Initializing the variables
 sess.run(tf.global_variables_initializer())

 # Training cycle
 for epoch in range(num_epochs):
 batch_ind = 1

 for batch_features, batch_labels in load_preprocess_training_batch(batch_ind, batch_size):
 train(sess, model_optimizer, keep_probability, batch_features, batch_labels)

 print('Epoch number {:>2}, CIFAR-10 Batch Number {}: '.format(epoch + 1, batch_ind), end='')
 print_model_stats(sess, batch_features, batch_labels, model_cost, accuracy)

Output:
.
.
.
Epoch number 85, CIFAR-10 Batch Number 1: Valid Loss: 1.490792
Valid accuracy: 0.550000
Epoch number 86, CIFAR-10 Batch Number 1: Valid Loss: 1.487118
Valid accuracy: 0.525000
Epoch number 87, CIFAR-10 Batch Number 1: Valid Loss: 1.309082
Valid accuracy: 0.575000
Epoch number 88, CIFAR-10 Batch Number 1: Valid Loss: 1.446488
Valid accuracy: 0.475000
Epoch number 89, CIFAR-10 Batch Number 1: Valid Loss: 1.430939
Valid accuracy: 0.550000
Epoch number 90, CIFAR-10 Batch Number 1: Valid Loss: 1.484480
Valid accuracy: 0.525000
Epoch number 91, CIFAR-10 Batch Number 1: Valid Loss: 1.345774
Valid accuracy: 0.575000
Epoch number 92, CIFAR-10 Batch Number 1: Valid Loss: 1.425942
Valid accuracy: 0.575000

Epoch number 93, CIFAR-10 Batch Number 1: Valid Loss: 1.451115
Valid accuracy: 0.550000
Epoch number 94, CIFAR-10 Batch Number 1: Valid Loss: 1.368719
Valid accuracy: 0.600000
Epoch number 95, CIFAR-10 Batch Number 1: Valid Loss: 1.336483
Valid accuracy: 0.600000
Epoch number 96, CIFAR-10 Batch Number 1: Valid Loss: 1.383425
Valid accuracy: 0.575000
Epoch number 97, CIFAR-10 Batch Number 1: Valid Loss: 1.378877
Valid accuracy: 0.625000
Epoch number 98, CIFAR-10 Batch Number 1: Valid Loss: 1.343391
Valid accuracy: 0.600000
Epoch number 99, CIFAR-10 Batch Number 1: Valid Loss: 1.319342
Valid accuracy: 0.625000
Epoch number 100, CIFAR-10 Batch Number 1: Valid Loss: 1.340849
Valid accuracy: 0.525000

如你所见，仅在单一批次上训练时，验证准确率并不高。让我们看看仅通过完整训练过程，验证准确率会如何变化：

model_save_path = './cifar-10_classification'

with tf.Session() as sess:
 # Initializing the variables
 sess.run(tf.global_variables_initializer())

 # Training cycle
 for epoch in range(num_epochs):

 # iterate through the batches
 num_batches = 5

 for batch_ind in range(1, num_batches + 1):
 for batch_features, batch_labels in load_preprocess_training_batch(batch_ind, batch_size):
 train(sess, model_optimizer, keep_probability, batch_features, batch_labels)

 print('Epoch number{:>2}, CIFAR-10 Batch Number {}: '.format(epoch + 1, batch_ind), end='')
 print_model_stats(sess, batch_features, batch_labels, model_cost, accuracy)

 # Save the trained Model
 saver = tf.train.Saver()
 save_path = saver.save(sess, model_save_path)

Output:
.
.
.
Epoch number94, CIFAR-10 Batch Number 5: Valid Loss: 0.316593
Valid accuracy: 0.925000
Epoch number95, CIFAR-10 Batch Number 1: Valid Loss: 0.285429
Valid accuracy: 0.925000
Epoch number95, CIFAR-10 Batch Number 2: Valid Loss: 0.347411
Valid accuracy: 0.825000
Epoch number95, CIFAR-10 Batch Number 3: Valid Loss: 0.232483
Valid accuracy: 0.950000
Epoch number95, CIFAR-10 Batch Number 4: Valid Loss: 0.294707
Valid accuracy: 0.900000
Epoch number95, CIFAR-10 Batch Number 5: Valid Loss: 0.299490
Valid accuracy: 0.975000
Epoch number96, CIFAR-10 Batch Number 1: Valid Loss: 0.302191
Valid accuracy: 0.950000
Epoch number96, CIFAR-10 Batch Number 2: Valid Loss: 0.347043
Valid accuracy: 0.750000
Epoch number96, CIFAR-10 Batch Number 3: Valid Loss: 0.252851
Valid accuracy: 0.875000
Epoch number96, CIFAR-10 Batch Number 4: Valid Loss: 0.291433
Valid accuracy: 0.950000
Epoch number96, CIFAR-10 Batch Number 5: Valid Loss: 0.286192
Valid accuracy: 0.950000
Epoch number97, CIFAR-10 Batch Number 1: Valid Loss: 0.277105
Valid accuracy: 0.950000
Epoch number97, CIFAR-10 Batch Number 2: Valid Loss: 0.305842
Valid accuracy: 0.850000
Epoch number97, CIFAR-10 Batch Number 3: Valid Loss: 0.215272
Valid accuracy: 0.950000
Epoch number97, CIFAR-10 Batch Number 4: Valid Loss: 0.313761
Valid accuracy: 0.925000
Epoch number97, CIFAR-10 Batch Number 5: Valid Loss: 0.313503
Valid accuracy: 0.925000
Epoch number98, CIFAR-10 Batch Number 1: Valid Loss: 0.265828
Valid accuracy: 0.925000
Epoch number98, CIFAR-10 Batch Number 2: Valid Loss: 0.308948
Valid accuracy: 0.800000
Epoch number98, CIFAR-10 Batch Number 3: Valid Loss: 0.232083
Valid accuracy: 0.950000
Epoch number98, CIFAR-10 Batch Number 4: Valid Loss: 0.298826
Valid accuracy: 0.925000
Epoch number98, CIFAR-10 Batch Number 5: Valid Loss: 0.297230
Valid accuracy: 0.950000
Epoch number99, CIFAR-10 Batch Number 1: Valid Loss: 0.304203
Valid accuracy: 0.900000
Epoch number99, CIFAR-10 Batch Number 2: Valid Loss: 0.308775
Valid accuracy: 0.825000
Epoch number99, CIFAR-10 Batch Number 3: Valid Loss: 0.225072
Valid accuracy: 0.925000
Epoch number99, CIFAR-10 Batch Number 4: Valid Loss: 0.263737
Valid accuracy: 0.925000
Epoch number99, CIFAR-10 Batch Number 5: Valid Loss: 0.278601
Valid accuracy: 0.950000
Epoch number100, CIFAR-10 Batch Number 1: Valid Loss: 0.293509
Valid accuracy: 0.950000
Epoch number100, CIFAR-10 Batch Number 2: Valid Loss: 0.303817
Valid accuracy: 0.875000
Epoch number100, CIFAR-10 Batch Number 3: Valid Loss: 0.244428
Valid accuracy: 0.900000
Epoch number100, CIFAR-10 Batch Number 4: Valid Loss: 0.280712
Valid accuracy: 0.925000
Epoch number100, CIFAR-10 Batch Number 5: Valid Loss: 0.278625
Valid accuracy: 0.950000

测试模型

让我们在 CIFAR-10 数据集的测试集部分上测试训练好的模型。首先，我们将定义一个辅助函数，帮助我们可视化一些示例图像的预测结果及其对应的真实标签：

#A helper function to visualize some samples and their corresponding predictions
def display_samples_predictions(input_features, target_labels, samples_predictions):

 num_classes = 10

 cifar10_class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

label_binarizer = LabelBinarizer()
 label_binarizer.fit(range(num_classes))
 label_inds = label_binarizer.inverse_transform(np.array(target_labels))

fig, axies = plt.subplots(nrows=4, ncols=2)
 fig.tight_layout()
 fig.suptitle('Softmax Predictions', fontsize=20, y=1.1)

num_predictions = 4
 margin = 0.05
 ind = np.arange(num_predictions)
 width = (1\. - 2\. * margin) / num_predictions

for image_ind, (feature, label_ind, prediction_indicies, prediction_values) in enumerate(zip(input_features, label_inds, samples_predictions.indices, samples_predictions.values)):
 prediction_names = [cifar10_class_names[pred_i] for pred_i in prediction_indicies]
 correct_name = cifar10_class_names[label_ind]

axies[image_ind][0].imshow(feature)
 axies[image_ind][0].set_title(correct_name)
 axies[image_ind][0].set_axis_off()

axies[image_ind][1].barh(ind + margin, prediction_values[::-1], width)
 axies[image_ind][1].set_yticks(ind + margin)
 axies[image_ind][1].set_yticklabels(prediction_names[::-1])
 axies[image_ind][1].set_xticks([0, 0.5, 1.0])

现在，让我们恢复训练好的模型并对测试集进行测试：

test_batch_size = 64
save_model_path = './cifar-10_classification'
#Number of images to visualize
num_samples = 4

#Number of top predictions
top_n_predictions = 4

#Defining a helper function for testing the trained model
def test_classification_model():

 input_test_features, target_test_labels = pickle.load(open('preprocess_test.p', mode='rb'))
 loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:

 # loading the trained model
 model = tf.train.import_meta_graph(save_model_path + '.meta')
 model.restore(sess, save_model_path)

# Getting some input and output Tensors from loaded model
 model_input_values = loaded_graph.get_tensor_by_name('input_images:0')
 model_target = loaded_graph.get_tensor_by_name('input_images_target:0')
 model_keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
 model_logits = loaded_graph.get_tensor_by_name('logits:0')
 model_accuracy = loaded_graph.get_tensor_by_name('model_accuracy:0')

 # Testing the trained model on the test set batches
 test_batch_accuracy_total = 0
 test_batch_count = 0

 for input_test_feature_batch, input_test_label_batch in batch_split_features_labels(input_test_features, target_test_labels, test_batch_size):
 test_batch_accuracy_total += sess.run(
 model_accuracy,
 feed_dict={model_input_values: input_test_feature_batch, model_target: input_test_label_batch, model_keep_prob: 1.0})
 test_batch_count += 1

print('Test set accuracy: {}\n'.format(test_batch_accuracy_total/test_batch_count))

# print some random images and their corresponding predictions from the test set results
 random_input_test_features, random_test_target_labels = tuple(zip(*random.sample(list(zip(input_test_features, target_test_labels)), num_samples)))

 random_test_predictions = sess.run(
 tf.nn.top_k(tf.nn.softmax(model_logits), top_n_predictions),
 feed_dict={model_input_values: random_input_test_features, model_target: random_test_target_labels, model_keep_prob: 1.0})

 display_samples_predictions(random_input_test_features, random_test_target_labels, random_test_predictions)

#Calling the function
test_classification_model()

Output:
INFO:tensorflow:Restoring parameters from ./cifar-10_classification
Test set accuracy: 0.7540007961783439

让我们可视化另一个例子，看看一些错误：

现在，我们的测试准确率大约为 75%，对于像我们使用的简单 CNN 来说，这并不算差。

总结

本章向我们展示了如何制作一个 CNN 来分类 CIFAR-10 数据集中的图像。测试集上的分类准确率约为 79% - 80%。卷积层的输出也被绘制出来，但很难看出神经网络是如何识别和分类输入图像的。需要更好的可视化技术。

接下来，我们将使用深度学习中的一种现代且激动人心的实践方法——迁移学习。迁移学习使你能够使用深度学习中的数据需求大的架构，适用于小型数据集。

第九章：目标检测 – 使用卷积神经网络（CNNs）进行迁移学习

“个体如何在一个情境中转移到另一个具有相似特征的情境”

E. L. Thorndike，R. S. Woodworth (1991)

迁移学习（TL）是数据科学中的一个研究问题，主要关注在解决特定任务时获得的知识如何得以保存，并利用这些获得的知识来解决另一个不同但相似的任务。在本章中，我们将展示数据科学领域中使用迁移学习的现代实践和常见主题之一。这里的思想是如何从数据集非常大的领域获得帮助，转移到数据集较小的领域。最后，我们将重新审视我们在 CIFAR-10 上的目标检测示例，并尝试通过迁移学习减少训练时间和性能误差。

本章将涵盖以下主题：

迁移学习
重新审视 CIFAR-10 目标检测

迁移学习

深度学习架构对数据的需求很大，训练集中的样本较少时无法充分发挥其潜力。迁移学习通过将从大数据集解决一个任务中学到的知识/表示转移到另一个不同但相似的小数据集任务中，解决了这一问题。

迁移学习不仅对小训练集有用，我们还可以用它来加速训练过程。从头开始训练大型深度学习架构有时会非常慢，因为这些架构中有数百万个权重需要学习。相反，可以通过迁移学习，只需微调在类似问题上学到的权重，而不是从头开始训练模型。

迁移学习的直觉

让我们通过以下师生类比来建立迁移学习的直觉。一位教师在他/她教授的模块中有多年的经验。另一方面，学生从这位教师的讲座中获得了该主题的简洁概述。所以你可以说，教师正在以简洁紧凑的方式将自己的知识传授给学生。

教师与学生的类比同样适用于我们在深度学习或神经网络中知识迁移的情况。我们的模型从数据中学习一些表示，这些表示由网络的权重表示。这些学习到的表示/特征（权重）可以转移到另一个不同但相似的任务中。将学到的权重转移到另一个任务的过程将减少深度学习架构收敛所需的庞大数据集，并且与从头开始训练模型相比，它还会减少将模型适应新数据集所需的时间。

深度学习现在广泛应用，但通常大多数人在训练深度学习架构时都会使用迁移学习（TL）；很少有人从零开始训练深度学习架构，因为大多数情况下，深度学习需要的数据集规模通常不足以支持收敛。所以，使用在大型数据集上预训练的模型，如ImageNet（大约有 120 万张图像），并将其应用到新任务上是非常常见的。我们可以使用该预训练模型的权重作为特征提取器，或者我们可以将其作为初始化模型，然后对其进行微调以适应新任务。使用迁移学习有三种主要场景：

使用卷积网络作为固定特征提取器：在这种场景下，你使用在大型数据集（如 ImageNet）上预训练的卷积模型，并将其调整为适应你的问题。例如，一个在 ImageNet 上预训练的卷积模型将有一个全连接层，输出 ImageNet 上 1,000 个类别的得分。所以你需要移除这个层，因为你不再关心 ImageNet 的类别。然后，你将所有其他层当作特征提取器。一旦使用预训练模型提取了特征，你可以将这些特征输入到任何线性分类器中，比如 softmax 分类器，甚至是线性支持向量机（SVM）。
微调卷积神经网络：第二种场景涉及到第一种场景，但增加了使用反向传播在你的新任务上微调预训练权重的额外工作。通常，人们保持大部分层不变，只微调网络的顶部。尝试微调整个网络或大多数层可能导致过拟合。因此，你可能只对微调与图像的语义级别特征相关的那些层感兴趣。保持早期层固定的直觉是，它们包含了大多数图像任务中常见的通用或低级特征，如角点、边缘等。如果你正在引入模型在原始数据集中没有的新类别，那么微调网络的高层或顶部层会非常有用。

图 10.1：为新任务微调预训练的卷积神经网络（CNN）

预训练模型：第三种广泛使用的场景是下载互联网上人们提供的检查点。如果你没有足够的计算能力从零开始训练模型，可以选择这种场景，只需使用发布的检查点初始化模型，然后做一点微调。

传统机器学习和迁移学习（TL）的区别

如你从前一部分中注意到的，传统机器学习和涉及迁移学习（TL）的机器学习有明显的区别（如以下图示*所示）。在传统机器学习中，你不会将任何知识或表示迁移到其他任务中，而在迁移学习中却不同。有时，人们会错误地使用迁移学习，因此我们将列出一些条件，只有在这些条件下使用迁移学习才能最大化收益。

以下是应用迁移学习（TL）的条件：

与传统的机器学习不同，源任务和目标任务或领域不需要来自相同的分布，但它们必须是相似的。
如果训练样本较少或你没有足够的计算能力，你也可以使用迁移学习。

图 10.2：传统机器学习与迁移学习（TL）相结合的机器学习

CIFAR-10 目标检测—重新审视

在上一章中，我们在 CIFAR-10 数据集上训练了一个简单的卷积神经网络（CNN）模型。在这里，我们将展示如何使用预训练模型作为特征提取器，同时移除预训练模型的全连接层，然后将提取的特征或迁移值输入 Softmax 层。

这次实现中的预训练模型将是 Inception 模型，它将在 ImageNet 上进行预训练。但请记住，这个实现是基于前两章介绍的 CNN。

解决方案概述

我们将再次替换预训练 Inception 模型的最终全连接层，并使用其余部分作为特征提取器。因此，我们首先将原始图像输入 Inception 模型，模型会从中提取特征，然后输出我们所谓的迁移值。

在从 Inception 模型中获取提取特征的迁移值后，你可能需要将它们保存到本地，因为如果你实时处理，这可能需要一些时间，因此将它们持久化到本地可以节省时间。在 TensorFlow 教程中，他们使用“瓶颈值”这一术语来代替迁移值，但这只是对相同概念的不同命名。

在获得迁移值或从本地加载它们后，我们可以将它们输入到任何为新任务定制的线性分类器中。在这里，我们将提取的迁移值输入到另一个神经网络，并为 CIFAR-10 的新类别进行训练。

以下图示展示了我们将遵循的一般解决方案概述：

图 10.3：使用 CIFAR-10 数据集进行目标检测任务的解决方案概述（使用迁移学习）

加载和探索 CIFAR-10

让我们首先导入本次实现所需的包：

%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import time
from datetime import timedelta
import os

# Importing a helper module for the functions of the Inception model.
import inception

接下来，我们需要加载另一个辅助脚本，以便下载处理 CIFAR-10 数据集：

import cifar10
#importing number of classes of CIFAR-10
from cifar10 import num_classes

如果你还没有做过这一点，你需要设置 CIFAR-10 的路径。这个路径将被 cifar-10.py 脚本用来持久化数据集：

cifar10.data_path = "data/CIFAR-10/"

The CIFAR-10 dataset is about 170 MB, the next line checks if the dataset is already downloaded if not it downloads the dataset and store in the previous data_path:

cifar10.maybe_download_and_extract</span>()

Output:

- Download progress: 100.0%
Download finished. Extracting files.
Done.

让我们来看一下 CIFAR-10 数据集中的类别：

#Loading the class names of CIFAR-10 dataset
class_names = cifar10.load_class_names()
class_names

输出：

Loading data: data/CIFAR-10/cifar-10-batches-py/batches.meta
['airplane',
 'automobile',
 'bird',
 'cat',
 'deer',
 'dog',
 'frog',
 'horse', 
 'ship',
 'truck']
Load the training-set.

这将返回 images，类别编号作为 integers，以及类别编号作为一种名为 labels 的 one-hot 编码数组：

training_images, training_cls_integers, trainig_one_hot_labels = cifar10.load_training_data()

输出：

Loading data: data/CIFAR-10/cifar-10-batches-py/data_batch_1
Loading data: data/CIFAR-10/cifar-10-batches-py/data_batch_2
Loading data: data/CIFAR-10/cifar-10-batches-py/data_batch_3
Loading data: data/CIFAR-10/cifar-10-batches-py/data_batch_4
Loading data: data/CIFAR-10/cifar-10-batches-py/data_batch_5
Load the test-set.

现在，让我们对测试集做相同的操作，加载图像及其相应的目标类别的整数表示和 one-hot 编码：

#Loading the test images, their class integer, and their corresponding one-hot encoding
testing_images, testing_cls_integers, testing_one_hot_labels = cifar10.load_test_data()

Output:

Loading data: data/CIFAR-10/cifar-10-batches-py/test_batch

让我们看看 CIFAR-10 中训练集和测试集的分布：

print("-Number of images in the training set:\t\t{}".format(len(training_images)))
print("-Number of images in the testing set:\t\t{}".format(len(testing_images)))

输出：

-Number of images in the training set:          50000
-Number of images in the testing set:           10000

让我们定义一些辅助函数，以便我们可以探索数据集。以下辅助函数将把九张图片绘制成网格：

def plot_imgs(imgs, true_class, predicted_class=None):

    assert len(imgs) == len(true_class)

    # Creating a placeholders for 9 subplots
    fig, axes = plt.subplots(3, 3)

    # Adjustting spacing.
    if predicted_class is None:
        hspace = 0.3
    else:
        hspace = 0.6
    fig.subplots_adjust(hspace=hspace, wspace=0.3)

    for i, ax in enumerate(axes.flat):
        # There may be less than 9 images, ensure it doesn't crash.
        if i < len(imgs):
            # Plot image.
            ax.imshow(imgs[i],
                      interpolation='nearest')

            # Get the actual name of the true class from the class_names array
            true_class_name = class_names[true_class[i]]

            # Showing labels for the predicted and true classes
            if predicted_class is None:
                xlabel = "True: {0}".format(true_class_name)
            else:
                # Name of the predicted class.
                predicted_class_name = class_names[predicted_class[i]]

                xlabel = "True: {0}\nPred: {1}".format(true_class_name, predicted_class_name)

            ax.set_xlabel(xlabel)

        # Remove ticks from the plot.
        ax.set_xticks([])
        ax.set_yticks([])

    plt.show()

让我们可视化测试集中的一些图像，并查看它们相应的实际类别：

# get the first 9 images in the test set
imgs = testing_images[0:9]

# Get the integer representation of the true class.
true_class = testing_cls_integers[0:9]

# Plotting the images
plot_imgs(imgs=imgs, true_class=true_class)

输出：

图 10.4：测试集的前九张图片

Inception 模型传递值

如前所述，我们将使用在 ImageNet 数据集上预训练的 Inception 模型。因此，我们需要从互联网上下载这个预训练的模型。

让我们首先定义 data_dir 来为 Inception 模型设置路径：

inception.data_dir = 'inception/'

预训练 Inception 模型的权重大约为 85 MB。如果它不在之前定义的 data_dir 中，以下代码行将下载该模型：

inception.maybe_download()

Downloading Inception v3 Model ...
- Download progress: 100%

我们将加载 Inception 模型，以便可以将其作为特征提取器来处理我们的 CIFAR-10 图像：

# Loading the inception model so that we can inialized it with the pre-trained weights and customize for our model
inception_model = inception.Inception()

如前所述，计算 CIFAR-10 数据集的传递值需要一些时间，因此我们需要将它们缓存以便将来使用。幸运的是，inception 模块中有一个辅助函数可以帮助我们做到这一点：

from inception import transfer_values_cache

接下来，我们需要设置缓存的训练和测试文件的文件路径：

file_path_train = os.path.join(cifar10.data_path, 'inception_cifar10_train.pkl')
file_path_test = os.path.join(cifar10.data_path, 'inception_cifar10_test.pkl')
print("Processing Inception transfer-values for the training images of Cifar-10 ...")
# First we need to scale the imgs to fit the Inception model requirements as it requires all pixels to be from 0 to 255,
# while our training examples of the CIFAR-10 pixels are between 0.0 and 1.0
imgs_scaled = training_images * 255.0

# Checking if the transfer-values for our training images are already calculated and loading them, if not calculate and save them.
transfer_values_training = transfer_values_cache(cache_path=file_path_train,
                                              images=imgs_scaled,
                                              model=inception_model)
print("Processing Inception transfer-values for the testing images of Cifar-10 ...")
# First we need to scale the imgs to fit the Inception model requirements as it requires all pixels to be from 0 to 255,
# while our training examples of the CIFAR-10 pixels are between 0.0 and 1.0
imgs_scaled = testing_images * 255.0
# Checking if the transfer-values for our training images are already calculated and loading them, if not calcaulate and save them.
transfer_values_testing = transfer_values_cache(cache_path=file_path_test,
                                     images=imgs_scaled,
                                     model=inception_model)

如前所述，CIFAR-10 数据集的训练集中有 50,000 张图像。让我们检查这些图像的传递值的形状。每张图像的传递值应该是 2,048：

transfer_values_training.shape

输出：

(50000, 2048)

我们需要对测试集做相同的操作：

transfer_values_testing.shape

输出：

(10000, 2048)

为了直观地理解传递值的样子，我们将定义一个辅助函数，帮助我们绘制训练集或测试集中某张图像的传递值：

def plot_transferValues(ind):
    print("Original input image:")

    # Plot the image at index ind of the test set.
    plt.imshow(testing_images[ind], interpolation='nearest')
    plt.show()

    print("Transfer values using Inception model:")

    # Visualize the transfer values as an image.
    transferValues_img = transfer_values_testing[ind]
    transferValues_img = transferValues_img.reshape((32, 64))

    # Plotting the transfer values image.
    plt.imshow(transferValues_img, interpolation='nearest', cmap='Reds')
    plt.show()
plot_transferValues(i=16)

Input image:

图 10.5：输入图像

使用 Inception 模型的传递值：

图 10.6：图 10.3 中输入图像的传递值

plot_transferValues(i=17)

图 10.7：输入图像

使用 Inception 模型的传递值：

图 10.8：图 10.5 中输入图像的传递值

传递值分析

在这一部分，我们将分析刚刚为训练图像获得的传输值。这次分析的目的是看这些传输值是否足以对我们在 CIFAR-10 中的图像进行分类。

每张输入图像都有 2,048 个传输值。为了绘制这些传输值并对其进行进一步分析，我们可以使用像 scikit-learn 中的主成分分析（PCA）这样的降维技术。我们将传输值从 2,048 减少到 2，以便能够可视化它，并查看它们是否能成为区分 CIFAR-10 不同类别的好特征：

from sklearn.decomposition import PCA

接下来，我们需要创建一个 PCA 对象，其中组件数量为2：

pca_obj = PCA(n_components=2)

将传输值从 2,048 减少到 2 需要花费很多时间，因此我们将只选取 5,000 张图像中的 3,000 张作为子集：

subset_transferValues = transfer_values_training[0:3000]

我们还需要获取这些图像的类别编号：

cls_integers = testing_cls_integers[0:3000]

我们可以通过打印传输值的形状来再次检查我们的子集：

subset_transferValues.shape

输出：

(3000, 2048)

接下来，我们使用我们的 PCA 对象将传输值从 2,048 减少到仅 2：

reduced_transferValues = pca_obj.fit_transform(subset_transferValues)

现在，让我们看看 PCA 降维过程的输出：

reduced_transferValues.shape

输出：

(3000, 2)

在将传输值的维度减少到仅为 2 之后，让我们绘制这些值：

#Importing the color map for plotting each class with different color.
import matplotlib.cm as color_map

def plot_reduced_transferValues(transferValues, cls_integers):

    # Create a color-map with a different color for each class.
    c_map = color_map.rainbow(np.linspace(0.0, 1.0, num_classes))

    # Getting the color for each sample.
    colors = c_map[cls_integers]

    # Getting the x and y values.
    x_val = transferValues[:, 0]
    y_val = transferValues[:, 1]

    # Plot the transfer values in a scatter plot
    plt.scatter(x_val, y_val, color=colors)
    plt.show()

在这里，我们绘制的是训练集子集的减少后的传输值。CIFAR-10 中有 10 个类别，所以我们将使用不同的颜色绘制它们对应的传输值。从下图可以看出，传输值根据对应的类别被分组。组与组之间的重叠是因为 PCA 的降维过程无法正确分离传输值：

plot_reduced_transferValues(reduced_transferValues, cls_integers)

图 10.9：使用 PCA 减少的传输值

我们可以使用另一种降维方法t-SNE进一步分析我们的传输值：

from sklearn.manifold import TSNE

再次，我们将减少传输值的维度，从 2,048 减少到 50 个值，而不是 2：

pca_obj = PCA(n_components=50)
transferValues_50d = pca_obj.fit_transform(subset_transferValues)

接下来，我们堆叠第二种降维技术，并将 PCA 过程的输出传递给它：

tsne_obj = TSNE(n_components=2)

最后，我们使用 PCA 方法减少后的值并将 t-SNE 方法应用于其上：

reduced_transferValues = tsne_obj.fit_transform(transferValues_50d)

并再次检查它是否具有正确的形状：

reduced_transferValues.shape

输出：

(3000, 2)

让我们绘制 t-SNE 方法减少后的传输值。正如你在下图中看到的，t-SNE 比 PCA 更好地分离了分组的传输值。

通过这次分析，我们得出的结论是，通过将输入图像输入预训练的 Inception 模型获得的提取传输值，可以用于将训练图像分为 10 个类别。由于下图中存在轻微的重叠，这种分离不会 100%准确，但我们可以通过对预训练模型进行微调来消除这种重叠：

plot_reduced_transferValues(reduced_transferValues, cls_integers)

图 10.10：使用 t-SNE 减少的传输值

现在我们已经提取了训练图像中的转移值，并且知道这些值能够在一定程度上区分 CIFAR-10 中的不同类别。接下来，我们需要构建一个线性分类器，并将这些转移值输入其中，进行实际分类。

模型构建与训练

所以，让我们首先指定将要输入到神经网络模型中的输入占位符变量。第一个输入变量（将包含提取的转移值）的形状将是[None, transfer_len]。第二个占位符变量将以独热向量格式存储训练集的实际类别标签：

transferValues_arrLength = inception_model.transfer_len
input_values = tf.placeholder(tf.float32, shape=[None, transferValues_arrLength], name='input_values')
y_actual = tf.placeholder(tf.float32, shape=[None, num_classes], name='y_actual')

我们还可以通过定义另一个占位符变量，获取每个类别从 1 到 10 的对应整数值：

y_actual_cls = tf.argmax(y_actual, axis=1)

接下来，我们需要构建实际的分类神经网络，该网络将接受这些输入占位符，并生成预测的类别：

def new_weights(shape):
    return tf.Variable(tf.truncated_normal(shape, stddev=0.05))

def new_biases(length):
    return tf.Variable(tf.constant(0.05, shape=[length]))

def new_fc_layer(input,          # The previous layer.
                 num_inputs,     # Num. inputs from prev. layer.
                 num_outputs,    # Num. outputs.
                 use_relu=True): # Use Rectified Linear Unit (ReLU)?

    # Create new weights and biases.
    weights = new_weights(shape=[num_inputs, num_outputs])
    biases = new_biases(length=num_outputs)

    # Calculate the layer as the matrix multiplication of
    # the input and weights, and then add the bias-values.
    layer = tf.matmul(input, weights) + biases

    # Use ReLU?
    if use_relu:
        layer = tf.nn.relu(layer)

    return layer

# First fully-connected layer.
layer_fc1 = new_fc_layer(input=input_values,
                             num_inputs=2048,
                             num_outputs=1024,
                             use_relu=True)

# Second fully-connected layer.
layer_fc2 = new_fc_layer(input=layer_fc1,
                             num_inputs=1024,
                             num_outputs=num_classes,
                             use_relu=False)

# Predicted class-label.
y_predicted = tf.nn.softmax(layer_fc2)

# Cross-entropy for the classification of each image.
cross_entropy = \
    tf.nn.softmax_cross_entropy_with_logits(logits=layer_fc2,
                                                labels=y_actual)

# Loss aka. cost-measure.
# This is the scalar value that must be minimized.
loss = tf.reduce_mean(cross_entropy)

然后，我们需要定义一个优化标准，作为分类器训练过程中使用的准则。在此实现中，我们将使用AdamOptimizer。该分类器的输出将是一个包含 10 个概率分数的数组，对应 CIFAR-10 数据集中类别的数量。接下来，我们将对这个数组应用argmax操作，将最大分数的类别分配给该输入样本：

step = tf.Variable(initial_value=0,
                          name='step', trainable=False)
optimizer = tf.train.AdamOptimizer(learning_rate=1e-4).minimize(loss, step)
y_predicted_cls = tf.argmax(y_predicted, axis=1)
#compare the predicted and true classes
correct_prediction = tf.equal(y_predicted_cls, y_actual_cls)
#cast the boolean values to fload
model_accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

接下来，我们需要定义一个 TensorFlow 会话，实际执行计算图，并初始化我们之前在此实现中定义的变量：

session = tf.Session()
session.run(tf.global_variables_initializer())

在这个实现中，我们将使用随机梯度下降（SGD），因此我们需要定义一个函数，从我们包含 50,000 张图像的训练集中随机生成指定大小的批次。

因此，我们将定义一个辅助函数，从输入的训练集转移值中生成一个随机批次：

#defining the size of the train batch
train_batch_size = 64

#defining a function for randomly selecting a batch of images from the dataset
def select_random_batch():
    # Number of images (transfer-values) in the training-set.
    num_imgs = len(transfer_values_training)

    # Create a random index.
    ind = np.random.choice(num_imgs,
                           size=training_batch_size,
                           replace=False)

    # Use the random index to select random x and y-values.
    # We use the transfer-values instead of images as x-values.
    x_batch = transfer_values_training[ind]
    y_batch = trainig_one_hot_labels[ind]

    return x_batch, y_batch

接下来，我们需要定义一个辅助函数，进行实际的优化过程，优化网络的权重。它将在每次迭代时生成一个批次，并根据该批次优化网络：

def optimize(num_iterations):

    for i in range(num_iterations):
        # Selectin a random batch of images for training
        # where the transfer values of the images will be stored in input_batch
        # and the actual labels of those batch of images will be stored in y_actual_batch
        input_batch, y_actual_batch = select_random_batch()

        # storing the batch in a dict with the proper names
        # such as the input placeholder variables that we define above.
        feed_dict = {input_values: input_batch,
                           y_actual: y_actual_batch}

        # Now we call the optimizer of this batch of images
        # TensorFlow will automatically feed the values of the dict we created above
        # to the model input placeholder variables that we defined above.
        i_global, _ = session.run([step, optimizer],
                                  feed_dict=feed_dict)

        # print the accuracy every 100 steps.
        if (i_global % 100 == 0) or (i == num_iterations - 1):
            # Calculate the accuracy on the training-batch.
            batch_accuracy = session.run(model_accuracy,
                                    feed_dict=feed_dict)

            msg = "Step: {0:>6}, Training Accuracy: {1:>6.1%}"
            print(msg.format(i_global, batch_accuracy))

我们将定义一些辅助函数来显示之前神经网络的结果，并展示预测结果的混淆矩阵：

def plot_errors(cls_predicted, cls_correct):

    # cls_predicted is an array of the predicted class-number for
    # all images in the test-set.

    # cls_correct is an array with boolean values to indicate
    # whether is the model predicted the correct class or not.

    # Negate the boolean array.
    incorrect = (cls_correct == False)

    # Get the images from the test-set that have been
    # incorrectly classified. 
    incorrectly_classified_images = testing_images[incorrect]

    # Get the predicted classes for those images.
    cls_predicted = cls_predicted[incorrect]

    # Get the true classes for those images.
    true_class = testing_cls_integers[incorrect]

    n = min(9, len(incorrectly_classified_images))

    # Plot the first n images.
    plot_imgs(imgs=incorrectly_classified_images[0:n],
                true_class=true_class[0:n],
                predicted_class=cls_predicted[0:n])

接下来，我们需要定义一个用于绘制混淆矩阵的辅助函数：

from sklearn.metrics import confusion_matrix

def plot_confusionMatrix(cls_predicted):

    # cls_predicted array of all the predicted 
    # classes numbers in the test.

    # Call the confucion matrix of sklearn
    cm = confusion_matrix(y_true=testing_cls_integers,
                          y_pred=cls_predicted)

    # Printing the confusion matrix
    for i in range(num_classes):
        # Append the class-name to each line.
        class_name = "({}) {}".format(i, class_names[i])
        print(cm[i, :], class_name)

    # labeling each column of the confusion matrix with the class number
    cls_numbers = [" ({0})".format(i) for i in range(num_classes)]
    print("".join(cls_numbers))

此外，我们还将定义另一个辅助函数，用于在测试集上运行训练好的分类器，并测量训练模型在测试集上的准确性：

# Split the data-set in batches of this size to limit RAM usage.
batch_size = 128

def predict_class(transferValues, labels, cls_true):

    # Number of images.
    num_imgs = len(transferValues)

    # Allocate an array for the predicted classes which
    # will be calculated in batches and filled into this array.
    cls_predicted = np.zeros(shape=num_imgs, dtype=np.int)

    # Now calculate the predicted classes for the batches.
    # We will just iterate through all the batches.
    # There might be a more clever and Pythonic way of doing this.

    # The starting index for the next batch is denoted i.
    i = 0

    while i < num_imgs:
        # The ending index for the next batch is denoted j.
        j = min(i + batch_size, num_imgs)

        # Create a feed-dict with the images and labels
        # between index i and j.
        feed_dict = {input_values: transferValues[i:j],
                     y_actual: labels[i:j]}

        # Calculate the predicted class using TensorFlow.
        cls_predicted[i:j] = session.run(y_predicted_cls, feed_dict=feed_dict)

        # Set the start-index for the next batch to the
        # end-index of the current batch.
        i = j

    # Create a boolean array whether each image is correctly classified.
    correct = [a == p for a, p in zip(cls_true, cls_predicted)]

    return correct, cls_predicted

#Calling the above function making the predictions for the test

def predict_cls_test():
    return predict_class(transferValues = transfer_values_test,
                       labels = labels_test,
                       cls_true = cls_test)

def classification_accuracy(correct):
    # When averaging a boolean array, False means 0 and True means 1.
    # So we are calculating: number of True / len(correct) which is
    # the same as the classification accuracy.

    # Return the classification accuracy
    # and the number of correct classifications.
    return np.mean(correct), np.sum(correct)

def test_accuracy(show_example_errors=False,
                        show_confusion_matrix=False):

    # For all the images in the test-set,
    # calculate the predicted classes and whether they are correct.
    correct, cls_pred = predict_class_test()

    # Classification accuracypredict_class_test and the number of correct classifications.
    accuracy, num_correct = classification_accuracy(correct)

    # Number of images being classified.
    num_images = len(correct)

    # Print the accuracy.
    msg = "Test set accuracy: {0:.1%} ({1} / {2})"
    print(msg.format(accuracy, num_correct, num_images))

    # Plot some examples of mis-classifications, if desired.
    if show_example_errors:
        print("Example errors:")
        plot_errors(cls_predicted=cls_pred, cls_correct=correct)

    # Plot the confusion matrix, if desired.
    if show_confusion_matrix:
        print("Confusion Matrix:")
        plot_confusionMatrix(cls_predicted=cls_pred)

在进行任何优化之前，让我们看看之前神经网络模型的表现：

test_accuracy(show_example_errors=True,
                    show_confusion_matrix=True)

Accuracy on Test-Set: 9.4% (939 / 10000)

如你所见，网络的表现非常差，但在我们基于已定义的优化标准进行一些优化后，性能会有所提升。因此，我们将运行优化器进行 10,000 次迭代，并在此之后测试模型的准确性：

optimize(num_iterations=10000)
test_accuracy(show_example_errors=True,
                           show_confusion_matrix=True)
Accuracy on Test-Set: 90.7% (9069 / 10000)
Example errors:

图 10.11：来自测试集的部分误分类图像

Confusion Matrix:
[926   6  13   2   3   0   1   1  29  19] (0) airplane
[  9 921   2   5   0   1   1   1   2  58] (1) automobile
[ 18   1 883  31  32   4  22   5   1   3] (2) bird
[  7   2  19 855  23  57  24   9   2   2] (3) cat
[  5   0  21  25 896   4  24  22   2   1] (4) deer
[  2   0  12  97  18 843  10  15   1   2] (5) dog
[  2   1  16  17  17   4 940   1   2   0] (6) frog
[  8   0  10  19  28  14   1 914   2   4] (7) horse
[ 42   6   1   4   1   0   2   0 932  12] (8) ship
[  6  19   2   2   1   0   1   1   9 959] (9) truck
 (0) (1) (2) (3) (4) (5) (6) (7) (8) (9)

最后，我们将结束之前打开的会话：

model.close()
session.close()

总结

在本章中，我们介绍了深度学习中最广泛使用的最佳实践之一。TL 是一个非常令人兴奋的工具，您可以利用它让深度学习架构从小数据集进行学习，但请确保您以正确的方式使用它。

接下来，我们将介绍一种广泛应用于自然语言处理的深度学习架构。这些递归型架构在大多数 NLP 领域取得了突破：机器翻译、语音识别、语言建模和情感分析。

第十章：递归类型神经网络 - 语言建模

递归神经网络（RNNs）是一类广泛用于自然语言处理的深度学习架构。这类架构使我们能够为当前的预测提供上下文信息，并且具有处理任何输入序列中长期依赖性的特定架构。在本章中，我们将展示如何构建一个序列到序列模型，这将在 NLP 的许多应用中非常有用。我们将通过构建一个字符级语言模型来展示这些概念，并查看我们的模型如何生成与原始输入序列相似的句子。

本章将涵盖以下主题：

RNNs 背后的直觉
LSTM 网络
语言模型的实现

RNNs 背后的直觉

到目前为止，我们处理的所有深度学习架构都没有机制来记住它们之前接收到的输入。例如，如果你给前馈神经网络（FNN）输入一串字符，例如HELLO，当网络处理到E时，你会发现它没有保留任何信息/忘记了它刚刚读取的H。这是基于序列的学习的一个严重问题。由于它没有记住任何它读取过的先前字符，这种网络将非常难以训练来预测下一个字符。这对于许多应用（如语言建模、机器翻译、语音识别等）来说是没有意义的。

出于这个特定的原因，我们将介绍 RNNs，这是一组能够保存信息并记住它们刚刚遇到的内容的深度学习架构。

让我们展示 RNNs 如何在相同的字符输入序列HELLO上工作。当 RNN 单元接收到E作为输入时，它也接收到先前输入的字符H。这种将当前字符和先前字符一起作为输入传递给 RNN 单元的做法为这些架构提供了一个巨大优势，即短期记忆；它还使这些架构能够用于预测/推测在这个特定字符序列中H之后最可能的字符，即L。

我们已经看到，先前的架构将权重分配给它们的输入；RNNs 遵循相同的优化过程，将权重分配给它们的多个输入，包括当前输入和过去输入。因此，在这种情况下，网络将为每个输入分配两个不同的权重矩阵。为了做到这一点，我们将使用梯度下降和一种更重的反向传播版本，称为时间反向传播（BPTT）。

递归神经网络架构

根据我们对以前深度学习架构的了解，你会发现 RNN 是特别的。我们学过的前几种架构在输入或训练方面并不灵活。它们接受固定大小的序列/向量/图像作为输入，并产生另一个固定大小的输出。RNN 架构则有所不同，因为它们允许你输入一个序列并得到另一个序列作为输出，或者仅在输入/输出中使用序列，如图 1所示。这种灵活性对于多种应用，如语言建模和情感分析，非常有用：

图 1：RNN 在输入或输出形状上的灵活性（karpathy.github.io/2015/05/21/rnn-effectiveness/）

这些架构的直观原理是模仿人类处理信息的方式。在任何典型的对话中，你对某人话语的理解完全依赖于他之前说了什么，甚至可能根据他刚刚说的内容预测他接下来会说什么。

在 RNN 的情况下，应该遵循完全相同的过程。例如，假设你想翻译句子中的某个特定单词。你不能使用传统的 FNN，因为它们无法将之前单词的翻译作为输入与当前我们想翻译的单词结合使用，这可能导致翻译错误，因为缺少与该单词相关的上下文信息。

RNN 保留了关于过去的信息，并且它们具有某种循环结构，允许在任何给定时刻将之前学到的信息用于当前的预测：

图 2：具有循环结构的 RNN 架构，用于保留过去步骤的信息（来源：colah.github.io/posts/2015-08-Understanding-LSTMs/）

在图 2中，我们有一些神经网络称为A，它接收输入 X[t] 并生成输出 h[t]。同时，它借助这个循环接收来自过去步骤的信息。

这个循环看起来似乎不太清楚，但如果我们使用图 2的展开版本，你会发现它非常简单且直观，RNN 其实就是同一网络的重复版本（这可以是普通的 FNN），如图 3所示：

图 3：递归神经网络架构的展开版本（来源：colah.github.io/posts/2015-08-Understanding-LSTMs/）

RNN 的这种直观架构及其在输入/输出形状上的灵活性，使得它们非常适合处理有趣的基于序列的学习任务，如机器翻译、语言建模、情感分析、图像描述等。

RNN 的示例

现在，我们对循环神经网络（RNN）的工作原理有了直观的理解，也了解它在不同有趣的基于序列的例子中的应用。让我们更深入地了解一些这些有趣的例子。

字符级语言模型

语言建模是许多应用中一个至关重要的任务，如语音识别、机器翻译等。在本节中，我们将尝试模拟 RNN 的训练过程，并更深入地理解这些网络的工作方式。我们将构建一个基于字符的语言模型。所以，我们将向网络提供一段文本，目的是尝试建立一个概率分布，用于预测给定前一个字符后的下一个字符的概率，这将使我们能够生成类似于我们在训练过程中输入的文本。

例如，假设我们有一个词汇表仅包含四个字母，helo。

任务是训练一个循环神经网络，处理一个特定的字符输入序列，如hello。在这个具体的例子中，我们有四个训练样本：

给定第一个输入字符h的上下文，应该计算字符e的概率，
给定he的上下文，应该计算字符l的概率，
给定hel的上下文，应该计算字符l的概率，
最终，给定hell的上下文，应该计算字符o的概率。

正如我们在前几章中学到的那样，机器学习技术，深度学习也属于其中的一部分，一般只接受实数值作为输入。因此，我们需要以某种方式将输入字符转换或编码为数字形式。为此，我们将使用 one-hot 向量编码，这是一种通过将一个向量中除一个位置外其他位置填充为零的方式来编码文本，其中该位置的索引表示我们试图建模的语言（在此为helo）中的字符索引。在对训练样本进行编码后，我们将逐个提供给 RNN 类型的模型。在给定的每个字符时，RNN 类型模型的输出将是一个四维向量（该向量的大小对应于词汇表的大小），表示词汇表中每个字符作为下一个字符的概率。图 4 清楚地说明了这个过程：

图 4：RNN 类型网络的示例，输入为通过 one-hot 向量编码的字符，输出是词汇表中的分布，表示当前字符后最可能的字符（来源：karpathy.github.io/2015/05/21/…

如图 4所示，你可以看到我们将输入序列中的第一个字符h喂给模型，输出是一个四维向量，表示下一个字符的置信度。所以它对h作为下一个字符的置信度是1.0，对e是2.2，对l是**-3.0**，对o是4.1。在这个特定的例子中，我们知道下一个正确字符是e，基于我们的训练序列hello。所以，我们在训练这个 RNN 类型网络时的主要目标是增加e作为下一个字符的置信度，并减少其他字符的置信度。为了进行这种优化，我们将使用梯度下降和反向传播算法来更新权重，影响网络产生更高置信度的正确下一个字符e，并以此类推，处理其他三个训练例子。

如你所见，RNN 类型网络的输出会产生一个对所有词汇中字符的置信度分布，表示下一个字符可能性。我们可以将这种置信度分布转化为概率分布，使得某个字符作为下一个字符的概率增加时，其他字符的概率会相应减少，因为概率总和必须为 1。对于这种特定的修改，我们可以对每个输出向量使用一个标准的 Softmax 层。

为了从这些类型的网络生成文本，我们可以将一个初始字符输入模型，并得到一个关于下一个字符可能性的概率分布，然后我们可以从这些字符中采样并将其反馈作为输入给模型。通过重复这一过程多次，我们就能生成一个具有所需长度的字符序列。

使用莎士比亚数据的语言模型

从前面的例子中，我们可以得到生成文本的模型。但网络会让我们惊讶，因为它不仅仅会生成文本，还会学习训练数据中的风格和结构。我们可以通过训练一个 RNN 类型的模型来展示这一有趣的过程，使用具有结构和风格的特定文本，例如以下的莎士比亚作品。

让我们来看看从训练好的网络生成的输出：

第二位参议员：

他们远离了我灵魂上的痛苦，

当我死去时，打破并强烈应当埋葬

许多国家的地球与思想。

尽管网络一次只知道如何生成一个字符，但它还是能够生成有意义的文本和实际具有莎士比亚作品风格和结构的名字。

梯度消失问题

在训练这些 RNN 类型架构时，我们使用梯度下降和通过时间的反向传播，这些方法为许多基于序列的学习任务带来了成功。但是，由于梯度的性质以及使用快速训练策略，研究表明梯度值往往会变得过小并消失。这一过程引发了许多从业者遇到的梯度消失问题。接下来，在本章中，我们将讨论研究人员如何解决这些问题，并提出了传统 RNN 的变种来克服这个问题：

图 5：梯度消失问题

长期依赖问题

研究人员面临的另一个挑战性问题是文本中的长期依赖。例如，如果有人输入类似 我曾经住在法国，并且我学会了如何说…… 的序列，那么接下来的显而易见的词是 French。

在这种情况下，传统 RNN 能够处理短期依赖问题，如图 6所示：

图 6：展示文本中的短期依赖（来源：colah.github.io/posts/2015-…

另一个例子是，如果某人开始输入 我曾经住在法国…… 然后描述住在那里的一些美好经历，最后以 我学会了说法语 结束序列。那么，为了让模型预测他/她在序列结束时学会的语言，模型需要保留早期词汇 live 和 France 的信息。如果模型不能有效地跟踪文本中的长期依赖，它就无法处理这种情况：

图 7：文本中长期依赖问题的挑战（来源：colah.github.io/posts/2015-…

为了处理文本中的梯度消失和长期依赖问题，研究人员引入了一种名为长短时记忆网络（LSTM）的变种网络。

LSTM 网络

LSTM 是一种 RNN 的变种，用于帮助学习文本中的长期依赖。LSTM 最初由 Hochreiter 和 Schmidhuber（1997 年）提出（链接：www.bioinf.jku.at/publications/older/2604.pdf），许多研究者在此基础上展开了工作，并在多个领域取得了有趣的成果。

这些架构能够处理文本中长期依赖的问题，因为它们的内部架构设计使然。

LSTM 与传统的 RNN 相似，都具有一个随着时间重复的模块，但这个重复模块的内部结构与传统 RNN 不同。它包括更多的层，用于遗忘和更新信息：

图 8：标准 RNN 中包含单一层的重复模块（来源：colah.github.io/posts/2015-…

如前所述，基础 RNN 只有一个神经网络层，而 LSTM 有四个不同的层以特殊的方式相互作用。这种特殊的交互方式使得 LSTM 在许多领域中表现得非常好，正如我们在构建语言模型示例时会看到的那样：

图 9：LSTM 中包含四个交互层的重复模块（来源：colah.github.io/posts/2015-…

关于数学细节以及四个层是如何相互作用的更多信息，可以参考这个有趣的教程：colah.github.io/posts/2015-08-Understanding-LSTMs/

为什么 LSTM 有效？

我们的基础 LSTM 架构的第一步是决定哪些信息是不必要的，它通过丢弃这些信息，为更重要的信息留出更多空间。为此，我们有一个叫做遗忘门层的层，它查看前一个输出h[t-1]和当前输入x[t]，并决定我们要丢弃哪些信息。

LSTM 架构中的下一步是决定哪些信息值得保留/持久化并存储到细胞中。这是通过两个步骤完成的：

一个叫做输入门层的层，决定了哪些来自前一状态的值需要被更新
第二步是生成一组新的候选值，这些值将被添加到细胞中

最后，我们需要决定 LSTM 单元将输出什么。这个输出将基于我们的细胞状态，但会是一个经过筛选的版本。

语言模型的实现

在本节中，我们将构建一个基于字符的语言模型。在这个实现中，我们将使用《安娜·卡列尼娜》小说，并观察网络如何学习实现文本的结构和风格：

图 10：字符级 RNN 的一般架构（来源：karpathy.github.io/2015/05/21/…

该网络基于 Andrej Karpathy 关于 RNN 的文章（链接：karpathy.github.io/2015/05/21/rnn-effectiveness/）和在 Torch 中的实现（链接：github.com/karpathy/char-rnn）。此外，这里还有一些来自 r2rt 的资料（链接：r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html）以及 Sherjil Ozairp（链接：github.com/sherjilozair/char-rnn-tensorflow）在 GitHub 上的内容。以下是字符级 RNN 的一般架构。

我们将构建一个基于《安娜·卡列尼娜》小说的字符级 RNN（链接：en.wikipedia.org/wiki/Anna_Karenina）。它将能够基于书中的文本生成新的文本。你将在这个实现的资源包中找到.txt文件。

让我们首先导入实现字符级别操作所需的库：

import numpy as np
import tensorflow as tf

from collections import namedtuple

首先，我们需要通过加载数据集并将其转换为整数来准备数据集。因此，我们将字符转换为整数，然后将其编码为整数，这使得它可以作为模型的输入变量，直观且易于使用：

#reading the Anna Karenina novel text file
with open('Anna_Karenina.txt', 'r') as f:
    textlines=f.read()

#Building the vocan and encoding the characters as integers
language_vocab = set(textlines)
vocab_to_integer = {char: j for j, char in enumerate(language_vocab)}
integer_to_vocab = dict(enumerate(language_vocab))
encoded_vocab = np.array([vocab_to_integer[char] for char in textlines], dtype=np.int32)

让我们看一下《安娜·卡列尼娜》文本中的前 200 个字符：

textlines[:200]
Output:
"Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverything was in confusion in the Oblonskys' house. The wife had\ndiscovered that the husband was carrying on"

我们还将字符转换为适合网络使用的便捷形式，即整数。因此，让我们来看一下这些字符的编码版本：

encoded_vocab[:200]
Output:
array([70, 34, 54, 29, 24, 19, 76, 45, 2, 79, 79, 79, 69, 54, 29, 29, 49,
       45, 66, 54, 39, 15, 44, 15, 19, 12, 45, 54, 76, 19, 45, 54, 44, 44,
      45, 54, 44, 15, 27, 19, 58, 45, 19, 30, 19, 76, 49, 45, 59, 56, 34,
       54, 29, 29, 49, 45, 66, 54, 39, 15, 44, 49, 45, 15, 12, 45, 59, 56,
       34, 54, 29, 29, 49, 45, 15, 56, 45, 15, 24, 12, 45, 11, 35, 56, 79,
       35, 54, 49, 53, 79, 79, 36, 30, 19, 76, 49, 24, 34, 15, 56, 16, 45,
       35, 54, 12, 45, 15, 56, 45, 31, 11, 56, 66, 59, 12, 15, 11, 56, 45,
       15, 56, 45, 24, 34, 19, 45, 1, 82, 44, 11, 56, 12, 27, 49, 12, 37,
       45, 34, 11, 59, 12, 19, 53, 45, 21, 34, 19, 45, 35, 15, 66, 19, 45,
       34, 54, 64, 79, 64, 15, 12, 31, 11, 30, 19, 76, 19, 64, 45, 24, 34,
       54, 24, 45, 24, 34, 19, 45, 34, 59, 12, 82, 54, 56, 64, 45, 35, 54,
       12, 45, 31, 54, 76, 76, 49, 15, 56, 16, 45, 11, 56], dtype=int32)

由于网络处理的是单个字符，因此它类似于一个分类问题，我们试图从之前的文本中预测下一个字符。以下是我们网络需要选择的类别数。

因此，我们将一次喂给模型一个字符，模型将通过对可能出现的下一个字符（词汇表中的字符）的概率分布进行预测，从而预测下一个字符，这相当于网络需要从中选择的多个类别：

len(language_vocab)
Output:
83

由于我们将使用随机梯度下降来训练我们的模型，因此我们需要将数据转换为训练批次。

用于训练的小批次生成

在本节中，我们将把数据分成小批次以供训练使用。因此，这些批次将包含许多具有所需序列步数的序列。让我们在图 11中查看一个可视化示例：

图 11：批次和序列的可视化示例（来源：oscarmore2.github.io/Anna_KaRNNa_files/charseq.jpeg）

现在，我们需要定义一个函数，该函数将遍历编码后的文本并生成批次。在这个函数中，我们将使用 Python 中的一个非常棒的机制，叫做yield（链接：jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/）。

一个典型的批次将包含N × M个字符，其中N是序列的数量，M是序列步数的数量。为了获得数据集中可能的批次数，我们可以简单地将数据的长度除以所需的批次大小，然后在得到这个可能的批次数后，我们可以确定每个批次应该包含多少个字符。

之后，我们需要将现有数据集拆分成所需数量的序列（N）。我们可以使用 arr.reshape(size)。我们知道我们需要 N 个序列（在后续代码中使用 num_seqs），让我们将其作为第一维的大小。对于第二维，你可以使用 -1 作为占位符，它会为你填充适当的数据。这样，你应该得到一个形状为 N × (M * K) 的数组，其中 K 是批次的数量。

现在我们有了这个数组，可以通过它进行迭代以获取训练批次，每个批次包含 N × M 个字符。对于每个后续批次，窗口会向右移动 num_steps。最后，我们还需要创建输入和输出数组，以便将它们用作模型输入。这一步创建输出值非常简单；记住，目标是将输入移位一个字符。你通常会看到第一个输入字符作为最后一个目标字符使用，像这样：

其中 x 是输入批次，y 是目标批次。

我喜欢通过使用 range 函数来做这个窗口，步长为 num_steps，从 0 到 arr.shape[1]，也就是每个序列的总步数。这样，你从 range 函数得到的整数始终指向一个批次的开始，每个窗口宽度为 num_steps：

def generate_character_batches(data, num_seq, num_steps):
    '''Create a function that returns batches of size
       num_seq x num_steps from data.
    '''
    # Get the number of characters per batch and number of batches
    num_char_per_batch = num_seq * num_steps
    num_batches = len(data)//num_char_per_batch

    # Keep only enough characters to make full batches
    data = data[:num_batches * num_char_per_batch]

    # Reshape the array into n_seqs rows
    data = data.reshape((num_seq, -1))

    for i in range(0, data.shape[1], num_steps):
        # The input variables
        input_x = data[:, i:i+num_steps]

        # The output variables which are shifted by one
        output_y = np.zeros_like(input_x)

        output_y[:, :-1], output_y[:, -1] = input_x[:, 1:], input_x[:, 0]
        yield input_x, output_y

所以，让我们使用这个函数来演示，通过生成一个包含 15 个序列和 50 个序列步骤的批次：

generated_batches = generate_character_batches(encoded_vocab, 15, 50)
input_x, output_y = next(generated_batches)
print('input\n', input_x[:10, :10])
print('\ntarget\n', output_y[:10, :10])
Output:

input
 [[70 34 54 29 24 19 76 45 2 79]
 [45 19 44 15 16 15 82 44 19 45]
 [11 45 44 15 16 34 24 38 34 19]
 [45 34 54 64 45 82 19 19 56 45]
 [45 11 56 19 45 27 56 19 35 79]
 [49 19 54 76 12 45 44 54 12 24]
 [45 41 19 45 16 11 45 15 56 24]
 [11 35 45 24 11 45 39 54 27 19]
 [82 19 66 11 76 19 45 81 19 56]
 [12 54 16 19 45 44 15 27 19 45]]

target
 [[34 54 29 24 19 76 45 2 79 79]
 [19 44 15 16 15 82 44 19 45 16]
 [45 44 15 16 34 24 38 34 19 54]
 [34 54 64 45 82 19 19 56 45 82]
 [11 56 19 45 27 56 19 35 79 35]
 [19 54 76 12 45 44 54 12 24 45]
 [41 19 45 16 11 45 15 56 24 11]
 [35 45 24 11 45 39 54 27 19 33]
 [19 66 11 76 19 45 81 19 56 24]
 [54 16 19 45 44 15 27 19 45 24]]

接下来，我们将着手构建本示例的核心部分，即 LSTM 模型。

构建模型

在深入使用 LSTM 构建字符级模型之前，值得提到一个叫做 堆叠 LSTM 的概念。

堆叠 LSTM 对于在不同时间尺度上查看信息非常有用。

堆叠 LSTM

“通过将多个递归隐藏状态堆叠在一起构建深度 RNN。这种方法可以使每个层次的隐藏状态在不同的时间尺度上运行。” ——《如何构建深度递归神经网络》（链接: arxiv.org/abs/1312.60… 年

“RNN 本质上在时间上是深度的，因为它们的隐藏状态是所有先前隐藏状态的函数。启发本文的一个问题是，RNN 是否也能从空间深度中受益；也就是将多个递归隐藏层堆叠在一起，就像在传统深度网络中堆叠前馈层一样。” ——《深度 RNN 的语音识别》（链接: arxiv.org/abs/1303.5778），2013 年

大多数研究人员都在使用堆叠 LSTM 来解决具有挑战性的序列预测问题。堆叠 LSTM 架构可以定义为由多个 LSTM 层组成的 LSTM 模型。前面的 LSTM 层为 LSTM 层提供序列输出，而不是单一的值输出，如下所示。

具体来说，它是每个输入时间步都有一个输出，而不是所有输入时间步都只有一个输出时间步：

图 12：堆叠 LSTM

所以，在这个例子中，我们将使用这种堆叠 LSTM 架构，它能提供更好的性能。

模型架构

在这里你将构建网络。我们将把它分成几个部分，这样更容易理解每个部分。然后，我们可以将它们连接成一个完整的网络：

图 13：字符级模型架构

输入

现在，让我们开始定义模型输入作为占位符。模型的输入将是训练数据和目标。我们还将使用一个叫做keep_probability的参数用于 dropout 层，帮助模型避免过拟合：

def build_model_inputs(batch_size, num_steps):

    # Declare placeholders for the input and output variables
    inputs_x = tf.placeholder(tf.int32, [batch_size, num_steps], name='inputs')
    targets_y = tf.placeholder(tf.int32, [batch_size, num_steps], name='targets')

    # define the keep_probability for the dropout layer
    keep_probability = tf.placeholder(tf.float32, name='keep_prob')

    return inputs_x, targets_y, keep_probability

构建一个 LSTM 单元

在这一部分，我们将编写一个函数来创建 LSTM 单元，这将用于隐藏层。这个单元将是我们模型的构建块。因此，我们将使用 TensorFlow 来创建这个单元。让我们看看如何使用 TensorFlow 构建一个基本的 LSTM 单元。

我们调用以下代码行来创建一个 LSTM 单元，参数num_units表示隐藏层中的单元数：

lstm_cell = tf.contrib.rnn.BasicLSTMCell(num_units)

为了防止过拟合，我们可以使用一种叫做dropout的技术，它通过减少模型的复杂度来防止模型过拟合数据：

tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_probability)

正如我们之前提到的，我们将使用堆叠 LSTM 架构；它将帮助我们从不同角度查看数据，并且在实践中已被证明能表现得更好。为了在 TensorFlow 中定义堆叠 LSTM，我们可以使用tf.contrib.rnn.MultiRNNCell函数（链接：www.tensorflow.org/versions/r1.0/api_docs/python/tf/contrib/rnn/MultiRNNCell）：

tf.contrib.rnn.MultiRNNCell([cell]*num_layers)

初始时，对于第一个单元，没有前一个信息，因此我们需要将单元状态初始化为零。我们可以使用以下函数来实现：

initial_state = cell.zero_state(batch_size, tf.float32)

那么，让我们把所有部分结合起来，创建我们的 LSTM 单元：

def build_lstm_cell(size, num_layers, batch_size, keep_probability):

    ### Building the LSTM Cell using the tensorflow function
    lstm_cell = tf.contrib.rnn.BasicLSTMCell(size)

    # Adding dropout to the layer to prevent overfitting
    drop_layer = tf.contrib.rnn.DropoutWrapper(lstm_cell, output_keep_prob=keep_probability)

    # Add muliple cells together and stack them up to oprovide a level of more understanding
    stakced_cell = tf.contrib.rnn.MultiRNNCell([drop_layer] * num_layers)
    initial_cell_state = lstm_cell.zero_state(batch_size, tf.float32)

    return lstm_cell, initial_cell_state

RNN 输出

接下来，我们需要创建输出层，负责读取各个 LSTM 单元的输出并通过全连接层传递。这个层有一个 softmax 输出，用于生成可能出现的下一个字符的概率分布。

如你所知，我们为网络生成了输入批次，大小为 N × M 字符，其中 N 是该批次中的序列数，M 是序列步数。我们在创建模型时也使用了 L 个隐藏单元。根据批次大小和隐藏单元的数量，网络的输出将是一个 3D Tensor，大小为 N × M × L，这是因为我们调用 LSTM 单元 M 次，每次处理一个序列步。每次调用 LSTM 单元会产生一个大小为 L 的输出。最后，我们需要做的就是执行 N 次，即序列的数量。

然后，我们将这个 N × M × L 的输出传递给一个全连接层（所有输出使用相同的权重），但在此之前，我们将输出重新调整为一个 2D 张量，形状为 (M * N) × L。这个重新调整形状将使我们在处理输出时更加简便，因为新的形状会更方便；每一行的值代表了 LSTM 单元的 L 个输出，因此每一行对应一个序列和步骤。

在获取新形状之后，我们可以通过矩阵乘法将其与权重相乘，将其连接到带有 softmax 的全连接层。LSTM 单元中创建的权重和我们在这里即将创建的权重默认使用相同的名称，这样 TensorFlow 就会抛出错误。为避免这个错误，我们可以使用 TensorFlow 函数 tf.variable_scope() 将在这里创建的权重和偏置变量封装在一个变量作用域内。

在解释了输出的形状以及如何重新调整形状后，为了简化操作，我们继续编写这个 build_model_output 函数：

def build_model_output(output, input_size, output_size):

    # Reshaping output of the model to become a bunch of rows, where each row correspond for each step in the seq
    sequence_output = tf.concat(output, axis=1)
    reshaped_output = tf.reshape(sequence_output, [-1, input_size])

    # Connect the RNN outputs to a softmax layer
    with tf.variable_scope('softmax'):
        softmax_w = tf.Variable(tf.truncated_normal((input_size, output_size), stddev=0.1))
        softmax_b = tf.Variable(tf.zeros(output_size))

    # the output is a set of rows of LSTM cell outputs, so the logits will be a set
    # of rows of logit outputs, one for each step and sequence
    logits = tf.matmul(reshaped_output, softmax_w) + softmax_b

    # Use softmax to get the probabilities for predicted characters
    model_out = tf.nn.softmax(logits, name='predictions')

    return model_out, logits

训练损失

接下来是训练损失。我们获取 logits 和 targets，并计算 softmax 交叉熵损失。首先，我们需要对 targets 进行 one-hot 编码；我们得到的是编码后的字符。然后，我们重新调整 one-hot targets 的形状，使其成为一个 2D 张量，大小为 (M * N) × C，其中 C 是我们拥有的类别/字符数。记住，我们已经调整了 LSTM 输出的形状，并通过一个具有 C 单元的全连接层。于是，我们的 logits 也将具有大小 (M * N) × C。

然后，我们将 logits 和 targets 输入到 tf.nn.softmax_cross_entropy_with_logits 中，并计算其均值以获得损失：

def model_loss(logits, targets, lstm_size, num_classes):

    # convert the targets to one-hot encoded and reshape them to match the logits, one row per batch_size per step
    output_y_one_hot = tf.one_hot(targets, num_classes)
    output_y_reshaped = tf.reshape(output_y_one_hot, logits.get_shape())

    #Use the cross entropy loss
    model_loss = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=output_y_reshaped)
    model_loss = tf.reduce_mean(model_loss)
    return model_loss

优化器

最后，我们需要使用一种优化方法，帮助我们从数据集中学习到一些东西。正如我们所知，普通的 RNN 存在梯度爆炸和梯度消失的问题。LSTM 仅解决了其中一个问题，即梯度值的消失，但即使使用了 LSTM，仍然有一些梯度值会爆炸并且无限增大。为了解决这个问题，我们可以使用一种叫做梯度裁剪的技术，它可以将爆炸的梯度裁剪到一个特定的阈值。

所以，让我们通过使用 Adam 优化器来定义我们的优化器，用于学习过程：

def build_model_optimizer(model_loss, learning_rate, grad_clip):

    # define optimizer for training, using gradient clipping to avoid the exploding of the gradients
    trainable_variables = tf.trainable_variables()
    gradients, _ = tf.clip_by_global_norm(tf.gradients(model_loss, trainable_variables), grad_clip)

    #Use Adam Optimizer
    train_operation = tf.train.AdamOptimizer(learning_rate)
    model_optimizer = train_operation.apply_gradients(zip(gradients, trainable_variables))

    return model_optimizer

构建网络

现在，我们可以将所有部分组合起来，构建一个网络的类。为了真正将数据传递到 LSTM 单元，我们将使用tf.nn.dynamic_rnn（链接：www.tensorflow.org/versions/r1.0/api_docs/python/tf/nn/dynamic_rnn）。这个函数会适当地传递隐藏状态和单元状态给 LSTM 单元。它返回每个序列中每个 LSTM 单元在每个步骤的输出。它还会给我们最终的 LSTM 状态。我们希望将这个状态保存为final_state，以便在下一次 mini-batch 运行时将其传递给第一个 LSTM 单元。对于tf.nn.dynamic_rnn，我们传入从build_lstm获得的单元和初始状态，以及我们的输入序列。此外，我们需要对输入进行 one-hot 编码，然后才能进入 RNN：

class CharLSTM:

    def __init__(self, num_classes, batch_size=64, num_steps=50, 
                       lstm_size=128, num_layers=2, learning_rate=0.001, 
                       grad_clip=5, sampling=False):

        # When we're using this network for generating text by sampling, we'll be providing the network with
        # one character at a time, so providing an option for it.
        if sampling == True:
            batch_size, num_steps = 1, 1
        else:
            batch_size, num_steps = batch_size, num_steps

        tf.reset_default_graph()

        # Build the model inputs placeholders of the input and target variables
        self.inputs, self.targets, self.keep_prob = build_model_inputs(batch_size, num_steps)

        # Building the LSTM cell
        lstm_cell, self.initial_state = build_lstm_cell(lstm_size, num_layers, batch_size, self.keep_prob)

        ### Run the data through the LSTM layers
        # one_hot encode the input
        input_x_one_hot = tf.one_hot(self.inputs, num_classes)

        # Runing each sequence step through the LSTM architecture and finally collecting the outputs
        outputs, state = tf.nn.dynamic_rnn(lstm_cell, input_x_one_hot, initial_state=self.initial_state)
        self.final_state = state

        # Get softmax predictions and logits
        self.prediction, self.logits = build_model_output(outputs, lstm_size, num_classes)

        # Loss and optimizer (with gradient clipping)
        self.loss = model_loss(self.logits, self.targets, lstm_size, num_classes)
        self.optimizer = build_model_optimizer(self.loss, learning_rate, grad_clip)

模型超参数

和任何深度学习架构一样，有一些超参数可以用来控制模型并进行微调。以下是我们为这个架构使用的超参数集：

批次大小是每次通过网络运行的序列数量。
步骤数是网络训练过程中序列中的字符数量。通常，越大越好；网络将学习更多的长程依赖，但训练时间也会更长。100 通常是一个不错的数字。
LSTM 的大小是隐藏层中单元的数量。
架构层数是要使用的隐藏 LSTM 层的数量。
学习率是训练中典型的学习率。
最后，我们引入了一个新的概念，叫做保持概率，它由 dropout 层使用；它帮助网络避免过拟合。如果你的网络出现过拟合，尝试减小这个值。

训练模型

现在，让我们通过提供输入和输出给构建的模型来启动训练过程，然后使用优化器训练网络。不要忘记，在为当前状态做出预测时，我们需要使用前一个状态。因此，我们需要将输出状态传递回网络，以便在预测下一个输入时使用。

让我们为超参数提供初始值（你可以在之后根据训练该架构使用的数据集调整这些值）：


batch_size = 100        # Sequences per batch
num_steps = 100         # Number of sequence steps per batch
lstm_size = 512         # Size of hidden layers in LSTMs
num_layers = 2          # Number of LSTM layers
learning_rate = 0.001   # Learning rate
keep_probability = 0.5  # Dropout keep probability

epochs = 5

# Save a checkpoint N iterations
save_every_n = 100

LSTM_model = CharLSTM(len(language_vocab), batch_size=batch_size, num_steps=num_steps,
                lstm_size=lstm_size, num_layers=num_layers, 
                learning_rate=learning_rate)

saver = tf.train.Saver(max_to_keep=100)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    # Use the line below to load a checkpoint and resume training
    #saver.restore(sess, 'checkpoints/______.ckpt')
    counter = 0
    for e in range(epochs):
        # Train network
        new_state = sess.run(LSTM_model.initial_state)
        loss = 0
        for x, y in generate_character_batches(encoded_vocab, batch_size, num_steps):
            counter += 1
            start = time.time()
            feed = {LSTM_model.inputs: x,
                    LSTM_model.targets: y,
                    LSTM_model.keep_prob: keep_probability,
                    LSTM_model.initial_state: new_state}
            batch_loss, new_state, _ = sess.run([LSTM_model.loss, 
                                                 LSTM_model.final_state, 
                                                 LSTM_model.optimizer], 
                                                 feed_dict=feed)

            end = time.time()
            print('Epoch number: {}/{}... '.format(e+1, epochs),
                  'Step: {}... '.format(counter),
                  'loss: {:.4f}... '.format(batch_loss),
                  '{:.3f} sec/batch'.format((end-start)))

            if (counter % save_every_n == 0):
                saver.save(sess, "checkpoints/i{}_l{}.ckpt".format(counter, lstm_size))

    saver.save(sess, "checkpoints/i{}_l{}.ckpt".format(counter, lstm_size))

在训练过程的最后，你应该得到一个接近以下的错误：

.
.
.
Epoch number: 5/5...  Step: 978...  loss: 1.7151...  0.050 sec/batch
Epoch number: 5/5...  Step: 979...  loss: 1.7428...  0.051 sec/batch
Epoch number: 5/5...  Step: 980...  loss: 1.7151...  0.050 sec/batch
Epoch number: 5/5...  Step: 981...  loss: 1.7236...  0.050 sec/batch
Epoch number: 5/5...  Step: 982...  loss: 1.7314...  0.051 sec/batch
Epoch number: 5/5...  Step: 983...  loss: 1.7369...  0.051 sec/batch
Epoch number: 5/5...  Step: 984...  loss: 1.7075...  0.065 sec/batch
Epoch number: 5/5...  Step: 985...  loss: 1.7304...  0.051 sec/batch
Epoch number: 5/5...  Step: 986...  loss: 1.7128...  0.049 sec/batch
Epoch number: 5/5...  Step: 987...  loss: 1.7107...  0.051 sec/batch
Epoch number: 5/5...  Step: 988...  loss: 1.7351...  0.051 sec/batch
Epoch number: 5/5...  Step: 989...  loss: 1.7260...  0.049 sec/batch
Epoch number: 5/5...  Step: 990...  loss: 1.7144...  0.051 sec/batch

保存检查点

现在，让我们加载检查点。关于保存和加载检查点的更多信息，你可以查看 TensorFlow 文档（www.tensorflow.org/programmers_guide/variables）：

tf.train.get_checkpoint_state('checkpoints')

Output:
model_checkpoint_path: "checkpoints/i990_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i100_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i200_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i300_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i400_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i500_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i600_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i700_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i800_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i900_l512.ckpt"
all_model_checkpoint_paths: "checkpoints/i990_l512.ckpt"

生成文本

我们有一个基于输入数据集训练的模型。下一步是使用这个训练好的模型生成文本，并看看这个模型是如何学习输入数据的风格和结构的。为此，我们可以从一些初始字符开始，然后将新预测的字符作为下一个步骤的输入。我们将重复这个过程，直到生成特定长度的文本。

在以下代码中，我们还向函数添加了额外的语句，以便用一些初始文本为网络预热并从那里开始。

网络为我们提供了词汇中每个字符的预测或概率。为了减少噪声并只使用网络更加自信的字符，我们将只从输出中选择前N个最可能的字符：

def choose_top_n_characters(preds, vocab_size, top_n_chars=4):
    p = np.squeeze(preds)
    p[np.argsort(p)[:-top_n_chars]] = 0
    p = p / np.sum(p)
    c = np.random.choice(vocab_size, 1, p=p)[0]
    return c

def sample_from_LSTM_output(checkpoint, n_samples, lstm_size, vocab_size, prime="The "):
    samples = [char for char in prime]
    LSTM_model = CharLSTM(len(language_vocab), lstm_size=lstm_size, sampling=True)
    saver = tf.train.Saver()
    with tf.Session() as sess:
        saver.restore(sess, checkpoint)
        new_state = sess.run(LSTM_model.initial_state)
        for char in prime:
            x = np.zeros((1, 1))
            x[0,0] = vocab_to_integer[char]
            feed = {LSTM_model.inputs: x,
                    LSTM_model.keep_prob: 1.,
                    LSTM_model.initial_state: new_state}
            preds, new_state = sess.run([LSTM_model.prediction, LSTM_model.final_state], 
                                         feed_dict=feed)

        c = choose_top_n_characters(preds, len(language_vocab))
        samples.append(integer_to_vocab[c])

        for i in range(n_samples):
            x[0,0] = c
            feed = {LSTM_model.inputs: x,
                    LSTM_model.keep_prob: 1.,
                    LSTM_model.initial_state: new_state}
            preds, new_state = sess.run([LSTM_model.prediction, LSTM_model.final_state], 
                                         feed_dict=feed)

            c = choose_top_n_characters(preds, len(language_vocab))
            samples.append(integer_to_vocab[c])

    return ''.join(samples)

让我们开始使用保存的最新检查点进行采样过程：

tf.train.latest_checkpoint('checkpoints')

Output:
'checkpoints/i990_l512.ckpt'

现在，使用这个最新的检查点进行采样的时间到了：

checkpoint = tf.train.latest_checkpoint('checkpoints')
sampled_text = sample_from_LSTM_output(checkpoint, 1000, lstm_size, len(language_vocab), prime="Far")
print(sampled_text)

Output:
INFO:tensorflow:Restoring parameters from checkpoints/i990_l512.ckpt

Farcial the
confiring to the mone of the correm and thinds. She
she saw the
streads of herself hand only astended of the carres to her his some of the princess of which he came him of
all that his white the dreasing of
thisking the princess and with she was she had
bettee a still and he was happined, with the pood on the mush to the peaters and seet it.

"The possess a streatich, the may were notine at his mate a misted
and the
man of the mother at the same of the seem her
felt. He had not here.

"I conest only be alw you thinking that the partion
of their said."

"A much then you make all her
somether. Hower their centing
about
this, and I won't give it in
himself.
I had not come at any see it will that there she chile no one that him.

"The distiction with you all.... It was
a mone of the mind were starding to the simple to a mone. It to be to ser in the place," said Vronsky.
"And a plais in
his face, has alled in the consess on at they to gan in the sint
at as that
he would not be and t

你可以看到，我们能够生成一些有意义的词汇和一些无意义的词汇。为了获得更多结果，你可以让模型训练更多的 epochs，并尝试调整超参数。

总结

我们学习了 RNN，它们是如何工作的，以及为什么它们变得如此重要。我们在有趣的小说数据集上训练了一个 RNN 字符级语言模型，并看到了 RNN 的发展方向。你可以自信地期待在 RNN 领域会有大量的创新，我相信它们将成为智能系统中无处不在且至关重要的组成部分。

第十一章：表示学习 - 实现词嵌入

机器学习是一门主要基于统计学和线性代数的科学。矩阵运算在大多数机器学习或深度学习架构中非常常见，因为反向传播的原因。这也是深度学习或机器学习通常只接受实值输入的主要原因。这个事实与许多应用相矛盾，比如机器翻译、情感分析等，它们的输入是文本。因此，为了将深度学习应用于这些场景，我们需要将文本转化为深度学习能够接受的形式！

在本章中，我们将介绍表示学习领域，这是从文本中学习实值表示的一种方法，同时保持实际文本的语义。例如，love 的表示应该与 adore 的表示非常接近，因为它们在非常相似的上下文中使用。

所以，本章将涵盖以下主题：

表示学习简介
Word2Vec
Skip-gram 架构的实际示例
Skip-gram Word2Vec 实现

表示学习简介

到目前为止，我们使用的所有机器学习算法或架构都要求输入为实值或实值矩阵，这是机器学习中的一个共同主题。例如，在卷积神经网络中，我们必须将图像的原始像素值作为模型输入。在这一部分，我们处理的是文本，因此我们需要以某种方式对文本进行编码，并生成可以输入到机器学习算法中的实值数据。为了将输入文本编码为实值数据，我们需要使用一种名为自然语言处理（NLP）的中介技术。

我们提到过，在这种管道中，当我们将文本输入到机器学习模型中进行情感分析时，这将是一个问题，并且无法工作，因为我们无法在输入（字符串）上应用反向传播或其他操作（如点积）。因此，我们需要使用 NLP 的机制，构建一个文本的中间表示，该表示能够携带与文本相同的信息，并且可以被输入到机器学习模型中。

我们需要将输入文本中的每个单词或标记转换为实值向量。如果这些向量不携带原始输入的模式、信息、意义和语义，那么它们将毫无用处。例如，像真实文本中的两个单词 love 和 adore 非常相似，并且有相同的含义。我们需要将它们表示为的实值向量接近彼此，并处于相同的向量空间中。因此，这两个单词的向量表示与另一个不相似的单词一起，将呈现如下图所示的形态：

图 15.1：词的向量表示

有许多技术可以用于这个任务。这些技术统称为嵌入（embeddings），它将文本嵌入到另一个实值向量空间中。

正如我们稍后所见，这个向量空间实际上非常有趣，因为你会发现你可以通过其他与之相似的单词来推导一个单词的向量，甚至可以在这个空间里进行一些“地理”操作。

Word2Vec

Word2Vec 是自然语言处理（NLP）领域中广泛使用的嵌入技术之一。该模型通过观察输入单词出现的上下文信息，将输入文本转换为实值向量。因此，你会发现相似的单词通常会出现在非常相似的上下文中，从而模型会学到这些单词应该被放置在嵌入空间中的彼此相近位置。

从下面图示中的陈述来看，模型将学到 love 和 adore 共享非常相似的上下文，并且应该被放置在最终的向量空间中非常接近的位置。单词 like 的上下文可能也与 love 稍有相似，但它不会像 adore 那样接近 love：

图 15.2：情感句子示例

Word2Vec 模型还依赖于输入句子的语义特征；例如，单词 adore 和 love 主要在积极的语境中使用，通常会出现在名词短语或名词前面。同样，模型会学习到这两个词有共同之处，因此更可能将这两个词的向量表示放在相似的语境中。因此，句子的结构会告诉 Word2Vec 模型很多关于相似词的信息。

实际上，人们将一个大规模的文本语料库输入到 Word2Vec 模型中。该模型将学习为相似的单词生成相似的向量，并为输入文本中的每个唯一单词执行此操作。

所有这些单词的向量将被结合起来，最终的输出将是一个嵌入矩阵，其中每一行代表特定唯一单词的实值向量表示。

图 15.3：Word2Vec 模型流程示例

因此，模型的最终输出将是一个针对训练语料库中所有唯一单词的嵌入矩阵。通常，好的嵌入矩阵可以包含数百万个实值向量。

Word2Vec 建模使用窗口扫描句子，然后根据上下文信息预测窗口中间单词的向量；Word2Vec 模型一次扫描一个句子。与任何机器学习技术类似，我们需要为 Word2Vec 模型定义一个成本函数以及相应的优化标准，使得该模型能够为每个唯一的单词生成实值向量，并根据上下文信息将这些向量彼此关联。

构建 Word2Vec 模型

在本节中，我们将深入讨论如何构建一个 Word2Vec 模型。如前所述，我们的最终目标是拥有一个训练好的模型，能够为输入的文本数据生成实值向量表示，这也叫做词嵌入。

在模型训练过程中，我们将使用最大似然法 (en.wikipedia.org/wiki/Maximum_likelihood)，这个方法可以用来最大化给定模型已经看到的前一个词的条件下，下一个词 w[t] 在输入句子中的概率，我们可以称之为 h。

这个最大似然法将用软最大函数来表示：

在这里，score 函数计算一个值，用来表示目标词 w[t] 与上下文 h 的兼容性。该模型将在输入序列上进行训练，旨在最大化训练数据的似然性（为简化数学推导，使用对数似然）。

因此，ML 方法将尝试最大化上述方程，从而得到一个概率语言模型。但由于需要使用评分函数计算所有词的每个概率，这一计算非常耗费计算资源。

词汇表 V 中的单词 w'，在该模型的当前上下文 h 中。这将在每个训练步骤中发生。

图 15.4：概率语言模型的一般架构

由于构建概率语言模型的计算开销较大，人们倾向于使用一些计算上更为高效的技术，比如 连续词袋模型 (CBOW) 和跳字模型。

这些模型经过训练，用逻辑回归构建一个二元分类模型，以区分真实目标词 w[t] 和 h 噪声或虚构词 , 它们处在相同的上下文中。下面的图表简化了这个概念，采用了 CBOW 技术：

图 15.5：跳字模型的一般架构

下一张图展示了你可以用来构建 Word2Vec 模型的两种架构：

图 15.6：Word2Vec 模型的不同架构

更正式地说，这些技术的目标函数最大化如下：

其中：

是基于模型在数据集 D 中看到词 w 在上下文 h 中的二元逻辑回归概率，这个概率是通过 θ 向量计算的。这个向量表示已学习的嵌入。
是我们可以从一个噪声概率分布中生成的虚拟或噪声词汇，例如训练输入样本的 unigram。

总结来说，这些模型的目标是区分真实和虚拟的输入，因此需要给真实词汇分配较高的概率，而给虚拟或噪声词汇分配较低的概率。

当模型将高概率分配给真实词汇，低概率分配给噪声词汇时，该目标得到了最大化。

从技术上讲，将高概率分配给真实词汇的过程称为负采样（papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf），并且使用这种损失函数有很好的数学依据：它提出的更新近似了软最大（softmax）函数在极限情况下的更新。但从计算角度来看，它尤其具有吸引力，因为现在计算损失函数的复杂度仅与我们选择的噪声词数量（k）相关，而与词汇表中的所有词汇（V）无关。这使得训练变得更加高效。实际上，我们将使用非常类似的噪声对比估计（NCE）（papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf）损失函数，TensorFlow 提供了一个便捷的辅助函数 tf.nn.nce_loss()。

skip-gram 架构的一个实际示例

让我们通过一个实际例子，看看在这种情况下 skip-gram 模型是如何工作的：

the quick brown fox jumped over the lazy dog

首先，我们需要构建一个包含词语及其对应上下文的数据集。上下文的定义取决于我们，但必须合理。因此，我们会围绕目标词设置一个窗口，并从右边取一个词，再从左边取一个词。

通过采用这种上下文技术，我们最终会得到以下一组词语及其对应的上下文：

([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...

生成的词语及其对应的上下文将以 (context, target) 的形式表示。skip-gram 模型的思想与 CBOW 模型正好相反。在 skip-gram 模型中，我们会尝试根据目标词来预测该词的上下文。例如，考虑第一个词对，skip-gram 模型会尝试从目标词 quick 预测出 the 和 brown 等词，依此类推。所以，我们可以将数据集重写如下：

(quick, the), (quick, brown), (brown, quick), (brown, fox), ...

现在，我们有了一组输入和输出的词对。

让我们尝试模仿在特定步骤 t 处的训练过程。那么，skip-gram 模型将以第一个训练样本为输入，其中输入词为 quick，目标输出词为 the。接下来，我们需要构造噪声输入，因此我们将从输入数据的单词集中随机选择。为了简化，噪声向量的大小仅为 1。例如，我们可以选择 sheep 作为噪声样本。

现在，我们可以继续计算真实对和噪声对之间的损失，公式如下：

在这种情况下，目标是更新 θ 参数，以改进之前的目标函数。通常，我们可以使用梯度来进行这个操作。因此，我们将尝试计算损失相对于目标函数参数 θ 的梯度，其表示为。

在训练过程之后，我们可以基于实值向量表示的降维结果可视化一些结果。你会发现这个向量空间非常有趣，因为你可以用它做很多有趣的事情。例如，你可以在这个空间中学习类比，通过说“国王对王后就像男人对女人”。我们甚至可以通过从王后向量中减去国王向量并加上男人向量来推导出女人的向量；这个结果将非常接近实际学习到的女人向量。你也可以在这个空间中学习地理。

图 15.7：使用 t-分布随机邻域嵌入（t-SNE）降维技术将学习到的向量投影到二维空间

上面的例子为这些向量提供了很好的直觉，并且展示了它们如何对大多数自然语言处理应用（如机器翻译或词性（POS）标注）非常有用。

Skip-gram Word2Vec 实现

在理解了 skip-gram 模型如何工作的数学细节后，我们将实现 skip-gram，该模型将单词编码为具有某些属性的实值向量（因此得名 Word2Vec）。通过实现这一架构，你将了解学习另一种表示方式的过程是如何进行的。

文本是许多自然语言处理应用的主要输入，例如机器翻译、情感分析和语音合成系统。因此，为文本学习实值表示将帮助我们使用不同的深度学习技术来处理这些任务。

在本书的早期章节中，我们介绍了叫做独热编码（one-hot encoding）的方法，它会生成一个零向量，除了表示该词的索引外其他都为零。那么，你可能会想，为什么这里不使用它呢？这种方法非常低效，因为通常你会有一个很大的独特单词集，可能有 50,000 个单词，使用独热编码时，将会生成一个包含 49,999 个零的向量，并且只有一个位置是 1。

如果输入非常稀疏，会导致大量计算浪费，特别是在神经网络的隐藏层进行矩阵乘法时。

图 15.8：一热编码将导致大量计算浪费

如前所述，使用一热编码的结果将是一个非常稀疏的向量，特别是当你有大量不同的词汇需要编码时。

以下图所示，当我们将这个除了一个条目之外全为零的稀疏向量与一个权重矩阵相乘时，输出将仅为矩阵中与稀疏向量中唯一非零值对应的行：

图 15.9：将一个几乎全为零的一热向量与隐藏层权重矩阵相乘的效果

为了避免这种巨大的计算浪费，我们将使用嵌入技术，它仅仅是一个带有嵌入权重的全连接层。在这一层中，我们跳过了低效的乘法操作，而是通过所谓的权重矩阵来查找嵌入层的嵌入权重。

所以，为了避免计算时产生的浪费，我们将使用这个权重查找矩阵来查找嵌入权重。首先，需要构建这个查找表。为此，我们将所有输入词编码为整数，如下图所示，然后为了获取该词的对应值，我们将使用其整数表示作为该权重矩阵中的行号。找到特定词汇对应嵌入值的过程称为嵌入查找。如前所述，嵌入层只是一个全连接层，其中单元的数量代表嵌入维度。

图 15.10：标记化的查找表

你可以看到这个过程非常直观且简单；我们只需要按照这些步骤操作：

定义将被视为权重矩阵的查找表
将嵌入层定义为具有特定数量单元（嵌入维度）的全连接隐藏层
使用权重矩阵查找作为避免不必要的矩阵乘法的替代方案
最后，将查找表作为任何权重矩阵进行训练

如前所述，我们将在本节中构建一个跳字模型的 Word2Vec，这是学习词语表示的一种高效方式，同时保持词语的语义信息。

所以，让我们继续构建一个使用跳字架构的 Word2Vec 模型，它已被证明优于其他模型。

数据分析与预处理

在这一部分，我们将定义一些辅助函数，以帮助我们构建一个良好的 Word2Vec 模型。为了实现这一目标，我们将使用清理过的维基百科版本（mattmahoney.net/dc/textdata.html）。

那么，我们从导入实现所需的包开始：

#importing the required packages for this implementation
import numpy as np
import tensorflow as tf

#Packages for downloading the dataset
from urllib.request import urlretrieve
from os.path import isfile, isdir
from tqdm import tqdm
import zipfile

#packages for data preprocessing
import re
from collections import Counter
import random

接下来，我们将定义一个类，用于在数据集未下载时进行下载：

# In this implementation we will use a cleaned up version of Wikipedia from Matt Mahoney.
# So we will define a helper class that will helps to download the dataset
wiki_dataset_folder_path = 'wikipedia_data'
wiki_dataset_filename = 'text8.zip'
wiki_dataset_name = 'Text8 Dataset'

class DLProgress(tqdm):

    last_block = 0

    def hook(self, block_num=1, block_size=1, total_size=None):
        self.total = total_size
        self.update((block_num - self.last_block) * block_size)
        self.last_block = block_num

# Cheking if the file is not already downloaded
if not isfile(wiki_dataset_filename):
    with DLProgress(unit='B', unit_scale=True, miniters=1, desc=wiki_dataset_name) as pbar:
        urlretrieve(
            'http://mattmahoney.net/dc/text8.zip',
            wiki_dataset_filename,
            pbar.hook)

# Checking if the data is already extracted if not extract it
if not isdir(wiki_dataset_folder_path):
    with zipfile.ZipFile(wiki_dataset_filename) as zip_ref:
        zip_ref.extractall(wiki_dataset_folder_path)

with open('wikipedia_data/text8') as f:
    cleaned_wikipedia_text = f.read()

Output:

Text8 Dataset: 31.4MB [00:39, 794kB/s]

我们可以查看该数据集的前 100 个字符：

cleaned_wikipedia_text[0:100]

' anarchism originated as a term of abuse first used against early working class radicals including t'

接下来，我们将对文本进行预处理，因此我们将定义一个辅助函数，帮助我们将标点等特殊字符替换为已知的标记。此外，为了减少输入文本中的噪音，您可能还想去除那些在文本中出现频率较低的单词：

def preprocess_text(input_text):

    # Replace punctuation with some special tokens so we can use them in our model
    input_text = input_text.lower()
    input_text = input_text.replace('.', ' <PERIOD> ')
    input_text = input_text.replace(',', ' <COMMA> ')
    input_text = input_text.replace('"', ' <QUOTATION_MARK> ')
    input_text = input_text.replace(';', ' <SEMICOLON> ')
    input_text = input_text.replace('!', ' <EXCLAMATION_MARK> ')
    input_text = input_text.replace('?', ' <QUESTION_MARK> ')
    input_text = input_text.replace('(', ' <LEFT_PAREN> ')
    input_text = input_text.replace(')', ' <RIGHT_PAREN> ')
    input_text = input_text.replace('--', ' <HYPHENS> ')
    input_text = input_text.replace('?', ' <QUESTION_MARK> ')

    input_text = input_text.replace(':', ' <COLON> ')
    text_words = input_text.split()

    # neglecting all the words that have five occurrences of fewer
    text_word_counts = Counter(text_words)
    trimmed_words = [word for word in text_words if text_word_counts[word] > 5]

    return trimmed_words

现在，让我们在输入文本上调用这个函数，并查看输出：

preprocessed_words = preprocess_text(cleaned_wikipedia_text)
print(preprocessed_words[:30])

Output:
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst']

让我们看看在处理过的文本中有多少个单词和不同的单词：

print("Total number of words in the text: {}".format(len(preprocessed_words)))
print("Total number of unique words in the text: {}".format(len(set(preprocessed_words))))

Output:

Total number of words in the text: 16680599
Total number of unique words in the text: 63641

在这里，我正在创建字典，将单词转换为整数并反向转换，即将整数转换为单词。这些整数按频率降序排列，因此出现频率最高的单词（the）被赋予整数0，接下来频率次高的得到1，以此类推。单词被转换为整数并存储在列表int_words中。

正如本节前面提到的，我们需要使用单词的整数索引来查找它们在权重矩阵中的值，因此我们将单词转换为整数，并将整数转换为单词。这将帮助我们查找单词，并且获取特定索引的实际单词。例如，输入文本中最常出现的单词将被索引为位置 0，接下来是第二常出现的单词，以此类推。

那么，让我们定义一个函数来创建这个查找表：

def create_lookuptables(input_words):
 """
 Creating lookup tables for vocan

 Function arguments:
 param words: Input list of words
 """
 input_word_counts = Counter(input_words)
 sorted_vocab = sorted(input_word_counts, key=input_word_counts.get, reverse=True)
 integer_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
 vocab_to_integer = {word: ii for ii, word in integer_to_vocab.items()}

 # returning A tuple of dicts
 return vocab_to_integer, integer_to_vocab

现在，让我们调用已定义的函数来创建查找表：

vocab_to_integer, integer_to_vocab = create_lookuptables(preprocessed_words)
integer_words = [vocab_to_integer[word] for word in preprocessed_words]

为了构建更精确的模型，我们可以去除那些对上下文变化不大的单词，如of、for、the等。因此，实际上已经证明，在丢弃这些单词的情况下，我们可以构建更精确的模型。从上下文中去除与上下文无关的单词的过程被称为子抽样。为了定义一种通用的丢弃机制，Mikolov 提出了一个函数，用于计算某个单词的丢弃概率，该概率由以下公式给出：

其中：

t 是单词丢弃的阈值参数
f(w[i]) 是输入数据集中目标单词 w[i] 的频率

我们将实现一个辅助函数，用于计算数据集中每个单词的丢弃概率：

# removing context-irrelevant words threshold
word_threshold = 1e-5

word_counts = Counter(integer_words)
total_number_words = len(integer_words)

#Calculating the freqs for the words
frequencies = {word: count/total_number_words for word, count in word_counts.items()}

#Calculating the discard probability
prob_drop = {word: 1 - np.sqrt(word_threshold/frequencies[word]) for word in word_counts}
training_words = [word for word in integer_words if random.random() < (1 - prob_drop[word])]

现在，我们有了一个更精炼、更清晰的输入文本版本。

我们提到过，skip-gram 架构在生成目标单词的实值表示时，会考虑目标单词的上下文，因此它在目标单词周围定义了一个大小为 C 的窗口。

我们将不再平等地对待所有上下文单词，而是为那些距离目标单词较远的单词分配较小的权重。例如，如果我们选择窗口大小为 C = 4，那么我们将从 1 到 C 的范围内随机选择一个数字 L，然后从当前单词的历史和未来中采样 L 个单词。关于这一点的更多细节，请参见 Mikolov 等人的论文：arxiv.org/pdf/1301.3781.pdf。

所以，让我们继续定义这个函数：

# Defining a function that returns the words around specific index in a specific window
def get_target(input_words, ind, context_window_size=5):

    #selecting random number to be used for genearting words form history and feature of the current word
    rnd_num = np.random.randint(1, context_window_size+1)
    start_ind = ind - rnd_num if (ind - rnd_num) > 0 else 0
    stop_ind = ind + rnd_num

    target_words = set(input_words[start_ind:ind] + input_words[ind+1:stop_ind+1])

    return list(target_words)

此外，让我们定义一个生成器函数，从训练样本中生成一个随机批次，并为该批次中的每个单词获取上下文词：

#Defining a function for generating word batches as a tuple (inputs, targets)
def generate_random_batches(input_words, train_batch_size, context_window_size=5):

    num_batches = len(input_words)//train_batch_size

    # working on only only full batches
    input_words = input_words[:num_batches*train_batch_size]

    for ind in range(0, len(input_words), train_batch_size):
        input_vals, target = [], []
        input_batch = input_words[ind:ind+train_batch_size]

        #Getting the context for each word
        for ii in range(len(input_batch)):
            batch_input_vals = input_batch[ii]
            batch_target = get_target(input_batch, ii, context_window_size)

            target.extend(batch_target)
            input_vals.extend([batch_input_vals]*len(batch_target))
        yield input_vals, target

构建模型

接下来，我们将使用以下结构来构建计算图：

图 15.11：模型架构

正如之前所提到的，我们将使用一个嵌入层，尝试为这些词学习一个特殊的实数表示。因此，单词将作为 one-hot 向量输入。我们的想法是训练这个网络来构建权重矩阵。

那么，让我们从创建模型输入开始：

train_graph = tf.Graph()

#defining the inputs placeholders of the model
with train_graph.as_default():
    inputs_values = tf.placeholder(tf.int32, [None], name='inputs_values')
    labels_values = tf.placeholder(tf.int32, [None, None], name='labels_values')

我们要构建的权重或嵌入矩阵将具有以下形状：

num_words X num_hidden_neurons

此外，我们不需要自己实现查找函数，因为它在 Tensorflow 中已经可用：tf.nn.embedding_lookup()。因此，它将使用单词的整数编码，并找到它们在权重矩阵中的对应行。

权重矩阵将从均匀分布中随机初始化：

num_vocab = len(integer_to_vocab)

num_embedding =  300
with train_graph.as_default():
    embedding_layer = tf.Variable(tf.random_uniform((num_vocab, num_embedding), -1, 1))

    # Next, we are going to use tf.nn.embedding_lookup function to get the output of the hidden layer
    embed_tensors = tf.nn.embedding_lookup(embedding_layer, inputs_values)

更新嵌入层的所有权重是非常低效的。我们将采用负采样技术，它只会更新正确单词的权重，并且只涉及一个小的错误单词子集。

此外，我们不必自己实现这个函数，因为在 TensorFlow 中已经有了 tf.nn.sampled_softmax_loss：

# Number of negative labels to sample
num_sampled = 100

with train_graph.as_default():
    # create softmax weights and biases
    softmax_weights = tf.Variable(tf.truncated_normal((num_vocab, num_embedding))) 
    softmax_biases = tf.Variable(tf.zeros(num_vocab), name="softmax_bias") 

    # Calculating the model loss using negative sampling
    model_loss = tf.nn.sampled_softmax_loss(
        weights=softmax_weights,
        biases=softmax_biases,
        labels=labels_values,
        inputs=embed_tensors,
        num_sampled=num_sampled,
        num_classes=num_vocab)

    model_cost = tf.reduce_mean(model_loss)
    model_optimizer = tf.train.AdamOptimizer().minimize(model_cost)

为了验证我们训练的模型，我们将采样一些常见的词和一些不常见的词，并尝试基于跳字模型的学习表示打印它们的最近词集：

with train_graph.as_default():

    # set of random words for evaluating similarity on
    valid_num_words = 16 
    valid_window = 100

    # pick 8 samples from (0,100) and (1000,1100) each ranges. lower id implies more frequent 
    valid_samples = np.array(random.sample(range(valid_window), valid_num_words//2))
    valid_samples = np.append(valid_samples, 
                               random.sample(range(1000,1000+valid_window), valid_num_words//2))

    valid_dataset_samples = tf.constant(valid_samples, dtype=tf.int32)

    # Calculating the cosine distance
    norm = tf.sqrt(tf.reduce_sum(tf.square(embedding_layer), 1, keep_dims=True))
    normalized_embed = embedding_layer / norm
    valid_embedding = tf.nn.embedding_lookup(normalized_embed, valid_dataset_samples)
    cosine_similarity = tf.matmul(valid_embedding, tf.transpose(normalized_embed))

现在，我们已经拥有了模型的所有组成部分，准备开始训练过程。

训练

让我们继续启动训练过程：

num_epochs = 10
train_batch_size = 1000
contextual_window_size = 10

with train_graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=train_graph) as sess:

    iteration_num = 1
    average_loss = 0

    #Initializing all the vairables
    sess.run(tf.global_variables_initializer())

    for e in range(1, num_epochs+1):

        #Generating random batch for training
        batches = generate_random_batches(training_words, train_batch_size, contextual_window_size)

        #Iterating through the batch samples
        for input_vals, target in batches:

            #Creating the feed dict
            feed_dict = {inputs_values: input_vals,
                    labels_values: np.array(target)[:, None]}

            train_loss, _ = sess.run([model_cost, model_optimizer], feed_dict=feed_dict)

            #commulating the loss
            average_loss += train_loss

            #Printing out the results after 100 iteration
            if iteration_num % 100 == 0: 
                print("Epoch Number {}/{}".format(e, num_epochs),
                      "Iteration Number: {}".format(iteration_num),
                      "Avg. Training loss: {:.4f}".format(average_loss/100))
                average_loss = 0

            if iteration_num % 1000 == 0:

                ## Using cosine similarity to get the nearest words to a word
                similarity = cosine_similarity.eval()
                for i in range(valid_num_words):
                    valid_word = integer_to_vocab[valid_samples[i]]

                    # number of nearest neighbors
                    top_k = 8 
                    nearest_words = (-similarity[i, :]).argsort()[1:top_k+1]
                    msg = 'The nearest to %s:' % valid_word
                    for k in range(top_k):
                        similar_word = integer_to_vocab[nearest_words[k]]
                        msg = '%s %s,' % (msg, similar_word)
                    print(msg)

            iteration_num += 1
    save_path = saver.save(sess, "checkpoints/cleaned_wikipedia_version.ckpt")
    embed_mat = sess.run(normalized_embed)

在运行前面的代码片段 10 个周期后，您将得到以下输出：

Epoch Number 10/10 Iteration Number: 43100 Avg. Training loss: 5.0380
Epoch Number 10/10 Iteration Number: 43200 Avg. Training loss: 4.9619
Epoch Number 10/10 Iteration Number: 43300 Avg. Training loss: 4.9463
Epoch Number 10/10 Iteration Number: 43400 Avg. Training loss: 4.9728
Epoch Number 10/10 Iteration Number: 43500 Avg. Training loss: 4.9872
Epoch Number 10/10 Iteration Number: 43600 Avg. Training loss: 5.0534
Epoch Number 10/10 Iteration Number: 43700 Avg. Training loss: 4.8261
Epoch Number 10/10 Iteration Number: 43800 Avg. Training loss: 4.8752
Epoch Number 10/10 Iteration Number: 43900 Avg. Training loss: 4.9818
Epoch Number 10/10 Iteration Number: 44000 Avg. Training loss: 4.9251
The nearest to nine: one, seven, zero, two, three, four, eight, five,
The nearest to such: is, as, or, some, have, be, that, physical,
The nearest to who: his, him, he, did, to, had, was, whom,
The nearest to two: zero, one, three, seven, four, five, six, nine,
The nearest to which: as, a, the, in, to, also, for, is,
The nearest to seven: eight, one, three, five, four, six, zero, two,
The nearest to american: actor, nine, singer, actress, musician, comedian, athlete, songwriter,
The nearest to many: as, other, some, have, also, these, are, or,
The nearest to powers: constitution, constitutional, formally, assembly, state, legislative, general, government,
The nearest to question: questions, existence, whether, answer, truth, reality, notion, does,
The nearest to channel: tv, television, broadcasts, broadcasting, radio, channels, broadcast, stations,
The nearest to recorded: band, rock, studio, songs, album, song, recording, pop,
The nearest to arts: art, school, alumni, schools, students, university, renowned, education,
The nearest to orthodox: churches, orthodoxy, church, catholic, catholics, oriental, christianity, christians,
The nearest to scale: scales, parts, important, note, between, its, see, measured,
The nearest to mean: is, exactly, defined, denote, hence, are, meaning, example,

Epoch Number 10/10 Iteration Number: 45100 Avg. Training loss: 4.8466
Epoch Number 10/10 Iteration Number: 45200 Avg. Training loss: 4.8836
Epoch Number 10/10 Iteration Number: 45300 Avg. Training loss: 4.9016
Epoch Number 10/10 Iteration Number: 45400 Avg. Training loss: 5.0218
Epoch Number 10/10 Iteration Number: 45500 Avg. Training loss: 5.1409
Epoch Number 10/10 Iteration Number: 45600 Avg. Training loss: 4.7864
Epoch Number 10/10 Iteration Number: 45700 Avg. Training loss: 4.9312
Epoch Number 10/10 Iteration Number: 45800 Avg. Training loss: 4.9097
Epoch Number 10/10 Iteration Number: 45900 Avg. Training loss: 4.6924
Epoch Number 10/10 Iteration Number: 46000 Avg. Training loss: 4.8999
The nearest to nine: one, eight, seven, six, four, five, american, two,
The nearest to such: can, example, examples, some, be, which, this, or,
The nearest to who: him, his, himself, he, was, whom, men, said,
The nearest to two: zero, five, three, four, six, one, seven, nine
The nearest to which: to, is, a, the, that, it, and, with,
The nearest to seven: one, six, eight, five, nine, four, three, two,
The nearest to american: musician, actor, actress, nine, singer, politician, d, one,
The nearest to many: often, as, most, modern, such, and, widely, traditional,
The nearest to powers: constitutional, formally, power, rule, exercised, parliamentary, constitution, control,
The nearest to question: questions, what, answer, existence, prove, merely, true, statements,
The nearest to channel: network, channels, broadcasts, stations, cable, broadcast, broadcasting, radio,
The nearest to recorded: songs, band, song, rock, album, bands, music, studio,
The nearest to arts: art, school, martial, schools, students, styles, education, student,
The nearest to orthodox: orthodoxy, churches, church, christianity, christians, catholics, christian, oriental,
The nearest to scale: scales, can, amounts, depends, tend, are, structural, for,
The nearest to mean: we, defined, is, exactly, equivalent, denote, number, above,
Epoch Number 10/10 Iteration Number: 46100 Avg. Training loss: 4.8583
Epoch Number 10/10 Iteration Number: 46200 Avg. Training loss: 4.8887

如您所见，网络在某种程度上学习到了输入单词的一些语义有用的表示。为了帮助我们更清楚地看到嵌入矩阵，我们将使用降维技术，如 t-SNE，将实数值向量降至二维，然后我们将对它们进行可视化，并用相应的单词标记每个点：

num_visualize_words = 500
tsne_obj = TSNE()
embedding_tsne = tsne_obj.fit_transform(embedding_matrix[:num_visualize_words, :])

fig, ax = plt.subplots(figsize=(14, 14))
for ind in range(num_visualize_words):
    plt.scatter(*embedding_tsne[ind, :], color='steelblue')
    plt.annotate(integer_to_vocab[ind], (embedding_tsne[ind, 0], embedding_tsne[ind, 1]), alpha=0.7)


Output:

图 15.12：词向量的可视化

总结

在本章中，我们介绍了表示学习的概念以及它为什么对深度学习或机器学习（尤其是对非实数形式的输入）非常有用。此外，我们还讲解了将单词转换为实数向量的一种常用技术——Word2Vec，它具有非常有趣的特性。最后，我们使用 skip-gram 架构实现了 Word2Vec 模型。

接下来，你将看到这些学习到的表示在情感分析示例中的实际应用，在该示例中，我们需要将输入文本转换为实数向量。