tf-1x-dl-cookbook-merge-1TensorFlow 1.x 深度学习秘籍（二）四、卷积神经网络卷

TensorFlow 1.x 深度学习秘籍（二）

原文：TensorFlow 1.x Deep Learning Cookbook

协议：CC BY-NC-SA 4.0

四、卷积神经网络

卷积神经网络（CNN 或有时称为 ConvNets）令人着迷。在短时间内，它们成为一种破坏性技术，打破了从文本，视频到语音的多个领域中的所有最新技术成果，远远超出了最初用于图像处理的范围。在本章中，我们将介绍一些方法，如下所示：

创建一个卷积网络对手写 MNIST 编号进行分类
创建一个卷积网络对 CIFAR-10 进行分类
使用 VGG19 迁移风格用于图像重绘
使用预训练的 VGG16 网络进行迁移学习
创建 DeepDream 网络

介绍

CNN 由许多神经网络层组成。卷积和池化两种不同类型的层通常是交替的。网络中每个过滤器的深度从左到右增加。最后一级通常由一个或多个完全连接的层组成：

如图所示，卷积神经网络的一个示例。

卷积网络背后有三个主要的直觉：局部接受域，共享权重和池化。让我们一起回顾一下。

局部接受域

如果我们要保留通常在图像中发现的空间信息，则使用像素矩阵表示每个图像会很方便。然后，编码局部结构的一种简单方法是将相邻输入神经元的子矩阵连接到属于下一层的单个隐藏神经元中。单个隐藏的神经元代表一个局部感受野。请注意，此操作名为卷积，它为这种类型的网络提供了名称。

当然，我们可以通过重叠子矩阵来编码更多信息。例如，假设每个子矩阵的大小为5 x 5，并且这些子矩阵用于28 x 28像素的 MNIST 图像。然后，我们将能够在下一个隐藏层中生成23 x 23个局部感受野神经元。实际上，在触摸图像的边界之前，可以仅将子矩阵滑动 23 个位置。

让我们定义从一层到另一层的特征图。当然，我们可以有多个可以从每个隐藏层中独立学习的特征图。例如，我们可以从28 x 28个输入神经元开始处理 MNIST 图像，然后在下一个隐藏的区域中调用k个特征图，每个特征图的大小为23 x 23神经元（步幅为5 x 5）。

权重和偏置

假设我们想通过获得独立于输入图像中放置同一特征的能力来摆脱原始像素表示的困扰。一个简单的直觉是对隐藏层中的所有神经元使用相同的权重和偏差集。这样，每一层将学习从图像派生的一组位置无关的潜在特征。

一个数学示例

一种了解卷积的简单方法是考虑应用于矩阵的滑动窗口函数。在下面的示例中，给定输入矩阵I和内核K，我们得到了卷积输出。将3 x 3内核K（有时称为过滤器或特征检测器）与输入矩阵逐元素相乘，得到输出卷积矩阵中的一个单元格。通过在I上滑动窗口即可获得所有其他单元格：

卷积运算的一个示例：用粗体显示计算中涉及的单元

在此示例中，我们决定在触摸I的边界后立即停止滑动窗口（因此输出为3 x 3）。或者，我们可以选择用零填充输入（以便输出为5 x 5）。该决定与所采用的填充选择有关。

另一个选择是关于步幅，这与我们的滑动窗口采用的移位类型有关。这可以是一个或多个。较大的跨度将生成较少的内核应用，并且较小的输出大小，而较小的跨度将生成更多的输出并保留更多信息。

过滤器的大小，步幅和填充类型是超参数，可以在网络训练期间进行微调。

TensorFlow 中的卷积网络

在 TensorFlow 中，如果要添加卷积层，我们将编写以下内容：

tf.nn.conv2d(input, filter, strides, padding, use_cudnn_on_gpu=None, data_format=None, name=None)

以下是参数：

input：张量必须为以下类型之一：float32和float64。
filter：张量必须与输入具有相同的类型。
strides：整数列表。长度为 1 的 4D。输入每个维度的滑动窗口的步幅。必须与格式指定的尺寸顺序相同。
padding：来自SAME，VALID的字符串。要使用的填充算法的类型。
use_cudnn_on_gpu：可选的布尔值。默认为True。
data_format：来自NHWC和NCHW的可选字符串。默认为NHWC。指定输入和输出数据的数据格式。使用默认格式NHWC时，数据按以下顺序存储：[batch，in_height，in_width和in_channels]。或者，格式可以是NCHW，数据存储顺序为：[batch，in_channels，in_height, in_width]。
name：操作的名称（可选）。

下图提供了卷积的示例：

卷积运算的一个例子

汇聚层

假设我们要总结特征图的输出。同样，我们可以使用从单个特征图生成的输出的空间连续性，并将子矩阵的值聚合为一个单个输出值，以综合方式描述与该物理区域相关的含义。

最大池

一个简单而常见的选择是所谓的最大池化运算符，它仅输出在该区域中观察到的最大激活。在 TensorFlow 中，如果要定义大小为2 x 2的最大池化层，我们将编写以下内容：

tf.nn.max_pool(value, ksize, strides, padding, data_format='NHWC', name=None)

这些是参数：

value：形状为[batch，height，width，channels]且类型为tf.float32的 4-D 张量。
ksize：长度>= 4的整数的列表。输入张量每个维度的窗口大小。
strides：长度>= 4的整数的列表。输入张量每个维度的滑动窗口的步幅。
padding：VALID或SAME的字符串。
data_format：字符串。支持NHWC和NCHW。
name：操作的可选名称。

下图给出了最大池化操作的示例：

池化操作示例

平均池化

另一个选择是“平均池化”，它可以将一个区域简单地汇总为在该区域中观察到的激活平均值。

TensorFlow 实现了大量池化层，可在线获取完整列表。简而言之，所有池化操作仅是对给定区域的汇总操作。

卷积网络摘要

CNN 基本上是卷积的几层，具有非线性激活函数，并且池化层应用于结果。每层应用不同的过滤器（数百或数千）。要理解的主要观察结果是未预先分配滤波器，而是在训练阶段以最小化合适损失函数的方式来学习滤波器。已经观察到，较低的层将学会检测基本特征，而较高的层将逐渐检测更复杂的特征，例如形状或面部。请注意，得益于合并，后一层中的单个神经元可以看到更多的原始图像，因此它们能够组成在前几层中学习的基本特征。

到目前为止，我们已经描述了 ConvNets 的基本概念。 CNN 在沿时间维度的一维中对音频和文本数据应用卷积和池化操作，在沿（高度 x 宽度）维的图像中对二维图像应用卷积和池化操作，对于沿（高度 x 宽度 x 时间）维的视频中的三个维度应用卷积和池化操作。对于图像，在输入体积上滑动过滤器会生成一个贴图，该贴图为每个空间位置提供过滤器的响应。

换句话说，卷积网络具有堆叠在一起的多个过滤器，这些过滤器学会了独立于图像中的位置来识别特定的视觉特征。这些视觉特征在网络的初始层很简单，然后在网络的更深层越来越复杂。g操作

创建一个卷积网络对手写 MNIST 编号进行分类

在本秘籍中，您将学习如何创建一个简单的三层卷积网络来预测 MNIST 数字。深度网络由具有 ReLU 和最大池化的两个卷积层以及两个完全连接的最终层组成。

准备

MNIST 是一组 60,000 张代表手写数字的图像。本秘籍的目的是高精度地识别这些数字。

操作步骤

让我们从秘籍开始：

导入tensorflow，matplotlib，random和numpy。然后，导入minst数据并执行一键编码。请注意，TensorFlow 具有一些内置库来处理MNIST，我们将使用它们：

from __future__ import  division, print_function 
import tensorflow as tf 
import matplotlib.pyplot as plt 
import numpy as np 
# Import MNIST data 
from tensorflow.examples.tutorials.mnist import input_data 
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

内省一些数据以了解MNIST是什么。这使我们知道了训练数据集中有多少张图像，测试数据集中有多少张图像。我们还将可视化一些数字，只是为了理解它们的表示方式。多单元输出可以使我们直观地认识到即使对于人类来说，识别手写数字也有多困难。

def train_size(num): 
    print ('Total Training Images in Dataset = ' + str(mnist.train.images.shape)) 
    print ('--------------------------------------------------') 
    x_train = mnist.train.images[:num,:] 
    print ('x_train Examples Loaded = ' + str(x_train.shape)) 
    y_train = mnist.train.labels[:num,:] 
    print ('y_train Examples Loaded = ' + str(y_train.shape)) 
    print('') 
    return x_train, y_train 
def test_size(num): 
    print ('Total Test Examples in Dataset = ' + str(mnist.test.images.shape)) 
    print ('--------------------------------------------------') 
    x_test = mnist.test.images[:num,:] 
    print ('x_test Examples Loaded = ' + str(x_test.shape)) 
    y_test = mnist.test.labels[:num,:] 
    print ('y_test Examples Loaded = ' + str(y_test.shape)) 
    return x_test, y_test 
def display_digit(num): 
    print(y_train[num]) 
    label = y_train[num].argmax(axis=0) 
    image = x_train[num].reshape([28,28]) 
    plt.title('Example: %d  Label: %d' % (num, label)) 
    plt.imshow(image, cmap=plt.get_cmap('gray_r')) 
    plt.show() 
def display_mult_flat(start, stop): 
    images = x_train[start].reshape([1,784]) 
    for i in range(start+1,stop): 
        images = np.concatenate((images, x_train[i].reshape([1,784]))) 
    plt.imshow(images, cmap=plt.get_cmap('gray_r')) 
    plt.show() 
x_train, y_train = train_size(55000) 
display_digit(np.random.randint(0, x_train.shape[0])) 
display_mult_flat(0,400)

让我们看一下前面代码的输出：

MNIST 手写数字的示例

设置学习参数batch_size和display_step。另外，假设 MNIST 图像共享28 x 28像素，请设置n_input = 784，表示输出数字[0-9]的输出n_classes = 10，且丢弃概率= 0.85：

# Parameters 
learning_rate = 0.001 
training_iters = 500 
batch_size = 128 
display_step = 10 
# Network Parameters 
n_input = 784 
# MNIST data input (img shape: 28*28) 
n_classes = 10 
# MNIST total classes (0-9 digits) 
dropout = 0.85 
# Dropout, probability to keep units

设置 TensorFlow 计算图输入。让我们定义两个占位符以存储预测和真实标签：

x = tf.placeholder(tf.float32, [None, n_input]) 
y = tf.placeholder(tf.float32, [None, n_classes]) 
keep_prob = tf.placeholder(tf.float32)

使用输入x，权重W，偏差b和给定的步幅定义卷积层。激活函数为 ReLU，填充为SAME：

def conv2d(x, W, b, strides=1): 
    x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME') 
    x = tf.nn.bias_add(x, b) 
    return tf.nn.relu(x)

使用输入x，ksize和SAME填充定义一个最大池化层：

def maxpool2d(x, k=2): 
    return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')

用两个卷积层定义一个卷积网络，然后是一个完全连接的层，一个退出层和一个最终输出层：

def conv_net(x, weights, biases, dropout): 
    # reshape the input picture 
    x = tf.reshape(x, shape=[-1, 28, 28, 1]) 
    # First convolution layer 
    conv1 = conv2d(x, weights['wc1'], biases['bc1']) 
    # Max Pooling used for downsampling 
    conv1 = maxpool2d(conv1, k=2) 
    # Second convolution layer 
    conv2 = conv2d(conv1, weights['wc2'], biases['bc2']) 
    # Max Pooling used for downsampling 
    conv2 = maxpool2d(conv2, k=2) 
    # Reshape conv2 output to match the input of fully connected layer  
    fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]]) 
    # Fully connected layer 
    fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1']) 
    fc1 = tf.nn.relu(fc1) 
    # Dropout 
    fc1 = tf.nn.dropout(fc1, dropout) 
    # Output the class prediction 
    out = tf.add(tf.matmul(fc1, weights['out']), biases['out']) 
    return out

定义层权重和偏差。第一转换层具有5 x 5卷积，1 个输入和 32 个输出。第二个卷积层具有5 x 5卷积，32 个输入和 64 个输出。全连接层具有7 x 7 x 64输入和 1,024 输出，而第二层具有 1,024 输入和 10 输出，对应于最终数字类别。所有权重和偏差均使用randon_normal分布进行初始化：

weights = { 
    # 5x5 conv, 1 input, and 32 outputs 
    'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])), 
    # 5x5 conv, 32 inputs, and 64 outputs 
    'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])), 
    # fully connected, 7*7*64 inputs, and 1024 outputs 
    'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])), 
    # 1024 inputs, 10 outputs for class digits 
    'out': tf.Variable(tf.random_normal([1024, n_classes])) 
} 
biases = { 
    'bc1': tf.Variable(tf.random_normal([32])), 
    'bc2': tf.Variable(tf.random_normal([64])), 
    'bd1': tf.Variable(tf.random_normal([1024])), 
    'out': tf.Variable(tf.random_normal([n_classes])) 
}

使用给定的权重和偏差构建卷积网络。基于cross_entropy和logits定义loss函数，并使用 Adam 优化器来最小化成本。优化后，计算精度：

pred = conv_net(x, weights, biases, keep_prob) 
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y)) 
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost) 
correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1)) 
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 
init = tf.global_variables_initializer()

启动图并迭代training_iterats次，每次在输入中输入batch_size来运行优化器。请注意，我们使用mnist.train数据进行训练，该数据与minst分开。每个display_step都会计算出当前的部分精度。最后，在 2,048 张测试图像上计算精度，没有丢弃。

train_loss = [] 
train_acc = [] 
test_acc = [] 
with tf.Session() as sess: 
    sess.run(init) 
    step = 1 
    while step <= training_iters: 
        batch_x, batch_y = mnist.train.next_batch(batch_size) 
        sess.run(optimizer, feed_dict={x: batch_x, y: batch_y, 
                                       keep_prob: dropout}) 
        if step % display_step == 0: 
            loss_train, acc_train = sess.run([cost, accuracy],  
                                             feed_dict={x: batch_x, 
                                                        y: batch_y, 
                                                        keep_prob: 1.}) 
            print "Iter " + str(step) + ", Minibatch Loss= " + \ 
                  "{:.2f}".format(loss_train) + ", Training Accuracy= " + \ 
                  "{:.2f}".format(acc_train) 
            # Calculate accuracy for 2048 mnist test images.  
            # Note that in this case no dropout 
            acc_test = sess.run(accuracy,  
                                feed_dict={x: mnist.test.images, 
                                      y: mnist.test.labels, 
                                      keep_prob: 1.}) 
            print "Testing Accuracy:" + \ 
               "{:.2f}".format(acc_train) 
            train_loss.append(loss_train) 
            train_acc.append(acc_train) 
            test_acc.append(acc_test)             
        step += 1

绘制每次迭代的 Softmax 损失以及训练和测试精度：

eval_indices = range(0, training_iters, display_step) 
# Plot loss over time 
plt.plot(eval_indices, train_loss, 'k-') 
plt.title('Softmax Loss per iteration') 
plt.xlabel('Iteration') 
plt.ylabel('Softmax Loss') 
plt.show() 
# Plot train and test accuracy 
plt.plot(eval_indices, train_acc, 'k-', label='Train Set Accuracy') 
plt.plot(eval_indices, test_acc, 'r--', label='Test Set Accuracy') 
plt.title('Train and Test Accuracy') 
plt.xlabel('Generation') 
plt.ylabel('Accuracy') 
plt.legend(loc='lower right') 
plt.show()

以下是前面代码的输出。我们首先看一下每次迭代的 Softmax：

损失减少的一个例子

接下来我们看一下训练和文本的准确率：

训练和测试准确率提高的示例

工作原理

使用卷积网络，我们将 MNIST 数据集的表现提高了近 95%。我们的卷积网络由两层组成，分别是卷积，ReLU 和最大池化，然后是两个完全连接的带有丢弃的层。训练以 Adam 为优化器，以 128 的大小批量进行，学习率为 0.001，最大迭代次数为 500。

创建一个卷积网络对 CIFAR-10 进行分类

在本秘籍中，您将学习如何对从 CIFAR-10 拍摄的图像进行分类。 CIFAR-10 数据集由 10 类 60,000 张32 x 32彩色图像组成，每类 6,000 张图像。有 50,000 张训练图像和 10,000 张测试图像。下图取自这里：

CIFAR 图像示例

准备

在本秘籍中，我们使用tflearn-一个更高级别的框架-抽象了一些 TensorFlow 内部结构，使我们可以专注于深度网络的定义。 TFLearn 可从这里获得，该代码是标准发行版的一部分。

操作步骤

我们按以下步骤进行：

为卷积网络，dropout，fully_connected和max_pool导入一些utils和核心层。此外，导入一些对图像处理和图像增强有用的模块。请注意，TFLearn 为卷积网络提供了一些已经定义的更高层，这使我们可以专注于代码的定义：

from __future__ import division, print_function, absolute_import 
import tflearn 
from tflearn.data_utils import shuffle, to_categorical 
from tflearn.layers.core import input_data, dropout, fully_connected 
from tflearn.layers.conv import conv_2d, max_pool_2d 
from tflearn.layers.estimator import regression 
from tflearn.data_preprocessing import ImagePreprocessing 
from tflearn.data_augmentation import ImageAugmentation

加载 CIFAR-10 数据，并将其分为X列数据，Y列标签，用于测试的X_test和用于测试标签的Y_test。随机排列X和Y可能会很有用，以避免取决于特定的数据配置。最后一步是对X和Y进行一次热编码：

# Data loading and preprocessing 
from tflearn.datasets import cifar10 
(X, Y), (X_test, Y_test) = cifar10.load_data() 
X, Y = shuffle(X, Y) 
Y = to_categorical(Y, 10) 
Y_test = to_categorical(Y_test, 10)

将ImagePreprocessing()用于零中心（在整个数据集上计算平均值）和 STD 归一化（在整个数据集上计算 std）。 TFLearn 数据流旨在通过在 GPU 执行模型训练时在 CPU 上预处理数据来加快训练速度。

# Real-time data preprocessing 
img_prep = ImagePreprocessing() 
img_prep.add_featurewise_zero_center() 
img_prep.add_featurewise_stdnorm()

通过左右随机执行以及随机旋转来增强数据集。此步骤是一个简单的技巧，用于增加可用于训练的数据：

# Real-time data augmentation 
img_aug = ImageAugmentation() 
img_aug.add_random_flip_leftright() 
img_aug.add_random_rotation(max_angle=25.)

使用先前定义的图像准备和扩充来创建卷积网络。网络由三个卷积层组成。第一个使用 32 个卷积滤波器，滤波器的大小为 3，激活函数为 ReLU。之后，有一个max_pool层用于缩小尺寸。然后有两个级联的卷积滤波器与 64 个卷积滤波器，滤波器的大小为 3，激活函数为 ReLU。之后，有一个用于缩小规模的max_pool，一个具有 512 个神经元且具有激活函数 ReLU 的全连接网络，其次是丢弃的可能性为 50%。最后一层是具有 10 个神经元和激活函数softmax的完全连接的网络，用于确定手写数字的类别。请注意，已知这种特定类型的卷积网络对于 CIFAR-10 非常有效。在这种特殊情况下，我们将 Adam 优化器与categorical_crossentropy和学习率0.001结合使用：

# Convolutional network building 
network = input_data(shape=[None, 32, 32, 3], 
                     data_preprocessing=img_prep, 
                     data_augmentation=img_aug) 
network = conv_2d(network, 32, 3, activation='relu') 
network = max_pool_2d(network, 2) 
network = conv_2d(network, 64, 3, activation='relu') 
network = conv_2d(network, 64, 3, activation='relu') 
network = max_pool_2d(network, 2) 
network = fully_connected(network, 512, activation='relu') 
network = dropout(network, 0.5) 
network = fully_connected(network, 10, activation='softmax') 
network = regression(network, optimizer='adam', 
                     loss='categorical_crossentropy', 
                     learning_rate=0.001)

实例化卷积网络并使用batch_size=96将训练运行 50 个周期：

# Train using classifier 
model = tflearn.DNN(network, tensorboard_verbose=0) 
model.fit(X, Y, n_epoch=50, shuffle=True, validation_set=(X_test, Y_test), 
          show_metric=True, batch_size=96, run_id='cifar10_cnn')

工作原理

TFLearn 隐藏了 TensorFlow 公开的许多实现细节，并且在许多情况下，它使我们可以专注于具有更高抽象级别的卷积网络的定义。我们的管道在 50 次迭代中达到了 88% 的精度。下图是 Jupyter 笔记本中执行的快照：

Jupyter 执行 CIFAR10 分类的示例

要安装 TFLearn，请参阅《安装指南》，如果您想查看更多示例，可以在线获取一长串久经考验的解决方案。

使用 VGG19 迁移风格用于图像重绘

在本秘籍中，您将教计算机如何绘画。关键思想是拥有绘画模型图像，神经网络可以从该图像推断绘画风格。然后，此风格将迁移到另一张图片，并相应地重新粉刷。该秘籍是对log0开发的代码的修改，可以在线获取。

准备

我们将实现在论文《一种艺术风格的神经算法》中描述的算法，作者是 Leon A. Gatys，亚历山大 S. Ecker 和 Matthias Bethge。因此，最好先阅读该论文。此秘籍将重复使用在线提供的预训练模型 VGG19，该模型应在本地下载。我们的风格图片将是一幅可在线获得的梵高著名画作，而我们的内容图片则是从维基百科下载的玛丽莲梦露的照片。内容图像将根据梵高的风格重新绘制。

操作步骤

让我们从秘籍开始：

导入一些模块，例如numpy，scipy，tensorflow和matplotlib。然后导入PIL来处理图像。请注意，由于此代码在 Jupyter 笔记本上运行，您可以从网上下载该片段，因此添加了片段%matplotlib inline：

import os 
import sys 
import numpy as np 
import scipy.io 
import scipy.misc 
import tensorflow as tf 
import matplotlib.pyplot as plt 
from matplotlib.pyplot 
import imshow 
from PIL 
import Image %matplotlib inline from __future__ 
import division

然后，设置用于学习风格的图像的输入路径，并根据风格设置要重绘的内容图像的输入路径：

OUTPUT_DIR = 'output/' 
# Style image 
STYLE_IMAGE = 'data/StarryNight.jpg' 
# Content image to be repainted 
CONTENT_IMAGE = 'data/Marilyn_Monroe_in_1952.jpg'

然后，我们设置图像生成过程中使用的噪声比，以及在重画内容图像时要强调的内容损失和风格损失。除此之外，我们存储通向预训练的 VGG 模型的路径和在 VGG 预训练期间计算的平均值。这个平均值是已知的，可以从 VGG 模型的输入中减去：

# how much noise is in the image 
NOISE_RATIO = 0.6 
# How much emphasis on content loss. 
BETA = 5 
# How much emphasis on style loss. 
ALPHA = 100 
# the VGG 19-layer pre-trained model 
VGG_MODEL = 'data/imagenet-vgg-verydeep-19.mat' 
# The mean used when the VGG was trained 
# It is subtracted from the input to the VGG model. MEAN_VALUES = np.array([123.68, 116.779, 103.939]).reshape((1,1,1,3))

显示内容图像只是为了了解它的样子：

content_image = scipy.misc.imread(CONTENT_IMAGE) imshow(content_image)

这是前面代码的输出（请注意，此图像位于这个页面中）：

调整风格图像的大小并显示它只是为了了解它的状态。请注意，内容图像和风格图像现在具有相同的大小和相同数量的颜色通道：

style_image = scipy.misc.imread(STYLE_IMAGE) 
# Get shape of target and make the style image the same 
target_shape = content_image.shape 
print "target_shape=", target_shape 
print "style_shape=", style_image.shape 
#ratio = target_shape[1] / style_image.shape[1] 
#print "resize ratio=", ratio 
style_image = scipy.misc.imresize(style_image, target_shape) 
scipy.misc.imsave(STYLE_IMAGE, style_image) 
imshow(style_image)

这是前面代码的输出：

文森特·梵高画作的一个例子

下一步是按照原始论文中的描述定义 VGG 模型。请注意，深度学习网络相当复杂，因为它结合了具有 ReLU 激活函数和最大池的多个卷积网络层。另外需要注意的是，在原始论文《风格迁移》（Leon A. Gatys，Alexander S. Ecker 和 Matthias Bethge 撰写的《一种艺术风格的神经算法》）中，许多实验表明，平均合并实际上优于最大池化。因此，我们将改用平均池：

def load_vgg_model(path, image_height, image_width, color_channels):
   """
   Returns the VGG model as defined in the paper
       0 is conv1_1 (3, 3, 3, 64)
       1 is relu
       2 is conv1_2 (3, 3, 64, 64)
       3 is relu    
       4 is maxpool
       5 is conv2_1 (3, 3, 64, 128)
       6 is relu
       7 is conv2_2 (3, 3, 128, 128)
       8 is relu
       9 is maxpool
       10 is conv3_1 (3, 3, 128, 256)
       11 is relu
       12 is conv3_2 (3, 3, 256, 256)
       13 is relu
       14 is conv3_3 (3, 3, 256, 256)
       15 is relu
       16 is conv3_4 (3, 3, 256, 256)
       17 is relu
       18 is maxpool
       19 is conv4_1 (3, 3, 256, 512)
       20 is relu
       21 is conv4_2 (3, 3, 512, 512)
       22 is relu
       23 is conv4_3 (3, 3, 512, 512)
       24 is relu
       25 is conv4_4 (3, 3, 512, 512)
       26 is relu
       27 is maxpool
       28 is conv5_1 (3, 3, 512, 512)
       29 is relu
       30 is conv5_2 (3, 3, 512, 512)
       31 is relu
       32 is conv5_3 (3, 3, 512, 512)
       33 is relu
       34 is conv5_4 (3, 3, 512, 512)
       35 is relu
       36 is maxpool
       37 is fullyconnected (7, 7, 512, 4096)       38 is relu
       39 is fullyconnected (1, 1, 4096, 4096)
       40 is relu
       41 is fullyconnected (1, 1, 4096, 1000)
       42 is softmax
   """
   vgg = scipy.io.loadmat(path)
   vgg_layers = vgg['layers']   

   def _weights(layer, expected_layer_name):
       """       Return the weights and bias from the VGG model for a given layer.
"""
       W = vgg_layers[0][layer][0][0][0][0][0]
       b = vgg_layers[0][layer][0][0][0][0][1]
       layer_name = vgg_layers[0][layer][0][0][-2]
       assert layer_name == expected_layer_name
       return W, b

   def _relu(conv2d_layer):
       """
       Return the RELU function wrapped over a TensorFlow layer. Expects a
       Conv2d layer input.
       """
       return tf.nn.relu(conv2d_layer)

   def _conv2d(prev_layer, layer, layer_name):
       """
       Return the Conv2D layer using the weights, biases from the VGG
       model at 'layer'.
       """
       W, b = _weights(layer, layer_name)
       W = tf.constant(W)
       b = tf.constant(np.reshape(b, (b.size)))
       return tf.nn.conv2d(
           prev_layer, filter=W, strides=[1, 1, 1, 1], padding='SAME') + b

   def _conv2d_relu(prev_layer, layer, layer_name):
       """
       Return the Conv2D + RELU layer using the weights, biases from the VGG
       model at 'layer'.
       """
       return _relu(_conv2d(prev_layer, layer, layer_name))

   def _avgpool(prev_layer):
       """
       Return the AveragePooling layer.
       """
       return tf.nn.avg_pool(prev_layer, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

   # Constructs the graph model.
   graph = {}
   graph['input']   = tf.Variable(np.zeros((1,
                                            image_height, image_width, color_channels)),
                                  dtype = 'float32')
   graph['conv1_1']  = _conv2d_relu(graph['input'], 0, 'conv1_1')
   graph['conv1_2']  = _conv2d_relu(graph['conv1_1'], 2, 'conv1_2')
   graph['avgpool1'] = _avgpool(graph['conv1_2'])
   graph['conv2_1']  = _conv2d_relu(graph['avgpool1'], 5, 'conv2_1')
   graph['conv2_2']  = _conv2d_relu(graph['conv2_1'], 7, 'conv2_2')
   graph['avgpool2'] = _avgpool(graph['conv2_2'])
   graph['conv3_1']  = _conv2d_relu(graph['avgpool2'], 10, 'conv3_1')
   graph['conv3_2']  = _conv2d_relu(graph['conv3_1'], 12, 'conv3_2')
   graph['conv3_3']  = _conv2d_relu(graph['conv3_2'], 14, 'conv3_3')
   graph['conv3_4']  = _conv2d_relu(graph['conv3_3'], 16, 'conv3_4')
   graph['avgpool3'] = _avgpool(graph['conv3_4'])
   graph['conv4_1']  = _conv2d_relu(graph['avgpool3'], 19, 'conv4_1')
   graph['conv4_2']  = _conv2d_relu(graph['conv4_1'], 21, 'conv4_2')
   graph['conv4_3']  = _conv2d_relu(graph['conv4_2'], 23, 'conv4_3')
   graph['conv4_4']  = _conv2d_relu(graph['conv4_3'], 25, 'conv4_4')
   graph['avgpool4'] = _avgpool(graph['conv4_4'])
   graph['conv5_1']  = _conv2d_relu(graph['avgpool4'], 28, 'conv5_1')
   graph['conv5_2']  = _conv2d_relu(graph['conv5_1'], 30, 'conv5_2')
   graph['conv5_3']  = _conv2d_relu(graph['conv5_2'], 32, 'conv5_3')
   graph['conv5_4']  = _conv2d_relu(graph['conv5_3'], 34, 'conv5_4')
   graph['avgpool5'] = _avgpool(graph['conv5_4'])
   return graph

定义内容loss函数，如原始论文中所述：

def content_loss_func(sess, model): 
""" Content loss function as defined in the paper. """ 

def _content_loss(p, x): 
# N is the number of filters (at layer l). 
N = p.shape[3] 
# M is the height times the width of the feature map (at layer l). 
M = p.shape[1] * p.shape[2] return (1 / (4 * N * M)) * tf.reduce_sum(tf.pow(x - p, 2)) 
return _content_loss(sess.run(model['conv4_2']), model['conv4_2'])

定义我们要重用的 VGG 层。如果我们希望具有更柔和的特征，则需要增加较高层的权重（conv5_1）和降低较低层的权重（conv1_1）。如果我们想拥有更难的特征，我们需要做相反的事情：

STYLE_LAYERS = [ 
('conv1_1', 0.5), 
('conv2_1', 1.0), 
('conv3_1', 1.5), 
('conv4_1', 3.0), 
('conv5_1', 4.0), 
]

定义风格损失函数，如原始论文中所述：

def style_loss_func(sess, model):
   """
   Style loss function as defined in the paper.
   """

   def _gram_matrix(F, N, M):
    """
       The gram matrix G.
       """
       Ft = tf.reshape(F, (M, N))
       return tf.matmul(tf.transpose(Ft), Ft)

   def _style_loss(a, x):
       """
       The style loss calculation.
       """
       # N is the number of filters (at layer l).
       N = a.shape[3]
       # M is the height times the width of the feature map (at layer l).
       M = a.shape[1] * a.shape[2]
       # A is the style representation of the original image (at layer l).
       A = _gram_matrix(a, N, M)
       # G is the style representation of the generated image (at layer l).
       G = _gram_matrix(x, N, M)
       result = (1 / (4 * N**2 * M**2)) * tf.reduce_sum(tf.pow(G - A, 2))
       return result
       E = [_style_loss(sess.run(model[layer_name]), model[layer_name])
           for layer_name, _ in STYLE_LAYERS]
       W = [w for _, w in STYLE_LAYERS]
       loss = sum([W[l] * E[l] for l in range(len(STYLE_LAYERS))])
   return loss

定义一个函数以生成噪声图像，并将其与内容图像按给定比例混合。定义两种辅助方法来预处理和保存图像：

def generate_noise_image(content_image, noise_ratio = NOISE_RATIO):
 """   Returns a noise image intermixed with the content image at a certain ratio.
"""
   noise_image = np.random.uniform(
           -20, 20,
           (1,
            content_image[0].shape[0],
            content_image[0].shape[1],
            content_image[0].shape[2])).astype('float32')
   # White noise image from the content representation. Take a weighted average
   # of the values
   input_image = noise_image * noise_ratio + content_image * (1 - noise_ratio)
   return input_image

def process_image(image):
   # Resize the image for convnet input, there is no change but just
   # add an extra dimension.
   image = np.reshape(image, ((1,) + image.shape))
   # Input to the VGG model expects the mean to be subtracted.
   image = image - MEAN_VALUES
   return image

def save_image(path, image):
   # Output should add back the mean.
   image = image + MEAN_VALUES
   # Get rid of the first useless dimension, what remains is the image.
   image = image[0]
   image = np.clip(image, 0, 255).astype('uint8')
   scipy.misc.imsave(path, image)

开始一个 TensorFlow 交互式会话：

sess = tf.InteractiveSession()

加载处理后的内容图像并显示：

content_image = load_image(CONTENT_IMAGE) imshow(content_image[0])

我们得到以下代码的输出（请注意，我们使用了来自这里的图像）：

加载处理后的风格图像并显示它：

style_image = load_image(STYLE_IMAGE) imshow(style_image[0])

内容如下：

加载model并显示：

model = load_vgg_model(VGG_MODEL, style_image[0].shape[0], style_image[0].shape[1], style_image[0].shape[2]) print(model)

生成用于启动重新绘制的随机噪声图像：

input_image = generate_noise_image(content_image) imshow(input_image[0])

运行 TensorFlow 会话：

sess.run(tf.initialize_all_variables())

用相应的图像构造content_loss和sytle_loss：

# Construct content_loss using content_image. sess.run(model['input'].assign(content_image))
content_loss = content_loss_func(sess, model) 
# Construct style_loss using style_image. sess.run(model['input'].assign(style_image)) 
style_loss = style_loss_func(sess, model)

将total_loss构造为content_loss和sytle_loss的加权组合：

# Construct total_loss as weighted combination of content_loss and sytle_loss 
total_loss = BETA * content_loss + ALPHA * style_loss

建立一个优化器以最大程度地减少总损失。在这种情况下，我们采用 Adam 优化器：

# The content is built from one layer, while the style is from five 
# layers. Then we minimize the total_loss 
optimizer = tf.train.AdamOptimizer(2.0) 
train_step = optimizer.minimize(total_loss)

使用输入图像启动网络：

sess.run(tf.initialize_all_variables()) sess.run(model['input'].assign(input_image))

对模型运行固定的迭代次数，并生成中间的重绘图像：

sess.run(tf.initialize_all_variables())
sess.run(model['input'].assign(input_image))
print "started iteration"
for it in range(ITERATIONS):
   sess.run(train_step)
   print it , " "
   if it%100 == 0:
       # Print every 100 iteration.
       mixed_image = sess.run(model['input'])
       print('Iteration %d' % (it))
       print('sum : ',
sess.run(tf.reduce_sum(mixed_image)))
       print('cost: ', sess.run(total_loss))
       if not os.path.exists(OUTPUT_DIR):
           os.mkdir(OUTPUT_DIR)
       filename = 'output/%d.png' % (it)
       save_image(filename, mixed_image)

在此图像中，我们显示了在 200、400 和 600 次迭代后如何重新绘制内容图像：

风格迁移的例子

工作原理

在本秘籍中，我们已经看到了如何使用风格转换来重绘内容图像。风格图像已作为神经网络的输入提供，该网络学习了定义画家采用的风格的关键方面。这些方面已用于将风格迁移到内容图像。

自 2015 年提出原始建议以来，风格转换一直是活跃的研究领域。已经提出了许多新想法来加速计算并将风格转换扩展到视频分析。其中有两个结果值得一提

这篇文章是 Logan Engstrom 的快速风格转换，介绍了一种非常快速的实现，该实现也可以与视频一起使用。

通过 deepart 网站，您可以播放自己的图像，并以自己喜欢的艺术家的风格重新绘制图片。还提供了 Android 应用，iPhone 应用和 Web 应用。

将预训练的 VGG16 网络用于迁移学习

在本秘籍中，我们将讨论迁移学习，这是一种非常强大的深度学习技术，在不同领域中都有许多应用。直觉非常简单，可以用类推来解释。假设您想学习一种新的语言，例如西班牙语，那么从另一种语言（例如英语）已经知道的内容开始可能会很有用。

按照这种思路，计算机视觉研究人员现在通常使用经过预训练的 CNN 来生成新颖任务的表示形式，其中数据集可能不足以从头训练整个 CNN。另一个常见的策略是采用经过预先训练的 ImageNet 网络，然后将整个网络微调到新颖的任务。此处提出的示例的灵感来自 Francois Chollet 在 Keras 的著名博客文章。

准备

想法是使用在大型数据集（如 ImageNet）上预训练的 VGG16 网络。请注意，训练在计算上可能会相当昂贵，因此可以重用已经预先训练的网络：

A VGG16 Network

那么，如何使用 VGG16？ Keras 使该库变得容易，因为该库具有可作为库使用的标准 VGG16 应用，并且自动下载了预先计算的权重。请注意，我们明确省略了最后一层，并用我们的自定义层替换了它，这将在预构建的 VGG16 的顶部进行微调。在此示例中，您将学习如何对 Kaggle 提供的猫狗图像进行分类。

操作步骤

我们按以下步骤进行：

从 Kaggle（www.kaggle.com/c/dogs-vs-c… 狗和猫。
导入 Keras 模块，这些模块将在以后的计算中使用，并保存一些有用的常量：

from keras import applications
from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers
from keras.models import Sequential, Model
from keras.layers import Dropout, Flatten, Dense
from keras import optimizers
img_width, img_height = 256, 256
batch_size = 16
epochs = 50
train_data_dir = 'data/dogs_and_cats/train'
validation_data_dir = 'data/dogs_and_cats/validation'
#OUT CATEGORIES
OUT_CATEGORIES=1
#number of train, validation samples
nb_train_samples = 2000
nb_validation_samples =

将预训练的图像加载到 ImageNet VGG16 网络上，并省略最后一层，因为我们将在预构建的 VGG16 的顶部添加自定义分类网络并替换最后的分类层：

# load the VGG16 model pretrained on imagenet 
base_model = applications.VGG16(weights = "imagenet", include_top=False, input_shape = (img_width, img_height, 3)) 
base_model.summary()

这是前面代码的输出：

冻结一定数量的较低层用于预训练的 VGG16 网络。在这种情况下，我们决定冻结最初的 15 层：

# Freeze the 15 lower layers for layer in base_model.layers[:15]: layer.trainable = False

添加一组自定义的顶层用于分类：

# Add custom to layers # build a classifier model to put on top of the convolutional model top_model = Sequential() top_model.add(Flatten(input_shape=base_model.output_shape[1:])) 
top_model.add(Dense(256, activation='relu')) top_model.add(Dropout(0.5)) top_model.add(Dense(OUT_CATEGORIES, activation='sigmoid'))

定制网络应单独进行预训练，在这里，为简单起见，我们省略了这一部分，将这一任务留给了读者：

#top_model.load_weights(top_model_weights_path)

创建一个新网络，该网络与预训练的 VGG16 网络和我们的预训练的自定义网络并置：

# creating the final model, a composition of 
# pre-trained and 
model = Model(inputs=base_model.input, outputs=top_model(base_model.output)) 
# compile the model 
model.compile(loss = "binary_crossentropy", optimizer = optimizers.SGD(lr=0.0001, momentum=0.9), metrics=["accuracy"])

重新训练并列的新模型，仍将 VGG16 的最低 15 层冻结。在这个特定的例子中，我们还使用图像增幅器来增加训练集：

# Initiate the train and test generators with data Augumentation
train_datagen = ImageDataGenerator(
rescale = 1./255,
horizontal_flip = True)
test_datagen = ImageDataGenerator(rescale=1\. / 255)
train_generator = train_datagen.flow_from_directory(
   train_data_dir,
   target_size=(img_height, img_width),
   batch_size=batch_size,
   class_mode='binary')
validation_generator = test_datagen.flow_from_directory(
   validation_data_dir,
   target_size=(img_height, img_width),
   batch_size=batch_size,
   class_mode='binary', shuffle=False)
model.fit_generator(
   train_generator,
   steps_per_epoch=nb_train_samples // batch_size,
   epochs=epochs,
   validation_data=validation_generator,
   validation_steps=nb_validation_samples // batch_size,
   verbose=2, workers=12)

在并置的网络上评估结果：

score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size) 
scores = model.predict_generator(validation_generator, nb_validation_samples/batch_size)

工作原理

标准的 VGG16 网络已经在整个 ImageNet 上进行了预训练，并具有从互联网下载的预先计算的权重。然后，将该网络与也已单独训练的自定义网络并置。然后，并列的网络作为一个整体进行了重新训练，使 VGG16 的 15 个较低层保持冻结。

这种组合非常有效。通过对网络在 ImageNet 上学到的知识进行迁移学习，将其应用于我们的新特定领域，从而执行微调分类任务，它可以节省大量的计算能力，并重复使用已为 VGG16 执行的工作。

根据特定的分类任务，需要考虑一些经验法则：

如果新数据集很小并且类似于 ImageNet 数据集，那么我们可以冻结所有 VGG16 网络并仅重新训练自定义网络。通过这种方式，我们还将并置网络的过拟合风险降至最低：

＃冻结base_model.layers中所有较低的层：layer.trainable = False

如果新数据集很大且类似于 ImageNet 数据集，则我们可以重新训练整个并列的网络。我们仍然将预先计算的权重作为起点，并进行一些迭代以进行微调：

＃取消冻结model.layers中所有较低层的层：layer.trainable = True

如果新数据集与 ImageNet 数据集非常不同，则在实践中，使用预训练模型中的权重进行初始化可能仍然很好。在这种情况下，我们将有足够的数据和信心来调整整个网络。可以在这里在线找到更多信息。

创建 DeepDream 网络

Google 于 2014 年训练了神经网络以应对 ImageNet 大规模视觉识别挑战（ILSVRC），并于 2015 年 7 月将其开源。“深入了解卷积”中介绍了原始算法。网络学会了每个图像的表示。较低的层学习诸如线条和边缘之类的底层特征，而较高的层则学习诸如眼睛，鼻子，嘴等更复杂的图案。因此，如果尝试在网络中代表更高的级别，我们将看到从原始 ImageNet 提取的各种不同特征的混合，例如鸟的眼睛和狗的嘴巴。考虑到这一点，如果我们拍摄一张新图像并尝试使与网络上层的相似性最大化，那么结果就是一张新的有远见的图像。在这个有远见的图像中，较高层学习的某些模式在原始图像中被梦到（例如，想象中）。这是此类有远见的图像的示例：

如以下所示的 Google DeepDreams 示例

准备

从网上下载预训练的 Inception 模型。

操作步骤

我们按以下步骤进行操作：

导入numpy进行数值计算，导入functools定义已填充一个或多个参数的部分函数，导入 Pillow 进行图像处理，并导入matplotlib呈现图像：

import numpy as np from functools 
import partial import PIL.Image 
import tensorflow as tf 
import matplotlib.pyplot as plt

设置内容图像和预训练模型的路径。从只是随机噪声的种子图像开始：

content_image = 'data/gulli.jpg' 
# start with a gray image with a little noise 
img_noise = np.random.uniform(size=(224,224,3)) + 100.0 
model_fn = 'data/tensorflow_inception_graph.pb'

在图表中加载从互联网下载的 Inception 网络。初始化 TensorFlow 会话，使用FastGFile(..)加载图，然后使用ParseFromstring(..)解析图。之后，使用placeholder(..)方法创建一个输入作为占位符。 imagenet_mean是一个预先计算的常数，将从我们的内容图像中删除以标准化数据。实际上，这是在训练过程中观察到的平均值，归一化可以更快地收敛。该值将从输入中减去，并存储在t_preprocessed变量中，该变量然后用于加载图定义：

# load the graph
graph = tf.Graph()
sess = tf.InteractiveSession(graph=graph)
with tf.gfile.FastGFile(model_fn, 'rb') as f:
       graph_def = tf.GraphDef()
       graph_def.ParseFromString(f.read())
t_input = tf.placeholder(np.float32, name='input') # define
the input tensor
imagenet_mean = 117.0
t_preprocessed = tf.expand_dims(t_input-imagenet_mean, 0)
tf.import_graph_def(graph_def, {'input':t_preprocessed})

定义一些util函数以可视化图像并将 TF-graph 生成函数转换为常规 Python 函数（请参见以下示例以调整大小）：

# helper
#pylint: disable=unused-variable
def showarray(a):
   a = np.uint8(np.clip(a, 0, 1)*255)
   plt.imshow(a)
   plt.show()   
def visstd(a, s=0.1):
   '''Normalize the image range for visualization'''
   return (a-a.mean())/max(a.std(), 1e-4)*s + 0.5   

def T(layer):
   '''Helper for getting layer output tensor'''
   return graph.get_tensor_by_name("import/%s:0"%layer)   

def tffunc(*argtypes):
   '''Helper that transforms TF-graph generating function into a regular one.
   See "resize" function below.
   '''
   placeholders = list(map(tf.placeholder, argtypes))
   def wrap(f):
       out = f(*placeholders)
       def wrapper(*args, **kw):
           return out.eval(dict(zip(placeholders, args)), session=kw.get('session'))
       return wrapper
   return wrap   

def resize(img, size):
   img = tf.expand_dims(img, 0)
   return tf.image.resize_bilinear(img, size)[0,:,:,:]
resize = tffunc(np.float32, np.int32)(resize)

计算图像上的梯度上升。为了提高效率，请应用分块计算，其中在不同分块上计算单独的梯度上升。将随机移位应用于图像，以在多次迭代中模糊图块边界：

def calc_grad_tiled(img, t_grad, tile_size=512):
   '''Compute the value of tensor t_grad over the image in a tiled way.
   Random shifts are applied to the image to blur tile boundaries over
   multiple iterations.'''
   sz = tile_size
   h, w = img.shape[:2]
   sx, sy = np.random.randint(sz, size=2)
   img_shift = np.roll(np.roll(img, sx, 1), sy, 0)
   grad = np.zeros_like(img)
   for y in range(0, max(h-sz//2, sz),sz):
       for x in range(0, max(w-sz//2, sz),sz):
           sub = img_shift[y:y+sz,x:x+sz]
           g = sess.run(t_grad, {t_input:sub})
           grad[y:y+sz,x:x+sz] = g

   return np.roll(np.roll(grad, -sx, 1), -sy, 0)

定义优化对象以减少输入层的均值。 gradient函数允许我们通过考虑输入张量来计算优化张量的符号梯度。为了提高效率，将图像分成多个八度，然后调整大小并添加到八度数组中。然后，对于每个八度，我们使用calc_grad_tiled函数：

def render_deepdream(t_obj, img0=img_noise,
                        iter_n=10, step=1.5, octave_n=4, octave_scale=1.4):
   t_score = tf.reduce_mean(t_obj) # defining the optimization objective
 t_grad = tf.gradients(t_score, t_input)[0] # behold the power of automatic differentiation!

   # split the image into a number of octaves
   img = img0
   octaves = []
   for _ in range(octave_n-1):
       hw = img.shape[:2]
       lo = resize(img,
np.int32(np.float32(hw)/octave_scale))
       hi = img-resize(lo, hw)
       img = lo
       octaves.append(hi)       
# generate details octave by octave
   for octave in range(octave_n):
       if octave>0:
           hi = octaves[-octave]
           img = resize(img, hi.shape[:2])+hi
       for _ in range(iter_n):
           g = calc_grad_tiled(img, t_grad)
           img += g*(step / (np.abs(g).mean()+1e-7))

           #this will usually be like 3 or 4 octaves
           #Step 5 output deep dream image via matplotlib
       showarray(img/255.0)

加载特定的内容图像并开始做梦。在此示例中，作者的面孔已转变为类似于狼的事物：

DeepDream 转换的示例。其中一位作家变成了狼

工作原理

神经网络存储训练图像的抽象：较低的层存储诸如线条和边缘之类的特征，而较高的层则存储诸如眼睛，面部和鼻子之类的更复杂的图像特征。通过应用梯度上升过程，我们最大化了loss函数，并有助于发现类似于高层存储的图案的内容图像。这导致了网络看到虚幻图像的梦想。

许多网站都允许您直接玩 DeepDream。我特别喜欢DeepArt.io，它允许您上传内容图像和风格图像并在云上进行学习。

另见

在 2015 年发布初步结果之后，还发布了许多有关 DeepDream 的新论文和博客文章：

DeepDream: A code example to visualize Neural Networks--https://research.googleblog.com/2015/07/deepdream-code-example-for-visualizing.html

When Robots Hallucinate, LaFrance, Adrienne--https://www.theatlantic.com/technology/archive/2015/09/robots-hallucinate-dream/403498/

此外，了解如何可视化预训练网络的每一层并更好地了解网络如何记忆较低层的基本特征以及较高层的较复杂特征可能会很有趣。在线提供有关此主题的有趣博客文章：

卷积神经网络如何看待世界

五、高级卷积神经网络

在本章中，我们将讨论如何将卷积神经网络（CNN）用于除图像以外的领域中的深度学习。我们的注意力将首先集中在文本分析和自然语言处理（NLP）上。在本章中，我们将介绍一些用于以下方面的方法：

创建卷积网络进行情感分析
检查 VGG 预建网络学习了哪些过滤器
使用 VGGNet，ResNet，Inception 和 Xception 对图像进行分类
复用预先构建的深度学习模型来提取特征
用于迁移学习的非常深的 Inception-v3 网络
使用膨胀的 ConvNets，WaveNet 和 NSynth 生成音乐
回答有关图像的问题（可视化问答）
使用预训练网络通过六种不同方式来分类视频

介绍

在上一章中，我们了解了如何将 ConvNets 应用于图像。在本章中，我们将类似的思想应用于文本。

文本和图像有什么共同点？乍一看，很少。但是，如果我们将句子或文档表示为矩阵，则此矩阵与每个单元都是像素的图像矩阵没有区别。因此，下一个问题是，我们如何将文本表示为矩阵？好吧，这很简单：矩阵的每一行都是一个向量，代表文本的基本单位。当然，现在我们需要定义什么是基本单位。一个简单的选择就是说基本单位是一个字符。另一个选择是说基本单位是一个单词，另一个选择是将相似的单词聚合在一起，然后用代表符号表示每个聚合（有时称为簇或嵌入）。

请注意，无论我们的基本单位采用哪种具体选择，我们都需要从基本单位到整数 ID 的 1：1 映射，以便可以将文本视为矩阵。例如，如果我们有一个包含 10 行文本的文档，并且每行都是 100 维嵌入，那么我们将用10 x 100的矩阵表示文本。在这个非常特殊的图像中，如果该句子x包含位置y表示的嵌入，则打开像素。您可能还会注意到，文本实际上不是矩阵，而是向量，因为位于文本相邻行中的两个单词几乎没有共同点。确实，与图像的主要区别在于，相邻列中的两个像素最有可能具有某种相关性。

现在您可能会想：我知道您将文本表示为向量，但是这样做会使我们失去单词的位置，而这个位置应该很重要，不是吗？

好吧，事实证明，在许多实际应用中，知道一个句子是否包含特定的基本单位（一个字符，一个单词或一个合计）是非常准确的信息，即使我们不记住句子中的确切位置也是如此。基本单元位于。

创建用于情感分析的卷积网络

在本秘籍中，我们将使用 TFLearn 创建基于 CNN 的情感分析深度学习网络。如上一节所述，我们的 CNN 将是一维的。我们将使用 IMDb 数据集，用于训练的 45,000 个高度受欢迎的电影评论和用于测试的 5,000 个集合。

准备

TFLearn 具有用于自动从网络下载数据集并促进卷积网络创建的库，因此让我们直接看一下代码。

操作步骤

我们按以下步骤进行：

导入 TensorFlow tflearn和构建网络所需的模块。然后，导入 IMDb 库并执行一键编码和填充：

import tensorflow as tf
import tflearn
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.conv import conv_1d, global_max_pool
from tflearn.layers.merge_ops import merge
from tflearn.layers.estimator import regression
from tflearn.data_utils import to_categorical, pad_sequences
from tflearn.datasets import imdb

加载数据集，将句子填充到最大长度为 0 的位置，并对标签执行两个编码，分别对应于真值和假值的两个值。注意，参数n_words是要保留在词汇表中的单词数。所有多余的单词都设置为未知。另外，请注意trainX和trainY是稀疏向量，因为每个评论很可能包含整个单词集的子集：

# IMDb Dataset loading
train, test, _ = imdb.load_data(path='imdb.pkl', n_words=10000,
valid_portion=0.1)
trainX, trainY = train
testX, testY = test
#pad the sequence
trainX = pad_sequences(trainX, maxlen=100, value=0.)
testX = pad_sequences(testX, maxlen=100, value=0.)
#one-hot encoding
trainY = to_categorical(trainY, nb_classes=2)
testY = to_categorical(testY, nb_classes=2)

打印一些维度以检查刚刚处理的数据并了解问题的维度是什么：

print ("size trainX", trainX.size)
print ("size testX", testX.size)
print ("size testY:", testY.size)
print ("size trainY", trainY.size)
size trainX 2250000
 size testX 250000
 size testY: 5000
 site trainY 45000

为数据集中包含的文本构建嵌入。就目前而言，将此步骤视为一个黑盒子，该黑盒子接受这些单词并将它们映射到聚合（群集）中，以便相似的单词可能出现在同一群集中。请注意，先前步骤的词汇是离散且稀疏的。通过嵌入，我们将创建一个映射，该映射会将每个单词嵌入到连续的密集向量空间中。使用此向量空间表示将为我们提供词汇表的连续，分布式表示。当我们谈论 RNN 时，将详细讨论如何构建嵌入：

# Build an embedding
network = input_data(shape=[None, 100], name='input')
network = tflearn.embedding(network, input_dim=10000, output_dim=128)

建立一个合适的convnet。我们有三个卷积层。由于我们正在处理文本，因此我们将使用一维卷积网络，并且各层将并行运行。每层采用大小为 128 的张量（嵌入的输出），并应用有效填充，激活函数 ReLU 和 L2 regularizer的多个滤波器（分别为 3、4、5）。然后，将每个层的输出与合并操作连接在一起。此后，添加一个最大池层，然后以 50% 的概率进行删除。最后一层是具有 softmax 激活的完全连接层：

#Build the convnet
branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
network = merge([branch1, branch2, branch3], mode='concat', axis=1)
network = tf.expand_dims(network, 2)
network = global_max_pool(network)
network = dropout(network, 0.5)
network = fully_connected(network, 2, activation='softmax')

学习阶段意味着使用categorical_crossentropy作为损失函数的 Adam 优化器：

network = regression(network, optimizer='adam', learning_rate=0.001,
loss='categorical_crossentropy', name='target')

然后，我们使用batch_size = 32运行训练，并观察训练和验证集达到的准确率。如您所见，在预测电影评论所表达的情感方面，我们能够获得 79% 的准确率：

# Training
model = tflearn.DNN(network, tensorboard_verbose=0)
model.fit(trainX, trainY, n_epoch = 5, shuffle=True, validation_set=(testX, testY), show_metric=True, batch_size=32)
Training Step: 3519 | total loss: 0.09738 | time: 85.043s
 | Adam | epoch: 005 | loss: 0.09738 - acc: 0.9747 -- iter: 22496/22500
 Training Step: 3520 | total loss: 0.09733 | time: 86.652s
 | Adam | epoch: 005 | loss: 0.09733 - acc: 0.9741 | val_loss: 0.58740 - val_acc: 0.7944 -- iter: 22500/22500
 --

工作原理

用于句子分类的卷积神经网络，Yoon Kim，EMNLP 2014。请注意，由于筛选器窗口对连续单词进行操作，因此本文提出的模型保留了一些有关位置的信息。从论文中提取的以下图像以图形方式表示了网络之外的主要直觉。最初，文本被表示为基于标准嵌入的向量，从而为我们提供了一维密集空间中的紧凑表示。然后，使用多个标准一维卷积层处理矩阵。

请注意，模型使用多个过滤器（窗口大小不同）来获取多个特征。之后，进行最大池操作，其思想是捕获最重要的特征-每个特征图的最大值。为了进行正则化，该文章建议在倒数第二层上采用对权重向量的 L2 范数有约束的丢弃项。最后一层将输出情感为正或负。

为了更好地理解该模型，有以下几点观察：

过滤器通常在连续空间上卷积。对于图像，此空间是像素矩阵表示形式，在高度和宽度上在空间上是连续的。对于文本而言，连续空间无非是连续单词自然产生的连续尺寸。如果仅使用单次编码表示的单词，则空间稀疏；如果使用嵌入，则由于聚集了相似的单词，因此生成的空间密集。
图像通常具有三个通道（RGB），而文本自然只有一个通道，因为我们无需表示颜色。

论文《用于句子分类的卷积神经网络》（Yoon Kim，EMNLP 2014）进行了广泛的实验。尽管对超参数的调整很少，但具有一层卷积的简单 CNN 在句子分类方面的表现却非常出色。该论文表明，采用一组静态嵌入（将在我们谈论 RNN 时进行讨论），并在其之上构建一个非常简单的卷积网络，实际上可以显着提高情感分析的表现：

如图所示的模型架构示例

使用 CNN 进行文本分析是一个活跃的研究领域。我建议看看以下文章：

《从头开始理解文本》（张翔，Yann LeCun）。本文演示了我们可以使用 CNN 将深度学习应用于从字符级输入到抽象文本概念的文本理解。作者将 CNN 应用于各种大规模数据集，包括本体分类，情感分析和文本分类，并表明它们可以在不了解单词，词组，句子或任何其他句法或语义结构的情况下实现惊人的表现。一种人类的语言。这些模型适用于英文和中文。

检查 VGG 预建网络了解了哪些过滤器

在本秘籍中，我们将使用 keras-vis，这是一个外部 Keras 包，用于直观检查预建的 VGG16 网络从中学到了什么不同的过滤器。这个想法是选择一个特定的 ImageNet 类别，并了解 VGG16 网络如何学会代表它。

准备

第一步是选择用于在 ImageNet 上训练 VGG16 的特定类别。假设我们采用类别 20，它对应于下图中显示的美国北斗星鸟：

美国北斗星的一个例子

可以在网上找到 ImageNet 映射作为 python 泡菜字典，其中 ImageNet 1000 类 ID 映射到了人类可读的标签。

操作步骤

我们按以下步骤进行：

导入 matplotlib 和 keras-vis 使用的模块。此外，还导入预构建的 VGG16 模块。 Keras 使处理此预建网络变得容易：

from matplotlib import pyplot as plt
from vis.utils import utils
from vis.utils.vggnet import VGG16
from vis.visualization import visualize_class_activation

通过使用 Keras 中包含的并经过 ImageNet 权重训练的预构建层来访问 VGG16 网络：

# Build the VGG16 network with ImageNet weights
model = VGG16(weights='imagenet', include_top=True)
model.summary()
print('Model loaded.')

这就是 VGG16 网络在内部的外观。我们有许多卷积网络，与 2D 最大池化交替使用。然后，我们有一个展开层，然后是三个密集层。最后一个称为预测，并且这一层应该能够检测到高级特征，例如人脸或我们的鸟类形状。请注意，顶层已明确包含在我们的网络中，因为我们想可视化它学到的知识：

_________________________________________________________________
 Layer (type) Output Shape Param #
 =================================================================
 input_2 (InputLayer) (None, 224, 224, 3) 0
 _________________________________________________________________
 block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
 _________________________________________________________________
 block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
 _________________________________________________________________
 block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
 _________________________________________________________________
 block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
 _________________________________________________________________
 block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
 _________________________________________________________________
 block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
 _________________________________________________________________
 block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
 _________________________________________________________________
 block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
 _________________________________________________________________
 block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
 _________________________________________________________________
 block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
 _________________________________________________________________
 block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
 _________________________________________________________________
 block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
 _________________________________________________________________
 block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
 _________________________________________________________________
 block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
 _________________________________________________________________
 block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
 _________________________________________________________________
 block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
 _________________________________________________________________
 block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
 _________________________________________________________________
 block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
 _________________________________________________________________
 flatten (Flatten) (None, 25088) 0
 _________________________________________________________________
 fc1 (Dense) (None, 4096) 102764544
 _________________________________________________________________
 fc2 (Dense) (None, 4096) 16781312
 _________________________________________________________________
 predictions (Dense) (None, 1000) 4097000
 =================================================================
 Total params: 138,357,544
 Trainable params: 138,357,544
 Non-trainable params: 0
 _________________________________________________________________
 Model loaded.

从外观上看，网络可以如下图所示：

VGG16 网络

现在，让我们着重于通过关注 American Dipper（ID 20）来检查最后一个预测层的内部外观：

layer_name = 'predictions'
layer_idx = [idx for idx, layer in enumerate(model.layers) if layer.name == layer_name][0]
# Generate three different images of the same output index.
vis_images = []
for idx in [20, 20, 20]:
img = visualize_class_activation(model, layer_idx, filter_indices=idx, max_iter=500)
img = utils.draw_text(img, str(idx))
vis_images.append(img)

让我们在给定特征的情况下显示特定层的生成图像，并观察网络如何在内部看到美国北斗星鸟的概念：

因此，这就是神经网络在内部代表鸟类的方式。这是一种令人毛骨悚然的形象，但我发誓没有为网络本身提供任何特定种类的人造药物！这正是这种特殊的人工网络自然学到的东西。

您是否仍然想了解更多？好吧，让我们选择一个较早的层，并代表网络如何在内部看到相同的American Dipper训练类别：

layer_name = 'block3_conv1'
layer_idx = [idx for idx, layer in enumerate(model.layers) if layer.name == layer_name][0]
vis_images = []
for idx in [20, 20, 20]:
img = visualize_class_activation(model, layer_idx, filter_indices=idx, max_iter=500)
img = utils.draw_text(img, str(idx))
vis_images.append(img)
stitched = utils.stitch_images(vis_images)
plt.axis('off')
plt.imshow(stitched)
plt.title(layer_name)
plt.show()

以下是上述代码的输出：

不出所料，该特定层正在学习非常基本的特征，例如曲线。但是，卷积网络的真正力量在于，随着我们对模型的深入研究，网络会推断出越来越复杂的特征。

工作原理

密集层的 keras-vis 可视化的关键思想是生成一个输入图像，该图像最大化与鸟类类相对应的最终密集层输出。因此，实际上该模块的作用是解决问题。给定具有权重的特定训练密集层，将生成一个新的合成图像，它最适合该层本身。

每个转换滤波器都使用类似的想法。在这种情况下，请注意，由于卷积网络层在原始像素上运行，因此可以通过简单地可视化其权重来解释它。后续的卷积过滤器对先前的卷积过滤器的输出进行操作，因此直接可视化它们不一定很有解释性。但是，如果我们独立地考虑每一层，我们可以专注于仅生成可最大化滤波器输出的合成输入图像。

GitHub 上的 keras-vis 存储库提供了一组很好的可视化示例，这些示例说明了如何内部检查网络，包括最近的显着性映射，其目的是在图像经常包含其他元素（例如草）时检测图像的哪个部分对特定类别（例如老虎）的训练贡献最大。种子文章是《深度卷积网络：可视化图像分类模型和显着性图》（Karen Simonyan，Andrea Vedaldi，Andrew Zisserman），并在下面报告了从 Git 存储库中提取的示例，在该示例中，网络可以自行了解定义为老虎的图像中最突出的部分是：

显着性映射的示例

将 VGGNet，ResNet，Inception 和 Xception 用于图像分类

图像分类是典型的深度学习应用。由于 ImageNet 图像数据库，该任务的兴趣有了最初的增长。它按照 WordNet 层次结构（目前仅是名词）来组织，其中每个节点都由成百上千的图像描绘。更准确地说，ImageNet 旨在将图像标记和分类为将近 22,000 个单独的对象类别。在深度学习的背景下，ImageNet 通常指的是 ImageNet 大规模视觉识别挑战，或简称 ILSVRC 中包含的工作。在这种情况下，目标是训练一个模型，该模型可以将输入图像分类为 1,000 个单独的对象类别。在此秘籍中，我们将使用超过 120 万个训练图像，50,000 个验证图像和 100,000 个测试图像的预训练模型。

VGG16 和 VGG19

在《用于大型图像识别的超深度卷积网络》（Karen Simonyan，Andrew Zisserman，2014 年）中，引入了 VGG16 和 VGG19。该网络使用3×3卷积层堆叠并与最大池交替，两个 4096 个全连接层，然后是 softmax 分类器。 16 和 19 代表网络中权重层的数量（列 D 和 E）：

一个非常深的网络配置示例

在 2015 年，拥有 16 或 19 层就足以考虑网络的深度，而今天（2017 年）我们达到了数百层。请注意，VGG 网络的训练速度非常慢，并且由于末端的深度和完全连接的层数，它们需要较大的权重空间。

ResNet

ResNet 已在《用于图像识别的深度残差学习》（何开明，张向宇，任少青，孙健，2015）中引入。该网络非常深，可以使用称为残差模块的标准网络组件使用标准的随机下降梯度进行训练，然后使用该网络组件组成更复杂的网络（该网络在网络中称为子网络）。

与 VGG 相比，ResNet 更深，但是模型的大小更小，因为使用了全局平均池化操作而不是全密层。

Inception

在《重新思考计算机视觉的初始架构》（Christian Szegedy，Vincent Vanhoucke，Sergey Ioffe，Jonathon Shlens，Zbigniew Wojna，2015 年）中引入了 Inception 。关键思想是在同一模块中具有多种大小的卷积作为特征提取并计算1×1、3×3和5×5卷积。这些滤波器的输出然后沿着通道尺寸堆叠，并发送到网络的下一层。下图对此进行了描述：

在“重新思考计算机视觉的 Inception 架构”中描述了 Inception-v3，而在《Inception-v4，Inception-ResNet 和残余连接对学习的影响》（Szegedy，Sergey Ioffe，Vincent Vanhoucke，Alex Alemi，2016 年）中描述了 Inception-v4。

Xception

Xception 是 Inception 的扩展，在《Xception：具有深度可分离卷积的深度学习》（FrançoisChollet，2016 年）中引入。 Xception 使用一种称为深度可分离卷积运算的新概念，该概念使其在包含 3.5 亿张图像和 17,000 个类别的大型图像分类数据集上的表现优于 Inception-v3。由于 Xception 架构具有与 Inception-v3 相同数量的参数，因此表现的提高并不是由于容量的增加，而是由于模型参数的更有效使用。

准备

此秘籍使用 Keras，因为该框架已预先完成了上述模块的实现。 Keras 首次使用时会自动下载每个网络的权重，并将这些权重存储在本地磁盘上。换句话说，您不需要重新训练网络，而是可以利用互联网上已经可用的训练。在您希望将网络分类为 1000 个预定义类别的假设下，这是正确的。在下一个秘籍中，我们将了解如何从这 1,000 个类别开始，并通过称为迁移学习的过程将它们扩展到自定义集合。

操作步骤

我们按以下步骤进行：

导入处理和显示图像所需的预建模型和其他模块：

from keras.applications import ResNet50
from keras.applications import InceptionV3
from keras.applications import Xception # TensorFlow ONLY
from keras.applications import VGG16
from keras.applications import VGG19
from keras.applications import imagenet_utils
from keras.applications.inception_v3 import preprocess_input
from keras.preprocessing.image import img_to_array
from keras.preprocessing.image import load_img
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
from PIL import Image
%matplotlib inline

定义用于记忆用于训练网络的图像大小的映射。这些是每个模型的众所周知的常数：

MODELS = {
"vgg16": (VGG16, (224, 224)),
"vgg19": (VGG19, (224, 224)),
"inception": (InceptionV3, (299, 299)),
"xception": (Xception, (299, 299)), # TensorFlow ONLY
"resnet": (ResNet50, (224, 224))
}

定义用于加载和转换每个图像的辅助函数。注意，预训练网络已在张量上训练，该张量的形状还包括batch_size的附加维度。因此，我们需要将此尺寸添加到图像中以实现兼容性：

def image_load_and_convert(image_path, model):
pil_im = Image.open(image_path, 'r')
imshow(np.asarray(pil_im))
# initialize the input image shape
# and the pre-processing function (this might need to be changed
inputShape = MODELS[model][1]
preprocess = imagenet_utils.preprocess_input
image = load_img(image_path, target_size=inputShape)
image = img_to_array(image)
# the original networks have been trained on an additional
# dimension taking into account the batch size
# we need to add this dimension for consistency
# even if we have one image only
image = np.expand_dims(image, axis=0)
image = preprocess(image)
return image

定义用于对图像进行分类的辅助函数，并在预测上循环，并显示 5 级预测以及概率：

def classify_image(image_path, model):
img = image_load_and_convert(image_path, model)
Network = MODELS[model][0]
model = Network(weights="imagenet")
preds = model.predict(img)
P = imagenet_utils.decode_predictions(preds)
# loop over the predictions and display the rank-5 predictions
# along with probabilities
for (i, (imagenetID, label, prob)) in enumerate(P[0]):
print("{}. {}: {:.2f}%".format(i + 1, label, prob * 100))

5.然后开始测试不同类型的预训练网络：

classify_image("images/parrot.jpg", "vgg16")

接下来，您将看到具有相应概率的预测列表： 1.金刚鹦鹉：99.92% 2.美洲豹：0.03% 3.澳洲鹦鹉：0.02% 4.蜂食者：0.02% 5.巨嘴鸟：0.00%

金刚鹦鹉的一个例子

classify_image("images/parrot.jpg", "vgg19")

1.金刚鹦鹉：99.77% 2.鹦鹉：0.07% 3.巨嘴鸟：0.06% 4.犀鸟：0.05% 5.贾卡马尔：0.01%

classify_image("images/parrot.jpg", "resnet")

1.金刚鹦鹉：97.93% 2.孔雀：0.86% 3.鹦鹉：0.23% 4. j：0.12% 5.杰伊：0.12%

classify_image("images/parrot_cropped1.jpg", "resnet")

1.金刚鹦鹉：99.98% 2.鹦鹉：0.00% 3.孔雀：0.00% 4.硫凤头鹦鹉：0.00% 5.巨嘴鸟：0.00%

classify_image("images/incredible-hulk-180.jpg", "resnet")

1. comic_book：99.76% 2. book_jacket：0.19% 3.拼图游戏：0.05% 4.菜单：0.00% 5.数据包：0.00%

如中所示的漫画分类示例

classify_image("images/cropped_panda.jpg", "resnet")

大熊猫：99.04% 2.英迪尔：0.59% 3.小熊猫：0.17% 4.长臂猿：0.07% 5. titi：0.05%

classify_image("images/space-shuttle1.jpg", "resnet")

1.航天飞机：92.38% 2.三角恐龙：7.15% 3.战机：0.11% 4.牛仔帽：0.10% 5.草帽：0.04%

classify_image("images/space-shuttle2.jpg", "resnet")

1.航天飞机：99.96% 2.导弹：0.03% 3.弹丸：0.00% 4.蒸汽机车：0.00% 5.战机：0.00%

classify_image("images/space-shuttle3.jpg", "resnet")

1.航天飞机：93.21% 2.导弹：5.53% 3.弹丸：1.26% 4.清真寺：0.00% 5.信标：0.00%

classify_image("images/space-shuttle4.jpg", "resnet")

1.航天飞机：49.61% 2.城堡：8.17% 3.起重机：6.46% 4.导弹：4.62% 5.航空母舰：4.24%

请注意，可能会出现一些错误。例如：

classify_image("images/parrot.jpg", "inception")

1.秒表：100.00% 2.貂皮：0.00% 3.锤子：0.00% 4.黑松鸡：0.00% 5.网站：0.00%

classify_image("images/parrot.jpg", "xception")

1.背包：56.69% 2.军装：29.79% 3.围兜：8.02% 4.钱包：2.14% 5.乒乓球：1.52%

定义一个辅助函数，用于显示每个预构建和预训练网络的内部架构：

def print_model(model):
print ("Model:",model)
Network = MODELS[model][0]
model = Network(weights="imagenet")
model.summary()
print_model('vgg19')

('Model:', 'vgg19')
 _________________________________________________________________
 Layer (type) Output Shape Param #
 =================================================================
 input_14 (InputLayer) (None, 224, 224, 3) 0
 _________________________________________________________________
 block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
 _________________________________________________________________
 block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
 _________________________________________________________________
 block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
 _________________________________________________________________
 block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
 _________________________________________________________________
 block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
 _________________________________________________________________
 block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
 _________________________________________________________________
 block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
 _________________________________________________________________
 block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
 _________________________________________________________________
 block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
 _________________________________________________________________
 block3_conv4 (Conv2D) (None, 56, 56, 256) 590080
 _________________________________________________________________
 block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
 _________________________________________________________________
 block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
 _________________________________________________________________
 block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
 _________________________________________________________________
 block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
 _________________________________________________________________
 block4_conv4 (Conv2D) (None, 28, 28, 512) 2359808
 _________________________________________________________________
 block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
 _________________________________________________________________
 block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
 _________________________________________________________________
 block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
 _________________________________________________________________
 block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
 _________________________________________________________________
 block5_conv4 (Conv2D) (None, 14, 14, 512) 2359808
 _________________________________________________________________
 block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
 _________________________________________________________________
 flatten (Flatten) (None, 25088) 0
 _________________________________________________________________
 fc1 (Dense) (None, 4096) 102764544
 _________________________________________________________________
 fc2 (Dense) (None, 4096) 16781312
 _________________________________________________________________
 predictions (Dense) (None, 1000) 4097000
 =================================================================
 Total params: 143,667,240
 Trainable params: 143,667,240
 Non-trainable params: 0

工作原理

我们使用了 Keras 应用，预训练的 Keras 学习模型，该模型随预训练的权重一起提供。这些模型可用于预测，特征提取和微调。在这种情况下，我们将模型用于预测。我们将在下一个秘籍中看到如何使用模型进行微调，以及如何在最初训练模型时最初不可用的数据集上构建自定义分类器。

截至 2017 年 7 月，Inception-v4 尚未在 Keras 中直接提供，但可以作为单独的模块在线下载。安装后，该模块将在首次使用时自动下载砝码。

AlexNet 是最早的堆叠式深层网络之一，它仅包含八层，前五层是卷积层，然后是全连接层。该网络是在 2012 年提出的，明显优于第二名（前五名的错误率为 16%，而第二名的错误率为 26% ）。

关于深度神经网络的最新研究主要集中在提高准确率上。较小的 DNN 架构具有同等的准确率，至少具有三个优点：

较小的 CNN 在分布式训练期间需要较少的跨服务器通信。
较小的 CNN 需要较少的带宽才能将新模型从云导出到提供模型的位置。
较小的 CNN 在具有有限内存的 FPGA 和其他硬件上部署更可行。为了提供所有这些优点，SqueezeNet 在论文 SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size 中提出。 SqueezeNet 通过减少 50 倍的参数在 ImageNet 上达到 AlexNet 级别的准确率。此外，借助模型压缩技术，我们可以将 SqueezeNet 压缩到小于 0.5 MB（比 AlexNet 小 510 倍）。 Keras 将 SqueezeNet 作为单独的模块在线实现。

复用预建的深度学习模型来提取特征

在本秘籍中，我们将看到如何使用深度学习来提取相关特征

准备

一个非常简单的想法是通常使用 VGG16 和 DCNN 进行特征提取。该代码通过从特定层提取特征来实现该想法。

操作步骤

我们按以下步骤进行：

导入处理和显示图像所需的预建模型和其他模块：

from keras.applications.vgg16 import VGG16
from keras.models import Model
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np

从网络中选择一个特定的层，并获得作为输出生成的特征：

# pre-built and pre-trained deep learning VGG16 model
base_model = VGG16(weights='imagenet', include_top=True)
for i, layer in enumerate(base_model.layers):
print (i, layer.name, layer.output_shape)
# extract features from block4_pool block
model =
Model(input=base_model.input, output=base_model.get_layer('block4_pool').output)

提取给定图像的特征，如以下代码片段所示：

img_path = 'cat.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
# get the features from this block
features = model.predict(x)

工作原理

现在，您可能想知道为什么我们要从 CNN 的中间层提取特征。关键的直觉是：随着网络学会将图像分类，各层学会识别进行最终分类所必需的特征。

较低的层标识较低阶的特征（例如颜色和边缘），较高的层将这些较低阶的特征组合为较高阶的特征（例如形状或对象）。因此，中间层具有从图像中提取重要特征的能力，并且这些特征更有可能有助于不同种类的分类。

这具有多个优点。首先，我们可以依靠公开提供的大规模训练，并将这种学习迁移到新颖的领域。其次，我们可以节省昂贵的大型训练时间。第三，即使我们没有针对该领域的大量训练示例，我们也可以提供合理的解决方案。对于手头的任务，我们也有一个很好的起始网络形状，而不是猜测它。

用于迁移学习的非常深的 InceptionV3 网络

迁移学习是一种非常强大的深度学习技术，在不同领域中有更多应用。直觉非常简单，可以用类推来解释。假设您想学习一种新的语言，例如西班牙语，那么从另一种语言（例如英语）已经知道的内容开始可能会很有用。

按照这种思路，计算机视觉研究人员现在通常使用经过预训练的 CNN 来生成新任务的表示形式，其中数据集可能不足以从头训练整个 CNN。另一个常见的策略是采用经过预先训练的 ImageNet 网络，然后将整个网络微调到新颖的任务。

InceptionV3 Net 是 Google 开发的非常深入的卷积网络。 Keras 实现了整个网络，如下图所示，并且已在 ImageNet 上进行了预训练。该模型的默认输入大小在三个通道上为299x299：

ImageNet v3 的示例

准备

此框架示例受到 Keras 网站上在线提供的方案的启发。我们假设在与 ImageNet 不同的域中具有训练数据集 D。 D 在输入中具有 1,024 个特征，在输出中具有 200 个类别。

操作步骤

我们可以按照以下步骤进行操作：

导入处理所需的预建模型和其他模块：

from keras.applications.inception_v3 import InceptionV3
from keras.preprocessing import image
from keras.models import Model
from keras.layers import Dense, GlobalAveragePooling2D
from keras import backend as K
# create the base pre-trained model
base_model = InceptionV3(weights='imagenet', include_top=False)

我们使用训练有素的 Inception-v3，但我们不包括顶级模型，因为我们要在 D 上进行微调。顶层是具有 1,024 个输入的密集层，最后一个输出是具有 200 类输出的 softmax 密集层。 x = GlobalAveragePooling2D()(x)用于将输入转换为密集层要处理的正确形状。实际上，base_model.output张量具有dim_ordering="th"的形状（样本，通道，行，列），dim_ordering="tf"具有（样本，行，列，通道），但是密集层需要GlobalAveragePooling2D计算（行，列）平均值，将它们转换为（样本，通道）。因此，如果查看最后四层（在include_top=True中），则会看到以下形状：

# layer.name, layer.input_shape, layer.output_shape
('mixed10', [(None, 8, 8, 320), (None, 8, 8, 768), (None, 8, 8, 768), (None, 8, 8, 192)], (None, 8, 8, 2048))
('avg_pool', (None, 8, 8, 2048), (None, 1, 1, 2048))
('flatten', (None, 1, 1, 2048), (None, 2048))
('predictions', (None, 2048), (None, 1000))

当包含_top=False时，将除去最后三层并暴露mixed_10层，因此GlobalAveragePooling2D层将(None, 8, 8, 2048)转换为(None, 2048)，其中(None, 2048)张量中的每个元素都是(None, 8, 8, 2048)张量中每个对应的(8, 8)张量的平均值：

# add a global spatial average pooling layer
 x = base_model.output
 x = GlobalAveragePooling2D()(x)
 # let's add a fully-connected layer as first layer
 x = Dense(1024, activation='relu')(x)
 # and a logistic layer with 200 classes as last layer
 predictions = Dense(200, activation='softmax')(x)
 # model to train
 model = Model(input=base_model.input, output=predictions)

所有卷积级别都经过预训练，因此我们在训练完整模型时将其冻结。

# i.e. freeze all convolutional Inception-v3 layers
for layer in base_model.layers:
layer.trainable = False

然后，对模型进行编译和训练几个周期，以便对顶层进行训练：

# compile the model (should be done *after* setting layers to non-trainable)
 model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
# train the model on the new data for a few epochs
 model.fit_generator(...)

然后我们冻结 Inception 中的顶层并微调 Inception 层。在此示例中，我们冻结了前 172 层（要调整的超参数）：

# we chose to train the top 2 inception blocks, i.e. we will freeze
 # the first 172 layers and unfreeze the rest:
 for layer in model.layers[:172]:
 layer.trainable = False
 for layer in model.layers[172:]:
 layer.trainable = True

然后重新编译模型以进行微调优化。我们需要重新编译模型，以使这些修改生效：

# we use SGD with a low learning rate
 from keras.optimizers import SGD
 model.compile(optimizer=SGD(lr=0.0001, momentum=0.9), loss='categorical_crossentropy')

 # we train our model again (this time fine-tuning the top 2 inception blocks
 # alongside the top Dense layers
 model.fit_generator(...)

工作原理

现在，我们有了一个新的深度网络，该网络可以重用标准的 Inception-v3 网络，但可以通过迁移学习在新的域 D 上进行训练。当然，有许多参数需要微调以获得良好的精度。但是，我们现在正在通过迁移学习重新使用非常庞大的预训练网络作为起点。这样，我们可以通过重复使用 Keras 中已经可用的内容来节省对机器进行训练的需求。

截至 2017 年，“计算机视觉”问题意味着在图像中查找图案的问题可以视为已解决，并且此问题影响了我们的生活。例如：

《皮肤科医师对具有深层神经网络的皮肤癌的分类》（Andre Esteva，Brett Kuprel，Roberto A. Novoa，Justin Ko，Susan M. Swetter，Helen M. Blau & Sebastian Thrun，2017 年）使用 129450 个临床图像的数据集训练 CNN，该图像由 2032 种不同疾病组成。他们在 21 个经过董事会认证的皮肤科医生的活检验证的临床图像上对结果进行了测试，并使用了两个关键的二元分类用例：角质形成细胞癌与良性脂溢性角化病；恶性黑色素瘤与良性痣。 CNN 在这两项任务上均达到了与所有测试过的专家相同的表现，展示了一种能够对皮肤癌进行分类的，具有与皮肤科医生相当的能力的人工智能。
论文《通过多视图深度卷积神经网络进行高分辨率乳腺癌筛查》（Krzysztof J. Geras，Stacey Wolfson，S。Gene Kim，Linda Moy，Kyunghyun Cho）承诺通过其创新的架构来改善乳腺癌的筛查过程，该架构可以处理四个标准视图或角度，而不会牺牲高分辨率。与通常用于自然图像的 DCN 架构（其可处理224 x 224像素的图像）相反，MV-DCN 也能够使用2600 x 2000像素的分辨率。

使用膨胀的卷积网络，WaveNet 和 NSynth 生成音乐

WaveNet 是用于生成原始音频波形的深度生成模型。 Google DeepMind 已引入了这一突破性技术，以教授如何与计算机通话。结果确实令人印象深刻，在网上可以找到合成语音的示例，计算机可以在其中学习如何与名人的声音（例如 Matt Damon）交谈。

因此，您可能想知道为什么学习合成音频如此困难。嗯，我们听到的每个数字声音都是基于每秒 16,000 个样本（有时是 48,000 个或更多），并且要建立一个预测模型，在此模型中，我们学会根据以前的所有样本来复制样本是一个非常困难的挑战。尽管如此，仍有实验表明，WaveNet 改进了当前最先进的文本转语音（TTS）系统，使美国英语和中文普通话与人声的差异降低了 50%。

更酷的是，DeepMind 证明 WaveNet 也可以用于向计算机教授如何产生乐器声音（例如钢琴音乐）。

现在为一些定义。 TTS 系统通常分为两个不同的类别：

串联 TTS，其中单个语音语音片段首先被存储，然后在必须重现语音时重新组合。但是，该方法无法扩展，因为可能只重现存储的语音片段，并且不可能重现新的扬声器或不同类型的音频而不从一开始就记住这些片段。
参数化 TTS，在其中创建一个模型来存储要合成的音频的所有特征。在 WaveNet 之前，参数 TTS 生成的音频不如串联 TTS 生成的音频自然。 WaveNet 通过直接对音频声音的产生进行建模，而不是使用过去使用的中间信号处理算法，从而改善了现有技术。

原则上，WaveNet 可以看作是一维卷积层的栈（我们已经在第 4 章中看到了图像的 2D 卷积），步幅恒定为 1，没有池化层。请注意，输入和输出在构造上具有相同的尺寸，因此卷积网络非常适合对顺序数据（例如音频）建模。但是，已经显示出，为了在输出神经元中的接收场达到较大的大小，有必要使用大量的大型过滤器或过分增加网络的深度。请记住，一层中神经元的接受场是前一层的横截面，神经元从该层提供输入。因此，纯卷积网络在学习如何合成音频方面不是那么有效。

WaveNet 之外的主要直觉是所谓的因果因果卷积（有时称为原子卷积），这仅意味着在应用卷积层的滤波器时会跳过某些输入值。 Atrous 是法语表述为 à trous 的混蛋，意思是带有孔的。因此，AtrousConvolution 是带孔的卷积。例如，在一个维度中，尺寸为 3 且扩张为 1 的滤波器w将计算出以下总和。

简而言之，在 D 扩散卷积中，步幅通常为 1，但是没有任何东西可以阻止您使用其他步幅。下图给出了一个示例，其中膨胀（孔）尺寸增大了 0、1、2：

扩张网络的一个例子

由于采用了引入“空洞”的简单思想，可以在不具有过多深度网络的情况下，使用指数级增长的滤波器堆叠多个膨胀的卷积层，并学习远程输入依赖项。

因此，WaveNet 是一个卷积网络，其中卷积层具有各种膨胀因子，从而使接收场随深度呈指数增长，因此有效覆盖了数千个音频时间步长。

当我们训练时，输入是从人类扬声器录制的声音。波形被量化为固定的整数范围。 WaveNet 定义了一个初始卷积层，仅访问当前和先前的输入。然后，有一堆散布的卷积网络层，仍然仅访问当前和以前的输入。最后，有一系列密集层结合了先前的结果，然后是用于分类输出的 softmax 激活函数。

在每个步骤中，都会从网络预测一个值，并将其反馈到输入中。同时，为下一步计算新的预测。损失函数是当前步骤的输出与下一步的输入之间的交叉熵。

NSynth是 Google Brain 集团最近发布的 WaveNet 的改进版本，其目的不是查看因果关系，而是查看输入块的整个上下文。神经网络是真正的，复杂的，如下图所示，但是对于本介绍性讨论而言，足以了解该网络通过使用基于减少编码/解码过程中的错误的方法来学习如何再现其输入。阶段：

如下所示的 NSynth 架构示例

准备

对于本秘籍，我们不会编写代码，而是向您展示如何使用一些在线可用的代码和一些不错的演示，您可以从 Google Brain 找到。有兴趣的读者还可以阅读以下文章：《使用 WaveNet 自编码器的音符的神经音频合成》（杰西·恩格尔，辛琼·雷斯尼克，亚当·罗伯茨，桑德·迪勒曼，道格拉斯·埃克，卡伦·西蒙扬，穆罕默德·诺鲁兹，4 月 5 日 2017）。

操作步骤

我们按以下步骤进行：

通过创建单独的 conda 环境来安装 NSynth。使用支持 Jupyter Notebook 的 Python 2.7 创建并激活 Magenta conda 环境：

conda create -n magenta python=2.7 jupyter
source activate magenta

安装magenta PIP 包和librosa（用于读取音频格式）：

pip install magenta
pip install librosa

从互联网安装预构建的模型，然后下载示例声音。然后运行演示目录中包含的笔记本。第一部分是关于包含稍后将在我们的计算中使用的模块的：

import os
import numpy as np
import matplotlib.pyplot as plt
from magenta.models.nsynth import utils
from magenta.models.nsynth.wavenet import fastgen
from IPython.display import Audio
%matplotlib inline
%config InlineBackend.figure_format = 'jpg'

然后，我们加载从互联网下载的演示声音，并将其放置在与笔记本计算机相同的目录中。这将在约 2.5 秒内将 40,000 个样本加载到计算机中：

# from https://www.freesound.org/people/MustardPlug/sounds/395058/
fname = '395058__mustardplug__breakbeat-hiphop-a4-4bar-96bpm.wav'
sr = 16000
audio = utils.load_audio(fname, sample_length=40000, sr=sr)
sample_length = audio.shape[0]
print('{} samples, {} seconds'.format(sample_length, sample_length / float(sr)))

下一步是使用从互联网下载的预先训练的 NSynth 模型以非常紧凑的表示形式对音频样本进行编码。每四秒钟音频将为我们提供78 x 16尺寸的编码，然后我们可以对其进行解码或重新合成。我们的编码是张量（#files=1 x 78 x 16）：

%time encoding = fastgen.encode(audio, 'model.ckpt-200000', sample_length)
INFO:tensorflow:Restoring parameters from model.ckpt-200000
 CPU times: user 1min 4s, sys: 2.96 s, total: 1min 7s
 Wall time: 25.7 s
print(encoding.shape)
(1, 78, 16)

让我们保存以后将用于重新合成的编码。另外，让我们用图形表示来快速查看编码形状是什么，并将其与原始音频信号进行比较。如您所见，编码遵循原始音频信号中呈现的节拍：

np.save(fname + '.npy', encoding)
fig, axs = plt.subplots(2, 1, figsize=(10, 5))
axs[0].plot(audio);
axs[0].set_title('Audio Signal')
axs[1].plot(encoding[0]);
axs[1].set_title('NSynth Encoding')

我们观察到以下音频信号和 Nsynth 编码：

现在，让我们对刚刚产生的编码进行解码。换句话说，我们试图从紧凑的表示中再现原始音频，目的是理解重新合成的声音是否类似于原始声音。确实，如果您运行实验并聆听原始音频和重新合成的音频，它们听起来非常相似：

%time fastgen.synthesize(encoding, save_paths=['gen_' + fname], samples_per_save=sample_length)

工作原理

WaveNet 是一种卷积网络，其中卷积层具有各种扩张因子，从而使接收场随深度呈指数增长，因此有效覆盖了数千个音频时间步长。 NSynth 是 WaveNet 的演进，其中原始音频使用类似 WaveNet 的处理进行编码，以学习紧凑的表示形式。然后，使用这种紧凑的表示来再现原始音频。

一旦我们学习了如何通过膨胀卷积创建音频的紧凑表示形式，我们就可以玩这些学习并从中获得乐趣。您会在互联网上找到非常酷的演示：

例如，您可以看到模型如何学习不同乐器的声音：

然后，您将看到如何将在一个上下文中学习的一个模型在另一个上下文中重新组合。例如，通过更改说话者身份，我们可以使用 WaveNet 以不同的声音说同一件事。
另一个非常有趣的实验是学习乐器模型，然后以一种可以重新创建以前从未听说过的新乐器的方式对其进行重新混合。这真的很酷，它为通往新的可能性开辟了道路，坐在我里面的前电台 DJ 无法抗拒超级兴奋。例如，在此示例中，我们将西塔琴与电吉他结合在一起，这是一种很酷的新乐器。不够兴奋？那么，如何将弓弦低音与狗的吠声结合起来呢？玩得开心！：

回答有关图像的问题（可视化问答）

在本秘籍中，我们将学习如何回答有关特定图像内容的问题。这是一种强大的 Visual Q&A，它结合了从预先训练的 VGG16 模型中提取的视觉特征和词聚类（嵌入）的组合。然后将这两组异类特征组合成一个网络，其中最后一层由密集和缺失的交替序列组成。此秘籍适用于 Keras 2.0+。

因此，本秘籍将教您如何：

从预先训练的 VGG16 网络中提取特征。
使用预构建的单词嵌入将单词映射到相邻相似单词的空间中。
使用 LSTM 层构建语言模型。 LSTM 将在第 6 章中讨论，现在我们将它们用作黑盒。
组合不同的异构输入特征以创建组合的特征空间。对于此任务，我们将使用新的 Keras 2.0 函数式 API。
附加一些其他的密集和丢弃层，以创建多层感知机并增强我们的深度学习网络的功能。

为了简单起见，我们不会在 5 中重新训练组合网络，而是使用已经在线提供的预先训练的权重集。有兴趣的读者可以在由 N 个图像，N 个问题和 N 个答案组成的自己的训练数据集上对网络进行再训练。这是可选练习。该网络的灵感来自《VQA：视觉问题解答》（Aishwarya Agrawal，Jiasen Lu，Stanislaw Antol，Margaret Mitchell，C.Lawrence Zitnick，Dhruv Batra，Devi Parikh，2015 年）：

在视觉问题回答论文中看到的 Visual Q&A 示例

我们这种情况的唯一区别是，我们将图像层产生的特征与语言层产生的特征连接起来。

操作步骤

我们可以按照以下步骤进行操作：

加载秘籍所需的所有 Keras 模块。其中包括用于词嵌入的 spaCy，用于图像特征提取的 VGG16 和用于语言建模的 LSTM。其余的几个附加模块非常标准：

%matplotlib inline
import os, argparse
import numpy as np
import cv2 as cv2
import spacy as spacy
import matplotlib.pyplot as plt
from keras.models import Model, Input
from keras.layers.core import Dense, Dropout, Reshape
from keras.layers.recurrent import LSTM
from keras.layers.merge import concatenate
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
from sklearn.externals import joblib
import PIL.Image

定义一些常量。请注意，我们假设我们的问题语料库具有max_length_questions = 30，并且我们知道我们将使用 VGG16 提取 4,096 个描述输入图像的特征。另外，我们知道单词嵌入在length_feature_space = 300的空间中。请注意，我们将使用从互联网下载的一组预训练权重：

# mapping id -> labels for categories
label_encoder_file_name =
'/Users/gulli/Books/TF/code/git/tensorflowBook/Chapter5/FULL_labelencoder_trainval.pkl'
# max length across corpus
max_length_questions = 30
# VGG output
length_vgg_features = 4096
# Embedding outout
length_feature_space = 300
# pre-trained weights
VQA_weights_file =
'/Users/gulli/Books/TF/code/git/tensorflowBook/Chapter5/VQA_MODEL_WEIGHTS.hdf5'

3.使用 VGG16 提取特征。请注意，我们从 fc2 层中明确提取了它们。给定输入图像，此函数返回 4,096 个特征：

'''image features'''
def get_image_features(img_path, VGG16modelFull):
'''given an image returns a tensor with (1, 4096) VGG16 features'''
# Since VGG was trained as a image of 224x224, every new image
# is required to go through the same transformation
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
# this is required because of the original training of VGG was batch
# even if we have only one image we need to be consistent
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = VGG16modelFull.predict(x)
model_extractfeatures = Model(inputs=VGG16modelFull.input,
outputs=VGG16modelFull.get_layer('fc2').output)
fc2_features = model_extractfeatures.predict(x)
fc2_features = fc2_features.reshape((1, length_vgg_features))
return fc2_features

请注意，VGG16 的定义如下：

Layer (type) Output Shape Param #
 =================================================================
 input_5 (InputLayer) (None, 224, 224, 3) 0
 _________________________________________________________________
 block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
 _________________________________________________________________
 block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
 _________________________________________________________________
 block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
 _________________________________________________________________
 block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
 _________________________________________________________________
 block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
 _________________________________________________________________
 block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
 _________________________________________________________________
 block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
 _________________________________________________________________
 block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
 _________________________________________________________________
 block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
 _________________________________________________________________
 block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
 _________________________________________________________________
 block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
 _________________________________________________________________
 block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
 _________________________________________________________________
 block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
 _________________________________________________________________
 block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
 _________________________________________________________________
 block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
 _________________________________________________________________
 block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
 _________________________________________________________________
 block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
 _________________________________________________________________
 block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
 _________________________________________________________________
 flatten (Flatten) (None, 25088) 0
 _________________________________________________________________
 fc1 (Dense) (None, 4096) 102764544
 _________________________________________________________________
 fc2 (Dense) (None, 4096) 16781312
 _________________________________________________________________
 predictions (Dense) (None, 1000) 4097000
 =================================================================
 Total params: 138,357,544
 Trainable params: 138,357,544
 Non-trainable params: 0
 _________________________________________

使用 spaCy 获取单词嵌入，并将输入的问题映射到一个空格（max_length_questions, 300），其中max_length_questions是我们语料库中问题的最大长度，而 300 是 spaCy 产生的嵌入的尺寸。在内部，spaCy 使用一种称为 gloVe 的算法。 gloVe 将给定令牌简化为 300 维表示。请注意，该问题使用右 0 填充填充到max_lengh_questions：

'''embedding'''
def get_question_features(question):
''' given a question, a unicode string, returns the time series vector
with each word (token) transformed into a 300 dimension representation
calculated using Glove Vector '''
word_embeddings = spacy.load('en', vectors='en_glove_cc_300_1m_vectors')
tokens = word_embeddings(question)
ntokens = len(tokens)
if (ntokens > max_length_questions) :
ntokens = max_length_questions
question_tensor = np.zeros((1, max_length_questions, 300))
for j in xrange(len(tokens)):
question_tensor[0,j,:] = tokens[j].vector
return question_tensor

使用先前定义的图像特征提取器加载图像并获取其显着特征：

image_file_name = 'girl.jpg'
img0 = PIL.Image.open(image_file_name)
img0.show()
#get the salient features
model = VGG16(weights='imagenet', include_top=True)
image_features = get_image_features(image_file_name, model)
print image_features.shape

使用先前定义的句子特征提取器，编写一个问题并获得其显着特征：

question = u"Who is in this picture?"
language_features = get_question_features(question)
print language_features.shape

将两组异类特征组合为一个。在这个网络中，我们有三个 LSTM 层，这些层将考虑我们语言模型的创建。注意，LSTM 将在第 6 章中详细讨论，目前我们仅将它们用作黑匣子。最后的 LSTM 返回 512 个特征，这些特征随后用作一系列密集层和缺失层的输入。最后一层是具有 softmax 激活函数的密集层，其概率空间为 1,000 个潜在答案：

'''combine'''
def build_combined_model(
number_of_LSTM = 3,
number_of_hidden_units_LSTM = 512,
number_of_dense_layers = 3,
number_of_hidden_units = 1024,
activation_function = 'tanh',
dropout_pct = 0.5
):
#input image
input_image = Input(shape=(length_vgg_features,),
name="input_image")
model_image = Reshape((length_vgg_features,),
input_shape=(length_vgg_features,))(input_image)
#input language
input_language = Input(shape=(max_length_questions,length_feature_space,),
name="input_language")
#build a sequence of LSTM
model_language = LSTM(number_of_hidden_units_LSTM,
return_sequences=True,
name = "lstm_1")(input_language)
model_language = LSTM(number_of_hidden_units_LSTM,
return_sequences=True,
name = "lstm_2")(model_language)
model_language = LSTM(number_of_hidden_units_LSTM,
return_sequences=False,
name = "lstm_3")(model_language)
#concatenate 4096+512
model = concatenate([model_image, model_language])
#Dense, Dropout
for _ in xrange(number_of_dense_layers):
model = Dense(number_of_hidden_units,
kernel_initializer='uniform')(model)
model = Dropout(dropout_pct)(model)
model = Dense(1000,
activation='softmax')(model)
#create model from tensors
model = Model(inputs=[input_image, input_language], outputs = model)
return model

建立组合的网络并显示其摘要，以了解其内部外观。加载预训练的权重并使用 rmsprop 优化器使用categorical_crossentropy损失函数来编译模型：

combined_model = build_combined_model()
combined_model.summary()
combined_model.load_weights(VQA_weights_file)
combined_model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
____________________________
 Layer (type) Output Shape Param # Connected to
 ====================================================================================================
 input_language (InputLayer) (None, 30, 300) 0
 ____________________________________________________________________________________________________
 lstm_1 (LSTM) (None, 30, 512) 1665024 input_language[0][0]
 ____________________________________________________________________________________________________
 input_image (InputLayer) (None, 4096) 0
 ____________________________________________________________________________________________________
 lstm_2 (LSTM) (None, 30, 512) 2099200 lstm_1[0][0]
 ____________________________________________________________________________________________________
 reshape_3 (Reshape) (None, 4096) 0 input_image[0][0]
 ____________________________________________________________________________________________________
 lstm_3 (LSTM) (None, 512) 2099200 lstm_2[0][0]
 ____________________________________________________________________________________________________
 concatenate_3 (Concatenate) (None, 4608) 0 reshape_3[0][0]
 lstm_3[0][0]
 ____________________________________________________________________________________________________
 dense_8 (Dense) (None, 1024) 4719616 concatenate_3[0][0]
 ____________________________________________________________________________________________________
 dropout_7 (Dropout) (None, 1024) 0 dense_8[0][0]
 ____________________________________________________________________________________________________
 dense_9 (Dense) (None, 1024) 1049600 dropout_7[0][0]
 ____________________________________________________________________________________________________
 dropout_8 (Dropout) (None, 1024) 0 dense_9[0][0]
 ____________________________________________________________________________________________________
 dense_10 (Dense) (None, 1024) 1049600 dropout_8[0][0]
 ____________________________________________________________________________________________________
 dropout_9 (Dropout) (None, 1024) 0 dense_10[0][0]
 ____________________________________________________________________________________________________
 dense_11 (Dense) (None, 1000) 1025000 dropout_9[0][0]
 ====================================================================================================
 Total params: 13,707,240
 Trainable params: 13,707,240
 Non-trainable params: 0

使用预训练的组合网络进行预测。请注意，在这种情况下，我们使用该网络已经在线可用的权重，但是感兴趣的读者可以在自己的训练集中重新训练组合的网络：

y_output = combined_model.predict([image_features, language_features])
# This task here is represented as a classification into a 1000 top answers
# this means some of the answers were not part of training and thus would
# not show up in the result.
# These 1000 answers are stored in the sklearn Encoder class
labelencoder = joblib.load(label_encoder_file_name)
for label in reversed(np.argsort(y_output)[0,-5:]):
print str(round(y_output[0,label]*100,2)).zfill(5), "% ", labelencoder.inverse_transform(label)

工作原理

视觉问题解答的任务是通过结合使用不同的深度神经网络来解决的。预训练的 VGG16 已用于从图像中提取特征，而 LSTM 序列已用于从先前映射到嵌入空间的问题中提取特征。 VGG16 是用于图像特征提取的 CNN，而 LSTM 是用于提取表示序列的时间特征的 RNN。目前，这两种方法的结合是处理此类网络的最新技术。然后，在组合模型的顶部添加一个具有丢弃功能的多层感知机，以形成我们的深度网络。

在互联网上，您可以找到 Avi Singh 进行的更多实验，其中比较了不同的模型，包括简单的“袋装”语言的“单词”与图像的 CNN，仅 LSTM 模型以及 LSTM + CNN 模型-类似于本秘籍中讨论的模型。博客文章还讨论了每种模型的不同训练策略。

除此之外，有兴趣的读者可以在互联网上找到一个不错的 GUI，它建立在 Avi Singh 演示的顶部，使您可以交互式加载图像并提出相关问题。还提供了 YouTube 视频。

通过六种不同方式将预训练网络用于视频分类

对视频进行分类是一个活跃的研究领域，因为处理此类媒体需要大量数据。内存需求经常达到现代 GPU 的极限，可能需要在多台机器上进行分布式训练。目前，研究正在探索复杂性不断提高的不同方向，让我们对其进行回顾。

第一种方法包括一次将一个视频帧分类，方法是将每个视频帧视为使用 2D CNN 处理的单独图像。这种方法只是将视频分类问题简化为图像分类问题。每个视频帧产生分类输出，并且通过考虑每个帧的更频繁选择的类别对视频进行分类。

第二种方法包括创建一个单一网络，其中 2D CNN 与 RNN 结合在一起。这个想法是 CNN 将考虑图像分量，而 RNN 将考虑每个视频的序列信息。由于要优化的参数数量非常多，这种类型的网络可能很难训练。

第三种方法是使用 3D 卷积网络，其中 3D 卷积网络是在 3D 张量（time，image_width和image_height）上运行的 2D 卷积网络的扩展。这种方法是图像分类的另一个自然扩展，但同样，3D 卷积网络可能很难训练。

第四种方法基于智能直觉。代替直接使用 CNN 进行分类，它们可以用于存储视频中每个帧的脱机特征。想法是，如先前的秘籍所示，可以通过迁移学习使特征提取非常有效。提取所有特征后，可以将它们作为一组输入传递到 RNN，该 RNN 将学习多个帧中的序列并发出最终分类。

第五种方法是第四种方法的简单变体，其中最后一层是 MLP 而不是 RNN。在某些情况下，就计算要求而言，此方法可能更简单且成本更低。

第六种方法是第四种方法的变体，其中特征提取的阶段是通过提取空间和视觉特征的 3D CNN 实现的。然后将这些特征传递给 RNN 或 MLP。

选择最佳方法严格取决于您的特定应用，没有明确的答案。前三种方法通常在计算上更昂贵，而后三种方法则更便宜并且经常获得更好的表现。

在本秘籍中，我们将通过描述论文 Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks 来描述如何使用第六种方法。这项工作旨在解决 ActivityNet 挑战。这项挑战着重于从用户生成的视频中识别高水平和面向目标的活动，类似于在互联网门户网站中找到的那些活动。该挑战针对两项不同任务中的 200 个活动类别量身定制：

未修剪的分类挑战：给定较长的视频，请预测视频中存在的活动的标签
检测挑战：给定较长的视频，预测视频中存在的活动的标签和时间范围

呈现的架构包括两个阶段，如下图所示。第一阶段将视频信息编码为单个视频表示形式，以用于小型视频剪辑。为此，使用了 C3D 网络。 C3D 网络使用 3D 卷积从视频中提取时空特征，这些视频先前已被分成 16 帧剪辑。

一旦提取了视频特征，第二阶段就是对每个剪辑的活动进行分类。为了执行这种分类，使用了 RNN，更具体地说是一个 LSTM 网络，该网络试图利用长期相关性并执行视频序列的预测。此阶段已被训练：

C3D + RNN 的示例

操作步骤

对于此秘籍，我们仅汇总在线呈现的结果：

第一步是克隆 git 仓库

git clone https://github.com/imatge-upc/activitynet-2016-cvprw.git

然后，我们需要下载 ActivityNet v1.3 数据集，其大小为 600 GB：

cd dataset
 # This will download the videos on the default directory
 sh download_videos.sh username password
 # This will download the videos on the directory you specify
 sh download_videos.sh username password /path/you/want

下一步是下载 CNN3d 和 RNN 的预训练权重：

cd data/models
 sh get_c3d_sports_weights.sh
 sh get_temporal_location_weights.sh

最后一步包括对视频进行分类：

python scripts/run_all_pipeline.py -i path/to/test/video.mp4

工作原理

如果您有兴趣在计算机上训练 CNN3D 和 RNN，则可以在互联网上找到此计算机管道使用的特定命令。

目的是介绍可用于视频分类的不同方法的高级视图。同样，不仅有一个单一的秘籍，而且有多个选项，应根据您的特定需求仔细选择。

CNN-LSTM 架构是新的 RNN 层，其中输入转换和循环转换的输入都是卷积。尽管名称非常相似，但如上所述，CNN-LSTM 层与 CNN 和 LSTM 的组合不同。该模型在论文《卷积 LSTM 网络：降水临近预报的机器学习方法》（史兴建，陈周荣，王浩，杨天彦，黄伟坚，胡旺春，2015 年）中进行了描述，并且在 2017 年，有些人开始尝试使用此模块进行视频实验，但这仍然是一个活跃的研究领域。

六、循环神经网络

在本章中，我们将介绍一些涵盖以下主题的秘籍：

神经机器翻译-训练 seq2seq RNN
神经机器翻译-推理 seq2seq RNN
您只需要关注-seq2seq RNN 的另一个示例
通过 RNN 学习写作莎士比亚
学习使用 RNN 预测未来的比特币价值
多对一和多对多 RNN 示例

介绍

在本章中，我们将讨论循环神经网络（RNN）如何在保持顺序顺序重要的领域中用于深度学习。我们的注意力将主要集中在文本分析和自然语言处理（NLP）上，但我们还将看到用于预测比特币价值的序列示例。

通过采用基于时间序列的模型，可以描述许多实时情况。例如，如果您考虑编写文档，则单词的顺序很重要，而当前单词肯定取决于先前的单词。如果我们仍然专注于文本编写，很明显单词中的下一个字符取决于前一个字符（例如quick brown字符的下一个字母很有可能将会是字母fox），如下图所示。关键思想是在给定当前上下文的情况下生成下一个字符的分布，然后从该分布中采样以生成下一个候选字符：

用The quick brown fox句子进行预测的例子

一个简单的变体是存储多个预测，因此创建一棵可能的扩展树，如下图所示：

The quick brown fox句子的预测树的示例

但是，基于序列的模型可以在大量其他域中使用。在音乐中，乐曲中的下一个音符肯定取决于前一个音符，而在视频中，电影中的下一个帧必定与前一帧有关。此外，在某些情况下，当前的视频帧，单词，字符或音符不仅取决于前一个，而且还取决于后一个。

可以使用 RNN 描述基于时间序列的模型，其中对于给定输入X[i]，时间为i，产生输出Y[i]，将时间[0，i-1]的以前状态的记忆反馈到网络。反馈先前状态的想法由循环循环描述，如下图所示：

；

反馈示例

循环关系可以方便地通过展开网络来表示，如下图所示：

展开循环单元的例子

最简单的 RNN 单元由简单的 tanh 函数（双曲正切函数）组成，如下图所示：

个简单的 tanh 单元的例子

梯度消失和爆炸

训练 RNN 十分困难，因为存在两个稳定性问题。由于反馈回路的缘故，梯度可能会迅速发散到无穷大，或者它可能会迅速发散到 0。在两种情况下，如下图所示，网络将停止学习任何有用的东西。可以使用基于梯度修剪的相对简单的解决方案来解决梯度爆炸的问题。梯度消失的问题更难解决，它涉及更复杂的 RNN 基本单元的定义，例如长短期记忆（LSTM）或门控循环单元（GRU）。让我们首先讨论梯度爆炸和梯度裁剪：

梯度示例

梯度裁剪包括对梯度施加最大值，以使其无法无限增长。下图所示的简单解决方案为梯度爆炸问题提供了简单的解决方案：

梯度裁剪的例子

解决梯度消失的问题需要一种更复杂的内存模型，该模型可以选择性地忘记先前的状态，只记住真正重要的状态。考虑下图，输入以[0,1]中的概率p写入存储器M中，并乘以加权输入。

以类似的方式，以[0,1]中的概率p读取输出，将其乘以加权输出。还有一种可能性用来决定要记住或忘记的事情：

存储单元的一个例子

长短期记忆（LSTM）

LSTM 网络可以控制何时让输入进入神经元，何时记住在上一个时间步中学到的内容以及何时让输出传递到下一个时间戳。所有这些决定都是自调整的，并且仅基于输入。乍一看，LSTM 看起来很难理解，但事实并非如此。让我们用下图来说明它是如何工作的：

LSTM 单元的一个例子

首先，我们需要一个逻辑函数σ（请参见第 2 章，“回归”）来计算介于 0 和 1 之间的值，并控制哪些信息流过 LSTM 门。请记住，逻辑函数是可微的，因此允许反向传播。然后，我们需要一个运算符⊗，它采用两个相同维的矩阵并生成另一个矩阵，其中每个元素ij是原始两个矩阵的元素ij的乘积。同样，我们需要一个运算符⊕，它采用两个相同维度的矩阵并生成另一个矩阵，其中每个元素ij是原始两个矩阵的元素ij之和。使用这些基本块，我们考虑时间i处的输入X[i]，并将其与上一步中的输出Y[i-1]并置。

方程f[t] = σ(W[f] · [y[i-1], x[t]] + b[f])实现了控制激活门⊗的逻辑回归，并用于确定应从先前候选值C[i-1]获取多少信息。传递给下一个候选值C[i]（此处W[f]和b[f]矩阵和用于逻辑回归的偏差）。如果 Sigmoid 输出为 1，则表示不要忘记先前的单元格状态C[i-1]；如果输出 0，这将意味着忘记先前的单元状态C[i-1]。(0, 1)中的任何数字都将表示要传递的信息量。

然后我们有两个方程：s[i] = σ(W[s] · [Y[i-1], x[i]] + b[s])，用于通过⊗控制由当前单元产生的多少信息（Ĉ[i] = tanh(W [C] · [Y[i-1]， X[i] + b[c])）应该通过⊕运算符添加到下一个候选值C[i]中，根据上图中表示的方案。

为了实现与运算符⊕和⊗所讨论的内容，我们需要另一个方程，其中进行实际的加法+和乘法*：C[i] = f[t] * C[i-1] + s[i] * Ĉ[i]

最后，我们需要确定当前单元格的哪一部分应发送到Y[i]输出。这很简单：我们再进行一次逻辑回归方程，然后通过⊗运算来控制应使用哪一部分候选值输出。在这里，有一点值得关注，使用 tanh 函数将输出压缩为[-1, 1]。最新的步骤由以下公式描述：

现在，我了解到这看起来像很多数学运算，但有两个好消息。首先，如果您了解我们想要实现的目标，那么数学部分并不是那么困难。其次，您可以将 LSTM 单元用作标准 RNN 单元的黑盒替代，并立即获得解决梯度消失问题的好处。因此，您实际上不需要了解所有数学知识。您只需从库中获取 TensorFlow LSTM 实现并使用它即可。

门控循环单元（GRU）和窥孔 LSTM

近年来提出了许多 LSTM 单元的变体。其中两个真的很受欢迎。窥孔 LSTM 允许栅极层查看单元状态，如下图虚线所示，而门控循环单元（GRU）将隐藏状态和单元状态和合并为一个单一的信息渠道。

同样，GRU 和 Peephole LSTM 都可以用作标准 RNN 单元的黑盒插件，而无需了解基础数学。这两个单元都可用于解决梯度消失的问题，并可用于构建深度神经网络：

标准 LSTM，PeepHole LSTM 和 GRU 的示例

向量序列的运算

使 RNN 真正强大的是能够对向量序列进行操作的能力，其中 RNN 的输入和/或 RNN 的输出都可以是序列。下图很好地表示了这一点，其中最左边的示例是传统的（非循环）网络，其后是带有输出序列的 RNN，然后是带有输入序列的 RNN，再是带有序列的 RNN 在不同步序列的输入和输出中，然后是在序列同步的输入和输出中具有序列的 RNN：

RNN 序列的一个例子

机器翻译是输入和输出中不同步序列的一个示例：网络将输入文本作为序列读取，在读取全文之后，会输出目标语言。

视频分类是输入和输出中同步序列的示例：视频输入是帧序列，并且对于每个帧，输出中都提供了分类标签。

如果您想了解有关 RNN 有趣应用的更多信息，则必须阅读 Andrej Karpathy 发布的博客。他训练了网络，以莎士比亚的风格撰写论文（用 Karpathy 的话说：几乎不能从实际的莎士比亚中识别出这些样本），撰写有关虚构主题的现实 Wikipedia 文章，撰写关于愚蠢和不现实问题的现实定理证明（用 Karpathy 的话：更多的幻觉代数几何），并写出现实的 Linux 代码片段（用 Karpathy 的话：他首先建模逐个字符地列举 GNU 许可证，其中包括一些示例，然后生成一些宏，然后深入研究代码）。

以下示例摘自这个页面：

用 RNN 生成的文本示例

神经机器翻译 -- 训练 seq2seq RNN

序列到序列（seq2seq）是 RNN 的一种特殊类型，已成功应用于神经机器翻译，文本摘要和语音识别中。在本秘籍中，我们将讨论如何实现神经机器翻译，其结果与 Google 神经机器翻译系统。关键思想是输入整个文本序列，理解整个含义，然后将翻译输出为另一个序列。读取整个序列的想法与以前的架构大不相同，在先前的架构中，将一组固定的单词从一种源语言翻译成目标语言。

本节的灵感来自 Minh-Thang Luong 的 2016 年博士学位论文《神经机器翻译》。第一个关键概念是编码器-解码器架构的存在，其中编码器将源句子转换为代表含义的向量。然后，此向量通过解码器以产生翻译。编码器和解码器都是 RNN，它们可以捕获语言中的长期依赖关系，例如性别协议和语法结构，而无需先验地了解它们，并且不需要跨语言进行 1：1 映射。这是一种强大的功能，可实现非常流畅的翻译：

编解码器的示例

让我们看一个 RNN 的示例，该语句将She loves cute cats翻译成Elle Aime les chat Mignons。

有两种 RNN：一种充当编码器，另一种充当解码器。源句She loves cute cats后跟一个分隔符-目标句是Elle aime les chats mignons。这两个连接的句子在输入中提供给编码器进行训练，并且解码器将生成目标目标。当然，我们需要像这样的多个示例来获得良好的训练：

NMT 序列模型的示例

现在，我们可以拥有许多 RNN 变体。让我们看看其中的一些：

RNN 可以是单向或双向的。后者将捕捉双方的长期关系。
RNN 可以具有多个隐藏层。选择是关于优化的问题：一方面，更深的网络可以学到更多；另一方面，更深的网络可以学到更多。另一方面，可能需要很长的时间来训练并且可能会过头。
RNN 可以具有一个嵌入层，该层将单词映射到一个嵌入空间中，在该空间中相似的单词恰好被映射得非常近。
RNNs 可以使用简单的或者循环的单元，或 LSTM，或窥视孔 LSTM，或越冬。

仍然参考博士学位论文《神经机器翻译》，我们可以使用嵌入层来将输入语句放入嵌入空间。然后，有两个 RNN 粘在一起——源语言的编码器和目标语言的解码器。如您所见，存在多个隐藏层，并且有两个流程：前馈垂直方向连接这些隐藏层，水平方向是将知识从上一步转移到下一层的循环部分：

神经机器翻译的例子

在本秘籍中，我们使用 NMT（神经机器翻译），这是一个可在 TensorFlow 顶部在线获得的翻译演示包。

准备

NMT 可在这个页面上找到，并且代码在 GitHub 上。

操作步骤

我们按以下步骤进行：

从 GitHub 克隆 NMT：

git clone https://github.com/tensorflow/nmt/

下载训练数据集。在这种情况下，我们将使用训练集将越南语翻译为英语。其他数据集可从这里获取其他语言，例如德语和捷克语：

nmt/scripts/download_iwslt15.sh /tmp/nmt_data

考虑这里，我们将定义第一个嵌入层。嵌入层接受输入，词汇量 V 和输出嵌入空间的所需大小。词汇量使得仅考虑 V 中最频繁的单词进行嵌入，而所有其他单词都映射到一个常见的未知项。在我们的例子中，输入是主要时间的，这意味着最大时间是第一个输入参数：

# Embedding
 embedding_encoder = variable_scope.get_variable(
 "embedding_encoder", [src_vocab_size, embedding_size], ...)
 # Look up embedding:
 # encoder_inputs: [max_time, batch_size]
 # encoder_emb_inp: [max_time, batch_size, embedding_size]
 encoder_emb_inp = embedding_ops.embedding_lookup(
 embedding_encoder, encoder_inputs)

仍然参考这里，我们定义了一个简单的编码器，它使用tf.nn.rnn_cell.BasicLSTMCell(num_units)作为基本 RNN 单元。这非常简单，但是要注意，给定基本的 RNN 单元，我们使用tf.nn.dynamic_rnn创建 RNN：

# Build RNN cell
 encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)

 # Run Dynamic RNN
 # encoder_outpus: [max_time, batch_size, num_units]
 # encoder_state: [batch_size, num_units]
 encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
 encoder_cell, encoder_emb_inp,
 sequence_length=source_sequence_length, time_major=True)

之后，我们需要定义解码器。因此，第一件事是拥有一个带有tf.nn.rnn_cell.BasicLSTMCell的基本 RNN 单元，然后将其用于创建一个基本采样解码器tf.contrib.seq2seq.BasicDecoder，该基本采样解码器将用于与解码器tf.contrib.seq2seq.dynamic_decode进行动态解码：

# Build RNN cell
 decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)
# Helper
 helper = tf.contrib.seq2seq.TrainingHelper(
 decoder_emb_inp, decoder_lengths, time_major=True)
 # Decoder
 decoder = tf.contrib.seq2seq.BasicDecoder(
 decoder_cell, helper, encoder_state,
 output_layer=projection_layer)
 # Dynamic decoding
 outputs, _ = tf.contrib.seq2seq.dynamic_decode(decoder, ...)
 logits = outputs.rnn_output

网络的最后一个阶段是 softmax 密集阶段，用于将顶部隐藏状态转换为对率向量：

projection_layer = layers_core.Dense(
 tgt_vocab_size, use_bias=False)

当然，我们需要定义交叉熵函数和训练阶段使用的损失：

crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
 labels=decoder_outputs, logits=logits)
 train_loss = (tf.reduce_sum(crossent * target_weights) /
 batch_size)

下一步是定义反向传播所需的步骤，并使用适当的优化器（在本例中为 Adam）。请注意，梯度已被裁剪，Adam 使用预定义的学习率：

# Calculate and clip gradients
 params = tf.trainable_variables()
 gradients = tf.gradients(train_loss, params)
 clipped_gradients, _ = tf.clip_by_global_norm(
 gradients, max_gradient_norm)
# Optimization
 optimizer = tf.train.AdamOptimizer(learning_rate)
 update_step = optimizer.apply_gradients(
 zip(clipped_gradients, params))

现在，我们可以运行代码并了解不同的执行步骤。首先，创建训练图。然后，训练迭代开始。用于评估的度量标准是双语评估研究（BLEU）。此度量标准是评估已从一种自然语言机器翻译成另一种自然语言的文本质量的标准。质量被认为是机器与人工输出之间的对应关系。如您所见，该值随时间增长：

python -m nmt.nmt --src=vi --tgt=en --vocab_prefix=/tmp/nmt_data/vocab --train_prefix=/tmp/nmt_data/train --dev_prefix=/tmp/nmt_data/tst2012 --test_prefix=/tmp/nmt_data/tst2013 --out_dir=/tmp/nmt_model --num_train_steps=12000 --steps_per_stats=100 --num_layers=2 --num_units=128 --dropout=0.2 --metrics=bleu
# Job id 0
[...]
# creating train graph ...
num_layers = 2, num_residual_layers=0
cell 0 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0
cell 1 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0
cell 0 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0
cell 1 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0
start_decay_step=0, learning_rate=1, decay_steps 10000,decay_factor 0.98
[...]
# Start step 0, lr 1, Thu Sep 21 12:57:18 2017
# Init train iterator, skipping 0 elements
global step 100 lr 1 step-time 1.65s wps 3.42K ppl 1931.59 bleu 0.00
global step 200 lr 1 step-time 1.56s wps 3.59K ppl 690.66 bleu 0.00
[...]
global step 9100 lr 1 step-time 1.52s wps 3.69K ppl 39.73 bleu 4.89
global step 9200 lr 1 step-time 1.52s wps 3.72K ppl 40.47 bleu 4.89
global step 9300 lr 1 step-time 1.55s wps 3.62K ppl 40.59 bleu 4.89
[...]
# External evaluation, global step 9000
decoding to output /tmp/nmt_model/output_dev.
done, num sentences 1553, time 17s, Thu Sep 21 17:32:49 2017.
bleu dev: 4.9
saving hparams to /tmp/nmt_model/hparams
# External evaluation, global step 9000
decoding to output /tmp/nmt_model/output_test.
done, num sentences 1268, time 15s, Thu Sep 21 17:33:06 2017.
bleu test: 3.9
saving hparams to /tmp/nmt_model/hparams
[...]
global step 9700 lr 1 step-time 1.52s wps 3.71K ppl 38.01 bleu 4.89

工作原理

所有上述代码已在这个页面中定义。关键思想是将两个 RNN 打包在一起。第一个是编码器，它在嵌入空间中工作，非常紧密地映射相似的单词。编码器理解训练示例的含义，并产生张量作为输出。然后只需将编码器的最后一个隐藏层连接到解码器的初始层，即可将该张量传递给解码器。注意力学习是由于我们基于与labels=decoder_outputs的交叉熵的损失函数而发生的。

该代码学习如何翻译，并通过 BLEU 度量标准通过迭代跟踪进度，如下图所示：

Tensorboard 中的 BLEU 指标示例

神经机器翻译 -- 用 seq2seq RNN 推理

在此秘籍中，我们使用先前秘籍的结果将源语言转换为目标语言。这个想法非常简单：给源语句提供两个组合的 RNN（编码器+解码器）作为输入。句子一结束，解码器将产生对率值，我们贪婪地产生与最大值关联的单词。例如，从解码器产生单词moi作为第一个令牌，因为该单词具有最大对率值。之后，会产生单词suis，依此类推：

具有概率的 NM 序列模型的示例

使用解码器的输出有多种策略：

贪婪：产生对应最大对率的字
采样：通过对产生的对率进行采样来产生单词
集束搜索：一个以上的预测，因此创建了可能的扩展树

操作步骤

我们按以下步骤进行：

定义用于对解码器进行采样的贪婪策略。这很容易，因为我们可以使用tf.contrib.seq2seq.GreedyEmbeddingHelper中定义的库。由于我们不知道目标句子的确切长度，因此我们将启发式方法限制为最大长度为源句子长度的两倍：

# Helper
 helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(
 embedding_decoder,
 tf.fill([batch_size], tgt_sos_id), tgt_eos_id)

 # Decoder
 decoder = tf.contrib.seq2seq.BasicDecoder(
 decoder_cell, helper, encoder_state,
 output_layer=projection_layer)
 # Dynamic decoding
 outputs, _ = tf.contrib.seq2seq.dynamic_decode(
 decoder, maximum_iterations=maximum_iterations)
 translations = outputs.sample_id
maximum_iterations = tf.round(tf.reduce_max(source_sequence_length) * 2)

现在，我们可以运行网络，输入一个从未见过的句子（inference_input_file=/tmp/my_infer_file），然后让网络翻译结果（inference_output_file=/tmp/nmt_model/output_infer）：

python -m nmt.nmt \
 --out_dir=/tmp/nmt_model \
 --inference_input_file=/tmp/my_infer_file.vi \
 --inference_output_file=/tmp/nmt_model/output_infer

工作原理

将两个 RNN 打包在一起，以形成编码器-解码器 RNN 网络。解码器产生对率，然后将其贪婪地转换为目标语言的单词。例如，此处显示了从越南语到英语的自动翻译：

用英语输入的句子：小时候，我认为朝鲜是世界上最好的国家，我经常唱歌&。我们没有什么可嫉妒的。
翻译成英语的输出句子：当我非常好时，我将去了解最重要的事情，而我不确定该说些什么。

您只需要注意力 -- seq2seq RNN 的另一个示例

在本秘籍中，我们介绍了注意力方法（Dzmitry Bahdanau，Kyunghyun Cho 和 Yoshua Bengio，ICLR 2015），这是神经网络翻译的最新解决方案。，它包括在编码器和解码器 RNN 之间添加其他连接。实际上，仅将解码器与编码器的最新层连接会带来信息瓶颈，并且不一定允许通过先前的编码器层获取的信息通过。下图说明了采用的解决方案：

NMT 注意力模型的示例

需要考虑三个方面：

首先，将当前目标隐藏状态与所有先前的源状态一起使用以得出注意力权重，该注意力权重用于或多或少地关注序列中先前看到的标记
其次，创建上下文向量以汇总注意力权重的结果
第三，将上下文向量与当前目标隐藏状态组合以获得注意力向量

操作步骤

我们按以下步骤进行：

使用库tf.contrib.seq2seq.LuongAttention定义注意力机制，该库实现了 Minh-Thang Luong，Hieu Pham 和 Christopher D. Manning（2015 年）在《基于注意力的神经机器翻译有效方法》中定义的注意力模型：

# attention_states: [batch_size, max_time, num_units]
 attention_states = tf.transpose(encoder_outputs, [1, 0, 2])

 # Create an attention mechanism
 attention_mechanism = tf.contrib.seq2seq.LuongAttention(
 num_units, attention_states,
 memory_sequence_length=source_sequence_length)

通过注意力包装器，将定义的注意力机制用作解码器单元周围的包装器：

decoder_cell = tf.contrib.seq2seq.AttentionWrapper(
 decoder_cell, attention_mechanism,
 attention_layer_size=num_units)

运行代码以查看结果。我们立即注意到，注意力机制在 BLEU 得分方面产生了显着改善：

python -m nmt.nmt \
> --attention=scaled_luong \
> --src=vi --tgt=en \
> --vocab_prefix=/tmp/nmt_data/vocab \
> --train_prefix=/tmp/nmt_data/train \
> --dev_prefix=/tmp/nmt_data/tst2012 \
> --test_prefix=/tmp/nmt_data/tst2013 \
> --out_dir=/tmp/nmt_attention_model \
> --num_train_steps=12000 \
> --steps_per_stats=100 \
> --num_layers=2 \
> --num_units=128 \
> --dropout=0.2 \
> --metrics=bleu
[...]
# Start step 0, lr 1, Fri Sep 22 22:49:12 2017
# Init train iterator, skipping 0 elements
global step 100 lr 1 step-time 1.71s wps 3.23K ppl 15193.44 bleu 0.00
[...]
# Final, step 12000 lr 0.98 step-time 1.67 wps 3.37K ppl 14.64, dev ppl 14.01, dev bleu 15.9, test ppl 12.58, test bleu 17.5, Sat Sep 23 04:35:42 2017
# Done training!, time 20790s, Sat Sep 23 04:35:42 2017.
# Start evaluating saved best models.
[..]
loaded infer model parameters from /tmp/nmt_attention_model/best_bleu/translate.ckpt-12000, time 0.06s
# 608
src: nhưng bạn biết điều gì không ?
ref: But you know what ?
nmt: But what do you know ?
[...]
# Best bleu, step 12000 step-time 1.67 wps 3.37K, dev ppl 14.01, dev bleu 15.9, test ppl 12.58, test bleu 17.5, Sat Sep 23 04:36:35 2017

工作原理

注意是一种机制，该机制使用由编码器 RNN 的内部状态获取的信息，并将该信息与解码器的最终状态进行组合。关键思想是，通过这种方式，有可能或多或少地关注源序列中的某些标记。下图显示了 BLEU 得分，引起了关注。

我们注意到，相对于我们第一个秘籍中未使用任何注意力的图表而言，它具有明显的优势：

Tensorboard 中注意力的 BLEU 指标示例

值得记住的是 seq2seq 不仅可以用于机器翻译。让我们看一些例子：

Lukasz Kaiser 在作为外语的语法中，使用 seq2seq 模型来构建选区解析器。选区分析树将文本分为多个子短语。树中的非终结符是短语的类型，终结符是句子中的单词，并且边缘未标记。
seq2seq 的另一个应用是 SyntaxNet，又名 Parsey McParserFace（语法分析器），它是许多 NLU 系统中的关键第一组件。给定一个句子作为输入，它将使用描述单词的句法特征的词性（POS）标签标记每个单词，并确定句子中单词之间的句法关系，在依存关系分析树中表示。这些句法关系与所讨论句子的潜在含义直接相关。

下图使我们对该概念有了一个很好的了解：

SyntaxNet 的一个例子

通过 RNN 学习写作莎士比亚

在本秘籍中，我们将学习如何生成与威廉·莎士比亚（William Shakespeare）相似的文本。关键思想很简单：我们将莎士比亚写的真实文本作为输入，并将其作为输入 RNN 的输入，该 RNN 将学习序列。然后将这种学习用于生成新文本，该文本看起来像最伟大的作家用英语撰写的文本。

为了简单起见，我们将使用框架 TFLearn，它在 TensorFlow 上运行。此示例是标准分发版的一部分，可从以下位置获得。开发的模型是 RNN 字符级语言模型，其中考虑的序列是字符序列而不是单词序列。

操作步骤

我们按以下步骤进行：

使用pip安装 TFLearn：

pip install -I tflearn

导入许多有用的模块并下载一个由莎士比亚撰写的文本示例。在这种情况下，我们使用这个页面中提供的一种：

import os
import pickle
from six.moves import urllib
import tflearn
from tflearn.data_utils import *
path = "shakespeare_input.txt"
char_idx_file = 'char_idx.pickle'
if not os.path.isfile(path): urllib.request.urlretrieve("https://raw.githubusercontent.com/tflearn/tflearn.github.io/master/resources/shakespeare_input.txt", path)

使用string_to_semi_redundant_sequences()将输入的文本转换为向量，并返回解析的序列和目标以及相关的字典，该函数将返回一个元组（输入，目标，字典）：

maxlen = 25
char_idx = None
if os.path.isfile(char_idx_file):
print('Loading previous char_idx')
char_idx = pickle.load(open(char_idx_file, 'rb'))
X, Y, char_idx = \
textfile_to_semi_redundant_sequences(path, seq_maxlen=maxlen, redun_step=3,
pre_defined_char_idx=char_idx)
pickle.dump(char_idx, open(char_idx_file,'wb'))

定义一个由三个 LSTM 组成的 RNN，每个 LSTM 都有 512 个节点，并返回完整序列，而不是仅返回最后一个序列输出。请注意，我们使用掉线模块连接 LSTM 模块的可能性为 50%。最后一层是密集层，其应用 softmax 的长度等于字典大小。损失函数为categorical_crossentropy，优化器为 Adam：

g = tflearn.input_data([None, maxlen, len(char_idx)])
g = tflearn.lstm(g, 512, return_seq=True)
g = tflearn.dropout(g, 0.5)
g = tflearn.lstm(g, 512, return_seq=True)
g = tflearn.dropout(g, 0.5)
g = tflearn.lstm(g, 512)
g = tflearn.dropout(g, 0.5)
g = tflearn.fully_connected(g, len(char_idx), activation='softmax')
g = tflearn.regression(g, optimizer='adam', loss='categorical_crossentropy',
learning_rate=0.001)

给定步骤 4 中定义的网络，我们现在可以使用库flearn.models.generator.SequenceGenerator（network，dictionary=char_idx, seq_maxlen=maxle和clip_gradients=5.0, checkpoint_path='model_shakespeare'）生成序列：

m = tflearn.SequenceGenerator(g, dictionary=char_idx,
seq_maxlen=maxlen,
clip_gradients=5.0,
checkpoint_path='model_shakespeare')

对于 50 次迭代，我们从输入文本中获取随机序列，然后生成一个新文本。温度正在控制所创建序列的新颖性；温度接近 0 看起来像用于训练的样本，而温度越高，新颖性越强：

for i in range(50):
seed = random_sequence_from_textfile(path, maxlen)
m.fit(X, Y, validation_set=0.1, batch_size=128,
n_epoch=1, run_id='shakespeare')
print("-- TESTING...")
print("-- Test with temperature of 1.0 --")
print(m.generate(600, temperature=1.0, seq_seed=seed))
print("-- Test with temperature of 0.5 --")
print(m.generate(600, temperature=0.5, seq_seed=seed))

工作原理

当新的未知或被遗忘的艺术品要归功于作者时，有著名的学者将其与作者的其他作品进行比较。学者们要做的是在著名作品的文本序列中找到共同的模式，希望在未知作品中找到相似的模式。

这种方法的工作方式相似：RNN 了解莎士比亚作品中最特殊的模式是什么，然后将这些模式用于生成新的，从未见过的文本，这些文本很好地代表了最伟大的英语作者的风格。

让我们看一些执行示例：

python shakespeare.py
Loading previous char_idx
Vectorizing text...
Text total length: 4,573,338
Distinct chars : 67
Total sequences : 1,524,438
---------------------------------
Run id: shakespeare
Log directory: /tmp/tflearn_logs/

第一次迭代

在这里，网络正在学习一些基本结构，包括需要建立有关虚构字符（DIA，SURYONT，HRNTLGIPRMAR和ARILEN）的对话。但是，英语仍然很差，很多单词不是真正的英语：

---------------------------------
Training samples: 1371994
Validation samples: 152444
--
Training Step: 10719 | total loss: 2.22092 | time: 22082.057s
| Adam | epoch: 001 | loss: 2.22092 | val_loss: 2.12443 -- iter: 1371994/1371994
-- TESTING...
-- Test with temperature of 1.0 --
'st thou, malice?
If thou caseghough memet oud mame meard'ke. Afs weke wteak, Dy ny wold' as to of my tho gtroy ard has seve, hor then that wordith gole hie, succ, caight fom?
DIA:
A gruos ceen, I peey
by my
Wiouse rat Sebine would.
waw-this afeean.
SURYONT:
Teeve nourterong a oultoncime bucice'is furtutun
Ame my sorivass; a mut my peant?
Am:
Fe, that lercom ther the nome, me, paatuy corns wrazen meas ghomn'ge const pheale,
As yered math thy vans:
I im foat worepoug and thit mije woml!
HRNTLGIPRMAR:
I'd derfomquesf thiy of doed ilasghele hanckol, my corire-hougangle!
Kiguw troll! you eelerd tham my fom Inow lith a
-- Test with temperature of 0.5 --
'st thou, malice?
If thou prall sit I har, with and the sortafe the nothint of the fore the fir with with the ceme at the ind the couther hit yet of the sonsee in solles and that not of hear fore the hath bur.
ARILEN:
More you a to the mare me peod sore,
And fore string the reouck and and fer to the so has the theat end the dore; of mall the sist he the bot courd wite be the thoule the to nenge ape and this not the the ball bool me the some that dears,
The be to the thes the let the with the thear tould fame boors and not to not the deane fere the womour hit muth so thand the e meentt my to the treers and woth and wi

经过几次迭代

在这里，网络开始学习对话的正确结构，并且使用Well, there shall the things to need the offer to our heart和There is not that be so then to the death To make the body and all the mind这样的句子，书面英语看起来更正确：

---------------------------------
Training samples: 1371994
Validation samples: 152444
--
Training Step: 64314 | total loss: 1.44823 | time: 21842.362s
| Adam | epoch: 006 | loss: 1.44823 | val_loss: 1.40140 -- iter: 1371994/1371994
--
-- Test with temperature of 0.5 --
in this kind.
THESEUS:
There is not that be so then to the death
To make the body and all the mind.
BENEDICK:
Well, there shall the things to need the offer to our heart,
To not are he with him: I have see the hands are to true of him that I am not,
The whom in some the fortunes,
Which she were better not to do him?
KING HENRY VI:
I have some a starter, and and seen the more to be the boy, and be such a plock and love so say, and I will be his entire,
And when my masters are a good virtues,
That see the crown of our worse,
This made a called grace to hear him and an ass,
And the provest and stand,

博客文章循环神经网络的不合理有效性描述了一组引人入胜的示例 RNN 字符级语言模型，包括以下内容：

莎士比亚文本生成类似于此示例
Wikipedia 文本生成类似于此示例，但是基于不同的训练文本
代数几何（LaTex）文本生成类似于此示例，但基于不同的训练文本
Linux 源代码文本的生成与此示例相似，但是基于不同的训练文本
婴儿命名文本的生成与此示例类似，但是基于不同的训练文本

学习使用 RNN 预测未来的比特币价值

在本秘籍中，我们将学习如何使用 RNN 预测未来的比特币价值。关键思想是，过去观察到的值的时间顺序可以很好地预测未来的值。对于此秘籍，我们将使用 MIT 许可下的这个页面上提供的代码。给定时间间隔的比特币值通过 API 从这里下载。这是 API 文档的一部分：

We offer historical data from our Bitcoin Price Index through the following endpoint: https://api.coindesk.com/v1/bpi/historical/close.json By default, this will return the previous 31 days' worth of data. This endpoint accepts the following optional parameters: ?index=[USD/CNY]The index to return data for. Defaults to USD. ?currency=<VALUE>The currency to return the data in, specified in ISO 4217 format. Defaults to USD. ?start=<VALUE>&end=<VALUE> Allows data to be returned for a specific date range. Must be listed as a pair of start and end parameters, with dates supplied in the YYYY-MM-DD format, e.g. 2013-09-01 for September 1st, 2013. ?for=yesterday Specifying this will return a single value for the previous day. Overrides the start/end parameter. Sample Request: https://api.coindesk.com/v1/bpi/historical/close.json?start=2013-09-01&end=2013-09-05 Sample JSON Response: {"bpi":{"2013-09-01":128.2597,"2013-09-02":127.3648,"2013-09-03":127.5915,"2013-09-04":120.5738,"2013-09-05":120.5333},"disclaimer":"This data was produced from the CoinDesk Bitcoin Price Index. BPI value data returned as USD.","time":{"updated":"Sep 6, 2013 00:03:00 UTC","updatedISO":"2013-09-06T00:03:00+00:00"}}

操作步骤

这是我们进行秘籍的方法：

克隆以下 GitHub 存储库。这是一个鼓励用户尝试使用 seq2seq 神经网络架构的项目：

git clone https://github.com/guillaume-chevalier/seq2seq-signal-prediction.git

给定前面的存储库，请考虑以下函数，这些函数可加载和标准化 USD 或 EUR 比特币值的比特币历史数据。这些特征在dataset.py中定义。训练和测试数据根据 80/20 规则分开。因此，测试数据的 20% 是最新的历史比特币值。每个示例在特征轴/维度中包含 40 个 USD 数据点，然后包含 EUR 数据。根据平均值和标准差对数据进行归一化。函数generate_x_y_data_v4生成大小为batch_size的训练数据（分别是测试数据）的随机样本：

def loadCurrency(curr, window_size):
   """
   Return the historical data for the USD or EUR bitcoin value. Is done with an web API call.
   curr = "USD" | "EUR"
   """
   # For more info on the URL call, it is inspired by :
   # https://github.com/Levino/coindesk-api-node
   r = requests.get(
       "http://api.coindesk.com/v1/bpi/historical/close.json?start=2010-07-17&end=2017-03-03&currency={}".format(
           curr
       )
   )
   data = r.json()
   time_to_values = sorted(data["bpi"].items())
   values = [val for key, val in time_to_values]
   kept_values = values[1000:]
   X = []
   Y = []
   for i in range(len(kept_values) - window_size * 2):
       X.append(kept_values[i:i + window_size])
       Y.append(kept_values[i + window_size:i + window_size * 2])
   # To be able to concat on inner dimension later on:
   X = np.expand_dims(X, axis=2)
   Y = np.expand_dims(Y, axis=2)
   return X, Y
def normalize(X, Y=None):
   """
   Normalise X and Y according to the mean and standard
deviation of the X values only.
   """
   # # It would be possible to normalize with last rather than mean, such as:
   # lasts = np.expand_dims(X[:, -1, :], axis=1)
   # assert (lasts[:, :] == X[:, -1, :]).all(), "{}, {}, {}. {}".format(lasts[:, :].shape, X[:, -1, :].shape, lasts[:, :], X[:, -1, :])
   mean = np.expand_dims(np.average(X, axis=1) + 0.00001, axis=1)
   stddev = np.expand_dims(np.std(X, axis=1) + 0.00001, axis=1)
   # print (mean.shape, stddev.shape)
   # print (X.shape, Y.shape)
   X = X - mean
   X = X / (2.5 * stddev)
   if Y is not None:
       assert Y.shape == X.shape, (Y.shape, X.shape)
       Y = Y - mean
       Y = Y / (2.5 * stddev)
       return X, Y
   return X

def fetch_batch_size_random(X, Y, batch_size):
   """
   Returns randomly an aligned batch_size of X and Y among all examples.
   The external dimension of X and Y must be the batch size
(eg: 1 column = 1 example).
   X and Y can be N-dimensional.
   """
   assert X.shape == Y.shape, (X.shape, Y.shape)
   idxes = np.random.randint(X.shape[0], size=batch_size)
   X_out = np.array(X[idxes]).transpose((1, 0, 2))
   Y_out = np.array(Y[idxes]).transpose((1, 0, 2))
   return X_out, Y_out
X_train = []
Y_train = []
X_test = []
Y_test = []

def generate_x_y_data_v4(isTrain, batch_size):
   """
   Return financial data for the bitcoin.
   Features are USD and EUR, in the internal dimension.
   We normalize X and Y data according to the X only to not
   spoil the predictions we ask for.
   For every window (window or seq_length), Y is the prediction following X.
   Train and test data are separated according to the 80/20
rule.
   Therefore, the 20 percent of the test data are the most
   recent historical bitcoin values. Every example in X contains
   40 points of USD and then EUR data in the feature axis/dimension.
   It is to be noted that the returned X and Y has the same shape
   and are in a tuple.
   """
   # 40 pas values for encoder, 40 after for decoder's predictions.
   seq_length = 40
   global Y_train
   global X_train
   global X_test
   global Y_test
   # First load, with memoization:
   if len(Y_test) == 0:
       # API call:
       X_usd, Y_usd = loadCurrency("USD",
window_size=seq_length)
       X_eur, Y_eur = loadCurrency("EUR",
window_size=seq_length)
       # All data, aligned:
       X = np.concatenate((X_usd, X_eur), axis=2)
       Y = np.concatenate((Y_usd, Y_eur), axis=2)
       X, Y = normalize(X, Y)
       # Split 80-20:
       X_train = X[:int(len(X) * 0.8)]
       Y_train = Y[:int(len(Y) * 0.8)]
       X_test = X[int(len(X) * 0.8):]
       Y_test = Y[int(len(Y) * 0.8):]
   if isTrain:
       return fetch_batch_size_random(X_train, Y_train, batch_size)
   else:
       return fetch_batch_size_random(X_test,  Y_test,  batch_size)

生成训练，验证和测试数据，并定义许多超参数，例如batch_size，hidden_dim（RNN 中隐藏的神经元的数量）和layers_stacked_count（栈式循环单元的数量）。此外，定义一些参数以微调优化器，例如优化器的学习率，迭代次数，用于优化器模拟退火的lr_decay，优化器的动量以及避免过拟合的 L2 正则化。请注意，GitHub 存储库具有默认的batch_size = 5和nb_iters = 150，但使用batch_size = 1000和nb_iters = 100000获得了更好的结果：

from datasets import generate_x_y_data_v4
generate_x_y_data = generate_x_y_data_v4
import tensorflow as tf  
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
sample_x, sample_y = generate_x_y_data(isTrain=True, batch_size=3)
print("Dimensions of the dataset for 3 X and 3 Y training
examples : ")
print(sample_x.shape)
print(sample_y.shape)
print("(seq_length, batch_size, output_dim)")
print sample_x, sample_y
# Internal neural network parameters
seq_length = sample_x.shape[0]  # Time series will have the same past and future (to be predicted) lenght.
batch_size = 5  # Low value used for live demo purposes - 100 and 1000 would be possible too, crank that up!
output_dim = input_dim = sample_x.shape[-1]  # Output dimension (e.g.: multiple signals at once, tied in time)
hidden_dim = 12  # Count of hidden neurons in the recurrent units.
layers_stacked_count = 2  # Number of stacked recurrent cells, on the neural depth axis.
# Optmizer:
learning_rate = 0.007  # Small lr helps not to diverge during training.
nb_iters = 150  # How many times we perform a training step (therefore how many times we show a batch).
lr_decay = 0.92  # default: 0.9 . Simulated annealing.
momentum = 0.5  # default: 0.0 . Momentum technique in weights update
lambda_l2_reg = 0.003  # L2 regularization of weights - avoids overfitting

将网络定义为由基本 GRU 单元组成的编码器/解码器。该网络由layers_stacked_count=2 RNN 组成，我们将使用 TensorBoard 可视化该网络。请注意，hidden_dim = 12是循环单元中的隐藏神经元：

tf.nn.seq2seq = tf.contrib.legacy_seq2seq
tf.nn.rnn_cell = tf.contrib.rnn
tf.nn.rnn_cell.GRUCell = tf.contrib.rnn.GRUCell
tf.reset_default_graph()
# sess.close()
sess = tf.InteractiveSession()
with tf.variable_scope('Seq2seq'):
   # Encoder: inputs
   enc_inp = [
       tf.placeholder(tf.float32, shape=(None, input_dim), name="inp_{}".format(t))
          for t in range(seq_length)
   ]
   # Decoder: expected outputs
   expected_sparse_output = [
       tf.placeholder(tf.float32, shape=(None, output_dim), name="expected_sparse_output_".format(t))
         for t in range(seq_length)
   ]

   # Give a "GO" token to the decoder.
   # You might want to revise what is the appended value "+ enc_inp[:-1]".
   dec_inp = [ tf.zeros_like(enc_inp[0], dtype=np.float32, name="GO") ] + enc_inp[:-1]
   # Create a `layers_stacked_count` of stacked RNNs (GRU cells here).
   cells = []
   for i in range(layers_stacked_count):
       with tf.variable_scope('RNN_{}'.format(i)):
           cells.append(tf.nn.rnn_cell.GRUCell(hidden_dim))
           # cells.append(tf.nn.rnn_cell.BasicLSTMCell(...))
   cell = tf.nn.rnn_cell.MultiRNNCell(cells)   
   # For reshaping the input and output dimensions of the seq2seq RNN:
   w_in = tf.Variable(tf.random_normal([input_dim, hidden_dim]))
   b_in = tf.Variable(tf.random_normal([hidden_dim], mean=1.0))
   w_out = tf.Variable(tf.random_normal([hidden_dim, output_dim]))
   b_out = tf.Variable(tf.random_normal([output_dim]))   
reshaped_inputs = [tf.nn.relu(tf.matmul(i, w_in) + b_in) for i in enc_inp]   
# Here, the encoder and the decoder uses the same cell, HOWEVER,
   # the weights aren't shared among the encoder and decoder, we have two
   # sets of weights created under the hood according to that function's def.

   dec_outputs, dec_memory = tf.nn.seq2seq.basic_rnn_seq2seq(
       enc_inp,
       dec_inp,
       cell
   )   

output_scale_factor = tf.Variable(1.0, name="Output_ScaleFactor")
   # Final outputs: with linear rescaling similar to batch norm,
   # but without the "norm" part of batch normalization hehe.
   reshaped_outputs = [output_scale_factor*(tf.matmul(i, w_out) + b_out) for i in dec_outputs]  
   # Merge all the summaries and write them out to /tmp/bitcoin_logs (by default)
   merged = tf.summary.merge_all()
   train_writer = tf.summary.FileWriter('/tmp/bitcoin_logs',                                     sess.graph)

现在让我们运行 TensorBoard 并可视化由 RNN 编码器和 RNN 解码器组成的网络：

tensorboard --logdir=/tmp/bitcoin_logs

以下是代码流程：

Tensorboard 中的比特币价值预测代码示例

现在让我们将损失函数定义为具有正则化的 L2 损失，以避免过拟合并获得更好的泛化。选择的优化器是 RMSprop，其值为learning_rate，衰减和动量，如步骤 3 所定义：

# Training loss and optimizer
with tf.variable_scope('Loss'):
   # L2 loss
   output_loss = 0
   for _y, _Y in zip(reshaped_outputs, expected_sparse_output):
       output_loss += tf.reduce_mean(tf.nn.l2_loss(_y - _Y))       
  # L2 regularization (to avoid overfitting and to have a  better generalization capacity)
   reg_loss = 0
   for tf_var in tf.trainable_variables():
       if not ("Bias" in tf_var.name or "Output_" in tf_var.name):
           reg_loss += tf.reduce_mean(tf.nn.l2_loss(tf_var))

   loss = output_loss + lambda_l2_reg * reg_loss

with tf.variable_scope('Optimizer'):
   optimizer = tf.train.RMSPropOptimizer(learning_rate, decay=lr_decay, momentum=momentum)
   train_op = optimizer.minimize(loss)

通过生成训练数据并在数据集中的batch_size示例上运行优化器来为批量训练做准备。同样，通过从数据集中的batch_size示例生成测试数据来准备测试。训练针对nb_iters+1迭代进行，每十个迭代中的一个用于测试结果：

def train_batch(batch_size):
   """
   Training step that optimizes the weights
   provided some batch_size X and Y examples from the dataset.
   """
   X, Y = generate_x_y_data(isTrain=True, batch_size=batch_size)
   feed_dict = {enc_inp[t]: X[t] for t in range(len(enc_inp))}
   feed_dict.update({expected_sparse_output[t]: Y[t] for t in range(len(expected_sparse_output))})
   _, loss_t = sess.run([train_op, loss], feed_dict)
   return loss_t

def test_batch(batch_size):
   """
   Test step, does NOT optimizes. Weights are frozen by not
   doing sess.run on the train_op.
   """
   X, Y = generate_x_y_data(isTrain=False, batch_size=batch_size)
   feed_dict = {enc_inp[t]: X[t] for t in range(len(enc_inp))}
   feed_dict.update({expected_sparse_output[t]: Y[t] for t in range(len(expected_sparse_output))})
   loss_t = sess.run([loss], feed_dict)
   return loss_t[0]

# Training
train_losses = []
test_losses = []
sess.run(tf.global_variables_initializer())

for t in range(nb_iters+1):
   train_loss = train_batch(batch_size)
   train_losses.append(train_loss)   
   if t % 10 == 0:
       # Tester
       test_loss = test_batch(batch_size)
       test_losses.append(test_loss)
       print("Step {}/{}, train loss: {}, \tTEST loss: {}".format(t, nb_iters, train_loss, test_loss))
print("Fin. train loss: {}, \tTEST loss: {}".format(train_loss, test_loss))

可视化n_predictions结果。我们将以黄色形象化nb_predictions = 5预测，以x形象化蓝色的实际值ix。请注意，预测从直方图中的最后一个蓝点开始，从视觉上，您可以观察到，即使这个简单的模型也相当准确：

# Test
nb_predictions = 5
print("Let's visualize {} predictions with our signals:".format(nb_predictions))
X, Y = generate_x_y_data(isTrain=False, batch_size=nb_predictions)
feed_dict = {enc_inp[t]: X[t] for t in range(seq_length)}
outputs = np.array(sess.run([reshaped_outputs], feed_dict)[0])
for j in range(nb_predictions):
   plt.figure(figsize=(12, 3))   
   for k in range(output_dim):
       past = X[:,j,k]
       expected = Y[:,j,k]
       pred = outputs[:,j,k]       
       label1 = "Seen (past) values" if k==0 else "_nolegend_"
       label2 = "True future values" if k==0 else "_nolegend_"
       label3 = "Predictions" if k==0 else "_nolegend_"
       plt.plot(range(len(past)), past, "o--b", label=label1)
       plt.plot(range(len(past), len(expected)+len(past)), expected, "x--b", label=label2)
       plt.plot(range(len(past), len(pred)+len(past)), pred, "o--y", label=label3)   
   plt.legend(loc='best')
   plt.title("Predictions v.s. true values")
   plt.show()

我们得到的结果如下：

比特币价值预测的一个例子

工作原理

带有 GRU 基本单元的编码器-解码器层堆叠 RNN 用于预测比特币值。 RNN 非常擅长学习序列，即使使用基于 2 层和 12 个 GRU 单元的简单模型，比特币的预测确实相当准确。当然，此预测代码并非鼓励您投资比特币，而只是讨论深度学习方法。而且，需要更多的实验来验证我们是否存在数据过拟合的情况。

预测股市价值是一个不错的 RNN 应用，并且有许多方便的包，例如：

Drnns-prediction 使用来自 Kaggle 的《股票市场每日新闻》数据集上的 Keras 神经网络库实现了深度 RNN。数据集任务是使用当前和前一天的新闻头条作为特征来预测 DJIA 的未来走势。开源代码可从这里获得。
迈克尔·卢克（Michael Luk）撰写了一篇有趣的博客文章，内容涉及如何基于 RNN 预测可口可乐的库存量。
Jakob Aungiers 写了另一篇有趣的博客文章 LSTM 神经网络时间序列预测。

多对一和多对多 RNN 示例

在本秘籍中，我们通过提供 RNN 映射的各种示例来总结与 RNN 讨论过的内容。为了简单起见，我们将采用 Keras 并演示如何编写一对一，一对多，多对一和多对多映射，如下图所示：

RNN 序列的一个例子

操作步骤

我们按以下步骤进行：

如果要创建一对一映射，则这不是 RNN，而是密集层。假设已经定义了一个模型，并且您想添加一个密集网络。然后可以在 Keras 中轻松实现：

model = Sequential()
model.add(Dense(output_size, input_shape=input_shape))

如果要创建一对多选项，可以使用RepeatVector(...)实现。请注意，return_sequences是一个布尔值，用于决定是返回输出序列中的最后一个输出还是完整序列：

model = Sequential()
model.add(RepeatVector(number_of_times,input_shape=input_shape)) 
model.add(LSTM(output_size, return_sequences=True))

如果要创建多对一选项，则可以使用以下 LSTM 代码段实现：

model = Sequential()
model.add(LSTM(1, input_shape=(timesteps, data_dim)))

如果要创建多对多选项，当输入和输出的长度与循环步数匹配时，可以使用以下 LSTM 代码段来实现：

model = Sequential() 
model.add(LSTM(1, input_shape=(timesteps, data_dim), return_sequences=True))

工作原理

Keras 使您可以轻松编写各种形状的 RNN，包括一对一，一对多，多对一和多对多映射。上面的示例说明了用 Keras 实现它们有多么容易。