tf20-cv-cb-merge-2TensorFlow 2.0 计算机视觉秘籍（三）第六章：第六章：生成模型与对抗攻

TensorFlow 2.0 计算机视觉秘籍（三）

原文：annas-archive.org/md5/cf3ce16c27a13f4ce55f8e29a1bf85e1

译者：飞龙

协议：CC BY-NC-SA 4.0

第六章：第六章：生成模型与对抗攻击

能够区分两个或更多类别无疑是令人印象深刻的，且是深度神经网络确实在学习的健康信号。

但如果传统的分类任务令人印象深刻，那么生成新内容则令人叹为观止！这绝对需要对领域有更高的理解。那么，有没有神经网络能够做到这一点呢？当然有！

在本章中，我们将研究神经网络中最迷人且最有前景的一种类型：生成对抗网络（GANs）。顾名思义，这些网络实际上是由两个子网络组成的系统：生成器和判别器。生成器的任务是生成足够优秀的图像，使它们看起来像是来自原始分布（但实际上并非如此；它们是从零开始生成的），从而欺骗判别器，而判别器的任务是分辨真假图像。

GANs 在半监督学习和图像到图像的翻译等领域处于尖端位置，这两个主题我们将在本章中讨论。作为补充，本章最后的食谱将教我们如何使用快速梯度符号方法（FGSM）对网络进行对抗攻击。

本章我们将涉及的食谱如下：

实现一个深度卷积 GAN
使用 DCGAN 进行半监督学习
使用 Pix2Pix 进行图像翻译
使用 CycleGAN 翻译未配对的图像
使用快速梯度符号方法实现对抗攻击

技术要求

GANs 很棒，但在计算能力方面非常消耗资源。因此，GPU 是必不可少的，才能在这些食谱上进行操作（即使如此，大多数情况仍需运行几个小时）。在准备工作部分，你会发现每个食谱所需的准备工作（如果有的话）。本章的代码可以在这里找到：github.com/PacktPublishing/Tensorflow-2.0-Computer-Vision-Cookbook/tree/master/ch6。

查看以下链接，观看《代码实践》视频：bit.ly/35Z8IYn。

实现一个深度卷积 GAN

一个 seed，它只是一个高斯噪声的向量。

在本食谱中，我们将实现一个 EMNIST 数据集，它是在原有的 MNIST 数据集的基础上，加入了大写和小写的手写字母，并涵盖了从 0 到 9 的数字。

让我们开始吧！

准备工作

我们需要安装 tensorflow-datasets 来更方便地访问 EMNIST。另外，为了在训练 GAN 时显示漂亮的进度条，我们将使用 tqdm。

这两个依赖项可以按如下方式安装：

$> pip install tensorflow-datasets tqdm

我们可以开始了！

如何实现…

执行以下步骤来在 EMNIST 上实现 DCGAN：

导入必要的依赖项：

import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.layers import *
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tqdm import tqdm

定义 AUTOTUNE 设置的别名，我们将在后续处理中使用它来确定处理数据集时的并行调用数量：
```
AUTOTUNE = tf.data.experimental.AUTOTUNE
```

定义一个 DCGAN() 类来封装我们的实现。构造函数创建判别器、生成器、损失函数以及两个子网络各自的优化器：

class DCGAN(object):
    def __init__(self):
        self.loss = BinaryCrossentropy(from_logits=True)
        self.generator = self.create_generator()
        self.discriminator = self.create_discriminator()
        self.generator_opt = Adam(learning_rate=1e-4)
        self.discriminator_opt = Adam(learning_rate=1e-4)

定义一个静态方法来创建生成器网络。它从一个 100 元素的输入张量重建一个 28x28x1 的图像。注意，使用了转置卷积（Conv2DTranspose）来扩展输出体积，随着网络的深入，卷积层数量也增多。同时，注意激活函数为 'tanh'，这意味着输出将处于 [-1, 1] 的范围内：

   @staticmethod
    def create_generator(alpha=0.2):
        input = Input(shape=(100,))
        x = Dense(units=7 * 7 * 256, 
                 use_bias=False)(input)
        x = LeakyReLU(alpha=alpha)(x)
        x = BatchNormalization()(x)
        x = Reshape((7, 7, 256))(x)

添加第一个转置卷积块，具有 128 个滤波器：

        x = Conv2DTranspose(filters=128,
                            strides=(1, 1),
                            kernel_size=(5, 5),
                            padding='same',
                            use_bias=False)(x)
        x = LeakyReLU(alpha=alpha)(x)
        x = BatchNormalization()(x)

创建第二个转置卷积块，具有 64 个滤波器：

        x = Conv2DTranspose(filters=64,
                            strides=(2, 2),
                            kernel_size=(5, 5),
                            padding='same',
                            use_bias=False)(x)
        x = LeakyReLU(alpha=alpha)(x)
        x = BatchNormalization()(x)

添加最后一个转置卷积块，只有一个滤波器，对应于网络的输出：

        x = Conv2DTranspose(filters=1,
                            strides=(2, 2),
                            kernel_size=(5, 5),
                            padding='same',
                            use_bias=False)(x)
        output = Activation('tanh')(x)
        return Model(input, output)

定义一个静态方法来创建判别器。该架构是一个常规的 CNN：

    @staticmethod
    def create_discriminator(alpha=0.2, dropout=0.3):
        input = Input(shape=(28, 28, 1))
        x = Conv2D(filters=64,
                   kernel_size=(5, 5),
                   strides=(2, 2),
                   padding='same')(input)
        x = LeakyReLU(alpha=alpha)(x)
        x = Dropout(rate=dropout)(x)
        x = Conv2D(filters=128,
                   kernel_size=(5, 5),
                   strides=(2, 2),
                   padding='same')(x)
        x = LeakyReLU(alpha=alpha)(x)
        x = Dropout(rate=dropout)(x)
        x = Flatten()(x)
        output = Dense(units=1)(x)
        return Model(input, output)

定义一个方法来计算判别器的损失，它是实际损失和假损失的总和：

    def discriminator_loss(self, real, fake):
        real_loss = self.loss(tf.ones_like(real), real)
        fake_loss = self.loss(tf.zeros_like(fake), fake)
        return real_loss + fake_loss

定义一个方法来计算生成器的损失：

    def generator_loss(self, fake):
        return self.loss(tf.ones_like(fake), fake)

定义一个方法来执行单次训练步骤。我们将从生成一个随机高斯噪声向量开始：

    @tf.function
    def train_step(self, images, batch_size):
        noise = tf.random.normal((batch_size,noise_dimension))

接下来，将随机噪声传递给生成器以生成假图像：

        with tf.GradientTape() as gen_tape, \
                tf.GradientTape() as dis_tape:
            generated_images = self.generator(noise,
                                        training=True)

将真实图像和假图像传递给判别器，并计算两个子网络的损失：

            real = self.discriminator(images, 
                                      training=True)
            fake = self.discriminator(generated_images,
                                      training=True)
            gen_loss = self.generator_loss(fake)
            disc_loss = self.discriminator_loss(real, 
                                               fake)

计算梯度：

        generator_grad = gen_tape \
            .gradient(gen_loss,
                      self.generator.trainable_variables)
        discriminator_grad = dis_tape \
            .gradient(disc_loss,
                self.discriminator.trainable_       variables)

接下来，使用各自的优化器应用梯度：

        opt_args = zip(generator_grad,
                      self.generator.trainable_variables)
        self.generator_opt.apply_gradients(opt_args)
        opt_args = zip(discriminator_grad,

               self.discriminator.trainable_variables)
        self.discriminator_opt.apply_gradients(opt_args)

最后，定义一个方法来训练整个架构。每训练 10 个周期，我们将绘制生成器生成的图像，以便直观地评估它们的质量：

    def train(self, dataset, test_seed, epochs, 
               batch_size):
        for epoch in tqdm(range(epochs)):
            for image_batch in dataset:
                self.train_step(image_batch, 
                                 batch_size)
            if epoch == 0 or epoch % 10 == 0:

           generate_and_save_images(self.generator,
                                         epoch,
                                         test_seed)

定义一个函数来生成新图像，然后将它们的 4x4 马赛克保存到磁盘：

def generate_and_save_images(model, epoch, test_input):
    predictions = model(test_input, training=False)
    plt.figure(figsize=(4, 4))
    for i in range(predictions.shape[0]):
        plt.subplot(4, 4, i + 1)
        image = predictions[i, :, :, 0] * 127.5 + 127.5
        image = tf.cast(image, tf.uint8)
        plt.imshow(image, cmap='gray')
        plt.axis('off')
    plt.savefig(f'{epoch}.png')
    plt.show()

定义一个函数来将来自 EMNIST 数据集的图像缩放到 [-1, 1] 区间：

def process_image(input):
    image = tf.cast(input['image'], tf.float32)
    image = (image - 127.5) / 127.5
    return image

使用 tfds 加载 EMNIST 数据集。我们只使用 'train' 数据集，其中包含超过 60 万张图像。我们还会确保将每张图像缩放到 'tanh' 范围内：

BUFFER_SIZE = 1000
BATCH_SIZE = 512
train_dataset = (tfds
                 .load('emnist', split='train')
                 .map(process_image,
                      num_parallel_calls=AUTOTUNE)
                 .shuffle(BUFFER_SIZE)
                 .batch(BATCH_SIZE))

创建一个测试种子，在整个 DCGAN 训练过程中用于生成图像：

noise_dimension = 100
num_examples_to_generate = 16
seed_shape = (num_examples_to_generate, 
              noise_dimension)
test_seed = tf.random.normal(seed_shape)

最后，实例化并训练一个 DCGAN() 实例，训练 200 个周期：
```
EPOCHS = 200
dcgan = DCGAN()
dcgan.train(train_dataset, test_seed, EPOCHS, BATCH_SIZE)
```
由 GAN 生成的第一张图像将类似于这个，只是一些没有形状的斑点：

图 6.1 – 在第 0 个周期生成的图像

在训练过程结束时，结果要好得多：

图 6.2 – 在第 200 个周期生成的图像

在 图 6.2 中，我们可以辨认出熟悉的字母和数字，包括 A、d、9、X 和 B。然而，在第一行中，我们注意到几个模糊的形状，这表明生成器还有改进的空间。

让我们在下一节中看看它是如何工作的。

它是如何工作的……

在这个配方中，我们学到 GAN 是协同工作的，不像自编码器那样相互配合，它们是相互对抗的（因此名字中有对抗二字）。当我们专注于生成器时，判别器只是一个训练生成器的工具，正如本例中的情况一样。这意味着训练后，判别器会被丢弃。

我们的生成器实际上是一个解码器，它接收一个包含 100 个元素的随机高斯向量，并生成 28x28x1 的图像，接着这些图像被传递给判别器，一个常规的 CNN，判别器需要判断它们是真实的还是伪造的。

因为我们的目标是创造最好的生成器，所以判别器尝试解决的分类问题与 EMNIST 中的实际类别无关。因此，我们不会事先明确标记图像为真实或伪造，但在 discriminator_loss() 方法中，我们知道所有来自 real 的图像都来自 EMNIST，因此我们对一个全为 1 的张量（tf.ones_like(real)）计算损失，类似地，所有来自 fake 的图像是合成的，我们对一个全为 0 的张量（tf.zeros_like(fake)）计算损失。

另一方面，生成器在计算其损失时会考虑来自判别器的反馈，以改进其输出。

必须注意的是，这里的目标是实现平衡，而不是最小化损失。因此，视觉检查至关重要，这也是我们每隔 10 个周期保存生成器输出的图像的原因。

最终，我们从第 0 个周期的随机、无形的块，到了第 200 个周期时，生成了可识别的数字和字母，尽管网络仍然可以进一步改进。

另见

你可以在这里阅读更多关于 EMNIST 的内容：arxiv.org/abs/1702.05373v1。

使用 DCGAN 进行半监督学习

数据是开发任何深度学习模型中最重要的部分。然而，好的数据通常稀缺且获取成本高。好消息是，GAN 可以在这些情况下提供帮助，通过人工生成新颖的训练示例，这个过程被称为 半监督学习。

在这个配方中，我们将开发一个特殊的 DCGAN 架构，在 Fashion-MNIST 的一个非常小的子集上训练分类器，并仍然达到不错的性能。

让我们开始吧，怎么样？

准备工作

我们不需要额外的东西来访问 Fashion-MNIST，因为它与 TensorFlow 一起捆绑提供。为了显示一个好看的进度条，让我们安装 tqdm：

$> pip install tqdm

现在让我们进入下一部分，开始实现这个配方。

如何操作……

执行以下步骤以完成配方：

让我们开始导入所需的包：

import numpy as np
from numpy.random import *
from tensorflow.keras import backend as K
from tensorflow.keras.datasets import fashion_mnist as fmnist
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tqdm import tqdm

定义 pick_supervised_subset() 函数来选择数据的子集。这将帮助我们模拟数据稀缺的情况，非常适合半监督学习。

def pick_supervised_subset(feats,
                           labels,
                           n_samples=1000,
                           n_classes=10):
    samples_per_class = int(n_samples / n_classes)
    X = []
    y = []
    for i in range(n_classes):
        class_feats = feats[labels == i]
        class_sample_idx = randint(low=0,

                               high=len(class_feats),
                              size=samples_per_class)
        X.extend([class_feats[j] for j in 
                  class_sample_idx])
        y.extend([i] * samples_per_class)
    return np.array(X), np.array(y)

现在，定义一个函数来选择一个随机数据样本用于分类。这意味着我们将使用原始数据集中的标签：

def pick_samples_for_classification(feats, labels, 
                                     n_samples):
    sample_idx = randint(low=0,
                         high=feats.shape[0],
                         size=n_samples)
    X = np.array([feats[i] for i in sample_idx])
    y = np.array([labels[i] for i in sample_idx])
    return X, y

定义 pick_samples_for_discrimination() 函数以选择一个随机样本用于判别。与上一个函数的主要区别在于这里的标签都是 1，表示所有的图像都是真实的，这清楚地表明该样本是为判别器准备的：

def pick_samples_for_discrimination(feats, n_samples):
    sample_idx = randint(low=0,
                         high=feats.shape[0],
                         size=n_samples)
    X = np.array([feats[i] for i in sample_idx])
    y = np.ones((n_samples, 1))
    return X, y

实现 generate_fake_samples() 函数来生成一批潜在点，换句话说，就是一组随机噪声向量，生成器将利用这些向量生成假图像：

def generate_fake_samples(model, latent_size, 
                          n_samples):
    z_input = generate_latent_points(latent_size, 
                                      n_samples)
    images = model.predict(z_input)
    y = np.zeros((n_samples, 1))
    return images, y

创建 generate_fake_samples() 函数，用生成器生成假数据：

def generate_fake_samples(model, latent_size, 
                          n_samples):
    z_input = generate_latent_points(latent_size, 
                                      n_samples)
    images = model.predict(z_input)
    y = np.zeros((n_samples, 1))
    return images, y

我们已经准备好定义我们的半监督式 DCGAN，接下来将它封装在此处定义的 SSGAN() 类中。我们将从构造函数开始：

class SSGAN(object):
    def __init__(self,
                 latent_size=100,
                 input_shape=(28, 28, 1),
                 alpha=0.2):
        self.latent_size = latent_size
        self.input_shape = input_shape
        self.alpha = alpha

在将参数作为成员存储后，让我们实例化判别器：

        (self.classifier,
         self.discriminator) = self._create_discriminators()

现在，编译分类器和判别器模型：

        clf_opt = Adam(learning_rate=2e-4, beta_1=0.5)
        self.classifier.compile(
            loss='sparse_categorical_crossentropy',
            optimizer=clf_opt,
            metrics=['accuracy'])
        dis_opt = Adam(learning_rate=2e-4, beta_1=0.5)
        self.discriminator.compile(loss='binary_crossentropy',
                                   optimizer=dis_opt)

创建生成器：

        self.generator = self._create_generator()

创建 GAN 并进行编译：

        self.gan = self._create_gan()
        gan_opt = Adam(learning_rate=2e-4, beta_1=0.5)
        self.gan.compile(loss='binary_crossentropy',
                         optimizer=gan_opt)

定义私有的 _create_discriminators() 方法来创建判别器。内部的 custom_activation() 函数用于激活分类器模型的输出，生成一个介于 0 和 1 之间的值，用于判断图像是真实的还是假的：

    def _create_discriminators(self, num_classes=10):
        def custom_activation(x):
            log_exp_sum = K.sum(K.exp(x), axis=-1,
                                keepdims=True)
            return log_exp_sum / (log_exp_sum + 1.0)

定义分类器架构，它只是一个常规的 softmax 激活的 CNN：

        input = Input(shape=self.input_shape)
        x = input
        for _ in range(3):
            x = Conv2D(filters=128,
                       kernel_size=(3, 3),
                       strides=2,
                       padding='same')(x)
            x = LeakyReLU(alpha=self.alpha)(x)
        x = Flatten()(x)
        x = Dropout(rate=0.4)(x)
        x = Dense(units=num_classes)(x)
        clf_output = Softmax()(x)
        clf_model = Model(input, clf_output)

判别器与分类器共享权重，但不同的是，它不再使用 softmax 激活输出，而是使用之前定义的 custom_activation() 函数：
```
        dis_output = Lambda(custom_activation)(x)
        discriminator_model = Model(input, dis_output)
```

返回分类器和判别器：

        return clf_model, discriminator_model

创建私有的 _create_generator() 方法来实现生成器架构，实际上它只是一个解码器，正如本章第一节中所解释的那样：

    def _create_generator(self):
        input = Input(shape=(self.latent_size,))
        x = Dense(units=128 * 7 * 7)(input)
        x = LeakyReLU(alpha=self.alpha)(x)
        x = Reshape((7, 7, 128))(x)
        for _ in range(2):
            x = Conv2DTranspose(filters=128,
                                kernel_size=(4, 4),
                                strides=2,
                                padding='same')(x)
            x = LeakyReLU(alpha=self.alpha)(x)
        x = Conv2D(filters=1,
                   kernel_size=(7, 7),
                   padding='same')(x)
        output = Activation('tanh')(x)
        return Model(input, output)

定义私有的 _create_gan() 方法来创建 GAN 本身，实际上它只是生成器和判别器之间的连接：

    def _create_gan(self):
        self.discriminator.trainable = False
        output = 
              self.discriminator(self.generator.output)
        return Model(self.generator.input, output)

最后，定义 train() 函数来训练整个系统。我们将从选择将要训练的 Fashion-MNIST 子集开始，然后定义所需的批次和训练步骤数量来适配架构：

    def train(self, X, y, epochs=20, num_batches=100):
        X_sup, y_sup = pick_supervised_subset(X, y)
        batches_per_epoch = int(X.shape[0] / num_batches)
        num_steps = batches_per_epoch * epochs
        num_samples = int(num_batches / 2)

选择用于分类的样本，并使用这些样本来训练分类器：

        for _ in tqdm(range(num_steps)):
            X_sup_real, y_sup_real = \
                pick_samples_for_classification(X_sup,
                                                y_sup,
                                          num_samples)
            self.classifier.train_on_batch(X_sup_real,
                                           y_sup_real)

选择真实样本进行判别，并使用这些样本来训练判别器：

            X_real, y_real = \
                pick_samples_for_discrimination(X,
                                          num_samples)
        self.discriminator.train_on_batch(X_real, y_real)

使用生成器生成假数据，并用这些数据来训练判别器：

            X_fake, y_fake = \
                generate_fake_samples(self.generator,
                                    self.latent_size,
                                      num_samples)
            self.discriminator.train_on_batch(X_fake, 
                                             y_fake)

生成潜在点，并利用这些点训练 GAN：

            X_gan = generate_latent_points(self.latent_size,
                      num_batches)
            y_gan = np.ones((num_batches, 1))
            self.gan.train_on_batch(X_gan, y_gan)

加载 Fashion-MNIST 数据集并对训练集和测试集进行归一化处理：

(X_train, y_train), (X_test, y_test) = fmnist.load_data()
X_train = np.expand_dims(X_train, axis=-1)
X_train = (X_train.astype(np.float32) - 127.5) / 127.5
X_test = np.expand_dims(X_test, axis=-1)
X_test = (X_test.astype(np.float32) - 127.5) / 127.5

实例化一个 SSCGAN() 并训练 30 个 epoch：

ssgan = SSGAN()
ssgan.train(X_train, y_train, epochs=30)

报告分类器在训练集和测试集上的准确率：

train_acc = ssgan.classifier.evaluate(X_train, 
                                      y_train)[1]
train_acc *= 100
print(f'Train accuracy: {train_acc:.2f}%')
test_acc = ssgan.classifier.evaluate(X_test, y_test)[1]
test_acc *= 100
print(f'Test accuracy: {test_acc:.2f}%')

训练完成后，训练集和测试集的准确率应该都在 83% 左右，如果考虑到我们只使用了 50,000 个样本中的 1,000 个，这个结果是相当令人满意的！

它的工作原理…

在本食谱中，我们实现了一个与本章开头的 实现深度卷积 GAN 食谱中实现的架构非常相似。主要的区别在于我们有两个判别器：第一个实际上是一个分类器，训练时使用我们手头的少量标记数据的子集；另一个是常规判别器，其唯一任务是不要被生成器欺骗。

分类器如何在如此少的数据下取得如此出色的性能？答案是共享权重。分类器和判别器共享相同的特征提取层，唯一的区别在于最终的输出层，分类器使用普通的 softmax 函数进行激活，而判别器则使用一个 Lambda() 层包裹我们的 custom_activation() 函数进行激活。

这意味着每次分类器在一批标记数据上训练时，这些共享权重都会被更新，同时判别器在真实和假图像上训练时也会更新。最终，我们借助生成器解决了数据稀缺问题。

很厉害吧？

参见

你可以通过阅读最初提出这种方法的论文来巩固对本食谱中使用的半监督训练方法的理解：arxiv.org/abs/1606.03498。

使用 Pix2Pix 翻译图像

GAN 最有趣的应用之一是图像到图像的翻译，顾名思义，它包括将一个图像领域的内容翻译到另一个领域（例如，素描到照片，黑白图像到 RGB，Google Maps 到卫星视图等）。

在本食谱中，我们将实现一个相当复杂的条件对抗网络，称为 Pix2Pix。我们将专注于解决方案的实际应用，如果你想了解更多文献，可以查看食谱末尾的参见部分。

准备工作

我们将使用 cityscapes 数据集，它可以在此处找到：people.eecs.berkeley.edu/~tinghuiz/p… ~/.keras/datasets 目录下，命名为 cityscapes。为了在训练过程中显示进度条，安装 tqdm：

$> pip install tqdm

在本节结束时，我们将学会如何使用 Pix2Pix 从右侧的图像生成左侧的图像：

图 6.3 – 我们将使用右侧的分割图像生成像左侧那样的真实世界图像

让我们开始吧！

如何实现…

完成这些步骤后，你将从头实现 Pix2Pix！

导入依赖项：

import pathlib
import cv2
import numpy as np
import tensorflow as tf
import tqdm
from tensorflow.keras.layers import *
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.models import *
from tensorflow.keras.optimizers import Adam

定义 TensorFlow 的自动调优和调整大小选项的常量，以及图像尺寸。我们将调整数据集中的所有图像：

AUTOTUNE = tf.data.experimental.AUTOTUNE
NEAREST_NEIGHBOR = tf.image.ResizeMethod.NEAREST_NEIGHBOR
IMAGE_WIDTH = 256
IMAGE_HEIGHT = 256

数据集中的每张图像由输入和目标组成，因此在处理完图像后，我们需要将它们拆分成单独的图像。load_image() 函数实现了这一点：

def load_image(image_path):
    image = tf.io.read_file(image_path)
    image = tf.image.decode_jpeg(image)
    width = tf.shape(image)[1]
    width = width // 2
    real_image = image[:, :width, :]
    input_image = image[:, width:, :]
    input_image = tf.cast(input_image, tf.float32)
    real_image = tf.cast(real_image, tf.float32)
    return input_image, real_image

让我们创建 resize() 函数来调整输入图像和目标图像的大小：

 def resize(input_image, real_image, height, width):
    input_image = tf.image.resize(input_image,
                              size=(height,width),
                             method=NEAREST_NEIGHBOR)
    real_image = tf.image.resize(real_image,
                                 size=(height, width),
                              method=NEAREST_NEIGHBOR)
    return input_image, real_image

现在，实施 random_crop() 函数，对图像进行随机裁剪：

def random_crop(input_image, real_image):
    stacked_image = tf.stack([input_image, 
                             real_image],axis=0)
    size = (2, IMAGE_HEIGHT, IMAGE_WIDTH, 3)
    cropped_image = tf.image.random_crop(stacked_image,
                                         size=size)
    input_image = cropped_image[0]
    real_image = cropped_image[1]
    return input_image, real_image

接下来，编写 normalize() 函数，将图像归一化到 [-1, 1] 范围内：

def normalize(input_image, real_image):
    input_image = (input_image / 127.5) - 1
    real_image = (real_image / 127.5) - 1
    return input_image, real_image

定义 random_jitter() 函数，对输入图像进行随机抖动（注意它使用了 第 4 步 和 第 5 步 中定义的函数）：

@tf.function
def random_jitter(input_image, real_image):
    input_image, real_image = resize(input_image, 
                                     real_image,
                                     width=286, 
                                      height=286)
    input_image, real_image = random_crop(input_image,
                                          real_image)
    if np.random.uniform() > 0.5:
        input_image = \
              tf.image.flip_left_right(input_image)
        real_image = \
             tf.image.flip_left_right(real_image)
    return input_image, real_image

创建 load_training_image() 函数，用于加载和增强训练图像：

def load_training_image(image_path):
    input_image, real_image = load_image(image_path)
    input_image, real_image = \
        random_jitter(input_image, real_image)
    input_image, real_image = \
        normalize(input_image, real_image)
    return input_image, real_image

现在，让我们实现 load_test_image() 函数，顾名思义，它将用于加载测试图像：

def load_test_image(image_path):
    input_image, real_image = load_image(image_path)
    input_image, real_image = resize(input_image, 
                                     real_image,
                                   width=IMAGE_WIDTH,
                                 height=IMAGE_HEIGHT)
    input_image, real_image = \
        normalize(input_image, real_image)
    return input_image, real_image

现在，让我们继续创建 generate_and_save_images() 函数，来存储生成器模型生成的合成图像。结果图像将是 input、target 和 prediction 的拼接：

def generate_and_save_images(model, input, target,epoch):
    prediction = model(input, training=True)
    display_list = [input[0], target[0], prediction[0]]
    image = np.hstack(display_list)
    image *= 0.5
    image += 0.5
    image *= 255.0
    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
    cv2.imwrite(f'{epoch + 1}.jpg', image)

接下来，定义 Pix2Pix() 类，封装此架构的实现。首先是构造函数：

class Pix2Pix(object):
    def __init__(self, output_channels=3, 
                 lambda_value=100):
        self.loss = BinaryCrossentropy(from_logits=True)
        self.output_channels = output_channels
        self._lambda = lambda_value
        self.generator = self.create_generator()
        self.discriminator = self.create_discriminator()
        self.gen_opt = Adam(learning_rate=2e-4, 
                             beta_1=0.5)
        self.dis_opt = Adam(learning_rate=2e-4, 
                             beta_1=0.5)

第 11 步 中实现的构造函数定义了要使用的损失函数（二元交叉熵）、lambda 值（用于 第 18 步），并实例化了生成器和判别器及其各自的优化器。我们的生成器是一个修改过的 U-Net，它是一个 U 形网络，由下采样和上采样块组成。现在，让我们创建一个静态方法来生成下采样块：

    @staticmethod
    def downsample(filters, size, batch_norm=True):
       initializer = tf.random_normal_initializer(0.0, 0.02)
        layers = Sequential()
        layers.add(Conv2D(filters=filters,
                          kernel_size=size,
                          strides=2,
                          padding='same',

                      kernel_initializer=initializer,
                          use_bias=False))
        if batch_norm:
            layers.add(BatchNormalization())
        layers.add(LeakyReLU())
        return layers

下采样块是一个卷积块，可选地进行批归一化，并激活 LeakyReLU()。现在，让我们实现一个静态方法来创建上采样块：

    @staticmethod
    def upsample(filters, size, dropout=False):
        init = tf.random_normal_initializer(0.0, 0.02)
        layers = Sequential()
        layers.add(Conv2DTranspose(filters=filters,
                                   kernel_size=size,
                                   strides=2,
                                   padding='same',
                             kernel_initializer=init,
                                   use_bias=False))
        layers.add(BatchNormalization())
        if dropout:
            layers.add(Dropout(rate=0.5))
        layers.add(ReLU())
        return layers

上采样块是一个转置卷积，后面可选地跟随 dropout，并激活 ReLU()。现在，让我们使用这两个便捷方法来实现 U-Net 生成器：

    def create_generator(self, input_shape=(256, 256,3)):
        down_stack = [self.downsample(64,4,batch_norm=False)]
        for filters in (128, 256, 512, 512, 512, 512, 
                         512):
            down_block = self.downsample(filters, 4)
            down_stack.append(down_block)

在定义了下采样堆栈后，让我们对上采样层做同样的事情：

        up_stack = []
        for _ in range(3):
            up_block = self.upsample(512, 4,dropout=True)
            up_stack.append(up_block)
        for filters in (512, 256, 128, 64):
            up_block = self.upsample(filters, 4)
            up_stack.append(up_block)

将输入通过下采样和上采样堆栈，同时添加跳跃连接，以防止网络的深度妨碍其学习：

        inputs = Input(shape=input_shape)
        x = inputs
        skip_layers = []
        for down in down_stack:
            x = down(x)
            skip_layers.append(x)
        skip_layers = reversed(skip_layers[:-1])
        for up, skip_connection in zip(up_stack, 
                                       skip_layers):
            x = up(x)
            x = Concatenate()([x, skip_connection])

输出层是一个转置卷积，激活函数为 'tanh'：

        init = tf.random_normal_initializer(0.0, 0.02)
        output = Conv2DTranspose(
            filters=self.output_channels,
            kernel_size=4,
            strides=2,
            padding='same',
            kernel_initializer=init,
            activation='tanh')(x)
        return Model(inputs, outputs=output)

定义一个方法来计算生成器的损失，正如 Pix2Pix 的作者所推荐的那样。注意 self._lambda 常量的使用：

    def generator_loss(self,
                       discriminator_generated_output,
                       generator_output,
                       target):
        gan_loss = self.loss(
            tf.ones_like(discriminator_generated_output),
            discriminator_generated_output)
        # MAE
        error = target - generator_output
        l1_loss = tf.reduce_mean(tf.abs(error))
        total_gen_loss = gan_loss + (self._lambda * 
                                      l1_loss)
        return total_gen_loss, gan_loss, l1_loss

本步骤中定义的判别器接收两张图像；输入图像和目标图像：

    def create_discriminator(self):
        input = Input(shape=(256, 256, 3))
        target = Input(shape=(256, 256, 3))
        x = Concatenate()([input, target])
        x = self.downsample(64, 4, False)(x)
        x = self.downsample(128, 4)(x)
        x = self.downsample(256, 4)(x)
        x = ZeroPadding2D()(x)

注意，最后几层是卷积层，而不是 Dense() 层。这是因为判别器一次处理的是图像的一个小块，并判断每个小块是真实的还是假的：

        init = tf.random_normal_initializer(0.0, 0.02)
        x = Conv2D(filters=512,
                   kernel_size=4,
                   strides=1,
                   kernel_initializer=init,
                   use_bias=False)(x)
        x = BatchNormalization()(x)
        x = LeakyReLU()(x)
        x = ZeroPadding2D()(x)
        output = Conv2D(filters=1,
                        kernel_size=4,
                        strides=1,
                        kernel_initializer=init)(x)
        return Model(inputs=[input, target], 
                    outputs=output)

定义判别器的损失：

    def discriminator_loss(self,
                           discriminator_real_output,
                         discriminator_generated_output):
        real_loss = self.loss(
            tf.ones_like(discriminator_real_output),
            discriminator_real_output)
        fake_loss = self.loss(
            tf.zeros_like(discriminator_generated_output),
            discriminator_generated_output)
        return real_loss + fake_loss

定义一个执行单个训练步骤的函数，命名为train_step()，该函数包括：将输入图像传入生成器，然后使用判别器对输入图像与原始目标图像配对，再对输入图像与生成器输出的假图像配对进行处理：

    @tf.function
    def train_step(self, input_image, target):
        with tf.GradientTape() as gen_tape, \
                tf.GradientTape() as dis_tape:
            gen_output = self.generator(input_image,
                                        training=True)
            dis_real_output = self.discriminator(
                [input_image, target], training=True)
            dis_gen_output = self.discriminator(
                [input_image, gen_output], 
                        training=True)

接下来，计算损失值以及梯度：

            (gen_total_loss, gen_gan_loss,   
               gen_l1_loss) = \
                self.generator_loss(dis_gen_output,
                                    gen_output,
                                    target)
            dis_loss = \
                self.discriminator_loss(dis_real_output,

                        dis_gen_output)
        gen_grads = gen_tape. \
            gradient(gen_total_loss,
                     self.generator.trainable_variables)
        dis_grads = dis_tape. \
            gradient(dis_loss,
                     self.discriminator.trainable_variables)

使用梯度通过相应的优化器更新模型：

        opt_args = zip(gen_grads,
                       self.generator.trainable_variables)
        self.gen_opt.apply_gradients(opt_args)
        opt_args = zip(dis_grads,
                       self.discriminator.trainable_variables)
        self.dis_opt.apply_gradients(opt_args)

实现fit()方法来训练整个架构。对于每一轮，我们将生成的图像保存到磁盘，以便通过视觉方式评估模型的性能：

    def fit(self, train, epochs, test):
        for epoch in tqdm.tqdm(range(epochs)):
            for example_input, example_target in 
                              test.take(1):
                generate_and_save_images(self.generator,
                                       example_input,
                                       example_target,
                                         epoch)
            for input_image, target in train:
                self.train_step(input_image, target)

组装训练集和测试集数据的路径：

dataset_path = (pathlib.Path.home() / '.keras' / 
                'datasets' /'cityscapes')
train_dataset_pattern = str(dataset_path / 'train' / 
                             '*.jpg')
test_dataset_pattern = str(dataset_path / 'val' / 
                           '*.jpg')

定义训练集和测试集数据：

BUFFER_SIZE = 400
BATCH_SIZE = 1
train_ds = (tf.data.Dataset
            .list_files(train_dataset_pattern)
            .map(load_training_image,
                 num_parallel_calls=AUTOTUNE)
            .shuffle(BUFFER_SIZE)
            .batch(BATCH_SIZE))
test_ds = (tf.data.Dataset
           .list_files(test_dataset_pattern)
           .map(load_test_image)
           .batch(BATCH_SIZE))

实例化Pix2Pix()并训练 150 轮：
```
pix2pix = Pix2Pix()
pix2pix.fit(train_ds, epochs=150, test=test_ds)
```
这是第 1 轮生成的图像：

图 6.4 – 最初，生成器只会产生噪声

这是第 150 轮的结果：

图 6.5 – 在训练结束时，生成器能够产生合理的结果

当训练结束时，我们的 Pix2Pix 架构能够将分割后的图像转换为真实场景，如图 6.5所示，其中第一张是输入图像，第二张是目标图像，最右边的是生成的图像。

接下来我们将在下一部分连接这些点。

它是如何工作的…

在本示例中，我们实现了一个稍微复杂的架构，但它基于与所有 GAN 相同的思路。主要的区别是，这次判别器工作在图像块上，而不是整个图像。更具体地说，判别器一次查看原始图像和假图像的图像块，并判断这些图像块是否属于真实图像或合成图像。

由于图像到图像的转换是一种图像分割形式，我们的生成器是一个经过修改的 U-Net，U-Net 是一种首次用于生物医学图像分割的突破性 CNN 架构。

因为 Pix2Pix 是一个如此复杂且深度的网络，训练过程需要几个小时才能完成，但最终，我们在将分割后的城市景观内容转换为真实感预测方面取得了非常好的结果。令人印象深刻！

如果你想查看其他生成的图像以及生成器和判别器的图形表示，请查阅官方仓库：github.com/PacktPublishing/Tensorflow-2.0-Computer-Vision-Cookbook/tree/master/ch6/recipe3。

另见

我建议你阅读Pix2Pix的原始论文，作者为 Phillip Isola、Jun-Yan Zhu、Tinghui Zhou 和 Alexei A. Efros，论文链接在此：arxiv.org/abs/1611.07004。我们使用了 U-Net 作为生成器，你可以在这里了解更多：arxiv.org/abs/1505.04597。

使用 CycleGAN 翻译未配对的图像

在使用 Pix2Pix 翻译图像的配方中，我们探索了如何将图像从一个领域转移到另一个领域。然而，最终这仍然是监督学习，需要输入和目标图像的配对，以便 Pix2Pix 学习正确的映射。如果我们能够绕过这个配对条件，让网络自己找出如何将一个领域的特征翻译到另一个领域，同时保持图像的一致性，那该多好？

好吧，这正是CycleGAN的作用，在这个配方中，我们将从头开始实现一个，将夏季拍摄的优胜美地国家公园的照片转换为冬季版本！

开始吧。

准备工作

在这个配方中，我们将使用OpenCV、tqdm和tensorflow-datasets。

使用pip同时安装这些：

$> pip install opencv-contrib-python tqdm tensorflow-datasets

通过 TensorFlow 数据集，我们将访问cyclegan/summer2winter_yosemite数据集。

以下是该数据集的一些示例图像：

图 6.6 – 左：夏季的优胜美地；右：冬季的优胜美地

提示

CycleGAN 的实现与 Pix2Pix 非常相似。因此，我们不会详细解释其中的大部分内容。相反，我建议你先完成使用 Pix2Pix 翻译图像的配方，然后再来挑战这个。

如何操作…

执行以下步骤来完成这个配方：

导入必要的依赖项：

import cv2
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.layers import *
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.models import *
from tensorflow.keras.optimizers import Adam
from tqdm import tqdm

定义tf.data.experimental.AUTOTUNE的别名：
```
AUTOTUNE = tf.data.experimental.AUTOTUNE
```

定义一个函数来执行图像的随机裁剪：

def random_crop(image):
    return tf.image.random_crop(image, size=(256, 256, 
                                               3))

定义一个函数，将图像归一化到[-1, 1]的范围：

def normalize(image):
    image = tf.cast(image, tf.float32)
    image = (image / 127.5) - 1
    return image

定义一个函数，执行图像的随机抖动：

def random_jitter(image):
    method = tf.image.ResizeMethod.NEAREST_NEIGHBOR
    image = tf.image.resize(image, (286, 286), 
                            method=method)
    image = random_crop(image)
    image = tf.image.random_flip_left_right(image)
    return image

定义一个函数来预处理并增强训练图像：

def preprocess_training_image(image, _):
    image = random_jitter(image)
    image = normalize(image)
    return image

定义一个函数来预处理测试图像：

def preprocess_test_image(image, _):
    image = normalize(image)
    return image

定义一个函数，使用生成器模型生成并保存图像。生成的图像将是输入图像与预测结果的拼接：

def generate_images(model, test_input, epoch):
    prediction = model(test_input)
    image = np.hstack([test_input[0], prediction[0]])
    image *= 0.5
    image += 0.5
    image *= 255.0
    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
    cv2.imwrite(f'{epoch + 1}.jpg', image)

定义一个自定义实例归一化层，从构造函数开始：

class InstanceNormalization(Layer):
    def __init__(self, epsilon=1e-5):
        super(InstanceNormalization, self).__init__()
        self.epsilon = epsilon

现在，定义build()方法，它创建InstanceNormalization()类的内部组件：

    def build(self, input_shape):
        init = tf.random_normal_initializer(1.0, 0.02)
        self.scale = self.add_weight(name='scale',
                               shape=input_shape[-1:],
                                     initializer=init,
                                     trainable=True)
        self.offset = self.add_weight(name='offset',
                               shape=input_shape[-1:],
                                  initializer='zeros',
                                      trainable=True)

创建call()方法，该方法实现实例归一化输入张量x的逻辑：

    def call(self, x):
        mean, variance = tf.nn.moments(x,
                                       axes=(1, 2),
                                       keepdims=True)
        inv = tf.math.rsqrt(variance + self.epsilon)
        normalized = (x - mean) * inv
        return self.scale * normalized + self.offset

定义一个类来封装 CycleGAN 的实现。首先是构造函数：

class CycleGAN(object):
    def __init__(self, output_channels=3, 
                 lambda_value=10):
        self.output_channels = output_channels
        self._lambda = lambda_value
        self.loss = BinaryCrossentropy(from_logits=True)
        self.gen_g = self.create_generator()
        self.gen_f = self.create_generator()
        self.dis_x = self.create_discriminator()
        self.dis_y = self.create_discriminator()
        self.gen_g_opt = Adam(learning_rate=2e-4, 
                               beta_1=0.5)
        self.gen_f_opt = Adam(learning_rate=2e-4, 
                              beta_1=0.5)
        self.dis_x_opt = Adam(learning_rate=2e-4, 
                              beta_1=0.5)
        self.dis_y_opt = Adam(learning_rate=2e-4, 
                              beta_1=0.5)

与 Pix2Pix 的主要区别在于我们有两个生成器（gen_g和gen_f）和两个鉴别器（dis_x和dis_y）。gen_g学习如何将图像 X 转换为图像 Y，而gen_f学习如何将图像 Y 转换为图像 X。类似地，dis_x学习区分真实的图像 X 和gen_f生成的图像，而dis_y学习区分真实的图像 Y 和gen_g生成的图像。

现在，让我们创建一个静态方法来生成下采样块（这与上一个示例相同，只是这次我们使用实例化而不是批处理归一化）：

    @staticmethod
    def downsample(filters, size, norm=True):
        initializer = tf.random_normal_initializer(0.0, 0.02)
        layers = Sequential()
        layers.add(Conv2D(filters=filters,
                          kernel_size=size,
                          strides=2,
                          padding='same',

                     kernel_initializer=initializer,
                          use_bias=False))
        if norm:
            layers.add(InstanceNormalization())
        layers.add(LeakyReLU())
        return layers

现在，定义一个静态方法来生成上采样块（这与上一个示例相同，只是这次我们使用实例化而不是批处理归一化）：

    @staticmethod
    def upsample(filters, size, dropout=False):
        init = tf.random_normal_initializer(0.0, 0.02)
        layers = Sequential()
        layers.add(Conv2DTranspose(filters=filters,
                                   kernel_size=size,
                                   strides=2,
                                   padding='same',

                             kernel_initializer=init,
                                   use_bias=False))
        layers.add(InstanceNormalization())
        if dropout:
            layers.add(Dropout(rate=0.5))
        layers.add(ReLU())
        return layers

定义一个方法来构建生成器。首先创建下采样层：

    def create_generator(self):
        down_stack = [
            self.downsample(64, 4, norm=False),
            self.downsample(128, 4),
            self.downsample(256, 4)]
        for _ in range(5):
            down_block = self.downsample(512, 4)
            down_stack.append(down_block)

现在，创建上采样层：

        for _ in range(3):
            up_block = self.upsample(512, 4, 
                                   dropout=True)
            up_stack.append(up_block)
        for filters in (512, 256, 128, 64):
            up_block = self.upsample(filters, 4)
            up_stack.append(up_block)

将输入通过下采样和上采样层。添加跳跃连接以避免梯度消失问题：

inputs = Input(shape=(None, None, 3))
        x = inputs
        skips = []
        for down in down_stack:
            x = down(x)
            skips.append(x)
        skips = reversed(skips[:-1])
        for up, skip in zip(up_stack, skips):
            x = up(x)
            x = Concatenate()([x, skip])

输出层是一个激活函数为'tanh'的转置卷积层：

        init = tf.random_normal_initializer(0.0, 0.02)
        output = Conv2DTranspose(
            filters=self.output_channels,
            kernel_size=4,
            strides=2,
            padding='same',
            kernel_initializer=init,
            activation='tanh')(x)
        return Model(inputs, outputs=output)

定义一个方法来计算生成器损失：

    def generator_loss(self, generated):
        return self.loss(tf.ones_like(generated), 
                         generated)

定义一个方法来创建鉴别器：

    def create_discriminator(self):
        input = Input(shape=(None, None, 3))
        x = input
        x = self.downsample(64, 4, False)(x)
        x = self.downsample(128, 4)(x)
        x = self.downsample(256, 4)(x)
        x = ZeroPadding2D()(x)

添加最后几层卷积层：

        init = tf.random_normal_initializer(0.0, 0.02)
        x = Conv2D(filters=512,
                   kernel_size=4,
                   strides=1,
                   kernel_initializer=init,
                   use_bias=False)(x)
        x = InstanceNormalization()(x)
        x = LeakyReLU()(x)
        x = ZeroPadding2D()(x)
        output = Conv2D(filters=1,
                        kernel_size=4,
                        strides=1,
                        kernel_initializer=init)(x)
        return Model(inputs=input, outputs=output)

定义一个方法来计算鉴别器损失：

    def discriminator_loss(self, real, generated):
        real_loss = self.loss(tf.ones_like(real), 
                                     real)
        generated_loss = 
              self.loss(tf.zeros_like(generated),
                                   generated)
        total_discriminator_loss = real_loss + generated_loss
        return total_discriminator_loss * 0.5

定义一个方法来计算真实图像和循环图像之间的损失。这个损失用于量化循环一致性，即如果你将图像 X 翻译为 Y，然后再将 Y 翻译为 X，结果应该是 X，或者接近 X：

    def calculate_cycle_loss(self, real_image, 
                             cycled_image):
        error = real_image - cycled_image
        loss1 = tf.reduce_mean(tf.abs(error))
        return self._lambda * loss1

定义一个方法来计算身份损失。这个损失确保如果你将图像 Y 通过gen_g传递，我们应该得到真实的图像 Y 或接近它（gen_f也同样适用）：

    def identity_loss(self, real_image, same_image):
        error = real_image - same_image
        loss = tf.reduce_mean(tf.abs(error))
        return self._lambda * 0.5 * loss

定义一个方法来执行单步训练。它接收来自不同领域的图像 X 和 Y。然后，使用gen_g将 X 转换为 Y，并使用gen_f将 Y 转换为 X：

    @tf.function
    def train_step(self, real_x, real_y):
        with tf.GradientTape(persistent=True) as tape:
            fake_y = self.gen_g(real_x, training=True)
            cycled_x = self.gen_f(fake_y, 
                                 training=True)
            fake_x = self.gen_f(real_y, training=True)
            cycled_y = self.gen_g(fake_x, 
                                   training=True)

现在，将 X 通过gen_f传递，将 Y 通过gen_y传递，以便稍后计算身份损失：

            same_x = self.gen_f(real_x, training=True)
            same_y = self.gen_g(real_y, training=True)

将真实的 X 和伪造的 X 传递给dis_x，将真实的 Y 以及生成的 Y 传递给dis_y：

            dis_real_x = self.dis_x(real_x, 
                                    training=True)
            dis_real_y = self.dis_y(real_y, 
                                    training=True)
            dis_fake_x = self.dis_x(fake_x,training=True)
            dis_fake_y = self.dis_y(fake_y, 
                                   training=True)

计算生成器的损失：

            gen_g_loss = self.generator_loss(dis_fake_y)
            gen_f_loss = self.generator_loss(dis_fake_x)

计算循环损失：

            cycle_x_loss = \
                self.calculate_cycle_loss(real_x, 
                                          cycled_x)
            cycle_y_loss = \
                self.calculate_cycle_loss(real_y, 
                                         cycled_y)
            total_cycle_loss = cycle_x_loss + 
                                   cycle_y_loss

计算身份损失和总生成器 G 的损失：

            identity_y_loss = \
                self.identity_loss(real_y, same_y)
            total_generator_g_loss = (gen_g_loss +
                                      total_cycle_loss +
                                      identity_y_loss)

对生成器 F 重复此过程：

            identity_x_loss = \
                self.identity_loss(real_x, same_x)
            total_generator_f_loss = (gen_f_loss +
                                      total_cycle_loss +
                                      identity_x_loss)

计算鉴别器的损失：

         dis_x_loss = \
           self.discriminator_loss(dis_real_x,dis_fake_x)
         dis_y_loss = \
           self.discriminator_loss(dis_real_y,dis_fake_y)

计算生成器的梯度：

        gen_g_grads = tape.gradient(
            total_generator_g_loss,
            self.gen_g.trainable_variables)
        gen_f_grads = tape.gradient(
            total_generator_f_loss,
            self.gen_f.trainable_variables)

计算鉴别器的梯度：

        dis_x_grads = tape.gradient(
            dis_x_loss,
            self.dis_x.trainable_variables)
        dis_y_grads = tape.gradient(
            dis_y_loss,
            self.dis_y.trainable_variables)

使用相应的优化器将梯度应用到每个生成器：

        gen_g_opt_params = zip(gen_g_grads,
                         self.gen_g.trainable_variables)
        self.gen_g_opt.apply_gradients(gen_g_opt_params)
        gen_f_opt_params = zip(gen_f_grads,
                               self.gen_f.trainable_variables)
        self.gen_f_opt.apply_gradients(gen_f_opt_params)

使用相应的优化器将梯度应用到每个鉴别器：

        dis_x_opt_params = zip(dis_x_grads,
                          self.dis_x.trainable_variables)
        self.dis_x_opt.apply_gradients(dis_x_opt_params)
        dis_y_opt_params = zip(dis_y_grads,
                          self.dis_y.trainable_variables)
        self.dis_y_opt.apply_gradients(dis_y_opt_params)

定义一个方法来拟合整个架构。它将在每个 epoch 之后将生成器 G 生成的图像保存到磁盘：

    def fit(self, train, epochs, test):
        for epoch in tqdm(range(epochs)):
            for image_x, image_y in train:
                self.train_step(image_x, image_y)
            test_image = next(iter(test))
            generate_images(self.gen_g, test_image, 
                               epoch)

加载数据集：

dataset, _ = tfds.load('cycle_gan/summer2winter_  yosemite',
                       with_info=True,
                       as_supervised=True)

解包训练和测试集：

train_summer = dataset['trainA']
train_winter = dataset['trainB']
test_summer = dataset['testA']
test_winter = dataset['testB']

定义训练集的数据处理管道：

BUFFER_SIZE = 400
BATCH_SIZE = 1
train_summer = (train_summer
                .map(preprocess_training_image,
                     num_parallel_calls=AUTOTUNE)
                .cache()
                .shuffle(BUFFER_SIZE)
                .batch(BATCH_SIZE))
train_winter = (train_winter
                .map(preprocess_training_image,
                     num_parallel_calls=AUTOTUNE)
                .cache()
                .shuffle(BUFFER_SIZE)
                .batch(BATCH_SIZE))

定义测试集的数据处理管道：

test_summer = (test_summer
               .map(preprocess_test_image,
                    num_parallel_calls=AUTOTUNE)
               .cache()
               .shuffle(BUFFER_SIZE)
               .batch(BATCH_SIZE))
test_winter = (test_winter
               .map(preprocess_test_image,
                    num_parallel_calls=AUTOTUNE)
               .cache()
               .shuffle(BUFFER_SIZE)
               .batch(BATCH_SIZE))

创建一个CycleGAN()实例并训练 40 个 epoch：

cycle_gan = CycleGAN()
train_ds = tf.data.Dataset.zip((train_summer, 
                                train_winter))
cycle_gan.fit(train=train_ds,
              epochs=40,
              test=test_summer)

在第 1 个 epoch 时，我们会注意到网络尚未学到很多东西：

图 6.7 – 左：夏季的原始图像；右：翻译后的图像（冬季）

然而，在第 40 个周期时，结果更加令人鼓舞：

图 6.8 – 左：夏季的原始图像；右：翻译后的图像（冬季）

如前图所示，我们的 CycleGAN() 在某些区域（如小道和树木）添加了更多的白色，使得翻译后的图像看起来像是冬季拍摄的。当然，训练更多的周期可能会带来更好的结果，我鼓励你这么做，以加深你对 CycleGAN 的理解！

它是如何工作的……

在本教程中，我们了解到，CycleGAN 的工作方式与 Pix2Pix 非常相似。然而，最大优势是 CycleGAN 不需要配对图像数据集就能实现目标。相反，它依赖于两组生成器和判别器，实际上，这些生成器和判别器形成了一个学习循环，因此得名。

特别地，CycleGAN 的工作方式如下：

生成器 G 必须学习从图像 X 到图像 Y 的映射。
生成器 F 必须学习从图像 Y 到图像 X 的映射。
判别器 D(X) 必须区分真实图像 X 和由 G 生成的假图像。
判别器 D(Y) 必须区分真实图像 Y 和由 F 生成的假图像。

有两个条件确保翻译在两个领域中保持含义（就像我们从英语翻译成西班牙语时，希望保留词语的含义，反之亦然）：

循环一致性：从 X 到 Y，再从 Y 到 X 应该产生原始的 X 或与 X 非常相似的东西。Y 也是如此。
身份一致性：将 X 输入 G 应该产生相同的 X 或与 X 非常相似的东西。Y 也是如此。

使用这四个组件，CycleGAN 试图在翻译中保持循环和身份一致性，从而在无需监督、配对数据的情况下生成非常令人满意的结果。

另见

你可以在这里阅读关于 CycleGAN 的原始论文：arxiv.org/abs/1703.10593。此外，这里有一个非常有趣的讨论，帮助你理解实例归一化与批归一化之间的区别：intellipaat.com/community/1869/instance-normalisation-vs-batch-normalisation。

使用快速梯度符号方法实现对抗性攻击

我们通常认为高度准确的深度神经网络是强大的模型，但由 GAN 之父伊恩·古德费洛（Ian Goodfellow）提出的快速梯度符号方法（FGSM）却证明了相反的观点。在这个示例中，我们将对一个预训练模型执行 FGSM 攻击，看看如何通过引入看似无法察觉的变化，完全欺骗一个网络。

准备工作

让我们用pip安装OpenCV。

我们将使用它来使用 FGSM 方法保存扰动后的图像：

$> pip install opencv-contrib-python

让我们开始吧。

如何操作

完成以下步骤后，您将成功执行一次对抗性攻击：

导入依赖项：

import cv2
import tensorflow as tf
from tensorflow.keras.applications.nasnet import *
from tensorflow.keras.losses import CategoricalCrossentropy

定义一个函数来预处理图像，这包括调整图像大小并应用与我们将要使用的预训练网络相同的处理（在这个例子中是NASNetMobile）：

def preprocess(image, target_shape):
    image = tf.cast(image, tf.float32)
    image = tf.image.resize(image, target_shape)
    image = preprocess_input(image)
    image = image[None, :, :, :]
    return image

定义一个函数来根据一组概率获取人类可读的图像：

def get_imagenet_label(probabilities):
    return decode_predictions(probabilities, top=1)[0][0]

定义一个函数来保存图像。这个函数将使用预训练模型来获取正确的标签，并将其作为图像文件名的一部分，文件名中还包含预测的置信度百分比。在将图像存储到磁盘之前，它会确保图像在预期的[0, 255]范围内，并且处于 BGR 空间中，这是 OpenCV 使用的颜色空间：

def save_image(image, model, description):
    prediction = model.predict(image)
    _, label, conf = get_imagenet_label(prediction)
    image = image.numpy()[0] * 0.5 + 0.5
    image = (image * 255).astype('uint8')
    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
    conf *= 100
    img_name = f'{description}, {label} ({conf:.2f}%).jpg'
    cv2.imwrite(img_name, image)

定义一个函数来创建对抗性模式，该模式将在后续用于执行实际的 FGSM 攻击：

def generate_adv_pattern(model,
                         input_image,
                         input_label,
                         loss_function):
    with tf.GradientTape() as tape:
        tape.watch(input_image)
        prediction = model(input_image)
        loss = loss_function(input_label, prediction)
    gradient = tape.gradient(loss, input_image)
    signed_gradient = tf.sign(gradient)
    return signed_gradient

这个模式非常简单：它由一个张量组成，其中每个元素表示梯度的符号。更具体地说，signed_gradient将包含一个-1表示梯度值小于0，1表示梯度值大于0，而当梯度为0时，则是0。

实例化预训练的NASNetMobile()模型并冻结其权重：

pretrained_model = NASNetMobile(include_top=True,
                                weights='imagenet')
pretrained_model.trainable = False

加载测试图像并通过网络传递：

image = tf.io.read_file('dog.jpg')
image = tf.image.decode_jpeg(image)
image = preprocess(image, pretrained_model.input.shape[1:-1])
image_probabilities = pretrained_model.predict(image)

对原始图像的地面真值标签进行独热编码，并用它生成对抗性模式：

cce_loss = CategoricalCrossentropy()
pug_index = 254
label = tf.one_hot(pug_index, image_probabilities.shape[-1])
label = tf.reshape(label, (1, image_probabilities.shape[-1]))
disturbances = generate_adv_pattern(pretrained_model,
                                    image,
                                    label,
                                    cce_loss)

执行一系列对抗性攻击，使用逐渐增大但仍然较小的epsilon值，这些值将在梯度方向上应用，利用disturbances中的模式：

for epsilon in [0, 0.005, 0.01, 0.1, 0.15, 0.2]:
    corrupted_image = image + epsilon * disturbances
    corrupted_image = tf.clip_by_value(corrupted_image, -1, 1)
    save_image(corrupted_image,
               pretrained_model,
               f'Epsilon = {epsilon:.3f}')

对于 epsilon = 0（无攻击），图像如下，标签为pug，置信度为 80%：

图 6.9 – 原始图像。标签：pug（80.23% 置信度）

](tos-cn-i-73owjymdk6/04df09fe6daa4d21b73647ded0db1a59)

图 6.9 – 原始图像。标签：pug（80.23% 置信度）

当 epsilon = 0.005（一个非常小的扰动）时，标签变为Brabancon_griffon，置信度为 43.03%：

图 6.10 – 在梯度方向上应用 epsilon = 0.005。标签：Brabancon_griffon（43.03% 置信度）

如前图所示，像素值的微小变化会导致网络产生截然不同的响应。然而，随着ε（epsilon）值增大的情况变得更糟。有关完整的结果列表，请参阅github.com/PacktPublishing/Tensorflow-2.0-Computer-Vision-Cookbook/tree/master/ch6/recipe5。

它是如何工作的……

在这个食谱中，我们实现了一个基于 Ian Goodfellow 提出的 FGSM（快速梯度符号法）的简单攻击方法，主要通过确定每个位置的梯度方向（符号）并利用该信息创建对抗性图案。其基本原理是该技术在每个像素值上最大化损失。

接下来，我们使用此图案来对图像中的每个像素进行加减微小的扰动，然后将其传递给网络。

尽管这些变化通常人眼难以察觉，但它们能够完全扰乱网络，导致荒谬的预测，正如在本食谱的最后一步所展示的那样。

另请参见

幸运的是，针对这种类型攻击（以及更复杂攻击）的防御措施已经出现。你可以阅读一篇相当有趣的对抗攻击与防御的综述，内容在这里：arxiv.org/abs/1810.00069。

第七章：第七章：使用 CNN 和 RNN 给图像加上字幕

赋予神经网络描述视觉场景的能力以人类可读的方式，必定是深度学习中最有趣但也最具挑战性的应用之一。主要困难在于，这个问题结合了人工智能的两个主要子领域：计算机视觉（CV）和自然语言处理（NLP）。

大多数图像字幕网络的架构使用卷积神经网络（CNN）来将图像编码为数字格式，以便解码器消费，解码器通常是递归神经网络（RNN）。这是一种专门用于学习序列数据（如时间序列、视频和文本）的网络。

正如我们在这一章中将看到的，构建具有这些能力的系统的挑战从准备数据开始，我们将在第一个实例中讨论这一点。然后，我们将从头开始实现一个图像字幕解决方案。在第三个实例中，我们将使用这个模型为我们自己的图片生成字幕。最后，在第四个实例中，我们将学习如何在我们的架构中包含注意力机制，以便我们可以理解网络在生成输出字幕中每个单词时看到图像的哪些部分。

相当有趣，你同意吗？

具体来说，在本章中我们将涵盖以下实例：

实现可重复使用的图像字幕特征提取器
实现图像字幕网络
为您自己的照片生成字幕
在 COCO 上实现带注意力的图像字幕网络
让我们开始吧！

技术要求

图像字幕是一个需要大量内存、存储和计算资源的问题。我建议您使用像 AWS 或 FloydHub 这样的云解决方案来运行本章中的实例，除非您有足够强大的硬件。如预期的那样，GPU 对于完成本章中的实例至关重要。在每个实例的“准备就绪”部分，您将找到所需准备的内容。本章的代码在此处可用：github.com/PacktPublishing/Tensorflow-2.0-Computer-Vision-Cookbook/tree/master/ch7。

点击以下链接查看“代码实战”视频：

bit.ly/3qmpVme。

实现可重复使用的图像字幕特征提取器

创建基于深度学习的图像字幕解决方案的第一步是将数据转换为可以被某些网络使用的格式。这意味着我们必须将图像编码为向量或张量，将文本编码为嵌入，即句子的向量表示。

在本食谱中，我们将实现一个可自定义和可重用的组件，允许我们提前预处理实现图像标题生成器所需的数据，从而节省后续过程中大量时间。

让我们开始吧！

准备就绪

我们需要的依赖是tqdm（用于显示漂亮的进度条）和Pillow（用于使用 TensorFlow 的内置函数加载和处理图像）：

$> pip install Pillow tqdm

我们将使用Flickr8k数据集，该数据集位于~/.keras/datasets/flickr8k文件夹中。

这里是一些示例图像：

图 7.1 – 来自 Flickr8k 的示例图像

有了这些，我们就可以开始了！

如何实现……

按照以下步骤创建一个可重用的特征提取器，用于图像标题问题：

导入所有必要的依赖项：

import glob
import os
import pathlib
import pickle
from string import punctuation
import numpy as np
import tqdm
from tensorflow.keras.applications.vgg16 import *
from tensorflow.keras.layers import *
from tensorflow.keras.preprocessing.image import *
from tensorflow.keras.preprocessing.sequence import \
    pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tqdm import tqdm

定义ImageCaptionFeatureExtractor类及其构造函数：

class ImageCaptionFeatureExtractor(object):
    def __init__(self,
                 output_path,
                 start_token='beginsequence',
                 end_token='endsequence',
                 feature_extractor=None,
                 input_shape=(224, 224, 3)):

接下来，我们必须接收输出存储路径，以及我们将用于分隔文本序列起始和结束的标记。我们还必须将特征提取器的输入形状作为参数。接下来，让我们将这些值存储为成员：

        self.input_shape = input_shape
        if feature_extractor is None:
            input = Input(shape=input_shape)
            self.feature_extractor = VGG16(input_ 
                                     tensor=input,
                                   weights='imagenet',
                                   include_top=False)
        else:
            self.feature_extractor = feature_extractor
        self.output_path = output_path
        self.start_token = start_token
        self.end_token = end_token
        self.tokenizer = Tokenizer()
        self.max_seq_length = None

如果没有接收到任何feature_extractor，我们将默认使用VGG16。接下来，定义一个公共方法，该方法根据图像路径提取图像的特征：

    def extract_image_features(self, image_path):
        image = load_img(image_path,
                       target_size=self.input_shape[:2])
        image = img_to_array(image)
        image = np.expand_dims(image, axis=0)
        image = preprocess_input(image)
        return self.feature_extractor.predict(image)[0]

为了清理标题，我们必须去除所有标点符号和单个字母的单词（如a）。_clean_captions()方法执行了这个任务，并且还添加了特殊标记，也就是self.start_token和self.end_token：

    def _clean_captions(self, captions):
        def remove_punctuation(word):
            translation = str.maketrans('', '',
                                        punctuation)
            return word.translate(translation)
        def is_valid_word(word):
            return len(word) > 1 and word.isalpha()
        cleaned_captions = []
        for caption in captions:
            caption = caption.lower().split(' ')
            caption = map(remove_punctuation, caption)
            caption = filter(is_valid_word, caption)
            cleaned_caption = f'{self.start_token} ' \
                              f'{“ “.join(caption)} ' \
                              f'{self.end_token}'
            cleaned_captions.append(cleaned_caption)
        return cleaned_captions

我们还需要计算最长标题的长度，可以通过_get_max_seq_length()方法来实现。方法定义如下：

    def _get_max_seq_length(self, captions):
        max_sequence_length = -1
        for caption in captions:
            caption_length = len(caption.split(' '))
            max_sequence_length = 
                            max(max_sequence_length,
                                      caption_length)
        return max_sequence_length

定义一个公共方法extract_features()，它接收一个包含图像路径和标题的列表，并利用这些数据从图像和文本序列中提取特征：
```
    def extract_features(self, images_path, captions):
        assert len(images_path) == len(captions)
```

请注意，两个列表必须具有相同的大小。接下来的步骤是清理标题，计算最大序列长度，并为所有标题适配一个分词器：

        captions = self._clean_captions(captions)
        self.max_seq_length=self._get_max_seq_ 
                                   length(captions) 
        self.tokenizer.fit_on_texts(captions)

我们将遍历每一对图像路径和标题，从图像中提取特征。然后，我们将在data_mapping的dict中保存一个条目，将图像 ID（存在于image_path中）与相应的视觉特征和清理后的标题相关联：

        data_mapping = {}
        print('\nExtracting features...')
        for i in tqdm(range(len(images_path))):
            image_path = images_path[i]
            caption = captions[i]
         feats = self.extract_image_features(image_ path)
            image_id = image_path.split(os.path.sep)[-1]
            image_id = image_id.split('.')[0]
            data_mapping[image_id] = {
                'features': feats,
                'caption': caption
            }

我们将把这个data_mapping保存到磁盘，以 pickle 格式存储：

        out_path = f'{self.output_path}/data_mapping.pickle'
        with open(out_path, 'wb') as f:
            pickle.dump(data_mapping, f, protocol=4)

我们将通过创建和存储将在未来输入图像标题网络的序列来完成此方法：
```
        self._create_sequences(data_mapping)
```
以下方法创建了用于训练图像标题模型的输入和输出序列（详细说明请见*如何实现……*部分）。我们将从确定输出类别数开始，这个数值是词汇大小加一（以便考虑超出词汇表的标记）。我们还必须定义存储序列的列表：
```
    def _create_sequences(self, mapping):
        num_classes = len(self.tokenizer.word_index) + 1
        in_feats = []
        in_seqs = []
        out_seqs = []
```

接下来，我们将迭代每个特征-标题对。我们将把标题从字符串转换为表示句子中单词的数字序列：

        print('\nCreating sequences...')
        for _, data in tqdm(mapping.items()):
            feature = data['features']
            caption = data['caption']
            seq = self.tokenizer.texts_to_
                       sequences([caption])
            seq = seq[0]

接下来，我们将生成与标题中单词数量相同的输入序列。每个输入序列将用于生成序列中的下一个单词。因此，对于给定的索引i，输入序列将是到i-1的所有元素，而相应的输出序列或标签将是在i处的独热编码元素（即下一个单词）。为了确保所有输入序列的长度相同，我们必须对它们进行填充：

            for i in range(1, len(seq)):
                input_seq = seq[:i]
                input_seq, = 
                   pad_sequences([input_seq],

                     self.max_seq_length)
                out_seq = seq[i]
                out_seq = to_categorical([out_seq],

                                       num_classes)[0]

然后，我们将视觉特征向量、输入序列和输出序列添加到相应的列表中：

                in_feats.append(feature)
                in_seqs.append(input_seq)
                out_seqs.append(out_seq)

最后，我们必须将序列以 pickle 格式写入磁盘：

        file_paths = [
            f'{self.output_path}/input_features.pickle',
            f'{self.output_path}/input_sequences.pickle',
            f'{self.output_path}/output_sequences.
                                                 pickle']
        sequences = [in_feats,
                     in_seqs,
                     out_seqs]
        for path, seq in zip(file_paths, sequences):
            with open(path, 'wb') as f:
                pickle.dump(np.array(seq), f, 
                            protocol=4)

让我们定义Flickr8k图像和标题的路径：

BASE_PATH = (pathlib.Path.home() / '.keras' / 'datasets'       
                                      /'flickr8k')
IMAGES_PATH = str(BASE_PATH / 'Images')
CAPTIONS_PATH = str(BASE_PATH / 'captions.txt')

创建我们刚刚实现的特征提取器类的实例：

extractor = ImageCaptionFeatureExtractor(output_path='.')

列出Flickr8k数据集中的所有图像文件：

image_paths = list(glob.glob(f'{IMAGES_PATH}/*.jpg'))

读取标题文件的内容：

with open(CAPTIONS_PATH, 'r') as f:
    text = f.read()
    lines = text.split('\n')

现在，我们必须创建一个映射，将每个图像与多个标题关联起来。键是图像 ID，而值是与该图像相关的所有标题的列表：

mapping = {}
for line in lines:
    if '.jpg' not in line:
        continue
    tokens = line.split(',', maxsplit=1)
    if len(line) < 2:
        continue
    image_id, image_caption = tokens
    image_id = image_id.split('.')[0]
    captions_per_image = mapping.get(image_id, [])
    captions_per_image.append(image_caption)
    mapping[image_id] = captions_per_image

我们将仅保留每个图像的一个标题：

captions = []
for image_path in image_paths:
    image_id = image_path.split('/')[-1].split('.')[0]
    captions.append(mapping[image_id][0])

最后，我们必须使用我们的提取器生成数据映射和相应的输入序列：
```
extractor.extract_features(image_paths, captions)
```
这个过程可能需要一些时间。几分钟后，我们应该在输出路径中看到以下文件：
```
data_mapping.pickle     input_features.pickle   input_sequences.pickle  output_sequences.pickle
```

接下来的部分将详细介绍这一切是如何工作的。

工作原理如下...

在这个示例中，我们学到了创建良好的图像字幕系统的关键之一是将数据放入适当的格式中。这使得网络能够学习如何用文本描述视觉场景中发生的事情。

有许多方法可以构建图像字幕问题，但最流行和有效的方法是使用每个单词来生成标题中的下一个单词。这样，我们将逐词构造句子，通过每个中间输出作为下一个周期的输入传递。（这就是RNNs的工作原理。要了解更多信息，请参阅参考部分。）

你可能想知道如何将视觉信息传递给网络。这就是特征提取步骤至关重要的地方，因为我们将数据集中的每个图像转换为一个数值向量，该向量总结了每张图片中的空间信息。然后，在训练网络时，我们通过每个输入序列传递相同的特征向量。这样，网络将学会将标题中的所有单词与同一图像关联起来。

如果我们不小心，可能会陷入无限循环的单词生成中。我们如何防止这种情况发生？通过使用一个特殊的标记来信号化序列的结束（这意味着网络在遇到这样的标记时应停止生成单词）。在我们的情况下，默认的标记是endsequence。

一个类似的问题是如何启动一个序列。我们应该使用哪个词？在这种情况下，我们也必须使用一个特殊的标记（我们的默认值是beginsequence）。这个标记充当一个种子，网络将基于它开始生成字幕。

这一切现在听起来可能有点令人困惑，这是因为我们只专注于数据预处理阶段。在本章的剩余食谱中，我们将利用在这里所做的工作来训练许多不同的图像字幕生成器，一切都会变得明了！

另请参见

这是一个很好的关于RNNs如何工作的解释：www.youtube.com/watch?v=UNmqTiOnRfg。

实现图像字幕生成网络

一个图像字幕生成架构由编码器和解码器组成。编码器是一个CNN（通常是一个预训练的模型），它将输入图像转换为数值向量。然后，这些向量与文本序列一起传递给解码器，解码器是一个RNN，它将基于这些值学习如何逐步生成对应字幕中的每个单词。

在这个食谱中，我们将实现一个已在Flickr8k数据集上训练的图像字幕生成器。我们将利用在实现可重用的图像字幕特征提取器食谱中实现的特征提取器。

我们开始吧，好吗？

准备工作

在这个食谱中，我们将使用的外部依赖是Pillow、nltk和tqdm。你可以通过以下命令一次性安装它们：

$> pip install Pillow nltk tqdm

我们将使用Flickr8k数据集，您可以从~/.keras/datasets/flickr8k目录中获取它。

以下是一些来自Flickr8k数据集的示例图像：

图 7.2 – 来自 Flickr8k 的示例图像

让我们进入下一部分，开始本食谱的实现。

如何实现……

按照以下步骤实现基于深度学习的图像字幕生成系统：

首先，我们必须导入所有必需的包：

import glob
import pathlib
import pickle
import numpy as np
from nltk.translate.bleu_score import corpus_bleu
from sklearn.model_selection import train_test_split
from tensorflow.keras.applications.vgg16 import *
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.preprocessing.sequence import \
    pad_sequences
from ch7.recipe1.extractor import ImageCaptionFeatureExtractor

定义图像和字幕的路径，以及输出路径，这将是我们存储在本食谱中创建的工件的位置：

BASE_PATH = (pathlib.Path.home() / '.keras' / 'datasets'     
             /'flickr8k')
IMAGES_PATH = str(BASE_PATH / 'Images')
CAPTIONS_PATH = str(BASE_PATH / 'captions.txt')
OUTPUT_PATH = '.'

定义一个函数，该函数将加载图像路径及其对应的字幕列表。此实现类似于步骤 20到22，来自实现可重用的图像字幕特征提取器食谱：

def load_paths_and_captions():
    image_paths = list(glob.glob(f'{IMAGES_PATH}/*.jpg'))
    with open(f'{CAPTIONS_PATH}', 'r') as f:
        text = f.read()
        lines = text.split('\n')
    mapping = {}
    for line in lines:
        if '.jpg' not in line:
            continue
        tokens = line.split(',', maxsplit=1)
        if len(line) < 2:
            continue
        image_id, image_caption = tokens
        image_id = image_id.split('.')[0]
        captions_per_image = mapping.get(image_id, [])
        captions_per_image.append(image_caption)
        mapping[image_id] = captions_per_image

编译所有字幕：

    all_captions = []
    for image_path in image_paths:
        image_id = image_path.split('/')[-
                   1].split('.')[0]
        all_captions.append(mapping[image_id][0])
    return image_paths, all_captions

定义一个函数，该函数将构建网络的架构，接收词汇表大小、最大序列长度以及编码器的输入形状：

def build_network(vocabulary_size,
                  max_sequence_length,
                  input_shape=(4096,)):

网络的第一部分接收特征向量并将其通过一个全连接的ReLU激活层：

    x = Dropout(rate=0.5)(feature_inputs)
    x = Dense(units=256)(x)
    feature_output = ReLU()(x)

层的第二部分接收文本序列，这些文本序列被转换为数值向量，并训练一个包含 256 个元素的嵌入层。然后，它将该嵌入传递给LSTM层：

    sequence_inputs = 
            Input(shape=(max_sequence_length,))
    y = Embedding(input_dim=vocabulary_size,
                  output_dim=256,
                  mask_zero=True)(sequence_inputs)
    y = Dropout(rate=0.5)(y)
    sequence_output = LSTM(units=256)(y)

我们将这两部分的输出连接起来，并通过一个全连接网络传递，输出层的单元数量与词汇表中的单词数相同。通过对该输出进行Softmax激活，我们得到一个对应词汇表中某个单词的 one-hot 编码向量：
```
    z = Add()([feature_output, sequence_output])
    z = Dense(units=256)(z)
    z = ReLU()(z)
    z = Dense(units=vocabulary_size)(z)
    outputs = Softmax()(z)
```

最后，我们构建模型，传入图像特征和文本序列作为输入，并输出 one-hot 编码向量：

    return Model(inputs=[feature_inputs, 
                  sequence_inputs],
                 outputs=outputs)

定义一个函数，通过使用分词器的内部映射将整数索引转换为单词：

def get_word_from_index(tokenizer, index):
    return tokenizer.index_word.get(index, None)

定义一个函数来生成标题。它将从将beginsequence标记输入到网络开始，网络会迭代构建句子，直到达到最大序列长度或遇到endsequence标记：

def produce_caption(model,
                    tokenizer,
                    image,
                    max_sequence_length):
    text = 'beginsequence'
    for _ in range(max_sequence_length):
       sequence = tokenizer.texts_to_sequences([text])[0]
        sequence = pad_sequences([sequence],
               maxlen=max_sequence_length)
        prediction = model.predict([[image], sequence])
        index = np.argmax(prediction)
        word = get_word_from_index(tokenizer, index)
        if word is None:
            break
        text += f' {word}'
        if word == 'endsequence':
            break
    return text

定义一个函数来评估模型的表现。首先，我们将为测试数据集中每个图像的特征生成一个标题：

def evaluate_model(model, features, captions, 
                     tokenizer,
                   max_seq_length):
    actual = []
    predicted = []
    for feature, caption in zip(features, captions):
        generated_caption = produce_caption(model,
                                            tokenizer,
                                            feature,
                                     max_seq_length)
        actual.append([caption.split(' ')])
        predicted.append(generated_caption.split(' '))

接下来，我们将使用不同的权重计算BLEU分数。虽然BLEU分数超出了本教程的范围，但你可以在另见部分找到一篇详细解释的优秀文章。你需要知道的是，它用于衡量生成的标题与一组参考标题的相似度：

    for index, weights in enumerate([(1, 0, 0, 0),
                                     (.5, .5, 0, 0),
                                     (.3, .3, .3, 0),
                                     (.25, .25, .25, 
                                        .25)],
                                    start=1):
        b_score = corpus_bleu(actual, predicted, weights)
        print(f'BLEU-{index}: {b_score}')

加载图像路径和标题：

image_paths, all_captions = load_paths_and_captions()

创建图像提取模型：

extractor_model = VGG16(weights='imagenet')
inputs = extractor_model.inputs
outputs = extractor_model.layers[-2].output
extractor_model = Model(inputs=inputs, outputs=outputs)

创建图像标题特征提取器（传入我们在步骤 15中创建的常规图像提取器），并用它从数据中提取序列：

extractor = ImageCaptionFeatureExtractor(
    feature_extractor=extractor_model,
    output_path=OUTPUT_PATH)
extractor.extract_features(image_paths, all_captions)

加载我们在步骤 16中创建的已序列化输入和输出序列：

pickled_data = []
for p in [f'{OUTPUT_PATH}/input_features.pickle',
          f'{OUTPUT_PATH}/input_sequences.pickle',
          f'{OUTPUT_PATH}/output_sequences.pickle']:
    with open(p, 'rb') as f:
        pickled_data.append(pickle.load(f))
input_feats, input_seqs, output_seqs = pickled_data

使用 80% 的数据进行训练，20% 用于测试：

(train_input_feats, test_input_feats,
 train_input_seqs, test_input_seqs,
 train_output_seqs,
 test_output_seqs) = train_test_split(input_feats,
                                      input_seqs,
                                      output_seqs,
                                      train_size=0.8,
                                      random_state=9)

实例化并编译模型。因为最终这是一个多类分类问题，我们将使用categorical_crossentropy作为损失函数：

vocabulary_size = len(extractor.tokenizer.word_index) + 1
model = build_network(vocabulary_size,
                      extractor.max_seq_length)
model.compile(loss='categorical_crossentropy',
              optimizer='adam')

由于训练过程非常消耗资源，并且网络通常在早期就能给出最佳结果，因此我们创建了一个ModelCheckpoint回调，它将存储具有最低验证损失的模型：

checkpoint_path = ('model-ep{epoch:03d}-
                     loss{loss:.3f}-'
                   'val_loss{val_loss:.3f}.h5')
checkpoint = ModelCheckpoint(checkpoint_path,
                             monitor='val_loss',
                             verbose=1,
                             save_best_only=True,
                             mode='min')

在 30 个训练周期内拟合模型。请注意，我们必须传入两组输入或特征，但只有一组标签：

EPOCHS = 30
model.fit(x=[train_input_feats, train_input_seqs],
          y=train_output_seqs,
          epochs=EPOCHS,
          callbacks=[checkpoint],
          validation_data=([test_input_feats,test_input_
                                                 seqs],
                                       test_output_seqs))

加载最佳模型。这个模型可能会因运行而异，但在本教程中，它存储在model-ep003-loss3.847-val_loss4.328.h5文件中：
```
model = load_model('model-ep003-loss3.847-
                   val_loss4.328.h5')
```

加载数据映射，其中包含所有特征与真实标题的配对。将特征和映射提取到不同的集合中：

with open(f'{OUTPUT_PATH}/data_mapping.pickle', 'rb') as f:
    data_mapping = pickle.load(f)
feats = [v['features'] for v in data_mapping.values()]
captions = [v['caption'] for v in data_mapping.values()]

评估模型：

evaluate_model(model,
               features=feats,
               captions=captions,
               tokenizer=extractor.tokenizer,
               max_seq_length=extractor.max_seq_length)

这个步骤可能需要一些时间。最终，你会看到类似这样的输出：

BLEU-1: 0.35674398077995173
BLEU-2: 0.17030332240763874
BLEU-3: 0.12170338107914261
BLEU-4: 0.05493477725774873

训练图像标题生成器并不是一项简单的任务。然而，通过按正确的顺序执行合适的步骤，我们成功创建了一个表现不错的模型，并且在测试集上表现良好，基于前面代码块中显示的BLEU分数。继续阅读下一部分，了解它是如何工作的！

它是如何工作的……

在这个教程中，我们从零开始实现了一个图像描述生成网络。尽管一开始看起来可能有些复杂，但我们必须记住，这只是一个编码器-解码器架构的变种，类似于我们在第五章《使用自编码器减少噪声》和第六章《生成模型与对抗攻击》中研究过的架构。

在这种情况下，编码器只是一个完全连接的浅层网络，将我们从 ImageNet 的预训练模型中提取的特征映射到一个包含 256 个元素的向量。

另一方面，解码器并不是使用转置卷积，而是使用一个RNN，它接收文本序列（映射为数字向量）和图像特征，将它们连接成一个由 512 个元素组成的长序列。

网络的训练目标是根据前面时间步生成的所有词，预测句子中的下一个词。注意，在每次迭代中，我们传递的是与图像对应的相同特征向量，因此网络会学习按特定顺序映射某些词，以描述编码在该向量中的视觉数据。

网络的输出是独热编码，这意味着只有与网络认为应该出现在句子中的下一个词对应的位置包含 1，其余位置包含 0。

为了生成描述，我们遵循类似的过程。当然，我们需要某种方式告诉模型开始生成词汇。为此，我们将beginsequence标记传递给网络，并不断迭代，直到达到最大序列长度，或模型输出endsequence标记。记住，我们将每次迭代的输出作为下一次迭代的输入。

一开始这可能看起来有些困惑和繁琐，但现在你已经掌握了解决任何图像描述问题所需的构建块！

参见

如果你希望全面理解BLEU分数，可以参考这篇精彩的文章：machinelearningmastery.com/calculate-bleu-score-for-text-python/。

为你的照片生成描述

训练一个优秀的图像描述生成系统只是整个过程的一部分。为了实际使用它，我们必须执行一系列的步骤，类似于我们在训练阶段执行的操作。

在这个教程中，我们将使用一个训练好的图像描述生成网络来生成新图像的文字描述。

让我们开始吧！

准备工作

虽然在这个特定的教程中我们不需要外部依赖，但我们需要访问一个训练好的图像描述生成网络，并且需要清理过的描述文本来对其进行训练。强烈建议你在开始这个教程之前，先完成实现可复用的图像描述特征提取器和实现图像描述生成网络的教程。

你准备好了吗？让我们开始描述吧！

如何做……

按照以下步骤生成自己图像的标题：

和往常一样，让我们首先导入必要的依赖项：

import glob
import pickle
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.keras.applications.vgg16 import *
from tensorflow.keras.models import *
from tensorflow.keras.preprocessing.sequence import \
    pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from ch7.recipe1.extractor import ImageCaptionFeatureExtractor

定义一个函数，将整数索引转换为分词器映射中的对应单词：

def get_word_from_index(tokenizer, index):
    return tokenizer.index_word.get(index, None)

定义produce_caption()函数，该函数接受标题生成模型、分词器、要描述的图像以及生成文本描述所需的最大序列长度：

def produce_caption(model,
                    tokenizer,
                    image,
                    max_sequence_length):
    text = 'beginsequence'
    for _ in range(max_sequence_length):
       sequence = tokenizer.texts_to_sequences([text])[0]
       sequence = pad_sequences([sequence],
                             maxlen=max_sequence_length)
        prediction = model.predict([[image], sequence])
        index = np.argmax(prediction)
        word = get_word_from_index(tokenizer, index)
        if word is None:
            break
        text += f' {word}'
        if word == 'endsequence':
            break
    return text

注意，我们必须持续生成单词，直到遇到endsequence标记或达到最大序列长度。

定义一个预训练的VGG16网络，我们将其用作图像特征提取器：

extractor_model = VGG16(weights='imagenet')
inputs = extractor_model.inputs
outputs = extractor_model.layers[-2].output
extractor_model = Model(inputs=inputs, outputs=outputs)

将图像提取器传递给ImageCaptionFeatureExtractor()的一个实例：

extractor = ImageCaptionFeatureExtractor(
    feature_extractor=extractor_model)

加载我们用于训练模型的清理过的标题。我们需要它们来拟合步骤 7中的分词器：

with open('data_mapping.pickle', 'rb') as f:
    data_mapping = pickle.load(f)
captions = [v['caption'] for v in 
            data_mapping.values()]

实例化Tokenizer()并将其拟合到所有标题。还需计算最大序列长度：

tokenizer = Tokenizer()
tokenizer.fit_on_texts(captions)
max_seq_length = extractor._get_max_seq_length(captions)

加载训练好的网络（在本例中，网络名称为model-ep003-loss3.847-val_loss4.328.h5）：

model = load_model('model-ep003-loss3.847-
                     val_loss4.328.h5')

遍历当前目录中的所有测试图像，提取相应的数字特征：

for idx, image_path in enumerate(glob.glob('*.jpg'), 
                                   start=1):
    img_feats = (extractor
                 .extract_image_features(image_path))

生成标题并移除beginsequence和endsequence特殊标记：

    description = produce_caption(model,
                                  tokenizer,
                                  img_feats,
                                  max_seq_length)
    description = (description
                   .replace('beginsequence', '')
                   .replace('endsequence', ''))

打开图像，将生成的标题作为其标题并保存：

    image = plt.imread(image_path)
    plt.imshow(image)
    plt.title(description)
    plt.savefig(f'{idx}.jpg')

这是一个图像，网络在生成适当的标题方面表现得非常好：

图 7.3 – 我们可以看到，标题非常接近实际发生的情况

图 7.3 – 我们可以看到，标题非常接近实际发生的情况：

这是另一个例子，尽管网络在技术上是正确的，但它的准确性可以更高：

图 7.4 – 一名穿红色制服的足球运动员确实在空中，但还发生了更多的事情

最后，这里有一个网络完全无能为力的例子：

图 7.5 – 网络无法描述这一场景

这样，我们已经看到模型在一些图像上的表现不错，但仍有提升空间。我们将在下一节深入探讨。

它是如何工作的……

在本次配方中，我们了解到图像标题生成是一个困难的问题，且严重依赖于许多因素。以下是一些因素：

一个训练良好的CNN用于提取高质量的视觉特征：
为每个图像提供一组丰富的描述性标题：
具有足够容量的嵌入，能够以最小的损失编码词汇的表现力：
一个强大的RNN来学习如何将这一切组合在一起：

尽管存在这些明显的挑战，在这个教程中，我们使用了一个在 Flickr8k 数据集上训练的网络来生成新图像的标注。我们遵循的过程与我们训练系统时实施的过程类似：首先，我们必须将图像转换为特征向量。然后，我们需要对词汇表进行分词器拟合，获取适当的机制，以便能够将序列转换为人类可读的单词。最后，我们逐字拼接标注，同时传递图像特征和我们已经构建的序列。那么，我们如何知道何时停止呢？我们有两个停止标准：

标注达到了最大序列长度。
网络遇到了 endsequence 标记。

最后，我们在多张图片上测试了我们的解决方案，结果不一。在某些情况下，网络能够生成非常精确的描述，而在其他情况下，生成的标注则稍显模糊。在最后一个示例中，网络完全没有达成预期，这清楚地表明了仍有很大的改进空间。

如果你想查看其他带标注的图像，请查阅官方仓库：github.com/PacktPublishing/Tensorflow-2.0-Computer-Vision-Cookbook/tree/master/ch7/recipe3。

在 COCO 上实现带注意力机制的图像标注网络

理解图像标注网络如何生成描述的一个好方法是向架构中添加一个注意力组件。这使我们能够看到在生成每个单词时，网络注视图像的哪些部分。

在本教程中，我们将在更具挑战性的 常见物体上下文 (COCO) 数据集上训练一个端到端的图像标注系统。我们还将为网络配备注意力机制，以提高其性能，并帮助我们理解其内部推理过程。

这是一个较长且复杂的教程，但不用担心！我们将逐步进行。如果你想深入了解支撑该实现的理论，请查看 另请参阅 部分。

准备就绪

尽管我们将使用 COCO 数据集，但你无需提前做任何准备，因为我们将在教程中下载它（不过，你可以在这里了解更多关于这个开创性数据集的信息：cocodataset.org/#home）。

以下是 COCO 数据集中的一个示例：

图 7.6 – 来自 COCO 的示例图片

](tos-cn-i-73owjymdk6/e809c1a4c6ee4ba4a25e5f198477fb4c)

图 7.6 – 来自 COCO 的示例图片

让我们开始工作吧！

如何实现…

按照以下步骤完成这个教程：

导入所有必要的依赖项：

import json
import os
import time
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from tensorflow.keras.applications.inception_v3 import *
from tensorflow.keras.layers import *
from tensorflow.keras.losses import \
    SparseCategoricalCrossentropy
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.sequence import \
    pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import get_file

为 tf.data.experimental.AUTOTUNE 定义一个别名：
```
AUTOTUNE = tf.data.experimental.AUTOTUNE
```

定义一个函数来加载图像。它必须返回图像及其路径：

def load_image(image_path):
    image = tf.io.read_file(image_path)
    image = tf.image.decode_jpeg(image, channels=3)
    image = tf.image.resize(image, (299, 299))
    image = preprocess_input(image)
    return image, image_path

定义一个函数来获取最大序列长度。这将在稍后使用：

def get_max_length(tensor):
    return max(len(t) for t in tensor)

为图像标注网络定义一个函数，从磁盘加载图像（存储为 NumPy 格式）：

def load_image_and_caption(image_name, caption):
    image_name = image_name.decode('utf-8').split('/')
                                      [-1]
    image_tensor = np.load(f'./{image_name}.npy')
    return image_tensor, caption

使用模型子类化实现巴赫达努的注意力机制：

class BahdanauAttention(Model):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = Dense(units)
        self.W2 = Dense(units)
        self.V = Dense(1)

前面的代码块定义了网络层。现在，我们在call()方法中定义前向传播：

    def call(self, features, hidden):
        hidden_with_time_axis = tf.expand_dims(hidden, 
                                                 1)
        score = tf.nn.tanh(self.W1(features) +
                        self.W2(hidden_with_time_axis))
        attention_w = tf.nn.softmax(self.V(score), 
                                     axis=1)
        ctx_vector = attention_w * features
        ctx_vector = tf.reduce_sum(ctx_vector, axis=1)
        return ctx_vector, attention_w

定义图像编码器。这只是一个ReLU：

class CNNEncoder(Model):
    def __init__(self, embedding_dim):
        super(CNNEncoder, self).__init__()
        self.fc = Dense(embedding_dim)
    def call(self, x):
        x = self.fc(x)
        x = tf.nn.relu(x)
        return x

定义解码器。它是一个GRU和注意力机制，学习如何从视觉特征向量和文本输入序列中生成标题：

class RNNDecoder(Model):
    def __init__(self, embedding_size, units, 
                     vocab_size):
        super(RNNDecoder, self).__init__()
        self.units = units
        self.embedding = Embedding(vocab_size, 
                                    embedding_size)
        self.gru = GRU(self.units,
                       return_sequences=True,
                       return_state=True,
                       recurrent_initializer='glorot_
                       uniform')
        self.fc1 = Dense(self.units)
        self.fc2 = Dense(vocab_size)
        self.attention = BahdanauAttention(self.units)

现在我们已经定义了RNN架构中的各个层，接下来实现前向传播。首先，我们必须通过注意力子网络传递输入：

    def call(self, x, features, hidden):
        context_vector, attention_weights = \
            self.attention(features, hidden)

然后，我们必须将输入序列（x）通过嵌入层，并将其与从注意力机制中获得的上下文向量进行连接：

        x = self.embedding(x)
        expanded_context = tf.expand_dims(context_vector, 
                                           1)
        x = Concatenate(axis=-1)([expanded_context, x])

接下来，我们必须将合并后的张量传递给GRU层，然后通过全连接层。这样返回的是输出序列、状态和注意力权重：

        output, state = self.gru(x)
        x = self.fc1(output)
        x = tf.reshape(x, (-1, x.shape[2]))
        x = self.fc2(x)

最后，我们必须定义一个方法来重置隐藏状态：

    def reset_state(self, batch_size):
        return tf.zeros((batch_size, self.units))

定义ImageCaptionerClass。构造函数实例化基本组件，包括编码器、解码器、分词器、以及训练整个系统所需的优化器和损失函数：

class ImageCaptioner(object):
    def __init__(self, embedding_size, units, 
                 vocab_size,
                 tokenizer):
        self.tokenizer = tokenizer
        self.encoder = CNNEncoder(embedding_size)
        self.decoder = RNNDecoder(embedding_size, 
                                    units,
                                  vocab_size)
        self.optimizer = Adam()
        self.loss = SparseCategoricalCrossentropy(
            from_logits=True,
            reduction='none')

创建一个方法来计算损失函数：

    def loss_function(self, real, predicted):
        mask = tf.math.logical_not(tf.math.equal(real, 
                                               0))
        _loss = self.loss(real, predicted)
        mask = tf.cast(mask, dtype=_loss.dtype)
        _loss *= mask
        return tf.reduce_mean(_loss)

接下来，定义一个函数来执行单个训练步骤。我们将从创建隐藏状态和输入开始，输入仅是包含<start>标记索引的单一序列批次，<start>是一个特殊元素，用于指示句子的开始：

    @tf.function
    def train_step(self, image_tensor, target):
        loss = 0
        hidden = 
       self.decoder.reset_state(target.shape[0])
        start_token_idx = 
       self.tokenizer.word_index['<start>']
        init_batch = [start_token_idx] * 
        target.shape[0]
        decoder_input = tf.expand_dims(init_batch, 1)

现在，我们必须编码图像张量。然后，我们将反复将结果特征传递给解码器，连同到目前为止的输出序列和隐藏状态。关于RNNs如何工作的更深层次解释，请参考另见部分：

      with tf.GradientTape() as tape:
            features = self.encoder(image_tensor)
            for i in range(1, target.shape[1]):
                preds, hidden, _ = 
                self.decoder(decoder_input,
                            features,
                             hidden)
                loss += self.loss_function(target[:, i],
                                           preds)
                decoder_input = 
                       tf.expand_dims(target[:, i],1)

请注意，在前面的代码块中我们在每个时间步计算了损失。为了获得总损失，我们必须计算平均值。为了让网络真正学习，我们必须通过反向传播计算梯度，并通过优化器应用这些梯度：

        total_loss = loss / int(target.shape[1])
        trainable_vars = (self.encoder.trainable_
                            variables +
                          self.decoder.trainable_
                             variables)
        gradients = tape.gradient(loss, trainable_vars)
        self.optimizer.apply_gradients(zip(gradients,
                                      trainable_vars))
        return loss, total_loss

本类中的最后一个方法负责训练系统：

    def train(self, dataset, epochs, num_steps):
        for epoch in range(epochs):
            start = time.time()
            total_loss = 0
            for batch, (image_tensor, target) \
                    in enumerate(dataset):
                batch_loss, step_loss = \
                    self.train_step(image_tensor, target)
                total_loss += step_loss

每经过 100 个 epoch，我们将打印损失。在每个 epoch 结束时，我们还将打印该 epoch 的损失和已用时间：

                if batch % 100 == 0:
                    loss = batch_loss.numpy()
                    loss = loss / int(target.shape[1])
                    print(f'Epoch {epoch + 1}, batch 
                                             {batch},'
                          f' loss {loss:.4f}')
            print(f'Epoch {epoch + 1},'
                  f' loss {total_loss / 
                           num_steps:.6f}')
            epoch_time = time.time() - start
            print(f'Time taken: {epoch_time} seconds. 
                   \n')

下载并解压COCO数据集的注释文件。如果它们已经在系统中，只需存储文件路径：

INPUT_DIR = os.path.abspath('.')
annots_folder = '/annotations/'
if not os.path.exists(INPUT_DIR + annots_folder):
    origin_url = ('http://images.cocodataset.org/
            annotations''/annotations_trainval2014.zip')
    cache_subdir = os.path.abspath('.')
    annots_zip = get_file('all_captions.zip',
                          cache_subdir=cache_subdir,
                          origin=origin_url,
                          extract=True)
    annots_file = (os.path.dirname(annots_zip) +
                  '/annotations/captions_train2014.json')
    os.remove(annots_zip)
else:
    annots_file = (INPUT_DIR +
                  '/annotations/captions_train2014.json')

下载并解压COCO数据集的图像文件。如果它们已经在系统中，只需存储文件路径：

image_folder = '/train2014/'
if not os.path.exists(INPUT_DIR + image_folder):
    origin_url = ('http://images.cocodataset.org/zips/'
                  'train2014.zip')
    cache_subdir = os.path.abspath('.')
    image_zip = get_file('train2014.zip',
                         cache_subdir=cache_subdir,
                         origin=origin_url,
                         extract=True)
    PATH = os.path.dirname(image_zip) + image_folder
    os.remove(image_zip)
else:
    PATH = INPUT_DIR + image_folder

加载图像路径和标题。我们必须将特殊的<start>和<end>标记添加到每个标题中，以便它们包含在我们的词汇表中。这些特殊标记使我们能够分别指定序列的开始和结束位置：

with open(annots_file, 'r') as f:
    annotations = json.load(f)
captions = []
image_paths = []
for annotation in annotations['annotations']:
    caption = '<start>' + annotation['caption'] + ' <end>'
    image_id = annotation['image_id']
    image_path = f'{PATH}COCO_train2014_{image_id:012d}.jpg'
    image_paths.append(image_path)
    captions.append(caption)

由于COCO数据集庞大，训练一个模型需要很长时间，我们将选择 30,000 张图像及其对应的标题作为随机样本：

train_captions, train_image_paths = shuffle(captions,
                                        image_paths, 
                                      random_state=42)
SAMPLE_SIZE = 30000
train_captions = train_captions[:SAMPLE_SIZE]
train_image_paths = train_image_paths[:SAMPLE_SIZE]
train_images = sorted(set(train_image_paths))

我们使用InceptionV3的预训练实例作为我们的图像特征提取器：

feature_extractor = InceptionV3(include_top=False,
                                weights='imagenet')
feature_extractor = Model(feature_extractor.input,
                          feature_extractor.layers[-
                                        1].output)

创建一个 tf.data.Dataset，将图像路径映射到张量。使用它遍历我们样本中的所有图像，将它们转换为特征向量，并将其保存为 NumPy 数组。这将帮助我们在将来节省内存：

BATCH_SIZE = 8
image_dataset = (tf.data.Dataset
                 .from_tensor_slices(train_images)
                 .map(load_image, 
                     num_parallel_calls=AUTOTUNE)
                 .batch(BATCH_SIZE))
for image, path in image_dataset:
    batch_features = feature_extractor.predict(image)
    batch_features = tf.reshape(batch_features,
                             (batch_features.shape[0],
                                 -1,
                            batch_features.shape[3]))
    for batch_feature, p in zip(batch_features, path):
        feature_path = p.numpy().decode('UTF-8')
        image_name = feature_path.split('/')[-1]
        np.save(f'./{image_name}', batch_feature.numpy())

在我们标题中的前 5,000 个单词上训练一个分词器。然后，将每个文本转换为数字序列，并进行填充，使它们的大小一致。同时，计算最大序列长度：

top_k = 5000
filters = '!”#$%&()*+.,-/:;=?@[\]^_`{|}~ '
tokenizer = Tokenizer(num_words=top_k,
                      oov_token='<unk>',
                      filters=filters)
tokenizer.fit_on_texts(train_captions)
tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'
train_seqs = tokenizer.texts_to_sequences(train_captions)
captions_seqs = pad_sequences(train_seqs, 
                              padding='post')
max_length = get_max_length(train_seqs)

我们将使用 20% 的数据来测试模型，其余 80% 用于训练：

(images_train, images_val, caption_train, caption_val) = \
    train_test_split(train_img_paths,
                     captions_seqs,
                     test_size=0.2,
                     random_state=42)

我们将一次加载 64 张图像的批次（以及它们的标题）。请注意，我们使用的是 第 5 步 中定义的 load_image_and_caption() 函数，它读取与图像对应的特征向量，这些向量以 NumPy 格式存储。此外，由于该函数在 NumPy 层面工作，我们必须通过 tf.numpy_function 将其包装，以便它能作为有效的 TensorFlow 函数在 map() 方法中使用：

BATCH_SIZE = 64
BUFFER_SIZE = 1000
dataset = (tf.data.Dataset
           .from_tensor_slices((images_train, 
                                caption_train))
           .map(lambda i1, i2:
                tf.numpy_function(
                    load_image_and_caption,
                    [i1, i2],
                    [tf.float32, tf.int32]),
                num_parallel_calls=AUTOTUNE)
           .shuffle(BUFFER_SIZE)
           .batch(BATCH_SIZE)
           .prefetch(buffer_size=AUTOTUNE))

让我们实例化一个 ImageCaptioner。嵌入层将包含 256 个元素，解码器和注意力模型的单元数将是 512。词汇表大小为 5,001。最后，我们必须传入 第 27 步 中拟合的分词器：

image_captioner = ImageCaptioner(embedding_size=256,
                                 units=512,
                                 vocab_size=top_k + 1,
                                 tokenizer=tokenizer)
EPOCHS = 30
num_steps = len(images_train) // BATCH_SIZE
image_captioner.train(dataset, EPOCHS, num_steps)

定义一个函数，用于在图像上评估图像标题生成器。它必须接收编码器、解码器、分词器、待描述的图像、最大序列长度以及注意力向量的形状。我们将从创建一个占位符数组开始，这里将存储构成注意力图的子图：
```
def evaluate(encoder, decoder, tokenizer, image, 
              max_length,
             attention_shape):
    attention_plot = np.zeros((max_length,
                               attention_shape))
```

接下来，我们必须初始化隐藏状态，提取输入图像的特征，并将其传递给编码器。我们还必须通过创建一个包含 <start> 标记索引的单一序列来初始化解码器输入：

    hidden = decoder.reset_state(batch_size=1)
    temp_input = tf.expand_dims(load_image(image)[0], 
                                    0)
    image_tensor_val = feature_extractor(temp_input)
    image_tensor_val = tf.reshape(image_tensor_val,
                           (image_tensor_val.shape[0],
                                   -1,
                          image_tensor_val.shape[3]))
    feats = encoder(image_tensor_val)
    start_token_idx = tokenizer.word_index['<start>']
    dec_input = tf.expand_dims([start_token_idx], 0)
    result = []

现在，让我们构建标题，直到达到最大序列长度或遇到 <end> 标记：

    for i in range(max_length):
        (preds, hidden, attention_w) = \
            decoder(dec_input, feats, hidden)
        attention_plot[i] = tf.reshape(attention_w,
                                       (-1,)).numpy()
        pred_id = tf.random.categorical(preds,
                                     1)[0][0].numpy()
        result.append(tokenizer.index_word[pred_id])
        if tokenizer.index_word[pred_id] == '<end>':
            return result, attention_plot
        dec_input = tf.expand_dims([pred_id], 0)
    attention_plot = attention_plot[:len(result), :]
    return result, attention_plot

请注意，对于每个单词，我们都会更新 attention_plot，并返回解码器的权重。
让我们定义一个函数，用于绘制网络对每个单词的注意力。它接收图像、构成标题的单个单词列表（result）、由 evaluate() 返回的 attention_plot，以及我们将存储图形的输出路径：
```
def plot_attention(image, result,
                   attention_plot, output_path):
    tmp_image = np.array(load_image(image)[0])
    fig = plt.figure(figsize=(10, 10))
```

我们将遍历每个单词，创建相应注意力图的子图，并以其链接的特定单词为标题：

    for l in range(len(result)):
        temp_att = np.resize(attention_plot[l], (8, 8))
        ax = fig.add_subplot(len(result) // 2,
                             len(result) // 2,
                             l + 1)
        ax.set_title(result[l])
        image = ax.imshow(tmp_image)
        ax.imshow(temp_att,
                  cmap='gray',
                  alpha=0.6,
                  extent=image.get_extent())

最后，我们可以保存完整的图：

    plt.tight_layout()
    plt.show()
    plt.savefig(output_path)

在验证集上评估网络的随机图像：

attention_features_shape = 64
random_id = np.random.randint(0, len(images_val))
image = images_val[random_id]

构建并清理实际（真实标签）标题：

actual_caption = ' '.join([tokenizer.index_word[i]
                         for i in caption_val[random_id]
                           if i != 0])
actual_caption = (actual_caption
                  .replace('<start>', '')
                  .replace('<end>', ''))

为验证图像生成标题：

result, attention_plot = evaluate(image_captioner               
                                encoder,
                   image_captioner.decoder,
                                  tokenizer,
                                  image,
                                  max_length,
                          attention_feats_shape)

构建并清理预测的标题：

predicted_caption = (' '.join(result)
                     .replace('<start>', '')
                     .replace('<end>', ''))

打印真实标签和生成的标题，然后将注意力图保存到磁盘：

print(f'Actual caption: {actual_caption}')
print(f'Predicted caption: {predicted_caption}')
output_path = './attention_plot.png'
plot_attention(image, result, attention_plot, output_path)

在以下代码块中，我们可以欣赏到真实标题与模型输出标题之间的相似性：
```
Actual caption: a lone giraffe stands in the midst of a grassy area
Predicted caption: giraffe standing in a dry grass near trees
```
现在，让我们来看一下注意力图：

图 7.7 – 注意力图

注意在生成每个单词时，网络关注的区域。较浅的方块表示更多的关注被放在这些像素上。例如，要生成单词giraffe，网络关注了照片中长颈鹿的周围环境。此外，我们可以看到，当网络生成单词grass时，它关注了长颈鹿腿部的草地部分。难道这不令人惊讶吗？

我们将在*它是如何工作的...*部分中详细讨论这个问题。

它是如何工作的…

在这个食谱中，我们实现了一个更完整的图像描述系统，这一次使用了挑战更大的COCO数据集，该数据集不仅比Flickr8k大几个数量级，而且更加多样化，因此网络理解起来更为困难。

然而，我们通过为网络提供一个注意力机制，使其拥有优势，这一机制灵感来自 Dzmitry Bahdanau 提出的令人印象深刻的突破（更多细节请参见另见部分）。这个功能赋予模型进行软搜索的能力，查找与预测目标词相关的源描述部分，简而言之，就是在输出句子中生成最佳的下一个词。这种注意力机制相对于传统方法具有优势，传统方法是使用固定长度的向量（如我们在实现图像描述网络食谱中所做的那样），解码器从中生成输出句子。这样表示的问题在于，当提高性能时，它往往会成为瓶颈。

此外，注意力机制使我们能够以更直观的方式理解网络生成描述时的思考过程。

因为神经网络是复杂的软件（通常像一个黑箱），使用视觉技术来检查其内部工作原理是我们可以利用的一种很好的工具，有助于我们在训练、微调和优化过程中。

另见

在这个食谱中，我们使用模型子类化模式实现了我们的架构，你可以在这里阅读更多内容：www.tensorflow.org/guide/keras/custom_layers_and_models。

请查看以下链接，复习一下RNN的内容：www.youtube.com/watch?v=UNmqTiOnRfg。

最后，我强烈鼓励你阅读 Dzmitry Bahdanau 关于我们刚刚实现和使用的注意力机制的论文：arxiv.org/abs/1409.0473。