CIFAR-10 Estimator 之 Vgg 模型在上一篇学习笔记中(一、二、三)，我使用教程中用来识别MNIST数

在上一篇学习笔记中(一、二、三)，我使用教程中用来识别 MNIST 数据集的 CNN 模型，写了 CIFAR-10 的分类器。正好最近需要使用 Vgg 网络做一些图像识别的东西，就还是先使用 CIFAR-10 数据集为基础，来实现一个 Vgg Estimator 好了。

虽然作为经典的卷积网络模型，网上已经有很多 Vgg 模型的实现了，比如使用 Caffe 的，使用 Keras 的。不过，我还是感觉从论文出发，比较有趣。这里有一篇 Vgg 模型论文的解读。时间有限，我没有去读原文，就用这个解读来入手吧。

因为同样是实现 CIFAR-10 数据集的识别，在准备数据这个部分，我就省去了，可以参考上一个系列的第一篇文章。我要从第二步（可能也只需要改第二步）直接开搞了。

前文中，我使用了一个派生类，而不是简单创建一个 model_fn 的方法来构建一个 Estimator。这样，可以在很大程序上重用结构代码。这个派生类是这样的：

class VggEstimator(tf.estimator.Estimator):
    """CIFAR-10 Vgg Estimator."""

    def __init__(self, params=None, config=None):
        """Init the estimator.
        """
        super().__init__(
            model_fn=self.the_model_fn,
            model_dir=FLAGS.model_dir,
            config=config,
            params=params
        )

也就是说，model_fn 被放在一个 method 里了：

def the_model_fn(self, features, labels, mode, params):
    # Use `input_layer` to apply the feature columns.
    input_images = tf.reshape(features, [-1, 32, 32, 3])
    input_layer = tf.image.resize_images(input_images, [224, 224])

    logits = self.construct_model_vgg16(input_layer, mode == tf.estimator.ModeKeys.TRAIN)

    predictions = {
        # Generate predictions (for PREDICT and EVAL mode)
        "classes": tf.argmax(input=logits, axis=1),
        # Add `softmax_tensor` to the graph. It is used for PREDICT and by the
        # `logging_hook`.
        "probabilities": tf.nn.softmax(logits, name="softmax_tensor")
    }

    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)

    # Calculate Loss (for both TRAIN and EVAL modes)
    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

    # Configure the Training Op (for TRAIN mode)
    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=FLAGS.learning_rate)
        train_op = optimizer.minimize(
            loss=loss,
            global_step=tf.train.get_global_step())
        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

    # Add evaluation metrics (for EVAL mode)
    accuracy = tf.metrics.accuracy(labels=labels, predictions=predictions["classes"])    
    eval_metric_ops = { "accuracy": accuracy }
    return tf.estimator.EstimatorSpec(
        mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

上面这个 model_fn，其实只和输入数据的 shape 有关系，和模型的具体结构是不相关的，如果我们要在这里同时尝试 Vgg16 和 Vgg19 两种模型，只需要定义两个模型的构建函数就可以了，其它关于 loss 和 metrics 都是共同的。

简单解释一下代码（凑字数）：因为输入的图像是被展开成一维的数组的，所以首先要还数据的 shape 还原，也就是 reshape 成原来的图像形状，这里的 shape 应该是 [batch_size, height, width, channels]，而在 reshape 的时候，第一个维度使用了 -1，意思是这一维是自动计算的：总数据量 ÷ 后面三个维度的积。这一点，我是在看了官方教程之后才知道……所以也写下来，以备后来的人学习吧。

因为 CIFAR-10 的图片都是很小的，只有 32 x 32 那么大，所以需要先 resize 到 Vgg 网线正常的 224 x 224 的大小。当然，这个 resize 一定会破坏原来图像的信息，因为无论使用什么插值算法，大这么大的变换尺度下，都会引入很多人为的噪声。这里就先假装不知道吧。只是为了实现 Vgg 模型而已。

后面的代码就是构建在不同运行条件下的 EstimatorSpec，这个在前文中已经有一些描述了，里就行不管了。

下面开始构造模型，为了简单，就先搞一个 Vgg16 来试试吧：

    def construct_model_vgg16(self, input_layer, is_training):
        """Construct the model."""
        # Convolutional Layer group 1
        conv1_1 = self.conv_layer(input_layer, 64)
        conv1_2 = self.conv_layer(conv1_1, 64)
        pool1 = self.pool_layer(conv1_2)

        # Convolutional Layer group 2
        conv2_1 = self.conv_layer(pool1, 128)
        conv2_2 = self.conv_layer(conv2_1, 128)
        pool2 = self.pool_layer(conv2_2)

        # Convolutional Layer group 3
        conv3_1 = self.conv_layer(pool2, 256)
        conv3_2 = self.conv_layer(conv3_1, 256)
        conv3_3 = self.conv_layer(conv3_2, 256)
        pool3 = self.pool_layer(conv3_3)

        # Convolutional Layer group 4
        conv4_1 = self.conv_layer(pool3, 512)
        conv4_2 = self.conv_layer(conv4_1, 512)
        conv4_3 = self.conv_layer(conv4_2, 512)
        pool4 = self.pool_layer(conv4_3)

        # Convolutional Layer group 5
        conv5_1 = self.conv_layer(pool4, 512)
        conv5_2 = self.conv_layer(conv5_1, 512)
        conv5_3 = self.conv_layer(conv5_2, 512)
        pool5 = self.pool_layer(conv5_3)

        # Dense Layer
        pool5_flat = tf.reshape(pool5, [-1, 7 * 7 * 512])
        fc1 = tf.layers.dense(inputs=pool5_flat, units=4096, activation=tf.nn.relu)
        fc2 = tf.layers.dense(inputs=fc1, units=4096, activation=tf.nn.relu)
        dropout = tf.layers.dropout(
            inputs=fc2, rate=0.4, training=is_training)

        # Logits Layer
        logits = tf.layers.dense(inputs=dropout, units=10)

        return logits

漂亮！模型被组织成一组一组的，每一组是由若干卷积层和一个池化层组成。Vgg 模型比较有意思，就是卷积层都是使用一样的卷积核大小，这样我就创建了两个辅助函数：

    def conv_layer(self, inputs, filters):
        return tf.layers.conv2d(
            inputs=inputs,
            filters=filters,
            kernel_size=3,
            padding='same',
            activation=tf.nn.relu)

    def pool_layer(self, inputs):
        return tf.layers.max_pooling2d(inputs=inputs, pool_size=2, strides=2)

当然，也可以直接创建一个函数，一次创建一个 Group 的层。因为每一组里的卷积层都是一样的 filters 数。不过列出来也很清晰。

配合之前的 main.py 和 TFRecord 转换脚本，就又完成了一个 CIFAR-10 的 Estimator 了。不过，这个模型有更大的参数空间，我在 MBP 上跑了一下，CPU 根本跑不动，风扇狂转，吓得我赶快关了。准备去阿里云或者京东云买个 GPU 的服务器来跑一下。完整的代码点这里。

当然，对于 CIFAR-10 这样的数据集来说，其实并不需要这么复杂的模型。只是通过这篇文章和之前的一个系列，就算是完成了基本的 Estimator 模型架构。然后就可以在这个架构下，去实现一些经典的卷积网络模型了。只是 CIFAR-10 这个数据集比较弱：图像尺寸小，数据规模也小。所以对于很大的网络应该是杀鸡用牛刀了。也只能用来测试一下网络结构的正确性而已。