VGG

147 阅读6分钟

本文已参与「新人创作礼」活动,一起开启掘金创作之路。

arxiv.org/abs/1409.15…

常用D的配置,即16层的配置(13个卷积层+3个全连接层)

网络结构

  • 3×3卷积stride=1,padding=1(卷积后输入输出大小不变)
  • maxpool的size=1,stride=2(特征矩阵高和宽缩小一半)
  • 前两个FC后有RELU激活函数,最后一个后面没有,因为有soft-max

使用3×3卷积,它是最小的,有的里面使用了1×1卷积,可以看做线性转换。论文里除了一个网络外,其他都没用LRN(局部响应归一化),这种归一化不会提高ILSVRC数据集的性能,而是导致内存消耗和计算时间增加。在适用的情况下,LRN层的参数是(Krizhevsky等人,2012年)的参数。

During training, the input to our ConvNets is a fixed-size 224 × 224 RGB image. The only preprocessing we do is subtracting the mean RGB value, computed on the training set, from each pixel. The image is passed through a stack of convolutional (conv.) layers, where we use filters with a very small receptive field: 3 × 3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations we also utilise 1 × 1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3 × 3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2 × 2 pixel window, with stride 2.

A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks.

All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al., 2012)) non-linearity. We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al., 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time. Where applicable, the parameters for the LRN layer are those of (Krizhevsky et al., 2012).

网络亮点:

通过堆叠多个3×3卷积核代替大尺度卷积核(减少所需参数),论文中通过堆叠2个3×3卷积核代替5×5卷积核,3个代替7×7卷积核。

It is easy to see that a stack of two 3 × 3 conv. layers (without spatial pooling in between) has an effective receptive field of 5 × 5; three such layers have a 7 × 7 effective receptive field. So what have we gained by using, for instance, a stack of three 3 × 3 conv. layers instead of a single 7 × 7 layer? First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer 3 × 3 convolution stack has C channels, the stack is parametrised by 3 (32C2) = 27C2 weights; at the same time, a single 7 × 7 conv. layer would require 72C2 = 49C2 parameters, i.e. 81% more. This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between).

首先,我们合并了三个非线性校正层而不是单个,这使得决策函数更具判别性。其次,我们减少参数的数量:假设一个三层 3×3卷积堆叠的输入和输出都有C个通道,其参数为3(3²C²)=27C²;同时,单个7×7卷积层需要7²C² = 49C²个参数,即多81%。这可以看作是对7×7卷积进行正则化,迫使它们通过3×3卷积进行分解(在其间注入非线性)。arxiv.org/abs/1409.15…

常用D的配置,即16层的配置(13个卷积层+3个全连接层)

网络结构

  • 3×3卷积stride=1,padding=1(卷积后输入输出大小不变)
  • maxpool的size=1,stride=2(特征矩阵高和宽缩小一半)
  • 前两个FC后有RELU激活函数,最后一个后面没有,因为有soft-max

使用3×3卷积,它是最小的,有的里面使用了1×1卷积,可以看做线性转换。论文里除了一个网络外,其他都没用LRN(局部响应归一化),这种归一化不会提高ILSVRC数据集的性能,而是导致内存消耗和计算时间增加。在适用的情况下,LRN层的参数是(Krizhevsky等人,2012年)的参数。

During training, the input to our ConvNets is a fixed-size 224 × 224 RGB image. The only preprocessing we do is subtracting the mean RGB value, computed on the training set, from each pixel. The image is passed through a stack of convolutional (conv.) layers, where we use filters with a very small receptive field: 3 × 3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations we also utilise 1 × 1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3 × 3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2 × 2 pixel window, with stride 2.

A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks.

All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al., 2012)) non-linearity. We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al., 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time. Where applicable, the parameters for the LRN layer are those of (Krizhevsky et al., 2012).

网络亮点:

通过堆叠多个3×3卷积核代替大尺度卷积核(减少所需参数),论文中通过堆叠2个3×3卷积核代替5×5卷积核,3个代替7×7卷积核。

It is easy to see that a stack of two 3 × 3 conv. layers (without spatial pooling in between) has an effective receptive field of 5 × 5; three such layers have a 7 × 7 effective receptive field. So what have we gained by using, for instance, a stack of three 3 × 3 conv. layers instead of a single 7 × 7 layer? First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer 3 × 3 convolution stack has C channels, the stack is parametrised by 3 (32C2) = 27C2 weights; at the same time, a single 7 × 7 conv. layer would require 72C2 = 49C2 parameters, i.e. 81% more. This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between).

首先,我们合并了三个非线性校正层而不是单个,这使得决策函数更具判别性。其次,我们减少参数的数量:假设一个三层 3×3卷积堆叠的输入和输出都有C个通道,其参数为3(3²C²)=27C²;同时,单个7×7卷积层需要7²C² = 49C²个参数,即多81%。这可以看作是对7×7卷积进行正则化,迫使它们通过3×3卷积进行分解(在其间注入非线性)。