含并行连结的网络(GoogleLeNet)|现代卷积神经网络|动手学深度学习

341 阅读14分钟

1. GoogLeNet 有一些后续版本。尝试实现并运行它们,然后观察实验结果。这些后续版本包括:

import torch
import torch.nn as nn
import torch.nn.functional as F

class Inception(nn.Module):
    # c1 - c4 are the number of output channels for each branch
    def __init__(self, in_channels, c1, c2, c3, c4):
        super(Inception, self).__init__()
        self.p1_1 = nn.Conv2d(in_channels, c1, kernel_size=1)
        self.p1_bn = nn.BatchNorm2d(c1)  # Batch Normalization

        self.p2_1 = nn.Conv2d(in_channels, c2[0], kernel_size=1)
        self.p2_2 = nn.Conv2d(c2[0], c2[1], kernel_size=3, padding=1)
        self.p2_bn = nn.BatchNorm2d(c2[1])  # Batch Normalization

        self.p3_1 = nn.Conv2d(in_channels, c3[0], kernel_size=1)
        self.p3_2 = nn.Conv2d(c3[0], c3[1], kernel_size=5, padding=2)
        self.p3_bn = nn.BatchNorm2d(c3[1])  # Batch Normalization

        self.p4_1 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
        self.p4_2 = nn.Conv2d(in_channels, c4, kernel_size=1)
        self.p4_bn = nn.BatchNorm2d(c4)  # Batch Normalization

    def forward(self, x):
        p1 = F.relu(self.p1_bn(self.p1_1(x)))
        p2 = F.relu(self.p2_bn(self.p2_2(F.relu(self.p2_1(x)))))
        p3 = F.relu(self.p3_bn(self.p3_2(F.relu(self.p3_1(x)))))
        p4 = F.relu(self.p4_bn(self.p4_2(self.p4_1(x))))
        return torch.cat((p1, p2, p3, p4), dim=1)
class GoogLeNet(nn.Module):
    def __init__(self):
        super(GoogLeNet, self).__init__()
        self.b1 = nn.Sequential(
            nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        )
        self.b2 = nn.Sequential(
            nn.Conv2d(64, 64, kernel_size=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Conv2d(64, 192, kernel_size=3, padding=1),
            nn.BatchNorm2d(192),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        )
        self.b3 = nn.Sequential(
            Inception(192, 64, (96, 128), (16, 32), 32),
            Inception(256, 128, (128, 192), (32, 96), 64),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        )
        self.b4 = nn.Sequential(
            Inception(480, 192, (96, 208), (16, 48), 64),
            Inception(512, 160, (112, 224), (24, 64), 64),
            Inception(512, 128, (128, 256), (24, 64), 64),
            Inception(512, 112, (144, 288), (32, 64), 64),
            Inception(528, 256, (160, 320), (32, 128), 128),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        )
        self.b5 = nn.Sequential(
            Inception(832, 256, (160, 320), (32, 128), 128),
            Inception(832, 384, (192, 384), (48, 128), 128),
            nn.AdaptiveAvgPool2d((1, 1)),
            nn.Flatten()
        )
        self.fc = nn.Linear(1024, 10)

    def forward(self, x):
        x = self.b1(x)
        x = self.b2(x)
        x = self.b3(x)
        x = self.b4(x)
        x = self.b5(x)
        x = self.fc(x)
        return x
        
net = GoogLeNet()
GoogLeNet(
  (b1): Sequential(
    (0): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3))
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  )
  (b2): Sequential(
    (0): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): Conv2d(64, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU()
    (6): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  )
  (b3): Sequential(
    (0): Inception(
      (p1_1): Conv2d(192, 64, kernel_size=(1, 1), stride=(1, 1))
      (p1_bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p2_1): Conv2d(192, 96, kernel_size=(1, 1), stride=(1, 1))
      (p2_2): Conv2d(96, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (p2_bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p3_1): Conv2d(192, 16, kernel_size=(1, 1), stride=(1, 1))
      (p3_2): Conv2d(16, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
      (p3_bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p4_1): MaxPool2d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
      (p4_2): Conv2d(192, 32, kernel_size=(1, 1), stride=(1, 1))
      (p4_bn): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): Inception(
      (p1_1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1))
      (p1_bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p2_1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1))
      (p2_2): Conv2d(128, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (p2_bn): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p3_1): Conv2d(256, 32, kernel_size=(1, 1), stride=(1, 1))
      (p3_2): Conv2d(32, 96, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
      (p3_bn): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p4_1): MaxPool2d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
      (p4_2): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1))
      (p4_bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (2): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  )
  (b4): Sequential(
    (0): Inception(
      (p1_1): Conv2d(480, 192, kernel_size=(1, 1), stride=(1, 1))
      (p1_bn): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p2_1): Conv2d(480, 96, kernel_size=(1, 1), stride=(1, 1))
      (p2_2): Conv2d(96, 208, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (p2_bn): BatchNorm2d(208, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p3_1): Conv2d(480, 16, kernel_size=(1, 1), stride=(1, 1))
      (p3_2): Conv2d(16, 48, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
      (p3_bn): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p4_1): MaxPool2d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
      (p4_2): Conv2d(480, 64, kernel_size=(1, 1), stride=(1, 1))
      (p4_bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): Inception(
      (p1_1): Conv2d(512, 160, kernel_size=(1, 1), stride=(1, 1))
      (p1_bn): BatchNorm2d(160, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p2_1): Conv2d(512, 112, kernel_size=(1, 1), stride=(1, 1))
      (p2_2): Conv2d(112, 224, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (p2_bn): BatchNorm2d(224, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p3_1): Conv2d(512, 24, kernel_size=(1, 1), stride=(1, 1))
      (p3_2): Conv2d(24, 64, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
      (p3_bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p4_1): MaxPool2d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
      (p4_2): Conv2d(512, 64, kernel_size=(1, 1), stride=(1, 1))
      (p4_bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (2): Inception(
      (p1_1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1))
      (p1_bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p2_1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1))
      (p2_2): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (p2_bn): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p3_1): Conv2d(512, 24, kernel_size=(1, 1), stride=(1, 1))
      (p3_2): Conv2d(24, 64, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
      (p3_bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p4_1): MaxPool2d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
      (p4_2): Conv2d(512, 64, kernel_size=(1, 1), stride=(1, 1))
      (p4_bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (3): Inception(
      (p1_1): Conv2d(512, 112, kernel_size=(1, 1), stride=(1, 1))
      (p1_bn): BatchNorm2d(112, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p2_1): Conv2d(512, 144, kernel_size=(1, 1), stride=(1, 1))
      (p2_2): Conv2d(144, 288, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (p2_bn): BatchNorm2d(288, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p3_1): Conv2d(512, 32, kernel_size=(1, 1), stride=(1, 1))
      (p3_2): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
      (p3_bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p4_1): MaxPool2d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
      (p4_2): Conv2d(512, 64, kernel_size=(1, 1), stride=(1, 1))
      (p4_bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (4): Inception(
      (p1_1): Conv2d(528, 256, kernel_size=(1, 1), stride=(1, 1))
      (p1_bn): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p2_1): Conv2d(528, 160, kernel_size=(1, 1), stride=(1, 1))
      (p2_2): Conv2d(160, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (p2_bn): BatchNorm2d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p3_1): Conv2d(528, 32, kernel_size=(1, 1), stride=(1, 1))
      (p3_2): Conv2d(32, 128, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
      (p3_bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p4_1): MaxPool2d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
      (p4_2): Conv2d(528, 128, kernel_size=(1, 1), stride=(1, 1))
      (p4_bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (5): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  )
  (b5): Sequential(
    (0): Inception(
      (p1_1): Conv2d(832, 256, kernel_size=(1, 1), stride=(1, 1))
      (p1_bn): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p2_1): Conv2d(832, 160, kernel_size=(1, 1), stride=(1, 1))
      (p2_2): Conv2d(160, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (p2_bn): BatchNorm2d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p3_1): Conv2d(832, 32, kernel_size=(1, 1), stride=(1, 1))
      (p3_2): Conv2d(32, 128, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
      (p3_bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p4_1): MaxPool2d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
      (p4_2): Conv2d(832, 128, kernel_size=(1, 1), stride=(1, 1))
      (p4_bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): Inception(
      (p1_1): Conv2d(832, 384, kernel_size=(1, 1), stride=(1, 1))
      (p1_bn): BatchNorm2d(384, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p2_1): Conv2d(832, 192, kernel_size=(1, 1), stride=(1, 1))
      (p2_2): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (p2_bn): BatchNorm2d(384, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p3_1): Conv2d(832, 48, kernel_size=(1, 1), stride=(1, 1))
      (p3_2): Conv2d(48, 128, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
      (p3_bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (p4_1): MaxPool2d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
      (p4_2): Conv2d(832, 128, kernel_size=(1, 1), stride=(1, 1))
      (p4_bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (2): AdaptiveAvgPool2d(output_size=(1, 1))
    (3): Flatten(start_dim=1, end_dim=-1)
  )
  (fc): Linear(in_features=1024, out_features=10, bias=True)
)
loss 0.091, train acc 0.965, test acc 0.904
5100.7 examples/sec on cuda:0

image.png

原始结果

loss 0.272, train acc 0.896, test acc 0.888
5732.2 examples/sec on cuda:0

image.png

2. 使用GoogLeNet的最小图像大小是多少?

GoogLeNet(Inception v1)的结构包含多个卷积层、池化层和全连接层,定义了输入图像的最小尺寸。为了确定最小图像大小,我们需要分析模型的架构,尤其是池化层的下采样效果。

GoogLeNet的架构简要回顾

  1. 第一层:

    • 卷积层:7x7 卷积,stride 2
    • 最大池化层:3x3 池化,stride 2
  2. 第二层:

    • 卷积层:1x1 卷积,stride 1
    • 卷积层:3x3 卷积,stride 1
    • 最大池化层:3x3 池化,stride 2
  3. 第三层:

    • 包含两个 Inception 模块,后接一个最大池化层:3x3 池化,stride 2
  4. 第四层:

    • 包含五个 Inception 模块,后接一个最大池化层:3x3 池化,stride 2
  5. 第五层:

    • 包含两个 Inception 模块,后接一个全局平均池化层

为了计算最小输入图像大小,我们可以追踪每个阶段的尺寸变化。假设输入图像的大小为 H×WH \times W,以下是每个阶段的尺寸变化:

  1. 第一层:

    • 卷积后尺寸:H+2×372+1×W+2×372+1\left\lfloor \frac{H+2 \times 3 - 7}{2} + 1 \right\rfloor \times \left\lfloor \frac{W+2 \times 3 - 7}{2} + 1 \right\rfloor
    • 最大池化后尺寸:H4×W4\left\lfloor \frac{H}{4} \right\rfloor \times \left\lfloor \frac{W}{4} \right\rfloor
  2. 第二层:

    • 卷积层对尺寸无显著影响(stride 1,不改变尺寸)
    • 最大池化后尺寸:H8×W8\left\lfloor \frac{H}{8} \right\rfloor \times \left\lfloor \frac{W}{8} \right\rfloor
  3. 第三层:

    • Inception 模块对尺寸无显著影响
    • 最大池化后尺寸:H16×W16\left\lfloor \frac{H}{16} \right\rfloor \times \left\lfloor \frac{W}{16} \right\rfloor
  4. 第四层:

    • Inception 模块对尺寸无显著影响
    • 最大池化后尺寸:H32×W32\left\lfloor \frac{H}{32} \right\rfloor \times \left\lfloor \frac{W}{32} \right\rfloor
  5. 第五层:

    • Inception 模块对尺寸无显著影响
    • 全局平均池化层:将所有空间维度平均到 1×11 \times 1

要确保图像尺寸在通过所有这些池化层后仍然大于零,我们需要计算最小尺寸。例如,如果我们希望最终的特征图尺寸为至少 1×11 \times 1,我们可以逆推计算每一层的最小尺寸:

Hmin=Wmin=32H_{\text{min}} = W_{\text{min}} = 32

因此,对于 GoogLeNet(Inception v1),输入图像的最小尺寸为 32×3232 \times 32。这意味着在图像尺寸为 32×3232 \times 32 的情况下,经过所有层之后,特征图尺寸恰好为 1×11 \times 1

3. 将AlexNet、VGG和NiN的模型参数大小与GoogLeNet进行比较。后两个网络架构是如何显著减少模型参数大小的?

在比较 AlexNet、VGG、NiN(Network in Network)和 GoogLeNet 的模型参数大小之前,让我们先了解它们各自的参数规模和减少参数的策略。

模型参数大小

1. AlexNet

AlexNet 是一个经典的卷积神经网络,由 5 个卷积层和 3 个全连接层组成。其参数规模主要集中在全连接层。

  • 参数数量:约 60M(百万)

2. VGG

VGG 有多个变体,最常用的是 VGG-16 和 VGG-19。它们的特点是使用了更多的卷积层(16 或 19 层),卷积核尺寸均为 3×33 \times 3

  • 参数数量:约 138M(百万)

3. NiN

NiN 引入了 1×11 \times 1 卷积层来替代部分全连接层,这显著减少了参数数量。

  • 参数数量:约 7.6M(百万)

4. GoogLeNet (Inception v1)

GoogLeNet 使用了 Inception 模块,通过并行的卷积操作和池化操作进行特征提取,极大地减少了参数数量。

  • 参数数量:约 6.8M(百万)

参数规模比较

模型参数数量 (百万)
AlexNet60M
VGG-16138M
NiN7.6M
GoogLeNet6.8M

减少模型参数大小的策略

NiN (Network in Network)

  1. 使用 1×11 \times 1 卷积层:

    • 1×11 \times 1 卷积层可以改变通道数,同时保留空间维度的分辨率。这不仅减少了参数数量,还增加了网络的非线性能力。
    • 1×11 \times 1 卷积层通过降维减少了计算量和参数量。例如,将 256256 个通道通过一个 1×11 \times 1 卷积层降维到 6464 个通道,参数数量从 256×256=65536256 \times 256 = 65536 大幅减少到 256×64=16384256 \times 64 = 16384
  2. 逐层替代全连接层:

    • 通过在最后的卷积层使用 1×11 \times 1 卷积层,NiN 减少了全连接层的使用。传统的全连接层有大量的参数,尤其是在高维输入的情况下,而 1×11 \times 1 卷积层大幅度减少了这种需求。

GoogLeNet (Inception v1)

  1. Inception 模块:

    • Inception 模块通过多分支结构同时执行 1×11 \times 13×33 \times 35×55 \times 5 卷积以及最大池化,并将这些结果合并。这种方式能够有效捕捉多尺度特征。
    • 通过使用 1×11 \times 1 卷积层进行降维,减少了后续 3×33 \times 35×55 \times 5 卷积层的输入通道数,从而大幅度减少参数量。
  2. 减少全连接层:

    • GoogLeNet 在网络末端使用全局平均池化层来替代传统的全连接层。全局平均池化将每个通道的特征图缩小到一个值,大幅减少了参数量。例如,全连接层通常有数百万的参数,而全局平均池化层只有少量的参数。
  3. 模块化设计:

    • GoogLeNet 通过模块化的设计(如 Inception 模块)更好地控制了参数规模,每个模块都精心设计以均衡性能和计算复杂度。

总结

通过比较 AlexNet、VGG、NiN 和 GoogLeNet 的参数数量,我们可以看到后两个网络架构(NiN 和 GoogLeNet)采用了创新的设计显著减少了参数大小:

  • NiN 通过大量使用 1×11 \times 1 卷积层来降低参数数量,同时保留了网络的表达能力。
  • GoogLeNet 通过 Inception 模块的多分支结构和全局平均池化层有效地减少了参数数量,同时提升了模型的特征提取能力和性能。

这些设计创新不仅减少了模型的参数量,还提高了计算效率,使得深度学习模型在资源受限的环境中变得更加实用。