AlexNet (ImageNet Classification with Deep Convolutional Neural Networks)

Title: ImageNet Classification with Deep Convolutional Neural Networks

Paper: proceedings.neurips.cc/paper_files…

【NIPS 2012】

1. Introduction

60 million parameters
8 layers（five convolutional layers + three fully-connected layers）, 1000-way softmax
ReLU (计算更简单；使训练模型更加容易)
Reducing Overfitting
- Dropout (regularization)
- Data Augmentation (更大的训练样本量)

1.1 ImageNet

ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.

AlexNet 赢下了 2012 ImageNet 竞赛后，标注着新一轮神经网络热潮的开始。

1.2 CNNs

1.3 AlexNet带来第三波AI浪

人工特征 —> 神经网络自动提取特征。如上图所示，左侧人工特征提取和SVM是独立的过程；而右侧通过神经网络自动提取特征和Softmax分类是一起训练的过程。

神经网络自动提取特征

End-to-end（端到端，没有复杂的特征工程），直接处理原始像素（raw RGB values of the pixels），简化了数据预处理。
Learn from data（数据驱动）

注：

除了在训练集上对像素减去平均活跃度 (mean activity)，没有进行任何其他方式的预处理。
算均值时通常有两种计算方法。一种是 image mean，是对 RGB 三个通道都求均值，然后再从各个通道减去该均值；另一种是pixel mean，直接全图减去全部像素的均值。本文应该是取的第一种。
通过 center crop 对图片进行裁剪很常用。
- e.g.，通过 Nano banana (Gemini Flash Image Model) API 得到的 16:9 图片，其实际 aspect_ratio 并不是严格16:9的（w*h=1344x768），若直接用之作为 first frame 调用 VEO 3.1 API 去生成 16:9 的视频，得到的视频左右两侧会有黑边。一个简单的解决方案是通过 center crop 将 image 裁剪为严格 16:9 宽高比（1344x768 -> 1344x756）。

2. AlexNet架构

AlexNet 可视化：dgschwend.github.io/netscope/#/…

Netscope 是一个在线可视化工具，使用它可以把 Caffe 的 .prototxt文件（定义模型结构）可视化，直观地理解模型结构和数据流动

注意

(224-11)/4 不能整除，会向下取整，也就是说 Kernel 在水平移动时，最后几个不足 kernel_size 的像素会被丢掉。在 Netscope 中这里被修改了，里面的输入被 resize 到 227*227*3。

[0, 10], [11, 20], ..., [211, 220]
[221, 223]三个像素被丢掉

如想要计算 Shape 的变化，可参考：

docs.pytorch.org/docs/stable…

简化版的公式为：

（w + 2*padding - kernel_size）/ stride + 1

以这里 zh-v2.d2l.ai/chapter_con… 网络各层shape变化为例，

import torch
from torch import nn
from d2l import torch as d2l

net = nn.Sequential(
    # 这里使用一个11*11的更大窗口来捕捉对象。
    # 同时，步幅为4，以减少输出的高度和宽度。
    # 另外，输出通道的数目远大于LeNet
    nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1), nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    # 减小卷积窗口，使用填充为2来使得输入与输出的高和宽一致，且增大输出通道数
    nn.Conv2d(96, 256, kernel_size=5, padding=2), nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    # 使用三个连续的卷积层和较小的卷积窗口。
    # 除了最后的卷积层，输出通道的数量进一步增加。
    # 在前两个卷积层之后，汇聚层不用于减少输入的高度和宽度
    nn.Conv2d(256, 384, kernel_size=3, padding=1), nn.ReLU(),
    nn.Conv2d(384, 384, kernel_size=3, padding=1), nn.ReLU(),
    nn.Conv2d(384, 256, kernel_size=3, padding=1), nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Flatten(),
    # 这里，全连接层的输出数量是LeNet中的好几倍。使用dropout层来减轻过拟合
    nn.Linear(6400, 4096), nn.ReLU(),
    nn.Dropout(p=0.5),
    nn.Linear(4096, 4096), nn.ReLU(),
    nn.Dropout(p=0.5),
    # 最后是输出层。由于这里使用Fashion-MNIST，所以用类别数为10，而非论文中的1000
    nn.Linear(4096, 10))
    
X = torch.randn(1, 1, 224, 224)
for layer in net:
    X=layer(X)
    print(layer.__class__.__name__,'output shape:\t',X.shape)
    
    
    
"""
Conv2d output shape:     torch.Size([1, 96, 54, 54])           # (224+2*1-11)/4 + 1 = 53 + 1 = 54
ReLU output shape:       torch.Size([1, 96, 54, 54])
MaxPool2d output shape:  torch.Size([1, 96, 26, 26])        # (54-3)/2 + 1 = 26
Conv2d output shape:     torch.Size([1, 256, 26, 26])         # (26+2*2-5)/1 + 1 = 26
ReLU output shape:       torch.Size([1, 256, 26, 26])
MaxPool2d output shape:  torch.Size([1, 256, 12, 12])       # (26-3)/2 + 1 = 12
Conv2d output shape:     torch.Size([1, 384, 12, 12])
ReLU output shape:       torch.Size([1, 384, 12, 12])
Conv2d output shape:     torch.Size([1, 384, 12, 12])         # (12+2*1-3)/1 + 1 = 12 
ReLU output shape:       torch.Size([1, 384, 12, 12])
Conv2d output shape:     torch.Size([1, 256, 12, 12])         # (12+2*1-3)/1 + 1 = 12
ReLU output shape:       torch.Size([1, 256, 12, 12])
MaxPool2d output shape:  torch.Size([1, 256, 5, 5])         # (12-3)/2 + 1 = 5
Flatten output shape:    torch.Size([1, 6400])                    # 256*5*5 = 6400
Linear output shape:     torch.Size([1, 4096])
ReLU output shape:       torch.Size([1, 4096])
Dropout output shape:    torch.Size([1, 4096])
Linear output shape:     torch.Size([1, 4096])
ReLU output shape:       torch.Size([1, 4096])
Dropout output shape:    torch.Size([1, 4096])
Linear output shape:     torch.Size([1, 10])
"""