MobileNet

431 阅读4分钟

本文已参与「新人创作礼」活动,一起开启掘金创作之路。

MobileNetV1:arxiv.org/abs/1704.04…

MobileNetV2:arxiv.org/abs/1801.04…

MobileNetV3:arxiv.org/abs/1905.02…

MobileNetV1

MobileNetV2

t:扩展因子

c:输出特征矩阵深度channel,也就是k`

n:bottleneck重复次数

s:步距,每一个block第一层bottleneck的步距,其他都是等于1

第一个bottleneck的t=1,没有对深度进行调整,pytorch和TensorFlow在这一层没有使用第一个1×1卷积层

MobileNetV3

注意:

第一个bneck的exp size与输入通道相同,因此第一个1×1卷积没必要用,官方源码没有用第一个1×1卷积

网络亮点

MobileNetV1:

  • Depthwise Convolution(大大减少运算量和参数数量)
  • 增加超参数α和β(α:控制卷积层卷积核个数的超参数,β:控制输入图像大小的超参数)

MobileNetV2:

  • Inverted Residuals(倒残差结构)
  • Linear Bottlenecks

MobileNetV3:

  • 更新了Block(bneck)
  • 使用NAS搜索参数(Neural Architecture Search)
  • 重新设计耗时层结构

网络结构

MobileNetV1

DW卷积

Depthwise Separable Conv(深度可分离卷积)

Depthwise Separable Conv = DW卷积+PW卷积(Pointwise Conv)

PW卷积就是卷积核大小为1×1的卷积

参数对比:

注意:DW部分的卷积核容易废掉,即卷积核参数大部分为0,此问题会在MobilNetV2进行一定改善

MobileNetV2

Inverted Residual

其中激活函数由relu变为了relu6。

不是每个倒残差结构都有捷径分支,只有stride=1才有,不过这里论文说的有误,正确的是:当stride=1,且输入特征矩阵与输出特征矩阵shape相同时才有捷径分支

倒残差结构最后一个1×1卷积层使用了线性激活函数,因为RELU激活函数对于低维特征会造成比较大的损失,对于高维特征造成损失小。在倒残差结构中,是两边细中间粗的结构,在输出时是一个低维的特征向量,所以要是用先行激活函数替代relu,避免信息损失

Linear Bottleneck

重要性:

Importance of linear bottlenecks. The linear bottleneck models are strictly less powerful than models with non-linearities, because the activations can always operate in linear regime with appropriate changes to biases and scaling. However our experiments shown in Figure 6a indicate that linear bottlenecks improve performance, providing support that non-linearity destroys information in low-dimensional space.

MobileNetV3

block

  • 加入了SE模块
  • 更新了激活函数

SE模块

对每个通道进行池化操作,得到channel个元素,然后分别经过FC、Relu、FC、hard-sigmoid,得到对应每个通道的权重(其中第一个FC个数为通道的1/4,第二个为原通道个数)

重新设计耗时层结构

  • 减少第一个卷积层卷积核个数(32->16)
  • 精简Last Stage

The original and optimized last stages can be seen in figure 5. The efficient last stage reduces the latency by 7 milliseconds which is 11% of the running time and reduces the number of operations by 30 millions MAdds with almost no loss of accuracy. Section 6 contains detailed results.

We settled on using the hard swish nonlinearity for this layer as it performed as well as other nonlinearities tested. We were able to reduce the number of filters to 16 while maintaining the same accuracy as 32 filters using either ReLU or swish. This saves an additional 2 milliseconds and 10 million MAdds.

重新设计激活函数

In [36, 13, 16] a nonlinearity called swish was introduced that when used as a drop-in replacement for ReLU, that significantly improves the accuracy of neural networks. The nonlinearity is defined as

While this nonlinearity improves accuracy, it comes with non-zero cost in embedded environments as the sigmoid function is much more expensive to compute on mobile devices. We deal with this problem in two ways.

之前使用的是Relu6,现在常见的是swish x,后面是sigmoid,但是swish计算、求导复杂,且对量化过程不友好(移动端设备为了加速基本都会进行量化操作)

所以提出一个h-swish激活函数

首先看看h-sigmoid

h-sigmoid类似sigmoid,所以很多场景下会用它替代sigmoid

所以,使用它在推理过程,对推理速度有一定帮助,且对量化友好

The cost of applying nonlinearity decreases as we go deeper into the network, since each layer activation memory typically halves every time the resolution drops. Incidentally, we find that most of the benefits swish are realized by using them only in the deeper layers. Thus in our architectures we only use h-swish at the second half of the model. We refer to the tables 1 and 2 for the precise layout.

Even with these optimizations, h-swish still introduces some latency cost. However as we demonstrate in section 6 the net effect on accuracy and latency is positive with no optimizations and substantial when using an optimized implementation based on a piece-wise function.

有的地方用了RELU,有的用了h-sigmoid

实验对比

MobileNetV1

MobileNetV2