ShuffleNetV2

124 阅读4分钟

本文已参与「新人创作礼」活动,一起开启掘金创作之路。

ShuffleNet V2:arxiv.org/abs/1807.11…(值得读,硬核,内容多,提到的思想和做的实验都很全面)

ShuffleNetV2

和V1基本一样,就是多了个Conv5(1×1卷积)

对于每个stage第一个block,channel是需要翻倍的,所以是stride=2的,而对于stage2的第一个block,它的两个分支中输出的channel并不等于输入的channel,而是直接设置为指定输出channel的一半,比如1×版本,每个分支的channel是58,不是24

网络结构

ShuffleNetV2

FLOPS与FLOPs

  • FLOPS (Floating Point Operations Per Second):每秒浮点运算次数,是一个衡量硬件速度的指标。
  • FLOPs(Floating Point Operations):浮点运算次数,理解为计算量,用来衡量模型计算复杂度,常用来做神经网络模型速度的间接衡量标准。

论文中说计算复杂度不能只看FLOPs(它是间接指标,直接因素是speed),还要看其他指标,作者提出了4条如何设计出高效的网络准则,基于这个准则提出了新的block设计

FLOPs主要是计算卷积相关的参数指标

several practical guidelines for efficient network architecture design

G1) Equal channel width minimizes memory access cost (MAC).

当卷积层的输入特征矩阵和输出特征矩阵channel相等时MAC最小(保持FLOPs不变时)(MAC:内存访问时间成本)

B是使用1×1卷积层时所消耗的FLOPs

从图中可以看到,FLOPs保持一定时,c1与c2差距越大,推理速度越慢

G2) Excessive group convolution increases MAC.

当GConv的groups增大时(保持FLOPs不变时),MAC也会增大

G3) Network fragmentation reduces degree of parallelism.

网络设计的碎片化程度越高,速度越慢(可以理解为分支的程度,串联并联都是分支)

Though such fragmented structure has been shown beneficial for accuracy, it could decrease efficiency because it is unfriendly for devices with strong parallel computing powers like GPU. It also introduces extra overheads such as kernel launching and synchronization.

G4) Element-wise operations are non-negligible.

Element-wise操作带来的影响也是不可忽视的(疑问:这里为什么把DW卷积看做element-wise操作)

As shown in Figure 2, in light-weight models like [15,14], element-wise operations occupy considerable amount of time, especially on GPU. Here, the element-wise operators include ReLU, AddTensor, AddBias, etc. They have small FLOPs but relatively heavy MAC. Specially, we also consider depthwise convolution [12,13,14,15] as an element-wise operator as it also has a high MAC/FLOPs ratio.

For validation, we experimented with the “bottleneck” unit (1 × 1 conv followed by 3 × 3 conv followed by 1 × 1 conv, with ReLU and shortcut connection) in ResNet [4]. The ReLU and shortcut operations are removed, separately. Runtime of different variants is reported in Table 4. We observe around 20% speedup is obtained on both GPU and ARM, after ReLU and shortcut are removed.

不适用relu和shortcut后能提速20%,可见仅仅通过FLOPs来对比的话我们会认为这些操作是不占用运算时间的,但在实际使用中是很耗时的

总结
  1. 使用一个平衡的卷积(输入输出channel比值尽可能为1)
  2. 注意组卷积计算成本,不能一味增加group数,虽然增加能降低参数以及FLOPs,或许还能提高准确率,但是会增加计算成本
  3. 降低网络的碎片程度
  4. 尽可能减少element-wise操作

Conclusion and Discussions Based on the above guidelines and empirical studies, we conclude that an efficient network architecture should 1) use ”balanced“ convolutions (equal channel width); 2) be aware of the cost of using group convolution; 3) reduce the degree of fragmentation; and 4) reduce element-wise operations. These desirable properties depend on platform characterics (such as memory manipulation and code optimization) that are beyond theoretical FLOPs. They should be taken into accout for practical network design.

block

ab是ShuffleNetV1的,cd为V2

先将输入通道分为两部分(channel split,论文中),遵循G3,左边分支不做任何操作,右边的三个卷积输入输出都相同,满足了G1准则,两个1×1卷积不再使用组卷积,这是为了满足G2,结束后两个分支通过concat拼接,满足G1准则。将relu移到一个分支上,只对右边分支进行relu操作,同时concat、channel shuffle以及下一层的channel split可以合并到一个element-wise操作中,这减少了element-wise操作个数,满足G4。

对于有stride=2的结构,没有channel split操作,因此输出是输入的两倍

Towards above purpose, we introduce a simple operator called channel split. It is illustrated in Figure 3(c). At the beginning of each unit, the input of c feature channels are split into two branches with c − c′ and c′ channels, respectively. Following G3, one branch remains as identity. The other branch consists of three convolutions with the same input and output channels to satisfy G1. The two 1 × 1 convolutions are no longer group-wise, unlike [15]. This is partially to follow G2, and partially because the split operation already produces two groups.

After convolution, the two branches are concatenated. So, the number of channels keeps the same (G1). The same “channel shuffle” operation as in [15] is then used to enable information communication between the two branches.

After the shuffling, the next unit begins. Note that the “Add” operation in ShuffleNet v1 [15] no longer exists. Element-wise operations like ReLU and depthwise convolutions exist only in one branch. Also, the three successive elementwise operations, “Concat”, “Channel Shuffle” and “Channel Split”, are merged into a single element-wise operation. These changes are beneficial according to G4.