SegFormer本文已参与「新人创作礼」活动，一起开启掘金创作之路。 SegFormer，一个有效但简单的目标分割网

本文已参与「新人创作礼」活动，一起开启掘金创作之路。

一个有效但简单的目标分割网络一个高效、准确且简单的基于Transformer的目标分割网络

本文创新点：

设计了一个无位置编码的层次Transformer编码器
使用轻量化的全MLP构成的解码器，融合多层特征，产生输出结果

A novel positional-encoding-free and hierarchical Transformer encoder.

A lightweight All-MLP decoder design that yields a powerful representation without complex and computationally demanding modules.

层次Transformer编码器

设计了Mix Transformer encoders (MiT), MiT-B0 to MiT-B5。

Overlapped Patch Merging

将特征图高宽减半，通道数增加一倍。

作用类似于池化，用它进行下采样，实现了多尺度的特征图。

讲述下non-overlapping Patch Merging：

它无法保持这些patch的局部连续性，因此文中使用K=7，S=4，P=3以及K=3，S=2，P=1的overlapping patch merging

This process was initially designed to combine non-overlapping image or feature patches. Therefore, it fails to preserve the local continuity around those patches.

In our experiments, we set K = 7, S = 4, P = 3 ,and K = 3, S = 2, P = 1 to perform overlapping patch merging to produces features with the same size as the non-overlapping process.

高效自注意力

在原始的多头自注意力层中，每个头的都是相同的维度：，其中

计算复杂度为，这对于高分辨率图像是难以实现的，文中使用缩减比例R来缩减序列（缩减的KV）长度：

K是被缩减的序列，将K缩减为维度为的序列，再使用线性层将维度调整为，计算复杂度变为，论文中R分别为[64, 16, 4, 1]。

这是segformer的self attention实现:
EfficientMultiheadAttention(MultiheadAttention)
注意这里的qkv计算实际上是继承的pytorch的

    def forward(self, x, hw_shape, identity=None):

        x_q = x
        if self.sr_ratio > 1:
            x_kv = nlc_to_nchw(x, hw_shape)
            x_kv = self.sr(x_kv)
            x_kv = nchw_to_nlc(x_kv)
            x_kv = self.norm(x_kv)
        else:
            x_kv = x

        if identity is None:
            identity = x_q
        out = self.attn(query=x_q, key=x_kv, value=x_kv, need_weights=False)[0]

        return identity + self.dropout_layer(self.proj_drop(out))

可以看到作者提出来的Efficient Self-Attention本质其实是增加了一个sr_ratio超参，通过sr_ratio来控制KV参数矩阵的尺寸，具体实现是这样的：

self.sr_ratio = sr_ratio
        if sr_ratio > 1:
            self.sr = Conv2d(
                in_channels=embed_dims,
                out_channels=embed_dims,
                kernel_size=sr_ratio,
                stride=sr_ratio)

MixX-FFN

ViT使用位置编码(PE)来介绍位置信息。然而，PE的分辨率是固定的。因此，当测试分辨率不同于训练率时，位置代码需要进行插值，这通常会导致精度下降。作者认为，位置编码实际上并不是语义分割的必要条件，Segform引入了Mix-FFN，它通过直接在前馈网络(FFN)中使用3×3Conv传递位置信息。

class MixFFN(BaseModule):
    def __init__(self,
                 embed_dims,
                 feedforward_channels,
                 act_cfg=dict(type='GELU'),
                 ffn_drop=0.,
                 dropout_layer=None,
                 init_cfg=None):
        super(MixFFN, self).__init__(init_cfg)
        self.embed_dims = embed_dims
        self.feedforward_channels = feedforward_channels
        self.act_cfg = act_cfg
        self.activate = build_activation_layer(act_cfg)
        in_channels = embed_dims
        fc1 = Conv2d(
            in_channels=in_channels,
            out_channels=feedforward_channels,
            kernel_size=1,
            stride=1,
            bias=True)
        # 3x3 depth wise conv to provide positional encode information
        pe_conv = Conv2d(
            in_channels=feedforward_channels,
            out_channels=feedforward_channels,
            kernel_size=3,
            stride=1,
            padding=(3 - 1) // 2,
            bias=True,
            groups=feedforward_channels)
        fc2 = Conv2d(
            in_channels=feedforward_channels,
            out_channels=in_channels,
            kernel_size=1,
            stride=1,
            bias=True)
        drop = nn.Dropout(ffn_drop)
        layers = [fc1, pe_conv, self.activate, drop, fc2, drop]
        self.layers = Sequential(*layers)
        self.dropout_layer = build_dropout(
            dropout_layer) if dropout_layer else torch.nn.Identity()
    def forward(self, x, hw_shape, identity=None):
        out = nlc_to_nchw(x, hw_shape)
        out = self.layers(out)
        out = nchw_to_nlc(out)
        if identity is None:
            identity = x
        return identity + self.dropout_layer(out)

这里注意一下，在transform算法中，激活函数不再是我们所熟悉的Relu，而是Gelu，这个也不是说随便就换上去的。

在神经网络的建模过程中，模型很重要的性质就是非线性，同时为了模型泛化能力，需要加入随机正则，例如dropout(随机置一些输出为0,其实也是一种变相的随机非线性激活)，而随机正则与非线性激活是分开的两个事情，而其实模型的输入是由非线性激活与随机正则两者共同决定的。

GELUs正是在激活中引入了随机正则的思想，是一种对神经元输入的概率描述，直观上更符合自然的认识，同时实验效果要比Relus与ELUs都要好。

GELU也会为inputs乘以0或者1，但不同于以上的或有明确值或随机，GELU所加的0-1mask的值是随机的，同时是依赖于inputs的分布的。可以理解为：GELU的权值取决于当前的输入input有多大的概率大于其余的inputs

因为神经元的输入趋向于正态分布，这么设定使得当输入x减小的时候，输入会有一个更高的概率被dropout掉，这样的激活变换就会随机依赖于输入了。

论文也给了一种近似表示：

BETR源码：

def gelu(input_tensor):
	cdf = 0.5 * (1.0 + tf.erf(input_tensor / tf.sqrt(2.0)))
	return input_tesnsor*cdf

在transform中确实采用GELU效果会更理想一些

如图所示，对于给定的分辨率，使用Mix-FFN的方法明显优于使用位置编码的方法。此外，我们的方法对测试分辨率的差异不那么敏感：当在较低分辨率使用位置编码时，精度下降3.3%。相比之下，当我们使用建议的Mix-FFN时，性能下降仅为0.7%。根据这些结果，我们可以得出结论，使用所提出的Mix-FFN比使用位置编码产生更好和更健壮的编码器。

Table 1c shows the results for this experiment. As shown, for a given resolution, our approach using Mix-FFN clearly outperforms using a positional encoding. Moreover, our approach is less sensitive to differences in the test resolution: the accuracy drops 3.3% when using a positional encoding with a lower resolution. In contrast, when we use the proposed Mix-FFN the performance drop is reduced to only 0.7%. From these results, we can conclude using the proposed Mix-FFN produces better and more robust encoders than those using positional encoding.

解码器

过去三年语义分割从Deeplab系列到PSPNet到DANet等等都是在研究如何设计
更好的decoder（encoder一般通过backbone提取）decoder越来越重越来越复杂

对于语义分割来说最重要的问题就是如何增大感受野。首先对于CNN encoder来说，有效感受野是比较小且局部的，所以需要一些decoder 的设计来增大有效感受野，比如ASPP里利用了不同大小的空洞卷积来实现这一目的。
但是对于Transformer encoder来说，由于 self-attention存在，有效感受野变得非常大，因此decoder 不需要更多操作来提高感受野(作者试了一堆分割头，基本没有提升)，下面是deeplab和segformer有效感受野可视化的对比（有效感受野：Understanding the Effective Receptive Field in Deep Convolutional Neural Networks）

The ERF of DeepLabv3+ is relatively small even at Stage-4, the deepest stage.

SegFormer’s encoder naturally produces local attentions which resemble convolutions at lower stages, while able to output highly non-local attentions that effectively capture contexts at Stage-4.

As shown with the zoom-in patches in Figure 3, the ERF of the MLP head (blue box) differs from Stage-4 (red box) with a significant stronger local attention besides the non-local attention.

由于自注意力机制的存在，segformer encoder阶段感受野就足够大了，所以decoder不需要很重的head

class SegformerHead(BaseDecodeHead):
    def __init__(self, interpolate_mode='bilinear', **kwargs):
        super().__init__(input_transform='multiple_select', **kwargs)
        self.interpolate_mode = interpolate_mode
        num_inputs = len(self.in_channels)
        assert num_inputs == len(self.in_index)
        self.convs = nn.ModuleList()
        for i in range(num_inputs):
            self.convs.append(
                ConvModule(
                    in_channels=self.in_channels[i],
                    out_channels=self.channels,
                    kernel_size=1,
                    stride=1,
                    norm_cfg=self.norm_cfg,
                    act_cfg=self.act_cfg))
        self.fusion_conv = ConvModule(
            in_channels=self.channels * num_inputs,
            out_channels=self.channels,
            kernel_size=1,
            norm_cfg=self.norm_cfg)
    def forward(self, inputs):
        # Receive 4 stage backbone feature map: 1/4, 1/8, 1/16, 1/32
        inputs = self._transform_inputs(inputs)
        outs = []
        for idx in range(len(inputs)):
            x = inputs[idx]
            conv = self.convs[idx]
            outs.append(
                resize(
                    input=conv(x),
                    size=inputs[0].shape[2:],
                    mode=self.interpolate_mode,
                    align_corners=self.align_corners))
        out = self.fusion_conv(torch.cat(outs, dim=1))
        out = self.cls_seg(out)
        return out

实验结果

SegFormer一作的思考：

对于语义分割，特征提取非常重要，Transformer已经在分类上证明了比CNN更强大的特征提取能力。但是分类和分割还是有一定的GAP, 因此如何设计对分割友好的更好的Transformer结构，还可以继续研究。

有了很好的特征，decoder该如何设计才能进一步提高性能。这里用了一个很简单的MLP decoder取得了不错的效果，而传统的ASPP之类的decoder 在Transformer的基础上基本上没有帮助，未来如何针对性的设计更好的decoder也比较值得探索。

关于tf的思考：downsample、position encoding也都开始倾向于用conv了，退化回CNN架构的设计方式(SwinT使用了CNN的local和hierarchical思想)。到最后，Vision Transformer相比于CNN，可能只有local self-attention是有进步意义的。