本文已参与「新人创作礼」活动,一起开启掘金创作之路。
本文贡献
- 本文提出了一种新的基于Transformer的语义上下文挖掘解码器,适用于不同结构的编码器。
- 我们在解码器中设计了简单但有效的内部和外部上下文挖掘模块,用于不同级别的特征增强。
- 我们提出了进一步扩展的优化技术,并分析了SegDeformer对分段基准的影响。
- We propose a novel transformer-based decoder for semantic context mining, which is applicable to different encoders with varied structures.
- We design simple but effective internal and external context mining modules in the decoder for different levels of feature augmentation.
- We propose optimization techniques for further expansion and analysis the effect of SegDeformer on segmentation benchmarks.
解码器
首先按照SegFormer解码器得到融合所有尺度的特征
First, following SegFormer [30], basic MLP layers and optional up-sampling operations are used to unify the channel dimension and feature scale of multi-stage features:
然后把F展平,得到内部特征,维度为
,
,然后执行内部信息挖掘还和外部信息挖掘
Then, we flatten F back to the internal token sequence Xinter with size N × C, where N = H1 × W1 denotes the length of Xinter, and conduct internal and external context mining as follows:
是内部和外部上下文挖掘模块,
是可学习的外部token(类似DETR里的)
之后通过特征融合操作(按元素相加)和Reshape操作得到增强特征
,最后通过分类头
和上采样得到分割结果M。
内部上下文挖掘
尽管Transformer结构的最高层被证明有全局感受野,但多级特征融合操作(原本的SegFormer解码器)仍会带来特征混淆,导致不连续和错误的预测结果。作者认为是由于不同阶段的注意力通常集中在不同的内容上,一些聚合信息可能不适合分割,需要重新调整。此外,对于深-窄设计的Transformer编码器,特征以细粒度到粗粒度的方式组织,并且,不同阶段的tokens表示的区域存在比例差异。
因此利用具有全局聚合能力的内部上下文挖掘模块来重新聚合来自其他相关像素的信息。混淆像素可以从其他像素聚合相关高质量信息,提高表示能力。
Although the highest layers of transformer architecture are proved to have global respective field [30], we observe that the following multi-stage feature unification operation (Eq. 1) still brings feature confusion and leads to noncontinuous and incorrect mask prediction. We think this is because attention at different stages usually focuses on different contents, some aggregated information may not be suitable for segmentation and requires re-integration.Besides, for transformer encoders with deep-narrow designs, features are organized in a fine-to-coarse way, and regions represented by tokens at different stages have scale differences.
The internal context mining module Minter with global aggregation capabilities is therefore designed to reintegrate information from other related pixels. Such that confused pixels can aggregate relevant high-quality information from other pixels, and improve the representation.
对于内部上下文挖掘模块只使用了简单的单头自注意力
外部上下文挖掘
除了内部信息,来自其他图像的上下文信息可以丰富图像的特征,有利于图像的全局一致表示。
外部上下文挖掘模块使用跨图像信息来增强特征,并对进行了补充。Transformer中使用的序列化输入是很灵活的,可以容易的扩展来携带跨图像信息。
添加外部tokens到解码器
外部token序列有个tokens,每个token聚合具有不同含义的信息。
由两个单头交叉注意力组成,分别在
之间执行。
首先将转换成
,
转换成
,计算交叉注意力得到
:
之后再将转换成
,
转换成
,计算交叉注意力得到
:
对外部tokens添加限制
作为网络参数的一部分,可以随着学习过程不断自我更新。但我们发现向
添加额外约数,使每个token特定于类别,可以带来更好的性能。
通过对由产生的中间输出
使用交叉熵损失,可以实现类别特定化。
表示将真值类别标签变成one-hot形式,最终loss为:
设置
解耦外部tokens
在上述基本结构中,同时负责信息交互和类别预测,这增加了其压缩信息能力的难度。我们发现将
分离为两个分别承担这两个责任可以进一步提高表达能力。因此在实践中,我们首先
为
,维度为:
,然后对于一个外部token
,维度为2C,我们解耦为两个部分:
负责信息交互,
负责中层预测。在下述公式中,将方程中的
替换为
的投影可以很方便的实现解耦。
优化模块
SegDefomer可以与其他模块无缝集成,以进一步的适配和扩展。
我们使用的部分优化技术有:多阶段特征融合和高校自主伊利。
多阶段特征融合
SegDeformer mainly utilizes the attention mechanism to augment features while only using basic MLP (Eq. 1) for multistage feature fusion. Some other multi-level feature aggregation techniques, e.g., Semantic-FPN [18], UperNet [29], and FAPN [14], can be used in conjunction with our method and bring further performance gains. Following Swin [21], We additionally introduce UperNet in some cases to pursue better feature representation.
高效自注意力
Despite SegDeformer can make good use of the characteristics of the transformer, the computational cost of the attention operations, mainly coming from Eq. 5 and 7, makes it unable to adapt to some real-time structures. In this case, we can introduce efficient self-attention to reduce the calculation amount. Denote the reduction ratio as R,we can use a R × R convolution with stride R to reduce the scale of K and V in Eq. 5 and 7. As a result, the complexity of the self-attention is reduced from to
.