Path Aggregation Network for Instance Segmentation【翻译】

123 阅读25分钟

选择 Doc2X,高效 PDF 转 Markdown Doc2X 支持多栏 PDF 转换为 Markdown,完美保留布局与格式,适合科研与开发者使用。 Choose Doc2X for Efficient PDF to Markdown Doc2X supports multi-column PDF to Markdown conversion, perfectly preserving layouts and formats, ideal for researchers and developers. 👉 点击查看 Doc2X | Learn More

原文链接:1803.01534

Path Aggregation Network for Instance Segmentation

实例分割的路径聚合网络

Shu Liu†

刘澍†

†The Chinese University of Hong Kong

†香港中文大学

#SenseTime Research b YouTu Lab, Tencent

#商汤科技研究院 腾讯优图实验室

{sliu, luqi, leojia}@cse.cuhk.edu.hk qhfpku@pku.edu.cn shijianping@sensetime.cor

{sliu, luqi, leojia}@cse.cuhk.edu.hk qhfpku@pku.edu.cn shijianping@sensetime.cor

Abstract

摘要

The way that information propagates in neural networks is of great importance. In this paper, we propose Path Aggregation Network (PANet) aiming at boosting information flow in proposal-based instance segmentation framework. Specifically, we enhance the entire feature hierarchy with accurate localization signals in lower layers by bottom-up path augmentation, which shortens the information path between lower layers and topmost feature. We present adaptive feature pooling, which links feature grid and all feature levels to make useful information in each feature level propagate directly to following proposal subnetworks. A complementary branch capturing different views for each proposal is created to further improve mask prediction.

信息在神经网络中的传播方式至关重要。本文提出了路径聚合网络(PANet),旨在增强基于提议的实例分割框架中的信息流。具体而言,我们通过自下而上的路径增强,在较低层中增强整个特征层次的准确定位信号,从而缩短了较低层与最上层特征之间的信息路径。我们提出了自适应特征池化,将特征网格与所有特征层连接,使每个特征层中的有用信息能够直接传播到后续的提议子网络。还创建了一个补充分支,以捕捉每个提议的不同视角,从而进一步改善掩膜预测。

These improvements are simple to implement, with subtle extra computational overhead. Our PANet reaches the 1st {1}^{\text{st }} place in the COCO 2017 Challenge Instance Segmentation task and the 2nd {2}^{\text{nd }} place in Object Detection task without large-batch training. It is also state-of-the-art on MVD and Cityscapes. Code is available at github.com/ShuLiu1993/….

这些改进易于实现,额外的计算开销微乎其微。我们的PANet在COCO 2017挑战赛的实例分割任务中获得了1st {1}^{\text{st }}名,在目标检测任务中获得了2nd {2}^{\text{nd }}名,且无需大批量训练。它在MVD和Cityscapes上也达到了最先进的水平。代码可在 github.com/ShuLiu1993/… 获取。

1. Introduction

1. 引言

Instance segmentation is one of the most important and challenging tasks. It aims to predict class label and pixel-wise instance mask to localize varying numbers of instances presented in images. This task widely benefits autonomous vehicles, robotics, video surveillance, to name a few.

实例分割是最重要且最具挑战性的任务之一。它旨在预测类别标签和逐像素实例掩膜,以定位图像中呈现的不同数量的实例。该任务广泛应用于自动驾驶汽车、机器人、视频监控等领域。

With the help of deep convolutional neural networks, several frameworks for instance segmentation, e.g., [21, 33, 3, 38], were proposed where performance grows rapidly [12]. Mask R-CNN [21] is a simple and effective system for instance segmentation. Based on Fast/Faster R-CNN [16,51]\left\lbrack {{16},{51}}\right\rbrack ,a fully convolutional network (FCN) is used for mask prediction, along with box regression and classification. To achieve high performance, feature pyramid network (FPN) [35] is utilized to extract in-network feature hierarchy, where a top-down path with lateral connections is augmented to propagate semantically strong features.

借助深度卷积神经网络,提出了多个实例分割框架,例如 [21, 33, 3, 38],其性能迅速提升 [12]。Mask R-CNN [21] 是一个简单而有效的实例分割系统。基于 Fast/Faster R-CNN [16,51]\left\lbrack {{16},{51}}\right\rbrack,使用全卷积网络(FCN)进行掩膜预测,同时进行框回归和分类。为了实现高性能,采用特征金字塔网络(FPN) [35] 来提取网络内特征层次,其中通过增强自上而下的路径与侧向连接来传播语义强特征。

Several newly released datasets [37,7,45]\left\lbrack {{37},7,{45}}\right\rbrack make large room for algorithm improvement. COCO [37] consists of 200k{200}\mathrm{k} images. Lots of instances with complex spatial layout are captured in each image. Differently, Cityscapes [7] and MVD [45] provide street scenes with a large number of traffic participants in each image. Blur, heavy occlusion and extremely small instances appear in these datasets.

几个新发布的数据集 [37,7,45]\left\lbrack {{37},7,{45}}\right\rbrack 为算法改进提供了广阔的空间。COCO [37] 包含 200k{200}\mathrm{k} 张图像。每张图像中捕捉到了许多具有复杂空间布局的实例。与此不同,Cityscapes [7] 和 MVD [45] 提供了街景,每张图像中都有大量交通参与者。这些数据集中出现了模糊、严重遮挡和极小的实例。

There have been several principles proposed for designing networks in image classification that are also effective for object recognition. For example, shortening information path and easing information propagation by clean residual connection [23,24]\left\lbrack {{23},{24}}\right\rbrack and dense connection [26] are useful. Increasing the flexibility and diversity of information paths by creating parallel paths following the split-transform-merge strategy [61,6]\left\lbrack {{61},6}\right\rbrack is also beneficial.

已经提出了几个用于图像分类网络设计的原则,这些原则在物体识别中同样有效。例如,通过干净的残差连接 [23,24]\left\lbrack {{23},{24}}\right\rbrack 和密集连接 [26] 缩短信息路径和简化信息传播是有用的。通过创建遵循分裂-变换-合并策略的并行路径来增加信息路径的灵活性和多样性 [61,6]\left\lbrack {{61},6}\right\rbrack 也是有益的。

Our Findings Our research indicates that information propagation in state-of-the-art Mask R-CNN can be further improved. Specifically, features in low levels are helpful for large instance identification. But there is a long path from low-level structure to topmost features, increasing difficulty to access accurate localization information. Further, each proposal is predicted based on feature grids pooled from one feature level, which is assigned heuristically. This process can be updated since information discarded in other levels may be helpful for final prediction. Finally, mask prediction is made on a single view, losing the chance to gather more diverse information.

我们的发现 我们的研究表明,最先进的 Mask R-CNN 中的信息传播可以进一步改善。具体而言,低层次的特征对于大实例识别是有帮助的。但从低层结构到最上层特征之间的路径较长,增加了获取准确定位信息的难度。此外,每个提议是基于从一个特征层池化的特征网格进行预测的,这一过程是根据启发式方法分配的。由于在其他层中丢弃的信息可能对最终预测有帮助,因此可以更新这一过程。最后,掩膜预测是在单一视图上进行的,失去了收集更多多样化信息的机会。

Our Contributions Inspired by these principles and observations, we propose PANet, illustrated in Figure 1, for instance segmentation.

我们的贡献 受这些原则和观察的启发,我们提出了 PANet,如图 1 所示,用于实例分割。

First, to shorten information path and enhance feature pyramid with accurate localization signals existing in low-levels, bottom-up path augmentation is created. In fact, features in low-layers were utilized in the systems of [44,42,13,46,35,5,31,14]\left\lbrack {{44},{42},{13},{46},{35},5,{31},{14}}\right\rbrack . But propagating low-level features to enhance entire feature hierarchy for instance recognition was not explored.

首先,为了缩短信息路径并增强具有准确定位信号的特征金字塔,我们创建了自下而上的路径增强。实际上,低层特征在 [44,42,13,46,35,5,31,14]\left\lbrack {{44},{42},{13},{46},{35},5,{31},{14}}\right\rbrack 的系统中被利用。但将低层特征传播以增强整个特征层次以进行实例识别尚未被探索。

Figure 1. Illustration of our framework. (a) FPN backbone. (b) Bottom-up path augmentation. (c) Adaptive feature pooling. (d) Box branch. (e) Fully-connected fusion. Note that we omit channel dimension of feature maps in (a) and (b) for brevity.

图 1. 我们框架的示意图。 (a) FPN 主干。 (b) 自下而上的路径增强。 (c) 自适应特征池化。 (d) 边框分支。 (e) 全连接融合。 请注意,我们为了简洁起见省略了 (a) 和 (b) 中特征图的通道维度。

Second, to recover broken information path between each proposal and all feature levels, we develop adaptive feature pooling. It is a simple component to aggregate features from all feature levels for each proposal, avoiding arbitrarily assigned results. With this operation, cleaner paths are created compared with those of [4,62]\left\lbrack {4,{62}}\right\rbrack .

其次,为了恢复每个提议与所有特征层之间的断裂信息路径,我们开发了自适应特征池化。这是一个简单的组件,用于为每个提议聚合来自所有特征层的特征,避免任意分配的结果。通过这一操作,与 [4,62]\left\lbrack {4,{62}}\right\rbrack 的路径相比,创建了更清晰的路径。

Finally, to capture different views of each proposal, we augment mask prediction with tiny fully-connected(fc)layers, which possess complementary properties to FCN originally used by Mask R-CNN. By fusing predictions from these two views, information diversity increases and masks with better quality are produced.

最后,为了捕获每个提议的不同视图,我们通过微小的全连接(fc)层增强掩膜预测,这些层具有与 Mask R-CNN 原本使用的 FCN 互补的特性。通过融合这两个视图的预测,信息的多样性增加,产生了更高质量的掩膜。

The first two components are shared by both object detection and instance segmentation, leading to much enhanced performance of both tasks.

前两个组件是目标检测和实例分割共同使用的,从而显著提升了这两项任务的性能。

Experimental Results With PANet, we achieve state-of-the-art performance on several datasets. With ResNet-50 [23] as the initial network, our PANet tested with a single scale already outperforms champion of COCO 2016 Challenge in both object detection [27] and instance segmentation [33] tasks. Note that these previous results are achieved by larger models [23,58]\left\lbrack {{23},{58}}\right\rbrack together with multi-scale and horizontal flip testing.

实验结果显示,使用PANet,我们在多个数据集上达到了最先进的性能。以ResNet-50 [23] 作为初始网络,我们的PANet在单尺度测试中已经在目标检测 [27] 和实例分割 [33] 任务中超越了COCO 2016挑战的冠军。需要注意的是,这些先前的结果是通过更大的模型 [23,58]\left\lbrack {{23},{58}}\right\rbrack 以及多尺度和水平翻转测试获得的。

We achieve the 1st {1}^{\text{st }} place in COCO 2017 Challenge Instance Segmentation task and the 2nd {2}^{\text{nd }} place in Object Detection task without large-batch training. We also benchmark our system on Cityscapes and MVD, which similarly yields top-ranking results, manifesting that our PANet is a very practical and top-performing framework. Our code and models are available at github.com/ ShuLiu1993/PANet.

我们在COCO 2017挑战的实例分割任务中获得了 1st {1}^{\text{st }} 名,在目标检测任务中获得了 2nd {2}^{\text{nd }} 名,而无需大批量训练。我们还在Cityscapes和MVD上对我们的系统进行了基准测试,同样获得了顶级排名的结果,表明我们的PANet是一个非常实用且表现优异的框架。我们的代码和模型可在 github.com/ShuLiu1993/… 获取。

2. Related Work

2. 相关工作

Instance Segmentation There are mainly two streams of methods in instance segmentation. The most popular one is proposal-based. Methods in this stream have a strong connection to object detection. In R-CNN [17], object proposals from [60,68]\left\lbrack {{60},{68}}\right\rbrack were fed into the network to extract features for classification. While Fast/Faster R-CNN [16, 51] and SPPNet [22] sped up the process by pooling features from global feature maps. Earlier work [18,19]\left\lbrack {{18},{19}}\right\rbrack took mask proposals from MCG [1] as input to extract features while CFM [9], MNC [10] and Hayder et al. [20] merged feature pooling to network for faster speed. Newer design was to generate instance masks in networks as proposal [48,49,8]\left\lbrack {{48},{49},8}\right\rbrack or final result [10, 34, 41]. Mask R-CNN [21] is an effective framework falling in this stream. Our work is built on Mask R-CNN and improves it from different aspects.

实例分割 实例分割主要有两种方法流派。最流行的一种是基于提议的方法。这一流派的方法与目标检测有着密切的联系。在 R-CNN [17] 中,来自 [60,68]\left\lbrack {{60},{68}}\right\rbrack 的目标提议被输入到网络中以提取特征进行分类。而 Fast/Faster R-CNN [16, 51] 和 SPPNet [22] 通过从全局特征图中池化特征来加速这一过程。早期的工作 [18,19]\left\lbrack {{18},{19}}\right\rbrack 采用来自 MCG [1] 的掩模提议作为输入来提取特征,而 CFM [9]、MNC [10] 和 Hayder 等 [20] 则将特征池化合并到网络中以提高速度。更新的设计是在网络中生成实例掩模作为提议 [48,49,8]\left\lbrack {{48},{49},8}\right\rbrack 或最终结果 [10, 34, 41]。Mask R-CNN [21] 是这一流派中的一个有效框架。我们的工作基于 Mask R-CNN,并从不同方面对其进行了改进。

Methods in the other stream are mainly segmentation-based. They learned specially designed transformation [3,33,38,59]\left\lbrack {3,{33},{38},{59}}\right\rbrack or instance boundaries [30]. Then instance masks were decoded from predicted transformation. Instance segmentation by other pipelines also exists. DIN [2] fused predictions from object detection and semantic segmentation systems. A graphical model was used in [66,65]\left\lbrack {{66},{65}}\right\rbrack to infer the order of instances. RNN was utilized in [53,50]\left\lbrack {{53},{50}}\right\rbrack to propose one instance in each time step.

另一流派的方法主要是基于分割的。它们学习了专门设计的变换 [3,33,38,59]\left\lbrack {3,{33},{38},{59}}\right\rbrack 或实例边界 [30]。然后,从预测的变换中解码出实例掩模。其他管道的实例分割也存在。DIN [2] 融合了来自目标检测和语义分割系统的预测。在 [66,65]\left\lbrack {{66},{65}}\right\rbrack 中使用图形模型推断实例的顺序。在 [53,50]\left\lbrack {{53},{50}}\right\rbrack 中利用 RNN 在每个时间步提出一个实例。

Multi-level Features Features from different layers were used in image recognition. SharpMask [49], Peng et al. [47] and LRR [14] fused feature maps for segmentation with finer details. FCN [44], U-Net [54] and Noh et al. [46] fused information from lower layers through skip-connections. Both TDM [56] and FPN [35] augmented a top-down path with lateral connections for object detection. Different from TDM, which took the fused feature map with the highest resolution to pool features, SSD [42], DSSD [13], MS-CNN [5] and FPN [35] assigned proposals to appropriate feature levels for inference. We take FPN as a baseline and much enhance it.

多层特征 不同层次的特征被用于图像识别。SharpMask [49]、Peng 等 [47] 和 LRR [14] 融合了具有更细节的分割特征图。FCN [44]、U-Net [54] 和 Noh 等 [46] 通过跳跃连接融合了来自较低层的信息。TDM [56] 和 FPN [35] 都增强了自上而下的路径,并增加了横向连接以进行目标检测。与 TDM 不同,后者采用分辨率最高的融合特征图来汇聚特征,SSD [42]、DSSD [13]、MS-CNN [5] 和 FPN [35] 将提议分配到适当的特征层进行推理。我们将 FPN 作为基线,并大幅增强它。

ION [4], Zagoruyko et al. [62], Hypernet [31] and Hy-percolumn [19] concatenated feature grids from different layers for better prediction. But a sequence of operations, i.e., normalization, concatenation and dimension reduction are needed to get feasible new features. In comparison, our design is much simpler.

ION [4]、Zagoruyko 等 [62]、Hypernet [31] 和 Hypercolumn [19] 从不同层次连接特征网格以获得更好的预测。但需要一系列操作,即归一化、连接和维度减少,以获得可行的新特征。相比之下,我们的设计要简单得多。

Figure 2. Illustration of our building block of bottom-up path augmentation.

图 2. 我们自下而上路径增强的构建模块示意图。

Fusing feature grids from different sources for each proposal was also utilized in [52]. But this method extracted feature maps on input with different scales and then conducted feature fusion (with the max operation) to improve feature selection from the input image pyramid. In contrast, our method aims at utilizing information from all feature levels in the in-network feature hierarchy with single-scale input. End-to-end training is enabled.

对于每个提议,从不同来源融合特征网格的方法也在 [52] 中被利用。但该方法在输入上提取了不同尺度的特征图,然后进行特征融合(使用最大操作)以改善从输入图像金字塔中选择特征。相比之下,我们的方法旨在利用来自网络特征层次中所有特征层的信息,且使用单尺度输入。实现了端到端的训练。

Larger Context Region Methods of [15, 64, 62] pooled features for each proposal with a foveal structure to exploit context information from regions with different resolutions. Features pooled from a larger region provide surrounding context. Global pooling was used in PSPNet [67] and ParseNet [43] to greatly improve quality of semantic segmentation. Similar trend was observed by Peng et al. [47] where global convolutionals were utilized. Our mask prediction branch also supports accessing global information. But the technique is completely different.

更大上下文区域方法 [15, 64, 62] 对每个提议进行了特征池化,采用中心结构以利用来自不同分辨率区域的上下文信息。来自更大区域的特征提供了周围的上下文。PSPNet [67] 和 ParseNet [43] 中使用了全局池化,极大地提高了语义分割的质量。Peng 等人 [47] 观察到了类似的趋势,他们利用了全局卷积。我们的掩膜预测分支也支持访问全局信息,但该技术完全不同。

3. Framework

3. 框架

Our framework is illustrated in Figure 1. Path augmentation and aggregation is conducted for improving performance. A bottom-up path is augmented to make low-layer information easier to propagate. We design adaptive feature pooling to allow each proposal to access information from all levels for prediction. A complementary path is added to the mask-prediction branch. This new structure leads to decent performance. Similar to FPN, the improvement is independent of the CNN structure, e.g., [57, 32, 23].

我们的框架如图 1 所示。进行路径增强和聚合以提高性能。底层路径被增强,以便更容易传播低层信息。我们设计了自适应特征池化,使每个提议能够访问来自所有层的信息进行预测。一个互补路径被添加到掩膜预测分支中。这种新结构带来了良好的性能。与 FPN 类似,改进与 CNN 结构无关,例如 [57, 32, 23]。

3.1. Bottom-up Path Augmentation

3.1. 自底向上的路径增强

Motivation The insightful point [63] that neurons in high layers strongly respond to entire objects while other neurons are more likely to be activated by local texture and patterns manifests the necessity of augmenting a top-down path to propagate semantically strong features and enhance all features with reasonable classification capability in FPN.

动机 高层神经元对整个物体的强烈反应以及其他神经元更可能被局部纹理和模式激活的深刻见解 [63] 显示了增强自上而下路径的必要性,以传播语义强的特征并增强 FPN 中所有具有合理分类能力的特征。

Our framework further enhances the localization capability of the entire feature hierarchy by propagating strong responses of low-level patterns based on the fact that high response to edges or instance parts is a strong indicator to accurately localize instances. To this end, we build a path with clean lateral connections from the low level to top ones. Therefore, there is a "shortcut" (dashed green line in Figure 1), which consists of less than 10 layers, across these levels. In comparison, the CNN trunk in FPN gives a long path (dashed red line in Figure 1) passing through even 100 + layers from low layers to the topmost one.

我们的框架通过传播低级模式的强响应,进一步增强了整个特征层次的定位能力,因为对边缘或实例部分的高响应是准确定位实例的强指示。因此,我们从低级到高级构建了一条具有清晰横向连接的路径。因此,在这些层次之间存在一条“捷径”(图1中的虚线绿色线),其层数少于10层。相比之下,FPN中的CNN主干提供了一条长路径(图1中的虚线红色线),经过甚至100层以上的层级,从低层到最高层。

Augmented Bottom-up Structure Our framework first accomplishes bottom-up path augmentation. We follow FPN to define that layers producing feature maps with the same spatial size are in the same network stage. Each feature level corresponds to one stage. We also take ResNet [23] as the basic structure and use {P2,P3,P4,P5}\left\{ {{P}_{2},{P}_{3},{P}_{4},{P}_{5}}\right\} to denote feature levels generated by FPN. Our augmented path starts from the lowest level P2{P}_{2} and gradually approaches P5{P}_{5} as shown in Figure 1(b). From P2{P}_{2} to P5{P}_{5} ,the spatial size is gradually down-sampled with factor 2 . We use {N2,N3,N4,N5}\left\{ {{N}_{2},{N}_{3},{N}_{4},{N}_{5}}\right\} to denote newly generated feature maps corresponding to {P2,P3,P4,P5}\left\{ {{P}_{2},{P}_{3},{P}_{4},{P}_{5}}\right\} . Note that N2{N}_{2} is simply P2{P}_{2} ,without any processing.

增强的自下而上结构 我们的框架首先完成自下而上的路径增强。我们遵循FPN的定义,将产生相同空间大小特征图的层视为同一网络阶段。每个特征层对应一个阶段。我们还将ResNet [23] 作为基本结构,并用 {P2,P3,P4,P5}\left\{ {{P}_{2},{P}_{3},{P}_{4},{P}_{5}}\right\} 表示由FPN生成的特征层。我们的增强路径从最低层 P2{P}_{2} 开始,逐渐接近 P5{P}_{5},如图1(b)所示。从 P2{P}_{2}P5{P}_{5},空间大小逐渐以2的因子下采样。我们用 {N2,N3,N4,N5}\left\{ {{N}_{2},{N}_{3},{N}_{4},{N}_{5}}\right\} 表示与 {P2,P3,P4,P5}\left\{ {{P}_{2},{P}_{3},{P}_{4},{P}_{5}}\right\} 相对应的新生成特征图。请注意, N2{N}_{2} 仅仅是 P2{P}_{2},没有任何处理。

As shown in Figure 2, each building block takes a higher resolution feature map Ni{N}_{i} and a coarser map Pi+1{P}_{i + 1} through lateral connection and generates the new feature map Ni+1{N}_{i + 1} . Each feature map Ni{N}_{i} first goes through a 3×33 \times 3 convolutional layer with stride 2 to reduce the spatial size. Then each element of feature map Pi+1{P}_{i + 1} and the down-sampled map are added through lateral connection. The fused feature map is then processed by another 3×33 \times 3 convolutional layer to generate Ni+1{N}_{i + 1} for following sub-networks. This is an iterative process and terminates after approaching P5{P}_{5} . In these building blocks, we consistently use channel 256 of feature maps. All convolutional layers are followed by a ReLU [32]. The feature grid for each proposal is then pooled from new feature maps,i.e., {N2,N3,N4,N5}\left\{ {{N}_{2},{N}_{3},{N}_{4},{N}_{5}}\right\} .

如图2所示,每个构建模块接收一个更高分辨率的特征图 Ni{N}_{i} 和一个较粗的特征图 Pi+1{P}_{i + 1},通过横向连接生成新的特征图 Ni+1{N}_{i + 1}。每个特征图 Ni{N}_{i} 首先经过一个步幅为2的 3×33 \times 3 卷积层,以减少空间大小。然后,通过横向连接将特征图 Pi+1{P}_{i + 1} 的每个元素与下采样的特征图相加。融合后的特征图随后通过另一个 3×33 \times 3 卷积层进行处理,以生成 Ni+1{N}_{i + 1} 供后续子网络使用。这是一个迭代过程,直到接近 P5{P}_{5} 后终止。在这些构建模块中,我们始终使用特征图的256个通道。所有卷积层后面都跟随一个ReLU [32]。每个提议的特征网格随后从新的特征图中池化,即 {N2,N3,N4,N5}\left\{ {{N}_{2},{N}_{3},{N}_{4},{N}_{5}}\right\}

3.2. Adaptive Feature Pooling

3.2. 自适应特征池化

Motivation In FPN [35], proposals are assigned to different feature levels according to the size of proposals. It makes small proposals assigned to low feature levels and large proposals to higher ones. Albeit simple and effective, it still could generate non-optimal results. For example, two proposals with 10-pixel difference can be assigned to different levels. In fact, these two proposals are rather similar.

动机 在FPN [35]中,提议根据其大小被分配到不同的特征层级。小提议被分配到低特征层级,而大提议则被分配到更高的层级。尽管这种方法简单有效,但仍可能产生非最优结果。例如,两个相差10像素的提议可能被分配到不同的层级。实际上,这两个提议是相当相似的。

Further, importance of features may not be strongly correlated to the levels they belong to. High-level features are generated with large receptive fields and capture richer context information. Allowing small proposals to access these features better exploits useful context information for prediction. Similarly, low-level features are with many fine details and high localization accuracy. Making large proposals access them is obviously beneficial. With these thoughts, we propose pooling features from all levels for each proposal and fusing them for following prediction. We call this process adaptive feature pooling.

此外,特征的重要性可能与其所属层级的相关性并不强。高层特征是通过大感受野生成的,能够捕捉更丰富的上下文信息。允许小提议访问这些特征可以更好地利用有用的上下文信息进行预测。同样,低层特征包含许多细节和高定位精度。让大提议访问这些特征显然是有益的。基于这些考虑,我们提出对每个提议从所有层级池化特征,并将其融合以进行后续预测。我们称这一过程为自适应特征池化。

Figure 3. Ratio of features pooled from different feature levels with adaptive feature pooling. Each line represents a set of proposals that should be assigned to the same feature level in FPN, i.e., proposals with similar scales. The horizontal axis denotes the source of pooled features. It shows that proposals with different sizes all exploit features from several different levels.

图3. 通过自适应特征池化从不同特征层提取的特征比例。每条线代表一组应被分配到FPN中相同特征层的提案,即具有相似尺度的提案。横轴表示池化特征的来源。它显示出不同大小的提案都利用了来自几个不同层次的特征。

We now analyze the ratio of features pooled from different levels with adaptive feature pooling. We use max operation to fuse features from different levels, which lets network select element-wise useful information. We cluster proposals into four classes based on the levels they were assigned to originally in FPN. For each set of proposals, we calculate the ratio of features selected from different levels. In notation, levels 1 - 4 represent low-to-high levels. As shown in Figure 3, the blue line represents small proposals that were assigned to level 1 originally in FPN. Surprisingly,nearly 70%{70}\% of features are from other higher levels. We also use the yellow line to represent large proposals that were assigned to level 4 in FPN. Again, 50%+{50}\% + of the features are pooled from other lower levels. This observation clearly indicates that features in multiple levels together are helpful for accurate prediction. It is also a strong support of designing bottom-up path augmentation.

我们现在分析通过自适应特征池化从不同层次提取的特征比例。我们使用最大操作来融合来自不同层次的特征,这使得网络能够选择逐元素有用的信息。我们根据提案在FPN中最初被分配的层次将其聚类为四个类别。对于每组提案,我们计算从不同层次选择的特征比例。在符号表示中,层次1-4代表从低到高的层次。如图3所示,蓝线代表最初在FPN中被分配到层次1的小提案。令人惊讶的是,几乎 70%{70}\% 的特征来自其他更高层次。我们还使用黄色线表示在FPN中被分配到层次4的大提案。同样, 50%+{50}\% + 的特征是从其他较低层次池化而来的。这一观察清楚地表明,多个层次的特征共同有助于准确预测。这也是设计自下而上的路径增强的有力支持。

Adaptive Feature Pooling Structure Adaptive feature pooling is actually simple in implementation and is demonstrated in Figure 1(c). First, for each proposal, we map them to different feature levels, as denoted by dark grey regions in Figure 1(b). Following Mask R-CNN [21], ROIAlign is used to pool feature grids from each level. Then a fusion operation (element-wise max or sum) is utilized to fuse feature grids from different levels.

自适应特征池化结构 自适应特征池化在实现上实际上很简单,如图1(c)所示。首先,对于每个提案,我们将其映射到不同的特征层,如图1(b)中的深灰色区域所示。遵循Mask R-CNN [21],使用ROIAlign从每个层次池化特征网格。然后,利用融合操作(逐元素最大或求和)来融合来自不同层次的特征网格。

In following sub-networks, pooled feature grids go through one parameter layer independently, which is followed by the fusion operation, to enable network to adapt features. For example,there are two fc{fc} layers in the box branch in FPN. We apply the fusion operation after the first layer. Since four consecutive convolutional layers are used in mask prediction branch in Mask R-CNN, we place fusion operation between the first and second convolutional layers. Ablation study is given in Section 4.2. The fused feature grid is used as the feature grid of each proposal for further prediction, i.e., classification, box regression and mask prediction. A detailed illustration of adaptive feature pooling on box branch is shown by Figure 6 in Appendix.

在后续的子网络中,池化特征网格独立通过一个参数层,然后进行融合操作,以使网络能够适应特征。例如,FPN 中的框分支有两个 fc{fc} 层。我们在第一个层之后应用融合操作。由于在 Mask R-CNN 的掩码预测分支中使用了四个连续的卷积层,我们将融合操作放置在第一个和第二个卷积层之间。消融研究在第 4.2 节中给出。融合后的特征网格被用作每个提议的特征网格,以进行进一步的预测,即分类、框回归和掩码预测。附录中的图 6 详细说明了框分支上的自适应特征池化。

Figure 4. Mask prediction branch with fully-connected fusion.

图 4. 具有全连接融合的掩码预测分支。

Our design focuses on fusing information from in-network feature hierarchy instead of those from different feature maps of input image pyramid [52]. It is simpler compared with the process of [4,62,31]\left\lbrack {4,{62},{31}}\right\rbrack ,where L-2 normalization, concatenation and dimension reduction are needed.

我们的设计侧重于融合来自网络特征层次的信息,而不是来自输入图像金字塔的不同特征图的信息 [52]。与 [4,62,31]\left\lbrack {4,{62},{31}}\right\rbrack 的过程相比,这更简单,其中需要 L-2 归一化、连接和维度减少。

3.3. Fully-connected Fusion

3.3. 全连接融合

Motivation Fully-connected layers, or MLP, were widely used in mask prediction in instance segmentation [10, 41, 34] and mask proposal generation [48, 49]. Results of [8,33]\left\lbrack {8,{33}}\right\rbrack show that FCN is also competent in predicting pixel-wise masks for instances. Recently, Mask R-CNN [21] applied a tiny FCN on the pooled feature grid to predict corresponding masks avoiding competition between classes.

动机 全连接层或 MLP 在实例分割中的掩码预测 [10, 41, 34] 和掩码提议生成 [48, 49] 中被广泛使用。[8,33]\left\lbrack {8,{33}}\right\rbrack 的结果表明,FCN 在预测实例的像素级掩码方面也很有能力。最近,Mask R-CNN [21] 在池化特征网格上应用了一个小型 FCN,以预测相应的掩码,从而避免类之间的竞争。

We note fc{fc} layers yield different properties compared with FCN where the latter gives prediction at each pixel based on a local receptive field and parameters are shared at different spatial locations. Contrarily, fc{fc} layers are location sensitive since predictions at different spatial locations are achieved by varying sets of parameters. So they have the ability to adapt to different spatial locations. Also prediction at each spatial location is made with global information of the entire proposal. It is helpful to differentiate instances [48] and recognize separate parts belonging to the same object. Given properties of fc{fc} and convolutional layers different from each other, we fuse predictions from these two types of layers for better mask prediction.

我们注意到 fc{fc} 层与全卷积网络(FCN)相比,具有不同的属性,后者基于局部感受野在每个像素上给出预测,并且参数在不同的空间位置上是共享的。相反, fc{fc} 层是位置敏感的,因为在不同空间位置的预测是通过变化的参数集实现的。因此,它们具有适应不同空间位置的能力。此外,在每个空间位置的预测是基于整个提议的全局信息进行的。这有助于区分实例 [48] 并识别属于同一对象的不同部分。鉴于 fc{fc} 和卷积层的属性彼此不同,我们融合这两种类型层的预测,以获得更好的掩膜预测。

Mask Prediction Structure Our component of mask prediction is light-weighted and easy to implement. The mask branch operates on pooled feature grid for each proposal. As shown in Figure 4, the main path is a small FCN, which consists of 4 consecutive convolutional layers and 1 decon-volutional layer. Each convolutional layer consists of 256 3×33 \times 3 filters and the deconvolutional layer up-samples feature with factor 2. It predicts a binary pixel-wise mask for each class independently to decouple segmentation and classification, similar to that of Mask R-CNN. We further

掩膜预测结构 我们的掩膜预测组件轻量且易于实现。掩膜分支在每个提议的池化特征网格上操作。如图4所示,主路径是一个小型全卷积网络(FCN),由4个连续的卷积层和1个反卷积层组成。每个卷积层包含256个 3×33 \times 3 滤波器,反卷积层以2的倍数上采样特征。它为每个类别独立预测一个二进制逐像素掩膜,以解耦分割和分类,类似于Mask R-CNN。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——