Fully Convolutional Instance-aware Semantic Segmentation【翻译】

Doc2X | 专业公式识别工具精准识别 PDF 文档中的公式，并支持编辑与转化到 Word 或 Latex，为科研工作者节省宝贵时间。 Doc2X | Professional Formula Recognition Tool Accurately recognizes formulas in PDFs, with editing and conversion to Word or LaTeX, saving valuable time for researchers. 👉 点击体验 Doc2X | Try Doc2X

原文链接：1611.07709

Fully Convolutional Instance-aware Semantic Segmentation

完全卷积实例感知语义分割

Yi Li1,2* Haozhi Qi2*

李毅1,2* 齐浩志2*

1Tsinghua University

1清华大学

{liyi14,xyji}@tsinghua.edu.cn, {v-haoq,jifdai,yichenw}@microsoft.com

Abstract

摘要

We present the first fully convolutional end-to-end solution for instance-aware semantic segmentation task. It inherits all the merits of FCNs for semantic segmentation [29] and instance mask proposal [5]. It detects and segments the object instances jointly and simultanoulsy. By the introduction of position-senstive inside/outside score maps, the underlying convolutional representation is fully shared between the two sub-tasks, as well as between all regions of interest. The proposed network is highly integrated and achieves state-of-the-art performance in both accuracy and efficiency. It wins the COCO 2016 segmentation competition by a large margin. Code would be released at https: //github.com/daijifeng001/TA-FCN.

我们提出了第一个完全卷积的端到端实例感知语义分割任务解决方案。它继承了FCN在语义分割 [29] 和实例掩码提议 [5] 中的所有优点。它同时检测和分割对象实例。通过引入位置敏感的内外部评分图，底层卷积表示在两个子任务之间以及所有感兴趣区域之间完全共享。所提出的网络高度集成，在准确性和效率上都达到了最先进的性能。它在COCO 2016分割比赛中以较大优势获胜。代码将在 github.com/daijifeng00… 发布。

1. Introduction

1. 引言

Fully convolutional networks (FCNs) [29] have recently dominated the field of semantic image segmentation. An FCN takes an input image of arbitrary size, applies a series of convolutional layers, and produces per-pixel likelihood score maps for all semantic categories, as illustrated in Figure 1(a). Thanks to the simplicity, efficiency, and the local weight sharing property of convolution, FCNs provide an accurate, fast, and end-to-end solution for semantic segmentation.

完全卷积网络（FCN）[29] 最近在语义图像分割领域占据主导地位。FCN接受任意大小的输入图像，应用一系列卷积层，并为所有语义类别生成逐像素的可能性评分图，如图1(a)所示。由于卷积的简单性、效率以及局部权重共享特性，FCN为语义分割提供了准确、快速且端到端的解决方案。

However, conventional FCNs do not work for the instance-aware semantic segmentation task, which requires the detection and segmentation of individual object instances. The limitation is inherent. Because convolution is translation invariant, the same image pixel receives the same responses (thus classification scores) irrespective to its relative position in the context. However, instance-aware semantic segmentation needs to operate on region level, and the same pixel can have different semantics in different regions. This behavior cannot be modeled by a single FCN on the whole image. The problem is exemplified in Figure 2.

然而，传统的全卷积网络（FCNs）无法用于实例感知语义分割任务，该任务需要检测和分割单个对象实例。这一限制是固有的。由于卷积是平移不变的，相同的图像像素在上下文中的相对位置无论如何都会接收到相同的响应（因此分类分数）。然而，实例感知语义分割需要在区域级别上进行操作，相同的像素在不同区域中可以具有不同的语义。这种行为无法通过对整个图像应用单个FCN来建模。该问题在图2中得到了示例说明。

Certain translation-variant property is required to solve the problem. In a prevalent family of instance-aware semantic segmentation approaches $\left\lbrack {7,{16},8}\right\rbrack$ ,it is achieved by adopting different types of sub-networks in three stages: 1) an FCN is applied on the whole image to generate intermediate and shared feature maps; 2) from the shared feature maps, a pooling layer warps each region of interest (ROI) into fixed-size per-ROI feature maps [17, 12]; 3) one or more fully-connected (fc) layer(s) in the last network convert the per-ROI feature maps to per-ROI masks. Note that the translation-variant property is introduced in the fc layer(s) in the last step.

解决该问题需要某种平移变异特性。在一种流行的实例感知语义分割方法家族中 $\left\lbrack {7,{16},8}\right\rbrack$ ，通过在三个阶段采用不同类型的子网络来实现：1）对整个图像应用FCN以生成中间和共享特征图；2）从共享特征图中，池化层将每个感兴趣区域（ROI）扭曲为固定大小的每个ROI特征图 [17, 12]；3）最后网络中的一个或多个全连接层将每个ROI特征图转换为每个ROI掩码。请注意，平移变异特性是在最后一步的全连接层中引入的。

Such methods have several drawbacks. First, the ROI pooling step losses spatial details due to feature warping and resizing, which however, is necessary to obtain a fixed-size representation (e.g., ${14} \times {14}$ in [8]) for fc layers. Such distortion and fixed-size representation degrades the segmentation accuracy, especially for large objects. Second, the fc layers over-parametrize the task, without using regularization of local weight sharing. For example, the last fc layer has high dimensional 784-way output to estimate a ${28} \times {28}$ mask. Last, the per-ROI network computation in the last step is not shared among ROIs. As observed empirically, a considerably complex sub-network in the last step is necessary to obtain good accuracy $\left\lbrack {{36},9}\right\rbrack$ . It is therefore slow for a large number of ROIs (typically hundreds or thousands of region proposals). For example, in the MNC method [8], which won the 1st place in COCO segmentation challenge 2015 [25], 10 layers in the ResNet-101 model [18] are kept in the per-ROI sub-network. The approach takes 1.4 seconds per image,where more than ${80}\%$ of the time is spent on the last per-ROI step. These drawbacks motivate us to ask the question that, can we exploit the merits of FCNs for end-to-end instance-aware semantic segmentation?

这些方法有几个缺点。首先，ROI 池化步骤由于特征扭曲和调整大小而损失了空间细节，然而，为了获得固定大小的表示（例如，在 [8] 中的 ${14} \times {14}$ ），这是必要的。这种扭曲和固定大小的表示降低了分割精度，尤其是对于大物体。其次，fc 层对任务进行了过度参数化，而没有使用局部权重共享的正则化。例如，最后一个 fc 层具有高维的 784 维输出，用于估计 ${28} \times {28}$ 掩码。最后，最后一步中的每个 ROI 网络计算在 ROIs 之间没有共享。根据经验观察，最后一步中需要一个相当复杂的子网络才能获得良好的精度 $\left\lbrack {{36},9}\right\rbrack$ 。因此，对于大量 ROI（通常是数百或数千个区域提议）来说，这种方法是缓慢的。例如，在 MNC 方法 [8] 中，该方法在 2015 年 COCO 分割挑战中获得第一名 [25]，ResNet-101 模型 [18] 中的 10 层被保留在每个 ROI 子网络中。该方法每张图像耗时 1.4 秒，其中超过 ${80}\%$ 的时间花费在最后的每个 ROI 步骤上。这些缺点促使我们提出一个问题：我们能否利用 FCNs 的优点进行端到端的实例感知语义分割？

Recently, a fully convolutional approach has been proposed for instance mask proposal generation [5]. It extends the translation invariant score maps in conventional FCNs to position-sensitive score maps, which are somewhat translation-variant. This is illustrated in Figure 1(b). The approach is only used for mask proposal generation and presents several drawbacks. It is blind to semantic cat-

最近，提出了一种完全卷积的方法用于实例掩码提议生成 [5]。它将传统 FCNs 中的平移不变评分图扩展为位置敏感评分图，这些评分图在某种程度上是平移变换的。这在图 1(b) 中进行了说明。该方法仅用于掩码提议生成，并且存在几个缺点。它对语义类别是盲目的。

*Equal contribution. This work is done when Yi Li and Haozhi Qi are interns at Microsoft Research.

*同等贡献。此项工作是在 Yi Li 和 Haozhi Qi 在微软研究院实习期间完成的。

Figure 1. Illustration of our idea. (a) Conventional fully convolutional network (FCN) [29] for semantic segmentation. A single score map is used for each category, which is unaware of individual object instances. (b) InstanceFCN [5] for instance segment proposal, where $3 \times 3$ position-sensitive score maps are used to encode relative position information. A downstream network is used for segment proposal classification. (c) Our fully convolutional instance-aware semantic segmentation method (FCIS), where position-sensitive inside/outside score maps are used to perform object segmentation and detection jointly and simultanously.

图 1. 我们想法的示意图。 (a) 用于语义分割的传统全卷积网络 (FCN) [29]。每个类别使用单一的得分图，无法识别单个物体实例。 (b) 用于实例分割提议的 InstanceFCN [5]，其中使用 $3 \times 3$ 位置敏感得分图来编码相对位置信息。下游网络用于分割提议分类。 (c) 我们的全卷积实例感知语义分割方法 (FCIS)，其中使用位置敏感的内部/外部得分图来共同且同时执行物体分割和检测。

egories and requires a downstream network for detection. The object segmentation and detection sub-tasks are separated and the solution is not end-to-end. It operates on square,fixed-size sliding windows $\left( {{224} \times {224}\text{pixels}}\right)$ and adopts a time-consuming image pyramid scanning to find instances at different scales.

类别，并且需要下游网络进行检测。物体分割和检测子任务是分开的，解决方案不是端到端的。它在固定大小的方形滑动窗口 $\left( {{224} \times {224}\text{pixels}}\right)$ 上操作，并采用耗时的图像金字塔扫描来查找不同尺度的实例。

In this work, we propose the first end-to-end fully convolutional approach for instance-aware semantic segmentation. Dubbed FCIS, it extends the approach in [5]. The underlying convolutional representation and the score maps are fully shared for the object segmentation and detection sub-tasks, via a novel joint formulation with no extra parameters. The network structure is highly integrated and efficient. The per-ROI computation is simple, fast, and does not involve any warping or resizing operations. The approach is briefly illustrated in Figure 1(c). It operates on box proposals instead of sliding windows, enjoying the recent advances in object detection [34].

在本研究中，我们提出了首个端到端的全卷积实例感知语义分割方法。该方法称为 FCIS，扩展了 [5] 中的方法。基础的卷积表示和得分图在物体分割和检测子任务中完全共享，通过一种新颖的联合公式，没有额外的参数。网络结构高度集成且高效。每个 ROI 的计算简单、快速，并且不涉及任何扭曲或调整大小的操作。该方法在图 1(c) 中简要说明。它在框提议上操作，而不是滑动窗口，享受物体检测 [34] 的最新进展。

Extensive experiments verify that the proposed approach is state-of-the-art in both accuracy and efficiency. It achieves significantly higher accuracy than the previous challenge winning method MNC [8] on the large-scale COCO dataset [25]. It wins the 1st place in COCO 2016 segmentation competition, outperforming the 2nd place entry by ${12}\%$ in accuracy relatively. It is fast. The inference in COCO competition takes 0.24 seconds per image using ResNet-101 model [18] (Nvidia K40),which is $6 \times$ faster than MNC [8]. Code would be released at https: //github.com/daijifeng001/TA-FCN.

大量实验验证了所提出的方法在准确性和效率上都是最先进的。它在大规模 COCO 数据集 [25] 上的准确性显著高于之前获胜的挑战方法 MNC [8]。它在 COCO 2016 分割比赛中获得第一名，相对地在准确性上超越了第二名的参赛作品 ${12}\%$ 。它的速度很快。在 COCO 比赛中，使用 ResNet-101 模型 [18]（Nvidia K40）每张图像的推理时间为 0.24 秒，这比 MNC [8] 快 $6 \times$ 。代码将发布在 github.com/daijifeng00…

2.Our Approach

2. 我们的方法

2.1. Position-sensitive Score Map Parameterization

2.1. 位置敏感得分图参数化

In FCNs [29], a classifier is trained to predict each pixel's likelihood score of "the pixel belongs to some object category". It is translation invariant and unaware of individual object instances. For example, the same pixel can be foreground on one object but background on another (adjacent) object. A single score map per-category is insufficient to distinguish these two cases.

在 FCNs [29] 中，训练一个分类器来预测每个像素属于某个对象类别的可能性得分。它是平移不变的，并且对个别对象实例没有意识。例如，同一个像素在一个对象上可以是前景，而在另一个（相邻）对象上可以是背景。每个类别的单一得分图不足以区分这两种情况。

To introduce translation-variant property, a fully convolutional solution is firstly proposed in [5] for instance mask proposal. It uses ${k}^{2}$ position-sensitive score maps that correspond to $k \times k$ evenly partitioned cells of objects. This is illustrated in Figure 1(b) $\left( {k = 3}\right)$ . Each score map has the same spatial extent of the original image (in a lower resolution,e.g., ${16} \times$ smaller). Each score represents the likelihood of "the pixel belongs to some object instance at a relative position". For example, the first map is for "at top left position" in Figure 1(b).

为了引入平移变异特性，首先在 [5] 中提出了一种完全卷积的解决方案用于实例掩码提议。它使用 ${k}^{2}$ 位置敏感得分图，这些得分图对应于 $k \times k$ 均匀划分的对象单元。图 1(b) $\left( {k = 3}\right)$ 中对此进行了说明。每个得分图具有与原始图像相同的空间范围（在较低分辨率下，例如 ${16} \times$ 更小）。每个得分表示“像素属于某个对象实例在相对位置的可能性”。例如，第一个图是针对图 1(b) 中的“左上角位置”。

During training and inference, for a fixed-size square sliding window $\left( {{224} \times {224}\text{pixels}}\right)$ ,its pixel-wise foreground likelihood map is produced by assembling (copy-paste) its $k \times k$ cells from the corresponding score maps. In this way,a pixel can have different scores in different instances as long as the pixel is at different relative positions in the instances.

在训练和推理过程中，对于固定大小的方形滑动窗口 $\left( {{224} \times {224}\text{pixels}}\right)$ ，其逐像素前景可能性图是通过从相应的得分图中组装（复制-粘贴）其 $k \times k$ 单元生成的。通过这种方式，只要像素在不同实例中的相对位置不同，像素可以在不同实例中具有不同的得分。

Figure 2. Instance segmentation and classification results (of "person" category) of different ROIs. The score maps are shared by different ROIs and both sub-tasks. The red dot indicates one pixel having different semantics in different ROIs.

图 2. 不同 ROI 的实例分割和分类结果（“人”类别）。得分图由不同 ROI 和两个子任务共享。红点表示一个像素在不同 ROI 中具有不同的语义。

As shown in [5], the approach is state-of-the-art for the object mask proposal task. However, it is also limited by the task. Only a fixed-size square sliding window is used. The network is applied on multi-scale images to find object instances of different sizes. The approach is blind to the object categories. Only a separate "objectness" classification sub-network is used to categorize the window as object or background. For the instance-aware semantic segmentation task, a separate downstream network is used to further classify the mask proposals into object categories [5].

如 [5] 所示，该方法在物体掩膜提议任务中处于最先进水平。然而，它也受到任务的限制。仅使用固定大小的方形滑动窗口。网络应用于多尺度图像，以查找不同大小的物体实例。该方法对物体类别是盲目的。仅使用一个单独的“物体性”分类子网络将窗口分类为物体或背景。对于实例感知语义分割任务，使用一个单独的下游网络进一步将掩膜提议分类为物体类别 [5]。

2.2. Joint Mask Prediction and Classification

2.2. 联合掩膜预测与分类

For the instance-aware semantic segmentation task, not only [5], but also many other state-of-the-art approaches, such as SDS [15], Hypercolumn [16], CFM [7], MNC [8], and MultiPathNet [42], share a similar structure: two subnetworks are used for object segmentation and detection sub-tasks, separately and sequentially.

对于实例感知语义分割任务，不仅 [5]，还有许多其他最先进的方法，如 SDS [15]、Hypercolumn [16]、CFM [7]、MNC [8] 和 MultiPathNet [42]，共享类似的结构：两个子网络分别和顺序地用于物体分割和检测子任务。

Apparently, the design choices in such a setting, e.g., the two networks' structure, parameters and execution order, are kind of arbitrary. They can be easily made for convenience other than for fundamental considerations. We conjecture that the separated sub-network design may not fully exploit the tight correlation between the two tasks.

显然，在这种设置中的设计选择，例如两个网络的结构、参数和执行顺序，都是有些任意的。它们可以很容易地为了方便而做出，而不是出于基本考虑。我们推测，分开的子网络设计可能无法充分利用这两个任务之间的紧密关联。

We enhance the "position-sensitive score map" idea to perform the object segmentation and detection sub-tasks jointly and simultaneously. The same set of score maps are shared for the two sub-tasks, as well as the underlying convolutional representation. Our approach brings no extra parameters and eliminates non essential design choices. We believe it can better exploit the strong correlation between the two sub-tasks.

我们增强了“位置敏感评分图”的概念，以共同和同时执行物体分割和检测子任务。两项子任务共享同一组评分图，以及基础的卷积表示。我们的方法没有增加额外的参数，并消除了非必要的设计选择。我们相信它可以更好地利用两项子任务之间的强相关性。

Our approach is illustrated in Figure 1(c) and Figure 2. Given a region-of-interest (ROI), its pixel-wise score maps are produced by the assembling operation within the ROI. For each pixel in a ROI, there are two tasks: 1) detection: whether it belongs to an object bounding box at a relative position (detection+) or not (detection-); 2) segmentation: whether it is inside an object instance's boundary (segmentation+) or not (segmentation-). A simple solution is to train two classifiers, separately. That's exactly our baseline FCIS (separate score maps) in Table 1. In this case, the two classifiers are two $1 \times 1$ conv layers,each using just one task’s supervision.

我们的方法在图1(c)和图2中进行了说明。给定一个感兴趣区域（ROI），其像素级评分图是通过ROI内的汇总操作生成的。对于ROI中的每个像素，有两个任务：1）检测：它是否属于相对位置的物体边界框（检测+）或不属于（检测-）；2）分割：它是否在物体实例的边界内（分割+）或不在（分割-）。一个简单的解决方案是分别训练两个分类器。这正是我们在表1中所展示的基线FCIS（分离评分图）。在这种情况下，这两个分类器是两个 $1 \times 1$ 卷积层，每个层仅使用一个任务的监督。

Our joint formulation fuses the two answers into two scores: inside and outside. There are three cases: 1) high inside score and low outside score: detection+, segmentation+; 2) low inside score and high outside score: detection+, segmentation-; 3) both scores are low: detection-, segmentation-. The two scores answer the two questions jointly via softmax and max operations. For detection, we use max to differentiate cases 1)-2) (detection+) from case 3) (detection-). The detection score of the whole ROI is then obtained via average pooling over all pixels' likelihoods (followed by a softmax operator across all the categories). For segmentation, we use softmax to differentiate cases 1) (segmentation+) from 2) (segmentation-), at each pixel. The foreground mask (in probabilities) of the ROI is the union of the per-pixel segmentation scores (for each category). Similarly, the two sets of scores are from two $1 \times 1$ conv layer. The inside/outside classifiers are trained jointly as they receive the back-propagated gradients from both segmentation and detection losses.

我们的联合公式将两个答案融合为两个分数：内部和外部。存在三种情况：1）高内部分数和低外部分数：检测+，分割+；2）低内部分数和高外部分数：检测+，分割-；3）两个分数均低：检测-，分割-。这两个分数通过 softmax 和 max 操作共同回答这两个问题。对于检测，我们使用 max 来区分情况 1)-2)（检测+）与情况 3)（检测-）。整个 ROI 的检测分数通过对所有像素的可能性进行平均池化获得（随后在所有类别上应用 softmax 操作）。对于分割，我们在每个像素上使用 softmax 来区分情况 1)（分割+）与 2)（分割-）。ROI 的前景掩码（以概率表示）是每个像素分割分数（对于每个类别）的并集。同样，这两组分数来自两个 $1 \times 1$ 卷积层。内部/外部分类器是联合训练的，因为它们接收来自分割和检测损失的反向传播梯度。

The approach has many desirable properties. All the per-ROI components (as in Figure 1(c)) do not have free parameters. The score maps are produced by a single FCN, without involving any feature warping, resizing or fc layers. All the features and score maps respect the aspect ratio of the original image. The local weight sharing property of FCNs is preserved and serves as a regularization mechanism. All per-ROI computation is simple ( ${k}^{2}$ cell division,score map copying, softmax, max, average pooling) and fast, giving rise to a negligible per-ROI computation cost.

该方法具有许多理想特性。所有每个 ROI 组件（如图 1(c) 所示）没有自由参数。分数图由单个 FCN 生成，不涉及任何特征扭曲、调整大小或全连接层。所有特征和分数图都遵循原始图像的纵横比。FCN 的局部权重共享特性得以保留，并作为正则化机制。所有每个 ROI 的计算都很简单（ ${k}^{2}$ 单元划分、分数图复制、softmax、max、平均池化）且快速，从而产生可忽略的每个 ROI 计算成本。

2.3.An End-to-End Solution

2.3. 一种端到端的解决方案

Figure 3 shows the architecture of our end-to-end solution. While any convolutional network architecture can be used $\left\lbrack {{39},{40}}\right\rbrack$ ,in this work we adopt the ResNet model [18]. The last fully-connected layer for 1000-way classification is discarded. Only the previous convolutional layers are retained. The resulting feature maps have 2048 channels. On top of it,a $1 \times 1$ convolutional layer is added to reduce the dimension to 1024 .

图3展示了我们端到端解决方案的架构。虽然可以使用任何卷积网络架构 $\left\lbrack {{39},{40}}\right\rbrack$ ，但在本研究中我们采用了ResNet模型 [18]。最后的1000类分类的全连接层被丢弃。仅保留之前的卷积层。生成的特征图有2048个通道。在此基础上，添加了一个 $1 \times 1$ 卷积层，将维度降低到1024。

In the original ResNet, the effective feature stride (the decrease in feature map resolution) at the top of the network is 32 . This is too coarse for instance-aware semantic segmentation. To reduce the feature stride and maintain the field of view, the "hole algorithm" [3, 29] (Algorithme à trous [30]) is applied. The stride in the first block of conv5 convolutional layers is decreased from 2 to 1 . The effective feature stride is thus reduced to 16 . To maintain the field of view, the "hole algorithm" is applied on all the convolutional layers of conv 5 by setting the dilation as 2 .

在原始的ResNet中，网络顶部的有效特征步幅（特征图分辨率的降低）为32。这对于实例感知语义分割来说过于粗糙。为了减少特征步幅并保持视野，应用了“孔算法” [3, 29]（Algorithme à trous [30]）。conv5卷积层的第一个块中的步幅从2降低到1。因此，有效特征步幅减少到16。为了保持视野，在conv5的所有卷积层上应用“孔算法”，将膨胀设置为2。

We use region proposal network (RPN) [34] to generate ROIs. For fair comparison with the MNC method [8], it is added on top of the conv4 layers in the same way. Note that RPN is also fully convolutional.

我们使用区域提议网络（RPN） [34] 来生成ROI。为了与MNC方法 [8] 进行公平比较，它以相同的方式添加到conv4层的顶部。请注意，RPN也是完全卷积的。

From the conv5 feature maps, $2{k}^{2} \times \left( {C + 1}\right)$ score maps are produced ( $C$ object categories,one background category,two sets of ${k}^{2}$ score maps per category, $k = 7$ by default in experiments) using a $1 \times 1$ convolutional layer. Over the score maps,each ROI is projected into a ${16} \times$ smaller region. Its segmentation probability maps and classification scores over all the categories are computed as described in Section 2.2.

从conv5特征图中生成 $2{k}^{2} \times \left( {C + 1}\right)$ 得分图（ $C$ 物体类别，一个背景类别，每个类别有两组 ${k}^{2}$ 得分图， $k = 7$ 在实验中默认为此）使用一个 $1 \times 1$ 卷积层。在得分图上，每个ROI被投影到一个 ${16} \times$ 较小的区域。其分割概率图和所有类别的分类得分按照第2.2节中的描述进行计算。

Following the modern object detection systems, bounding box (bbox) regression $\left\lbrack {{13},{12}}\right\rbrack$ is used to refine the initial input ROIs. A sibling $1 \times 1$ convolutional layer with $4{k}^{2}$ channels is added on the conv5 feature maps to estimate the bounding box shift in location and size.

在现代目标检测系统中，边界框（bbox）回归 $\left\lbrack {{13},{12}}\right\rbrack$ 用于细化初始输入的感兴趣区域（ROIs）。在 conv5 特征图上添加了一个具有 $4{k}^{2}$ 通道的兄弟 $1 \times 1$ 卷积层，以估计边界框在位置和大小上的偏移。

Below we discuss more details in inference and training.

下面我们讨论推理和训练的更多细节。

Inference For an input image, 300 ROIs with highest scores are generated from RPN. They pass through the bbox regression branch and give rise to another 300 ROIs. For each ROI, we get its classification scores and foreground mask (in probability) for all categories. Figure 2 shows an example. Non-maximum suppression (NMS) with an intersection-over-union (IoU) threshold 0.3 is used to filter out highly overlapping ROIs. The remaining ROIs are classified as the categories with highest classification scores. Their foreground masks are obtained by mask voting [8] as follows. For an ROI under consideration, we find all the ROIs (from the 600) with IoU scores higher than 0.5 . Their foreground masks of the category are averaged on a per-pixel basis, weighted by their classification scores. The averaged mask is binarized as the output.

推理对于输入图像，从 RPN 生成 300 个具有最高分数的 ROIs。它们通过 bbox 回归分支，产生另外 300 个 ROIs。对于每个 ROI，我们获得其分类分数和所有类别的前景掩码（以概率表示）。图 2 显示了一个示例。使用交并比（IoU）阈值为 0.3 的非极大值抑制（NMS）来过滤高度重叠的 ROIs。剩余的 ROIs 被分类为具有最高分类分数的类别。它们的前景掩码通过掩码投票 [8] 获得，具体如下。对于考虑中的 ROI，我们找到所有 IoU 分数高于 0.5 的 ROIs（从 600 个中）。它们的类别前景掩码在每个像素基础上进行平均，并根据其分类分数加权。平均掩码被二值化作为输出。

Training An ROI is positive if its box IoU with respect to the nearest ground truth object is larger than 0.5 , otherwise it is negative. Each ROI has three loss terms in equal weights: a softmax detection loss over $C + 1$ categories, a softmax segmentation loss ${}^{1}$ over the foreground mask of the ground-truth category only, and a bbox regression loss as in [12]. The latter two loss terms are effective only on the positive ROIs.

训练如果一个 ROI 与最近的真实目标的框 IoU 大于 0.5，则该 ROI 为正，否则为负。每个 ROI 有三个相等权重的损失项：针对 $C + 1$ 类别的 softmax 检测损失、仅针对真实类别前景掩码的 softmax 分割损失 ${}^{1}$ ，以及如 [12] 所示的 bbox 回归损失。后两个损失项仅对正 ROI 有效。

During training, the model is initialized from the pre-trained model on ImageNet classification [18]. Layers absent in the pre-trained model are randomly initialized. The training images are resized to have a shorter side of 600 pixels. We use SGD optimization. We train the model using 8 GPUs, each holding one image mini batch, giving rise to

在训练过程中，模型是从在 ImageNet 分类上预训练的模型初始化的 [18]。在预训练模型中缺失的层被随机初始化。训练图像被调整为较短边为 600 像素。我们使用 SGD 优化。我们使用 8 个 GPU 训练模型，每个 GPU 持有一个图像小批量，从而产生

${}^{1}$ The term sums per-pixel losses over the ROI and normalizes the sum by the ROI's size.

${}^{1}$ 该术语对 ROI 的每个像素损失进行求和，并通过 ROI 的大小进行归一化。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——