Mask Scoring R-CNN【翻译】https://arxiv.org/pdf/1903.00241 Mask

Doc2X：高效处理科研文档支持 PDF转Word、批量转换、多栏识别，并结合 Mathpix公式解析和深度翻译，提升文档处理速度。 Doc2X: Efficient Handling of Research Documents Supports PDF to Word, batch conversion, multi-column recognition, with Mathpix formula parsing and advanced translation for faster processing. 👉 点击了解更多 | Learn More

原文链接：1903.00241

Mask Scoring R-CNN

掩膜评分 R-CNN

Zhaojin Huang ${}^{\dagger * }$ Lichao Huang ${}^{ \ddagger }$ Yongchao Gong ${}^{ \ddagger }$ Chang Huang ${}^{ \ddagger }$ Xinggang Wang ${}^{ \dagger }$ ${}^{ \dagger }$ Institute of AI,School of EIC,Huazhong University of Science and Technology

Zhaojin Huang ${}^{\dagger * }$ Lichao Huang ${}^{ \ddagger }$ Yongchao Gong ${}^{ \ddagger }$ Chang Huang ${}^{ \ddagger }$ Xinggang Wang ${}^{ \dagger }$ ${}^{ \dagger }$ 华中科技大学电子信息与通信学院人工智能研究所

${}^{ \ddagger }$ Horizon Robotics Inc.

${}^{ \ddagger }$ 地平线机器人公司

{zhaojinhuang,xgwang}@hust.edu.cn {lichao.huang,yongchao.gong,chang}@horizon.ai

Abstract

摘要

Letting a deep network be aware of the quality of its own predictions is an interesting yet important problem. In the task of instance segmentation, the confidence of instance classification is used as mask quality score in most instance segmentation frameworks. However, the mask quality,quantified as the IoU between the instance mask and its ground truth, is usually not well correlated with classification score. In this paper, we study this problem and propose Mask Scoring R-CNN which contains a network block to learn the quality of the predicted instance masks. The proposed network block takes the instance feature and the corresponding predicted mask together to regress the mask IoU. The mask scoring strategy calibrates the misalignment between mask quality and mask score, and improves instance segmentation performance by prioritizing more accurate mask predictions during COCO AP evaluation. By extensive evaluations on the COCO dataset, Mask Scoring $R$ -CNN brings consistent and noticeable gain with different models, and outperforms the state-of-the-art Mask R-CNN. We hope our simple and effective approach will provide a new direction for improving instance segmentation. The source code of our method is available at https:// github.com/zjhuang22/maskscoring_rcnn.

让深度网络意识到自身预测的质量是一个有趣而重要的问题。在实例分割任务中，大多数实例分割框架将实例分类的置信度用作掩膜质量评分。然而，掩膜质量（量化为实例掩膜与其真实值之间的交并比 IoU）通常与分类评分的相关性不佳。本文研究了这一问题，并提出了掩膜评分 R-CNN，其中包含一个网络模块来学习预测实例掩膜的质量。所提出的网络模块将实例特征和相应的预测掩膜结合在一起，以回归掩膜 IoU。掩膜评分策略校准了掩膜质量与掩膜评分之间的不一致，并通过在 COCO AP 评估期间优先考虑更准确的掩膜预测来提高实例分割性能。通过在 COCO 数据集上的广泛评估，掩膜评分 $R$ -CNN 在不同模型中带来了持续且显著的提升，并超越了最先进的掩膜 R-CNN。我们希望我们简单有效的方法能为改善实例分割提供新的方向。我们方法的源代码可在 https:// github.com/zjhuang22/maskscoring_rcnn 获取。

1. Introduction

1. 引言

Deep networks are dramatically driving the development of computer vision, leading to a series of state-of-the-art in tasks including classification [22, 16, 35], object detection $\left\lbrack {{12},{17},{32},{27},{33},{34}}\right\rbrack$ ,semantic segmentation $\left\lbrack {{28},4,{37},{18}}\right\rbrack$ etc. From the development of deep learning in computer vision, we can observe that the ability of deep networks is gradually growing from making image-level prediction [22] to region/box-level prediction [12], pixel-level prediction [28] and instance/mask-level prediction [15]. The ability of making fine-grained predictions requires not only more detailed labels but also more delicate network designing.

深度网络正在显著推动计算机视觉的发展，导致在包括分类 [22, 16, 35]、目标检测 $\left\lbrack {{12},{17},{32},{27},{33},{34}}\right\rbrack$ 、语义分割 $\left\lbrack {{28},4,{37},{18}}\right\rbrack$ 等任务中取得了一系列最先进的成果。从深度学习在计算机视觉中的发展来看，我们可以观察到深度网络的能力逐渐从进行图像级预测 [22] 增长到区域/框级预测 [12]、像素级预测 [28] 和实例/掩膜级预测 [15]。进行细粒度预测的能力不仅需要更详细的标签，还需要更精细的网络设计。

In this paper, we focus on the problem of instance segmentation, which is a natural next step of object detection to move from coarse box-level instance recognition to precise pixel-level classification. Specifically, this work presents a novel method to score the instance segmentation hypotheses, which is quite important for instance segmentation evaluation. The reason lies in that most evaluation metrics are defined according to the hypothesis scores, and more precise scores help to better characterize the model performance. For example, precision-recall curves and average precision (AP) are often used for the challenging instance segmentation dataset COCO [26]. If one instance segmentation hypothesis is not properly scored, it might be wrongly regarded as false positive or false negative, resulting in a decrease of AP.

本文关注实例分割问题，这是目标检测的自然下一步，从粗略的框级实例识别转向精确的像素级分类。具体而言，本研究提出了一种新颖的方法来评分实例分割假设，这对于实例分割评估非常重要。原因在于大多数评估指标是根据假设分数定义的，更精确的分数有助于更好地表征模型性能。例如，精确度-召回曲线和平均精确度（AP）通常用于具有挑战性的实例分割数据集 COCO [26]。如果一个实例分割假设没有被正确评分，它可能会被错误地视为假阳性或假阴性，从而导致 AP 的下降。

However, in most instance segmentation pipelines, such as Mask R-CNN [15] and MaskLab [3], the score of the instance mask is shared with box-level classification confidence, which is predicted by a classifier applied on the proposal feature. It is inappropriate to use classification confidence to measure the mask quality since it only serves for distinguishing the semantic categories of proposals, and is not aware of the actual quality and completeness of the instance mask. The misalignment between classification confidence and mask quality is illustrated in Fig. 1, where instance segmentation hypotheses get accurate box-level localization results and high classification score, but the corresponding masks are inaccurate. Obviously, scoring the masks using such classification score tends to degrade the evaluation results.

然而，在大多数实例分割管道中，例如 Mask R-CNN [15] 和 MaskLab [3]，实例掩码的得分与框级分类置信度共享，该置信度是通过应用于提议特征的分类器预测的。使用分类置信度来衡量掩码质量是不合适的，因为它仅用于区分提议的语义类别，而并不关注实例掩码的实际质量和完整性。分类置信度与掩码质量之间的不一致在图 1 中得到了说明，其中实例分割假设获得了准确的框级定位结果和高分类得分，但相应的掩码却不准确。显然，使用这样的分类得分来评分掩码往往会降低评估结果的质量。

Unlike the previous methods that aim to obtain more accurate instance localization or segmentation mask, our method focuses on scoring the masks. To achieve this goal, our model learns a score for each mask instead of using its classification score. For clarity, we call the learned score mask score.

与以往旨在获得更准确的实例定位或分割掩码的方法不同，我们的方法专注于对掩码进行评分。为了实现这一目标，我们的模型为每个掩码学习一个得分，而不是使用其分类得分。为清晰起见，我们将学习到的得分称为掩码得分。

Inspired by the AP metric of instance segmentation that

受到实例分割的 AP 指标的启发

The work was done when Zhaojin Huang was an intern in Horizon Robotics Inc.
该工作是在黄兆金担任地平线机器人公司的实习生时完成的。

Figure 1. Demonstrative cases of instance segmentation in which bounding box has a high overlap with ground truth and a high classification score while the mask is not good enough. The scores predicted by both Mask R-CNN and our proposed MS R-CNN are attached above their corresponding bounding boxes. The left four images show good detection results with high classification scores but low mask quality. Our method aims at solving this problem. The rightmost image shows the case of a good mask with a high classification score. Our method will retrain the high score. As can be seen, scores predicted by our model can better interpret the actual mask quality.

图 1. 实例分割的示例案例，其中边界框与真实值有很高的重叠和高分类得分，而掩码却不够好。由 Mask R-CNN 和我们提出的 MS R-CNN 预测的得分附加在其相应的边界框上。左侧的四幅图像显示了高分类得分但低掩码质量的良好检测结果。我们的方法旨在解决这个问题。最右侧的图像显示了一个具有高分类得分的良好掩码的案例。我们的方法将重新训练高得分。如所见，我们模型预测的得分能够更好地解释实际的掩码质量。

uses pixel-level Intersection-over-Union (IoU) between the predicted mask and its ground truth mask to describe instance segmentation quality, we propose a network to learn the IoU directly. In this paper, this IoU is denoted as MaskIoU. Once we obtain the predicted MaskIoU in testing phase, mask score is reevaluated by multiplying the predicted MaskIoU and classification score. Thus, mask score is aware of both semantic categories and the instance mask completeness.

使用像素级的交并比 (IoU) 来描述实例分割质量，该 IoU 是在预测的掩码与其真实掩码之间计算的，我们提出了一种网络来直接学习 IoU。在本文中，这个 IoU 被称为 MaskIoU。一旦我们在测试阶段获得了预测的 MaskIoU，掩码得分通过将预测的 MaskIoU 与分类得分相乘进行重新评估。因此，掩码得分同时考虑了语义类别和实例掩码的完整性。

Learning MaskIoU is quite different from proposal classification or mask prediction, as it needs to "compare" the predicted mask with object feature. Within the Mask R-CNN framework, we implement a MaskIoU prediction network named MaskIoU head. It takes both the output of the mask head and RoI feature as input, and is trained using a simple regression loss. We name the proposed model, namely Mask R-CNN with MaskIoU head, as Mask Scoring R-CNN (MS R-CNN). Extensive experiments with our MS R-CNN have been conducted, and the results demonstrate that our method provides consistent and noticeable performance improvement attributing to the alignment between mask quality and score.

学习 MaskIoU 与提案分类或掩码预测有很大不同，因为它需要“比较”预测的掩码与物体特征。在 Mask R-CNN 框架内，我们实现了一个名为 MaskIoU head 的 MaskIoU 预测网络。它同时以掩码头的输出和 RoI 特征作为输入，并使用简单的回归损失进行训练。我们将所提出的模型，即带有 MaskIoU head 的 Mask R-CNN，称为 Mask Scoring R-CNN (MS R-CNN)。我们进行了大量的实验，结果表明我们的方法由于掩码质量与得分之间的对齐，提供了一致且显著的性能提升。

In summary, the main contributions of this work are highlighted as follows:

总之，本工作的主要贡献如下：

We present Mask Scoring R-CNN, the first framework that addresses the problem of scoring instance segmentation hypothesis. It explores a new direction for improving the performance of instance segmentation models. By considering the completeness of instance mask, the score of instance mask can be penalized if it has high classification score while the mask is not good enough.
我们提出了 Mask Scoring R-CNN，这是第一个解决实例分割假设评分问题的框架。它探索了提高实例分割模型性能的新方向。通过考虑实例掩码的完整性，如果实例掩码的分类得分很高但掩码质量不够好，则可以对其得分进行惩罚。
Our MaskIoU head is very simple and effective. Experimental results on the challenging COCO benchmark show that when using mask score from our MS R-CNN rather than only classification confidence, the AP improves consistently by about ${1.5}\%$ with various backbone networks.
我们的 MaskIoU head 非常简单且有效。在具有挑战性的 COCO 基准上的实验结果表明，当使用来自我们 MS R-CNN 的掩码得分而不仅仅是分类置信度时，AP 一致提高了约 ${1.5}\%$ ，适用于各种主干网络。

2. Related Work

2. 相关工作

2.1. Instance Segmentation

2.1. 实例分割

Current instance segmentation methods can be roughly categorized into two classes. One is detection based methods and the other is segmentation based methods. Detection based methods exploit the state-of-the-art detectors, such as Faster R-CNN [33], R-FCN [8], to get the region of each instance, and then predict the mask for each region. Pinheiro et al. [31] proposed DeepMask to segment and classify the center object in a sliding window fashion. Dai et al. [6] proposed instance-sensitive FCNs to generate the position-sensitive maps and assembled them to obtain the final masks. FCIS [23] takes position-sensitive maps with inside/outside scores to generate the instance segmentation results. He et al. [15] proposed Mask R-CNN that is built on the top of Faster R-CNN by adding an instance-level semantic segmentation branch. Based on Mask R-CNN, Chen et al. [3] proposed MaskLab that used position-sensitive scores to obtain better results. However, an underlying drawback in these methods is that mask quality is only measured by the classification scores, thus resulting in the issues discussed above.

目前的实例分割方法大致可以分为两类。一类是基于检测的方法，另一类是基于分割的方法。基于检测的方法利用最先进的检测器，如 Faster R-CNN [33] 和 R-FCN [8]，来获取每个实例的区域，然后预测每个区域的掩码。Pinheiro 等人 [31] 提出了 DeepMask，以滑动窗口的方式对中心对象进行分割和分类。Dai 等人 [6] 提出了实例敏感的 FCN，以生成位置敏感图，并将其组合以获得最终的掩码。FCIS [23] 使用带有内外分数的位置敏感图来生成实例分割结果。He 等人 [15] 提出了 Mask R-CNN，该方法在 Faster R-CNN 的基础上增加了一个实例级的语义分割分支。基于 Mask R-CNN，Chen 等人 [3] 提出了 MaskLab，使用位置敏感分数以获得更好的结果。然而，这些方法的一个潜在缺陷是掩码质量仅通过分类分数来衡量，从而导致上述问题。

Segmentation based methods predict the category labels of each pixel first and then group them together to form instance segmentation results. Liang et al. [24] used spectral clustering to cluster the pixels. Other works, such as $\left\lbrack {{20},{21}}\right\rbrack$ ,add boundary detection information during the clustering procedure. Bai et al. [1] predicted pixel-level energy values and used watershed algorithms for grouping. Recently,there are some works $\left\lbrack {{30},{11},{14},{10}}\right\rbrack$ using metric learning to learn the embedding. Specifically, these methods learn an embedding for each pixel to ensure that pixels from the same instance have similar embedding. Afterwards, clustering is performed on the learned embedding to obtain the final instance labels. As these methods do not have explicit scores to measure the instance mask quality, they have to use the averaged pixel-level classification scores as an alternative.

基于分割的方法首先预测每个像素的类别标签，然后将它们组合在一起形成实例分割结果。Liang 等人 [24] 使用谱聚类对像素进行聚类。其他工作，如 $\left\lbrack {{20},{21}}\right\rbrack$ ，在聚类过程中添加边界检测信息。Bai 等人 [1] 预测像素级能量值，并使用分水岭算法进行分组。最近，一些工作 $\left\lbrack {{30},{11},{14},{10}}\right\rbrack$ 使用度量学习来学习嵌入。具体来说，这些方法为每个像素学习一个嵌入，以确保来自同一实例的像素具有相似的嵌入。之后，对学习到的嵌入进行聚类，以获得最终的实例标签。由于这些方法没有明确的分数来衡量实例掩码质量，因此它们必须使用平均像素级分类分数作为替代。

Figure 2. Comparisons of Mask R-CNN and our proposed MS R-CNN. (a) shows the results of Mask R-CNN, the mask score has less relationship with MaskIoU. (b) shows the results of MS R-CNN, we penalize the detection with high score and low MaskIoU, and the mask score can correlate with MaskIoU better. (c) shows the quantitative results, where we average the score between each MaskIoU interval, we can see that our method can have a better correspondence between score and MaskIoU.

图2. Mask R-CNN与我们提出的MS R-CNN的比较。(a) 显示了Mask R-CNN的结果，mask得分与MaskIoU的关系较小。(b) 显示了MS R-CNN的结果，我们对高得分和低MaskIoU的检测进行了惩罚，mask得分与MaskIoU的相关性更好。(c) 显示了定量结果，我们在每个MaskIoU区间之间平均得分，可以看到我们的方法在得分与MaskIoU之间有更好的对应关系。

Both classes of the above methods do not take into consideration the alignment between mask score and mask quality. Due to the unreliability of mask score, a mask hypothesis with higher IoU against ground truth is vulnerable to be ranked with low priority if it has a low mask score. In this case, the final AP is consequently degraded.

上述两类方法都没有考虑mask得分与mask质量之间的对齐。由于mask得分的不可靠性，具有较高IoU与真实值匹配的mask假设如果得分较低，则容易被排在低优先级。在这种情况下，最终的AP因此会降低。

2.2. Detection Score Correction

2.2. 检测得分修正

There are several methods focusing on correcting the classification score for the detection box, which have a similar goal to our method. Tychsen-Smith et al. [36] proposed Fitness NMS that corrected the detection score using the IoU between the detected bounding boxes and their ground truth. It formulates box IoU prediction as a classification task. Our method differs from this method in that we formulate mask IoU estimation as a regression task. Jiang et al. [19] proposed IoU-Net that regressed box IoU directly, and the predicted IoU was used for both NMS and bounding box refinement. In [5], Cheng et al. discussed the false positive samples and used a separated network for correcting the score of such samples. SoftNMS [2] uses the overlap between two boxes to correct the low score box. Neumann et al. [29] proposed Relaxed Softmax to predict temperature scaling factor value in standard softmax for safety-critical pedestrian detection.

有几种方法专注于修正检测框的分类得分，其目标与我们的方法相似。Tychsen-Smith等人[36]提出了Fitness NMS，通过检测到的边界框与其真实值之间的IoU来修正检测得分。它将框IoU预测公式化为分类任务。我们的方法与此方法不同，我们将mask IoU估计公式化为回归任务。Jiang等人[19]提出了IoU-Net，直接回归框IoU，预测的IoU用于NMS和边界框精细化。在[5]中，Cheng等人讨论了假阳性样本，并使用单独的网络来修正这些样本的得分。SoftNMS[2]使用两个框之间的重叠来修正低得分框。Neumann等人[29]提出了Relaxed Softmax，以预测标准softmax中用于安全关键行人检测的温度缩放因子值。

Unlike these methods that focus on bounding box level detection, our method is designed for instance segmentation. The instance mask is further processed in our Mask-IoU head so that the network can be aware of the completeness of instance mask, and the final mask score can reflect the actual quality of the instance segmentation hypothesis. It is a new direction for improving the performance of instance segmentation.

与这些专注于边界框级别检测的方法不同，我们的方法旨在进行实例分割。实例掩码在我们的 Mask-IoU 头中进一步处理，以便网络能够意识到实例掩码的完整性，最终的掩码得分可以反映实例分割假设的实际质量。这是提高实例分割性能的新方向。

3. Method

3. 方法

3.1. Motivation

3.1. 动机

In the current Mask R-CNN framework, the score of a detection (i.e., instance segmentation) hypothesis is determined by the largest element in its classification scores. Due to the problems of background clutter, occlusion etc., it is possible that the classification score is high but the mask quality is low, as the examples shown in Fig. 1. To quantitatively analyze this problem, we compare the vanilla mask score from Mask R-CNN with the actual IoU between the predicted mask and its ground truth mask (MaskIoU). Specifically, we conduct experiments using Mask R-CNN with ResNet-18 FPN on COCO 2017 validation dataset. Then we select the detection hypotheses after Soft-NMS with both MaskIoU and classification scores larger than 0.5. The distribution of MaskIoU over classification score is shown in Fig. 2 (a) and the average classification score in each MaskIoU interval is shown in blue in Fig. 2 (c). These figures show that classification score and MaskIoU is not well correlated in Mask R-CNN.

在当前的 Mask R-CNN 框架中，检测（即实例分割）假设的得分由其分类得分中的最大元素决定。由于背景杂乱、遮挡等问题，可能出现分类得分高但掩码质量低的情况，如图 1 所示的例子。为了定量分析这个问题，我们将 Mask R-CNN 的原始掩码得分与预测掩码和其真实掩码之间的实际 IoU（MaskIoU）进行比较。具体而言，我们在 COCO 2017 验证数据集上使用带有 ResNet-18 FPN 的 Mask R-CNN 进行实验。然后，我们选择在 Soft-NMS 后，MaskIoU 和分类得分均大于 0.5 的检测假设。图 2 (a) 显示了分类得分下的 MaskIoU 分布，图 2 (c) 中蓝色部分显示了每个 MaskIoU 区间的平均分类得分。这些图表显示了在 Mask R-CNN 中，分类得分与 MaskIoU 之间的相关性不佳。

In most instance segmentation evaluation protocols, such as COCO, a detection hypothesis with a low MaskIoU and a high score is harmful. In many practical applications, it is important to determine when the detection results can be trusted and when they cannot [29]. This motivates us to learn a calibrated mask score according to MaskIoU for every detection hypothesis. Without loss of generality, we work on the Mask R-CNN framework, and propose Mask Scoring R-CNN (MS R-CNN), a Mask R-CNN with an additional MaskIoU head module that learns the Mask-IoU aligned mask score. The predicted mask scores of our framework are shown in Fig. 2 (b) and the orange histogram in Fig. 2 (c).

在大多数实例分割评估协议中，例如 COCO，低 MaskIoU 和高分数的检测假设是有害的。在许多实际应用中，确定检测结果何时可以被信任以及何时不能被信任是非常重要的 [29]。这促使我们根据 MaskIoU 为每个检测假设学习一个校准的掩码分数。为了不失一般性，我们在 Mask R-CNN 框架上工作，并提出了 Mask Scoring R-CNN (MS R-CNN)，这是一个具有额外 MaskIoU 头模块的 Mask R-CNN，该模块学习与 Mask-IoU 对齐的掩码分数。我们框架的预测掩码分数如图 2 (b) 所示，图 2 (c) 中的橙色直方图展示了相关信息。

3.2. Mask scoring in Mask R-CNN

3.2. Mask R-CNN 中的掩码评分

Mask Scoring R-CNN is conceptually simple: Mask R-CNN with MaskIoU Head, which takes the instance feature and the predicted mask together as input, and predicts the IoU between input mask and ground truth mask, as shown in Fig. 3. We will present the details of our framework in the following sections.

掩码评分 R-CNN 在概念上是简单的：Mask R-CNN 加上 MaskIoU 头，它将实例特征和预测掩码一起作为输入，并预测输入掩码与真实掩码之间的 IoU，如图 3 所示。我们将在以下部分中介绍我们框架的详细信息。

Mask R-CNN: We begin by briefly reviewing the Mask R-CNN [15]. Following Faster R-CNN [33], Mask R-CNN consists of two stages. The first stage is the Region Proposal Network (RPN). It proposes candidate object bounding boxes regardless of object categories. The second stage is termed as the R-CNN stage, which extracts features using RoIAlign for each proposal and performs proposal classification, bounding box regression and mask predicting.

Mask R-CNN：我们首先简要回顾 Mask R-CNN [15]。继 Faster R-CNN [33] 之后，Mask R-CNN 由两个阶段组成。第一阶段是区域提议网络 (RPN)。它提出候选对象边界框，而不考虑对象类别。第二阶段称为 R-CNN 阶段，它使用 RoIAlign 提取每个提议的特征，并执行提议分类、边界框回归和掩码预测。

Mask scoring: We define ${s}_{\text{mask }}$ as the score of the predicted mask. The ideal ${\mathrm{s}}_{\text{mask }}$ is equal to the pixel-level IoU between predicted mask and its matched ground truth mask, which is termed as MaskIoU before. The ideal ${\mathrm{s}}_{\text{mask }}$ also should only have positive value for ground truth category, and be zero for other classes, since a mask only belong to one class. This requires the mask score to works well on two task: classifying the mask to right category and regressing the proposal's MaskIoU for foreground object category.

掩码评分：我们定义 ${s}_{\text{mask }}$ 为预测掩码的得分。理想的 ${\mathrm{s}}_{\text{mask }}$ 等于预测掩码与其匹配的真实掩码之间的像素级 IoU，之前称为 MaskIoU。理想的 ${\mathrm{s}}_{\text{mask }}$ 也应该仅对真实类别具有正值，对其他类别为零，因为一个掩码只属于一个类别。这要求掩码得分在两个任务上表现良好：将掩码分类到正确的类别以及回归前景对象类别的 MaskIoU 提案。

It is hard to train the two tasks only using a single objective function. For simplify, we can decompose the mask score learning task into mask classification and IoU regression,denoted as ${\mathrm{s}}_{\text{mask }} = {\mathrm{s}}_{\text{cls }} \cdot {\mathrm{s}}_{\text{iou }}$ for all object categories. ${\mathrm{s}}_{\mathrm{{cls}}}$ focuses on classifying the proposal belong to which class and ${\mathrm{s}}_{\text{iou }}$ focuses on regressing the MaskIoU.

仅使用单一目标函数来训练这两个任务是困难的。为了简化，我们可以将掩码得分学习任务分解为掩码分类和 IoU 回归，表示为 ${\mathrm{s}}_{\text{mask }} = {\mathrm{s}}_{\text{cls }} \cdot {\mathrm{s}}_{\text{iou }}$ ，适用于所有对象类别。 ${\mathrm{s}}_{\mathrm{{cls}}}$ 专注于将提案分类到哪个类别，而 ${\mathrm{s}}_{\text{iou }}$ 专注于回归 MaskIoU。

As for ${\mathrm{s}}_{\mathrm{{cls}}}$ ,the goal of ${\mathrm{s}}_{\mathrm{{cls}}}$ is to classify the proposal belonging to which class, which has been done in the classification task in the R-CNN stage. So we can directly take the corresponding classification score. Regressing ${\mathrm{s}}_{\text{iou }}$ is the target of this paper, which is discussed in the following paragraph.

至于 ${\mathrm{s}}_{\mathrm{{cls}}}$ ，其目标是分类提案属于哪个类别，这在 R-CNN 阶段的分类任务中已经完成。因此，我们可以直接取相应的分类得分。回归 ${\mathrm{s}}_{\text{iou }}$ 是本文的目标，将在以下段落中讨论。

MaskIoU head: The MaskIoU head aims to regress the IoU between the predicted mask and its ground truth mask. We use the concatenation of feature from RoIAlign layer and the predicted mask as the input of MaskIoU head. When concatenating, we use a max pooing layer with kernel size of 2 and stride of 2 to make the predicted mask have the same spatial size with RoI feature. We only choose to regress the MaskIoU for the ground truth class (for testing, we choose the predicted class) instead of all classes. Our MaskIoU head consists of 4 convolution layers and 3 fully connected layers. For the 4 convolution layers, we follow Mask head and set the kernel size and filter number to 3 and 256 respectively for all the convolution layers. For the 3 fully connected (FC) layers, we follow the RCNN head and set the outputs of the first two FC layers to 1024 and the output of the final FC to the number of classes.

MaskIoU 头：MaskIoU 头旨在回归预测掩码与其真实掩码之间的 IoU。我们使用 RoIAlign 层的特征与预测掩码的拼接作为 MaskIoU 头的输入。在拼接时，我们使用一个核大小为 2、步幅为 2 的最大池化层，使预测掩码与 RoI 特征具有相同的空间大小。我们仅选择回归真实类别的 MaskIoU（在测试时，我们选择预测类别），而不是所有类别。我们的 MaskIoU 头由 4 个卷积层和 3 个全连接层组成。对于这 4 个卷积层，我们遵循 Mask 头的设置，将所有卷积层的核大小和滤波器数量分别设为 3 和 256。对于这 3 个全连接（FC）层，我们遵循 RCNN 头的设置，将前两个 FC 层的输出设为 1024，最后一个 FC 层的输出设为类别数量。

Training: For training the MaskIoU head, we use the RPN proposals as training samples. The training samples are required to have a IoU between proposal box and the matched ground truth box larger than 0.5 , which are the same with the training samples of the Mask head of Mask R-CNN. For generating the regression target for each training sample, we firstly get the predicted mask of the target class and binarize the predicted mask using a threshold of

训练：为了训练 MaskIoU 头，我们使用 RPN 提议作为训练样本。训练样本要求提议框与匹配的真实框之间的 IoU 大于 0.5，这与 Mask R-CNN 的 Mask 头的训练样本相同。为了生成每个训练样本的回归目标，我们首先获取目标类别的预测掩码，并使用一个阈值对预测掩码进行二值化。

0.5

Then we use the MaskIoU between the binary mask and its matched ground truth as the MaskIoU target. We use the ${\ell }_{2}$ loss for regressing MaskIoU,and the loss weight is set to 1. The proposed MaskIoU head is integrated into Mask R-CNN, and the whole network is end to end trained.

然后，我们使用二值掩码与其匹配的真实掩码之间的 MaskIoU 作为 MaskIoU 目标。我们使用 ${\ell }_{2}$ 损失来回归 MaskIoU，损失权重设为 1。所提出的 MaskIoU 头集成到 Mask R-CNN 中，整个网络进行端到端训练。

Inference: During inference, we just use MaskIoU head to calibrate classification score generated from R-CNN. Specifically, suppose the R-CNN stage of Mask R-CNN outputs $\mathrm{N}$ bounding boxes,and among them top-k (i.e. $\mathrm{k} = {100}$ ) scoring boxes after SoftNMS [2] are selected. Then the top-k boxes are fed into the Mask head to generate multi-class masks. This is the standard Mask R-CNN inference procedure. We follow this procedure as well, and feed the top-k target masks to predict the MaskIoU. The predicted MaskIoU are multiplied with classification score, to get the new calibrated mask score as the final mask confidence.

推理：在推理过程中，我们仅使用 MaskIoU 头来校准从 R-CNN 生成的分类分数。具体来说，假设 Mask R-CNN 的 R-CNN 阶段输出 $\mathrm{N}$ 个边界框，并且在这些框中，经过 SoftNMS [2] 选择出前 k 个（即 $\mathrm{k} = {100}$ ）得分框。然后将前 k 个框输入到 Mask 头中以生成多类掩码。这是标准的 Mask R-CNN 推理过程。我们也遵循这一过程，并将前 k 个目标掩码输入以预测 MaskIoU。预测的 MaskIoU 与分类分数相乘，以获得新的校准掩码分数作为最终的掩码置信度。

4. Experiments

4. 实验

All experiments are conducted on the COCO dataset [26] with 80 object categories. We follow COCO 2017 settings, using the ${115}\mathrm{k}$ images train split for training, $5\mathrm{k}$ validation split for validation, ${20}\mathrm{k}$ test-dev split for test. We use COCO evaluation metrics AP (averaged over IoU thresholds) to report the results, including AP@0.5, AP@0.75, and ${\mathrm{{AP}}}_{\mathrm{S}},{\mathrm{{AP}}}_{\mathrm{M}},{\mathrm{{AP}}}_{\mathrm{L}}$ (AP at different scales). AP@0.5 (or AP@0.75) means using an IoU threshold 0.5 (or 0.75) to identify whether a predicted bounding box or mask is positive in the evaluation. Unless noted, AP is evaluated using mask IoU.

所有实验均在 COCO 数据集 [26] 上进行，该数据集包含 80 个物体类别。我们遵循 COCO 2017 设置，使用 ${115}\mathrm{k}$ 图像训练集进行训练，使用 $5\mathrm{k}$ 验证集进行验证，使用 ${20}\mathrm{k}$ 测试开发集进行测试。我们使用 COCO 评估指标 AP（在 IoU 阈值上取平均）来报告结果，包括 AP@0.5、AP@0.75 和 ${\mathrm{{AP}}}_{\mathrm{S}},{\mathrm{{AP}}}_{\mathrm{M}},{\mathrm{{AP}}}_{\mathrm{L}}$ （在不同尺度下的 AP）。AP@0.5（或 AP@0.75）意味着使用 IoU 阈值 0.5（或 0.75）来识别在评估中预测的边界框或掩码是否为正例。除非另有说明，AP 是使用掩码 IoU 进行评估的。

4.1. Implementation Details

4.1. 实现细节

We use our reproduced Mask R-CNN for all experiments. We use ResNet-18 based FPN network for ablation study and ResNet-18/50/101 based on Faster R-CNN/FPN/DCN+FPN [9] for comparing our method with other baseline results. For ResNet-18 FPN, input images

我们在所有实验中使用我们复现的 Mask R-CNN。我们使用基于 ResNet-18 的 FPN 网络进行消融研究，并使用基于 Faster R-CNN/FPN/DCN+FPN [9] 的 ResNet-18/50/101 来将我们的方法与其他基线结果进行比较。对于 ResNet-18 FPN，输入图像

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——