Mask R-CNN【翻译】https://arxiv.org/pdf/1703.06870 Mask R-CNN Ma

Doc2X | PDF 到 Markdown 转换专家精准将 PDF 转换为 Markdown 文档，支持公式解析与代码提取，简化开发与科研工作流程。 Doc2X | PDF to Markdown Conversion Expert Convert PDFs to Markdown accurately with support for formula parsing and code extraction, simplifying development and research workflows. 👉 开始使用 Doc2X | Start Using Doc2X

原文链接：1703.06870

Mask R-CNN

Kaiming He Georgia Gkioxari Piotr Dollár Ross Girshick

Facebook AI Research (FAIR)

Abstract

摘要

We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster $R$ -CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at: github.com/ facebookresearch/Detectron.

我们提出了一种概念上简单、灵活且通用的对象实例分割框架。我们的方法有效地检测图像中的对象，同时为每个实例生成高质量的分割掩码。该方法称为 Mask R-CNN，通过为预测对象掩码添加一个分支来扩展 Faster $R$ -CNN，与现有的边界框识别分支并行工作。Mask R-CNN 训练简单，仅对 Faster R-CNN 增加了少量开销，以 5 fps 的速度运行。此外，Mask R-CNN 易于推广到其他任务，例如，在同一框架中估计人体姿态。我们在 COCO 挑战的三个轨道中展示了最佳结果，包括实例分割、边界框对象检测和人体关键点检测。没有花哨的功能，Mask R-CNN 在每个任务上都超越了所有现有的单模型条目，包括 COCO 2016 挑战的获胜者。我们希望我们简单有效的方法能作为一个坚实的基线，并帮助简化未来在实例级识别方面的研究。代码已在以下地址提供：github.com/facebookres…

1. Introduction

1. 引言

The vision community has rapidly improved object detection and semantic segmentation results over a short period of time. In large part, these advances have been driven by powerful baseline systems, such as the Fast/Faster R-CNN [12, 36] and Fully Convolutional Network (FCN) [30] frameworks for object detection and semantic segmentation, respectively. These methods are conceptually intuitive and offer flexibility and robustness, together with fast training and inference time. Our goal in this work is to develop a comparably enabling framework for instance segmentation.

视觉社区在短时间内迅速改善了对象检测和语义分割的结果。在很大程度上，这些进展是由强大的基线系统驱动的，例如用于对象检测和语义分割的 Fast/Faster R-CNN [12, 36] 和全卷积网络 (FCN) [30] 框架。这些方法在概念上直观，并提供灵活性和鲁棒性，同时具有快速的训练和推理时间。我们在这项工作中的目标是开发一个同样具有启发性的实例分割框架。

Instance segmentation is challenging because it requires the correct detection of all objects in an image while also precisely segmenting each instance. It therefore combines elements from the classical computer vision tasks of ${ob}$ - ject detection, where the goal is to classify individual objects and localize each using a bounding box, and semantic segmentation, where the goal is to classify each pixel into a fixed set of categories without differentiating object instances. ${}^{1}$ Given this,one might expect a complex method is required to achieve good results. However, we show that a surprisingly simple, flexible, and fast system can surpass prior state-of-the-art instance segmentation results.

实例分割具有挑战性，因为它要求正确检测图像中的所有对象，同时精确分割每个实例。因此，它结合了经典计算机视觉任务的元素，包括对象检测，其目标是对单个对象进行分类并使用边界框定位每个对象，以及语义分割，其目标是将每个像素分类为固定类别而不区分对象实例。 ${}^{1}$ 鉴于此，人们可能会认为需要复杂的方法才能取得良好的结果。然而，我们展示了一个出乎意料的简单、灵活且快速的系统可以超越先前的最先进实例分割结果。

Figure 1. The Mask R-CNN framework for instance segmentation.

图1. 用于实例分割的Mask R-CNN框架。

Our method, called Mask R-CNN, extends Faster R-CNN [36] by adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression (Figure 1). The mask branch is a small FCN applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner. Mask R-CNN is simple to implement and train given the Faster R-CNN framework, which facilitates a wide range of flexible architecture designs. Additionally, the mask branch only adds a small computational overhead, enabling a fast system and rapid experimentation.

我们的方法称为Mask R-CNN，通过为每个兴趣区域（RoI）添加一个分支来预测分割掩膜，从而扩展了Faster R-CNN [36]，与现有的分类和边界框回归分支并行（图1）。掩膜分支是应用于每个RoI的小型全卷积网络（FCN），以逐像素的方式预测分割掩膜。考虑到Faster R-CNN框架，Mask R-CNN简单易于实现和训练，这促进了广泛的灵活架构设计。此外，掩膜分支仅增加了少量计算开销，使得系统快速且便于快速实验。

In principle Mask R-CNN is an intuitive extension of Faster R-CNN, yet constructing the mask branch properly is critical for good results. Most importantly, Faster R-CNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool $\left\lbrack {{18},{12}}\right\rbrack$ ,the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, we propose a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations. Despite being

原则上，Mask R-CNN是Faster R-CNN的直观扩展，但正确构建掩膜分支对于获得良好结果至关重要。最重要的是，Faster R-CNN并不是为网络输入和输出之间的逐像素对齐而设计的。这在RoIPool $\left\lbrack {{18},{12}}\right\rbrack$ 的表现中最为明显，RoIPool是处理实例的事实核心操作，它对特征提取执行粗略的空间量化。为了修复这种不对齐，我们提出了一种简单的无量化层，称为RoIAlign，能够忠实地保留精确的空间位置。

${}^{1}$ Following common terminology,we use object detection to denote detection via bounding boxes, not masks, and semantic segmentation to denote per-pixel classification without differentiating instances. Yet we note that instance segmentation is both semantic and a form of detection.

${}^{1}$ 根据常见术语，我们使用目标检测来表示通过边界框进行的检测，而不是掩码，并将语义分割表示为不区分实例的逐像素分类。然而，我们注意到实例分割既是语义的，也是检测的一种形式。

Figure 2. Mask R-CNN results on the COCO test set. These results are based on ResNet-101 [19], achieving a mask AP of 35.7 and running at 5 fps. Masks are shown in color, and bounding box, category, and confidences are also shown.

图2. Mask R-CNN 在 COCO 测试集上的结果。这些结果基于 ResNet-101 [19]，实现了 35.7 的掩码平均精度（AP），运行速度为每秒 5 帧。掩码以颜色显示，同时也显示了边界框、类别和置信度。

a seemingly minor change, RoIAlign has a large impact: it improves mask accuracy by relative ${10}\%$ to ${50}\%$ ,showing bigger gains under stricter localization metrics. Second, we found it essential to decouple mask and class prediction: we predict a binary mask for each class independently, without competition among classes, and rely on the network's RoI classification branch to predict the category. In contrast, FCNs usually perform per-pixel multi-class categorization, which couples segmentation and classification, and based on our experiments works poorly for instance segmentation.

看似微小的变化，RoIAlign 产生了巨大的影响：它相对 ${10}\%$ 提高了掩码的准确性 ${50}\%$ ，在更严格的定位指标下显示出更大的增益。其次，我们发现解耦掩码和类别预测是至关重要的：我们独立地为每个类别预测一个二进制掩码，没有类别之间的竞争，并依赖网络的 RoI 分类分支来预测类别。相比之下，FCN 通常执行逐像素的多类分类，这将分割和分类耦合在一起，并且根据我们的实验，在实例分割上效果较差。

Without bells and whistles, Mask R-CNN surpasses all previous state-of-the-art single-model results on the COCO instance segmentation task [28], including the heavily-engineered entries from the 2016 competition winner. As a by-product, our method also excels on the COCO object detection task. In ablation experiments, we evaluate multiple basic instantiations, which allows us to demonstrate its robustness and analyze the effects of core factors.

没有花哨的设计，Mask R-CNN 超越了在 COCO 实例分割任务 [28] 上所有之前的最先进的单模型结果，包括 2016 年比赛获胜者的高度工程化条目。作为副产品，我们的方法在 COCO 目标检测任务上也表现出色。在消融实验中，我们评估了多个基本实例化，这使我们能够展示其鲁棒性并分析核心因素的影响。

Our models can run at about 200ms per frame on a GPU, and training on COCO takes one to two days on a single 8-GPU machine. We believe the fast train and test speeds, together with the framework's flexibility and accuracy, will benefit and ease future research on instance segmentation.

我们的模型在 GPU 上每帧运行约 200 毫秒，在单个 8-GPU 机器上训练 COCO 需要一到两天。我们相信，快速的训练和测试速度，加上框架的灵活性和准确性，将有利于并简化未来在实例分割方面的研究。

Finally, we showcase the generality of our framework via the task of human pose estimation on the COCO key-point dataset [28]. By viewing each keypoint as a one-hot binary mask, with minimal modification Mask R-CNN can be applied to detect instance-specific poses. Mask R-CNN surpasses the winner of the 2016 COCO keypoint competition, and at the same time runs at 5 fps. Mask R-CNN, therefore, can be seen more broadly as a flexible framework for instance-level recognition and can be readily extended to more complex tasks.

最后，我们通过在 COCO 关键点数据集 [28] 上的人体姿态估计任务展示了我们框架的普遍性。通过将每个关键点视为一个独热二进制掩码，经过最小修改的 Mask R-CNN 可以用于检测特定实例的姿态。Mask R-CNN 超越了 2016 年 COCO 关键点竞赛的获胜者，同时以每秒 5 帧的速度运行。因此，Mask R-CNN 可以更广泛地视为一个灵活的实例级识别框架，并且可以轻松扩展到更复杂的任务。

We have released code to facilitate future research.

我们已发布代码以促进未来的研究。

2. Related Work

2. 相关工作

R-CNN: The Region-based CNN (R-CNN) approach [13] to bounding-box object detection is to attend to a manageable number of candidate object regions [42, 20] and evaluate convolutional networks $\left\lbrack {{25},{24}}\right\rbrack$ independently on each RoI. R-CNN was extended [18, 12] to allow attending to RoIs on feature maps using RoIPool, leading to fast speed and better accuracy. Faster R-CNN [36] advanced this stream by learning the attention mechanism with a Region Proposal Network (RPN). Faster R-CNN is flexible and robust to many follow-up improvements (e.g., [38, 27, 21]), and is the current leading framework in several benchmarks.

R-CNN：基于区域的 CNN (R-CNN) 方法 [13] 通过关注可管理数量的候选对象区域 [42, 20] 来进行边界框对象检测，并在每个 RoI 上独立评估卷积网络 $\left\lbrack {{25},{24}}\right\rbrack$ 。R-CNN 被扩展 [18, 12] 以允许在特征图上使用 RoIPool 关注 RoI，从而实现快速速度和更好的准确性。Faster R-CNN [36] 通过与区域提议网络 (RPN) 学习注意机制来推进这一方向。Faster R-CNN 对许多后续改进具有灵活性和鲁棒性（例如，[38, 27, 21]），并且是多个基准测试中的当前领先框架。

Instance Segmentation: Driven by the effectiveness of R-CNN, many approaches to instance segmentation are based on segment proposals. Earlier methods [13, 15, 16, 9] resorted to bottom-up segments [42, 2]. DeepMask [33] and following works $\left\lbrack {{34},8}\right\rbrack$ learn to propose segment candidates, which are then classified by Fast R-CNN. In these methods, segmentation precedes recognition, which is slow and less accurate. Likewise, Dai et al. [10] proposed a complex multiple-stage cascade that predicts segment proposals from bounding-box proposals, followed by classification. Instead, our method is based on parallel prediction of masks and class labels, which is simpler and more flexible.

实例分割：受到 R-CNN 效果的驱动，许多实例分割的方法基于分割提议。早期的方法 [13, 15, 16, 9] 依赖于自下而上的分割 [42, 2]。DeepMask [33] 和后续的工作 $\left\lbrack {{34},8}\right\rbrack$ 学习提出分割候选，这些候选随后由 Fast R-CNN 进行分类。在这些方法中，分割在识别之前进行，这一过程较慢且准确性较低。同样，Dai 等人 [10] 提出了一个复杂的多阶段级联，首先从边界框提议中预测分割提议，然后进行分类。相反，我们的方法基于掩码和类别标签的并行预测，这种方法更简单且更灵活。

Most recently, Li et al. [26] combined the segment proposal system in [8] and object detection system in [11] for "fully convolutional instance segmentation" (FCIS). The common idea in $\left\lbrack {8,{11},{26}}\right\rbrack$ is to predict a set of position-sensitive output channels fully convolutionally. These channels simultaneously address object classes, boxes, and masks, making the system fast. But FCIS exhibits systematic errors on overlapping instances and creates spurious edges (Figure 6), showing that it is challenged by the fundamental difficulties of segmenting instances.

最近，Li 等人 [26] 将 [8] 中的分割提议系统与 [11] 中的目标检测系统结合起来，提出了“全卷积实例分割”（FCIS）。在 $\left\lbrack {8,{11},{26}}\right\rbrack$ 中的共同理念是完全卷积地预测一组位置敏感的输出通道。这些通道同时处理目标类别、边界框和掩码，使得系统运行快速。但 FCIS 在重叠实例上表现出系统性错误，并产生虚假边缘（图 6），显示出它在实例分割的基本困难面前面临挑战。

Another family of solutions $\left\lbrack {{23},4,3,{29}}\right\rbrack$ to instance segmentation are driven by the success of semantic segmentation. Starting from per-pixel classification results (e.g., FCN outputs), these methods attempt to cut the pixels of the same category into different instances. In contrast to the segmentation-first strategy of these methods, Mask R-CNN is based on an instance-first strategy. We expect a deeper incorporation of both strategies will be studied in the future.

另一类解决方案 $\left\lbrack {{23},4,3,{29}}\right\rbrack$ 是受到语义分割成功的驱动。从逐像素分类结果（例如，FCN 输出）开始，这些方法试图将同一类别的像素切割成不同的实例。与这些方法的分割优先策略相对，Mask R-CNN 基于实例优先策略。我们期待未来对这两种策略的更深入结合进行研究。

3. Mask R-CNN

Mask R-CNN is conceptually simple: Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset; to this we add a third branch that outputs the object mask. Mask R-CNN is thus a natural and intuitive idea. But the additional mask output is distinct from the class and box outputs, requiring extraction of much finer spatial layout of an object. Next, we introduce the key elements of Mask R-CNN, including pixel-to-pixel alignment, which is the main missing piece of Fast/Faster R-CNN.

Mask R-CNN 在概念上是简单的：Faster R-CNN 为每个候选对象提供两个输出，一个是类别标签，另一个是边界框偏移；在此基础上，我们增加了第三个分支，用于输出对象掩码。因此，Mask R-CNN 是一个自然且直观的想法。但是，额外的掩码输出与类别和框输出是不同的，要求提取对象的更精细的空间布局。接下来，我们介绍 Mask R-CNN 的关键元素，包括像素到像素的对齐，这是 Fast/Faster R-CNN 的主要缺失部分。

Faster R-CNN: We begin by briefly reviewing the Faster R-CNN detector [36]. Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network (RPN), proposes candidate object bounding boxes. The second stage, which is in essence Fast R-CNN [12], extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference. We refer readers to [21] for latest, comprehensive comparisons between Faster R-CNN and other frameworks.

Faster R-CNN：我们首先简要回顾 Faster R-CNN 检测器 [36]。Faster R-CNN 由两个阶段组成。第一阶段称为区域提议网络（RPN），提出候选对象的边界框。第二阶段，本质上是 Fast R-CNN [12]，从每个候选框中使用 RoIPool 提取特征，并进行分类和边界框回归。这两个阶段使用的特征可以共享，以加快推理速度。我们建议读者参考 [21]，以获取 Faster R-CNN 与其他框架之间最新的全面比较。

Mask R-CNN: Mask R-CNN adopts the same two-stage procedure, with an identical first stage (which is RPN). In the second stage, in parallel to predicting the class and box offset, Mask R-CNN also outputs a binary mask for each RoI. This is in contrast to most recent systems, where classification depends on mask predictions (e.g. [33, 10, 26]). Our approach follows the spirit of Fast R-CNN [12] that applies bounding-box classification and regression in parallel (which turned out to largely simplify the multi-stage pipeline of original R-CNN [13]).

Mask R-CNN：Mask R-CNN 采用相同的两阶段程序，第一阶段（即 RPN）是相同的。在第二阶段，Mask R-CNN 在预测类别和框偏移的同时，还为每个 RoI 输出一个二进制掩码。这与大多数最近的系统形成对比，在这些系统中，分类依赖于掩码预测（例如 [33, 10, 26]）。我们的方法遵循 Fast R-CNN [12] 的精神，该方法并行应用边界框分类和回归（这在很大程度上简化了原始 R-CNN [13] 的多阶段管道）。

Formally, during training, we define a multi-task loss on each sampled RoI as $L = {L}_{cls} + {L}_{box} + {L}_{\text{mask }}$ . The classification loss ${L}_{cls}$ and bounding-box loss ${L}_{box}$ are identical as those defined in [12]. The mask branch has a $K{m}^{2}$ - dimensional output for each RoI,which encodes $K$ binary masks of resolution $m \times m$ ,one for each of the $K$ classes. To this we apply a per-pixel sigmoid,and define ${L}_{\text{mask }}$ as the average binary cross-entropy loss. For an RoI associated with ground-truth class $k,{L}_{\text{mask }}$ is only defined on the $k$ -th mask (other mask outputs do not contribute to the loss).

正式地，在训练过程中，我们将每个采样的 RoI 上定义一个多任务损失为 $L = {L}_{cls} + {L}_{box} + {L}_{\text{mask }}$ 。分类损失 ${L}_{cls}$ 和边界框损失 ${L}_{box}$ 与 [12] 中定义的相同。掩码分支对每个 RoI 具有 $K{m}^{2}$ 维的输出，编码分辨率为 $m \times m$ 的 $K$ 个二进制掩码，每个掩码对应 $K$ 个类别。对此，我们应用逐像素的 sigmoid，并将 ${L}_{\text{mask }}$ 定义为平均二进制交叉熵损失。对于与真实类别 $k,{L}_{\text{mask }}$ 相关联的 RoI，仅在 $k$ -th 掩码上定义（其他掩码输出不贡献于损失）。

Our definition of ${L}_{\text{mask }}$ allows the network to generate masks for every class without competition among classes; we rely on the dedicated classification branch to predict the class label used to select the output mask. This decouples mask and class prediction. This is different from common practice when applying FCNs [30] to semantic segmentation, which typically uses a per-pixel softmax and a multinomial cross-entropy loss. In that case, masks across classes compete; in our case, with a per-pixel sigmoid and a binary loss, they do not. We show by experiments that this formulation is key for good instance segmentation results.

我们对 ${L}_{\text{mask }}$ 的定义允许网络为每个类别生成掩码，而无需类别之间的竞争；我们依赖专用的分类分支来预测用于选择输出掩码的类别标签。这使得掩码和类别预测解耦。这与将 FCNs [30] 应用于语义分割的常见做法不同，后者通常使用逐像素 softmax 和多项交叉熵损失。在那种情况下，类别之间的掩码会相互竞争；而在我们的情况下，使用逐像素的 sigmoid 和二进制损失时，它们不会。我们通过实验表明，这种公式对于良好的实例分割结果至关重要。

Figure 3. RoIAlign: The dashed grid represents a feature map, the solid lines an RoI (with $2 \times 2$ bins in this example),and the dots the 4 sampling points in each bin. RoIAlign computes the value of each sampling point by bilinear interpolation from the nearby grid points on the feature map. No quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points.

图 3. RoIAlign：虚线网格表示特征图，实线表示 RoI（在此示例中具有 $2 \times 2$ 个 bin），点表示每个 bin 中的 4 个采样点。RoIAlign 通过从特征图上附近的网格点进行双线性插值来计算每个采样点的值。对 RoI、其 bins 或采样点涉及的任何坐标均未进行量化。

Mask Representation: A mask encodes an input object's spatial layout. Thus, unlike class labels or box offsets that are inevitably collapsed into short output vectors by fully-connected(fc)layers,extracting the spatial structure of masks can be addressed naturally by the pixel-to-pixel correspondence provided by convolutions.

掩码表示：掩码编码了输入对象的空间布局。因此，与不可避免地被全连接（fc）层压缩为短输出向量的类别标签或框偏移不同，提取掩码的空间结构可以通过卷积提供的逐像素对应关系自然地解决。

Specifically,we predict an $m \times m$ mask from each RoI using an FCN [30]. This allows each layer in the mask branch to maintain the explicit $m \times m$ object spatial layout without collapsing it into a vector representation that lacks spatial dimensions. Unlike previous methods that resort to ${fc}$ layers for mask prediction $\left\lbrack {{33},{34},{10}}\right\rbrack$ ,our fully convolutional representation requires fewer parameters, and is more accurate as demonstrated by experiments.

具体来说，我们使用全卷积网络（FCN）从每个感兴趣区域（RoI）预测一个 $m \times m$ 掩码。这使得掩码分支中的每一层能够保持明确的 $m \times m$ 对象空间布局，而不将其压缩为缺乏空间维度的向量表示。与之前依赖 ${fc}$ 层进行掩码预测 $\left\lbrack {{33},{34},{10}}\right\rbrack$ 的方法不同，我们的全卷积表示需要更少的参数，并且实验表明其准确性更高。

This pixel-to-pixel behavior requires our RoI features, which themselves are small feature maps, to be well aligned to faithfully preserve the explicit per-pixel spatial correspondence. This motivated us to develop the following RoIAlign layer that plays a key role in mask prediction.

这种逐像素行为要求我们的 RoI 特征（本身是小特征图）能够良好对齐，以忠实保留明确的逐像素空间对应关系。这促使我们开发了以下的 RoIAlign 层，该层在掩码预测中发挥了关键作用。

RoIAlign: RoIPool [12] is a standard operation for extracting a small feature map (e.g., $7 \times 7$ ) from each RoI. RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Quantization is performed, e.g., on a continuous coordinate $x$ by computing $\left\lbrack {x/{16}}\right\rbrack$ ,where 16 is a feature map stride and $\left\lbrack \cdot \right\rbrack$ is rounding; likewise,quantization is performed when dividing into bins (e.g., $7 \times 7$ ). These quantizations introduce misalignments between the RoI and the extracted features. While this may not impact classification, which is robust to small translations, it has a large negative effect on predicting pixel-accurate masks.

RoIAlign：RoIPool [12] 是从每个 RoI 提取小特征图（例如， $7 \times 7$ ）的标准操作。RoIPool 首先将浮动数 RoI 量化为特征图的离散粒度，然后将该量化的 RoI 细分为空间箱，这些箱本身也被量化，最后覆盖每个箱的特征值被聚合（通常通过最大池化）。量化是在连续坐标 $x$ 上执行的，例如通过计算 $\left\lbrack {x/{16}}\right\rbrack$ ，其中 16 是特征图的步幅， $\left\lbrack \cdot \right\rbrack$ 是四舍五入；同样，当划分为箱时（例如， $7 \times 7$ ），也会进行量化。这些量化在 RoI 和提取特征之间引入了不对齐。虽然这可能不会影响分类，因为分类对小的平移具有鲁棒性，但它对预测像素精确的掩码有很大的负面影响。

To address this, we propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the extracted features with the input. Our proposed change is simple: we avoid any quantization of the RoI boundaries or bins (i.e.,we use $x/{16}$ instead of $\left\lbrack {x/{16}}\right\rbrack$ ). We use bilinear interpolation [22] to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average), see Figure 3 for details. We note that the results are not sensitive to the exact sampling locations, or how many points are sampled, as long as no quantization is performed.

为了解决这个问题，我们提出了一种 RoIAlign 层，它消除了 RoIPool 的严格量化，正确地将提取的特征与输入对齐。我们提出的改变很简单：我们避免对 RoI 边界或箱体进行任何量化（即，我们使用 $x/{16}$ 而不是 $\left\lbrack {x/{16}}\right\rbrack$ ）。我们使用双线性插值 [22] 来计算每个 RoI 箱体中四个规则采样位置的输入特征的确切值，并聚合结果（使用最大值或平均值），具体细节见图 3。我们注意到，结果对确切的采样位置或采样点的数量不敏感，只要不进行量化即可。

RoIAlign leads to large improvements as we show in §4.2. We also compare to the RoIWarp operation proposed in [10]. Unlike RoIAlign, RoIWarp overlooked the alignment issue and was implemented in [10] as quantizing RoI just like RoIPool. So even though RoIWarp also adopts bilinear resampling motivated by [22], it performs on par with RoIPool as shown by experiments (more details in Table 2c), demonstrating the crucial role of alignment.

RoIAlign 带来了显著的改进，如我们在 §4.2 中所示。我们还与 [10] 中提出的 RoIWarp 操作进行了比较。与 RoIAlign 不同，RoIWarp 忽视了对齐问题，并在 [10] 中被实现为像 RoIPool 一样对 RoI 进行量化。因此，尽管 RoIWarp 也采用了受 [22] 启发的双线性重采样，但实验表明其性能与 RoIPool 相当（更多细节见表 2c），这表明了对齐的关键作用。

Network Architecture: To demonstrate the generality of our approach, we instantiate Mask R-CNN with multiple architectures. For clarity, we differentiate between: (i) the convolutional backbone architecture used for feature extraction over an entire image, and (ii) the network head for bounding-box recognition (classification and regression) and mask prediction that is applied separately to each RoI.

网络架构：为了展示我们方法的通用性，我们用多种架构实例化 Mask R-CNN。为清晰起见，我们区分：（i）用于对整个图像进行特征提取的卷积主干架构，以及（ii）用于边界框识别（分类和回归）和掩码预测的网络头，这些是分别应用于每个 RoI 的。

We denote the backbone architecture using the nomenclature network-depth-features. We evaluate ResNet [19] and ResNeXt [45] networks of depth 50 or 101 layers. The original implementation of Faster R-CNN with ResNets [19] extracted features from the final convolutional layer of the 4-th stage,which we call C4. This backbone with ResNet-50, for example, is denoted by ResNet-50-C4. This is a common choice used in $\left\lbrack {{19},{10},{21},{39}}\right\rbrack$ .

我们使用命名法 network-depth-features 来表示主干架构。我们评估深度为 50 或 101 层的 ResNet [19] 和 ResNeXt [45] 网络。原始的 Faster R-CNN 实现使用 ResNets [19] 从第 4 层的最终卷积层提取特征，我们称之为 C4。例如，使用 ResNet-50 的主干表示为 ResNet-50-C4。这是 $\left\lbrack {{19},{10},{21},{39}}\right\rbrack$ 中常用的选择。

We also explore another more effective backbone recently proposed by Lin et al. [27], called a Feature Pyramid Network (FPN). FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask R-CNN gives excellent gains in both accuracy and speed. For further details on FPN, we refer readers to [27].

我们还探索了 Lin 等人最近提出的另一种更有效的主干，称为特征金字塔网络（FPN）。FPN 使用自上而下的架构和侧向连接，从单尺度输入构建网络内特征金字塔。使用 FPN 主干的 Faster R-CNN 根据 RoI 特征的尺度从特征金字塔的不同层提取特征，但其余方法与普通的 ResNet 类似。使用 ResNet-FPN 主干进行特征提取的 Mask R-CNN 在准确性和速度上都取得了显著提升。有关 FPN 的更多细节，请参阅 [27]。

For the network head we closely follow architectures presented in previous work to which we add a fully convolutional mask prediction branch. Specifically, we extend the Faster R-CNN box heads from the ResNet [19] and FPN [27] papers. Details are shown in Figure 4. The head on the ResNet-C4 backbone includes the 5-th stage of ResNet (namely, the 9-layer 'res5' [19]), which is compute-intensive. For FPN, the backbone already includes res 5 and thus allows for a more efficient head that uses fewer filters.

对于网络头部，我们紧密遵循先前工作中提出的架构，并添加了一个全卷积的掩膜预测分支。具体而言，我们扩展了来自 ResNet [19] 和 FPN [27] 论文的 Faster R-CNN 框选头。详细信息见图 4。ResNet-C4 主干上的头部包括 ResNet 的第 5 层（即 9 层的 'res5' [19]），其计算密集型。对于 FPN，主干已经包含 res 5，因此允许使用更少的滤波器来实现更高效的头部。

We note that our mask branches have a straightforward structure. More complex designs have the potential to improve performance but are not the focus of this work.

我们注意到我们的掩膜分支结构简单。更复杂的设计有潜力提高性能，但不是本工作的重点。

Figure 4. Head Architecture: We extend two existing Faster R-CNN heads [19, 27]. Left/Right panels show the heads for the ResNet C4 and FPN backbones, from [19] and [27], respectively, to which a mask branch is added. Numbers denote spatial resolution and channels. Arrows denote either conv,deconv,or ${fc}$ layers as can be inferred from context (conv preserves spatial dimension while deconv increases it). All convs are $3 \times 3$ ,except the output conv which is $1 \times 1$ ,deconvs are $2 \times 2$ with stride 2,and we use ReLU [31] in hidden layers. Left: 'res5' denotes ResNet's fifth stage, which for simplicity we altered so that the first conv operates on a $7 \times 7$ RoI with stride 1 (instead of ${14} \times {14}$ / stride 2 as in [19]). Right: ’ $\times 4$ ’ denotes a stack of four consecutive convs.

图4. 头部架构：我们扩展了两个现有的Faster R-CNN头部 [19, 27]。左/右面板显示了分别来自 [19] 和 [27] 的ResNet C4和FPN骨干网络的头部，并添加了一个掩码分支。数字表示空间分辨率和通道。箭头表示可以从上下文推断出的conv、deconv或 ${fc}$ 层（conv保持空间维度，而deconv增加它）。所有的conv都是 $3 \times 3$ ，除了输出conv是 $1 \times 1$ ，deconv是 $2 \times 2$ ，步幅为2，我们在隐藏层中使用ReLU [31]。左侧：'res5'表示ResNet的第五阶段，为了简化，我们将第一个conv修改为在步幅为1的 $7 \times 7$ RoI上操作（而不是 ${14} \times {14}$ / 步幅为2，如在 [19] 中）。右侧：’ $\times 4$ ’表示四个连续conv的堆叠。

3.1. Implementation Details

3.1. 实施细节

We set hyper-parameters following existing Fast/Faster R-CNN work [12, 36, 27]. Although these decisions were made for object detection in original papers [12, 36, 27], we found our instance segmentation system is robust to them.

我们根据现有的Fast/Faster R-CNN工作 [12, 36, 27] 设置超参数。尽管这些决策是在原始论文 [12, 36, 27] 中为目标检测而做出的，但我们发现我们的实例分割系统对这些参数是稳健的。

Training: As in Fast R-CNN, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. The mask loss ${L}_{\text{mask }}$ is defined only on positive RoIs. The mask target is the intersection between an RoI and its associated ground-truth mask.

训练：与Fast R-CNN相同，如果RoI与真实框的IoU至少为0.5，则该RoI被视为正样本，否则为负样本。掩码损失 ${L}_{\text{mask }}$ 仅在正RoI上定义。掩码目标是RoI与其关联的真实掩码之间的交集。

We adopt image-centric training [12]. Images are resized such that their scale (shorter edge) is 800 pixels [27]. Each mini-batch has 2 images per GPU and each image has $N$ sampled RoIs,with a ratio of $1 : 3$ of positive to negatives [12]. $N$ is 64 for the C4 backbone (as in [12,36]) and 512 for FPN (as in [27]). We train on 8 GPUs (so effective mini-batch size is 16) for ${160}\mathrm{k}$ iterations,with a learning rate of 0.02 which is decreased by 10 at the ${120}\mathrm{k}$ iteration. We use a weight decay of 0.0001 and momentum of 0.9 . With ResNeXt [45], we train with 1 image per GPU and the same number of iterations, with a starting learning rate of 0.01 .

我们采用以图像为中心的训练 [12]。图像被调整大小，使其比例（较短边）为 800 像素 [27]。每个小批量在每个 GPU 上有 2 张图像，每张图像有 $N$ 采样的 RoI，正负样本的比例为 $1 : 3$ [12]。 $N$ 对于 C4 主干网络为 64（如 [12,36] 所示），对于 FPN 为 512（如 [27] 所示）。我们在 8 个 GPU 上训练（因此有效的小批量大小为 16），进行 ${160}\mathrm{k}$ 次迭代，学习率为 0.02，在 ${120}\mathrm{k}$ 次迭代时降低 10。我们使用 0.0001 的权重衰减和 0.9 的动量。使用 ResNeXt [45] 时，我们在每个 GPU 上训练 1 张图像，并进行相同数量的迭代，初始学习率为 0.01。

The RPN anchors span 5 scales and 3 aspect ratios, following [27]. For convenient ablation, RPN is trained separately and does not share features with Mask R-CNN, unless specified. For every entry in this paper, RPN and Mask R-CNN have the same backbones and so they are shareable.

RPN 锚点跨越 5 个尺度和 3 个长宽比，遵循 [27]。为了方便消融实验，RPN 被单独训练，并且不与 Mask R-CNN 共享特征，除非另有说明。本文中的每个条目中，RPN 和 Mask R-CNN 具有相同的主干网络，因此它们是可共享的。

Inference: At test time, the proposal number is 300 for the C4 backbone (as in [36]) and 1000 for FPN (as in [27]). We run the box prediction branch on these proposals, followed by non-maximum suppression [14]. The mask branch is then applied to the highest scoring 100 detection boxes. Although this differs from the parallel computation used in training, it speeds up inference and improves accuracy (due to the use of fewer, more accurate RoIs). The mask branch

推理：在测试时，C4 主干网络的提议数量为 300（如 [36] 所示），FPN 为 1000（如 [27] 所示）。我们在这些提议上运行框预测分支，随后进行非极大值抑制 [14]。然后将掩膜分支应用于得分最高的 100 个检测框。尽管这与训练中使用的并行计算不同，但它加快了推理速度并提高了准确性（由于使用了更少且更准确的 RoI）。掩膜分支

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——