YOLACT Real-time Instance Segmentation【翻译】YOLACT Real-time I

Doc2X | 专业公式识别工具精准识别 PDF 文档中的公式，并支持编辑与转化到 Word 或 Latex，为科研工作者节省宝贵时间。 Doc2X | Professional Formula Recognition Tool Accurately recognizes formulas in PDFs, with editing and conversion to Word or LaTeX, saving valuable time for researchers. 👉 点击体验 Doc2X | Try Doc2X

原文链接：1904.02689

YOLACT Real-time Instance Segmentation

YOLACT 实时实例分割

Daniel Bolya

丹尼尔·博利亚

University of California, Davis

加州大学戴维斯分校

{dbolya, cczhou, fyxiao, yongjaelee}@ucdavis.edu

Abstract

摘要

We present a simple, fully-convolutional model for real-time instance segmentation that achieves ${29.8}\mathrm{{mAP}}$ on MS COCO at 33.5 fps evaluated on a single Titan Xp, which is significantly faster than any previous competitive approach. Moreover, we obtain this result after training on only one GPU. We accomplish this by breaking instance segmentation into two parallel subtasks: (1) generating a set of prototype masks and (2) predicting per-instance mask coefficients. Then we produce instance masks by linearly combining the prototypes with the mask coefficients. We find that because this process doesn't depend on repooling, this approach produces very high-quality masks and exhibits temporal stability for free. Furthermore, we analyze the emergent behavior of our prototypes and show they learn to localize instances on their own in a translation variant manner, despite being fully-convolutional. Finally, we also propose Fast NMS, a drop-in 12 ms faster replacement for standard NMS that only has a marginal performance penalty.

我们提出了一种简单的全卷积模型，用于实时实例分割，在单个 Titan Xp 上以 33.5 fps 在 MS COCO 上实现了 ${29.8}\mathrm{{mAP}}$ ，这比任何以前的竞争方法都快得多。此外，我们在仅使用一块 GPU 进行训练后获得了这一结果。我们通过将实例分割分解为两个并行子任务来实现这一点：(1) 生成一组原型掩膜和 (2) 预测每个实例的掩膜系数。然后，我们通过线性组合原型和掩膜系数来生成实例掩膜。我们发现，由于这个过程不依赖于重新池化，这种方法产生了非常高质量的掩膜，并且表现出免费的时间稳定性。此外，我们分析了原型的涌现行为，并展示它们能够以平移不变的方式自主定位实例，尽管它们是全卷积的。最后，我们还提出了快速 NMS，这是一种替代标准 NMS 的 12 毫秒更快的解决方案，性能损失微乎其微。

1. Introduction

1. 引言

"Boxes are stupid anyway though, I'm probably a true believer in masks except I can't get YOLO to learn them."

“反正框是愚蠢的，我可能是真正相信掩膜的人，只是我无法让 YOLO 学习它们。”

– Joseph Redmon, YOLOv3 [36]

– 约瑟夫·雷德蒙，YOLOv3 [36]

What would it take to create a real-time instance segmentation algorithm? Over the past few years, the vision community has made great strides in instance segmentation, in part by drawing on powerful parallels from the well-established domain of object detection. State-of-the-art approaches to instance segmentation like Mask R-CNN [18] and FCIS [24] directly build off of advances in object detection like Faster R-CNN [37] and R-FCN [8]. Yet, these methods focus primarily on performance over speed, leaving the scene devoid of instance segmentation parallels to real-time object detectors like SSD [30] and YOLO [35, 36]. In this work, our goal is to fill that gap with a fast, one-stage instance segmentation model in the same way that SSD and YOLO fill that gap for object detection.

创建一个实时实例分割算法需要什么？在过去的几年里，视觉领域在实例分割方面取得了巨大的进展，部分原因是借鉴了成熟的物体检测领域的强大平行性。像 Mask R-CNN [18] 和 FCIS [24] 这样的最先进的实例分割方法直接建立在物体检测的进步之上，如 Faster R-CNN [37] 和 R-FCN [8]。然而，这些方法主要关注性能而非速度，使得场景缺乏与实时物体检测器（如 SSD [30] 和 YOLO [35, 36]）的实例分割平行性。在这项工作中，我们的目标是通过一个快速的一阶段实例分割模型来填补这一空白，正如 SSD 和 YOLO 为物体检测所做的那样。

Figure 1: Speed-performance trade-off for various instance segmentation methods on COCO. To our knowledge, ours is the first real-time (above 30 FPS) approach with around 30 mask mAP on COCO test-dev.

图 1：在 COCO 上各种实例分割方法的速度-性能权衡。据我们所知，我们的方法是第一个实时（超过 30 FPS）且在 COCO test-dev 上具有约 30 个掩码 mAP 的方法。

However, instance segmentation is hard-much harder than object detection. One-stage object detectors like SSD and YOLO are able to speed up existing two-stage detectors like Faster R-CNN by simply removing the second stage and making up for the lost performance in other ways. The same approach is not easily extendable, however, to instance segmentation. State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks. That is, these methods "re-pool" features in some bounding box region (e.g., via RoI-pool/align), and then feed these now localized features to their mask predictor. This approach is inherently sequential and is therefore difficult to accelerate. One-stage methods that perform these steps in parallel like FCIS do exist, but they require significant amounts of post-processing after localization, and thus are still far from real-time.

然而，实例分割是困难的——比物体检测要困难得多。一阶段物体检测器如 SSD 和 YOLO 能够通过简单地去掉第二阶段并以其他方式弥补性能损失，从而加速现有的两阶段检测器如 Faster R-CNN。然而，这种方法并不容易扩展到实例分割。最先进的两阶段实例分割方法在很大程度上依赖于特征定位来生成掩码。也就是说，这些方法在某个边界框区域“重新池化”特征（例如，通过 RoI-pool/align），然后将这些现在已定位的特征输入到它们的掩码预测器中。这种方法本质上是顺序的，因此难以加速。确实存在像 FCIS 这样的并行执行这些步骤的一阶段方法，但它们在定位后需要大量的后处理，因此仍然远未达到实时水平。

To address these issues,we propose ${\mathrm{{YOLACT}}}^{1}$ ,a real-time instance segmentation framework that forgoes an explicit localization step. Instead, YOLACT breaks up instance segmentation into two parallel tasks: (1) generating a dictionary of non-local prototype masks over the entire image, and (2) predicting a set of linear combination coefficients per instance. Then producing a full-image instance segmentation from these two components is simple: for each instance, linearly combine the prototypes using the corresponding predicted coefficients and then crop with a predicted bounding box. We show that by segmenting in this manner, the network learns how to localize instance masks on its own, where visually, spatially, and semantically similar instances appear different in the prototypes.

为了解决这些问题，我们提出了 ${\mathrm{{YOLACT}}}^{1}$ ，这是一个实时实例分割框架，省略了显式定位步骤。相反，YOLACT 将实例分割分解为两个并行任务：（1）在整个图像上生成一个非局部原型掩膜字典，以及（2）为每个实例预测一组线性组合系数。然后，从这两个组件生成全图实例分割是简单的：对于每个实例，使用相应的预测系数线性组合原型，然后用预测的边界框裁剪。我们展示了通过这种方式进行分割，网络能够自主学习如何定位实例掩膜，在视觉上、空间上和语义上相似的实例在原型中看起来是不同的。

${}^{1}$ You Only Look At CoefficienTs

${}^{1}$ 你只看系数

Moreover, since the number of prototype masks is independent of the number of categories (e.g., there can be more categories than prototypes), YOLACT learns a distributed representation in which each instance is segmented with a combination of prototypes that are shared across categories. This distributed representation leads to interesting emergent behavior in the prototype space: some prototypes spatially partition the image, some localize instances, some detect instance contours, some encode position-sensitive directional maps (similar to those obtained by hard-coding a position-sensitive module in FCIS [24]), and most do a combination of these tasks (see Figure 5).

此外，由于原型掩膜的数量与类别的数量无关（例如，类别的数量可以多于原型），YOLACT 学习了一种分布式表示，其中每个实例是通过跨类别共享的原型组合进行分割的。这种分布式表示在原型空间中导致了有趣的涌现行为：一些原型在空间上划分图像，一些定位实例，一些检测实例轮廓，一些编码位置敏感的方向图（类似于通过在 FCIS [24] 中硬编码位置敏感模块获得的图），大多数则执行这些任务的组合（见图 5）。

This approach also has several practical advantages. First and foremost, it's fast: because of its parallel structure and extremely lightweight assembly process, YOLACT adds only a marginal amount of computational overhead to a one-stage backbone detector, making it easy to reach 30 fps even when using ResNet-101 [19]; in fact, the entire mask branch takes only $\sim 5\mathrm{\;{ms}}$ to evaluate. Second,masks are high-quality: since the masks use the full extent of the image space without any loss of quality from repooling, our masks for large objects are significantly higher quality than those of other methods (see Figure 7). Finally, it's general: the idea of generating prototypes and mask coefficients could be added to almost any modern object detector.

这种方法还有几个实际优势。首先，它的速度很快：由于其并行结构和极其轻量的组装过程，YOLACT 仅对单阶段主干检测器增加了微不足道的计算开销，使得即使使用 ResNet-101 [19] 也能轻松达到 30 fps；实际上，整个掩码分支的评估仅需 $\sim 5\mathrm{\;{ms}}$ 。其次，掩码质量高：由于掩码利用了图像空间的全部范围，而没有因重新池化而导致质量损失，我们对大物体的掩码质量显著高于其他方法（见图 7）。最后，它具有通用性：生成原型和掩码系数的想法可以添加到几乎任何现代物体检测器中。

Our main contribution is the first real-time (> 30 fps) instance segmentation algorithm with competitive results on the challenging MS COCO dataset [28] (see Figure 1). In addition, we analyze the emergent behavior of YOLACT's prototypes and provide experiments to study the speed vs. performance trade-offs obtained with different backbone architectures, numbers of prototypes, and image resolutions. We also provide a novel Fast NMS approach that is ${12}\mathrm{\;{ms}}$ faster than traditional NMS with a negligible performance penalty. The code for YOLACT is available at

我们的主要贡献是第一个实时（> 30 fps）实例分割算法，在具有挑战性的 MS COCO 数据集 [28] 上取得了竞争性结果（见图 1）。此外，我们分析了 YOLACT 原型的涌现行为，并提供实验以研究不同主干架构、原型数量和图像分辨率下获得的速度与性能权衡。我们还提供了一种新颖的快速 NMS 方法，其速度比传统 NMS 快 ${12}\mathrm{\;{ms}}$ ，且性能损失可以忽略不计。YOLACT 的代码可在

github.com/dbolya/yola….

2. Related Work

2. 相关工作

Instance Segmentation Given its importance, a lot of research effort has been made to push instance segmentation accuracy. Mask-RCNN [18] is a representative two-stage instance segmentation approach that first generates candidate region-of-interests (ROIs) and then classifies and segments those ROIs in the second stage. Follow-up works try to improve its accuracy by e.g., enriching the FPN features [29] or addressing the incompatibility between a mask's confidence score and its localization accuracy [20]. These two-stage methods require re-pooling features for each ROI and processing them with subsequent computations, which make them unable to obtain real-time speeds (30 fps) even when decreasing image size (see Table 2c).

实例分割由于其重要性，许多研究努力致力于提高实例分割的准确性。Mask-RCNN [18] 是一种具有代表性的两阶段实例分割方法，它首先生成候选兴趣区域（ROIs），然后在第二阶段对这些 ROIs 进行分类和分割。后续工作试图通过丰富 FPN 特征 [29] 或解决掩模的置信度分数与其定位准确性之间的不兼容性 [20] 来提高其准确性。这些两阶段方法需要对每个 ROI 重新池化特征，并通过后续计算进行处理，这使得它们即使在减小图像大小时也无法达到实时速度（30 fps）（见表 2c）。

One-stage instance segmentation methods generate position sensitive maps that are assembled into final masks with position-sensitive pooling $\left\lbrack {6,{24}}\right\rbrack$ or combine semantic segmentation logits and direction prediction logits [4]. Though conceptually faster than two-stage methods, they still require repooling or other non-trivial computations (e.g., mask voting). This severely limits their speed, placing them far from real-time. In contrast, our assembly step is much more lightweight (only a linear combination) and can be implemented as one GPU-accelerated matrix-matrix multiplication, making our approach very fast.

一阶段实例分割方法生成位置敏感图，这些图通过位置敏感池化 $\left\lbrack {6,{24}}\right\rbrack$ 组装成最终掩模，或结合语义分割 logits 和方向预测 logits [4]。尽管在概念上比两阶段方法更快，但它们仍然需要重新池化或其他非平凡的计算（例如，掩模投票）。这严重限制了它们的速度，使它们远离实时处理。相比之下，我们的组装步骤轻量得多（仅为线性组合），可以实现为一次 GPU 加速的矩阵-矩阵乘法，使我们的方法非常快速。

Finally, some methods first perform semantic segmentation followed by boundary detection [22], pixel clustering $\left\lbrack {3,{25}}\right\rbrack$ ,or learn an embedding to form instance masks [32, 17, 9, 13]. Again, these methods have multiple stages and/or involve expensive clustering procedures, which limits their viability for real-time applications.

最后，一些方法首先执行语义分割，然后进行边界检测 [22]、像素聚类 $\left\lbrack {3,{25}}\right\rbrack$ ，或学习嵌入以形成实例掩模 [32, 17, 9, 13]。同样，这些方法具有多个阶段和/或涉及昂贵的聚类过程，这限制了它们在实时应用中的可行性。

Real-time Instance Segmentation While real-time object detection $\left\lbrack {{30},{34},{35},{36}}\right\rbrack$ ,and semantic segmentation $\left\lbrack {2,{41},{33},{11},{47}}\right\rbrack$ methods exist,few works have focused on real-time instance segmentation. Straight to Shapes [21] and Box2Pix [42] can perform instance segmentation in real-time (30 fps on Pascal SBD 2012 [12, 16] for Straight to Shapes, and 10.9 fps on Cityscapes [5] and 35 fps on KITTI [15] for Box2Pix), but their accuracies are far from that of modern baselines. In fact, Mask R-CNN [18] remains one of the fastest instance segmentation methods on semantically challenging datasets like COCO [28] (13.5 fps on ${550}^{2}\mathrm{{px}}$ images; see Table 2c).

实时实例分割尽管存在实时物体检测 $\left\lbrack {{30},{34},{35},{36}}\right\rbrack$ 和语义分割 $\left\lbrack {2,{41},{33},{11},{47}}\right\rbrack$ 方法，但很少有研究专注于实时实例分割。Straight to Shapes [21] 和 Box2Pix [42] 可以在实时中执行实例分割（Straight to Shapes 在 Pascal SBD 2012 [12, 16] 上的帧率为 30 fps，而 Box2Pix 在 Cityscapes [5] 上为 10.9 fps，在 KITTI [15] 上为 35 fps），但它们的准确性远低于现代基线。实际上，Mask R-CNN [18] 仍然是处理像 COCO [28] 这样具有语义挑战的数据集时最快的实例分割方法之一（在 ${550}^{2}\mathrm{{px}}$ 图像上的帧率为 13.5 fps；见表 2c）。

Prototypes Learning prototypes (aka vocabulary or codebook) has been extensively explored in computer vision. Classical representations include textons [23] and visual words [40], with advances made via sparsity and locality priors $\left\lbrack {{44},{43},{46}}\right\rbrack$ . Others have designed prototypes for object detection $\left\lbrack {1,{45},{38}}\right\rbrack$ . Though related,these works use prototypes to represent features, whereas we use them to assemble masks for instance segmentation. Moreover, we learn prototypes that are specific to each image, rather than global prototypes shared across the entire dataset.

原型学习原型（也称为词汇或代码本）在计算机视觉中得到了广泛的探索。经典表示包括文本元 [23] 和视觉词 [40]，通过稀疏性和局部性先验 $\left\lbrack {{44},{43},{46}}\right\rbrack$ 取得了进展。其他人则为物体检测设计了原型 $\left\lbrack {1,{45},{38}}\right\rbrack$ 。尽管相关，这些工作使用原型来表示特征，而我们则使用它们来组装实例分割的掩码。此外，我们学习的原型是特定于每个图像的，而不是在整个数据集中共享的全局原型。

3. YOLACT

Our goal is to add a mask branch to an existing one-stage object detection model in the same vein as Mask R-CNN [18] does to Faster R-CNN [37], but without an explicit feature localization step (e.g., feature repooling). To do this, we break up the complex task of instance segmentation into two simpler, parallel tasks that can be assembled to form the final masks. The first branch uses an FCN [31] to produce a set of image-sized "prototype masks" that do not depend on any one instance. The second adds an extra head to the object detection branch to predict a vector of "mask coefficients" for each anchor that encode an instance's representation in the prototype space. Finally, for each instance that survives NMS, we construct a mask for that instance by linearly combining the work of these two branches.

我们的目标是在现有的一阶段目标检测模型中添加一个掩码分支，类似于 Mask R-CNN [18] 对 Faster R-CNN [37] 的做法，但不采用显式的特征定位步骤（例如，特征重池化）。为此，我们将实例分割的复杂任务分解为两个更简单的并行任务，这些任务可以组合形成最终的掩码。第一个分支使用 FCN [31] 生成一组图像大小的“原型掩码”，这些掩码不依赖于任何单一实例。第二个分支在目标检测分支上添加一个额外的头，以预测每个锚点的“掩码系数”向量，这些系数编码了实例在原型空间中的表示。最后，对于每个在 NMS 中存活的实例，我们通过线性组合这两个分支的工作来构建该实例的掩码。

Figure 2: YOLACT Architecture Blue/yellow indicates low/high values in the prototypes, gray nodes indicate functions that are not trained,and $k = 4$ in this example. We base this architecture off of RetinaNet [27] using ResNet-101 + FPN.

图 2：YOLACT 架构蓝色/黄色表示原型中的低/高值，灰色节点表示未训练的函数，以及 $k = 4$ 在此示例中。我们基于 RetinaNet [27] 构建此架构，使用 ResNet-101 + FPN。

Rationale We perform instance segmentation in this way primarily because masks are spatially coherent; i.e., pixels close to each other are likely to be part of the same instance. While a convolutional (conv) layer naturally takes advantage of this coherence,a fully-connected(fc)layer does not. That poses a problem, since one-stage object detectors produce class and box coefficients for each anchor as an output of an ${fc}$ layer. ${}^{2}$ Two stage approaches like Mask R-CNN get around this problem by using a localization step (e.g., RoI-Align), which preserves the spatial coherence of the features while also allowing the mask to be a conv layer output. However, doing so requires a significant portion of the model to wait for a first-stage RPN to propose localization candidates, inducing a significant speed penalty.

理由我们以这种方式进行实例分割，主要是因为掩码在空间上是连贯的；即，彼此靠近的像素很可能属于同一个实例。虽然卷积（conv）层自然利用了这种连贯性，但全连接（fc）层则没有。这就造成了一个问题，因为一阶段目标检测器为每个锚点生成类别和框系数，作为 ${fc}$ 层的输出。 ${}^{2}$ 像 Mask R-CNN 这样的两阶段方法通过使用定位步骤（例如，RoI-Align）来解决这个问题，这样可以保持特征的空间连贯性，同时允许掩码作为卷积层的输出。然而，这样做需要模型的相当一部分等待第一阶段 RPN 提出定位候选，从而导致显著的速度损失。

Thus, we break the problem into two parallel parts, making use of ${fc}$ layers,which are good at producing semantic vectors, and conv layers, which are good at producing spatially coherent masks, to produce the "mask coefficients" and "prototype masks", respectively. Then, because prototypes and mask coefficients can be computed independently, the computational overhead over that of the backbone detector comes mostly from the assembly step, which can be implemented as a single matrix multiplication. In this way, we can maintain spatial coherence in the feature space while still being one-stage and fast.

因此，我们将问题分解为两个并行部分，利用 ${fc}$ 层，这些层擅长生成语义向量，以及卷积层，这些层擅长生成空间一致的掩码，分别生成“掩码系数”和“原型掩码”。然后，由于原型和掩码系数可以独立计算，因此相较于主干检测器的计算开销，主要来自于组装步骤，这可以通过一次矩阵乘法来实现。通过这种方式，我们可以在特征空间中保持空间一致性，同时仍然保持单阶段和快速。

3.1. Prototype Generation

3.1. 原型生成

The prototype generation branch (protonet) predicts a set of $k$ prototype masks for the entire image. We implement protonet as an FCN whose last layer has $k$ channels (one for each prototype) and attach it to a backbone feature layer (see Figure 3 for an illustration). While this formulation is similar to standard semantic segmentation, it differs in that we exhibit no explicit loss on the prototypes. Instead, all supervision for these prototypes comes from the final mask loss after assembly.

原型生成分支（protonet）为整个图像预测一组 $k$ 原型掩码。我们将 protonet 实现为一个全卷积网络（FCN），其最后一层具有 $k$ 个通道（每个原型一个），并将其附加到主干特征层上（见图 3 以获取插图）。虽然这种形式与标准语义分割相似，但不同之处在于我们对原型没有明确的损失。相反，所有对这些原型的监督来自于组装后的最终掩码损失。

We note two important design choices: taking pro-tonet from deeper backbone features produces more robust masks, and higher resolution prototypes result in both higher quality masks and better performance on smaller objects. Thus, we use FPN [26] because its largest feature layers ( ${P}_{3}$ in our case; see Figure 2) are the deepest. Then, we upsample it to one fourth the dimensions of the input image to increase performance on small objects.

我们注意到两个重要的设计选择：从更深的主干特征中提取 protonet 会产生更稳健的掩码，而更高分辨率的原型会导致更高质量的掩码以及在小物体上的更好性能。因此，我们使用 FPN [26]，因为其最大的特征层（在我们的案例中为 ${P}_{3}$ ；见图 2）是最深的。然后，我们将其上采样至输入图像的四分之一尺寸，以提高对小物体的性能。

Finally, we find it important for the protonet's output to be unbounded, as this allows the network to produce large, overpowering activations for prototypes it is very confident about (e.g., obvious background). Thus, we have the option of following protonet with either a ReLU or no nonlinearity. We choose ReLU for more interpretable prototypes.

最后，我们发现 protonet 的输出不受限制是重要的，因为这使得网络能够为其非常有信心的原型（例如，明显的背景）产生大的、强烈的激活。因此，我们可以选择在 protonet 后跟随一个 ReLU 或不使用非线性激活。我们选择 ReLU 以获得更可解释的原型。

3.2. Mask Coefficients

3.2. 掩码系数

Typical anchor-based object detectors have two branches in their prediction heads: one branch to predict $c$ class confidences, and the other to predict 4 bounding box regres-

典型的基于锚点的目标检测器在其预测头中有两个分支：一个分支用于预测 $c$ 类别置信度，另一个分支用于预测 4 个边界框回归。

${}^{2}$ To show that this is an issue,we develop an "fc-mask" model that produces masks for each anchor as the reshaped output of an ${fc}$ layer. As our experiments in Table $2\mathrm{c}$ show,simply adding masks to a one-stage model as ${fc}$ outputs only obtains ${20.7}\mathrm{{mAP}}$ and is thus very much insufficient.

${}^{2}$ 为了表明这是一个问题，我们开发了一个“fc-mask”模型，该模型为每个锚点生成掩码，作为 ${fc}$ 层的重塑输出。正如我们在表 $2\mathrm{c}$ 中的实验所示，仅仅将掩码添加到单阶段模型作为 ${fc}$ 输出仅获得 ${20.7}\mathrm{{mAP}}$ ，因此非常不足。

Figure 3: Protonet Architecture The labels denote feature size and channels for an image size of ${550} \times {550}$ . Arrows indicate $3 \times 3$ conv layers,except for the final conv which is $1 \times 1$ . The increase in size is an upsample followed by a conv. Inspired by the mask branch in [18].

图 3：Protonet 架构标签表示图像大小为 ${550} \times {550}$ 的特征大小和通道。箭头表示 $3 \times 3$ 卷积层，最后的卷积层为 $1 \times 1$ 。大小的增加是上采样后跟随卷积。灵感来自于 [18] 中的掩码分支。

sors. For mask coefficient prediction, we simply add a third branch in parallel that predicts $k$ mask coefficients,one corresponding to each prototype. Thus, instead of producing $4 + c$ coefficients per anchor,we produce $4 + c + k$ .

对于掩码系数预测，我们简单地添加一个并行的第三个分支，预测 $k$ 掩码系数，每个原型对应一个。因此，我们生成的不是每个锚点的 $4 + c$ 系数，而是 $4 + c + k$ 。

Then for nonlinearity, we find it important to be able to subtract out prototypes from the final mask. Thus, we apply tanh to the $k$ mask coefficients,which produces more stable outputs over no nonlinearity. The relevance of this design choice is apparent in Figure 2, as neither mask would be constructable without allowing for subtraction.

然后对于非线性，我们发现能够从最终掩码中减去原型是很重要的。因此，我们对 $k$ 掩码系数应用 tanh，这比没有非线性时产生更稳定的输出。这个设计选择的重要性在图 2 中显而易见，因为如果不允许减法，则无法构建任何掩码。

3.3. Mask Assembly

3.3. 掩码组装

To produce instance masks, we combine the work of the prototype branch and mask coefficient branch, using a linear combination of the former with the latter as coefficients. We then follow this by a sigmoid nonlinearity to produce the final masks. These operations can be implemented efficiently using a single matrix multiplication and sigmoid:

为了生成实例掩码，我们结合原型分支和掩码系数分支的工作，使用前者与后者的线性组合作为系数。然后我们通过 sigmoid 非线性来生成最终掩码。这些操作可以通过单个矩阵乘法和 sigmoid 高效实现：

where $P$ is an $h \times w \times k$ matrix of prototype masks and $C$ is a $n \times k$ matrix of mask coefficients for $n$ instances surviving NMS and score thresholding. Other, more complicated combination steps are possible; however, we keep it simple (and fast) with a basic linear combination.

其中 $P$ 是一个原型掩码的 $h \times w \times k$ 矩阵，而 $C$ 是一个掩码系数的 $n \times k$ 矩阵，适用于经过 NMS 和得分阈值筛选的 $n$ 实例。其他更复杂的组合步骤是可能的；然而，我们保持简单（且快速），使用基本的线性组合。

Losses We use three losses to train our model: classification loss ${L}_{cls}$ ,box regression loss ${L}_{box}$ and mask loss ${L}_{\text{mask }}$ with the weights 1,1.5,and 6.125 respectively. Both ${L}_{cls}$ and ${L}_{box}$ are defined in the same way as in [30]. Then to compute mask loss, we simply take the pixel-wise binary cross entropy between assembled masks $M$ and the ground truth masks ${M}_{gt} : {L}_{\text{mask }} = \operatorname{BCE}\left( {M,{M}_{gt}}\right)$ .

损失我们使用三种损失来训练我们的模型：分类损失 ${L}_{cls}$ 、框回归损失 ${L}_{box}$ 和掩码损失 ${L}_{\text{mask }}$ ，权重分别为 1、1.5 和 6.125。 ${L}_{cls}$ 和 ${L}_{box}$ 的定义与 [30] 中相同。然后，为了计算掩码损失，我们简单地计算组装掩码 $M$ 和真实掩码 ${M}_{gt} : {L}_{\text{mask }} = \operatorname{BCE}\left( {M,{M}_{gt}}\right)$ 之间的逐像素二元交叉熵。

Cropping Masks We crop the final masks with the predicted bounding box during evaluation. During training, we instead crop with the ground truth bounding box, and divide ${L}_{\text{mask }}$ by the ground truth bounding box area to preserve small objects in the prototypes.

裁剪掩码在评估期间，我们使用预测的边界框裁剪最终掩码。在训练期间，我们则使用真实的边界框进行裁剪，并将 ${L}_{\text{mask }}$ 除以真实边界框的面积，以保留原型中的小物体。

3.4. Emergent Behavior

3.4. 新兴行为

Our approach might seem surprising, as the general consensus around instance segmentation is that because FCNs are translation invariant, the task needs translation variance added back in [24]. Thus methods like FCIS [24] and Mask R-CNN [18] try to explicitly add translation variance, whether it be by directional maps and position-sensitive re-pooling, or by putting the mask branch in the second stage so it does not have to deal with localizing instances. In our method, the only translation variance we add is to crop the final mask with the predicted bounding box. However, we find that our method also works without cropping for medium and large objects, so this is not a result of cropping. Instead, YOLACT learns how to localize instances on its own via different activations in its prototypes.

我们的方法可能看起来令人惊讶，因为关于实例分割的普遍共识是，由于 FCN 是平移不变的，因此任务需要重新引入平移方差 [24]。因此，像 FCIS [24] 和 Mask R-CNN [18] 这样的方法试图显式地添加平移方差，无论是通过方向图和位置敏感的重新池化，还是通过将掩码分支放在第二阶段，这样它就不必处理实例的定位。在我们的方法中，我们添加的唯一平移方差是使用预测的边界框裁剪最终掩码。然而，我们发现我们的方法在中等和大型物体上也能正常工作，因此这并不是裁剪的结果。相反，YOLACT 通过其原型中的不同激活学习如何自主定位实例。

Figure 4: Head Architecture We use a shallower prediction head than RetinaNet [27] and add a mask coefficient branch. This is for $c$ classes, $a$ anchors for feature layer ${P}_{i}$ , and $k$ prototypes. See Figure 3 for a key.

图4：头部架构我们使用比RetinaNet [27] 更浅的预测头，并添加一个掩码系数分支。这是针对 $c$ 类别、 $a$ 锚点用于特征层 ${P}_{i}$ 和 $k$ 原型。请参见图3以获取关键。

To see how this is possible, first note that the prototype activations for the solid red image (image a) in Figure 5 are actually not possible in an FCN without padding. Because a convolution outputs to a single pixel, if its input everywhere in the image is the same, the result everywhere in the conv output will be the same. On the other hand, the consistent rim of padding in modern FCNs like ResNet gives the network the ability to tell how far away from the image's edge a pixel is. Conceptually, one way it could accomplish this is to have multiple layers in sequence spread the padded 0 's out from the edge toward the center (e.g., with a kernel like $\left\lbrack {1,0}\right\rbrack$ ). This means ResNet,for instance,is inherently translation variant, and our method makes heavy use of that property (images $\mathrm{b}$ and $\mathrm{c}$ exhibit clear translation variance).

要了解这如何成为可能，首先要注意图5中实心红色图像（图像a）的原型激活在没有填充的FCN中实际上是不可能的。因为卷积输出到单个像素，如果其输入在图像中的每个位置都是相同的，那么卷积输出的结果在每个位置上都会是相同的。另一方面，像ResNet这样的现代FCN中的一致填充边缘使网络能够判断像素距离图像边缘的远近。从概念上讲，它可以通过多个层级依次将填充的0从边缘扩展到中心（例如，使用类似 $\left\lbrack {1,0}\right\rbrack$ 的卷积核）来实现。这意味着，ResNet本质上是平移变异的，而我们的方法大量利用了这一特性（图像 $\mathrm{b}$ 和 $\mathrm{c}$ 显示出明显的平移变异）。

We observe many prototypes to activate on certain "partitions" of the image. That is, they only activate on objects on one side of an implicitly learned boundary. In Figure 5, prototypes 1-3 are such examples. By combining these partition maps, the network can distinguish between different (even overlapping) instances of the same semantic class; e.g., in image d, the green umbrella can be separated from the red one by subtracting prototype 3 from prototype 2.

我们观察到许多原型在图像的某些“分区”上激活。也就是说，它们仅在隐式学习的边界一侧的对象上激活。在图5中，原型1-3就是这样的例子。通过组合这些分区图，网络可以区分同一语义类别的不同（甚至重叠）实例；例如，在图像d中，可以通过从原型2中减去原型3来将绿色雨伞与红色雨伞分开。

Furthermore, being learned objects, prototypes are compressible. That is, if protonet combines the functionality of

此外，作为学习到的对象，原型是可压缩的。也就是说，如果protonet结合了

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——