FCOS: Fully Convolutional One-Stage Object Detection【翻译】

108 阅读25分钟

Doc2X:Markdown 转换工具专家 Doc2X 提供专业 PDF 转 Markdown 服务,支持表格解析、多栏布局和代码提取,优化工作流程。 Doc2X: Markdown Conversion Tool Specialist Doc2X offers professional PDF to Markdown services with table parsing, multi-column layout, and code extraction to optimize workflows. 👉 访问 Doc2X 官网 | Visit Doc2X Official Site

原文链接:1904.01355

FCOS: Fully Convolutional One-Stage Object Detection

FCOS: 全卷积单阶段目标检测

Zhi Tian

施天

The University of Adelaide, Australia

澳大利亚阿德莱德大学

Abstract

摘要

We propose a fully convolutional one-stage object detector (FCOS) to solve object detection in a per-pixel prediction fashion, analogue to semantic segmentation. Almost all state-of-the-art object detectors such as RetinaNet, SSD, YOLOv3, and Faster R-CNN rely on pre-defined anchor boxes. In contrast, our proposed detector FCOS is anchor box free, as well as proposal free. By eliminating the predefined set of anchor boxes, FCOS completely avoids the complicated computation related to anchor boxes such as calculating overlapping during training. More importantly, we also avoid all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance. With the only post-processing non-maximum suppression (NMS), FCOS with ResNeXt-64x4d-101 achieves 44.7%{44.7}\% in AP with single-model and single-scale testing, surpassing previous one-stage detectors with the advantage of being much simpler. For the first time, we demonstrate a much simpler and flexible detection framework achieving improved detection accuracy. We hope that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks. Code is available at:

我们提出了一种全卷积单阶段目标检测器(FCOS),以逐像素预测的方式解决目标检测问题,类似于语义分割。几乎所有的最先进目标检测器,如RetinaNet、SSD、YOLOv3和Faster R-CNN,都依赖于预定义的锚框。相比之下,我们提出的检测器FCOS不使用锚框,也不依赖于提议。通过消除预定义的锚框集,FCOS完全避免了与锚框相关的复杂计算,例如在训练期间计算重叠。更重要的是,我们还避免了与锚框相关的所有超参数,这些超参数通常对最终检测性能非常敏感。通过唯一的后处理非极大值抑制(NMS),使用ResNeXt-64x4d-101的FCOS在单模型和单尺度测试中达到了44.7%{44.7}\%的AP,超越了之前的单阶段检测器,且其结构更为简单。我们首次展示了一种更简单且灵活的检测框架,达到了更高的检测精度。我们希望所提出的FCOS框架能够作为许多其他实例级任务的简单而强大的替代方案。代码可在以下链接获取:

tinyurl.com/FCOSv1

tinyurl.com/FCOSv1

1. Introduction

1. 引言

Object detection is a fundamental yet challenging task in computer vision, which requires the algorithm to predict a bounding box with a category label for each instance of interest in an image. All current mainstream detectors such as Faster R-CNN [24], SSD [18] and YOLOv2, v3 [23] rely on a set of pre-defined anchor boxes and it has long been believed that the use of anchor boxes is the key to detectors' success. Despite their great success, it is important to note that anchor-based detectors suffer some drawbacks: 1) As shown in [15, 24], detection performance is sensitive to the sizes, aspect ratios and number of anchor boxes. For example, in RetinaNet [15], varying these hyper-parameters affects the performance up to 4%4\% in AP on the COCO benchmark [16]. As a result, these hyper-parameters need to be carefully tuned in anchor-based detectors. 2) Even with careful design, because the scales and aspect ratios of anchor boxes are kept fixed, detectors encounter difficulties to deal with object candidates with large shape variations, particularly for small objects. The pre-defined anchor boxes also hamper the generalization ability of detectors, as they need to be re-designed on new detection tasks with different object sizes or aspect ratios. 3) In order to achieve a high recall rate, an anchor-based detector is required to densely place anchor boxes on the input image (e.g., more than 180  K{180}\mathrm{\;K} anchor boxes in feature pyramid networks (FPN) [14] for an image with its shorter side being 800 ). Most of these anchor boxes are labelled as negative samples during training. The excessive number of negative samples aggravates the imbalance between positive and negative samples in training. 4) Anchor boxes also involve complicated computation such as calculating the intersection-over-union (IoU) scores with ground-truth bounding boxes.

目标检测是计算机视觉中的一项基本而具有挑战性的任务,它要求算法为图像中每个感兴趣的实例预测一个带有类别标签的边界框。当前所有主流检测器,如 Faster R-CNN [24]、SSD [18] 和 YOLOv2、v3 [23],都依赖于一组预定义的锚框,并且长期以来人们认为锚框的使用是检测器成功的关键。尽管它们取得了巨大的成功,但重要的是要注意,基于锚框的检测器存在一些缺点:1)如 [15, 24] 所示,检测性能对锚框的大小、纵横比和数量非常敏感。例如,在 RetinaNet [15] 中,改变这些超参数会影响 COCO 基准 [16] 上的 AP 性能高达 4%4\%。因此,这些超参数需要在基于锚框的检测器中进行仔细调整。2)即使经过仔细设计,由于锚框的尺度和纵横比保持固定,检测器在处理具有大形状变化的物体候选时会遇到困难,特别是对于小物体。预定义的锚框还妨碍了检测器的泛化能力,因为它们需要在具有不同物体大小或纵横比的新检测任务上重新设计。3)为了实现高召回率,基于锚框的检测器需要在输入图像上密集放置锚框(例如,对于短边为 800 的图像,特征金字塔网络(FPN) [14] 中需要超过 180  K{180}\mathrm{\;K} 个锚框)。在训练过程中,这些锚框中的大多数被标记为负样本。负样本的过多数量加剧了训练中正负样本之间的不平衡。4)锚框还涉及复杂的计算,例如计算与真实边界框的交并比(IoU)分数。

Figure 1 - As shown in the left image, FCOS works by predicting a 4D vector(l,t,r,b)encoding the location of a bounding box at each foreground pixel (supervised by ground-truth bounding box information during training). The right plot shows that when a location residing in multiple bounding boxes, it can be ambiguous in terms of which bounding box this location should regress.

图1 - 如左侧图像所示,FCOS通过预测一个4D向量(l,t,r,b)来编码每个前景像素的边界框位置(在训练过程中由真实边界框信息进行监督)。右侧图表显示,当一个位置位于多个边界框中时,关于该位置应回归哪个边界框可能会产生歧义。

Recently, fully convolutional networks (FCNs) [20] have achieved tremendous success in dense prediction tasks such as semantic segmentation [20, 28, 9, 19], depth estimation [17, 31], keypoint detection [3] and counting [2]. As one of high-level vision tasks, object detection might be the only one deviating from the neat fully convolutional per-pixel prediction framework mainly due to the use of anchor boxes. It is nature to ask a question: Can we solve object detection in the neat per-pixel prediction fashion, analogue to FCN for semantic segmentation, for example? Thus those fundamental vision tasks can be unified in (almost) one single framework. We show that the answer is affirmative. Moreover, we demonstrate that, for the first time, the much simpler FCN-based detector achieves even better performance than its anchor-based counterparts.

最近,完全卷积网络(FCNs)[20] 在语义分割 [20, 28, 9, 19]、深度估计 [17, 31]、关键点检测 [3] 和计数 [2] 等密集预测任务中取得了巨大的成功。作为一种高级视觉任务,目标检测可能是唯一一个偏离整洁的逐像素预测框架的任务,这主要是由于使用了锚框。自然会提出一个问题:我们能否以整洁的逐像素预测方式解决目标检测问题,例如,类似于语义分割的FCN?因此,这些基础视觉任务可以在(几乎)一个单一框架中统一。我们证明了答案是肯定的。此外,我们展示了首次使用更简单的基于FCN的检测器,其性能甚至优于基于锚框的对应物。


*Corresponding author, email: chunhua.shen@adelaide.edu.au

*通讯作者,电子邮件:chunhua.shen@adelaide.edu.au


In the literature, some works attempted to leverage the FCNs-based framework for object detection such as Dense-Box [12]. Specifically, these FCN-based frameworks directly predict a 4D vector plus a class category at each spatial location on a level of feature maps. As shown in Fig. 1 (left), the 4D vector depicts the relative offsets from the four sides of a bounding box to the location. These frameworks are similar to the FCNs for semantic segmentation, except that each location is required to regress a 4D4\mathrm{D} continuous vector. However, to handle the bounding boxes with different sizes, DenseBox [12] crops and resizes training images to a fixed scale. Thus DenseBox has to perform detection on image pyramids, which is against FCN's philosophy of computing all convolutions once. Besides, more significantly, these methods are mainly used in special domain objection detection such as scene text detection [33, 10] or face detection [32, 12], since it is believed that these methods do not work well when applied to generic object detection with highly overlapped bounding boxes. As shown in Fig. 1 (right), the highly overlapped bounding boxes result in an intractable ambiguity: it is not clear w.r.t. which bounding box to regress for the pixels in the overlapped regions.

在文献中,一些研究尝试利用基于全卷积网络(FCNs)的框架进行目标检测,例如 Dense-Box [12]。具体而言,这些基于 FCN 的框架在特征图的每个空间位置直接预测一个 4D 向量和一个类别。正如图 1(左)所示,4D 向量描述了边界框四个边缘到该位置的相对偏移量。这些框架与用于语义分割的 FCN 类似,唯一的区别在于每个位置需要回归一个 4D4\mathrm{D} 连续向量。然而,为了处理不同大小的边界框,DenseBox [12] 将训练图像裁剪并调整为固定比例。因此,DenseBox 必须在图像金字塔上执行检测,这与 FCN 一次性计算所有卷积的理念相悖。此外,更重要的是,这些方法主要用于特定领域的目标检测,例如场景文本检测 [33, 10] 或人脸检测 [32, 12],因为人们认为这些方法在应用于具有高度重叠边界框的通用目标检测时效果不佳。如图 1(右)所示,高度重叠的边界框导致了一个难以处理的模糊性:对于重叠区域中的像素,哪个边界框是回归的对象并不明确。

In the sequel, we take a closer look at the issue and show that with FPN this ambiguity can be largely eliminated. As a result, our method can already obtain comparable detection accuracy with those traditional anchor based detectors. Furthermore, we observe that our method may produce a number of low-quality predicted bounding boxes at the locations that are far from the center of an target object. In order to suppress these low-quality detections, we introduce a novel "center-ness" branch (only one layer) to predict the deviation of a pixel to the center of its corresponding bounding box, as defined in Eq. (3). This score is then used to down-weight low-quality detected bounding boxes and merge the detection results in NMS. The simple yet effective center-ness branch allows the FCN-based detector to outperform anchor-based counterparts under exactly the same training and testing settings.

在后续部分,我们将更仔细地审视这个问题,并展示使用 FPN 后这种模糊性可以在很大程度上消除。因此,我们的方法已经能够获得与传统的基于锚点的检测器相当的检测精度。此外,我们观察到我们的方法可能会在远离目标物体中心的位置产生一些低质量的预测边界框。为了抑制这些低质量的检测,我们引入了一种新颖的“中心性”分支(仅一层),用于预测一个像素到其对应边界框中心的偏差,如公式 (3) 所定义的那样。然后,这个分数用于降低低质量检测到的边界框的权重,并在 NMS 中合并检测结果。这个简单而有效的中心性分支使得基于 FCN 的检测器在完全相同的训练和测试设置下超越基于锚点的对手。

This new detection framework enjoys the following advantages.

这个新的检测框架具有以下优点。

  • Detection is now unified with many other FCN-solvable tasks such as semantic segmentation, making it easier to re-use ideas from those tasks.

  • 检测现在与许多其他可用 FCN 解决的任务(如语义分割)统一,使得从这些任务中重用想法变得更加容易。

  • Detection becomes proposal free and anchor free, which significantly reduces the number of design parameters. The design parameters typically need heuristic tuning and many tricks are involved in order to achieve good performance. Therefore, our new detection framework makes the detector, particularly its training, considerably simpler.

  • 检测变得无提案且无锚点,这显著减少了设计参数的数量。设计参数通常需要启发式调优,并且涉及许多技巧以实现良好的性能。因此,我们的新检测框架使得检测器,特别是其训练,变得相当简单。

  • By eliminating the anchor boxes, our new detector completely avoids the complicated computation related to anchor boxes such as the IOU computation and matching between the anchor boxes and ground-truth boxes during training, resulting in faster training and testing as well as less training memory footprint than its anchor-based counterpart.

  • 通过消除锚框,我们的新检测器完全避免了与锚框相关的复杂计算,例如 IOU 计算以及训练期间锚框与真实框之间的匹配,从而实现比其基于锚点的对手更快的训练和测试,以及更少的训练内存占用。

  • Without bells and whistles, we achieve state-of-the-art results among one-stage detectors. We also show that the proposed FCOS can be used as a Region Proposal Networks (RPNs) in two-stage detectors and can achieve significantly better performance than its anchor-based RPN counterparts. Given the even better performance of the much simpler anchor-free detector, we encourage the community to rethink the necessity of anchor boxes in object detection, which are currently considered as the de facto standard for detection.

  • 在没有花哨功能的情况下,我们在单阶段检测器中实现了最先进的结果。我们还展示了所提出的 FCOS 可以作为两阶段检测器中的区域提议网络(RPNs)使用,并且能够比其基于锚框的 RPN 对应物实现显著更好的性能。考虑到这种更简单的无锚检测器的更好性能,我们鼓励社区重新思考在目标检测中锚框的必要性,锚框目前被视为检测的事实标准。

  • The proposed detector can be immediately extended to solve other vision tasks with minimal modification, including instance segmentation and key-point detection. We believe that this new method can be the new baseline for many instance-wise prediction problems.

  • 所提出的检测器可以立即扩展以解决其他视觉任务,仅需最小的修改,包括实例分割和关键点检测。我们相信,这种新方法可以成为许多实例级预测问题的新基线。

2. Related Work

2. 相关工作

Anchor-based Detectors. Anchor-based detectors inherit the ideas from traditional sliding-window and proposal based detectors such as Fast R-CNN [6]. In anchor-based detectors, the anchor boxes can be viewed as pre-defined sliding windows or proposals, which are classified as positive or negative patches, with an extra offsets regression to refine the prediction of bounding box locations. Therefore, the anchor boxes in these detectors may be viewed as training samples. Unlike previous detectors like Fast RCNN, which compute image features for each sliding window/proposal repeatedly, anchor boxes make use of the feature maps of CNNs and avoid repeated feature computation, speeding up detection process dramatically. The design of anchor boxes are popularized by Faster R-CNN in its RPNs [24], SSD [18] and YOLOv2 [22], and has become the convention in a modern detector.

基于锚框的检测器。基于锚框的检测器继承了传统滑动窗口和基于提议的检测器的思想,例如 Fast R-CNN [6]。在基于锚框的检测器中,锚框可以视为预定义的滑动窗口或提议,这些提议被分类为正样本或负样本,并通过额外的偏移回归来细化边界框位置的预测。因此,这些检测器中的锚框可以视为训练样本。与之前的检测器如 Fast R-CNN 不同,后者对每个滑动窗口/提议重复计算图像特征,锚框利用 CNN 的特征图,避免了重复的特征计算,从而显著加快了检测过程。锚框的设计由 Faster R-CNN 在其 RPNs [24]、SSD [18] 和 YOLOv2 [22] 中普及,并已成为现代检测器的惯例。

However, as described above, anchor boxes result in excessively many hyper-parameters, which typically need to be carefully tuned in order to achieve good performance. Besides the above hyper-parameters describing anchor shapes, the anchor-based detectors also need other hyper-parameters to label each anchor box as a positive, ignored or negative sample. In previous works, they often employ intersection over union (IOU) between anchor boxes and ground-truth boxes to determine the label of an anchor box (e.g., a positive anchor if its IOU is in [0.5, 1]). These hyper-parameters have shown a great impact on the final accuracy, and require heuristic tuning. Meanwhile, these hyper-parameters are specific to detection tasks, making detection tasks deviate from a neat fully convolutional network architectures used in other dense prediction tasks such as semantic segmentation.

然而,如上所述,锚框导致了过多的超参数,这些超参数通常需要仔细调整以实现良好的性能。除了描述锚形状的超参数外,基于锚的检测器还需要其他超参数来将每个锚框标记为正样本、忽略样本或负样本。在之前的研究中,它们通常采用锚框与真实框之间的交并比(IOU)来确定锚框的标签(例如,如果其IOU在[0.5, 1]之间,则为正锚)。这些超参数对最终准确性产生了重大影响,并且需要启发式调整。同时,这些超参数是特定于检测任务的,使得检测任务偏离了用于其他密集预测任务(如语义分割)的整洁的全卷积网络架构。

Figure 2 - The network architecture of FCOS, where C3, C4, and C5 denote the feature maps of the backbone network and P3 to P7 are the feature levels used for the final prediction. H×WH \times W is the height and width of feature maps. ’ /s/s(s=8,16,,128)\left( {s = 8,{16},\ldots ,{128}}\right) is the down-sampling ratio of the feature maps at the level to the input image. As an example,all the numbers are computed with an 800×1024{800} \times {1024} input.

图2 - FCOS的网络架构,其中C3、C4和C5表示主干网络的特征图,P3到P7是用于最终预测的特征层。H×WH \times W是特征图的高度和宽度。’/s/s(s=8,16,,128)\left( {s = 8,{16},\ldots ,{128}}\right)是该层特征图相对于输入图像的下采样比例。作为一个例子,所有数字都是通过800×1024{800} \times {1024}输入计算得出的。

Anchor-free Detectors. The most popular anchor-free detector might be YOLOv1 [21]. Instead of using anchor boxes, YOLOv1 predicts bounding boxes at points near the center of objects. Only the points near the center are used since they are considered to be able to produce higher-quality detection. However, since only points near the center are used to predict bounding boxes, YOLOv1 suffers from low recall as mentioned in YOLOv2 [22]. As a result, YOLOv2 [22] employs anchor boxes as well. Compared to YOLOv1, FCOS takes advantages of all points in a ground truth bounding box to predict the bounding boxes and the low-quality detected bounding boxes are suppressed by the proposed "center-ness" branch. As a result, FCOS is able to provide comparable recall with anchor-based detectors as shown in our experiments.

无锚检测器。最流行的无锚检测器可能是 YOLOv1 [21]。YOLOv1 不使用锚框,而是在物体中心附近的点预测边界框。仅使用靠近中心的点,因为它们被认为能够产生更高质量的检测。然而,由于仅使用靠近中心的点来预测边界框,YOLOv1 如 YOLOv2 [22] 中所提到的,存在低召回率的问题。因此,YOLOv2 [22] 也采用了锚框。与 YOLOv1 相比,FCOS 利用真实边界框中的所有点来预测边界框,并通过提出的“中心性”分支抑制低质量检测的边界框。因此,FCOS 能够提供与基于锚的检测器相当的召回率,如我们的实验所示。

CornerNet [13] is a recently proposed one-stage anchor-free detector, which detects a pair of corners of a bounding box and groups them to form the final detected bounding box. CornerNet requires much more complicated postprocessing to group the pairs of corners belonging to the same instance. An extra distance metric is learned for the purpose of grouping.

CornerNet [13] 是一种最近提出的一阶段无锚检测器,它检测边界框的一对角点并将其组合形成最终检测的边界框。CornerNet 需要更复杂的后处理来将属于同一实例的角点对进行分组。为此,学习了额外的距离度量。

Another family of anchor-free detectors such as [32] are based on DenseBox [12]. The family of detectors have been considered unsuitable for generic object detection due to difficulty in handling overlapping bounding boxes and the recall being relatively low. In this work, we show that both problems can be largely alleviated with multi-level FPN prediction. Moreover, we also show together with our proposed center-ness branch, the much simpler detector can achieve even better detection performance than its anchor-based counterparts.

另一类无锚检测器,如 [32],基于 DenseBox [12]。这一类检测器由于处理重叠边界框的困难和相对较低的召回率,被认为不适合通用物体检测。在本研究中,我们展示了通过多层 FPN 预测可以在很大程度上缓解这两个问题。此外,我们还展示了结合我们提出的中心性分支,这种更简单的检测器可以实现比其基于锚的对应物更好的检测性能。

3.Our Approach

3. 我们的方法

In this section, we first reformulate object detection in a per-pixel prediction fashion. Next, we show that how we make use of multi-level prediction to improve the recall and resolve the ambiguity resulted from overlapped bounding boxes. Finally, we present our proposed "center-ness" branch, which helps suppress the low-quality detected bounding boxes and improves the overall performance by a large margin.

在本节中,我们首先以逐像素预测的方式重新表述目标检测。接下来,我们展示如何利用多级预测来提高召回率,并解决由于重叠边界框导致的模糊性。最后,我们提出了“中心性”分支,它有助于抑制低质量检测到的边界框,并大幅提高整体性能。

3.1. Fully Convolutional One-Stage Object Detector

3.1. 完全卷积的一阶段目标检测器

Let FiRH×W×C{F}_{i} \in {\mathbb{R}}^{H \times W \times C} be the feature maps at layer ii of a backbone CNN and ss be the total stride until the layer. The ground-truth bounding boxes for an input image are defined as {Bi}\left\{ {B}_{i}\right\} ,where Bi=(x0(i),y0(i),x1(i)y1(i),c(i)){B}_{i} = \left( {{x}_{0}^{\left( i\right) },{y}_{0}^{\left( i\right) },{x}_{1}^{\left( i\right) }{y}_{1}^{\left( i\right) },{c}^{\left( i\right) }}\right) \in R4×{1,2C}{\mathbb{R}}^{4} \times \{ 1,2\ldots C\} . Here (x0(i),y0(i))\left( {{x}_{0}^{\left( i\right) },{y}_{0}^{\left( i\right) }}\right) and (x1(i)y1(i))\left( {{x}_{1}^{\left( i\right) }{y}_{1}^{\left( i\right) }}\right) denote the coordinates of the left-top and right-bottom corners of the bounding box. c(i){c}^{\left( i\right) } is the class that the object in the bounding box belongs to. CC is the number of classes,which is 80 for MS-COCO dataset.

FiRH×W×C{F}_{i} \in {\mathbb{R}}^{H \times W \times C} 为主干 CNN 的第 ii 层的特征图, ss 为该层的总步幅。输入图像的真实边界框定义为 {Bi}\left\{ {B}_{i}\right\},其中 Bi=(x0(i),y0(i),x1(i)y1(i),c(i)){B}_{i} = \left( {{x}_{0}^{\left( i\right) },{y}_{0}^{\left( i\right) },{x}_{1}^{\left( i\right) }{y}_{1}^{\left( i\right) },{c}^{\left( i\right) }}\right) \in R4×{1,2C}{\mathbb{R}}^{4} \times \{ 1,2\ldots C\}。这里 (x0(i),y0(i))\left( {{x}_{0}^{\left( i\right) },{y}_{0}^{\left( i\right) }}\right)(x1(i)y1(i))\left( {{x}_{1}^{\left( i\right) }{y}_{1}^{\left( i\right) }}\right) 表示边界框左上角和右下角的坐标。c(i){c}^{\left( i\right) } 是边界框内物体所属的类别。CC 是类别的数量,对于 MS-COCO 数据集为 80。

For each location(x,y)on the feature map Fi{F}_{i} ,we can map it back onto the input image as (s2+xs,s2+ys)\left( {\left\lfloor \frac{s}{2}\right\rfloor + {xs},\left\lfloor \frac{s}{2}\right\rfloor + {ys}}\right) , which is near the center of the receptive field of the location (x,y). Different from anchor-based detectors,which consider the location on the input image as the center of (multiple) anchor boxes and regress the target bounding box with these anchor boxes as references, we directly regress the target bounding box at the location. In other words, our detector directly views locations as training samples instead of anchor boxes in anchor-based detectors, which is the same as FCNs for semantic segmentation [20].

对于特征图 Fi{F}_{i} 上的每个位置 (x,y),我们可以将其映射回输入图像,表示为 (s2+xs,s2+ys)\left( {\left\lfloor \frac{s}{2}\right\rfloor + {xs},\left\lfloor \frac{s}{2}\right\rfloor + {ys}}\right),该位置 (x,y) 的感受野中心附近。与基于锚框的检测器不同,后者将输入图像上的位置视为(多个)锚框的中心,并使用这些锚框作为参考来回归目标边界框,我们直接在该位置回归目标边界框。换句话说,我们的检测器直接将位置视为训练样本,而不是基于锚框的检测器中的锚框,这与用于语义分割的 FCNs 相同 [20]。

Specifically,location(x,y)is considered as a positive sample if it falls into any ground-truth box and the class label c{c}^{ * } of the location is the class label of the ground-truth box. Otherwise it is a negative sample and c=0{c}^{ * } = 0 (background class). Besides the label for classification, we also have a 4D real vector t=(l,t,r,b){\mathbf{t}}^{ * } = \left( {{l}^{ * },{t}^{ * },{r}^{ * },{b}^{ * }}\right) being the regression targets for the location. Here l,t,r{l}^{ * },{t}^{ * },{r}^{ * } and b{b}^{ * } are the distances from the location to the four sides of the bounding box, as shown in Fig. 1 (left). If a location falls into multiple bounding boxes, it is considered as an ambiguous sample. We simply choose the bounding box with minimal area as its regression target. In the next section, we will show that with multi-level prediction, the number of ambiguous samples can be reduced significantly and thus they hardly affect the detection performance. Formally, if location(x,y)is associated to a bounding box Bi{B}_{i} ,the training regression targets for the location can be formulated as,

具体而言,如果位置 (x,y) 落入任何真实框中,并且该位置的类别标签 c{c}^{ * } 与真实框的类别标签相同,则该位置被视为正样本。否则,它是负样本 c=0{c}^{ * } = 0(背景类)。除了用于分类的标签外,我们还有一个 4D 实数向量 t=(l,t,r,b){\mathbf{t}}^{ * } = \left( {{l}^{ * },{t}^{ * },{r}^{ * },{b}^{ * }}\right),作为该位置的回归目标。这里 l,t,r{l}^{ * },{t}^{ * },{r}^{ * }b{b}^{ * } 是位置到边界框四个边的距离,如图 1(左)所示。如果一个位置落入多个边界框,则被视为模糊样本。我们简单地选择面积最小的边界框作为其回归目标。在下一节中,我们将展示通过多级预测,可以显著减少模糊样本的数量,从而它们几乎不会影响检测性能。形式上,如果位置 (x,y) 与边界框 Bi{B}_{i} 相关联,则该位置的训练回归目标可以表述为,

It is worth noting that FCOS can leverage as many foreground samples as possible to train the regressor. It is different from anchor-based detectors, which only consider the anchor boxes with a highly enough IOU with ground-truth boxes as positive samples. We argue that it may be one of the reasons that FCOS outperforms its anchor-based counterparts.

值得注意的是,FCOS 可以利用尽可能多的前景样本来训练回归器。这与基于锚框的检测器不同,后者仅将与真实框具有足够高 IOU 的锚框视为正样本。我们认为,这可能是 FCOS 超越其基于锚框的竞争对手的原因之一。

Network Outputs. Corresponding to the training targets, the final layer of our networks predicts an 80D{80}\mathrm{D} vector p\mathbf{p} of classification labels and a 4D4\mathrm{D} vector t=(l,t,r,b)\mathbf{t} = \left( {l,t,r,b}\right) bounding box coordinates. Following [15], instead of training a multi-class classifier,we train CC binary classifiers. Similar to [15], we add four convolutional layers after the feature maps of the backbone networks respectively for classification and regression branches. Moreover, since the regression targets are always positive,we employ exp(x)\exp \left( x\right) to map any real number to (0,)\left( {0,\infty }\right) on the top of the regression branch. It is worth noting that FCOS has 9×9 \times fewer network output variables than the popular anchor-based detectors [15, 24] with 9 anchor boxes per location.

网络输出。与训练目标相对应,我们网络的最终层预测一个 80D{80}\mathrm{D} 向量 p\mathbf{p} 的分类标签和一个 4D4\mathrm{D} 向量 t=(l,t,r,b)\mathbf{t} = \left( {l,t,r,b}\right) 的边界框坐标。遵循 [15],我们不是训练一个多类分类器,而是训练 CC 二分类器。与 [15] 类似,我们在主干网络的特征图之后分别添加四个卷积层用于分类和回归分支。此外,由于回归目标始终为正,我们采用 exp(x)\exp \left( x\right) 将任何实数映射到 (0,)\left( {0,\infty }\right) 在回归分支的顶部。值得注意的是,FCOS 的网络输出变量比流行的基于锚框的检测器 [15, 24] 少,后者每个位置有 9 个锚框。

Loss Function. We define our training loss function as follows:

损失函数。我们将训练损失函数定义如下:

where Lcls{L}_{\mathrm{{cls}}} is focal loss as in [15] and Lreg{L}_{\mathrm{{reg}}} is the IOU loss as in UnitBox [32]. Npos {N}_{\text{pos }} denotes the number of positive samples and λ\lambda being 1 in this paper is the balance weight for Lreg {L}_{\text{reg }} . The summation is calculated over all locations on the feature maps Fi.1{ci>0}{F}_{i}.{\mathbb{1}}_{\left\{ {c}_{i}^{ * } > 0\right\} } is the indicator function, being 1 if ci>0{c}_{i}^{ * } > 0 and 0 otherwise.

其中 Lcls{L}_{\mathrm{{cls}}} 是如 [15] 中所述的焦点损失,Lreg{L}_{\mathrm{{reg}}} 是如 UnitBox [32] 中所述的 IOU 损失。Npos {N}_{\text{pos }} 表示正样本的数量,λ\lambda 在本文中为 1,是 Lreg {L}_{\text{reg }} 的平衡权重。求和是在特征图的所有位置上计算的,Fi.1{ci>0}{F}_{i}.{\mathbb{1}}_{\left\{ {c}_{i}^{ * } > 0\right\} } 是指示函数,当 ci>0{c}_{i}^{ * } > 0 时为 1,否则为 0。

Inference. The inference of FCOS is straightforward. Given an input images, we forward it through the network and obtain the classification scores px,y{\mathbf{p}}_{x,y} and the regression prediction tx,y{\mathbf{t}}_{x,y} for each location on the feature maps Fi{F}_{i} . Following [15],we choose the location with px,y>0.05{p}_{x,y} > {0.05} as positive samples and invert Eq. 1 to obtain the predicted bounding boxes.

推理。FCOS 的推理过程非常简单。给定输入图像,我们将其通过网络前向传播,并为特征图的每个位置获得分类分数 px,y{\mathbf{p}}_{x,y} 和回归预测 tx,y{\mathbf{t}}_{x,y}Fi{F}_{i}。遵循 [15],我们选择具有 px,y>0.05{p}_{x,y} > {0.05} 的位置作为正样本,并反转公式 1 以获得预测的边界框。

3.2. Multi-level Prediction with FPN for FCOS

3.2. 使用 FPN 进行 FCOS 的多级预测

Here we show that how two possible issues of the proposed FCOS can be resolved with multi-level prediction with FPN [14]. 1) The large stride (e.g.,16×)\left( {e.g.,{16} \times }\right) of the final feature maps in a CNN can result in a relatively low best possible recall (BPR){\left( BPR\right) }^{\top } . For anchor based detectors,low recall rates due to the large stride can be compensated to some extent by lowering the required IOU scores for positive anchor boxes. For FCOS, at the first glance one may think that the BPR can be much lower than anchor-based detectors because it is impossible to recall an object which no location on the final feature maps encodes due to a large stride. Here, we empirically show that even with a large stride, FCN-based FCOS is still able to produce a good BPR, and it can even better than the BPR of the anchor-based detector RetinaNet [15] in the official implementation Detectron [7] (refer to Table 1). Therefore, the BPR is actually not a problem of FCOS. Moreover, with multi-level FPN prediction [14], the BPR can be improved further to match the

在这里,我们展示了如何通过多级预测与 FPN [14] 解决所提议的 FCOS 的两个可能问题。1)CNN 中最终特征图的大步幅 (e.g.,16×)\left( {e.g.,{16} \times }\right) 可能导致相对较低的最佳可能召回率 (BPR){\left( BPR\right) }^{\top }。对于基于锚点的检测器,由于大步幅导致的低召回率可以通过降低正锚框所需的 IOU 分数在一定程度上得到补偿。对于 FCOS,乍一看,人们可能会认为 BPR 可能远低于基于锚点的检测器,因为由于大步幅,无法召回在最终特征图上没有编码位置的物体。在这里,我们通过实验证明,即使在大步幅下,基于 FCN 的 FCOS 仍然能够产生良好的 BPR,甚至在官方实现 Detectron [7] 中,其 BPR 还优于基于锚点的检测器 RetinaNet [15](参见表 1)。因此,BPR 实际上并不是 FCOS 的问题。此外,通过多级 FPN 预测 [14],BPR 可以进一步改善以匹配


1{}^{1} Upper bound of the recall rate that a detector can achieve.

1{}^{1} 检测器可以实现的召回率的上限。


—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——