Focal Loss for Dense Object Detection【翻译】Focal Loss for Dens

Doc2X：一站式文档转换平台支持 PDF 转 Word、Latex、HTML 和 Markdown，集成公式解析、表格识别和沉浸式翻译功能。 Doc2X: One-Stop Document Conversion Platform Supports PDF to Word, LaTeX, HTML, and Markdown, integrated with formula parsing, table recognition, and immersive translation. 👉 点击访问 Doc2X 官网 | Visit Doc2X

原文链接：1708.02002

Focal Loss for Dense Object Detection

焦点损失用于密集目标检测

Tsung-Yi Lin Priya Goyal Ross Girshick Kaiming He Piotr Dollár

Facebook AI Research (FAIR)

Facebook AI 研究院 (FAIR)

Figure 1. We propose a novel loss we term the Focal Loss that adds a factor ${\left( 1 - {p}_{\mathrm{t}}\right) }^{\gamma }$ to the standard cross entropy criterion. Setting $\gamma > 0$ reduces the relative loss for well-classified examples $\left( {{p}_{\mathrm{t}} > {.5}}\right)$ ,putting more focus on hard,misclassified examples. As our experiments will demonstrate, the proposed focal loss enables training highly accurate dense object detectors in the presence of vast numbers of easy background examples.

图 1. 我们提出了一种新颖的损失函数，称为焦点损失，它在标准交叉熵标准中添加了一个因子 ${\left( 1 - {p}_{\mathrm{t}}\right) }^{\gamma }$ 。设置 $\gamma > 0$ 减少了对分类良好的示例的相对损失 $\left( {{p}_{\mathrm{t}} > {.5}}\right)$ ，使得更多的关注集中在困难的、分类错误的示例上。正如我们的实验所展示的，所提出的焦点损失使得在大量简单背景示例的情况下，训练高精度的密集目标检测器成为可能。

Abstract

摘要

The highest accuracy object detectors to date are based on a two-stage approach popularized by $R - {CNN}$ ,where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at:

迄今为止，最高精度的目标检测器基于由 $R - {CNN}$ 推广的两阶段方法，其中分类器应用于一组稀疏的候选目标位置。相比之下，应用于可能目标位置的规则、密集采样的一阶段检测器有潜力更快且更简单，但迄今为止其精度落后于两阶段检测器。在本文中，我们研究了为什么会出现这种情况。我们发现，在训练密集检测器时遇到的极端前景-背景类别不平衡是主要原因。我们建议通过重塑标准交叉熵损失来解决这种类别不平衡，从而降低分配给分类良好的示例的损失。我们新颖的焦点损失将训练重点放在一组稀疏的困难示例上，并防止大量简单负例在训练过程中淹没检测器。为了评估我们损失的有效性，我们设计并训练了一个简单的密集检测器，称为 RetinaNet。我们的结果表明，当使用焦点损失进行训练时，RetinaNet 能够匹配之前一阶段检测器的速度，同时超越所有现有的最先进的两阶段检测器的精度。代码在：

github.com/facebookres….

Figure 2. Speed (ms) versus accuracy (AP) on COCO test-dev. Enabled by the focal loss, our simple one-stage RetinaNet detector outperforms all previous one-stage and two-stage detectors, including the best reported Faster R-CNN [28] system from [20]. We show variants of RetinaNet with ResNet-50-FPN (blue circles) and ResNet-101-FPN (orange diamonds) at five scales (400-800 pixels). Ignoring the low-accuracy regime (AP<25), RetinaNet forms an upper envelope of all current detectors, and an improved variant (not shown) achieves 40.8 AP. Details are given in §5.

图2. 在COCO测试开发集上，速度（毫秒）与准确率（AP）的关系。得益于焦点损失，我们的简单单阶段RetinaNet检测器超越了所有之前的单阶段和双阶段检测器，包括从[20]报告的最佳Faster R-CNN [28]系统。我们展示了使用ResNet-50-FPN（蓝色圆圈）和ResNet-101-FPN（橙色菱形）的RetinaNet变体，在五个尺度（400-800像素）下。忽略低准确率范围（AP<25），RetinaNet形成了所有当前检测器的上包络线，而一个改进的变体（未显示）达到了40.8 AP。详细信息见§5。

1. Introduction

1. 引言

Current state-of-the-art object detectors are based on a two-stage, proposal-driven mechanism. As popularized in the R-CNN framework [11], the first stage generates a sparse set of candidate object locations and the second stage classifies each candidate location as one of the foreground classes or as background using a convolutional neural network. Through a sequence of advances [10, 28, 20, 14], this two-stage framework consistently achieves top accuracy on the challenging COCO benchmark [21].

当前最先进的目标检测器基于一种双阶段、提议驱动的机制。正如在R-CNN框架中所推广的[11]，第一阶段生成一组稀疏的候选目标位置，第二阶段使用卷积神经网络将每个候选位置分类为前景类之一或背景。通过一系列进展[10, 28, 20, 14]，这一双阶段框架在具有挑战性的COCO基准测试[21]上始终实现了顶级准确率。

Despite the success of two-stage detectors, a natural question to ask is: could a simple one-stage detector achieve similar accuracy? One stage detectors are applied over a regular, dense sampling of object locations, scales, and aspect ratios. Recent work on one-stage detectors, such as YOLO [26, 27] and SSD [22, 9], demonstrates promising results, yielding faster detectors with accuracy within 10- ${40}\%$ relative to state-of-the-art two-stage methods.

尽管双阶段检测器取得了成功，但一个自然的问题是：简单的单阶段检测器能否达到类似的准确率？单阶段检测器在目标位置、尺度和长宽比的规则、密集采样上应用。最近关于单阶段检测器的研究，如YOLO [26, 27]和SSD [22, 9]，展示了有希望的结果，产生了比最先进的双阶段方法快且准确率在10- ${40}\%$ 范围内的检测器。

This paper pushes the envelop further: we present a one-stage object detector that, for the first time, matches the state-of-the-art COCO AP of more complex two-stage detectors, such as the Feature Pyramid Network (FPN) [20] or Mask R-CNN [14] variants of Faster R-CNN [28]. To achieve this result, we identify class imbalance during training as the main obstacle impeding one-stage detector from achieving state-of-the-art accuracy and propose a new loss function that eliminates this barrier.

本文进一步推动了研究的边界：我们提出了一种单阶段目标检测器，它首次达到了更复杂的两阶段检测器的最先进COCO AP，例如特征金字塔网络（FPN）[20]或Mask R-CNN [14]变体的Faster R-CNN [28]。为了实现这一结果，我们将训练过程中的类别不平衡识别为阻碍单阶段检测器达到最先进精度的主要障碍，并提出了一种新的损失函数来消除这一障碍。

Class imbalance is addressed in R-CNN-like detectors by a two-stage cascade and sampling heuristics. The proposal stage (e.g., Selective Search [35], EdgeBoxes [39], DeepMask [24, 25], RPN [28]) rapidly narrows down the number of candidate object locations to a small number (e.g., 1-2k), filtering out most background samples. In the second classification stage, sampling heuristics, such as a fixed foreground-to-background ratio (1:3), or online hard example mining (OHEM) [31], are performed to maintain a manageable balance between foreground and background.

在R-CNN类检测器中，通过两阶段级联和采样启发式方法来解决类别不平衡问题。提议阶段（例如，选择性搜索[35]、EdgeBoxes[39]、DeepMask[24, 25]、RPN[28]）快速将候选目标位置的数量缩小到一个小范围（例如，1-2k），过滤掉大多数背景样本。在第二个分类阶段，执行采样启发式方法，例如固定的前景与背景比例（1:3）或在线困难样本挖掘（OHEM）[31]，以维持前景和背景之间的可管理平衡。

In contrast, a one-stage detector must process a much larger set of candidate object locations regularly sampled across an image. In practice this often amounts to enumerating $\sim {100}\mathrm{k}$ locations that densely cover spatial positions, scales, and aspect ratios. While similar sampling heuristics may also be applied, they are inefficient as the training procedure is still dominated by easily classified background examples. This inefficiency is a classic problem in object detection that is typically addressed via techniques such as bootstrapping $\left\lbrack {{33},{29}}\right\rbrack$ or hard example mining $\left\lbrack {{37},8,{31}}\right\rbrack$ .

相比之下，单阶段检测器必须定期处理更大规模的候选目标位置集合，这些位置是在图像中均匀采样的。实际上，这通常意味着枚举 $\sim {100}\mathrm{k}$ 覆盖空间位置、尺度和长宽比的密集位置。虽然也可以应用类似的采样启发式方法，但由于训练过程仍然被易于分类的背景样本主导，因此它们效率低下。这种低效是目标检测中的经典问题，通常通过引导 $\left\lbrack {{33},{29}}\right\rbrack$ 或困难样本挖掘 $\left\lbrack {{37},8,{31}}\right\rbrack$ 等技术来解决。

In this paper, we propose a new loss function that acts as a more effective alternative to previous approaches for dealing with class imbalance. The loss function is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases, see Figure 1. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. Experiments show that our proposed Focal Loss enables us to train a high-accuracy, one-stage detector that significantly outperforms the alternatives of training with the sampling heuristics or hard example mining, the previous state-of-the-art techniques for training one-stage detectors. Finally, we note that the exact form of the focal loss is not crucial, and we show other instantiations can achieve similar results.

在本文中，我们提出了一种新的损失函数，作为处理类别不平衡的更有效替代方案。该损失函数是一个动态缩放的交叉熵损失，其中缩放因子随着对正确类别信心的增加而衰减至零，见图1。直观上，这个缩放因子可以在训练过程中自动降低简单样本的贡献，并迅速将模型的注意力集中在困难样本上。实验表明，我们提出的焦点损失使我们能够训练出一种高精度的一阶段检测器，其性能显著优于使用采样启发式或困难样本挖掘的替代方案，这些是训练一阶段检测器的先前最先进技术。最后，我们指出，焦点损失的确切形式并不重要，我们展示了其他实例化可以达到类似的结果。

To demonstrate the effectiveness of the proposed focal loss, we design a simple one-stage object detector called RetinaNet, named for its dense sampling of object locations in an input image. Its design features an efficient in-network feature pyramid and use of anchor boxes. It draws on a variety of recent ideas from [22, 6, 28, 20]. RetinaNet is efficient and accurate; our best model, based on a ResNet-101- FPN backbone, achieves a COCO test-dev AP of 39.1 while running at $5\mathrm{{fps}}$ ,surpassing the previously best published single-model results from both one and two-stage detectors, see Figure 2.

为了证明所提焦点损失的有效性，我们设计了一种简单的一阶段目标检测器，称为 RetinaNet，因其在输入图像中对目标位置的密集采样而得名。其设计特点是高效的网络内特征金字塔和锚框的使用。它借鉴了来自 [22, 6, 28, 20] 的多种最新思想。RetinaNet 既高效又准确；我们基于 ResNet-101-FPN 主干的最佳模型，在运行时达到 $5\mathrm{{fps}}$ 的 COCO test-dev AP 为 39.1，超越了之前发布的单模型结果，无论是一阶段还是两阶段检测器，见图2。

2. Related Work

2. 相关工作

Classic Object Detectors: The sliding-window paradigm, in which a classifier is applied on a dense image grid, has a long and rich history. One of the earliest successes is the classic work of LeCun et al. who applied convolutional neural networks to handwritten digit recognition [19, 36]. Viola and Jones [37] used boosted object detectors for face detection, leading to widespread adoption of such models. The introduction of HOG [4] and integral channel features [5] gave rise to effective methods for pedestrian detection. DPMs [8] helped extend dense detectors to more general object categories and had top results on PASCAL [7] for many years. While the sliding-window approach was the leading detection paradigm in classic computer vision, with the resurgence of deep learning [18], two-stage detectors, described next, quickly came to dominate object detection.

经典目标检测器：滑动窗口范式，即在密集图像网格上应用分类器，拥有悠久而丰富的历史。最早的成功之一是LeCun等人的经典工作，他们将卷积神经网络应用于手写数字识别 [19, 36]。Viola和Jones [37] 使用增强型目标检测器进行人脸检测，导致此类模型的广泛采用。HOG [4] 和积分通道特征 [5] 的引入催生了有效的行人检测方法。DPMs [8] 帮助将密集检测器扩展到更一般的目标类别，并在PASCAL [7] 上多年来取得了顶尖结果。虽然滑动窗口方法是经典计算机视觉中的主导检测范式，但随着深度学习 [18] 的复兴，接下来描述的两阶段检测器迅速主导了目标检测领域。

Two-stage Detectors: The dominant paradigm in modern object detection is based on a two-stage approach. As pioneered in the Selective Search work [35], the first stage generates a sparse set of candidate proposals that should contain all objects while filtering out the majority of negative locations, and the second stage classifies the proposals into foreground classes / background. R-CNN [11] upgraded the second-stage classifier to a convolutional network yielding large gains in accuracy and ushering in the modern era of object detection. R-CNN was improved over the years, both in terms of speed $\left\lbrack {{15},{10}}\right\rbrack$ and by using learned object proposals [6, 24, 28]. Region Proposal Networks (RPN) integrated proposal generation with the second-stage classifier into a single convolution network, forming the Faster R-CNN framework [28]. Numerous extensions to this framework have been proposed, e.g. [20, 31, 32, 16, 14].

两阶段检测器：现代目标检测的主导范式基于两阶段方法。正如在选择性搜索工作中所开创的 [35]，第一阶段生成一组稀疏的候选提议，这些提议应包含所有对象，同时过滤掉大多数负位置，第二阶段将提议分类为前景类/背景。R-CNN [11] 将第二阶段分类器升级为卷积网络，从而在准确性上获得了巨大的提升，并开启了现代目标检测的时代。R-CNN在速度 $\left\lbrack {{15},{10}}\right\rbrack$ 和使用学习到的目标提议 [6, 24, 28] 方面多年来得到了改进。区域提议网络（RPN）将提议生成与第二阶段分类器集成到一个单一的卷积网络中，形成了Faster R-CNN框架 [28]。对此框架提出了许多扩展，例如 [20, 31, 32, 16, 14]。

One-stage Detectors: OverFeat [30] was one of the first modern one-stage object detector based on deep networks. More recently SSD [22, 9] and YOLO [26, 27] have renewed interest in one-stage methods. These detectors have been tuned for speed but their accuracy trails that of two-stage methods. SSD has a ${10} - {20}\%$ lower AP,while YOLO focuses on an even more extreme speed/accuracy trade-off. See Figure 2. Recent work showed that two-stage detectors can be made fast simply by reducing input image resolution and the number of proposals, but one-stage methods trailed in accuracy even with a larger compute budget [17]. In contrast, the aim of this work is to understand if one-stage detectors can match or surpass the accuracy of two-stage detectors while running at similar or faster speeds.

一阶段检测器：OverFeat [30] 是基于深度网络的第一款现代一阶段目标检测器。最近，SSD [22, 9] 和 YOLO [26, 27] 重新引发了对一阶段方法的兴趣。这些检测器在速度上进行了调优，但它们的准确性落后于两阶段方法。SSD 的 ${10} - {20}\%$ 平均精度（AP）较低，而 YOLO 则专注于更极端的速度/准确性权衡。见图 2。最近的研究表明，通过降低输入图像分辨率和提案数量，可以使两阶段检测器变得快速，但即使在更大的计算预算下，一阶段方法的准确性仍然落后 [17]。相反，本工作的目标是了解一阶段检测器是否能够在运行速度相似或更快的情况下，匹配或超越两阶段检测器的准确性。

The design of our RetinaNet detector shares many similarities with previous dense detectors, in particular the concept of 'anchors' introduced by RPN [28] and use of features pyramids as in SSD [22] and FPN [20]. We emphasize that our simple detector achieves top results not based on innovations in network design but due to our novel loss.

我们的 RetinaNet 检测器的设计与之前的密集检测器有许多相似之处，特别是 RPN [28] 引入的“锚点”概念，以及 SSD [22] 和 FPN [20] 中使用的特征金字塔。我们强调，我们的简单检测器之所以能够取得顶尖结果，并不是基于网络设计的创新，而是由于我们新颖的损失函数。

Class Imbalance: Both classic one-stage object detection methods, like boosted detectors [37, 5] and DPMs [8], and more recent methods, like SSD [22], face a large class imbalance during training. These detectors evaluate ${10}^{4}$ - ${10}^{5}$ candidate locations per image but only a few locations contain objects. This imbalance causes two problems: (1) training is inefficient as most locations are easy negatives that contribute no useful learning signal; (2) en masse, the easy negatives can overwhelm training and lead to degenerate models. A common solution is to perform some form of hard negative mining $\left\lbrack {{33},{37},8,{31},{22}}\right\rbrack$ that samples hard examples during training or more complex sampling/reweighing schemes [2]. In contrast, we show that our proposed focal loss naturally handles the class imbalance faced by a one-stage detector and allows us to efficiently train on all examples without sampling and without easy negatives overwhelming the loss and computed gradients.

类别不平衡：经典的一阶段目标检测方法，如增强检测器 [37, 5] 和 DPMs [8]，以及更近期的方法，如 SSD [22]，在训练过程中都面临着严重的类别不平衡。这些检测器每张图像评估 ${10}^{4}$ - ${10}^{5}$ 个候选位置，但只有少数位置包含对象。这种不平衡导致了两个问题：（1）训练效率低，因为大多数位置都是简单的负样本，无法提供有用的学习信号；（2）大量的简单负样本可能会淹没训练，导致模型退化。一个常见的解决方案是进行某种形式的困难负样本挖掘 $\left\lbrack {{33},{37},8,{31},{22}}\right\rbrack$ ，在训练过程中抽样困难示例或更复杂的抽样/重加权方案 [2]。相反，我们展示了我们提出的焦点损失自然处理了一阶段检测器面临的类别不平衡，并允许我们在所有示例上高效训练，而无需抽样，也不会让简单负样本淹没损失和计算的梯度。

Robust Estimation: There has been much interest in designing robust loss functions (e.g., Huber loss [13]) that reduce the contribution of outliers by down-weighting the loss of examples with large errors (hard examples). In contrast, rather than addressing outliers, our focal loss is designed to address class imbalance by down-weighting inliers (easy examples) such that their contribution to the total loss is small even if their number is large. In other words, the focal loss performs the opposite role of a robust loss: it focuses training on a sparse set of hard examples.

鲁棒估计：设计鲁棒损失函数（例如，Huber 损失 [13]）引起了广泛关注，这些函数通过降低大误差示例（困难示例）的损失权重来减少离群值的贡献。相反，我们的焦点损失并不是解决离群值，而是通过降低内点（简单示例）的权重来解决类别不平衡，使得即使它们的数量很大，它们对总损失的贡献也很小。换句话说，焦点损失的作用与鲁棒损失相反：它将训练重点放在一小部分困难示例上。

3. Focal Loss

3. 焦点损失

The Focal Loss is designed to address the one-stage object detection scenario in which there is an extreme imbalance between foreground and background classes during training (e.g., 1:1000). We introduce the focal loss starting from the cross entropy (CE) loss for binary classification ${}^{1}$ :

焦点损失旨在解决一阶段目标检测场景，在该场景中，训练期间前景和背景类别之间存在极端不平衡（例如，1:1000）。我们从二分类的交叉熵（CE）损失 ${}^{1}$ 开始引入焦点损失：

In the above $y \in \{ \pm 1\}$ specifies the ground-truth class and $p \in \left\lbrack {0,1}\right\rbrack$ is the model’s estimated probability for the class with label $y = 1$ . For notational convenience,we define ${p}_{\mathrm{t}}$ :

在上述 $y \in \{ \pm 1\}$ 中指定了真实类别，而 $p \in \left\lbrack {0,1}\right\rbrack$ 是模型对标签为 $y = 1$ 的类别的估计概率。为了方便记号，我们定义 ${p}_{\mathrm{t}}$ ：

and rewrite $\mathrm{{CE}}\left( {p,y}\right) = \mathrm{{CE}}\left( {p}_{\mathrm{t}}\right) = - \log \left( {p}_{\mathrm{t}}\right)$ .

并重写 $\mathrm{{CE}}\left( {p,y}\right) = \mathrm{{CE}}\left( {p}_{\mathrm{t}}\right) = - \log \left( {p}_{\mathrm{t}}\right)$ 。

The CE loss can be seen as the blue (top) curve in Figure 1. One notable property of this loss, which can be easily seen in its plot, is that even examples that are easily classified $\left( {{p}_{\mathrm{t}} \gg {.5}}\right)$ incur a loss with non-trivial magnitude. When summed over a large number of easy examples, these small loss values can overwhelm the rare class.

CE 损失可以看作是图 1 中的蓝色（顶部）曲线。该损失的一个显著特性是，即使是容易分类的例子 $\left( {{p}_{\mathrm{t}} \gg {.5}}\right)$ 也会产生非平凡的损失。当在大量简单例子上求和时，这些小的损失值可能会淹没稀有类别。

3.1. Balanced Cross Entropy

3.1. 平衡交叉熵

A common method for addressing class imbalance is to introduce a weighting factor $\alpha \in \left\lbrack {0,1}\right\rbrack$ for class 1 and $1 - \alpha$ for class -1 . In practice $\alpha$ may be set by inverse class frequency or treated as a hyperparameter to set by cross validation. For notational convenience,we define ${\alpha }_{\mathrm{t}}$ analogously to how we defined ${p}_{\mathrm{t}}$ . We write the $\alpha$ -balanced CE loss as:

解决类别不平衡的常用方法是为类别 1 引入权重因子 $\alpha \in \left\lbrack {0,1}\right\rbrack$ ，为类别 -1 引入权重因子 $1 - \alpha$ 。在实践中， $\alpha$ 可以通过逆类频率设置，或作为超参数通过交叉验证进行设置。为了方便记号，我们定义 ${\alpha }_{\mathrm{t}}$ ，类似于我们定义 ${p}_{\mathrm{t}}$ 的方式。我们将 $\alpha$ -平衡的 CE 损失写为：

This loss is a simple extension to $\mathrm{{CE}}$ that we consider as an experimental baseline for our proposed focal loss.

该损失是对 $\mathrm{{CE}}$ 的简单扩展，我们将其视为我们提出的焦点损失的实验基线。

3.2. Focal Loss Definition

3.2. 焦点损失定义

As our experiments will show, the large class imbalance encountered during training of dense detectors overwhelms the cross entropy loss. Easily classified negatives comprise the majority of the loss and dominate the gradient. While $\alpha$ balances the importance of positive/negative examples,it does not differentiate between easy/hard examples. Instead, we propose to reshape the loss function to down-weight easy examples and thus focus training on hard negatives.

正如我们的实验所示，在密集检测器训练过程中遇到的大类不平衡会淹没交叉熵损失。容易分类的负例占据了大部分损失并主导了梯度。虽然 $\alpha$ 平衡了正/负例的重要性，但它并没有区分简单/困难的例子。相反，我们建议重新调整损失函数，以降低简单例子的权重，从而将训练重点放在困难的负例上。

More formally, we propose to add a modulating factor ${\left( 1 - {p}_{\mathrm{t}}\right) }^{\gamma }$ to the cross entropy loss,with tunable focusing parameter $\gamma \geq 0$ . We define the focal loss as:

更正式地说，我们建议向交叉熵损失添加一个调节因子 ${\left( 1 - {p}_{\mathrm{t}}\right) }^{\gamma }$ ，并设置可调的聚焦参数 $\gamma \geq 0$ 。我们将焦点损失定义为：

The focal loss is visualized for several values of $\gamma \in$ $\left\lbrack {0,5}\right\rbrack$ in Figure 1. We note two properties of the focal loss. (1) When an example is misclassified and ${p}_{\mathrm{t}}$ is small,the modulating factor is near 1 and the loss is unaffected. As ${p}_{\mathrm{t}} \rightarrow 1$ ,the factor goes to 0 and the loss for well-classified examples is down-weighted. (2) The focusing parameter $\gamma$ smoothly adjusts the rate at which easy examples are downweighted. When $\gamma = 0$ ,FL is equivalent to CE,and as $\gamma$ is increased the effect of the modulating factor is likewise increased (we found $\gamma = 2$ to work best in our experiments).

在图1中，针对多个 $\gamma \in$ $\left\lbrack {0,5}\right\rbrack$ 的值可视化了焦点损失。我们注意到焦点损失的两个特性。(1) 当一个样本被错误分类且 ${p}_{\mathrm{t}}$ 较小时，调节因子接近1，损失不受影响。随着 ${p}_{\mathrm{t}} \rightarrow 1$ 的增加，该因子趋向于0，良好分类样本的损失被降低。(2) 聚焦参数 $\gamma$ 平滑地调整了简单样本的降权速率。当 $\gamma = 0$ 时，FL 等同于 CE，随着 $\gamma$ 的增加，调节因子的效果也随之增强（我们发现 $\gamma = 2$ 在我们的实验中效果最佳）。

Intuitively, the modulating factor reduces the loss contribution from easy examples and extends the range in which an example receives low loss. For instance,with $\gamma = 2$ ,an example classified with ${p}_{\mathrm{t}} = {0.9}$ would have ${100} \times$ lower loss compared with $\mathrm{{CE}}$ and with ${p}_{\mathrm{t}} \approx {0.968}$ it would have ${1000} \times$ lower loss. This in turn increases the importance of correcting misclassified examples (whose loss is scaled down by at most $4 \times$ for ${p}_{\mathrm{t}} \leq {.5}$ and $\gamma = 2$ ).

直观上，调节因子减少了简单样本的损失贡献，并扩展了样本获得低损失的范围。例如，当 $\gamma = 2$ 时，与 $\mathrm{{CE}}$ 相比，分类为 ${p}_{\mathrm{t}} = {0.9}$ 的样本将具有 ${100} \times$ 较低的损失，而与 ${p}_{\mathrm{t}} \approx {0.968}$ 相比，它将具有 ${1000} \times$ 较低的损失。这反过来增加了纠正错误分类样本的重要性（其损失在 ${p}_{\mathrm{t}} \leq {.5}$ 和 $\gamma = 2$ 的情况下最多被缩小 $4 \times$ ）。

In practice we use an $\alpha$ -balanced variant of the focal loss:

实际上，我们使用焦点损失的 $\alpha$ 平衡变体：

We adopt this form in our experiments as it yields slightly improved accuracy over the non- $\alpha$ -balanced form. Finally, we note that the implementation of the loss layer combines the sigmoid operation for computing $p$ with the loss computation, resulting in greater numerical stability.

我们在实验中采用这种形式，因为它比非 $\alpha$ 平衡形式略微提高了准确性。最后，我们注意到损失层的实现将计算 $p$ 的 sigmoid 操作与损失计算相结合，从而提高了数值稳定性。

${}^{1}$ Extending the focal loss to the multi-class case is straightforward and works well; for simplicity we focus on the binary loss in this work.

将焦点损失扩展到多类情况是直接的，并且效果良好；为简单起见，我们在这项工作中专注于二元损失。

While in our main experimental results we use the focal loss definition above, its precise form is not crucial. In the appendix we consider other instantiations of the focal loss and demonstrate that these can be equally effective.

尽管在我们的主要实验结果中使用了上述的焦点损失定义，但其精确形式并不是至关重要的。在附录中，我们考虑了焦点损失的其他实例，并证明这些实例同样有效。

3.3. Class Imbalance and Model Initialization

3.3. 类别不平衡与模型初始化

Binary classification models are by default initialized to have equal probability of outputting either $y = - 1$ or 1 . Under such an initialization, in the presence of class imbalance, the loss due to the frequent class can dominate total loss and cause instability in early training. To counter this, we introduce the concept of a ’prior’ for the value of $p$ estimated by the model for the rare class (foreground) at the start of training. We denote the prior by $\pi$ and set it so that the model’s estimated $p$ for examples of the rare class is low, e.g. 0.01 . We note that this is a change in model initialization (see §4.1) and not of the loss function. We found this to improve training stability for both the cross entropy and focal loss in the case of heavy class imbalance.

二元分类模型默认初始化为输出 0 $y = - 1$ 或 1 的概率相等。在这种初始化下，若存在类别不平衡，频繁类别造成的损失可能主导总损失，并导致早期训练的不稳定。为了解决这个问题，我们引入了一个“先验”概念，用于在训练开始时模型对稀有类别（前景）估计的 $p$ 值。我们用 $\pi$ 表示先验，并将其设置为模型对稀有类别样本的估计 $p$ 较低，例如 0.01。我们注意到这是一种模型初始化的变化（见 §4.1），而不是损失函数的变化。我们发现这在严重类别不平衡的情况下改善了交叉熵和焦点损失的训练稳定性。

3.4. Class Imbalance and Two-stage Detectors

3.4. 类别不平衡与两阶段检测器

Two-stage detectors are often trained with the cross entropy loss without use of $\alpha$ -balancing or our proposed loss. Instead, they address class imbalance through two mechanisms: (1) a two-stage cascade and (2) biased minibatch sampling. The first cascade stage is an object proposal mechanism $\left\lbrack {{35},{24},{28}}\right\rbrack$ that reduces the nearly infinite set of possible object locations down to one or two thousand. Importantly, the selected proposals are not random, but are likely to correspond to true object locations, which removes the vast majority of easy negatives. When training the second stage, biased sampling is typically used to construct minibatches that contain,for instance,a $1 : 3$ ratio of positive to negative examples. This ratio is like an implicit $\alpha$ - balancing factor that is implemented via sampling. Our proposed focal loss is designed to address these mechanisms in a one-stage detection system directly via the loss function.

两阶段检测器通常使用交叉熵损失进行训练，而不使用 $\alpha$ -平衡或我们提出的损失。相反，它们通过两种机制来解决类别不平衡问题：(1) 两阶段级联和 (2) 偏置小批量采样。第一个级联阶段是一个目标提议机制 $\left\lbrack {{35},{24},{28}}\right\rbrack$ ，它将几乎无限的可能目标位置集减少到一两千个。重要的是，所选择的提议不是随机的，而是很可能对应于真实的目标位置，这去除了绝大多数简单的负样本。在训练第二阶段时，通常使用偏置采样来构建小批量，其中包含例如正负样本的 $1 : 3$ 比例。这个比例就像是一个隐式的 $\alpha$ -平衡因子，通过采样实现。我们提出的焦点损失旨在通过损失函数直接解决这些机制在单阶段检测系统中的应用。

4. RetinaNet Detector

4. RetinaNet 检测器

RetinaNet is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-self convolutional network. The first subnet performs convolutional object classification on the backbone's output; the second subnet performs convolutional bounding box regression. The two subnetworks feature a simple design that we propose specifically for one-stage, dense detection, see Figure 3. While there are many possible choices for the details of these components, most design parameters are not particularly sensitive to exact values as shown in the experiments. We describe each component of RetinaNet next.

RetinaNet 是一个单一的统一网络，由一个主干网络和两个特定任务的子网络组成。主干负责对整个输入图像计算卷积特征图，并且是一个现成的卷积网络。第一个子网络对主干的输出进行卷积目标分类；第二个子网络进行卷积边界框回归。这两个子网络具有我们专门为单阶段、密集检测提出的简单设计，见图 3。虽然这些组件的细节有许多可能的选择，但大多数设计参数对确切值并不是特别敏感，如实验所示。接下来我们将描述 RetinaNet 的每个组件。

Feature Pyramid Network Backbone: We adopt the Feature Pyramid Network (FPN) from [20] as the backbone network for RetinaNet. In brief, FPN augments a standard convolutional network with a top-down pathway and lateral connections so the network efficiently constructs a rich, multi-scale feature pyramid from a single resolution input image, see Figure 3(a)-(b). Each level of the pyramid can be used for detecting objects at a different scale. FPN improves multi-scale predictions from fully convolutional networks (FCN) [23], as shown by its gains for RPN [28] and DeepMask-style proposals [24], as well at two-stage detectors such as Fast R-CNN [10] or Mask R-CNN [14].

特征金字塔网络主干：我们采用 [20] 中的特征金字塔网络（FPN）作为 RetinaNet 的主干网络。简而言之，FPN 通过自上而下的路径和侧向连接增强了标准卷积网络，使得网络能够从单一分辨率输入图像高效地构建丰富的多尺度特征金字塔，见图 3(a)-(b)。金字塔的每个层级都可以用于检测不同尺度的物体。FPN 改善了来自全卷积网络（FCN） [23] 的多尺度预测，正如其在 RPN [28] 和 DeepMask 风格提议 [24] 中的提升所示，以及在 Fast R-CNN [10] 或 Mask R-CNN [14] 等两阶段检测器中。

Following [20], we build FPN on top of the ResNet architecture [16]. We construct a pyramid with levels ${P}_{3}$ through ${P}_{7}$ ,where $l$ indicates pyramid level $\left( {P}_{l}\right.$ has resolution ${2}^{l}$ lower than the input). As in [20] all pyramid levels have $C = {256}$ channels. Details of the pyramid generally follow [20] with a few modest differences. ${}^{2}$ While many design choices are not crucial, we emphasize the use of the FPN backbone is; preliminary experiments using features from only the final ResNet layer yielded low AP.

根据 [20]，我们在 ResNet 架构 [16] 的基础上构建 FPN。我们构建了一个金字塔，层级为 ${P}_{3}$ 到 ${P}_{7}$ ，其中 $l$ 表示金字塔层级 $\left( {P}_{l}\right.$ 的分辨率 ${2}^{l}$ 低于输入图像。与 [20] 一样，所有金字塔层级都有 $C = {256}$ 个通道。金字塔的细节通常遵循 [20]，但有一些适度的差异。 ${}^{2}$ 虽然许多设计选择并不是至关重要的，但我们强调使用 FPN 主干的重要性；仅使用最终 ResNet 层的特征进行的初步实验得到了较低的 AP。

Anchors: We use translation-invariant anchor boxes similar to those in the RPN variant in [20]. The anchors have areas of ${32}^{2}$ to ${512}^{2}$ on pyramid levels ${P}_{3}$ to ${P}_{7}$ ,respectively. As in [20], at each pyramid level we use anchors at three aspect ratios $\{ 1 : 2,1 : 1,2 : 1\}$ . For denser scale coverage than in [20],at each level we add anchors of sizes $\left\{ {2}^{0}\right.$ , $\left. {{2}^{1/3},{2}^{2/3}}\right\}$ of the original set of 3 aspect ratio anchors. This improve AP in our setting. In total there are $A = 9$ anchors per level and across levels they cover the scale range 32 - 813 pixels with respect to the network's input image.

锚框：我们使用与 [20] 中 RPN 变体类似的平移不变锚框。这些锚框在金字塔层级 ${P}_{3}$ 到 ${P}_{7}$ 上的面积分别为 ${32}^{2}$ 到 ${512}^{2}$ 。与 [20] 一样，在每个金字塔层级，我们使用三个纵横比 $\{ 1 : 2,1 : 1,2 : 1\}$ 的锚框。为了比 [20] 更密集地覆盖尺度，在每个层级我们添加了原始的 3 个纵横比锚框的大小为 $\left\{ {2}^{0}\right.$ 和 $\left. {{2}^{1/3},{2}^{2/3}}\right\}$ 的锚框。这在我们的设置中提高了 AP。每个层级总共有 $A = 9$ 个锚框，并且在各层级之间，它们覆盖了相对于网络输入图像的尺度范围 32 - 813 像素。

Each anchor is assigned a length $K$ one-hot vector of classification targets,where $K$ is the number of object classes, and a 4-vector of box regression targets. We use the assignment rule from RPN [28] but modified for multi-class detection and with adjusted thresholds. Specifically, anchors are assigned to ground-truth object boxes using an intersection-over-union (IoU) threshold of 0.5 ; and to background if their IoU is in $\lbrack 0,{0.4})$ . As each anchor is assigned to at most one object box, we set the corresponding entry in its length $K$ label vector to 1 and all other entries to 0 . If an anchor is unassigned, which may happen with overlap in $\lbrack {0.4},{0.5})$ ,it is ignored during training. Box regression targets are computed as the offset between each anchor and its assigned object box, or omitted if there is no assignment.

每个锚框被分配一个长度为 $K$ 的独热向量作为分类目标，其中 $K$ 是物体类别的数量，以及一个 4 维向量作为框回归目标。我们使用 RPN [28] 中的分配规则，但针对多类别检测进行了修改，并调整了阈值。具体而言，锚框使用 0.5 的交并比 (IoU) 阈值分配给真实物体框；如果它们的 IoU 在 $\lbrack 0,{0.4})$ 内，则分配给背景。由于每个锚框最多只能分配给一个物体框，我们将其长度为 $K$ 的标签向量中相应的条目设置为 1，其他所有条目设置为 0。如果一个锚框未被分配，这可能发生在 $\lbrack {0.4},{0.5})$ 的重叠情况下，它将在训练期间被忽略。框回归目标被计算为每个锚框与其分配的物体框之间的偏移，或者如果没有分配则省略。

${}^{2}$ RetinaNet uses feature pyramid levels ${P}_{3}$ to ${P}_{7}$ ,where ${P}_{3}$ to ${P}_{5}$ are computed from the output of the corresponding ResNet residual stage $\left( {C}_{3}\right.$ through ${C}_{5}$ ) using top-down and lateral connections just as in [20], ${P}_{6}$ is obtained via a $3 \times 3$ stride- 2 conv on ${C}_{5}$ ,and ${P}_{7}$ is computed by applying ReLU followed by a $3 \times 3$ stride-2 conv on ${P}_{6}$ . This differs slightly from [20]: (1) we don’t use the high-resolution pyramid level ${P}_{2}$ for computational reasons,(2) ${P}_{6}$ is computed by strided convolution instead of downsampling,and (3) we include ${P}_{7}$ to improve large object detection. These minor modifications improve speed while maintaining accuracy.

${}^{2}$ RetinaNet 使用特征金字塔层级 ${P}_{3}$ 来 ${P}_{7}$ ，其中 ${P}_{3}$ 到 ${P}_{5}$ 是通过 $\left( {C}_{3}\right.$ 对应的 ResNet 残差阶段的输出 ${C}_{5}$ 计算得出的，采用自上而下和侧向连接，正如在 [20] 中所示， ${P}_{6}$ 是通过在 ${C}_{5}$ 上进行 $3 \times 3$ 步幅为 2 的卷积获得的，而 ${P}_{7}$ 是通过对 ${P}_{6}$ 应用 ReLU 后再进行 $3 \times 3$ 步幅为 2 的卷积计算得出的。这与 [20] 略有不同：（1）出于计算原因，我们不使用高分辨率金字塔层级 ${P}_{2}$ ；（2） ${P}_{6}$ 是通过步幅卷积计算的，而不是下采样；（3）我们包含 ${P}_{7}$ 以改善大物体检测。这些小的修改提高了速度，同时保持了准确性。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——