Cascade R-CNN: Delving into High Quality Object Detection【翻译】

Doc2X | PDF 到 Markdown 一步搞定只需几秒，Doc2X 即可将 PDF 转换为 Markdown，支持批量处理和深度翻译功能。 Doc2X | One-Step PDF to Markdown Conversion In just seconds, Doc2X converts PDFs to Markdown, with support for batch processing and advanced translation features. 👉 立即试用 Doc2X | Try Doc2X Now

原文链接：1712.00726

Cascade R-CNN: Delving into High Quality Object Detection

Cascade R-CNN: 深入高质量目标检测

Zhaowei Cai

蔡兆维

UC San Diego

加州大学圣地亚哥分校

zwcai@ucsd.edu

Nuno Vasconcelos

努诺·瓦斯康塞洛斯

UC San Diego

加州大学圣地亚哥分校

nuno@ucsd.edu

Abstract

摘要

In object detection, an intersection over union (IoU) threshold is required to define positives and negatives. An object detector, trained with low IoU threshold, e.g. 0.5, usually produces noisy detections. However, detection performance tends to degrade with increasing the IoU thresholds. Two main factors are responsible for this: 1) over-fitting during training, due to exponentially vanishing positive samples, and 2) inference-time mismatch between the IoUs for which the detector is optimal and those of the input hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, is proposed to address these problems. It consists of a sequence of detectors trained with increasing IoU thresholds, to be sequentially more selective against close false positives. The detectors are trained stage by stage, leveraging the observation that the output of a detector is a good distribution for training the next higher quality detector. The resampling of progressively improved hypotheses guarantees that all detectors have a positive set of examples of equivalent size, reducing the overfitting problem. The same cascade procedure is applied at inference, enabling a closer match between the hypotheses and the detector quality of each stage. A simple implementation of the Cascade R-CNN is shown to surpass all single-model object detectors on the challenging COCO dataset. Experiments also show that the Cascade R-CNN is widely applicable across detector architectures, achieving consistent gains independently of the baseline detector strength. The code will be made available at

在目标检测中，需要一个交并比（IoU）阈值来定义正例和负例。使用低IoU阈值（例如0.5）训练的目标检测器通常会产生噪声检测。然而，随着IoU阈值的增加，检测性能往往会下降。造成这种情况的主要因素有两个：1）训练过程中由于正样本的指数消失而导致的过拟合，2）检测器最优的IoU与输入假设的IoU之间的推理时间不匹配。为了解决这些问题，提出了一种多阶段目标检测架构，即Cascade R-CNN。它由一系列以逐渐增加的IoU阈值训练的检测器组成，以便对接近的假阳性进行逐步选择。检测器是逐阶段训练的，利用了检测器输出是训练下一个更高质量检测器的良好分布的观察。逐步改进的假设的重采样保证所有检测器都有相同大小的正例集，从而减少了过拟合问题。在推理时应用相同的级联程序，使假设与每个阶段的检测器质量之间的匹配更为紧密。Cascade R-CNN的简单实现被证明在具有挑战性的COCO数据集上超越了所有单模型目标检测器。实验还表明，Cascade R-CNN在检测器架构中广泛适用，无论基线检测器的强度如何，都能实现一致的性能提升。代码将会公开发布于

github.com/zhaoweicai/….

1. Introduction

1. 引言

Object detection is a complex problem, requiring the solution of two main tasks. First, the detector must solve the recognition problem, to distinguish foreground objects from background and assign them the proper object class labels. Second, the detector must solve the localization problem, to assign accurate bounding boxes to different objects. Both of these are particularly difficult because the detector faces many "close" false positives, corresponding to "close but not correct" bounding boxes. The detector must find the true positives while suppressing these close false positives.

目标检测是一个复杂的问题，需要解决两个主要任务。首先，检测器必须解决识别问题，以区分前景物体和背景，并为它们分配适当的物体类别标签。其次，检测器必须解决定位问题，为不同的物体分配准确的边界框。这两者都特别困难，因为检测器面临许多“接近”的假阳性，这些假阳性对应于“接近但不正确”的边界框。检测器必须找到真正的阳性，同时抑制这些接近的假阳性。

Figure 1. The detection outputs, localization and detection performance of object detectors of increasing IoU threshold $u$ .

图1. 随着IoU阈值增加，目标检测器的检测输出、定位和检测性能 $u$ 。

Many of the recently proposed object detectors are based on the two-stage R-CNN framework [12, 11, 27, 21], where detection is framed as a multi-task learning problem that combines classification and bounding box regression. Unlike object recognition, an intersection over union (IoU) threshold is required to define positives/negatives. However,the commonly used threshold values $u$ ,typically $u = {0.5}$ ,establish quite a loose requirement for positives. The resulting detectors frequently produce noisy bounding boxes, as shown in Figure 1 (a). Hypotheses that most humans would consider close false positives frequently pass the ${IoU} \geq {0.5}$ test. While the examples assembled under the $u = {0.5}$ criterion are rich and diversified,they make it difficult to train detectors that can effectively reject close false positives.

最近提出的许多目标检测器基于两阶段的R-CNN框架 [12, 11, 27, 21]，其中检测被框定为一个多任务学习问题，结合了分类和边界框回归。与物体识别不同，需要一个交并比（IoU）阈值来定义阳性/阴性。然而，常用的阈值 $u$ ，通常是 $u = {0.5}$ ，对阳性建立了相当宽松的要求。结果，检测器经常产生嘈杂的边界框，如图1(a)所示。大多数人认为的接近假阳性的假设常常通过 ${IoU} \geq {0.5}$ 测试。虽然根据 $u = {0.5}$ 标准汇集的示例丰富多样，但它们使得训练能够有效拒绝接近假阳性的检测器变得困难。

In this work, we define the quality of an hypothesis as its IoU with the ground truth, and the quality of the detector as the IoU threshold $u$ used to train it. The goal is to investigate the, so far, poorly researched problem of learning high quality object detectors, whose outputs contain few close false positives, as shown in Figure 1 (b). The basic idea is that a single detector can only be optimal for a single quality level. This is known in the cost-sensitive learning literature $\left\lbrack {7,{24}}\right\rbrack$ ,where the optimization of different points of the receiver operating characteristic (ROC) requires different loss functions. The main difference is that we consider the optimization for a given IoU threshold, rather than false positive rate.

在本研究中，我们将假设的质量定义为其与真实值的交并比（IoU），而检测器的质量定义为用于训练的IoU阈值 $u$ 。我们的目标是研究迄今为止研究较少的高质量目标检测器学习问题，其输出包含少量接近的假阳性，如图1（b）所示。基本思想是，单个检测器只能在单一质量水平上达到最佳。这在成本敏感学习文献中是已知的 $\left\lbrack {7,{24}}\right\rbrack$ ，不同的接收者操作特征（ROC）点的优化需要不同的损失函数。主要区别在于，我们考虑的是针对给定IoU阈值的优化，而不是假阳性率。

The idea is illustrated by Figure 1 (c) and (d), which present the localization and detection performance, respectively, of three detectors trained with IoU thresholds of $u = {0.5},{0.6},{0.7}$ . The localization performance is evaluated as a function of the IoU of the input proposals, and the detection performance as a function of IoU threshold, as in COCO [20]. Note that, in Figure 1 (c), each bounding box regressor performs best for examples of IoU close to the threshold that the detector was trained. This also holds for detection performance, up to overfitting. Figure 1 (d) shows that,the detector of $u = {0.5}$ outperforms the detector of $u = {0.6}$ for low IoU examples,underperforming it at higher IoU levels. In general, a detector optimized at a single IoU level is not necessarily optimal at other levels. These observations suggest that higher quality detection requires a closer quality match between the detector and the hypotheses that it processes. In general, a detector can only have high quality if presented with high quality proposals.

这个想法通过图1（c）和（d）进行说明，分别展示了使用IoU阈值 $u = {0.5},{0.6},{0.7}$ 训练的三个检测器的定位和检测性能。定位性能作为输入提案的IoU的函数进行评估，而检测性能作为IoU阈值的函数进行评估，类似于COCO [20]。请注意，在图1（c）中，每个边界框回归器在接近检测器训练阈值的IoU示例上表现最佳。这一点在检测性能上也成立，直到过拟合。图1（d）显示，检测器 $u = {0.5}$ 在低IoU示例中优于检测器 $u = {0.6}$ ，但在较高IoU水平下表现不佳。一般而言，针对单一IoU水平优化的检测器在其他水平上不一定是最佳的。这些观察结果表明，高质量检测需要检测器与其处理的假设之间更紧密的质量匹配。一般而言，检测器只有在提供高质量提案时才能具有高质量。

However, to produce a high quality detector, it does not suffice to simply increase $u$ during training. In fact,as seen for the detector of $u = {0.7}$ of Figure 1 (d),this can degrade detection performance. The problem is that the distribution of hypotheses out of a proposal detector is usually heavily imbalanced towards low quality. In general, forcing larger IoU thresholds leads to an exponentially smaller numbers of positive training samples. This is particularly problematic for neural networks, which are known to be very example intensive,and makes the "high $u$ " training strategy quite prone to overfitting. Another difficulty is the mismatch between the quality of the detector and that of the testing hypotheses at inference. As shown in Figure 1, high quality detectors are only necessarily optimal for high quality hypotheses. The detection could be suboptimal when they are asked to work on the hypotheses of other quality levels.

然而，要产生高质量的检测器，仅仅在训练期间增加 $u$ 是不够的。实际上，如图 1 (d) 中 $u = {0.7}$ 的检测器所示，这可能会降低检测性能。问题在于，提议检测器输出的假设分布通常严重不平衡，倾向于低质量。一般来说，强制更大的 IoU 阈值会导致正训练样本的数量呈指数级减少。这对于神经网络尤其成问题，因为它们被认为非常依赖示例，这使得“高 $u$ ”训练策略很容易过拟合。另一个困难是检测器的质量与推理时测试假设的质量之间的不匹配。如图 1 所示，高质量检测器仅在高质量假设下才是最优的。当它们被要求处理其他质量水平的假设时，检测可能会次优。

In this paper, we propose a new detector architecture, Cascade R-CNN, that addresses these problems. It is a multi-stage extension of the R-CNN, where detector stages deeper into the cascade are sequentially more selective against close false positives. The cascade of R-CNN stages are trained sequentially, using the output of one stage to train the next. This is motivated by the observation that the output IoU of a regressor is almost invariably better than the input IoU. This observation can be made in Figure 1 (c), where all plots are above the gray line. It suggests that the output of a detector trained with a certain IoU threshold is a good distribution to train the detector of the next higher IoU threshold. This is similar to boostrapping methods commonly used to assemble datasets in object detection literature $\left\lbrack {{31},8}\right\rbrack$ . The main difference is that the resampling procedure of the Cascade R-CNN does not aim to mine hard negatives. Instead, by adjusting bounding boxes, each stage aims to find a good set of close false positives for training the next stage. When operating in this manner, a sequence of detectors adapted to increasingly higher IoUs can beat the overfitting problem, and thus be effectively trained. At inference, the same cascade procedure is applied. The progressively improved hypotheses are better matched to the increasing detector quality at each stage. This enables higher detection accuracies, as suggested by Figure 1 (c) and (d).

在本文中，我们提出了一种新的检测器架构，级联 R-CNN，旨在解决这些问题。它是 R-CNN 的多阶段扩展，其中级联中更深的检测器阶段对接近的假阳性具有更高的选择性。级联 R-CNN 阶段是顺序训练的，使用一个阶段的输出训练下一个阶段。这一做法的动机在于观察到回归器的输出 IoU 几乎总是优于输入 IoU。在图 1 (c) 中可以看到这一观察结果，所有图形都位于灰线之上。这表明，以某个 IoU 阈值训练的检测器的输出是训练下一个更高 IoU 阈值检测器的良好分布。这与物体检测文献中常用的自举方法类似 $\left\lbrack {{31},8}\right\rbrack$ 。主要区别在于，级联 R-CNN 的重采样过程并不旨在挖掘困难负样本。相反，通过调整边界框，每个阶段旨在找到一组良好的接近假阳性，以训练下一个阶段。当以这种方式操作时，适应于越来越高 IoU 的一系列检测器可以克服过拟合问题，从而有效地进行训练。在推理时，应用相同的级联过程。逐步改进的假设与每个阶段日益提高的检测器质量更好地匹配。这使得检测精度更高，如图 1 (c) 和 (d) 所示。

The Cascade R-CNN is quite simple to implement and trained end-to-end. Our results show that a vanilla implementation, without any bells and whistles, surpasses all previous state-of-the-art single-model detectors by a large margin, on the challenging COCO detection task [20], especially under the higher quality evaluation metrics. In addition, the Cascade R-CNN can be built with any two-stage object detector based on the R-CNN framework. We have observed consistent gains (of $2 \sim 4$ points),at a marginal increase in computation. This gain is independent of the strength of the baseline object detectors. We thus believe that this simple and effective detection architecture can be of interest for many object detection research efforts.

Cascade R-CNN 的实现相当简单，并且是端到端训练的。我们的结果表明，未经任何附加改进的基础实现，在具有挑战性的 COCO 检测任务 [20] 上，尤其是在更高质量的评估指标下，超越了所有先前的最先进单模型检测器，差距相当大。此外，Cascade R-CNN 可以基于 R-CNN 框架与任何两阶段目标检测器构建。我们观察到，在计算量仅略有增加的情况下，性能一致提升了 $2 \sim 4$ 分。这种提升与基线目标检测器的强度无关。因此，我们相信这种简单而有效的检测架构对许多目标检测研究工作具有吸引力。

2. Related Work

2. 相关工作

Due to the success of the R-CNN [12] architecture, the two-stage formulation of the detection problems, by combining a proposal detector and a region-wise classifier has become predominant in the recent past. To reduce redundant CNN computations in the R-CNN, the SPP-Net [15] and Fast-RCNN [11] introduced the idea of region-wise feature extraction, significantly speeding up the overall detector. Later, the Faster-RCNN [27] achieved further speeds-up by introducing a Region Proposal Network (RPN). This architecture has become a leading object detection framework. Some more recent works have extended it to address various problems of detail. For example, the R-FCN [4] proposed efficient region-wise fully convolutions without accuracy loss, to avoid the heavy region-wise CNN computations of the Faster-RCNN; while the MS-CNN [1] and FPN [21] detect proposals at multiple output layers, so as to alleviate the scale mismatch between the RPN receptive fields and actual object size, for high-recall proposal detection.

由于 R-CNN [12] 架构的成功，检测问题的两阶段形式，通过结合提议检测器和区域分类器，在最近的过去变得主导。为了减少 R-CNN 中冗余的 CNN 计算，SPP-Net [15] 和 Fast-RCNN [11] 引入了区域特征提取的概念，显著加快了整体检测器的速度。后来，Faster-RCNN [27] 通过引入区域提议网络 (RPN) 实现了进一步的速度提升。这种架构已成为领先的目标检测框架。一些更近期的工作扩展了它，以解决各种细节问题。例如，R-FCN [4] 提出了高效的区域全卷积方法而不损失准确性，以避免 Faster-RCNN 的重区域 CNN 计算；而 MS-CNN [1] 和 FPN [21] 在多个输出层检测提议，以缓解 RPN 感受野与实际物体大小之间的尺度不匹配，从而实现高召回率的提议检测。

Alternatively, one-stage object detection architectures have also become popular, mostly due to their computational efficiency. These architectures are close to the classic sliding window strategy [31, 8]. YOLO [26] outputs very sparse detection results by forwarding the input image once. When implemented with an efficient backbone network, it enables real time object detection with fair performance. SSD [23] detects objects in a way similar to the RPN [27], but uses multiple feature maps at different resolutions to cover objects at various scales. The main limitation of these architectures is that their accuracies are typically below that of two-stage detectors. Recently, RetinaNet [22] was proposed to address the extreme foreground-background class imbalance in dense object detection, achieving better results than state-of-the-art two-stage object detectors.

另一方面，一阶段目标检测架构也变得流行，主要是由于其计算效率。这些架构接近经典的滑动窗口策略 [31, 8]。YOLO [26] 通过对输入图像进行一次前向传播，输出非常稀疏的检测结果。当与高效的主干网络结合使用时，它能够实现实时目标检测，性能相对较好。SSD [23] 以类似于 RPN [27] 的方式检测对象，但使用不同分辨率的多个特征图来覆盖各种尺度的对象。这些架构的主要限制是它们的准确性通常低于两阶段检测器。最近，RetinaNet [22] 被提出以解决密集目标检测中极端的前景-背景类别不平衡问题，取得了比最先进的两阶段目标检测器更好的结果。

Some explorations in multi-stage object detection have also been proposed. The multi-region detector [9] introduced iterative bounding box regression, where a R-CNN is applied several times, to produce better bounding boxes. CRAFT [33] and AttractioNet [10] used a multi-stage procedure to generate accurate proposals, and forwarded them to a Fast-RCNN. [19, 25] embedded the classic cascade architecture of [31] in object detection networks. [3] iterated a detection and a segmentation task alternatively, for instance segmentation.

在多阶段目标检测中也提出了一些探索。多区域检测器 [9] 引入了迭代边界框回归，其中 R-CNN 被多次应用，以生成更好的边界框。CRAFT [33] 和 AttractioNet [10] 使用多阶段程序生成准确的提议，并将其转发到 Fast-RCNN。[19, 25] 将经典的级联架构 [31] 嵌入到目标检测网络中。[3] 交替迭代检测和分割任务，以实现实例分割。

3. Object Detection

3. 目标检测

In this paper, we extend the two-stage architecture of the Faster-RCNN [27, 21], shown in Figure 3 (a). The first stage is a proposal sub-network ("H0"), applied to the entire image, to produce preliminary detection hypotheses, known as object proposals. In the second stage, these hypotheses are then processed by a region-of-interest detection subnetwork ("H1"), denoted as detection head. A final classification score ("C") and a bounding box ("B") are assigned to each hypothesis. We focus on modeling a multi-stage detection sub-network, and adopt, but are not limited to, the RPN [27] for proposal detection.

在本文中，我们扩展了 Faster-RCNN 的两阶段架构 [27, 21]，如图 3 (a) 所示。第一阶段是一个提议子网络（“H0”），应用于整个图像，以生成初步检测假设，称为物体提议。在第二阶段，这些假设由一个感兴趣区域检测子网络（“H1”）处理，称为检测头。每个假设被分配一个最终分类分数（“C”）和一个边界框（“B”）。我们专注于建模一个多阶段检测子网络，并采用但不限于 RPN [27] 进行提议检测。

3.1. Bounding Box Regression

3.1. 边界框回归

A bounding box $\mathbf{b} = \left( {{b}_{x},{b}_{y},{b}_{w},{b}_{h}}\right)$ contains the four coordinates of an image patch $x$ . The task of bounding box regression is to regress a candidate bounding box $\mathbf{b}$ into a target bounding box $\mathbf{g}$ ,using a regressor $f\left( {x,\mathbf{b}}\right)$ . This is learned from a training sample $\left\{ {{\mathbf{g}}_{i},{\mathbf{b}}_{i}}\right\}$ ,so as to minimize the bounding box risk

边界框 $\mathbf{b} = \left( {{b}_{x},{b}_{y},{b}_{w},{b}_{h}}\right)$ 包含图像补丁 $x$ 的四个坐标。边界框回归的任务是将候选边界框 $\mathbf{b}$ 回归到目标边界框 $\mathbf{g}$ ，使用一个回归器 $f\left( {x,\mathbf{b}}\right)$ 。这是从训练样本 $\left\{ {{\mathbf{g}}_{i},{\mathbf{b}}_{i}}\right\}$ 中学习的，以最小化边界框风险。

where ${L}_{loc}$ was a ${L}_{2}$ loss function in R-CNN [12],but updated to a smoothed ${L}_{1}$ loss function in Fast-RCNN [11]. To encourage a regression invariant to scale and location, ${L}_{loc}$ operates on the distance vector $\Delta = \left( {{\delta }_{x},{\delta }_{y},{\delta }_{w},{\delta }_{h}}\right)$ defined by

其中 ${L}_{loc}$ 是 R-CNN [12] 中的一个 ${L}_{2}$ 损失函数，但在 Fast-RCNN [11] 中更新为平滑的 ${L}_{1}$ 损失函数。为了鼓励回归对尺度和位置的不变性， ${L}_{loc}$ 在由以下定义的距离向量 $\Delta = \left( {{\delta }_{x},{\delta }_{y},{\delta }_{w},{\delta }_{h}}\right)$ 上操作。

Figure 2. Sequential $\Delta$ distribution (without normalization) at different cascade stage. Red dots are outliers when using increasing IoU thresholds, and the statistics are obtained after outlier removal.

图 2. 在不同级联阶段的顺序 $\Delta$ 分布（未归一化）。红点是在使用增加的 IoU 阈值时的异常值，统计数据是在去除异常值后获得的。

Since bounding box regression usually performs minor adjustments on $b$ ,the numerical values of (2) can be very small. Hence, the risk of (1) is usually much smaller than the classification risk. To improve the effectiveness of multi-task learning, $\Delta$ is usually normalized by its mean and variance,i.e. ${\delta }_{x}$ is replaced by ${\delta }_{x}^{\prime } = \left( {{\delta }_{x} - {\mu }_{x}}\right) /{\sigma }_{x}$ . This is widely used in the literature [27, 1, 4, 21, 14].

由于边界框回归通常对 $b$ 进行微小调整，因此 (2) 的数值可以非常小。因此，(1) 的风险通常远小于分类风险。为了提高多任务学习的有效性，通常通过其均值和方差对 $\Delta$ 进行归一化，即用 ${\delta }_{x}^{\prime } = \left( {{\delta }_{x} - {\mu }_{x}}\right) /{\sigma }_{x}$ 替换 ${\delta }_{x}$ 。这一方法在文献中被广泛使用 [27, 1, 4, 21, 14]。

Some works $\left\lbrack {9,{10},{16}}\right\rbrack$ have argued that a single regression step of $f$ is insufficient for accurate localization. Instead, $f$ is applied iteratively,as a post-processing step

一些研究 $\left\lbrack {9,{10},{16}}\right\rbrack$ 认为单一步骤的 $f$ 对于准确定位是不够的。相反， $f$ 被迭代应用，作为后处理步骤

to refine a bounding box $\mathbf{b}$ . This is called iterative bounding box regression, denoted as iterative BBox. It can be implemented with the inference architecture of Figure 3 (b) where all heads are the same. This idea, however, ignores two problems. First,as shown in Figure 1,a regressor $f$ trained at $u = {0.5}$ ,is suboptimal for hypotheses of higher IoUs. It actually degrades bounding boxes of IoU larger than 0.85 . Second, as shown in Figure 2, the distribution of bounding boxes changes significantly after each iteration. While the regressor is optimal for the initial distribution it can be quite suboptimal after that. Due to these problems, iterative BBox requires a fair amount of human engineering, in the form of proposal accumulation, box voting, etc. $\left\lbrack {9,{10},{16}}\right\rbrack$ ,and has somewhat unreliable gains. Usually, there is no benefit beyond applying $f$ twice.

以细化边界框 $\mathbf{b}$ 。这被称为迭代边界框回归，记作迭代 BBox。它可以通过图 3 (b) 的推理架构实现，其中所有头部都是相同的。然而，这一想法忽略了两个问题。首先，如图 1 所示，在 $u = {0.5}$ 训练的回归器 $f$ 对于更高 IoU 的假设是次优的。实际上，它会降低 IoU 大于 0.85 的边界框。其次，如图 2 所示，每次迭代后边界框的分布会显著变化。虽然回归器对于初始分布是最优的，但在此之后可能会变得相当次优。由于这些问题，迭代 BBox 需要相当多的人为工程，例如提案累积、框投票等 $\left\lbrack {9,{10},{16}}\right\rbrack$ ，并且其增益有些不可靠。通常，应用 $f$ 两次以上没有任何好处。

3.2. Classification

3.2. 分类

The classifier is a function $h\left( x\right)$ that assigns an image patch $x$ to one of $M + 1$ classes,where class 0 contains background and the remaining the objects to detect. $h\left( x\right)$ is a $M + 1$ -dimensional estimate of the posterior distribution over classes,i.e. ${h}_{k}\left( x\right) = p\left( {y = k \mid x}\right)$ ,where $y$ is the class label. Given a training set $\left( {{x}_{i},{y}_{i}}\right)$ ,it is learned by minimizing a classification risk

分类器是一个函数 $h\left( x\right)$ ，它将图像块 $x$ 分配给 $M + 1$ 个类，其中类 0 包含背景，其余的为待检测对象。 $h\left( x\right)$ 是对类的后验分布的 $M + 1$ 维估计，即 ${h}_{k}\left( x\right) = p\left( {y = k \mid x}\right)$ ，其中 $y$ 是类标签。给定一个训练集 $\left( {{x}_{i},{y}_{i}}\right)$ ，它通过最小化分类风险来学习。

Figure 3. The architectures of different frameworks. "T" is input image, "conv" backbone convolutions, "pool" region-wise feature extraction, "H" network head, "B" bounding box, and "C" classification. "BO" is proposals in all architectures.

图 3. 不同框架的架构。“T”是输入图像，“conv”是主干卷积，“pool”是区域特征提取，“H”是网络头，“B”是边界框，“C”是分类。“BO”是所有架构中的提案。

where ${L}_{cls}$ is the classic cross-entropy loss.

其中 ${L}_{cls}$ 是经典的交叉熵损失。

3.3. Detection Quality

3.3. 检测质量

Since a bounding box usually includes an object and some amount of background, it is difficult to determine if a detection is positive or negative. This is usually addressed by the IoU metric. If the IoU is above a threshold $u$ ,the patch is considered an example of the class. Thus, the class label of a hypothesis $x$ is a function of $u$ ,

由于边界框通常包含一个对象和一些背景，因此很难确定检测是正例还是负例。这通常通过 IoU 指标来解决。如果 IoU 超过阈值 $u$ ，则该图像块被视为该类的一个例子。因此，假设 $x$ 的类标签是 $u$ 的一个函数，

where ${g}_{y}$ is the class label of the ground truth object $g$ . This IoU threshold $u$ defines the quality of a detector.

其中 ${g}_{y}$ 是真实对象 $g$ 的类标签。这个 IoU 阈值 $u$ 定义了检测器的质量。

Object detection is challenging because, no matter threshold, the detection setting is highly adversarial. When $u$ is high,the positives contain less background,but it is difficult to assemble enough positive training examples. When $u$ is low,a richer and more diversified positive training set is available, but the trained detector has little incentive to reject close false positives. In general, it is very difficult to ask a single classifier to perform uniformly well over all IoU levels. At inference, since the majority of the hypotheses produced by a proposal detector, e.g. RPN [27] or selective search [30], have low quality, the detector must be more discriminant for lower quality hypotheses. A standard compromise between these conflicting requirements is to settle on $u = {0.5}$ . This,however,is a relatively low threshold, leading to low quality detections that most humans consider close false positives, as shown in Figure 1 (a).

物体检测具有挑战性，因为无论阈值如何，检测设置都是高度对抗性的。当 $u$ 较高时，正样本包含较少的背景，但很难收集到足够的正训练样本。当 $u$ 较低时，可以获得更丰富和多样化的正训练集，但训练出的检测器对拒绝接近的假阳性缺乏足够的动力。一般来说，要求单一分类器在所有 IoU 水平上表现均匀良好是非常困难的。在推理时，由于提议检测器（例如 RPN [27] 或选择性搜索 [30]）产生的大多数假设质量较低，因此检测器必须对低质量假设更加敏感。这些相互冲突的要求之间的标准折衷是选择 $u = {0.5}$ 。然而，这个阈值相对较低，导致低质量检测，大多数人认为这些检测是接近的假阳性，如图 1 (a) 所示。

A naïve solution is to develop an ensemble of classifiers, with the architecture of Figure 3 (c), optimized with a loss that targets various quality levels,

一个简单的解决方案是开发一个分类器的集成，采用图 3 (c) 的架构，并通过一个针对不同质量水平的损失进行优化。

Figure 4. The IoU histogram of training samples. The distribution at $1\mathrm{{st}}$ stage is the output of RPN. The red numbers are the positive percentage higher than the corresponding IoU threshold.

图 4. 训练样本的 IoU 直方图。在 $1\mathrm{{st}}$ 阶段的分布是 RPN 的输出。红色数字是高于相应 IoU 阈值的正样本百分比。

where $U$ is a set of IoU thresholds. This is closely related to the integral loss of [34],in which $U =$ $\{ {0.5},{0.55},\cdots ,{0.75}\}$ ,designed to fit the evaluation metric of the COCO challenge. By definition, the classifiers need to be ensembled at inference. This solution fails to address the problem that the different losses of (6) operate on different numbers of positives. As shown in the first figure of Figure 4, the set of positive samples decreases quickly with $u$ . This is particularly problematic because the high quality classifiers are prone to overfitting. In addition, those high quality classifiers are required to process proposals of overwhelming low quality at inference, for which they are not optimized. Due to all this, the ensemble of (6) fails to achieve higher accuracy at most quality levels, and the architecture has very little gain over that of Figure 3 (a).

其中 $U$ 是一组 IoU 阈值。这与 [34] 的积分损失密切相关，其中 $U =$ $\{ {0.5},{0.55},\cdots ,{0.75}\}$ ，旨在适应 COCO 挑战的评估指标。根据定义，分类器在推理时需要进行集成。该解决方案未能解决 (6) 的不同损失在不同数量的正样本上操作的问题。如图 4 的第一幅图所示，正样本的集合随着 $u$ 的变化迅速减少。这尤其成问题，因为高质量分类器容易过拟合。此外，这些高质量分类器在推理时需要处理大量低质量的提议，而它们并未针对这些提议进行优化。由于这些原因，(6) 的集成在大多数质量水平上未能实现更高的准确性，并且该架构相较于图 3 (a) 的增益非常有限。

4. Cascade R-CNN

4. 级联 R-CNN

In this section we introduce the proposed Cascade R-CNN object detection architecture of Figure 3 (d).

在本节中，我们介绍了图 3 (d) 中提出的级联 R-CNN 目标检测架构。

4.1. Cascaded Bounding Box Regression

4.1. 级联边界框回归

As seen in Figure 1 (c), it is very difficult to ask a single regressor to perform perfectly uniformly at all quality levels. The difficult regression task can be decomposed into a sequence of simpler steps, inspired by the works of cascade pose regression [6] and face alignment [2, 32]. In the

如图 1 (c) 所示，要求单个回归器在所有质量水平上都表现得完美均匀是非常困难的。这个困难的回归任务可以被分解为一系列更简单的步骤，这受到级联姿态回归 [6] 和人脸对齐 [2, 32] 研究的启发。在

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——