Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks翻译

Doc2X：智能文档解析与翻译助手从 PDF转Word、Latex 到 Markdown，Doc2X 提供完整解决方案，支持批量PDF识别、表格解析、多栏转换，并结合 GPT 翻译、深度语料提取功能。 Doc2X: Your Intelligent Parsing and Translation Assistant From PDF to Word, LaTeX, or Markdown, Doc2X offers a complete solution, supporting batch PDF recognition, table parsing, and multi-column conversion, integrated with GPT translation and advanced corpus extraction. 👉 立即试用 Doc2X | Try Doc2X Today

原文链接：1506.01497

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Faster R-CNN：基于区域提议网络的实时目标检测

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun

Shaoqing Ren, Kaiming He, Ross Girshick, 和 Jian Sun

Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.

摘要——最先进的目标检测网络依赖于区域提议算法来假设物体位置。像 SPPnet [1] 和 Fast R-CNN [2] 这样的进展减少了这些检测网络的运行时间，暴露了区域提议计算作为瓶颈。在这项工作中，我们引入了一种区域提议网络（RPN），它与检测网络共享全图卷积特征，从而实现几乎无成本的区域提议。RPN 是一个全卷积网络，它同时预测每个位置的物体边界和物体性得分。RPN 经过端到端训练，以生成高质量的区域提议，这些提议被 Fast R-CNN 用于检测。我们进一步通过共享卷积特征将 RPN 和 Fast R-CNN 合并为一个单一网络——使用最近流行的“注意”机制的神经网络术语，RPN 组件告诉统一网络该关注哪里。对于非常深的 VGG-16 模型 [3]，我们的检测系统在 GPU 上的帧率为 5fps（包括所有步骤），同时在 PASCAL VOC 2007、2012 和 MS COCO 数据集上实现了最先进的目标检测精度，每张图像仅使用 300 个提议。在 ILSVRC 和 COCO 2015 比赛中，Faster R-CNN 和 RPN 是多个赛道中获胜作品的基础。代码已公开发布。

Index Terms-Object Detection, Region Proposal, Convolutional Neural Network.

关键词——目标检测，区域提议，卷积神经网络。

INTRODUCTION

引言

Recent advances in object detection are driven by the success of region proposal methods (e.g., [4]) and region-based convolutional neural networks (R-CNNs) [5]. Although region-based CNNs were computationally expensive as originally developed in [5], their cost has been drastically reduced thanks to sharing convolutions across proposals [1], [2]. The latest incarnation, Fast R-CNN [2], achieves near real-time rates using very deep networks [3], when ignoring the time spent on region proposals. Now, proposals are the test-time computational bottleneck in state-of-the-art detection systems.

最近在目标检测方面的进展得益于区域提议方法（例如，[4]）和基于区域的卷积神经网络（R-CNNs）[5]的成功。尽管最初开发的基于区域的CNN在计算上非常昂贵，但由于在提议之间共享卷积，其成本已大幅降低[1]，[2]。最新版本的Fast R-CNN [2]在忽略区域提议所花费的时间时，使用非常深的网络[3]达到了接近实时的速度。目前，提议成为了最先进检测系统中的测试时间计算瓶颈。

Region proposal methods typically rely on inexpensive features and economical inference schemes. Selective Search [4], one of the most popular methods, greedily merges superpixels based on engineered low-level features. Yet when compared to efficient detection networks [2], Selective Search is an order of magnitude slower, at 2 seconds per image in a CPU implementation. EdgeBoxes [6] currently provides the best tradeoff between proposal quality and speed, at 0.2 seconds per image. Nevertheless, the region proposal step still consumes as much running time as the detection network.

区域提议方法通常依赖于廉价的特征和经济的推理方案。选择性搜索[4]是最流行的方法之一，它基于工程设计的低级特征贪婪地合并超像素。然而，与高效检测网络[2]相比，选择性搜索的速度慢了一个数量级，在CPU实现中每张图像需要2秒。EdgeBoxes [6]目前在提议质量和速度之间提供了最佳的权衡，每张图像仅需0.2秒。然而，区域提议步骤仍然消耗与检测网络相同的运行时间。

One may note that fast region-based CNNs take advantage of GPUs, while the region proposal methods used in research are implemented on the CPU, making such runtime comparisons inequitable. An obvious way to accelerate proposal computation is to reimplement it for the GPU. This may be an effective engineering solution, but re-implementation ignores the down-stream detection network and therefore misses important opportunities for sharing computation.

人们可能会注意到，快速的基于区域的CNN利用了GPU，而研究中使用的区域提议方法是在CPU上实现的，这使得这种运行时间比较不公平。加速提议计算的一个明显方法是为GPU重新实现它。这可能是一个有效的工程解决方案，但重新实现忽略了下游检测网络，因此错过了共享计算的重要机会。

In this paper, we show that an algorithmic change-computing proposals with a deep convolutional neural network-leads to an elegant and effective solution where proposal computation is nearly cost-free given the detection network's computation. To this end, we introduce novel Region Proposal Networks (RPNs) that share convolutional layers with state-of-the-art object detection networks [1], [2]. By sharing convolutions at test-time, the marginal cost for computing proposals is small (e.g., ${10}\mathrm{\;{ms}}$ per image).

在本文中，我们展示了一种算法变更计算提案与深度卷积神经网络相结合的方案，这导致了一种优雅且有效的解决方案，其中提案计算在检测网络的计算下几乎是无成本的。为此，我们引入了新颖的区域提案网络（RPN），它们与最先进的目标检测网络共享卷积层 [1]，[2]。通过在测试时共享卷积，计算提案的边际成本很小（例如，每张图像 ${10}\mathrm{\;{ms}}$ ）。

Our observation is that the convolutional feature maps used by region-based detectors, like Fast R-CNN, can also be used for generating region proposals. On top of these convolutional features, we construct an RPN by adding a few additional convolutional layers that simultaneously regress region bounds and objectness scores at each location on a regular grid. The RPN is thus a kind of fully convolutional network (FCN) [7] and can be trained end-to-end specifically for the task for generating detection proposals.

我们的观察是，区域基础检测器（如 Fast R-CNN）使用的卷积特征图也可以用于生成区域提案。在这些卷积特征的基础上，我们通过添加几个额外的卷积层来构建一个 RPN，这些层同时在常规网格的每个位置回归区域边界和物体性得分。因此，RPN 是一种全卷积网络（FCN） [7]，可以专门为生成检测提案的任务进行端到端训练。

RPNs are designed to efficiently predict region proposals with a wide range of scales and aspect ratios. In contrast to prevalent methods [8], [9], [1], [2] that use

RPN 旨在高效地预测具有广泛尺度和纵横比的区域提案。与普遍使用的方法 [8]，[9]，[1]，[2] 相比，

S. Ren is with University of Science and Technology of China, Hefei, China. This work was done when S. Ren was an intern at Microsoft Research. Email: sqren@mail.ustc.edu.cn
S. Ren 就读于中国科学技术大学，合肥，中国。此项工作是在 S. Ren 担任微软研究实习生期间完成的。电子邮件：sqren@mail.ustc.edu.cn
K. He and J. Sun are with Visual Computing Group, Microsoft Research. E-mail: {kahe,jiansun}@microsoft.com
K. He 和 J. Sun 隶属于微软研究的视觉计算组。电子邮件：{kahe,jiansun}@microsoft.com
R. Girshick is with Facebook AI Research. The majority of this work was done when R. Girshick was with Microsoft Research. E-mail:
R. Girshick 隶属于 Facebook AI 研究。大部分工作是在 R. Girshick 在微软研究期间完成的。电子邮件：

rbg@fb.com

Figure 1: Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps are built, and the classifier is run at all scales. (b) Pyramids of filters with multiple scales/sizes are run on the feature map. (c) We use pyramids of reference boxes in the regression functions.

图1：处理多尺度和多尺寸的不同方案。(a) 构建图像和特征图的金字塔，并在所有尺度上运行分类器。(b) 在特征图上运行具有多尺度/尺寸的滤波器金字塔。(c) 我们在回归函数中使用参考框的金字塔。

pyramids of images (Figure 1, a) or pyramids of filters (Figure 1, b), we introduce novel "anchor" boxes that serve as references at multiple scales and aspect ratios. Our scheme can be thought of as a pyramid of regression references (Figure 1, c), which avoids enumerating images or filters of multiple scales or aspect ratios. This model performs well when trained and tested using single-scale images and thus benefits running speed.

图像金字塔（图1，a）或滤波器金字塔（图1，b），我们引入了新颖的“锚”框，作为多尺度和长宽比的参考。我们的方案可以被视为回归参考的金字塔（图1，c），避免了枚举多尺度或多长宽比的图像或滤波器。当使用单尺度图像进行训练和测试时，该模型表现良好，从而提高了运行速度。

To unify RPNs with Fast R-CNN [2] object detection networks, we propose a training scheme that alternates between fine-tuning for the region proposal task and then fine-tuning for object detection, while keeping the proposals fixed. This scheme converges quickly and produces a unified network with convolutional features that are shared between both tasks. ${}^{1}$

为了将区域提议网络（RPNs）与Fast R-CNN [2] 目标检测网络统一，我们提出了一种训练方案，该方案在区域提议任务的微调和目标检测的微调之间交替进行，同时保持提议不变。该方案收敛迅速，并生成一个在两个任务之间共享卷积特征的统一网络。 ${}^{1}$

We comprehensively evaluate our method on the PASCAL VOC detection benchmarks [11] where RPNs with Fast R-CNNs produce detection accuracy better than the strong baseline of Selective Search with Fast R-CNNs. Meanwhile, our method waives nearly all computational burdens of Selective Search at test-time-the effective running time for proposals is just 10 milliseconds. Using the expensive very deep models of [3], our detection method still has a frame rate of $5\mathrm{{fps}}$ (including all steps) on a GPU, and thus is a practical object detection system in terms of both speed and accuracy. We also report results on the MS COCO dataset [12] and investigate the improvements on PASCAL VOC using the COCO data. Code has been made publicly available at github.com/shaoqingren…_ rcnn (in MATLAB) and github.com/ rbgirshick/py-faster-rcnn (in Python).

我们在 PASCAL VOC 检测基准 [11] 上全面评估了我们的方法，其中 RPN 与 Fast R-CNN 的组合产生的检测准确率优于使用 Fast R-CNN 的强基线选择性搜索。同时，我们的方法在测试时几乎免除了选择性搜索的所有计算负担——提议的有效运行时间仅为 10 毫秒。使用 [3] 中的昂贵的非常深的模型，我们的检测方法在 GPU 上的帧率仍为 $5\mathrm{{fps}}$ （包括所有步骤），因此在速度和准确性方面都是一个实用的目标检测系统。我们还报告了 MS COCO 数据集 [12] 的结果，并调查了使用 COCO 数据对 PASCAL VOC 的改进。代码已公开发布在 github.com/shaoqingren… 版）和 github.com/rbgirshick/… 版）。

A preliminary version of this manuscript was published previously [10]. Since then, the frameworks of RPN and Faster R-CNN have been adopted and generalized to other methods, such as 3D object detection [13], part-based detection [14], instance segmentation [15], and image captioning [16]. Our fast and effective object detection system has also been built in commercial systems such as at Pinterests [17], with user engagement improvements reported.

本手稿的初步版本之前已发布 [10]。自那时以来，RPN 和 Faster R-CNN 的框架已被采纳并推广到其他方法，例如 3D 目标检测 [13]、基于部件的检测 [14]、实例分割 [15] 和图像描述 [16]。我们的快速有效的目标检测系统也已在商业系统中构建，例如在 Pinterest [17] 中，报告了用户参与度的改善。

In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the basis of several 1st-place entries [18] in the tracks of ImageNet detection, Ima-geNet localization, COCO detection, and COCO segmentation. RPNs completely learn to propose regions from data, and thus can easily benefit from deeper and more expressive features (such as the 101-layer residual nets adopted in [18]). Faster R-CNN and RPN are also used by several other leading entries in these ${\text{competitions}}^{2}$ . These results suggest that our method is not only a cost-efficient solution for practical usage, but also an effective way of improving object detection accuracy.

在 ILSVRC 和 COCO 2015 竞赛中，Faster R-CNN 和 RPN 是多个第一名参赛作品的基础 [18]，这些作品涉及 ImageNet 检测、ImageNet 定位、COCO 检测和 COCO 分割等领域。RPN 完全从数据中学习提出区域，因此可以轻松受益于更深层次和更具表现力的特征（例如 [18] 中采用的 101 层残差网络）。Faster R-CNN 和 RPN 也被这些 ${\text{competitions}}^{2}$ 中的其他几个领先参赛作品使用。这些结果表明，我们的方法不仅是一个具有成本效益的实际应用解决方案，而且是提高目标检测准确性的有效方法。

2 Related Work

2 相关工作

Object Proposals. There is a large literature on object proposal methods. Comprehensive surveys and comparisons of object proposal methods can be found in [19], [20], [21]. Widely used object proposal methods include those based on grouping super-pixels (e.g., Selective Search [4], CPMC [22], MCG [23]) and those based on sliding windows (e.g., objectness in windows [24], EdgeBoxes [6]). Object proposal methods were adopted as external modules independent of the detectors (e.g., Selective Search [4] object detectors, R-CNN [5], and Fast R-CNN [2]).

目标提议。关于目标提议方法的文献非常丰富。关于目标提议方法的全面调查和比较可以在 [19]、[20]、[21] 中找到。广泛使用的目标提议方法包括基于超像素分组的方法（例如，选择性搜索 [4]、CPMC [22]、MCG [23]）和基于滑动窗口的方法（例如，窗口中的目标性 [24]、EdgeBoxes [6]）。目标提议方法被作为独立于检测器的外部模块采用（例如，选择性搜索 [4] 目标检测器、R-CNN [5] 和 Fast R-CNN [2]）。

Deep Networks for Object Detection. The R-CNN method [5] trains CNNs end-to-end to classify the proposal regions into object categories or background. R-CNN mainly plays as a classifier, and it does not predict object bounds (except for refining by bounding box regression). Its accuracy depends on the performance of the region proposal module (see comparisons in [20]). Several papers have proposed ways of using deep networks for predicting object bounding boxes [25], [9], [26], [27]. In the OverFeat method [9], a fully-connected layer is trained to predict the box coordinates for the localization task that assumes a single object. The fully-connected layer is then turned

深度网络用于目标检测。R-CNN 方法 [5] 通过端到端训练卷积神经网络（CNN）将提议区域分类为目标类别或背景。R-CNN 主要作为分类器，并不预测目标边界（除了通过边界框回归进行细化）。其准确性依赖于区域提议模块的性能（见 [20] 中的比较）。一些论文提出了使用深度网络预测目标边界框的方法 [25]、[9]、[26]、[27]。在 OverFeat 方法 [9] 中，训练一个全连接层来预测假设单个目标的定位任务的框坐标。然后将全连接层转化为卷积层，以检测多个特定类别的对象。MultiBox 方法 [26]、[27] 从一个网络生成区域提议，该网络的最后一个全连接层同时预测多个类别无关的框，推广了 OverFeat 的“单框”方式。这些类别无关的框被用作 R-CNN 的提议 [5]。MultiBox 提议网络应用于单个图像裁剪或多个大型图像裁剪（例如，[latex0]），与我们的全卷积方案形成对比。MultiBox 不在提议和检测网络之间共享特征。我们将在后面更深入地讨论 OverFeat 和 MultiBox，并结合我们的方法进行讨论。与我们的工作同时，DeepMask 方法 [28] 被开发用于学习分割提议。

Since the publication of the conference version of this paper [10], we have also found that RPNs can be trained jointly with Fast R-CNN networks leading to less training time.
自从本论文的会议版本发表以来 [10]，我们还发现 RPN 可以与 Fast R-CNN 网络共同训练，从而减少训练时间。
image-net.org/challenges/…
image-net.org/challenges/…

Figure 2: Faster R-CNN is a single, unified network for object detection. The RPN module serves as the 'attention' of this unified network.

图 2：Faster R-CNN 是一个统一的单一网络，用于目标检测。RPN 模块作为该统一网络的“注意力”。

into a convolutional layer for detecting multiple class-specific objects. The MultiBox methods [26], [27] generate region proposals from a network whose last fully-connected layer simultaneously predicts multiple class-agnostic boxes, generalizing the "single-box" fashion of OverFeat. These class-agnostic boxes are used as proposals for R-CNN [5]. The MultiBox proposal network is applied on a single image crop or multiple large image crops (e.g., ${224} \times {224}$ ),in contrast to our fully convolutional scheme. MultiBox does not share features between the proposal and detection networks. We discuss OverFeat and MultiBox in more depth later in context with our method. Concurrent with our work, the DeepMask method [28] is developed for learning segmentation proposals.

通过卷积层检测多个特定类别的对象。MultiBox 方法 [26]、[27] 从一个网络生成区域提议，该网络的最后一个全连接层同时预测多个类别无关的框，推广了 OverFeat 的“单框”方式。这些类别无关的框被用作 R-CNN 的提议 [5]。MultiBox 提议网络应用于单个图像裁剪或多个大型图像裁剪（例如， ${224} \times {224}$ ），与我们的全卷积方案形成对比。MultiBox 不在提议和检测网络之间共享特征。我们将在后面更深入地讨论 OverFeat 和 MultiBox，并结合我们的方法进行讨论。与我们的工作同时，DeepMask 方法 [28] 被开发用于学习分割提议。

Shared computation of convolutions [9], [1], [29], [7], [2] has been attracting increasing attention for efficient, yet accurate, visual recognition. The OverFeat paper [9] computes convolutional features from an image pyramid for classification, localization, and detection. Adaptively-sized pooling (SPP) [1] on shared convolutional feature maps is developed for efficient region-based object detection [1], [30] and semantic segmentation [29]. Fast R-CNN [2] enables end-to-end detector training on shared convolutional features and shows compelling accuracy and speed.

卷积的共享计算 [9]、[1]、[29]、[7]、[2] 正在吸引越来越多的关注，以实现高效且准确的视觉识别。OverFeat 论文 [9] 从图像金字塔中计算卷积特征，用于分类、定位和检测。针对共享卷积特征图的自适应池化 (SPP) [1] 被开发用于高效的基于区域的物体检测 [1]、[30] 和语义分割 [29]。快速 R-CNN [2] 使得在共享卷积特征上进行端到端的检测器训练成为可能，并展示了令人信服的准确性和速度。

3 FASTER R-CNN

3 更快的 R-CNN

Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector [2] that uses the proposed regions. The entire system is a single, unified network for object detection (Figure 2). Using the recently popular terminology of neural networks with 'attention' [31] mechanisms, the RPN module tells the Fast R-CNN module where to look. In Section 3.1 we introduce the designs and properties of the network for region proposal. In Section 3.2 we develop algorithms for training both modules with features shared.

我们的物体检测系统称为更快的 R-CNN，由两个模块组成。第一个模块是一个深度全卷积网络，用于提出区域，第二个模块是使用所提区域的快速 R-CNN 检测器 [2]。整个系统是一个统一的物体检测网络（图 2）。使用最近流行的带有“注意力” [31] 机制的神经网络术语，RPN 模块告诉快速 R-CNN 模块该关注哪里。在 3.1 节中，我们介绍区域提议网络的设计和特性。在 3.2 节中，我们开发了用于训练两个模块的共享特征的算法。

3.1 Region Proposal Networks

3.1 区域提议网络

A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals,each with an objectness score. ${}^{3}$ We model this process with a fully convolutional network [7], which we describe in this section. Because our ultimate goal is to share computation with a Fast R-CNN object detection network [2], we assume that both nets share a common set of convolutional layers. In our experiments, we investigate the Zeiler and Fergus model [32] (ZF), which has 5 shareable convolutional layers and the Simonyan and Zisserman model [3] (VGG-16), which has 13 shareable convolutional layers.

区域提议网络（RPN）接受任意大小的图像作为输入，并输出一组矩形物体提议，每个提议都有一个物体性得分。 ${}^{3}$ 我们用一个全卷积网络来建模这个过程 [7]，在本节中进行描述。因为我们的最终目标是与快速 R-CNN 物体检测网络 [2] 共享计算，我们假设这两个网络共享一组共同的卷积层。在我们的实验中，我们研究了 Zeiler 和 Fergus 模型 [32]（ZF），它有 5 个可共享的卷积层，以及 Simonyan 和 Zisserman 模型 [3]（VGG-16），它有 13 个可共享的卷积层。

To generate region proposals, we slide a small network over the convolutional feature map output by the last shared convolutional layer. This small network takes as input an $n \times n$ spatial window of the input convolutional feature map. Each sliding window is mapped to a lower-dimensional feature (256-d for ZF and 512-d for VGG, with ReLU [33] following). This feature is fed into two sibling fully-connected layers-a box-regression layer (reg) and a box-classification layer (cls). We use $n = 3$ in this paper, noting that the effective receptive field on the input image is large (171 and 228 pixels for ZF and VGG, respectively). This mini-network is illustrated at a single position in Figure 3 (left). Note that because the mini-network operates in a sliding-window fashion, the fully-connected layers are shared across all spatial locations. This architecture is naturally implemented with an $n \times n$ convolutional layer followed by two sibling $1 \times 1$ convolutional layers (for reg and cls, respectively).

为了生成区域提议，我们在最后一个共享卷积层输出的卷积特征图上滑动一个小网络。这个小网络的输入是输入卷积特征图的一个 $n \times n$ 空间窗口。每个滑动窗口被映射到一个低维特征（ZF 为 256 维，VGG 为 512 维，后面跟着 ReLU [33]）。这个特征被输入到两个兄弟全连接层——一个框回归层（reg）和一个框分类层（cls）。我们在本文中使用 $n = 3$ ，注意到输入图像上的有效感受野很大（ZF 和 VGG 分别为 171 和 228 像素）。这个小型网络在图 3（左侧）中的单个位置进行了说明。请注意，由于小型网络以滑动窗口的方式操作，所有空间位置的全连接层都是共享的。这个架构自然地实现为一个 $n \times n$ 卷积层，后面跟着两个兄弟 $1 \times 1$ 卷积层（分别用于 reg 和 cls）。

3.1.1 Anchors

3.1.1 锚点

At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as $k$ . So the reg layer has ${4k}$ outputs encoding the coordinates of $k$ boxes,and the ${cls}$ layer outputs ${2k}$ scores that estimate probability of object or not object for each proposal ${}^{4}$ . The $k$ proposals are parameterized relative to $k$ reference boxes,which we call

在每个滑动窗口位置，我们同时预测多个区域提议，其中每个位置的最大可能提议数量用 $k$ 表示。因此，回归层具有 ${4k}$ 输出，编码 $k$ 个框的坐标，而 ${cls}$ 层输出 ${2k}$ 分数，估计每个提议为对象或非对象的概率 ${}^{4}$ 。这些 $k$ 提议相对于 $k$ 参考框进行参数化，我们称之为

"Region" is a generic term and in this paper we only consider rectangular regions, as is common for many methods (e.g., [27], [4], [6]). "Objectness" measures membership to a set of object classes vs. background.
“区域”是一个通用术语，在本文中我们仅考虑矩形区域，这在许多方法中是常见的（例如，[27]，[4]，[6]）。 “对象性”衡量属于一组对象类别与背景的成员资格。
For simplicity we implement the cls layer as a two-class softmax layer. Alternatively, one may use logistic regression to produce $k$ scores.
为了简化，我们将 cls 层实现为一个二类 softmax 层。或者，可以使用逻辑回归来生成 $k$ 分数。

Figure 3: Left: Region Proposal Network (RPN). Right: Example detections using RPN proposals on PASCAL VOC 2007 test. Our method detects objects in a wide range of scales and aspect ratios.

图 3：左：区域提议网络（RPN）。右：使用 RPN 提议在 PASCAL VOC 2007 测试集上的示例检测。我们的方法能够检测各种尺度和长宽比的对象。

anchors. An anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio (Figure 3, left). By default we use 3 scales and 3 aspect ratios,yielding $k = 9$ anchors at each sliding position. For a convolutional feature map of a size $W \times H$ (typically $\sim 2,{400}$ ),there are ${WHk}$ anchors in total.

锚点。锚点位于相关的滑动窗口中心，并与一个尺度和长宽比相关联（图 3，左）。默认情况下，我们使用 3 种尺度和 3 种长宽比，在每个滑动位置产生 $k = 9$ 个锚点。对于大小为 $W \times H$ 的卷积特征图（通常为 $\sim 2,{400}$ ），总共有 ${WHk}$ 个锚点。

Translation-Invariant Anchors

平移不变锚点

An important property of our approach is that it is translation invariant, both in terms of the anchors and the functions that compute proposals relative to the anchors. If one translates an object in an image, the proposal should translate and the same function should be able to predict the proposal in either location. This translation-invariant property is guaranteed by our method ${}^{5}$ . As a comparison,the MultiBox method [27] uses k-means to generate 800 anchors, which are not translation invariant. So MultiBox does not guarantee that the same proposal is generated if an object is translated.

我们方法的一个重要特性是它具有平移不变性，无论是在锚点还是计算相对于锚点的提议的函数方面。如果在图像中平移一个物体，提议也应该随之平移，并且同样的函数应该能够预测任一位置的提议。这种平移不变性是由我们的方法保证的 ${}^{5}$ 。作为对比，MultiBox 方法 [27] 使用 k-means 生成 800 个锚点，这些锚点并不具备平移不变性。因此，如果一个物体被平移，MultiBox 并不能保证生成相同的提议。

The translation-invariant property also reduces the model size. MultiBox has a $\left( {4 + 1}\right) \times {800}$ -dimensional fully-connected output layer, whereas our method has a $\left( {4 + 2}\right) \times 9$ -dimensional convolutional output layer in the case of $k = 9$ anchors. As a result,our output layer has ${2.8} \times {10}^{4}$ parameters $({512} \times \left( {4 + 2}\right) \times 9$ for VGG-16), two orders of magnitude fewer than MultiBox’s output layer that has ${6.1} \times {10}^{6}$ parameters $({1536} \times \left( {4 + 1}\right) \times {800}$ for GoogleNet [34] in MultiBox [27]). If considering the feature projection layers, our proposal layers still have an order of magnitude fewer parameters than MultiBox ${}^{6}$ . We expect our method to have less risk of overfitting on small datasets, like PASCAL VOC.

平移不变性特性还减少了模型的大小。MultiBox 具有一个 $\left( {4 + 1}\right) \times {800}$ 维的全连接输出层，而我们的方法在 $k = 9$ 锚点的情况下具有一个 $\left( {4 + 2}\right) \times 9$ 维的卷积输出层。因此，我们的输出层具有 ${2.8} \times {10}^{4}$ 个参数 $({512} \times \left( {4 + 2}\right) \times 9$ （对于 VGG-16），比 MultiBox 的输出层少两个数量级，后者在 MultiBox [27] 中对于 GoogleNet [34] 具有 ${6.1} \times {10}^{6}$ 个参数 $({1536} \times \left( {4 + 1}\right) \times {800}$ 。如果考虑特征投影层，我们的提议层的参数数量仍然比 MultiBox 少一个数量级 ${}^{6}$ 。我们预计我们的方法在小型数据集（如 PASCAL VOC）上具有较低的过拟合风险。

Multi-Scale Anchors as Regression References

多尺度锚点作为回归参考

Our design of anchors presents a novel scheme for addressing multiple scales (and aspect ratios). As shown in Figure 1, there have been two popular ways for multi-scale predictions. The first way is based on image/feature pyramids, e.g., in DPM [8] and CNN-based methods [9], [1], [2]. The images are resized at multiple scales, and feature maps (HOG [8] or deep convolutional features [9], [1], [2]) are computed for each scale (Figure 1(a)). This way is often useful but is time-consuming. The second way is to use sliding windows of multiple scales (and/or aspect ratios) on the feature maps. For example, in DPM [8], models of different aspect ratios are trained separately using different filter sizes (such as $5 \times 7$ and $7 \times 5$ ). If this way is used to address multiple scales, it can be thought of as a "pyramid of filters" (Figure 1(b)). The second way is usually adopted jointly with the first way [8].

我们的锚点设计提出了一种新颖的方案来处理多尺度（和长宽比）。如图1所示，已有两种流行的多尺度预测方法。第一种方法基于图像/特征金字塔，例如，在DPM [8] 和基于CNN的方法 [9], [1], [2] 中。图像在多个尺度上被调整大小，并为每个尺度计算特征图（HOG [8] 或深度卷积特征 [9], [1], [2]）（图1(a)）。这种方法通常是有用的，但耗时较长。第二种方法是在特征图上使用多尺度（和/或长宽比）的滑动窗口。例如，在DPM [8] 中，使用不同的滤波器大小（如 $5 \times 7$ 和 $7 \times 5$ ）分别训练不同长宽比的模型。如果使用这种方法来处理多尺度，可以将其视为“滤波器金字塔”（图1(b)）。第二种方法通常与第一种方法共同采用 [8]。

As a comparison, our anchor-based method is built on a pyramid of anchors, which is more cost-efficient. Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios. It only relies on images and feature maps of a single scale, and uses filters (sliding windows on the feature map) of a single size. We show by experiments the effects of this scheme for addressing multiple scales and sizes (Table 8).

作为比较，我们的基于锚点的方法建立在锚点金字塔之上，具有更高的成本效率。我们的方法参考多尺度和长宽比的锚框对边界框进行分类和回归。它仅依赖于单尺度的图像和特征图，并使用单一尺寸的滤波器（在特征图上的滑动窗口）。我们通过实验展示了这种方案在处理多尺度和大小方面的效果（表8）。

Because of this multi-scale design based on anchors, we can simply use the convolutional features computed on a single-scale image, as is also done by the Fast R-CNN detector [2]. The design of multi-scale anchors is a key component for sharing features without extra cost for addressing scales.

由于这种基于锚点的多尺度设计，我们可以简单地使用在单尺度图像上计算的卷积特征，这也是Fast R-CNN检测器 [2] 所采用的。多尺度锚点的设计是共享特征而无需额外成本以处理尺度的关键组成部分。

3.1.2 Loss Function

3.1.2 损失函数

For training RPNs, we assign a binary class label (of being an object or not) to each anchor. We assign a positive label to two kinds of anchors: (i) the anchor/anchors with the highest Intersection-over-Union (IoU) overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap higher than 0.7 with

在训练 RPN 时，我们为每个锚点分配一个二元类别标签（是物体或不是物体）。我们为两种类型的锚点分配正标签：（i）与真实框具有最高交并比 (IoU) 重叠的锚点，或 (ii) 与真实框的 IoU 重叠超过 0.7 的锚点。

As is the case of FCNs [7], our network is translation invariant up to the network's total stride.

与 FCNs [7] 的情况一样，我们的网络在网络的总步幅下是平移不变的。

Considering the feature projection layers, our proposal layers' parameter count is $3 \times 3 \times {512} \times {512} + {512} \times 6 \times 9 = {2.4} \times {10}^{6}$ ; MultiBox’s proposal layers’ parameter count is $7 \times 7 \times ({64} + {96} +$

考虑到特征投影层，我们提议层的参数数量为 $3 \times 3 \times {512} \times {512} + {512} \times 6 \times 9 = {2.4} \times {10}^{6}$ ；MultiBox 的提议层的参数数量为 $7 \times 7 \times ({64} + {96} +$ 。

$\left. {{64} + {64}}\right) \times {1536} + {1536} \times 5 \times {800} = {27} \times {10}^{6}.$

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——