Fast R-CNN【翻译】https://arxiv.org/pdf/1504.08083 Fast R-CNN 快速

Doc2X：智能文档解析与翻译助手从 PDF转Word、Latex 到 Markdown，Doc2X 提供完整解决方案，支持批量PDF识别、表格解析、多栏转换，并结合 GPT 翻译、深度语料提取功能。 Doc2X: Your Intelligent Parsing and Translation Assistant From PDF to Word, LaTeX, or Markdown, Doc2X offers a complete solution, supporting batch PDF recognition, table parsing, and multi-column conversion, integrated with GPT translation and advanced corpus extraction. 👉 立即试用 Doc2X | Try Doc2X Today

原文链接：1504.08083

Fast R-CNN

快速 R-CNN

Ross Girshick

罗斯·吉尔希克

Microsoft Research

微软研究院

rbg@microsoft.com

Abstract

摘要

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast $R$ -CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work,Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9 $\times$ faster than R-CNN, is 213 $\times$ faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet,Fast R-CNN trains VGG16 $3 \times$ faster,tests ${10} \times$ faster,and is more accurate. Fast $R$ -CNN is implemented in Python and $\mathrm{C} + +$ (using Caffe) and is available under the open-source MIT License at https: //github.com/rbgirshick/fast-rcnn.

本文提出了一种用于目标检测的快速区域卷积网络方法（Fast R-CNN）。快速 $R$ -CNN 基于之前的工作，利用深度卷积网络高效地对目标提议进行分类。与之前的工作相比，Fast R-CNN 采用了多项创新，以提高训练和测试速度，同时也提高了检测精度。Fast R-CNN 训练非常深的 VGG16 网络的速度比 R-CNN 快 9 $\times$ ，测试时速度快 213 $\times$ ，并在 PASCAL VOC 2012 上实现了更高的 mAP。与 SPPnet 相比，Fast R-CNN 训练 VGG16 $3 \times$ 的速度更快，测试 ${10} \times$ 的速度更快，并且更准确。快速 $R$ -CNN 是用 Python 和 $\mathrm{C} + +$ （使用 Caffe）实现的，并在 github.com/rbgirshick/… 以开源 MIT 许可证提供。

1. Introduction

1. 引言

Recently, deep ConvNets [14, 16] have significantly improved image classification [14] and object detection [9, 19] accuracy. Compared to image classification, object detection is a more challenging task that requires more complex methods to solve. Due to this complexity, current approaches (e.g., $\left\lbrack {9,{11},{19},{25}}\right\rbrack$ ) train models in multi-stage pipelines that are slow and inelegant.

最近，深度卷积网络 [14, 16] 显著提高了图像分类 [14] 和目标检测 [9, 19] 的准确性。与图像分类相比，目标检测是一项更具挑战性的任务，需要更复杂的方法来解决。由于这种复杂性，目前的方法（例如 $\left\lbrack {9,{11},{19},{25}}\right\rbrack$ ）在多阶段管道中训练模型，这种方式既缓慢又不优雅。

Complexity arises because detection requires the accurate localization of objects, creating two primary challenges. First, numerous candidate object locations (often called "proposals") must be processed. Second, these candidates provide only rough localization that must be refined to achieve precise localization. Solutions to these problems often compromise speed, accuracy, or simplicity.

复杂性产生于检测需要对物体进行准确定位，这带来了两个主要挑战。首先，必须处理大量候选物体位置（通常称为“提议”）。其次，这些候选位置仅提供粗略的定位，必须进行精细化以实现精确定位。解决这些问题的方案往往会妥协速度、准确性或简洁性。

In this paper, we streamline the training process for state-of-the-art ConvNet-based object detectors [9, 11]. We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.

在本文中，我们简化了基于最先进的ConvNet的目标检测器的训练过程 [9, 11]。我们提出了一种单阶段训练算法，该算法共同学习对目标提议进行分类并优化其空间位置。

The resulting method can train a very deep detection network (VGG16 [20]) 9 × faster than R-CNN [9] and 3 × faster than SPPnet [11]. At runtime, the detection network processes images in 0.3 s (excluding object proposal time) while achieving top accuracy on PASCAL VOC 2012 [7] with a mAP of ${66}\%$ (vs. ${62}\%$ for R-CNN). ${}^{1}$

该方法可以以比R-CNN [9]快9倍、比SPPnet [11]快3倍的速度训练非常深的检测网络（VGG16 [20]）。在运行时，检测网络处理图像的时间为0.3秒（不包括目标提议时间），同时在PASCAL VOC 2012 [7]上实现了最高的准确率，mAP为 ${66}\%$ （与R-CNN的 ${62}\%$ 相比）。 ${}^{1}$

1.1. R-CNN and SPPnet

1.1. R-CNN和SPPnet

The Region-based Convolutional Network method (R-CNN) [9] achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals. R-CNN, however, has notable drawbacks:

基于区域的卷积网络方法（R-CNN） [9] 通过使用深度ConvNet对目标提议进行分类，达到了卓越的目标检测准确率。然而，R-CNN有显著的缺点：

Training is a multi-stage pipeline. R-CNN first fine-tunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned.
训练是一个多阶段的流程。R-CNN首先使用对数损失对目标提议上的ConvNet进行微调。然后，它将支持向量机（SVM）拟合到ConvNet特征上。这些SVM作为目标检测器，替代了通过微调学习到的softmax分类器。在第三个训练阶段，学习边界框回归器。
Training is expensive in space and time. For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16,this process takes ${2.5}\mathrm{{GPU}}$ -days for the $5\mathrm{k}$ images of the VOC07 trainval set. These features require hundreds of gigabytes of storage.
训练在空间和时间上都很昂贵。对于SVM和边界框回归器的训练，从每个图像中的每个目标提议提取特征并写入磁盘。对于非常深的网络，如VGG16，这个过程需要 ${2.5}\mathrm{{GPU}}$ 天来处理VOC07训练验证集的 $5\mathrm{k}$ 图像。这些特征需要数百GB的存储空间。
Object detection is slow. At test-time, features are extracted from each object proposal in each test image. Detection with VGG16 takes 47s / image (on a GPU).
目标检测速度较慢。在测试时，从每个测试图像中的每个目标提议提取特征。使用VGG16进行检测的时间为47秒/图像（在GPU上）。

R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Spatial pyramid pooling networks (SPPnets) [11] were proposed to speed up R-CNN by sharing computation. The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map. Features are extracted for a proposal by max-pooling the portion of the feature map inside the proposal into a fixed-size output (e.g., $6 \times 6$ ). Multiple output sizes are pooled and then concatenated as in spatial pyramid pooling [15]. SPPnet accelerates R-CNN by 10 to ${100} \times$ at test time. Training time is also reduced by $3 \times$ due to faster proposal feature extraction.

R-CNN 速度较慢，因为它对每个对象提议执行一次卷积神经网络的前向传播，而没有共享计算。空间金字塔池化网络（SPPnets）[11] 被提出以通过共享计算来加速 R-CNN。SPPnet 方法为整个输入图像计算卷积特征图，然后使用从共享特征图中提取的特征向量对每个对象提议进行分类。通过对提议内特征图部分进行最大池化，将特征提取为固定大小的输出（例如， $6 \times 6$ ）。多个输出大小被池化后，像空间金字塔池化一样进行连接 [15]。SPPnet 在测试时将 R-CNN 加速了 10 到 ${100} \times$ 。由于提议特征提取速度更快，训练时间也减少了 $3 \times$ 。

${}^{1}$ All timings use one Nvidia K40 GPU overclocked to ${875}\mathrm{{MHz}}$ .

${}^{1}$ 所有时间均使用一台超频至 ${875}\mathrm{{MHz}}$ 的 Nvidia K40 GPU。

SPPnet also has notable drawbacks. Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors. Features are also written to disk. But unlike R-CNN, the fine-tuning algorithm proposed in [11] cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurprisingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.

SPPnet 也有显著的缺点。与 R-CNN 一样，训练是一个多阶段的管道，涉及提取特征、使用对数损失微调网络、训练支持向量机（SVM），最后拟合边界框回归器。特征也会写入磁盘。但与 R-CNN 不同的是，文献 [11] 中提出的微调算法无法更新空间金字塔池化之前的卷积层。毫不奇怪，这一限制（固定的卷积层）限制了非常深层网络的准确性。

1.2. Contributions

1.2. 贡献

We propose a new training algorithm that fixes the disadvantages of R-CNN and SPPnet, while improving on their speed and accuracy. We call this method Fast R-CNN because it's comparatively fast to train and test. The Fast R-CNN method has several advantages:

我们提出了一种新的训练算法，解决了 R-CNN 和 SPPnet 的缺点，同时提高了它们的速度和准确性。我们称这种方法为快速 R-CNN，因为它相对较快地进行训练和测试。快速 R-CNN 方法具有几个优点：

Higher detection quality (mAP) than R-CNN, SPPnet
比 R-CNN 和 SPPnet 更高的检测质量（mAP）
Training is single-stage, using a multi-task loss
训练为单阶段，使用多任务损失
Training can update all network layers
训练可以更新所有网络层
No disk storage is required for feature caching
不需要磁盘存储进行特征缓存

Fast R-CNN is written in Python and C++ (Caffe [13]) and is available under the open-source MIT License at github.com/rbgirshick/ fast-rcnn.

Fast R-CNN 是用 Python 和 C++ (Caffe [13]) 编写的，并在 github.com/rbgirshick/… 上以开源 MIT 许可证提供。

2. Fast R-CNN architecture and training

2. Fast R-CNN 架构与训练

Fig. 1 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest(RoI)pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over $K$ object classes plus a catch-all "background" class and another layer that outputs four real-valued numbers for each of the $K$ object classes. Each set of 4 values encodes refined bounding-box positions for one of the $K$ classes.

图 1 说明了 Fast R-CNN 架构。Fast R-CNN 网络以整个图像和一组目标提案作为输入。网络首先通过多个卷积（conv）层和最大池化层处理整个图像，以生成卷积特征图。然后，对于每个目标提案，区域兴趣（RoI）池化层从特征图中提取固定长度的特征向量。每个特征向量被输入到一系列全连接（fc）层中，最终分支为两个兄弟输出层：一个生成 $K$ 目标类别的 softmax 概率估计以及一个包含“背景”类别的层，另一个层为每个 $K$ 目标类别输出四个实值数字。每组 4 个值编码了其中一个 $K$ 类别的精细边界框位置。

2.1.The RoI pooling layer

2.1. RoI 池化层

The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of $H \times W$ (e.g., $7 \times 7$ ), where $H$ and $W$ are layer hyper-parameters that are independent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple(r,c,h,w)that specifies its top-left corner(r,c)and its height and width(h,w).

RoI 池化层使用最大池化将任何有效区域内的特征转换为具有固定空间范围的较小特征图 $H \times W$ (例如， $7 \times 7$ )，其中 $H$ 和 $W$ 是与任何特定 RoI 无关的层超参数。在本文中，RoI 是卷积特征图中的一个矩形窗口。每个 RoI 由一个四元组 (r,c,h,w) 定义，指定其左上角 (r,c) 及其高度和宽度 (h,w)。

Figure 1. Fast R-CNN architecture. An input image and multiple regions of interest (RoIs) are input into a fully convolutional network. Each RoI is pooled into a fixed-size feature map and then mapped to a feature vector by fully connected layers (FCs). The network has two output vectors per RoI: softmax probabilities and per-class bounding-box regression offsets. The architecture is trained end-to-end with a multi-task loss.

图1. 快速R-CNN架构。输入图像和多个感兴趣区域（RoIs）被输入到一个全卷积网络中。每个RoI被池化为固定大小的特征图，然后通过全连接层（FCs）映射到特征向量。网络为每个RoI提供两个输出向量：softmax概率和每类边界框回归偏移量。该架构通过多任务损失进行端到端训练。

RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of sub-windows of approximate size $h/H \times w/W$ and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling. The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level. We use the pooling sub-window calculation given in [11].

RoI最大池化通过将 $h \times w$ RoI窗口划分为一个 $H \times W$ 近似大小的子窗口网格 $h/H \times w/W$ ，然后将每个子窗口中的值最大池化到相应的输出网格单元中来工作。池化独立应用于每个特征图通道，类似于标准的最大池化。RoI层实际上是SPPnets [11]中使用的空间金字塔池化层的特例，其中只有一个金字塔层级。我们使用[11]中给出的池化子窗口计算。

2.2. Initializing from pre-trained networks

2.2. 从预训练网络初始化

We experiment with three pre-trained ImageNet [4] networks, each with five max pooling layers and between five and thirteen conv layers (see Section 4.1 for network details). When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations.

我们对三个预训练的ImageNet [4]网络进行实验，每个网络有五个最大池化层和五到十三个卷积层（网络详细信息见第4.1节）。当一个预训练网络初始化一个快速R-CNN网络时，它经历三个变换。

First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting $H$ and $W$ to be compatible with the net's first fully connected layer (e.g., $H = W = 7$ for VGG16).

首先，最后一个最大池化层被替换为一个RoI池化层，该层通过设置 $H$ 和 $W$ 与网络的第一个全连接层兼容（例如，VGG16的 $H = W = 7$ ）。

Second, the network's last fully connected layer and soft-max (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over $K + 1$ categories and category-specific bounding-box regressors).

其次，网络的最后一个全连接层和softmax（它们是为1000类ImageNet分类训练的）被之前描述的两个兄弟层替换（一个全连接层和对 $K + 1$ 类的softmax以及特定类别的边界框回归器）。

Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

第三，网络被修改为接受两个数据输入：一组图像和这些图像中的一组感兴趣区域（RoIs）。

2.3. Fine-tuning for detection

2.3. 检测的微调

Training all network weights with back-propagation is an important capability of Fast R-CNN. First, let's elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer.

使用反向传播训练所有网络权重是 Fast R-CNN 的一个重要能力。首先，让我们阐明为什么 SPPnet 无法更新空间金字塔池化层下方的权重。

The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained. The inefficiency stems from the fact that each RoI may have a very large receptive field, often spanning the entire input image. Since the forward pass must process the entire receptive field, the training inputs are large (often the entire image).

根本原因在于，当每个训练样本（即 RoI）来自不同图像时，通过 SPP 层的反向传播效率极低，这正是 R-CNN 和 SPPnet 网络的训练方式。低效率的根源在于每个 RoI 可能具有非常大的感受野，通常跨越整个输入图像。由于前向传播必须处理整个感受野，因此训练输入较大（通常是整个图像）。

We propose a more efficient training method that takes advantage of feature sharing during training. In Fast R-CNN training, stochastic gradient descent (SGD) mini-batches are sampled hierarchically,first by sampling $N$ images and then by sampling $R/N$ RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making $N$ small decreases mini-batch computation. For example, when using $N = 2$ and $R = {128}$ ,the proposed training scheme is roughly ${64} \times$ faster than sampling one RoI from 128 different images (i.e., the R-CNN and SPPnet strategy).

我们提出了一种更高效的训练方法，利用训练过程中的特征共享。在 Fast R-CNN 训练中，随机梯度下降（SGD）小批量是分层采样的，首先采样 $N$ 图像，然后从每个图像中采样 $R/N$ RoIs。关键是来自同一图像的 RoIs 在前向和反向传播中共享计算和内存。使 $N$ 较小可以减少小批量计算。例如，当使用 $N = 2$ 和 $R = {128}$ 时，所提出的训练方案比从 128 张不同图像中采样一个 RoI（即 R-CNN 和 SPPnet 策略）快大约 ${64} \times$ 倍。

One concern over this strategy is it may cause slow training convergence because RoIs from the same image are correlated. This concern does not appear to be a practical issue and we achieve good results with $N = 2$ and $R = {128}$ using fewer SGD iterations than R-CNN.

对于这一策略的一个担忧是，它可能导致训练收敛速度缓慢，因为来自同一图像的 RoIs 之间是相关的。然而，这一担忧似乎并不是一个实际问题，我们在使用较少的 SGD 迭代次数时，仍然能够获得与 R-CNN 相媲美的良好结果。

In addition to hierarchical sampling, Fast R-CNN uses a streamlined training process with one fine-tuning stage that jointly optimizes a softmax classifier and bounding-box regressors, rather than training a softmax classifier, SVMs, and regressors in three separate stages [9, 11]. The components of this procedure (the loss, mini-batch sampling strategy, back-propagation through RoI pooling layers, and SGD hyper-parameters) are described below.

除了层次采样，Fast R-CNN 还使用了一种简化的训练过程，其中包含一个微调阶段，该阶段共同优化软最大分类器和边界框回归器，而不是在三个独立阶段中训练软最大分类器、支持向量机和回归器 [9, 11]。该过程的组成部分（损失、迷你批采样策略、通过 RoI 池化层的反向传播以及 SGD 超参数）如下所述。

Multi-task loss. A Fast R-CNN network has two sibling output layers. The first outputs a discrete probability distribution (per RoI), $p = \left( {{p}_{0},\ldots ,{p}_{K}}\right)$ ,over $K + 1$ categories. As usual, $p$ is computed by a softmax over the $K + 1$ outputs of a fully connected layer. The second sibling layer outputs bounding-box regression offsets, ${t}^{k} = \left( {{t}_{\mathrm{x}}^{k},{t}_{\mathrm{y}}^{k},{t}_{\mathrm{w}}^{k},{t}_{\mathrm{h}}^{k}}\right)$ ,for each of the $K$ object classes,indexed by $k$ . We use the parameterization for ${t}^{k}$ given in [9],in which ${t}^{k}$ specifies a scale-invariant translation and log-space height/width shift relative to an object proposal.

多任务损失。Fast R-CNN 网络有两个兄弟输出层。第一个输出一个离散概率分布（每个 RoI）， $p = \left( {{p}_{0},\ldots ,{p}_{K}}\right)$ ，在 $K + 1$ 类别上。通常情况下， $p$ 是通过对完全连接层的 $K + 1$ 输出进行软最大计算得出的。第二个兄弟层输出每个 $K$ 物体类别的边界框回归偏移量 ${t}^{k} = \left( {{t}_{\mathrm{x}}^{k},{t}_{\mathrm{y}}^{k},{t}_{\mathrm{w}}^{k},{t}_{\mathrm{h}}^{k}}\right)$ ，这些类别由 $k$ 索引。我们使用 [9] 中给出的 ${t}^{k}$ 的参数化，其中 ${t}^{k}$ 指定了相对于物体提议的尺度不变平移和对数空间高度/宽度偏移。

Each training RoI is labeled with a ground-truth class $u$ and a ground-truth bounding-box regression target $v$ . We use a multi-task loss $L$ on each labeled RoI to jointly train for classification and bounding-box regression:

每个训练 RoI 都被标记为一个真实类别 $u$ 和一个真实的边界框回归目标 $v$ 。我们在每个标记的 RoI 上使用多任务损失 $L$ 来联合训练分类和边界框回归：

in which ${L}_{\mathrm{{cls}}}\left( {p,u}\right) = - \log {p}_{u}$ is $\log$ loss for true class $u$ .

其中 ${L}_{\mathrm{{cls}}}\left( {p,u}\right) = - \log {p}_{u}$ 是真实类别 $u$ 的 $\log$ 损失。

The second task loss, ${L}_{\text{loc }}$ ,is defined over a tuple of true bounding-box regression targets for class $u,v =$ $\left( {{v}_{\mathrm{x}},{v}_{\mathrm{y}},{v}_{\mathrm{w}},{v}_{\mathrm{h}}}\right)$ ,and a predicted tuple ${t}^{u} = \left( {{t}_{\mathrm{x}}^{u},{t}_{\mathrm{y}}^{u},{t}_{\mathrm{w}}^{u},{t}_{\mathrm{h}}^{u}}\right)$ , again for class $u$ . The Iverson bracket indicator function $\left\lbrack {u \geq 1}\right\rbrack$ evaluates to 1 when $u \geq 1$ and 0 otherwise. By convention the catch-all background class is labeled $u = 0$ . For background RoIs there is no notion of a ground-truth bounding box and hence ${L}_{\text{loc }}$ is ignored. For bounding-box regression, we use the loss

第二任务损失 ${L}_{\text{loc }}$ 定义在一组真实的边界框回归目标上，针对类别 $u,v =$ $\left( {{v}_{\mathrm{x}},{v}_{\mathrm{y}},{v}_{\mathrm{w}},{v}_{\mathrm{h}}}\right)$ 和预测的元组 ${t}^{u} = \left( {{t}_{\mathrm{x}}^{u},{t}_{\mathrm{y}}^{u},{t}_{\mathrm{w}}^{u},{t}_{\mathrm{h}}^{u}}\right)$ ，同样针对类别 $u$ 。艾弗森括号指示函数 $\left\lbrack {u \geq 1}\right\rbrack$ 在 $u \geq 1$ 时评估为 1，否则为 0。根据惯例，所有背景类被标记为 $u = 0$ 。对于背景 RoI，没有真实边界框的概念，因此 ${L}_{\text{loc }}$ 被忽略。对于边界框回归，我们使用损失

in which

其中

is a robust ${L}_{1}$ loss that is less sensitive to outliers than the ${L}_{2}$ loss used in R-CNN and SPPnet. When the regression targets are unbounded,training with ${L}_{2}$ loss can require careful tuning of learning rates in order to prevent exploding gradients. Eq. 3 eliminates this sensitivity.

是一种稳健的 ${L}_{1}$ 损失，对异常值的敏感性低于 R-CNN 和 SPPnet 中使用的 ${L}_{2}$ 损失。当回归目标没有界限时，使用 ${L}_{2}$ 损失进行训练可能需要仔细调整学习率，以防止梯度爆炸。公式 3 消除了这种敏感性。

The hyper-parameter $\lambda$ in Eq. 1 controls the balance between the two task losses. We normalize the ground-truth regression targets ${v}_{i}$ to have zero mean and unit variance. All experiments use $\lambda = 1$ .

公式 1 中的超参数 $\lambda$ 控制两个任务损失之间的平衡。我们将真实的回归目标 ${v}_{i}$ 归一化为零均值和单位方差。所有实验使用 $\lambda = 1$ 。

We note that [6] uses a related loss to train a class-agnostic object proposal network. Different from our approach, [6] advocates for a two-network system that separates localization and classification. OverFeat [19], R-CNN [9], and SPPnet [11] also train classifiers and bounding-box localizers, however these methods use stage-wise training, which we show is suboptimal for Fast R-CNN (Section 5.1).

我们注意到 [6] 使用相关损失来训练一个类无关的目标提议网络。与我们的方法不同，[6] 提倡一个将定位和分类分开的双网络系统。OverFeat [19]、R-CNN [9] 和 SPPnet [11] 也训练分类器和边界框定位器，然而这些方法使用阶段性训练，我们证明这对 Fast R-CNN 来说是次优的（第 5.1 节）。

Mini-batch sampling. During fine-tuning, each SGD mini-batch is constructed from $N = 2$ images,chosen uniformly at random (as is common practice, we actually iterate over permutations of the dataset). We use mini-batches of size $R = {128}$ ,sampling 64 RoIs from each image. As in [9],we take ${25}\%$ of the RoIs from object proposals that have intersection over union (IoU) overlap with a ground-truth bounding box of at least 0.5 . These RoIs comprise the examples labeled with a foreground object class, i.e. $u \geq 1$ . The remaining RoIs are sampled from object proposals that have a maximum IoU with ground truth in the interval $\lbrack {0.1},{0.5})$ ,following [11]. These are the background examples and are labeled with $u = 0$ . The lower threshold of 0.1 appears to act as a heuristic for hard example mining [8]. During training, images are horizontally flipped with probability 0.5 . No other data augmentation is used.

小批量采样。在微调过程中，每个 SGD 小批量是从 $N = 2$ 图像中均匀随机选择的（按照常规做法，我们实际上会遍历数据集的排列）。我们使用大小为 $R = {128}$ 的小批量，从每张图像中采样 64 个 RoI。与 [9] 中一样，我们从与真实边界框的交并比（IoU）重叠至少为 0.5 的物体提议中提取 ${25}\%$ 个 RoI。这些 RoI 包含标记为前景物体类别的示例，即 $u \geq 1$ 。其余的 RoI 是从与真实值的最大 IoU 在区间 $\lbrack {0.1},{0.5})$ 内的物体提议中采样的，遵循 [11]。这些是背景示例，并标记为 $u = 0$ 。下限 0.1 似乎作为困难示例挖掘的启发式方法 [8]。在训练过程中，图像以 0.5 的概率进行水平翻转。没有使用其他数据增强。

Back-propagation through RoI pooling layers. Back-propagation routes derivatives through the RoI pooling layer. For clarity, we assume only one image per mini-batch $\left( {N = 1}\right)$ ,though the extension to $N > 1$ is straightforward because the forward pass treats all images independently.

通过 RoI 池化层的反向传播。反向传播将导数路由通过 RoI 池化层。为清晰起见，我们假设每个小批量只有一张图像 $\left( {N = 1}\right)$ ，尽管扩展到 $N > 1$ 是简单的，因为前向传播将所有图像独立处理。

Let ${x}_{i} \in \mathbb{R}$ be the $i$ -th activation input into the RoI pooling layer and let ${y}_{rj}$ be the layer’s $j$ -th output from the $r$ - th RoI. The RoI pooling layer computes ${y}_{rj} = {x}_{{i}^{ * }\left( {r,j}\right) }$ ,in which ${i}^{ * }\left( {r,j}\right) = {\operatorname{argmax}}_{{i}^{\prime } \in \mathcal{R}\left( {r,j}\right) }{x}_{{i}^{\prime }}.\mathcal{R}\left( {r,j}\right)$ is the index set of inputs in the sub-window over which the output unit ${y}_{rj}$ max pools. A single ${x}_{i}$ may be assigned to several different outputs ${y}_{rj}$ .

设 ${x}_{i} \in \mathbb{R}$ 为输入到 RoI 池化层的第 $i$ 个激活，并设 ${y}_{rj}$ 为该层从第 $r$ 个 RoI 的第 $j$ 个输出。RoI 池化层计算 ${y}_{rj} = {x}_{{i}^{ * }\left( {r,j}\right) }$ ，其中 ${i}^{ * }\left( {r,j}\right) = {\operatorname{argmax}}_{{i}^{\prime } \in \mathcal{R}\left( {r,j}\right) }{x}_{{i}^{\prime }}.\mathcal{R}\left( {r,j}\right)$ 是输出单元 ${y}_{rj}$ 最大池化的子窗口中的输入索引集。单个 ${x}_{i}$ 可以分配给多个不同的输出 ${y}_{rj}$ 。

The RoI pooling layer's backwards function computes partial derivative of the loss function with respect to each input variable ${x}_{i}$ by following the argmax switches:

RoI 池化层的反向函数通过遵循 argmax 开关计算损失函数相对于每个输入变量 ${x}_{i}$ 的偏导数：

In words,for each mini-batch RoI $r$ and for each pooling output unit ${y}_{rj}$ ,the partial derivative $\partial L/\partial {y}_{rj}$ is accumulated if $i$ is the argmax selected for ${y}_{rj}$ by max pooling. In back-propagation,the partial derivatives $\partial L/\partial {y}_{rj}$ are already computed by the backwards function of the layer on top of the RoI pooling layer.

用语言来说，对于每个小批量 RoI $r$ 和每个池化输出单元 ${y}_{rj}$ ，如果 $i$ 是通过最大池化选择的 ${y}_{rj}$ 的 argmax，则累积偏导数 $\partial L/\partial {y}_{rj}$ 。在反向传播中，偏导数 $\partial L/\partial {y}_{rj}$ 已经通过 RoI 池化层上方层的反向函数计算得出。

SGD hyper-parameters. The fully connected layers used for softmax classification and bounding-box regression are initialized from zero-mean Gaussian distributions with standard deviations 0.01 and 0.001 , respectively. Biases are initialized to 0 . All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001 . When training on VOC07 or VOC12 trainval we run SGD for ${30}\mathrm{k}$ mini-batch iterations,and then lower the learning rate to 0.0001 and train for another ${10}\mathrm{k}$ iterations. When we train on larger datasets, we run SGD for more iterations, as described later. A momentum of 0.9 and parameter decay of 0.0005 (on weights and biases) are used.

SGD 超参数。用于 softmax 分类和边界框回归的全连接层从均值为零的高斯分布初始化，标准差分别为 0.01 和 0.001。偏置初始化为 0。所有层的权重学习率为 1，偏置学习率为 2，全球学习率为 0.001。当在 VOC07 或 VOC12 trainval 上训练时，我们运行 SGD 进行 ${30}\mathrm{k}$ 小批量迭代，然后将学习率降低到 0.0001，再训练 ${10}\mathrm{k}$ 次。当我们在更大的数据集上训练时，我们会运行更多的 SGD 迭代，如后文所述。使用 0.9 的动量和 0.0005 的参数衰减（针对权重和偏置）。

2.4. Scale invariance

2.4. 尺度不变性

We explore two ways of achieving scale invariant object detection: (1) via "brute force" learning and (2) by using image pyramids. These strategies follow the two approaches in [11]. In the brute-force approach, each image is processed at a pre-defined pixel size during both training and testing. The network must directly learn scale-invariant object detection from the training data.

我们探索实现尺度不变目标检测的两种方法：（1）通过“暴力”学习和（2）使用图像金字塔。这些策略遵循 [11] 中的两种方法。在暴力方法中，每幅图像在训练和测试期间都以预定义的像素大小进行处理。网络必须直接从训练数据中学习尺度不变的目标检测。

The multi-scale approach, in contrast, provides approximate scale-invariance to the network through an image pyramid. At test-time, the image pyramid is used to approximately scale-normalize each object proposal. During multi-scale training, we randomly sample a pyramid scale each time an image is sampled, following [11], as a form of data augmentation. We experiment with multi-scale training for smaller networks only, due to GPU memory limits.

多尺度方法通过图像金字塔为网络提供了近似的尺度不变性。在测试时，图像金字塔用于对每个物体提议进行近似的尺度归一化。在多尺度训练过程中，每次采样图像时，我们随机抽取一个金字塔尺度，遵循 [11]，作为数据增强的一种形式。由于 GPU 内存限制，我们仅对较小的网络进行多尺度训练实验。

3. Fast R-CNN detection

3. 快速 R-CNN 检测

Once a Fast R-CNN network is fine-tuned, detection amounts to little more than running a forward pass (assuming object proposals are pre-computed). The network takes as input an image (or an image pyramid, encoded as a list of images) and a list of $R$ object proposals to score. At test-time, $R$ is typically around 2000,although we will consider cases in which it is larger $\left( { \approx {45}\mathrm{k}}\right)$ . When using an image pyramid, each RoI is assigned to the scale such that the scaled RoI is closest to ${224}^{2}$ pixels in area [11].

一旦快速 R-CNN 网络经过微调，检测几乎只需进行一次前向传播（假设物体提议是预先计算好的）。网络以图像（或图像金字塔，编码为图像列表）和一组 $R$ 物体提议作为输入进行评分。在测试时， $R$ 通常约为 2000，尽管我们将考虑更大的情况 $\left( { \approx {45}\mathrm{k}}\right)$ 。使用图像金字塔时，每个 RoI 被分配到一个尺度，使得缩放后的 RoI 面积最接近 ${224}^{2}$ 像素 [11]。

For each test RoI $r$ ,the forward pass outputs a class posterior probability distribution $p$ and a set of predicted bounding-box offsets relative to $r$ (each of the $K$ classes gets its own refined bounding-box prediction). We assign a detection confidence to $r$ for each object class $k$ using the estimated probability $\Pr \left( {\text{class} = k \mid r}\right) \triangleq {p}_{k}$ . We then perform non-maximum suppression independently for each class using the algorithm and settings from R-CNN [9].

对于每个测试 RoI $r$ ，前向传播输出一个类别后验概率分布 $p$ 和一组相对于 $r$ 的预测边界框偏移量（每个 $K$ 类都有自己的精细化边界框预测）。我们使用估计的概率 $\Pr \left( {\text{class} = k \mid r}\right) \triangleq {p}_{k}$ 为每个物体类别 $k$ 分配一个检测置信度给 $r$ 。然后，我们使用 R-CNN [9] 的算法和设置对每个类别独立执行非极大值抑制。

3.1. Truncated SVD for faster detection

3.1. 截断 SVD 以加快检测速度

For whole-image classification, the time spent computing the fully connected layers is small compared to the conv layers. On the contrary, for detection the number of RoIs to process is large and nearly half of the forward pass time is spent computing the fully connected layers (see Fig. 2). Large fully connected layers are easily accelerated by compressing them with truncated SVD $\left\lbrack {5,{23}}\right\rbrack$ .

对于整体图像分类，计算全连接层所花费的时间与卷积层相比是很小的。相反，对于检测，处理的 RoI 数量很大，几乎一半的前向传播时间用于计算全连接层（见图 2）。通过截断 SVD $\left\lbrack {5,{23}}\right\rbrack$ 压缩大型全连接层可以很容易地加速它们。

In this technique,a layer parameterized by the $u \times v$ weight matrix $W$ is approximately factorized as

在这种技术中，由 $u \times v$ 权重矩阵 $W$ 参数化的层被近似分解为

using SVD. In this factorization, $U$ is a $u \times t$ matrix comprising the first $t$ left-singular vectors of $W,{\sum }_{t}$ is a $t \times t$ diagonal matrix containing the top $t$ singular values of $W$ , and $V$ is $v \times t$ matrix comprising the first $t$ right-singular vectors of $W$ . Truncated SVD reduces the parameter count from ${uv}$ to $t\left( {u + v}\right)$ ,which can be significant if $t$ is much smaller than $\min \left( {u,v}\right)$ . To compress a network,the single fully connected layer corresponding to $W$ is replaced by two fully connected layers, without a non-linearity between them. The first of these layers uses the weight matrix ${\sum }_{t}{V}^{T}$ (and no biases) and the second uses $U$ (with the original biases associated with $W$ ). This simple compression method gives good speedups when the number of RoIs is large.

使用 SVD。在这种分解中， $U$ 是一个 $u \times t$ 矩阵，由 $W,{\sum }_{t}$ 的前 $t$ 个左奇异向量组成， $t \times t$ 是一个对角矩阵，包含 $W$ 的前 $t$ 个奇异值，而 $V$ 是一个 $v \times t$ 矩阵，由 $W$ 的前 $t$ 个右奇异向量组成。截断 SVD 将参数数量从 ${uv}$ 减少到 $t\left( {u + v}\right)$ ，如果 $t$ 远小于 $\min \left( {u,v}\right)$ ，这可能是显著的。为了压缩网络，与 $W$ 对应的单个全连接层被两个全连接层替代，它们之间没有非线性。第一个层使用权重矩阵 ${\sum }_{t}{V}^{T}$ （没有偏置），第二个层使用 $U$ （带有与 $W$ 相关的原始偏置）。当 RoI 数量很大时，这种简单的压缩方法可以带来良好的加速效果。

4. Main results

4. 主要结果

Three main results support this paper's contributions:

三个主要结果支持本文的贡献：

State-of-the-art mAP on VOC07, 2010, and 2012
在 VOC07、2010 和 2012 上的最先进的 mAP
Fast training and testing compared to R-CNN, SPPnet
与 R-CNN、SPPnet 相比，训练和测试速度快
Fine-tuning conv layers in VGG16 improves mAP
在 VGG16 中微调卷积层提高了 mAP

4.1. Experimental setup

4.1. 实验设置

Our experiments use three pre-trained ImageNet models that are available online. ${}^{2}$ The first is the CaffeNet (essentially AlexNet [14]) from R-CNN [9]. We alternatively refer

我们的实验使用了三种在线可用的预训练 ImageNet 模型。 ${}^{2}$ 第一个是来自 R-CNN 的 CaffeNet（本质上是 AlexNet [14]）。我们交替提及

${}^{2}$ github.com/BVLC/caffe/…

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——