G-CNN: an Iterative Grid Based Object Detector【翻译】G-CNN: an

Doc2X：多功能 PDF 转换专家支持批量PDF处理、代码解析、多栏布局转换，并结合 Deepseek 翻译功能，助您轻松完成复杂文档任务。 Doc2X: Multifunctional PDF Conversion Expert Supports batch PDF processing, code parsing, and multi-column layout conversion, combined with DeepSeek translation for complex tasks. 👉 立即试用 Doc2X | Try Doc2X Now

原文链接：1512.07729

G-CNN: an Iterative Grid Based Object Detector

G-CNN：一种基于网格的迭代对象检测器

Mahyar Najibi

University of Maryland, College Park

马里兰大学，公园学院

{najibi,mrastega}@cs.umd.edu

Abstract

摘要

We introduce G-CNN, an object detection technique based on CNNs which works without proposal algorithms. G-CNN starts with a multi-scale grid of fixed bounding boxes. We train a regressor to move and scale elements of the grid towards objects iteratively. G-CNN models the problem of object detection as finding a path from a fixed grid to boxes tightly surrounding the objects. G-CNN with around 180 boxes in a multi-scale grid performs comparably to Fast R-CNN which uses around ${2K}$ bounding boxes generated with a proposal technique. This strategy makes detection faster by removing the object proposal stage as well as reducing the number of boxes to be processed.

我们介绍了 G-CNN，一种基于卷积神经网络（CNN）的对象检测技术，它无需提议算法。G-CNN 从一个固定边界框的多尺度网格开始。我们训练一个回归器，迭代地移动和缩放网格中的元素以靠近对象。G-CNN 将对象检测的问题建模为从固定网格找到紧密包围对象的边界框的路径。G-CNN 在一个多尺度网格中使用大约 180 个边界框，其性能与使用约 ${2K}$ 个通过提议技术生成的边界框的 Fast R-CNN 相当。这一策略通过消除对象提议阶段以及减少需要处理的边界框数量，使检测速度更快。

1. Introduction

1. 引言

Object detection, i.e. the problem of finding the locations of objects and determining their categories, is an intrinsically more challenging problem than classification since it includes the problem of object localization. The recent and popular trend in object detection uses a pre-processing step to find a candidate set of bounding-boxes that are likely to encompass the objects in the image. This step is referred to as the bounding-box proposal stage. The proposal techniques are a major computational bottleneck in state-of-the-art object detectors [6]. There have been attempts [16, 14] to take this pre-processing stage out of the loop but they lead to performance degradations.

对象检测，即寻找对象位置并确定其类别的问题，固有上比分类问题更具挑战性，因为它包括对象定位的问题。最近在对象检测中的流行趋势是使用预处理步骤来找到可能包含图像中对象的候选边界框集。这个步骤被称为边界框提议阶段。提议技术是最先进对象检测器中的一个主要计算瓶颈 [6]。曾有尝试 [16, 14] 将这个预处理阶段移出循环，但这导致了性能下降。

We show that without object proposals, we can achieve detection rates similar to state-of-the-art performance in object detection. Inspired by the iterative optimization in [2], we introduce an iterative algorithm that starts with a regularly sampled multi-scale grid of boxes in an image and updates the boxes to cover and classify objects. One step regression can-not handle the non-linearity of the mapping from a regular grid to boxes containing objects. Instead, we introduce a piecewise regression model that can learn this non-linear mapping through a few iterations. Each step in our algorithm deals with an easier regression problem than enforcing a direct mapping to actual target locations.

我们展示了在没有物体提议的情况下，我们可以实现与物体检测领域的最新性能相似的检测率。受到 [2] 中迭代优化的启发，我们引入了一种迭代算法，该算法从图像中的规则采样多尺度网格框开始，并更新这些框以覆盖和分类物体。一步回归无法处理从规则网格到包含物体的框的非线性映射。相反，我们引入了一种分段回归模型，可以通过几次迭代学习这种非线性映射。我们算法中的每一步处理的回归问题比强制直接映射到实际目标位置要简单。

Figure 1: This figure shows a schematic illustration of our iterative algorithm "G-CNN". It starts with a multi-scale regular grid over the image and iteratively updates the boxes in the grid. Each iteration pushes the boxes toward the objects of interest in the image while classifying their category.

图1：该图展示了我们迭代算法“G-CNN”的示意图。它从图像上的多尺度规则网格开始，并迭代更新网格中的框。每次迭代将框推向图像中感兴趣的物体，同时对其类别进行分类。

Figure 1 depicts an overview of our algorithm. Initially, a multi-scale regular grid is superimposed on the image. For visualization we show a grid of non-overlapping, but in actuality the boxes do overlap. During training, each box is assigned to a ground-truth object by an assignment function based on intersection over union with respect to the ground truth boxes. Subsequently, at each training step, we regress boxes in the grid to move themselves towards the objects in the image to which they are assigned. At test time, for each box at each iteration, we obtain confidence scores over all categories and update its location with the regressor trained for the currently most probable class.

图1 描述了我们算法的概述。最初，一个多尺度规则网格被叠加在图像上。为了可视化，我们展示了一个不重叠的网格，但实际上这些框是重叠的。在训练期间，每个框通过基于与真实框的交并比的分配函数被分配给一个真实物体。随后，在每个训练步骤中，我们回归网格中的框，使其朝向分配给它们的图像中的物体移动。在测试时，对于每个框在每次迭代中，我们获得所有类别的置信度分数，并使用为当前最可能类别训练的回归器更新其位置。

Our experimental results show that G-CNN achieves the state-of-the-art results obtained by Fast-RCNN on PASCAL VOC datasets without computing bounding-box proposals. Our method is about ${5X}$ faster than Fast R-CNN for detection.

我们的实验结果表明，G-CNN 在 PASCAL VOC 数据集上实现了与 Fast-RCNN 相同的最新结果，而无需计算边界框提议。我们的方法比 Fast R-CNN 的检测速度快约 ${5X}$ 。

2. Related Work

2. 相关工作

Prior to CNN: For many years the problem of object detection was approached by techniques involving sliding window and classification [22, 20]. Lampert et al. [12] proposed an algorithm that goes beyond sliding windows and was guaranteed to reach the global optimal bounding box for an SVM-based classifier. Implicit Shape Models $\left\lbrack {{13},{15}}\right\rbrack$ eliminated sliding window search by relying on key-parts of an image to vote for a consistent bounding box that covers an object of interest. Deformable Part-based Models [4] employed an idea similar to Implicit Shape Models, but proposed a direct optimization via latent variable models and used dynamic programming for fast inference. Several extension of DPMs emerged $\left\lbrack {5,1}\right\rbrack$ until the remarkable improvements due to the convolutional neural networks was shown by [7].

在 CNN 之前：多年来，物体检测的问题通过滑动窗口和分类技术来解决 [22, 20]。Lampert 等人 [12] 提出了一个超越滑动窗口的算法，保证能够为基于 SVM 的分类器达到全局最优边界框。隐式形状模型 $\left\lbrack {{13},{15}}\right\rbrack$ 通过依赖图像的关键部分来投票生成一个覆盖感兴趣物体的一致边界框，从而消除了滑动窗口搜索。可变形部件模型 [4] 采用了类似于隐式形状模型的思想，但通过潜变量模型提出了直接优化，并使用动态规划进行快速推理。DPM 的几个扩展应运而生 $\left\lbrack {5,1}\right\rbrack$ ，直到 [7] 显示出卷积神经网络带来的显著改进。

CNN age: Deep convolutional neural networks (CNNs) are the state-of-the-art image classifiers and successful methods have been proposed based on these networks [11]. Driven by their success in image classification, Girshick et al. proposed a multi-stage object detection system, known as R-CNN [7], which has attracted great attention due to its success on standard object detection datasets.

CNN 时代：深度卷积神经网络（CNN）是最先进的图像分类器，并且基于这些网络提出了成功的方法 [11]。受到图像分类成功的驱动，Girshick 等人提出了一个多阶段物体检测系统，称为 R-CNN [7]，因其在标准物体检测数据集上的成功而引起了广泛关注。

To address the localization problem, R-CNN relies on advances in object proposal techniques. Recently, proposal algorithms have been developed which avoid exhaustive search of image locations [21, 24]. R-CNN uses these techniques to find bounding boxes which include an object with high probability. Next, a standard CNN is applied as feature extractor to each proposed bounding box and finally a classifier decides which object class is inside the box.

为了解决定位问题，R-CNN 依赖于物体提议技术的进展。最近，已经开发出避免对图像位置进行穷举搜索的提议算法 [21, 24]。R-CNN 使用这些技术来找到高概率包含物体的边界框。接下来，标准 CNN 被应用于每个提议的边界框作为特征提取器，最后分类器决定哪个物体类别位于框内。

The main drawback of R-CNN is the redundancy in computing the features. Generally,around $2\mathrm{\;K}$ proposals are generated; for each of them, the CNN is applied independently to extract features. To alleviate this problem, in SPP-Net [9] the convolutional layers of the network are applied only once for each image. Then, the features of each region of interest are constructed by pooling the global features which lie in the spatial support of the region. However, learning is limited to fine-tuning the weights of fully connected layers. This drawback is addressed in Fast-RCNN [6] in which all parameters are learned by back propagating the errors through the augmented pooling layer and packing all stages of the system, except generation of the object proposals, into one network.

R-CNN 的主要缺点是计算特征的冗余。通常，生成大约 $2\mathrm{\;K}$ 个提议；对于每一个提议，CNN 独立应用以提取特征。为了缓解这个问题，在 SPP-Net [9] 中，网络的卷积层仅对每个图像应用一次。然后，通过对位于区域空间支持中的全局特征进行池化，构建每个感兴趣区域的特征。然而，学习仅限于微调全连接层的权重。这个缺点在 Fast-RCNN [6] 中得到了改善，其中所有参数通过在增强池化层中反向传播误差来学习，并将系统的所有阶段（除了生成对象提议）打包成一个网络。

The generation of object proposals, in CNN-based detection systems has been regarded as crucial. However, after proposing Fast-RCNN, this stage became the bottleneck. To make the number of object proposals smaller, Multibox[3] introduced a proposal algorithm that outputs 800 bounding boxes using a CNN. This increases the size of the final layer of the CNN to ${4096} \times {800} \times 5$ and introduces a large set of additional parameters. Recently, Faster-RCNN [17] was proposed, which decreased the number of parameters; however it needs to start from thousands of anchor points to propose 300 boxes.

在基于 CNN 的检测系统中，生成对象提议被认为是至关重要的。然而，在提出 Fast-RCNN 之后，这一阶段成为了瓶颈。为了减少对象提议的数量，Multibox [3] 引入了一种提议算法，使用 CNN 输出 800 个边界框。这增加了 CNN 最终层的大小到 ${4096} \times {800} \times 5$ 并引入了一大组额外的参数。最近，提出了 Faster-RCNN [17]，它减少了参数的数量；然而，它需要从数千个锚点开始以提出 300 个框。

In addition to classification, using a regressor for object detection has been also studied previously. Before proposing R-CNN, Szegedy et al. [19], modeled object detection as a regression problem and proposed a CNN-based regression system. More recently, AttentionNet [23] is a single category detection that detects a single object inside an image using iterative regression. For multiple objects, the model is applied as a proposal algorithm to generate thousands of proposals and then is re-applied iteratively on each proposal for single category detection, which makes detection inefficient.

除了分类，使用回归器进行对象检测也曾被研究。在提出 R-CNN 之前，Szegedy 等人 [19] 将对象检测建模为回归问题，并提出了一种基于 CNN 的回归系统。更近期的 AttentionNet [23] 是一种单类别检测，使用迭代回归检测图像中的单个对象。对于多个对象，该模型作为提议算法应用以生成数千个提议，然后对每个提议进行迭代重新应用以进行单类别检测，这使得检测效率低下。

Although R-CNN and its variants attack the problem using a classification approach, they employ regression as a post-processing stage to refine the localization of the proposed bounding boxes.

尽管 R-CNN 及其变体采用分类方法来解决问题，但它们在后处理阶段使用回归来细化所提议的边界框的定位。

The importance of the regression stage has not received as much attention as improving the object proposal stage for more accurate localization. The necessity of an object proposal algorithm in CNN based object detection systems has recently been challenged by Lenc et al. [14]. Here, the proposals are replaced by a fixed set of bounding boxes. A set with a distribution derived from an object proposal method is selected using a clustering technique. However, for achieving comparable results, even more boxes need to be used compared to R-CNN. Another recent attempt for removing the proposal stage is Redmon et al. [16] which conducts object detection in a single shot. However, the considerable gap between the best detection accuracy of these systems and systems with an explicit proposal stage suggests that the identification of good object proposals is critical to the success of these CNN based detection systems.

回归阶段的重要性并没有像提高物体提议阶段以实现更准确的定位那样受到重视。Lenc 等人最近对 CNN 基于物体检测系统中物体提议算法的必要性提出了质疑 [14]。在这里，提议被一组固定的边界框所替代。通过聚类技术选择一组来自物体提议方法的分布。然而，为了获得可比的结果，甚至需要使用比 R-CNN 更多的框。Redmon 等人最近的一个尝试是去除提议阶段 [16]，该方法在一次性检测中进行物体检测。然而，这些系统的最佳检测精度与具有明确提议阶段的系统之间的显著差距表明，识别良好的物体提议对于这些基于 CNN 的检测系统的成功至关重要。

3. G-CNN Object Detector

3. G-CNN 物体检测器

3.1. Network structure

3.1. 网络结构

G-CNN trains a CNN to move and scale a fixed multi-scale grid of bounding boxes towards objects. The network architecture for this regressor is shown in Figure 2. The backbone of this architecture can be any CNN network (e.g. AlexNet [11], VGG [18], etc.). As in Fast R-CNN and SPP-Net, a spatial region of interest (ROI) pooling layer is included in the architecture after the convolutional layers. Given the location information of each box, this layer computes the feature for the box by pooling the global features that lie inside the ROI. After the fully connected layers, the network ends with a linear regressor which outputs the change in the location and scale of each current bounding box, conditioned on the assumption that the box is moving towards an object of a class.

G-CNN 训练一个 CNN 来移动和缩放固定的多尺度边界框网格以对准物体。该回归器的网络架构如图 2 所示。该架构的主干可以是任何 CNN 网络（例如 AlexNet [11]、VGG [18] 等）。与 Fast R-CNN 和 SPP-Net 一样，在卷积层之后的架构中包含一个空间兴趣区域（ROI）池化层。根据每个框的位置，该层通过池化位于 ROI 内的全局特征来计算该框的特征。在全连接层之后，网络以一个线性回归器结束，该回归器输出每个当前边界框的位置和尺度的变化，前提是该框正在朝着某一类别的物体移动。

3.2. Training the network

3.2. 训练网络

Despite the similarities between the Fast R-CNN and G-CNN architectures, the training goals and approaches are different. G-CNN defines the problem of object detection as an iterative search in the space of all possible bounding boxes. G-CNN starts from a fixed multi-scale spatial pyramid of boxes. The goal of learning is to train the network so that it can move this set of initial boxes towards the objects inside the image in $S$ steps iteratively. This iterative behaviour is essential for the success of the algorithm. The reason is the highly non-linear search space of the problem. In other words, although learning how to linearly regress boxes to far away targets is unrealistic, learning small changes in the search space is tractable. Section 4.3 shows the importance of this step-wise training approach.

尽管 Fast R-CNN 和 G-CNN 架构之间存在相似性，但训练目标和方法却不同。G-CNN 将目标检测问题定义为在所有可能的边界框空间中的迭代搜索。G-CNN 从固定的多尺度空间金字塔框开始。学习的目标是训练网络，使其能够在 $S$ 步骤中将这组初始框迭代地移动到图像中的物体上。这种迭代行为对于算法的成功至关重要。原因在于问题的高度非线性搜索空间。换句话说，尽管学习如何将框线性回归到远处目标是不现实的，但在搜索空间中学习小的变化是可行的。第 4.3 节展示了这种逐步训练方法的重要性。

Figure 2: Structure of G-CNN regression network as well as an illustration of the idea behind the iterative training approach. The bounding box at each step is shown by the blue rectangle and its target is represented by a red rectangle. The network is trained to learn the path from the initial bounding box to its assigned target iteratively.

图 2：G-CNN 回归网络的结构以及迭代训练方法背后理念的示意图。每一步的边界框由蓝色矩形表示，其目标由红色矩形表示。网络被训练以迭代地学习从初始边界框到其分配目标的路径。

3.2.1 Loss function

3.2.1 损失函数

G-CNN is an iterative method that moves bounding boxes towards object locations in ${S}_{\text{train }}$ steps. For this reason,the loss function is defined not only over the training samples but also over the iterative steps.

G-CNN 是一种迭代方法，它在 ${S}_{\text{train }}$ 步骤中将边界框移动到物体位置。因此，损失函数不仅在训练样本上定义，还在迭代步骤上定义。

More formally,let $\mathcal{B}$ represent the four-dimensional space of all possible bounding boxes represented by the coordinates of their center,their width,and height. ${\mathbf{B}}_{i} \in \mathcal{B}$ is the $i$ ’th training bounding box. We use the superscript $1 \leq s \leq {S}_{\text{train }}$ to denote the variables in step ’s’ of the G-CNN training,i.e. ${\mathbf{B}}_{i}^{s}$ is the position of the $i$ ’th training bounding box in step $s$ .

更正式地说，设 $\mathcal{B}$ 表示所有可能的边界框的四维空间，由其中心的坐标、宽度和高度表示。 ${\mathbf{B}}_{i} \in \mathcal{B}$ 是第 $i$ 个训练边界框。我们使用上标 $1 \leq s \leq {S}_{\text{train }}$ 来表示 G-CNN 训练中步骤 's' 的变量，即 ${\mathbf{B}}_{i}^{s}$ 是第 $i$ 个训练边界框在步骤 $s$ 中的位置。

During training, each bounding box with an IoU higher than a small threshold (0.2) is assigned to one of the ground truth bounding boxes inside its image. The following many-to-one function, $\mathcal{A}$ ,is used for this assignment.

在训练过程中，每个与小阈值（0.2）IoU 高于该阈值的边界框被分配给其图像内的一个真实边界框。以下多对一函数 $\mathcal{A}$ 用于此分配。

where ${\mathcal{G}}_{i} = \left\{ {{\mathbf{G}}_{i1} \in \mathcal{B},\ldots ,{\mathbf{G}}_{in} \in \mathcal{B}}\right\}$ ,is the set of ground truth bounding boxes which lie in the same image as ${\mathbf{B}}_{i}$ . IoU is the intersection over union measure. Note that ${\mathbf{B}}_{i}^{1}$ represents the position of the $i$ ’th bounding box in the initial grid. In other words, for each training bounding box, the assignment is done in the initial training step and is not changed during the training.

其中 ${\mathcal{G}}_{i} = \left\{ {{\mathbf{G}}_{i1} \in \mathcal{B},\ldots ,{\mathbf{G}}_{in} \in \mathcal{B}}\right\}$ 是与 ${\mathbf{B}}_{i}$ 位于同一图像中的真实边界框集合。IoU 是交并比度量。注意 ${\mathbf{B}}_{i}^{1}$ 表示初始网格中第 $i$ 个边界框的位置。换句话说，对于每个训练边界框，分配是在初始训练步骤中完成的，并且在训练过程中不会更改。

Since regressing the initial training bounding boxes to their assigned ground truth bounding box can be highly nonlinear, we tackle the problem with a piece-wise regression approach. At step $s$ ,we solve the problem of regressing ${\mathbf{B}}_{i}^{s}$ to a target bounding box on the path from ${\mathbf{B}}_{i}^{s}$ to its assigned ground truth box. The target bounding box is moved step by step towards the assigned bounding box until it coincides with the assigned ground truth in step ${S}_{\text{train }}$ . The following function is used for defining the target bounding boxes at each step:

由于将初始训练边界框回归到其分配的真实边界框可能是高度非线性的，我们采用分段回归的方法来解决这个问题。在步骤 $s$ 中，我们解决将 ${\mathbf{B}}_{i}^{s}$ 回归到从 ${\mathbf{B}}_{i}^{s}$ 到其分配的真实框路径上的目标边界框的问题。目标边界框逐步向分配的边界框移动，直到在步骤 ${S}_{\text{train }}$ 中与分配的真实框重合。以下函数用于定义每一步的目标边界框：

where ${\mathbf{G}}_{i}^{ * } = \mathcal{A}\left( {\mathbf{B}}_{i}^{s}\right)$ represents the assigned ground truth bounding box to ${\mathbf{B}}_{i}^{s}$ . That is,at each step,the path from the current representation of the bounding box to the assigned ground truth is divided by the number of remaining steps and the target is defined to be one unit away from the current location.

其中 ${\mathbf{G}}_{i}^{ * } = \mathcal{A}\left( {\mathbf{B}}_{i}^{s}\right)$ 表示分配给 ${\mathbf{B}}_{i}^{s}$ 的真实边界框。也就是说，在每一步中，从当前边界框的表示到分配的真实框的路径被剩余步骤的数量所划分，目标被定义为距离当前位置一单位。

G-CNN regression network outputs four values for each class, representing the parameterized change for regressing the bounding boxes assigned to that class. Following [7], a log-scale shift in width and height and a scale invariant translation is used to parametrize the relative change for mapping a bounding box to its assigned target. This parametrization is denoted by $\Delta \left( {{\mathbf{B}}_{i}^{s},{\mathbf{T}}_{i}^{s}}\right)$ ,where ${\mathbf{T}}_{i}^{s}$ is the assigned target to ${\mathbf{B}}_{i}^{s}$ computed by 2 .

G-CNN 回归网络为每个类别输出四个值，表示回归分配给该类别的边界框的参数化变化。根据 [7]，使用宽度和高度的对数尺度偏移以及尺度不变的平移来参数化边界框映射到其分配目标的相对变化。该参数化表示为 $\Delta \left( {{\mathbf{B}}_{i}^{s},{\mathbf{T}}_{i}^{s}}\right)$ ，其中 ${\mathbf{T}}_{i}^{s}$ 是分配给 ${\mathbf{B}}_{i}^{s}$ 的目标，由 2 计算得出。

So the loss function for G-CNN is defined as follows:

因此，G-CNN 的损失函数定义如下：

where ${\delta }_{i,{l}_{i}}^{s}$ is the four-dimensional parameterized output for class ${l}_{i}$ representing the relative change in the representation of bounding box ${\mathbf{B}}_{i}^{s}.{l}_{i}$ is the class label of the assigned ground truth bounding box to ${\mathbf{B}}_{i}.{L}_{reg}$ is the regression loss function. The smooth ${l}_{1}$ loss is used as defined in [6]. $I\left( \text{.}\right) {istheindicatorfunctionwhichoutputsonemhen}$ its condition is satisfied and zero otherwise. ${\mathcal{B}}_{BG}$ represents the set of all background bounding boxes.

其中 ${\delta }_{i,{l}_{i}}^{s}$ 是类别 ${l}_{i}$ 的四维参数化输出，表示边界框 ${\mathbf{B}}_{i}^{s}.{l}_{i}$ 表示的相对变化， ${\mathbf{B}}_{i}.{L}_{reg}$ 是分配给 ${\mathbf{B}}_{i}.{L}_{reg}$ 的真实边界框的类别标签，是回归损失函数。使用平滑的 ${l}_{1}$ 损失，如 [6] 中定义的那样。 $I\left( \text{.}\right) {istheindicatorfunctionwhichoutputsonemhen}$ 在其条件满足时为零，否则为零。 ${\mathcal{B}}_{BG}$ 表示所有背景边界框的集合。

During training,the representation of bounding box ${\mathbf{B}}_{i}$ at step $s,{\mathbf{B}}_{i}^{s}$ ,can be determined based on the actual output of the network by the following update formula:

在训练过程中，边界框 ${\mathbf{B}}_{i}$ 在步骤 $s,{\mathbf{B}}_{i}^{s}$ 的表示可以根据网络的实际输出通过以下更新公式确定：

where ${\Delta }^{-1}$ projects back the relative change in the position and scale from the defined parametrized space into $\mathcal{B}$ . However for calculating 4 , the forward path of the network needs to be evaluated during training, making training inefficient. Instead, we use an approximate update by assuming that in step $s$ ,the network could learn the regressor for step $s - 1$ perfectly. As a result the update formula becomes ${\mathbf{B}}_{i}^{s} = \Phi \left( {{\mathbf{B}}_{i}^{s - 1},{\mathbf{G}}_{i}^{ * },s - 1}\right)$ . This update is depicted in Figure 2.

其中 ${\Delta }^{-1}$ 将位置和尺度的相对变化从定义的参数化空间投影回 $\mathcal{B}$ 。然而，为了计算 4，网络的前向路径需要在训练期间进行评估，这使得训练效率低下。相反，我们通过假设在步骤 $s$ 中，网络可以完美地学习步骤 $s - 1$ 的回归器来使用近似更新。因此，更新公式变为 ${\mathbf{B}}_{i}^{s} = \Phi \left( {{\mathbf{B}}_{i}^{s - 1},{\mathbf{G}}_{i}^{ * },s - 1}\right)$ 。该更新在图 2 中表示。

3.2.2 Optimization

3.2.2 优化

G-CNN optimizes the objective function in 3 with stochastic gradient descent. Since G-CNN tries to map the bounding boxes to their assigned ground-truth boxes in ${S}_{\text{train }}$ steps, we use a step-wised learning algorithm that optimizes Eq. 3 step by step.

G-CNN 使用随机梯度下降优化公式 (3)。由于 G-CNN 尝试将边界框映射到其分配的真实框，在 ${S}_{\text{train }}$ 步骤中，我们使用逐步学习算法逐步优化公式 (3)。

To this end, we treat each of the bounding boxes in the initial grid together with its target in each of the steps as an independent training sample i.e. for each of the bounding boxes we have ${S}_{\text{train }}$ different training pairs. The algorithm first tries to optimize the loss function for the first step using ${N}_{\text{iter }}$ iterations. Then the training pairs of the second step are added to the training set and training continues step by step. By keeping the samples of the previous steps in the training set, we make sure that the network does not forget what was learned in the previous steps.

为此，我们将初始网格中的每个边界框及其在每个步骤中的目标视为独立的训练样本，即对于每个边界框，我们有 ${S}_{\text{train }}$ 个不同的训练对。该算法首先尝试使用 ${N}_{\text{iter }}$ 次迭代来优化第一步的损失函数。然后，将第二步的训练对添加到训练集中，训练继续逐步进行。通过将前一步的样本保留在训练集中，我们确保网络不会忘记之前步骤中学到的内容。

The samples for the earlier steps are part of the training set for a longer period of time. This choice is made since the earlier steps determine the global search direction and have a greater impact on the chance that the network will find the objects. On the other hand, the later steps only refine the bounding boxes to decrease localization error. Given that the search direction was correct and a good part of the object is now visible in the bounding box, the later steps solve a relatively easier problem.

较早步骤的样本在训练集中保留的时间更长。之所以这样选择，是因为较早的步骤决定了全局搜索方向，并对网络找到目标的机会产生更大影响。另一方面，后续步骤仅对边界框进行细化，以减少定位误差。考虑到搜索方向是正确的，并且现在在边界框中可见的目标部分较好，后续步骤解决的是相对较简单的问题。

Algorithm 1 is the method for generating training samples from each bounding box during each G-CNN step.

算法 1 是在每个 G-CNN 步骤中从每个边界框生成训练样本的方法。

Algorithm 1 G-CNN Training Algorithm

算法 1 G-CNN 训练算法

procedure TRAINGCNN

过程 TRAINGCNN

for $1 \leq c \leq {S}_{\text{train }}$ do

对于 $1 \leq c \leq {S}_{\text{train }}$ 执行

TrainTuples $\leftarrow \{ \}$

训练对 $\leftarrow \{ \}$

for $1 \leq s \leq c$ do

对于 $1 \leq s \leq c$ 执行

if $s = 1$ then

如果 $s = 1$ 则

${\mathbf{B}}^{1} \leftarrow$ Spatial pyramid grid of boxes

${\mathbf{B}}^{1} \leftarrow$ 空间金字塔网格框

${\mathbf{G}}^{ * } \leftarrow \mathcal{A}\left( {\mathbf{B}}^{1}\right)$

else

${\mathbf{B}}^{s} \leftarrow {\mathbf{T}}^{s - 1}$

end if

${\mathbf{T}}^{s} \leftarrow \Phi \left( {{\mathbf{B}}^{s},{\mathbf{G}}^{ * },s}\right)$

${\Delta }^{s} \leftarrow \Delta \left( {{\mathbf{B}}^{s},{\mathbf{T}}^{s}}\right)$

Add $\left( {{\mathbf{B}}^{s},{\Delta }^{s}}\right)$ to TrainTuples

将 $\left( {{\mathbf{B}}^{s},{\Delta }^{s}}\right)$ 添加到 TrainTuples

end for

Train G-CNN ${N}_{\text{iter }}$ iterations with TrainTuples

使用 TrainTuples 进行 G-CNN ${N}_{\text{iter }}$ 迭代训练

end for

end procedure

3.3. Fine-tuning

3.3. 微调

All models are fine-tuned from pre-trained models on ImageNet. Following [6], we fine-tune all layers except early convolutional layers (i.e. conv2 and up for AlexNet and conv3_1 and up for VGG16). During training, mini-batches of two images are used. At each step of G-CNN, 64 training samples are selected randomly from all possible samples of the image at the corresponding step.

所有模型均从在 ImageNet 上预训练的模型进行微调。根据 [6]，我们微调所有层，除了早期的卷积层（即 AlexNet 的 conv2 及其之后的层，以及 VGG16 的 conv3_1 及其之后的层）。在训练过程中，使用两个图像的小批量。在 G-CNN 的每一步中，从该步骤对应的所有可能样本中随机选择 64 个训练样本。

3.4. G-CNN Test Network

3.4. G-CNN 测试网络

The G-CNN regression network is trained to detect objects in an iterative fashion from a set of fixed bounding boxes in a multi-scale spatial grid. Likewise at test time, the set of bounding boxes is initialized to boxes inside a spatial pyramid grid. The regressor moves boxes towards objects using the classifier score to determine which class regressor to apply to update the box. The detection algorithm is presented in Algorithm 2.

G-CNN 回归网络以迭代的方式从一组固定的边界框中训练以检测对象，位于多尺度空间网格中。同样，在测试时，边界框的集合初始化为空间金字塔网格中的框。回归器根据分类器得分移动框以接近对象，以确定应用哪个类别的回归器来更新框。检测算法在算法 2 中给出。

During the detection phase,G-CNN is run ${S}_{\text{test }}$ times. However, like SPP-Net and Fast R-CNN there is no need to compute activations for all layers at every iteration. During test time, we decompose the network into global and regression parts as depicted in Figure. 3. The global net contains all convolutional layers of the network. On the other hand, the regression part consists of the fully connected layers and the regression weights. The input to the global net is the image and the forward path is computed only once for each image, outside the detection loop of Algorithm 2. Inside the detection loop, we only operate the regression network, which takes the outputs of the last layer of the global net as input and produces the bounding box modifications.

在检测阶段，G-CNN 被运行 ${S}_{\text{test }}$ 次。然而，与 SPP-Net 和 Fast R-CNN 一样，在每次迭代中并不需要计算所有层的激活。在测试时，我们将网络分解为全局部分和回归部分，如图 3 所示。全局网络包含网络的所有卷积层。另一方面，回归部分由全连接层和回归权重组成。全局网络的输入是图像，并且每张图像的前向路径仅计算一次，位于算法 2 的检测循环之外。在检测循环内部，我们仅操作回归网络，该网络以全局网络最后一层的输出作为输入，并生成边界框的修改。

This makes the computational cost of the algorithm com-

这使得算法的计算成本变得可控。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——