R-FCN: Object Detection via Region-based Fully Convolutional Networks【翻译】

102 阅读21分钟

Doc2X | 专注学术文档翻译 支持 PDF 转 Word、多栏识别和沉浸式双语翻译,为您的论文处理和学术研究提供全方位支持。 Doc2X | Academic Document Translation Expert Support PDF to Word, multi-column recognition, and immersive bilingual translation for comprehensive academic research assistance. 👉 了解 Doc2X 功能 | Learn About Doc2X

原文链接:1605.06409

R-FCN: Object Detection via Region-based Fully Convolutional Networks

R-FCN: 基于区域的全卷积网络进行目标检测

Jifeng Dai

Jifeng Dai

Microsoft Research Asia

微软亚洲研究院

YiLi\mathbf{{Yi}}{\mathbf{{Li}}}^{ * }

Tsinghua University

清华大学

Kaiming  He\mathbf{{Kaiming}\;{He}}

Microsoft Research

微软研究院

Jian Sun

Jian Sun

Microsoft Research

微软研究院

Abstract

摘要

We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN [6, 18] that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Our method can thus naturally adopt fully convolutional image classifier backbones, such as the latest Residual Networks (ResNets) [9], for object detection. We show competitive results on the PASCAL VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet. Meanwhile,our result is achieved at a test-time speed of 170  ms{170}\mathrm{\;{ms}} per image, 2.520×{2.5} - {20} \times faster than the Faster R-CNN counterpart. Code is made publicly available at:

我们提出了一种基于区域的全卷积网络,用于准确和高效的目标检测。与之前的基于区域的检测器(如 Fast/Faster R-CNN [6, 18],需要对每个区域的子网络进行数百次昂贵的计算)相比,我们的基于区域的检测器是全卷积的,几乎所有计算都在整个图像上共享。为了实现这一目标,我们提出了位置敏感的得分图,以解决图像分类中的平移不变性与目标检测中的平移变异性之间的两难问题。因此,我们的方法可以自然地采用全卷积图像分类器作为骨干网络,例如最新的残差网络(ResNets)[9],用于目标检测。我们在 PASCAL VOC 数据集上展示了具有竞争力的结果(例如,在 2007 年数据集上达到 83.6% 的 mAP),使用的是 101 层的 ResNet。同时,我们的结果在测试时的速度为 170  ms{170}\mathrm{\;{ms}} 每张图像,比 Faster R-CNN 对应的速度快 2.520×{2.5} - {20} \times。代码已公开发布于:

github.com/daijifeng00….

1 Introduction

1 引言

A prevalent family [8,6,18]\left\lbrack {8,6,{18}}\right\rbrack of deep networks for object detection can be divided into two subnetworks by the Region-of-Interest (RoI) pooling layer [6]: (i) a shared, "fully convolutional" subnetwork independent of RoIs, and (ii) an RoI-wise subnetwork that does not share computation. This decomposition [8] was historically resulted from the pioneering classification architectures, such as AlexNet [10] and VGG Nets [23], that consist of two subnetworks by design — a convolutional subnetwork ending with a spatial pooling layer, followed by several fully-connected (fc) layers. Thus the (last) spatial pooling layer in image classification networks is naturally turned into the RoI pooling layer in object detection networks [8,6,18]\left\lbrack {8,6,{18}}\right\rbrack .

一种流行的深度网络家族 [8,6,18]\left\lbrack {8,6,{18}}\right\rbrack 用于物体检测,可以通过区域兴趣 (RoI) 池化层 [6] 分为两个子网络:(i) 一个与 RoI 无关的共享“全卷积”子网络,以及 (ii) 一个不共享计算的 RoI 相关子网络。这种分解 [8] 历史上源于开创性的分类架构,如 AlexNet [10] 和 VGG 网络 [23],这些架构设计上由两个子网络组成——一个以空间池化层结束的卷积子网络,后面跟着几个全连接 (fc) 层。因此,图像分类网络中的(最后一个)空间池化层自然转变为物体检测网络中的 RoI 池化层 [8,6,18]\left\lbrack {8,6,{18}}\right\rbrack

But recent state-of-the-art image classification networks such as Residual Nets (ResNets) [9] and GoogLeNets [24,26]\left\lbrack {{24},{26}}\right\rbrack are by design fully convolutional 2{}^{2} . By analogy,it appears natural to use all convolutional layers to construct the shared, convolutional subnetwork in the object detection architecture, leaving the RoI-wise subnetwork no hidden layer. However, as empirically investigated in this work, this naïve solution turns out to have considerably inferior detection accuracy that does not match the network's superior classification accuracy. To remedy this issue, in the ResNet paper [9] the RoI pooling layer of the Faster R-CNN detector [18] is unnaturally inserted between two sets of convolutional layers - this creates a deeper RoI-wise subnetwork that improves accuracy, at the cost of lower speed due to the unshared per-RoI computation.

但是,最近的最先进的图像分类网络,如残差网络 (ResNets) [9] 和 GoogLeNets [24,26]\left\lbrack {{24},{26}}\right\rbrack,在设计上是完全卷积的 2{}^{2}。类比而言,使用所有卷积层来构建物体检测架构中的共享卷积子网络似乎是自然的,这使得 RoI 相关子网络没有隐藏层。然而,正如本研究中经验性调查的那样,这种简单的解决方案的检测准确性显著低于网络的优越分类准确性。为了解决这个问题,在 ResNet 论文 [9] 中,Faster R-CNN 检测器 [18] 的 RoI 池化层不自然地插入在两组卷积层之间——这创造了一个更深的 RoI 相关子网络,提高了准确性,但由于每个 RoI 的计算不共享,速度降低。

We argue that the aforementioned unnatural design is caused by a dilemma of increasing translation invariance for image classification vs{vs} . respecting translation variance for object detection. On one hand, the image-level classification task favors translation invariance - shift of an object inside an image should be indiscriminative. Thus, deep (fully) convolutional architectures that are as translation-invariant as possible are preferable as evidenced by the leading results on ImageNet classification

我们认为上述不自然的设计是由于在图像分类 vs{vs} 中增加平移不变性与物体检测的平移方差之间的困境所导致的。一方面,图像级分类任务偏向于平移不变性——图像内物体的位移应该是不可区分的。因此,尽可能平移不变的深度(全)卷积架构是更可取的,这一点在 ImageNet 分类的领先结果中得到了证明。


*This work was done when Yi Li was an intern at Microsoft Research.

*这项工作是在 Yi Li 担任微软研究院实习生时完成的。

2{}^{2} Only the last layer is fully-connected,which is removed and replaced when fine-tuning for object detection.

2{}^{2} 只有最后一层是全连接的,在针对物体检测进行微调时,该层会被移除并替换。


Figure 1: Key idea of R-FCN for object detection. In this illustration,there are k×k=3×3k \times k = 3 \times 3 position-sensitive score maps generated by a fully convolutional network. For each of the k×kk \times k bins in an RoI,pooling is only performed on one of the k2{k}^{2} maps (marked by different colors).

图 1:R-FCN 进行物体检测的关键思想。在此图示中,有 k×k=3×3k \times k = 3 \times 3 由全卷积网络生成的位置敏感得分图。对于 RoI 中的每个 k×kk \times k 区域,仅在其中一个 k2{k}^{2} 图上执行池化(用不同颜色标记)。

Table 1: Methodologies of region-based detectors using ResNet-101 [9].

表 1:使用 ResNet-101 [9] 的基于区域的检测器的方法论。

[9,24,26]\left\lbrack {9,{24},{26}}\right\rbrack . On the other hand,the object detection task needs localization representations that are translation-variant to an extent. For example, translation of an object inside a candidate box should produce meaningful responses for describing how good the candidate box overlaps the object. We hypothesize that deeper convolutional layers in an image classification network are less sensitive to translation. To address this dilemma, the ResNet paper's detection pipeline [9] inserts the RoI pooling layer into convolutions - this region-specific operation breaks down translation invariance, and the post-RoI convolutional layers are no longer translation-invariant when evaluated across different regions. However, this design sacrifices training and testing efficiency since it introduces a considerable number of region-wise layers (Table 1).

[9,24,26]\left\lbrack {9,{24},{26}}\right\rbrack 。另一方面,物体检测任务需要在一定程度上平移变异的定位表示。例如,候选框内物体的平移应该产生有意义的响应,以描述候选框与物体重叠的程度。我们假设图像分类网络中的更深卷积层对平移的敏感性较低。为了解决这一困境,ResNet 论文的检测管道 [9] 将 RoI 池化层插入到卷积中——这一区域特定的操作打破了平移不变性,后续的 RoI 卷积层在跨不同区域评估时不再是平移不变的。然而,这种设计牺牲了训练和测试效率,因为它引入了大量的区域级层(表 1)。

In this paper, we develop a framework called Region-based Fully Convolutional Network (R-FCN) for object detection. Our network consists of shared, fully convolutional architectures as is the case of FCN [15]. To incorporate translation variance into FCN, we construct a set of position-sensitive score maps by using a bank of specialized convolutional layers as the FCN output. Each of these score maps encodes the position information with respect to a relative spatial position (e.g., "to the left of an object"). On top of this FCN, we append a position-sensitive RoI pooling layer that shepherds information from these score maps, with no weight (convolutional/fc) layers following. The entire architecture is learned end-to-end. All learnable layers are convolutional and shared on the entire image, yet encode spatial information required for object detection. Figure 1 illustrates the key idea and Table 1 compares the methodologies among region-based detectors.

在本文中,我们开发了一个称为基于区域的全卷积网络(R-FCN)的框架用于目标检测。我们的网络由共享的全卷积架构组成,类似于 FCN [15]。为了将平移方差纳入 FCN,我们通过使用一组专门的卷积层作为 FCN 输出,构建了一组位置敏感的得分图。每个得分图编码了相对于相对空间位置(例如,“在物体的左侧”)的位置信息。在这个 FCN 的基础上,我们附加了一个位置敏感的 RoI 池化层,该层从这些得分图中引导信息,后面没有权重(卷积/全连接)层。整个架构是端到端学习的。所有可学习的层都是卷积层,并在整个图像上共享,但编码了目标检测所需的空间信息。图 1 说明了关键思想,表 1 比较了基于区域的检测器之间的方法论。

Using the 101-layer Residual Net (ResNet-101) [9] as the backbone, our R-FCN yields competitive results of 83.6%{83.6}\% mAP on the PASCAL VOC 2007 set and 82.0%{82.0}\% the 2012 set. Meanwhile,our results are achieved at a test-time speed of 170  ms{170}\mathrm{\;{ms}} per image using ResNet-101,which is 2.5×{2.5} \times to 20×{20} \times faster than the Faster R-CNN + ResNet-101 counterpart in [9]. These experiments demonstrate that our method manages to address the dilemma between invariance/variance on translation, and fully convolutional image-level classifiers such as ResNets can be effectively converted to fully convolutional object detectors. Code is made publicly available at: github.com/daijifeng00….

使用 101 层残差网络(ResNet-101)[9] 作为主干,我们的 R-FCN 在 PASCAL VOC 2007 数据集上获得了 83.6%{83.6}\% mAP 的竞争性结果,在 2012 数据集上获得了 82.0%{82.0}\% 的结果。同时,我们的结果是在使用 ResNet-101 时以每张图像 170  ms{170}\mathrm{\;{ms}} 的测试速度实现的,这比 [9] 中的 Faster R-CNN + ResNet-101 对应模型快了 2.5×{2.5} \times20×{20} \times 倍。这些实验表明,我们的方法成功解决了平移的不变性/方差之间的困境,并且像 ResNets 这样的全卷积图像级分类器可以有效地转换为全卷积目标检测器。代码已公开发布在:github.com/daijifeng00…

2 Our approach

2 我们的方法

Overview. Following R-CNN [7], we adopt the popular two-stage object detection strategy [7, 8, 6, 18, 1, 22] that consists of: (i) region proposal, and (ii) region classification. Although methods that do not rely on region proposal do exist (e.g., [17,14]\left\lbrack {{17},{14}}\right\rbrack ),region-based systems still possess leading accuracy on several benchmarks [5, 13, 20]. We extract candidate regions by the Region Proposal Network (RPN) [18], which is a fully convolutional architecture in itself. Following [18], we share the features between RPN and R-FCN. Figure 2 shows an overview of the system.

概述。继 R-CNN [7] 之后,我们采用流行的两阶段目标检测策略 [7, 8, 6, 18, 1, 22],该策略包括:(i) 区域提议,以及 (ii) 区域分类。尽管确实存在不依赖于区域提议的方法(例如,[17,14]\left\lbrack {{17},{14}}\right\rbrack),基于区域的系统在多个基准测试中仍然具有领先的准确性 [5, 13, 20]。我们通过区域提议网络(RPN)[18] 提取候选区域,RPN 本身是一种完全卷积的架构。继 [18] 之后,我们在 RPN 和 R-FCN 之间共享特征。图 2 显示了系统的概述。

Figure 2: Overall architecture of R-FCN. A Region Proposal Network (RPN) [18] proposes candidate RoIs, which are then applied on the score maps. All learnable weight layers are convolutional and are computed on the entire image; the per-RoI computational cost is negligible.

图 2:R-FCN 的整体架构。区域提议网络(RPN)[18] 提出候选 RoI,然后将其应用于得分图。所有可学习的权重层都是卷积的,并在整个图像上计算;每个 RoI 的计算成本可以忽略不计。

Given the proposal regions (RoIs), the R-FCN architecture is designed to classify the RoIs into object categories and background. In R-FCN, all learnable weight layers are convolutional and are computed on the entire image. The last convolutional layer produces a bank of k2{k}^{2} position-sensitive score maps for each category,and thus has a k2(C+1){k}^{2}\left( {C + 1}\right) -channel output layer with CC object categories (+1 for background). The bank of k2{k}^{2} score maps correspond to a k×kk \times k spatial grid describing relative positions. For example,with k×k=3×3k \times k = 3 \times 3 ,the 9 score maps encode the cases of {top-left,top-center; top-right, ..., bottom-right} of an object category.

给定提议区域(RoI),R-FCN 架构旨在将 RoI 分类为对象类别和背景。在 R-FCN 中,所有可学习的权重层都是卷积的,并在整个图像上计算。最后的卷积层为每个类别生成一组 k2{k}^{2} 位置敏感的得分图,因此具有 k2(C+1){k}^{2}\left( {C + 1}\right) 通道输出层和 CC 个对象类别(背景加 1)。这组 k2{k}^{2} 得分图对应于一个 k×kk \times k 空间网格,描述相对位置。例如,使用 k×k=3×3k \times k = 3 \times 3,这 9 个得分图编码了对象类别的 {左上、上中;右上,...,右下} 的情况。

R-FCN ends with a position-sensitive RoI pooling layer. This layer aggregates the outputs of the last convolutional layer and generates scores for each RoI. Unlike [8, 6], our position-sensitive RoI layer conducts selective pooling,and each of the k×kk \times k bin aggregates responses from only one score map out of the bank of k×kk \times k score maps. With end-to-end training,this RoI layer shepherds the last convolutional layer to learn specialized position-sensitive score maps. Figure 1 illustrates this idea. Figure 3 and 4 visualize an example. The details are introduced as follows.

R-FCN 以一个位置敏感的 RoI 池化层结束。该层聚合最后一个卷积层的输出,并为每个 RoI 生成分数。与 [8, 6] 不同,我们的位置敏感 RoI 层进行选择性池化,每个 k×kk \times k bin 仅聚合来自 k×kk \times k 个分数图中的一个分数图的响应。通过端到端训练,这个 RoI 层引导最后的卷积层学习专门的位置敏感分数图。图 1 说明了这个想法。图 3 和 4 可视化了一个示例。具体细节如下介绍。

Backbone architecture. The incarnation of R-FCN in this paper is based on ResNet-101 [9], though other networks [10, 23] are applicable. ResNet-101 has 100 convolutional layers followed by global average pooling and a 1000 -class fc{fc} layer. We remove the average pooling layer and the fc{fc} layer and only use the convolutional layers to compute feature maps. We use the ResNet-101 released by the authors of [9], pre-trained on ImageNet [20]. The last convolutional block in ResNet-101 is 2048-d, and we attach a randomly initialized 1024d1×1{1024} - \mathrm{d}1 \times 1 convolutional layer for reducing dimension (to be precise,this increases the depth in Table 1 by 1). Then we apply the k2(C+1){k}^{2}\left( {C + 1}\right) -channel convolutional layer to generate score maps, as introduced next.

主干架构。本文中 R-FCN 的实现基于 ResNet-101 [9],尽管其他网络 [10, 23] 也适用。ResNet-101 有 100 个卷积层,后面跟着全局平均池化和一个 1000 类的 fc{fc} 层。我们移除了平均池化层和 fc{fc} 层,仅使用卷积层来计算特征图。我们使用 [9] 的作者发布的在 ImageNet [20] 上预训练的 ResNet-101。ResNet-101 中的最后一个卷积块是 2048 维的,我们附加了一个随机初始化的 1024d1×1{1024} - \mathrm{d}1 \times 1 卷积层以减少维度(准确地说,这使得表 1 中的深度增加了 1)。然后我们应用 k2(C+1){k}^{2}\left( {C + 1}\right) 通道的卷积层来生成分数图,具体介绍如下。

Position-sensitive score maps & Position-sensitive RoI pooling. To explicitly encode position information into each RoI,we divide each RoI rectangle into k×kk \times k bins by a regular grid. For an RoI rectangle of a size w×hw \times h ,a bin is of a size wk×hk[8,6]\approx \frac{w}{k} \times \frac{h}{k}\left\lbrack {8,6}\right\rbrack . In our method,the last convolutional layer is constructed to produce k2{k}^{2} score maps for each category. Inside the(i,j)-th bin (0i,jk1)\left( {0 \leq i,j \leq k - 1}\right) , we define a position-sensitive RoI pooling operation that pools only over the(i,j)-th score map:

位置敏感得分图和位置敏感RoI池化。为了将位置信息显式编码到每个RoI中,我们通过规则网格将每个RoI矩形划分为k×kk \times k个区域。对于大小为w×hw \times h的RoI矩形,一个区域的大小为wk×hk[8,6]\approx \frac{w}{k} \times \frac{h}{k}\left\lbrack {8,6}\right\rbrack。在我们的方法中,最后的卷积层被构建为生成每个类别的k2{k}^{2}个得分图。在(i,j)-th区域(0i,jk1)\left( {0 \leq i,j \leq k - 1}\right)中,我们定义了一个位置敏感的RoI池化操作,仅对(i,j)-th得分图进行池化:

Here rc(i,j){r}_{c}\left( {i,j}\right) is the pooled response in the(i,j)-th bin for the cc -th category, zi,j,c{z}_{i,j,c} is one score map out of the k2(C+1){k}^{2}\left( {C + 1}\right) score maps, (x0,y0)\left( {{x}_{0},{y}_{0}}\right) denotes the top-left corner of an RoI, nn is the number of pixels in the bin,and Θ\Theta denotes all learnable parameters of the network. The(i,j)-th bin spans iwkx<(i+1)wk\lfloor i\frac{w}{k}\rfloor \leq x < \lceil \left( {i + 1}\right) \frac{w}{k}\rceil and jhky<(j+1)hk\lfloor j\frac{h}{k}\rfloor \leq y < \lceil \left( {j + 1}\right) \frac{h}{k}\rceil . The operation of Eqn.(1) is illustrated in Figure 1,where a color represents a pair of(i,j). Eqn.(1) performs average pooling (as we use throughout this paper), but max pooling can be conducted as well.

这里rc(i,j){r}_{c}\left( {i,j}\right)是第(i,j)-th区域对第cc类的池化响应,zi,j,c{z}_{i,j,c}k2(C+1){k}^{2}\left( {C + 1}\right)个得分图中的一个得分图,(x0,y0)\left( {{x}_{0},{y}_{0}}\right)表示RoI的左上角,nn是区域中的像素数量,Θ\Theta表示网络的所有可学习参数。(i,j)-th区域跨越iwkx<(i+1)wk\lfloor i\frac{w}{k}\rfloor \leq x < \lceil \left( {i + 1}\right) \frac{w}{k}\rceiljhky<(j+1)hk\lfloor j\frac{h}{k}\rfloor \leq y < \lceil \left( {j + 1}\right) \frac{h}{k}\rceil。公式(1)的操作在图1中进行了说明,其中颜色代表一对(i,j)。公式(1)执行平均池化(如本文所用),但也可以进行最大池化。

The k2{k}^{2} position-sensitive scores then vote on the RoI. In this paper we simply vote by averaging the scores,producing a (C+1)\left( {C + 1}\right) -dimensional vector for each RoI: rc(Θ)=i,jrc(i,jΘ){r}_{c}\left( \Theta \right) = \mathop{\sum }\limits_{{i,j}}{r}_{c}\left( {i,j \mid \Theta }\right) . Then we compute the softmax responses across categories: sc(Θ)=erc(Θ)/c=0Cerc(Θ){s}_{c}\left( \Theta \right) = {e}^{{r}_{c}\left( \Theta \right) }/\mathop{\sum }\limits_{{{c}^{\prime } = 0}}^{C}{e}^{{r}_{{c}^{\prime }}\left( \Theta \right) } . They are used for evaluating the cross-entropy loss during training and for ranking the RoIs during inference.

k2{k}^{2}位置敏感得分随后对RoI进行投票。在本文中,我们简单地通过平均得分进行投票,为每个RoI生成一个(C+1)\left( {C + 1}\right)维向量:rc(Θ)=i,jrc(i,jΘ){r}_{c}\left( \Theta \right) = \mathop{\sum }\limits_{{i,j}}{r}_{c}\left( {i,j \mid \Theta }\right)。然后,我们计算跨类别的softmax响应:sc(Θ)=erc(Θ)/c=0Cerc(Θ){s}_{c}\left( \Theta \right) = {e}^{{r}_{c}\left( \Theta \right) }/\mathop{\sum }\limits_{{{c}^{\prime } = 0}}^{C}{e}^{{r}_{{c}^{\prime }}\left( \Theta \right) }。它们用于在训练期间评估交叉熵损失,并在推理期间对RoI进行排名。

We further address bounding box regression [7,6]\left\lbrack {7,6}\right\rbrack in a similar way. Aside from the above k2(C+1){k}^{2}\left( {C + 1}\right) -d convolutional layer,we append a sibling 4k24{k}^{2} -d convolutional layer for bounding box regression. The position-sensitive RoI pooling is performed on this bank of 4k24{k}^{2} maps,producing a 4k24{k}^{2} -d vector for each RoI. Then it is aggregated into a 4-d vector by average voting. This 4-d vector parameterizes a bounding box as t=(tx,ty,tw,th)t = \left( {{t}_{x},{t}_{y},{t}_{w},{t}_{h}}\right) following the parameterization in [6]. We note that we perform class-agnostic bounding box regression for simplicity, but the class-specific counterpart (i.e., with a 4k2C4{k}^{2}C -d output layer) is applicable.

我们以类似的方式进一步处理边界框回归 [7,6]\left\lbrack {7,6}\right\rbrack。除了上述的 k2(C+1){k}^{2}\left( {C + 1}\right) -d 卷积层,我们还附加了一个兄弟 4k24{k}^{2} -d 卷积层用于边界框回归。位置敏感的 RoI 池化在这一组 4k24{k}^{2} 图上进行,为每个 RoI 产生一个 4k24{k}^{2} -d 向量。然后通过平均投票将其聚合成一个 4-d 向量。这个 4-d 向量根据 [6] 中的参数化方式对边界框进行参数化为 t=(tx,ty,tw,th)t = \left( {{t}_{x},{t}_{y},{t}_{w},{t}_{h}}\right)。我们注意到,为了简化起见,我们执行无类别边界框回归,但类别特定的对应物(即,具有 4k2C4{k}^{2}C -d 输出层)是适用的。

The concept of position-sensitive score maps is partially inspired by [3] that develops FCNs for instance-level semantic segmentation. We further introduce the position-sensitive RoI pooling layer that shepherds learning of the score maps for object detection. There is no learnable layer after the RoI layer, enabling nearly cost-free region-wise computation and speeding up both training and inference.

位置敏感得分图的概念部分受到 [3] 的启发,该文献为实例级语义分割开发了全卷积网络(FCNs)。我们进一步引入了位置敏感的 RoI 池化层,该层引导得分图的学习以进行目标检测。在 RoI 层之后没有可学习的层,从而实现几乎无成本的区域计算,并加速训练和推理过程。

Training. With pre-computed region proposals, it is easy to end-to-end train the R-FCN architecture. Following [6], our loss function defined on each RoI is the summation of the cross-entropy loss and the box regression loss: L(s,tx,y,w,h)=Lcls(sc)+λ[c>0]Lreg(t,t)L\left( {s,{t}_{x,y,w,h}}\right) = {L}_{cls}\left( {s}_{{c}^{ * }}\right) + \lambda \left\lbrack {{c}^{ * } > 0}\right\rbrack {L}_{reg}\left( {t,{t}^{ * }}\right) . Here c{c}^{ * } is the RoI’s ground-truth label (c=0\left( {{c}^{ * } = 0}\right. means background). Lcls(sc)=log(sc){L}_{cls}\left( {s}_{{c}^{ * }}\right) = - \log \left( {s}_{{c}^{ * }}\right) is the cross-entropy loss for classification, Lreg {L}_{\text{reg }} is the bounding box regression loss as defined in [6],and t{t}^{ * } represents the ground truth box. [c>0]\left\lbrack {{c}^{ * } > 0}\right\rbrack is an indicator which equals to 1 if the argument is true and 0 otherwise. We set the balance weight λ=1\lambda = 1 as in [6]. We define positive examples as the RoIs that have intersection-over-union (IoU) overlap with a ground-truth box of at least 0.5 , and negative otherwise.

训练。通过预计算的区域提议,端到端训练 R-FCN 架构变得简单。根据 [6],我们在每个 RoI 上定义的损失函数是交叉熵损失和框回归损失的总和:L(s,tx,y,w,h)=Lcls(sc)+λ[c>0]Lreg(t,t)L\left( {s,{t}_{x,y,w,h}}\right) = {L}_{cls}\left( {s}_{{c}^{ * }}\right) + \lambda \left\lbrack {{c}^{ * } > 0}\right\rbrack {L}_{reg}\left( {t,{t}^{ * }}\right)。这里 c{c}^{ * } 是 RoI 的真实标签,(c=0\left( {{c}^{ * } = 0}\right. 表示背景。Lcls(sc)=log(sc){L}_{cls}\left( {s}_{{c}^{ * }}\right) = - \log \left( {s}_{{c}^{ * }}\right) 是分类的交叉熵损失,Lreg {L}_{\text{reg }} 是在 [6] 中定义的边界框回归损失,t{t}^{ * } 表示真实框。[c>0]\left\lbrack {{c}^{ * } > 0}\right\rbrack 是一个指示器,如果参数为真则等于 1,否则为 0。我们将平衡权重 λ=1\lambda = 1 设置为 [6] 中的值。我们将正例定义为与真实框的交并比 (IoU) 重叠至少为 0.5 的 RoI,负例则相反。

It is easy for our method to adopt online hard example mining (OHEM) [22] during training. Our negligible per-RoI computation enables nearly cost-free example mining. Assuming NN proposals per image,in the forward pass,we evaluate the loss of all NN proposals. Then we sort all RoIs (positive and negative) by loss and select BB RoIs that have the highest loss. Backpropagation [11] is performed based on the selected examples. Because our per-RoI computation is negligible, the forward time is nearly not affected by NN ,in contrast to OHEM Fast R-CNN in [22] that may double training time. We provide comprehensive timing statistics in Table 3 in the next section.

我们的方法在训练过程中很容易采用在线困难样本挖掘 (OHEM) [22]。我们每个 RoI 的计算几乎可以忽略不计,这使得样本挖掘几乎没有成本。假设每张图像有 NN 个提议,在前向传播中,我们评估所有 NN 个提议的损失。然后,我们根据损失对所有 RoI(正例和负例)进行排序,并选择损失最高的 BB 个 RoI。基于所选示例进行反向传播 [11]。由于我们每个 RoI 的计算几乎可以忽略不计,前向传播时间几乎不受 NN 的影响,这与 [22] 中的 OHEM Fast R-CNN 可能会使训练时间加倍的情况形成对比。我们在下一节的表 3 中提供了全面的时间统计数据。

We use a weight decay of 0.0005 and a momentum of 0.9 . By default we use single-scale training: images are resized such that the scale (shorter side of image) is 600 pixels [6, 18]. Each GPU holds 1 image and selects B=128B = {128} RoIs for backprop. We train the model with 8 GPUs (so the effective mini-batch size is 8×8 \times ). We fine-tune R-FCN using a learning rate of 0.001 for 20k{20}\mathrm{k} mini-batches and 0.0001 for 10k mini-batches on VOC. To have R-FCN share features with RPN (Figure 2), we adopt the 4-step alternating training 3{}^{3} in [18],alternating between training RPN and training R-FCN.

我们使用 0.0005 的权重衰减和 0.9 的动量。默认情况下,我们使用单尺度训练:图像被调整大小,使得尺度(图像的短边)为 600 像素 [6, 18]。每个 GPU 存储 1 张图像并选择 B=128B = {128} RoIs 进行反向传播。我们使用 8 个 GPU 训练模型(因此有效的小批量大小为 8×8 \times)。我们使用 0.001 的学习率对 R-FCN 进行微调,针对 20k{20}\mathrm{k} 小批量使用 0.0001 的学习率进行 10k 小批量的训练。在 R-FCN 与 RPN 共享特征(图 2)时,我们采用 [18] 中的 4 步交替训练 3{}^{3},在训练 RPN 和训练 R-FCN 之间交替进行。

Inference. As illustrated in Figure 2, the feature maps shared between RPN and R-FCN are computed (on an image with a single scale of 600). Then the RPN part proposes RoIs, on which the R-FCN part evaluates category-wise scores and regresses bounding boxes. During inference we evaluate 300 RoIs as in [18] for fair comparisons. The results are post-processed by non-maximum suppression (NMS) using a threshold of 0.3 IoU [7], as standard practice.

推理。如图 2 所示,共享在 RPN 和 R-FCN 之间的特征图是在单尺度 600 的图像上计算的。然后 RPN 部分提出 RoIs,R-FCN 部分对这些 RoIs 进行类别评分和边界框回归。在推理过程中,我们评估 300 个 RoIs,参考 [18] 以确保公平比较。结果通过非最大抑制(NMS)进行后处理,使用 0.3 IoU 的阈值 [7],这是标准做法。

A˙\dot{A} trous and stride. Our fully convolutional architecture enjoys the benefits of the network modifications that are widely used by FCNs for semantic segmentation [15, 2]. Particularly, we reduce ResNet-101's effective stride from 32 pixels to 16 pixels, increasing the score map resolution. All layers before and on the conv4 stage [9] (stride=16) are unchanged; the stride=2 operations in the first conv5 block is modified to have stride =1= 1 ,and all convolutional filters on the conv5 stage are modified by the "hole algorithm" [15, 2] ("Algorithme à trous" [16]) to compensate for the reduced stride. For fair comparisons, the RPN is computed on top of the conv4 stage (that are shared with

A˙\dot{A} 孔和步幅。我们的全卷积架构享受了FCNs在语义分割中广泛使用的网络修改的好处 [15, 2]。特别地,我们将ResNet-101的有效步幅从32像素减少到16像素,从而提高了得分图的分辨率。在conv4阶段之前和在conv4阶段的所有层 [9](步幅=16)保持不变;在第一个conv5块中的步幅=2操作被修改为具有步幅 =1= 1,并且conv5阶段的所有卷积滤波器通过“孔算法” [15, 2](“Algorithme à trous” [16])进行了修改,以补偿减少的步幅。为了进行公平比较,RPN是在conv4阶段之上计算的(与


3{}^{3} Although joint training [18] is applicable,it is not straightforward to perform example mining jointly.

3{}^{3} 尽管联合训练 [18] 是适用的,但共同进行示例挖掘并不简单。


—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——