R-FCN-3000 at 30fps: Decoupling Detection and Classification【翻译】

Doc2X：Markdown 转换的首选工具支持 PDF 中表格、代码块和公式的精准提取，轻松转化为 Markdown 格式。 Doc2X: The Go-To Tool for Markdown Conversion Precisely extract tables, code blocks, and formulas from PDFs, and seamlessly convert them to Markdown format. 👉 点击体验 Doc2X | Try Doc2X

原文链接：1712.01802

R-FCN-3000 at 30fps: Decoupling Detection and Classification

R-FCN-3000 在 30fps 下：解耦检测与分类

Bharat Singh1 Hengduo Li2 Abhishek Sharma ${}^{3}\;$ Larry S. Davis ${}^{1}$

University of Maryland,College Park ${}^{1}$ Fudan University ${}^{2}$ Gobasco AI Labs ${}^{3}$

马里兰大学，公园学院 ${}^{1}$ 复旦大学 ${}^{2}$ Gobasco AI 实验室 ${}^{3}$

bharat,lsd}@cs.umd.edu lihd14@fudan.edu.cn abhisharayiya@gmail.com

Abstract

摘要

We present R-FCN-3000, a large-scale real-time object detector in which objectness detection and classification are decoupled. To obtain the detection score for an RoI, we multiply the objectness score with the fine-grained classification score. Our approach is a modification of the R-FCN architecture in which position-sensitive filters are shared across different object classes for performing localization. For fine-grained classification, these position-sensitive filters are not needed. R-FCN-3000 obtains an mAP of 34.9% on the ImageNet detection dataset and outperforms YOLO- 9000 by 18% while processing 30 images per second. We also show that the objectness learned by $R - {FCN} - {3000}$ generalizes to novel classes and the performance increases with the number of training object classes - supporting the hypothesis that it is possible to learn a universal objectness detector. Code will be made available.

我们提出了 R-FCN-3000，这是一种大规模实时物体检测器，其中物体性检测与分类被解耦。为了获得 RoI 的检测分数，我们将物体性分数与细粒度分类分数相乘。我们的方法是对 R-FCN 架构的修改，其中位置敏感滤波器在不同物体类别之间共享以执行定位。对于细粒度分类，这些位置敏感滤波器并不是必需的。R-FCN-3000 在 ImageNet 检测数据集上获得了 34.9% 的 mAP，并在每秒处理 30 张图像时比 YOLO-9000 超出 18%。我们还展示了 $R - {FCN} - {3000}$ 学到的物体性能够推广到新类别，并且随着训练物体类别数量的增加，性能也在提升——支持了学习通用物体性检测器的假设。代码将会公开。

1. Introduction

1. 引言

With the advent of Deep CNNs [16, 20], object-detection has witnessed a quantum leap in the performance on benchmark datasets. It is due to the powerful feature learning capabilities of deep CNN architectures. Within the last five years, the mAP scores on PASCAL [9] and COCO [24] have improved from 33% to 88% and 37% to 73% (at 50% overlap), respectively. While there have been massive improvements on standard benchmarks with tens of classes $\left\lbrack {{13},{12},{31},6,{14}}\right\rbrack$ ,little progress has been made towards real-life object detection that requires real-time detection of thousands of classes. Some recent efforts [30, 17] in this direction have led to large-scale detection systems, but at the cost of accuracy. We propose a solution to the large-scale object detection problem that outperforms YOLO- 9000 [30] by 18% and can process 30 images per second while detecting 3000 classes, referred to as R-FCN-3000.

随着深度卷积神经网络（Deep CNNs）[16, 20]的出现，目标检测在基准数据集上的性能经历了量子飞跃。这得益于深度卷积神经网络架构强大的特征学习能力。在过去五年中，PASCAL [9] 和 COCO [24] 的 mAP 分数分别从 33% 提升至 88% 和从 37% 提升至 73%（在 50% 重叠的情况下）。尽管在标准基准测试中，涉及数十个类别的性能有了巨大的提升 $\left\lbrack {{13},{12},{31},6,{14}}\right\rbrack$ ，但在需要实时检测数千个类别的实际目标检测方面，进展甚微。一些近期的努力 [30, 17] 在这个方向上导致了大规模检测系统的出现，但以牺牲准确性为代价。我们提出了一种解决大规模目标检测问题的方法，其性能比 YOLO-9000 [30] 提高了 18%，并且能够以每秒处理 30 张图像的速度检测 3000 个类别，称为 R-FCN-3000。

R-FCN-3000 is a result of systematic modifications to some of the recent object-detection architectures $\lbrack 6,5$ , ${23},{25},{29}\rbrack$ to afford real-time large-scale object detection. Recently proposed fully convolutional class of detectors $\left\lbrack {6,5,{23},{25},{29}}\right\rbrack$ compute per-class objectness score for a given image. They have shown impressive accuracy within limited computational budgets. Although fully-convolutional representations provide an efficient [19] solution for tasks like object detection [6], instance segmentation [22], tracking [10], relationship detection [41] etc., they require class-specific sets of filters for each class that prohibits their application for large number of classes. For example, R-FCN [5]/ Deformable-R-FCN [6] requires 49/197 position-specific filters for each class. Retina-Net [23] requires 9 filters for each class for each convolutional feature map. Therefore, such architectures would need hundreds of thousands of filters for detecting 3000 classes, which will make them extremely slow for practical purposes.

R-FCN-3000 是对一些近期目标检测架构 $\lbrack 6,5$ , ${23},{25},{29}\rbrack$ 进行系统性修改的结果，以实现实时大规模目标检测。最近提出的全卷积检测器类别 $\left\lbrack {6,5,{23},{25},{29}}\right\rbrack$ 为给定图像计算每个类别的目标性分数。它们在有限的计算预算内展示了令人印象深刻的准确性。尽管全卷积表示为目标检测 [6]、实例分割 [22]、跟踪 [10]、关系检测 [41] 等任务提供了高效的解决方案 [19]，但它们需要为每个类别设置特定的滤波器，这限制了它们在大量类别中的应用。例如，R-FCN [5]/可变形R-FCN [6] 每个类别需要49/197个位置特定的滤波器。Retina-Net [23] 每个卷积特征图需要每个类别9个滤波器。因此，这类架构在检测3000个类别时需要数十万个滤波器，这将使它们在实际应用中极其缓慢。

Figure 1. We propose to decouple classification and localization by independently predicting objectness and classification scores. These scores are multiplied to obtain a detector.

图1. 我们建议通过独立预测目标性和分类分数来解耦分类和定位。这些分数相乘以获得一个检测器。

The key insight behind the proposed R-FCN-3000 architecture is to decouple objectness detection and classification of the detected object so that the computational requirements for localization remain constant as the number of classes increases - see Fig. 1. We leverage the fact that many object categories are visually similar and share parts. For example - different breeds of dogs all have common body parts; therefore, learning a different set of filters for detecting each breed is overkill. So, R-FCN-3000 performs object detection (with position-sensitive filters) for a fixed number of super-classes followed by fine-grained classification (without position-sensitive filters) within each superclass. The super-classes are obtained by clustering the deep semantic features of images (2048 dimensional features of ResNet-101 in this case); therefore, we do not require a semantic hierarchy. The fine-grained class probability at a given location is obtained by multiplying the super-class probability with the classification probability of the fine-grained category within the super-class.

提出的 R-FCN-3000 架构背后的关键见解是将物体检测与检测到的物体的分类解耦，以便在类别数量增加时定位的计算需求保持不变 - 见图 1。我们利用许多物体类别在视觉上相似且共享部分的事实。例如，不同品种的狗都有共同的身体部位；因此，为每个品种学习一组不同的过滤器是多余的。因此，R-FCN-3000 对固定数量的超类执行物体检测（使用位置敏感过滤器），然后在每个超类内进行细粒度分类（不使用位置敏感过滤器）。超类是通过对图像的深层语义特征进行聚类获得的（在这种情况下为 ResNet-101 的 2048 维特征）；因此，我们不需要语义层次结构。在给定位置的细粒度类别概率是通过将超类概率与超类内细粒度类别的分类概率相乘获得的。

*Equal Contribution. Work done during H. Li's internship at UMD.

*同等贡献。该工作是在 H. Li 于马里兰大学实习期间完成的。

In order to study the effect of using super-classes instead of individual object categories, we varied the number of super-classes from 1 to 100 and evaluated the performance on the ImageNet detection dataset. Surprisingly, the detector performs well even with one super-class! This observation indicates that position-sensitive filters can potentially learn to detect universal objectness. It also reaffirms a well-researched concept from the past $\left\lbrack {1,2,{39}}\right\rbrack$ that objectness is a generic concept and a universal objectness detector can be learned. Thus, for performing object detection, it suffices to multiply the objectness score of an RoI with the classi-fiation probability for a given class. This results in a fast detector for thousands of classes, as per-class position sensitive filters are no longer needed. On the PASCAL-VOC dataset, with only our objectness based detector, we observe a ${1.5}\%$ drop in mAP compared to the deformable R-FCN [6] detector with class-specific filters for all 20 object classes. R-FCN-3000,trained for 3000 classes, obtains an 18% improvement in mAP over the current state-of-the-art large scale object detector (YOLO-9000) on the ImageNet detection dataset. Finally, we also evaluate the generalizability of our objectness detector on unseen classes (a zero-shot setting for localization) and observe that the generalization error decreases as we train the objectness detector on larger numbers of classes.

为了研究使用超类而不是单个对象类别的效果，我们将超类的数量从1变化到100，并评估在ImageNet检测数据集上的性能。令人惊讶的是，即使只有一个超类，检测器的表现也很好！这一观察表明，位置敏感滤波器有可能学习检测通用对象性。它也重申了过去一个经过充分研究的概念 $\left\lbrack {1,2,{39}}\right\rbrack$ ，即对象性是一个通用概念，可以学习到一个通用对象性检测器。因此，在进行对象检测时，仅需将RoI的对象性得分与给定类别的分类概率相乘即可。这使得检测器能够快速处理成千上万的类别，因为不再需要每个类别的特定位置敏感滤波器。在PASCAL-VOC数据集上，仅使用我们的基于对象性的检测器，我们观察到与具有类别特定滤波器的可变形R-FCN [6]检测器相比，mAP出现了 ${1.5}\%$ 的下降，后者涵盖了所有20个对象类别。R-FCN-3000为3000个类别训练，在ImageNet检测数据集上相较于当前最先进的大规模对象检测器（YOLO-9000）获得了18%的mAP提升。最后，我们还评估了我们的对象性检测器在未见类别上的泛化能力（针对定位的零样本设置），并观察到随着我们在更多类别上训练对象性检测器，泛化误差减少。

2. Related Work

2. 相关工作

Large scale localization using deep convolutional networks was first performed in $\left\lbrack {{33},{35}}\right\rbrack$ which used regression for predicting the location of bounding boxes. Later, RPN [31] was used for localization in ImageNet classification [15]. However, no evaluations were performed to determine if these networks generalize when applied on detection datasets without specifically training on them. Weakly-supervised detection has been a major focus over the past few years for solving large-scale object detection. In [17], knowledge of detectors trained with bounding boxes was transferred to classes for which no bounding boxes are available. The assumption is that it is possible to train object detectors on a fixed number of classes. For a class for which supervision is not available, transformations are learned to adapt the classifier to a detector. Multiple-instance learning based approaches have also been proposed which can leverage weakly supervised data for adapting classifiers to detectors [18]. Recently, YOLO-9000 [30] jointly trains on classification and detection data. When it sees a classification image, classification loss is back-propagated on the bounding box which has the highest probability. It assumes that the predicted box is the ground truth box and uses the difference between other anchors and the predicted box as the objectness loss. YOLO-9000 is fast, as it uses a lightweight network and uses 3 filters per class for performing localization. For performing good localization, just 3 priors are not sufficient.

使用深度卷积网络进行大规模定位的研究首次在 $\left\lbrack {{33},{35}}\right\rbrack$ 中进行，该研究使用回归方法预测边界框的位置。随后，RPN [31] 被用于 ImageNet 分类中的定位 [15]。然而，没有进行评估以确定这些网络在未经过专门训练的检测数据集上是否具有泛化能力。过去几年，弱监督检测已成为解决大规模物体检测的主要关注点。在 [17] 中，将使用边界框训练的检测器的知识转移到没有边界框可用的类别。假设可以在固定数量的类别上训练物体检测器。对于没有监督的类别，学习变换以将分类器适应于检测器。还提出了基于多实例学习的方法，这些方法可以利用弱监督数据来调整分类器以适应检测器 [18]。最近，YOLO-9000 [30] 在分类和检测数据上进行联合训练。当它看到一张分类图像时，分类损失会在具有最高概率的边界框上反向传播。它假设预测的框就是真实框，并使用其他锚点与预测框之间的差异作为物体性损失。YOLO-9000 速度快，因为它使用轻量级网络，并为每个类别使用 3 个过滤器进行定位。为了实现良好的定位，仅仅使用 3 个先验是不够的。

For classifying and localizing a large number of classes, some methods leverage the fact that parts can be shared across objects categories [27, 32, 37, 28]. Sharing filters for object parts reduces model complexity and also reduces the amount of training data required for learning part-based filters. Even in traditional methods, it has been shown that when filters are shared, they are more generic [37]. However, current detectors like Deformable-R-FCN [6], R-FCN [5], RetinaNet [23] do not share filters (in the final classification layer) across object categories: because of this, inference is slow when they are applied on thousands of categories. Taking motivation from prior work on sharing filters across object categories, we propose an architecture where filters can be shared across some object categories for large scale object detection.

为了对大量类别进行分类和定位，一些方法利用了对象类别之间可以共享部分的事实 [27, 32, 37, 28]。共享对象部分的滤波器减少了模型复杂性，并且减少了学习基于部分的滤波器所需的训练数据量。即使在传统方法中，已经证明当滤波器被共享时，它们更具通用性 [37]。然而，目前的检测器如 Deformable-R-FCN [6]、R-FCN [5]、RetinaNet [23] 在对象类别之间并不共享滤波器（在最终分类层）：因此，当它们应用于数千个类别时，推理速度较慢。受到以往在对象类别之间共享滤波器的研究启发，我们提出了一种架构，其中滤波器可以在一些对象类别之间共享，以实现大规模对象检测。

The extreme version of sharing parts is objectness, where we assume that all objects have something in common. Early in this decade (if not before), it was proposed that ob-jectness is a generic concept and it was demonstrated that only a very few category agnostic proposals were sufficient to obtain high recall $\left\lbrack {{39},3,2,1}\right\rbrack$ . With a bag-of-words feature-representation [21] for these proposals, better performance was shown compared to a sliding-window based part-based-model [11] for object detection. R-CNN [13] used the same proposals for object detection but also applied per-class bounding-box regression to refine the location of these proposals. Subsequently, it was observed that per-class regression was not necessary and a class-agnostic regression step is sufficient to refine the proposal position [5]. Therefore, if the regression step is class agnostic, and it is possible to obtain a reasonable objectness score, a simple classification layer should be sufficient to perform detection. We can simply multiply the objectness probability with the classification probability to make a detector! Therefore, in the extreme case, we set the number of super-classes to one and show that we can train a detector which obtains an mAP which is very close to state-of-the-art object detection architectures [5].

共享部分的极端版本是对象性，我们假设所有对象都有某种共同点。在本十年初（如果不是更早），有人提出对象性是一个通用概念，并且证明只有极少数与类别无关的提案足以获得高召回率 $\left\lbrack {{39},3,2,1}\right\rbrack$ 。对于这些提案，使用词袋特征表示 [21] 显示出比基于滑动窗口的部分模型 [11] 在目标检测中的表现更好。R-CNN [13] 使用相同的提案进行目标检测，但还应用了每类的边界框回归来细化这些提案的位置。随后观察到，每类回归并不是必要的，一个与类别无关的回归步骤足以细化提案位置 [5]。因此，如果回归步骤是与类别无关的，并且可以获得合理的对象性得分，那么一个简单的分类层就足以进行检测。我们可以简单地将对象性概率与分类概率相乘来构建检测器！因此，在极端情况下，我们将超类的数量设置为一，并展示我们可以训练一个检测器，其获得的 mAP 非常接近最先进的目标检测架构 [5]。

3. Background

3. 背景

This section provides a brief introduction of Deformable R-FCN [6] which is used in R-FCN-3000. In R-FCN [5], Atrous convolution [4] is used in the conv5 layer to increase the resolution of the feature map while still utilizing the pre-trained weights from the ImageNet classification network. In Deformable-R-FCN [6], the atrous convolution is replaced by a deformable convolution structure in which a separate branch predicts offsets for each pixel in the feature map, and the convolution kernel is applied after the offsets have been applied to the feature-map. A region proposal network (RPN) is used for generating object proposals, which is a two layer CNN on top of the conv4 features. Efficiently implemented local convolutions, referred to as position sensitive filters, are used to classify these proposals.

本节简要介绍了用于 R-FCN-3000 的可变形 R-FCN [6]。在 R-FCN [5] 中，使用空洞卷积 [4] 在 conv5 层中提高特征图的分辨率，同时仍然利用来自 ImageNet 分类网络的预训练权重。在可变形 R-FCN [6] 中，空洞卷积被可变形卷积结构所替代，其中一个单独的分支为特征图中的每个像素预测偏移量，并在应用偏移量后对特征图应用卷积核。区域提议网络 (RPN) 用于生成物体提议，这是一个位于 conv4 特征之上的两层 CNN。有效实现的局部卷积，称为位置敏感滤波器，用于对这些提议进行分类。

4. Large Scale Fully-Convolutional Detector

4. 大规模全卷积检测器

This section describes the process of training a large-scale object detector. We first explain the training data requirements followed by discussions of some of the challenges involved in training such a system - design decisions for making training and inference efficient, appropriate loss functions for a large number of classes, mitigating the domain-shift which arises when training on classification data.

本节描述了训练大规模物体检测器的过程。我们首先解释训练数据的要求，然后讨论训练此类系统所涉及的一些挑战——在训练和推理中提高效率的设计决策、适用于大量类别的适当损失函数、以及减轻在分类数据上训练时出现的领域转移。

4.1. Weakly Supervised vs. Supervised?

4.1. 弱监督与监督？

Obtaining an annotated dataset of thousands of classes is a major challenge for large scale detection. Ideally, a system that can learn to detect object instances using partial image level tags (class labels) for the objects present in training images would be preferable because large-scale training data is readily available on the internet in this format. Since the setting with partial annotations is very challenging, it is commonly assumed that labels are available for all the objects present in the image. This is referred to as the weakly supervised setting. Unfortunately, explicit boundaries of objects or atleast bounding-boxes are required as supervision signal for training accurate object detectors. This is the supervised setting. The performance gap between supervised and weakly supervised detectors is large - even 2015 object detectors [15] were better by ${40}\%$ on the PASCAL VOC 2007 dataset compared to recent weakly supervised detectors [8]. This gap is a direct result of insufficient learning signal coming from weak supervision and can be further explained with the help of an example. For classifying a dog among 1000 categories, only body texture or facial features of a dog may be sufficient and the network need not learn the visual properties of its tail or legs for correct classification. Therefore, it may never learn that legs or tail are parts of the dog category, which are essential to obtain accurate boundaries.

获取数千个类别的带注释数据集是大规模检测的一大挑战。理想情况下，能够使用训练图像中存在的对象的部分图像级标签（类别标签）来学习检测对象实例的系统将是更可取的，因为这种格式的大规模训练数据在互联网上很容易获得。由于部分注释的设置非常具有挑战性，通常假设图像中存在的所有对象都有标签。这被称为弱监督设置。不幸的是，训练准确的对象检测器需要明确的对象边界或至少边界框作为监督信号。这是监督设置。监督检测器和弱监督检测器之间的性能差距很大——即使是2015年的对象检测器 [15] 在 PASCAL VOC 2007 数据集上的表现也比最近的弱监督检测器 [8] 好 ${40}\%$ 。这个差距直接源于来自弱监督的学习信号不足，可以通过一个例子进一步解释。为了在1000个类别中对狗进行分类，仅狗的身体纹理或面部特征可能就足够了，网络不需要学习其尾巴或腿的视觉特性以进行正确分类。因此，它可能永远不会学习到腿或尾巴是狗类别的一部分，而这些部分对于获得准确的边界是必不可少的。

On one hand, the huge cost of annotating bounding boxes for thousands of classes under settings similar to popular detection datasets such as PASCAL or COCO makes it prohibitively expensive to collect and annotate a large-scale detection dataset. On the other hand, the poor performance of weakly supervised detectors impedes their deployment in real-life applications. Therefore, we ask - is there a middle ground that can alleviate the cost of annotation while yielding accurate detectors? Fortunately, the ImageNet database contains around 1-2 objects per image; therefore, the cost of annotating the bounding boxes for the objects is only a few seconds compared to several minutes in COCO [24]. It is because of this reason that the bounding boxes were also collected while annotating ImageNet! A potential downside of using ImageNet for training object detectors is the loss of variation in scale and context around objects available in detection datasets, but we do have access to the bounding-boxes of the objects. Therefore, a natural question to ask is, how would an object detector perform on "detection" datasets if it were trained on classification datasets with bounding-box supervision? We show that careful design choices with respect to the CNN architecture, loss function and training protocol can yield a large-scale detector trained on the ImageNet classification set with significantly better accuracy compared to weakly supervised detectors.

一方面，在类似于 PASCAL 或 COCO 等流行检测数据集的设置下，为数千个类别标注边界框的巨大成本使得收集和标注大规模检测数据集变得极其昂贵。另一方面，弱监督检测器的性能不佳阻碍了它们在实际应用中的部署。因此，我们提出一个问题——是否存在一种折中方案，可以在减轻标注成本的同时，产生准确的检测器？幸运的是，ImageNet 数据库每张图像大约包含 1-2 个对象；因此，相比于 COCO 中的几分钟，标注对象的边界框的成本仅需几秒钟。这也是在标注 ImageNet 时收集边界框的原因之一！使用 ImageNet 训练对象检测器的一个潜在缺点是失去了检测数据集中对象周围的尺度和上下文变化，但我们确实可以访问对象的边界框。因此，一个自然的问题是，如果一个对象检测器在带有边界框监督的分类数据集上进行训练，它在“检测”数据集上的表现如何？我们展示了在 CNN 架构、损失函数和训练协议方面的精心设计选择，可以使得在 ImageNet 分类集上训练的大规模检测器相比于弱监督检测器具有显著更好的准确性。

4.2. Super-class Discovery

4.2. 超类发现

Fully convolutional object detectors learn class-specific filters based on scale & aspect-ratio [23] or in the form of position sensitive filters $\left\lbrack {5,6}\right\rbrack$ for each class. Therefore, when the number of classes become large, it becomes computationally in-feasible to apply these detectors. Hence, we ask is it necessary to have sets of filters for each class or can they be shared across visually similar classes? In the extreme case - can detection be performed using just a foreground/background detector and a classification network? To obtain visually similar sets of objects for which position-sensitive filters can be shared, objects should have similar visual appearances. We obtain the ${j}^{\text{th }}$ object-class representation, ${x}_{j}$ ,by taking the average of 2048-dimensional feature-vectors $\left( {x}_{j}^{i}\right)$ ,from the final layer of ResNet-101, for the all the samples belonging to the ${j}^{th}$ object-class in the ImageNet classification dataset (validation set). Super-classes are then obtained by applying K-means clustering on $\left\{ {{x}_{j} : j \in \{ 1,2,\ldots C\} }\right\}$ ,where $C$ is the number of object-classes,to obtain $K$ super-class clusters.

完全卷积目标检测器根据尺度和纵横比 [23] 或以位置敏感滤波器 $\left\lbrack {5,6}\right\rbrack$ 的形式为每个类别学习特定的类滤波器。因此，当类别数量变得庞大时，应用这些检测器在计算上变得不可行。因此，我们提出问题：是否有必要为每个类别设置滤波器，或者它们是否可以在视觉上相似的类别之间共享？在极端情况下——是否可以仅使用前景/背景检测器和分类网络进行检测？为了获得可以共享位置敏感滤波器的视觉上相似的对象集合，对象应具有相似的视觉外观。我们通过对来自ResNet-101最终层的2048维特征向量 $\left( {x}_{j}^{i}\right)$ 进行平均，获得 ${j}^{\text{th }}$ 对象类别表示 ${x}_{j}$ ，这些特征向量来自ImageNet分类数据集（验证集）中属于 ${j}^{th}$ 对象类别的所有样本。然后，通过对 $\left\{ {{x}_{j} : j \in \{ 1,2,\ldots C\} }\right\}$ 应用K均值聚类来获得超类，其中 $C$ 是对象类别的数量，以获得 $K$ 超类聚类。

4.3. Architecture

4.3. 架构

First, RPN is used for generating proposals, as in [6]. Let the set of individual object-classes the detector is being trained on be $\mathcal{C},\left| \mathcal{C}\right| = C$ ,and the set of super-classes (SC) be $\mathcal{K},\left| \mathcal{K}\right| = K$ . For each super-class $k$ ,suppose we have $P \times P$ position-sensitive filters,as shown in Fig 2. On the conv5 feature, we first apply two independent convolution layers as in R-FCN for obtaining detection scores and bounding-box regression offsets. On each of these branches, after a non-linearity function, we apply position sensitive filters for classification and bounding-box regression. Since we have $K$ super-classes and $P \times P$ filters per super-class,there are $\left( {K + 1}\right) \times P \times P$ filters in the classification branch ( 1 more for background) and $P \times P$ filters in the bounding-box regression branch as this branch is class-agnostic. After performing position-sensitive RoI pooling and averaging the predictions in each bin, we obtain predictions of the network for classification and localization. To get the super-class probability,softmax function over $K$ super-classes is used and predictions from the localization branch are directly added to get the final position of the detection. These two branches help detect the super-classes which are represented by each cluster $k$ . For obtaining fine-grained class information, we employ a two layer CNN on the conv5 feature map, as shown in Fig. 2. A softmax function is used on the output of this layer for obtaining the final class probability. The detection and classification probabilities are multiplied to obtain the final detection score for each object-class. This architecture is shown in Fig. 2. Even though there are other challenges such as entailment, cover,equivalence etc. $\left\lbrack {{26},7}\right\rbrack$ which are not correctly modelled with the softmax function, the Top-1 accuracy even on the ImageNet-5000 classification dataset is greater than ${67}\%$ [40]. So,we believe these are not the bottlenecks for detecting a few thousand classes.

首先，RPN 用于生成提议，如文献 [6] 所示。设检测器正在训练的单个对象类别集合为 $\mathcal{C},\left| \mathcal{C}\right| = C$ ，超类集合 (SC) 为 $\mathcal{K},\left| \mathcal{K}\right| = K$ 。对于每个超类 $k$ ，假设我们有 $P \times P$ 个位置敏感滤波器，如图 2 所示。在 conv5 特征上，我们首先应用两个独立的卷积层，如 R-FCN 中所示，以获得检测分数和边界框回归偏移量。在这些分支的每一个上，在非线性函数之后，我们应用位置敏感滤波器进行分类和边界框回归。由于我们有 $K$ 个超类和每个超类 $P \times P$ 个滤波器，因此分类分支中有 $\left( {K + 1}\right) \times P \times P$ 个滤波器（背景再加 1 个），而边界框回归分支中有 $P \times P$ 个滤波器，因为该分支与类别无关。在执行位置敏感的 RoI 池化并对每个箱中的预测进行平均后，我们获得了网络的分类和定位预测。为了获得超类概率，使用对 $K$ 个超类的 softmax 函数，并将定位分支的预测直接相加以获得检测的最终位置。这两个分支有助于检测由每个簇 $k$ 表示的超类。为了获得细粒度的类别信息，我们在 conv5 特征图上采用了两层 CNN，如图 2 所示。对该层的输出使用 softmax 函数以获得最终类别概率。检测和分类概率相乘以获得每个对象类别的最终检测分数。该架构如图 2 所示。尽管还有其他挑战，如蕴含、覆盖、等价等 $\left\lbrack {{26},7}\right\rbrack$ ，这些在 softmax 函数中没有正确建模，但即使在 ImageNet-5000 分类数据集上，Top-1 准确率也超过了 ${67}\%$ [40]。因此，我们认为这些不是检测几千个类别的瓶颈。

Figure 2. R-FCN-3000 first generates region proposals which are provided as input to a super-class detection branch (like R-FCN) which jointly predicts the detection scores for each super-class (sc). A class-agnostic bounding-box regression step refines the position of each RoI (not shown). To obtain the semantic class, we do not use position-sensitive filters but predict per class scores in a fully convolutional fashion. Finally, we average pool the per-class scores inside the RoI to get the classification probability. The classification probability is multiplied with the super-class detection probability for detecting 3000 classes. When K is 1, the super-class detector predicts objectness.

图2. R-FCN-3000 首先生成区域提议，这些提议作为输入提供给一个超类检测分支（如 R-FCN），该分支共同预测每个超类（sc）的检测分数。一个与类别无关的边界框回归步骤细化每个 RoI 的位置（未显示）。为了获得语义类别，我们不使用位置敏感过滤器，而是以全卷积的方式预测每个类别的分数。最后，我们在 RoI 内对每个类别的分数进行平均池化，以获得分类概率。分类概率与超类检测概率相乘，以检测 3000 个类别。当 K 为 1 时，超类检测器预测物体性。

4.4. Label Assignment

4.4. 标签分配

Labels are assigned exactly the same way as fast-RCNN [12] for the $K$ super-classes on which detection is performed. Let $C$ be the total number of object-classes and let ${k}_{i}$ and ${c}_{j}$ denote the ${i}^{\text{th }}$ super-class and ${j}^{\text{th }}$ sub-class in ${k}_{i}$ , then ${k}_{i} = \left\{ {{c}_{1},{c}_{2},\ldots ,{c}_{j}}\right\}$ and $\mathop{\sum }\limits_{{i = 1}}^{K}\left| {k}_{i}\right| = C$ . For detecting super-class ${k}_{i}$ ,an RoI is assigned as positive for super-class ${k}_{i}$ if it has an intersection over union (IoU) greater than 0.5 with any of the ground truth boxes in ${k}_{i}$ ,otherwise it is marked as background (label for background class $K + 1$ is set to one). For the classification branch (to get the final 3000 classes), only positive RoIs are used for training, i.e. only those which have an IoU greater than 0.5 with a ground truth bounding box. The number of labels for classification is $C$ instead of $K + 1$ in detection.

标签的分配方式与 fast-RCNN [12] 对于进行检测的 $K$ 超类完全相同。设 $C$ 为物体类别的总数，设 ${k}_{i}$ 和 ${c}_{j}$ 分别表示 ${i}^{\text{th }}$ 超类和 ${j}^{\text{th }}$ 子类，则 ${k}_{i} = \left\{ {{c}_{1},{c}_{2},\ldots ,{c}_{j}}\right\}$ 和 $\mathop{\sum }\limits_{{i = 1}}^{K}\left| {k}_{i}\right| = C$ 。对于检测超类 ${k}_{i}$ ，如果一个 RoI 与 ${k}_{i}$ 中的任何真实框的交并比（IoU）大于 0.5，则将其分配为超类 ${k}_{i}$ 的正样本，否则标记为背景（背景类 $K + 1$ 的标签设置为 1）。对于分类分支（以获得最终的 3000 个类别），仅使用正 RoI 进行训练，即仅使用与真实边界框的 IoU 大于 0.5 的 RoI。分类的标签数量为 $C$ ，而不是检测中的 $K + 1$ 。

4.5. Loss Function

4.5. 损失函数

For training the detector, we use online hard example mining (OHEM) [34] as done in [6] and smooth L1 loss for bounding box localization [12]. For fine-grained classification we only use a softmax loss function over $C$ object-classes for classifying the positive bounding boxes. Since the number of positive RoIs are typically small compared to the number of proposals, the loss from this branch is weighted by a factor of 0.05 , so that these gradients do not dominate network training. This is important as we train RPN layers, R-FCN classification and localization layers, and fine-grained layers together in a multi-task fashion, so balancing the loss from each branch is important.

为了训练检测器，我们使用在线困难样本挖掘（OHEM）[34]，如在[6]中所做，并使用平滑L1损失进行边界框定位[12]。对于细粒度分类，我们仅对 $C$ 对象类别使用softmax损失函数来对正边界框进行分类。由于正RoI的数量通常相对于提议的数量较少，因此来自该分支的损失被加权为0.05，以便这些梯度不会主导网络训练。这一点非常重要，因为我们以多任务的方式同时训练RPN层、R-FCN分类和定位层，以及细粒度层，因此平衡每个分支的损失是重要的。

5. Experiments

5. 实验

In this section, we describe the implementation details of the proposed large-scale object detector and compare against some of the weakly supervised large-scale object detectors in terms of speed and accuracy.

在本节中，我们描述了所提出的大规模目标检测器的实现细节，并在速度和准确性方面与一些弱监督的大规模目标检测器进行比较。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——