DSSD : Deconvolutional Single Shot Detector【翻译】DSSD : Deconv

Doc2X：Markdown 转换工具专家 Doc2X 提供专业 PDF 转 Markdown 服务，支持表格解析、多栏布局和代码提取，优化工作流程。 Doc2X: Markdown Conversion Tool Specialist Doc2X offers professional PDF to Markdown services with table parsing, multi-column layout, and code extraction to optimize workflows. 👉 访问 Doc2X 官网 | Visit Doc2X Official Site

原文链接：1701.06659

DSSD : Deconvolutional Single Shot Detector

DSSD : 反卷积单次检测器

Cheng-Yang ${\mathrm{{Fu}}}^{1 * }$ ,Wei ${\mathrm{{Liu}}}^{1 * }$ Ananth Ranga ${}^{2}$ ,Ambrish Tyagi ${}^{2}$ ,Alexander C. Berg ${}^{1}$ ${}^{1}$ UNC Chapel Hill, ${}^{2}$ Amazon Inc.

Cheng-Yang ${\mathrm{{Fu}}}^{1 * }$ ,Wei ${\mathrm{{Liu}}}^{1 * }$ Ananth Ranga ${}^{2}$ ,Ambrish Tyagi ${}^{2}$ ,Alexander C. Berg ${}^{1}$ ${}^{1}$ 北卡罗来纳大学教堂山分校, ${}^{2}$ 亚马逊公司

{cyfu, wliu}@cs.unc.edu, {ananthr, ambrisht}@amazon.com, aberg@cs.unc.edu

Abstract

摘要

The main contribution of this paper is an approach for introducing additional context into state-of-the-art general object detection. To achieve this we first combine a state-of-the-art classifier (Residual-101 [14]) with a fast detection framework (SSD [18]). We then augment SSD+Residual- 101 with deconvolution layers to introduce additional large-scale context in object detection and improve accuracy, especially for small objects, calling our resulting system DSSD for deconvolutional single shot detector. While these two contributions are easily described at a high-level, a naive implementation does not succeed. Instead we show that carefully adding additional stages of learned transformations, specifically a module for feed-forward connections in deconvolution and a new output module, enables this new approach and forms a potential way forward for further detection research. Results are shown on both PASCAL VOC and COCO detection. Our DSSD with ${513} \times {513}$ input achieves ${81.5}\% \mathrm{{mAP}}$ on VOC2007 test, ${80.0}\% \mathrm{{mAP}}$ on VOC2012 test, and 33.2% mAP on COCO, outperforming a state-of-the-art method $R$ -FCN [3] on each dataset.

本文的主要贡献是提出了一种将额外上下文引入最先进的通用目标检测的方法。为此，我们首先将一种最先进的分类器（Residual-101 [14]）与快速检测框架（SSD [18]）相结合。然后，我们通过引入反卷积层来增强SSD+Residual-101，以在目标检测中引入额外的大规模上下文并提高准确性，特别是对于小物体，因此将我们得到的系统称为DSSD，即反卷积单次检测器。尽管这两个贡献在高层次上容易描述，但简单的实现并未成功。相反，我们展示了通过仔细添加额外的学习变换阶段，特别是用于反卷积的前馈连接模块和新的输出模块，使这种新方法成为可能，并为进一步的检测研究提供了一种潜在的前进方向。结果显示在PASCAL VOC和COCO检测上。我们的DSSD在 ${513} \times {513}$ 输入下在VOC2007测试中取得了 ${81.5}\% \mathrm{{mAP}}$ ，在VOC2012测试中取得了 ${80.0}\% \mathrm{{mAP}}$ ，在COCO上达到了33.2%的mAP，超越了每个数据集上的最先进方法 $R$ -FCN [3]。

1. Introduction

1. 引言

The main contribution of this paper is an approach for introducing additional context into state-of-the-art general object detection. The end result achieves the current highest accuracy for detection with a single network on PASCAL VOC [6] while also maintaining comparable speed with a previous state-of-the-art detection [3]. To achieve this we first combine a state-of-the-art classifier (Residual- 101 [14]) with a fast detection framework (SSD [18]). We then augment SSD+Residual-101 with deconvolution layers to introduce additional large-scale context in object detection and improve accuracy, especially for small objects, calling our resulting system DSSD for deconvolutional single shot detector. While these two contributions are easily described at a high-level, a naive implementation does not succeed. Instead we show that carefully adding additional stages of learned transformations, specifically a module for feed forward connections in deconvolution and a new output module, enables this new approach and forms a potential way forward for further detection research.

本文的主要贡献是提出了一种将额外上下文引入最先进的一般物体检测的方法。最终结果在 PASCAL VOC [6] 上实现了当前单网络检测的最高准确率，同时保持了与之前最先进检测 [3] 相当的速度。为了实现这一点，我们首先将一种最先进的分类器（Residual-101 [14]）与快速检测框架（SSD [18]）结合在一起。然后，我们通过引入反卷积层来增强 SSD+Residual-101，以在物体检测中引入额外的大规模上下文并提高准确率，特别是对于小物体，我们将所得到的系统称为 DSSD，即反卷积单次检测器。虽然这两个贡献在高层次上容易描述，但简单的实现并不能成功。相反，我们展示了通过仔细添加额外的学习变换阶段，特别是用于反卷积中的前馈连接的模块和一个新的输出模块，使这一新方法成为可能，并为进一步的检测研究提供了一种潜在的前进方向。

Putting this work in context, there has been a recent move in object detection back toward sliding-window techniques in the last two years. The idea is that instead of first proposing potential bounding boxes for objects in an image and then classifying them, as exemplified in selective search[27] and R-CNN[12] derived methods, a classifier is applied to a fixed set of possible bounding boxes in an image. While sliding window approaches never completely disappeared, they had gone out of favor after the heydays of HOG [4] and DPM [7] due to the increasingly large number of box locations that had to be considered to keep up with state-of-the-art. They are coming back as more powerful machine learning frameworks integrating deep learning are developed. These allow fewer potential bounding boxes to be considered, but in addition to a classification score for each box, require predicting an offset to the actual location of the object-snapping to its spatial extent. Recently these approaches have been shown to be effective for bounding box proposals $\left\lbrack {5,{24}}\right\rbrack$ in place of bottom-up grouping of segmentation [27, 12]. Even more recently, these approaches were used to not only score bounding boxes as potential object locations, but to simultaneously predict scores for object categories, effectively combining the steps of region proposal and classification. This is the approach taken by You Only Look Once (YOLO) [23] which computes a global feature map and uses a fully-connected layer to predict detections in a fixed set of regions. Taking this single-shot approach further by adding layers of feature maps for each scale and using a convolutional filter for prediction, the Single Shot MultiBox Detector (SSD) [18] is significantly more accurate and is currently the best detector with respect to the speed-vs-accuracy trade-off.

将这项工作置于背景中，最近在物体检测领域出现了一种回归滑动窗口技术的趋势。在过去两年中，研究的思路是，不再首先为图像中的物体提出潜在的边界框，然后对其进行分类，如选择性搜索[27]和R-CNN[12]派生的方法所示，而是对图像中一组固定的可能边界框应用分类器。尽管滑动窗口方法从未完全消失，但在HOG [4] 和 DPM [7] 的全盛时期之后，由于需要考虑的框位置数量日益增多以跟上最先进技术，它们逐渐失宠。随着更强大的机器学习框架的开发，这些方法正在重新回归，这些框架整合了深度学习。这些方法允许考虑更少的潜在边界框，但除了为每个框提供分类得分外，还需要预测与物体实际位置的偏移量，以便与其空间范围相匹配。最近，这些方法已被证明在边界框提议方面是有效的 $\left\lbrack {5,{24}}\right\rbrack$ ，取代了自下而上的分割分组方法 [27, 12]。更近期，这些方法不仅用于为边界框作为潜在物体位置进行评分，同时还预测物体类别的得分，有效地将区域提议和分类的步骤结合在一起。这是“你只看一次（YOLO）” [23] 采用的方法，它计算全局特征图，并使用全连接层在固定区域集内预测检测。通过为每个尺度添加特征图层并使用卷积滤波器进行预测，单次检测多框检测器（SSD） [18] 的准确性显著提高，目前在速度与准确性权衡方面是最佳检测器。

When looking for ways to further improve the accuracy of detection, obvious targets are better feature networks and adding more context, especially for small objects, in addition to improving the spatial resolution of the bounding box

在寻找进一步提高检测准确性的方法时，显而易见的目标是更好的特征网络和增加更多上下文，尤其是对于小物体，此外还要提高边界框的空间分辨率。

*Equal Contribution

*平等贡献

Figure 1: Networks of SSD and DSSD on residual network. The blue modules are the layers added in SSD framework, and we call them SSD Layers. In the bottom figure, the red layers are DSSD layers.

图 1：残差网络上的 SSD 和 DSSD 网络。蓝色模块是 SSD 框架中添加的层，我们称之为 SSD 层。在底部图中，红色层是 DSSD 层。

prediction process. Previous versions of SSD were based on the VGG [26] network, but many researchers have achieved better accuracy for tasks using Residual-101 [14]. Looking to concurrent research outside of detection, there has been a work on integrating context using so called "encoder-decoder" networks where a bottleneck layer in the middle of a network is used to encode information about an input image and then progressively larger layers decode this into a map over the whole image. The resulting wide, narrow, wide structure of the network is often referred to as an hourglass. These approaches have been especially useful in recent works on semantic segmentation [21], and human pose estimation [20].

预测过程。之前版本的 SSD 基于 VGG [26] 网络，但许多研究人员在使用 Residual-101 [14] 的任务中取得了更好的准确性。考虑到检测之外的并行研究，有一项工作致力于使用所谓的“编码器-解码器”网络来整合上下文，其中网络中间的瓶颈层用于编码输入图像的信息，然后逐渐较大的层将其解码为整个图像的映射。网络的宽、窄、宽结构通常被称为沙漏。这些方法在最近的语义分割 [21] 和人体姿态估计 [20] 研究中尤其有用。

Unfortunately neither of these modifications, using the much deeper Residual-101, or adding deconvolution layers to the end of SSD feature layers, work "out of the box". Instead it is necessary to carefully construct combination modules for integrating deconvolution, and output modules to insulate the Residual-101 layers during training and allow effective learning.

不幸的是，这些修改，无论是使用更深的 Residual-101，还是在 SSD 特征层末尾添加反卷积层，都无法“开箱即用”。相反，有必要仔细构建组合模块以整合反卷积，并输出模块以在训练过程中隔离 Residual-101 层，从而实现有效学习。

The code will be open sourced with models upon publication.

代码将在发布时开源，并附带模型。

2. Related Work

2. 相关工作

The majority of object detection methods, including SPPnet [13], Fast R-CNN [11], Faster R-CNN [24], R-FCN [3] and YOLO [23], use the top-most layer of a Con-vNet to learn to detect objects at different scales. Although powerful, it imposes a great burden for a single layer to model all possible object scales and shapes.

大多数目标检测方法，包括 SPPnet [13]、Fast R-CNN [11]、Faster R-CNN [24]、R-FCN [3] 和 YOLO [23]，使用 ConvNet 的最顶层来学习在不同尺度下检测物体。尽管功能强大，但单层模型所有可能的物体尺度和形状的负担非常大。

There are variety of ways to improve detection accuracy by exploiting multiple layers within a ConvNet. The first set of approaches combine feature maps from different layers of a ConvNet and use the combined feature map to do prediction. ION [1] uses L2 normalization [19] to combine multiple layers from VGGNet and pool features for object proposals from the combined layer. HyperNet [16] also follows a similar method and uses the combined layer to learn object proposals and to pool features. Because the combined feature map has features from different levels of abstraction of the input image, the pooled feature is more descriptive and is better suitable for localization and classification. However, the combined feature map not only increases the memory footprint of a model significantly but also decreases the speed of the model.

通过利用卷积网络中的多个层，有多种方法可以提高检测精度。第一组方法结合卷积网络不同层的特征图，并使用组合特征图进行预测。ION [1] 使用 L2 归一化 [19] 来结合 VGGNet 的多个层，并从组合层中池化特征以进行物体提议。HyperNet [16] 也采用类似的方法，利用组合层学习物体提议并池化特征。由于组合特征图包含来自输入图像不同抽象层次的特征，因此池化特征更加描述性，更适合于定位和分类。然而，组合特征图不仅显著增加了模型的内存占用，还降低了模型的速度。

Figure 2: Variants of the prediction module

图 2：预测模块的变体

Another set of methods uses different layers within a ConvNet to predict objects of different scales. Because the nodes in different layers have different receptive fields, it is natural to predict large objects from layers with large receptive fields (called higher or later layers within a Con-vNet) and use layers with small receptive fields to predict small objects. SSD [18] spreads out default boxes of different scales to multiple layers within a ConvNet and enforces each layer to focus on predicting objects of certain scale. MS-CNN [2] applies deconvolution on multiple layers of a ConvNet to increase feature map resolution before using the layers to learn region proposals and pool features. However, in order to detect small objects well, these methods need to use some information from shallow layers with small receptive fields and dense feature maps, which may cause low performance on small objects because shallow layers have less semantic information about objects. By using deconvolution layers and skip connections, we can inject more semantic information in dense (deconvolution) feature maps, which in turn helps predict small objects.

另一组方法使用卷积网络中的不同层来预测不同尺度的物体。由于不同层中的节点具有不同的感受野，因此从具有大感受野的层（称为卷积网络中的高层或后层）预测大物体是自然的，而使用具有小感受野的层来预测小物体。SSD [18] 将不同尺度的默认框分散到卷积网络的多个层中，并强制每一层专注于预测特定尺度的物体。MS-CNN [2] 在卷积网络的多个层上应用反卷积，以在使用这些层学习区域提议和池化特征之前提高特征图的分辨率。然而，为了良好地检测小物体，这些方法需要使用来自具有小感受野和密集特征图的浅层的一些信息，这可能导致小物体的性能较低，因为浅层对物体的语义信息较少。通过使用反卷积层和跳跃连接，我们可以在密集（反卷积）特征图中注入更多的语义信息，从而有助于预测小物体。

There is another line of work which tries to include context information for prediction. Multi-Region CNN [10] pools features not only from the region proposal but also pre-defined regions such as half parts, center, border and the context area. Following many existing works on semantic segmentation [21] and pose estimation [20], we propose to use an encoder-decoder hourglass structure to pass context information before doing prediction. The deconvolution layers not only addresses the problem of shrinking resolution of feature maps in convolution neural networks, but also brings in context information for prediction.

还有另一项工作试图将上下文信息纳入预测中。多区域卷积神经网络（Multi-Region CNN [10]）不仅从区域提议中提取特征，还从预定义区域（如半部分、中心、边界和上下文区域）中提取特征。基于许多现有的语义分割 [21] 和姿态估计 [20] 的研究，我们提议使用编码器-解码器沙漏结构在进行预测之前传递上下文信息。反卷积层不仅解决了卷积神经网络中特征图分辨率缩小的问题，还为预测引入了上下文信息。

3. Deconvolutional Single Shot Detection (DSSD) model

3. 反卷积单次检测（DSSD）模型

We begin by reviewing the structure of SSD and then describe the new prediction module that produces significantly improved training effectiveness when using Residual-101 as the base network for SSD. Next we discuss how to add deconvolution layers to make a hourglass network, and how to integrate the the new deconvolutional module to pass semantic context information for the final DSSD model.

我们首先回顾SSD的结构，然后描述新的预测模块，该模块在使用Residual-101作为SSD的基础网络时显著提高了训练效果。接下来，我们讨论如何添加反卷积层以构建沙漏网络，以及如何整合新的反卷积模块以传递语义上下文信息，从而形成最终的DSSD模型。

SSD

The Single Shot MultiBox Detector (SSD [18]) is built on top of a "base" network that ends (or is truncated to end) with some convolutional layers. SSD adds a series of progressively smaller convolutional layers as shown in blue on top of Figure 1 (the base network is shown in white). Each of the added layers, and some of the earlier base network layers are used to predict scores and offsets for some predefined default bounding boxes. These predictions are performed by $3 \times 3 \times$ #channels dimensional filters,one filter for each category score and one for each dimension of the bounding box that is regressed. It uses non-maximum suppression (NMS) to post-process the predictions to get final detection results. More details can be found in [18], where the detector uses VGG [26] as the base network.

单次多框检测器（Single Shot MultiBox Detector，SSD [18]）建立在一个“基础”网络之上，该网络以一些卷积层结束（或被截断以结束）。SSD在图1的顶部添加了一系列逐渐减小的卷积层（以蓝色显示，基础网络以白色显示）。每个添加的层以及一些早期基础网络层用于预测一些预定义默认边界框的得分和偏移量。这些预测是通过 $3 \times 3 \times$ #通道维度的滤波器进行的，每个类别得分一个滤波器，每个回归的边界框维度一个滤波器。它使用非极大值抑制（NMS）对预测结果进行后处理，以获得最终检测结果。更多细节可以在 [18] 中找到，其中检测器使用VGG [26] 作为基础网络。

3.1. Using Residual-101 in place of VGG

3.1. 用Residual-101替代VGG

Our first modification is using Residual-101 in place of VGG used in the original SSD paper, in particular we use the Residual-101 network from [14]. The goal is to improve accuracy. Figure 1 top shows SSD with Residual-101 as the base network. Here we are adding layers after the conv5_x block, and predicting scores and box offsets from conv3_x,conv5_x,and the additional layers. By itself this does not improve results. Considering the ablation study results in Table 4, the top row shows a mAP of 76.4 of SSD with Residual-101 on ${321} \times {321}$ inputs for PASCAL VOC 2007 test. This is lower than the 77.5 for SSD with VGG on ${300} \times {300}$ inputs (see Table3). However adding an additional prediction module, described next, increases performance significantly.

我们的第一个修改是使用 Residual-101 替代原始 SSD 论文中使用的 VGG，特别是我们使用来自 [14] 的 Residual-101 网络。目标是提高准确性。图 1 顶部显示了以 Residual-101 作为基础网络的 SSD。在这里，我们在 conv5_x 块后添加层，并从 conv3_x、conv5_x 以及附加层预测分数和框偏移量。单独来看，这并没有改善结果。考虑到表 4 中的消融研究结果，最上面一行显示在 ${321} \times {321}$ 输入下，使用 Residual-101 的 SSD 的 mAP 为 76.4。这低于使用 VGG 的 SSD 在 ${300} \times {300}$ 输入下的 77.5（见表 3）。然而，添加一个额外的预测模块，如下所述，显著提高了性能。

Prediction module

预测模块

In the original SSD [18], the objective functions are applied on the selected feature maps directly and a L2 normalization layer is used for the conv4_3 layer, because of the large magnitude of the gradient. MS-CNN[2] points out that improving the sub-network of each task can improve accuracy. Following this principle, we add one residual block for each prediction layer as shown in Figure 2 variant (c). We also tried the original SSD approach (a) and a version of the residual block with a skip connection (b) as well as two sequential residual blocks (d). Ablation studies with the different prediction modules are shown in Table 4 and discussed in Section 4. We note that Residual-101 and the prediction module seem to perform significantly better than VGG without the prediction module for higher resolution input images.

在原始 SSD [18] 中，目标函数直接应用于选定的特征图，并且在 conv4_3 层使用 L2 归一化层，这是由于梯度的巨大幅度。MS-CNN[2] 指出，改善每个任务的子网络可以提高准确性。遵循这一原则，我们为每个预测层添加一个残差块，如图 2 变体 (c) 所示。我们还尝试了原始 SSD 方法 (a) 和带有跳跃连接的残差块版本 (b)，以及两个顺序的残差块 (d)。不同预测模块的消融研究结果显示在表 4 中，并在第 4 节中讨论。我们注意到，Residual-101 和预测模块似乎在高分辨率输入图像上表现显著优于没有预测模块的 VGG。

Figure 3: Deconvolution module

图 3：反卷积模块

Deconvolutional SSD

反卷积 SSD

In order to include more high-level context in detection, we move prediction to a series of deconvolution layers placed after the original SSD setup, effectively making an asymmetric hourglass network structure, as shown in Figure 1 bottom. The DSSD model in our experiments is built on SSD with Residual-101. Extra deconvolution layers are added to successively increase the resolution of feature map layers. In order to strengthen features, we adopt the "skip connection" idea from the Hourglass model [20]. Although the hourglass model contains symmetric layers in both the Encoder and Decoder stage, we make the decoder stage extremely shallow for two reasons. First, detection is a fundamental task in vision and may need to provide information for the downstream tasks. Therefore, speed is an important factor. Building the symmetric network means the time for inference will double. This is not what we want in this fast detection framework. Second, there are no pre-trained models which include a decoder stage trained on the classification task of ILSVRC CLS-LOC dataset [25] because classification gives a single whole image label instead of a local label as in detection. State-of-the-art detectors rely on the power of transfer learning. The model pre-trained on the classification task of ILSVRC CLS-LOC dataset [25] makes the accuracy of our detector higher and converge faster compared to a randomly initialized model. Since there is no pre-trained model for our decoder, we cannot take the advantage of transfer learning for the decoder layers which must be trained starting from random initialization. An important aspect of the deconvolution layers is computational cost, especially when adding information from the previous layers in addition to the deconvolutional process.

为了在检测中包含更多高层次的上下文，我们将预测移动到一系列放置在原始 SSD 设置之后的反卷积层，有效地形成了一个不对称的沙漏网络结构，如图 1 底部所示。我们实验中的 DSSD 模型是基于带有 Residual-101 的 SSD 构建的。额外的反卷积层被添加以逐步提高特征图层的分辨率。为了增强特征，我们采用了来自沙漏模型的“跳跃连接”思想 [20]。尽管沙漏模型在编码器和解码器阶段都包含对称层，但我们将解码器阶段设定得极其浅层，原因有二。首先，检测是视觉中的一项基本任务，可能需要为下游任务提供信息。因此，速度是一个重要因素。构建对称网络意味着推理时间将加倍。这并不是我们在这个快速检测框架中所希望的。其次，没有经过预训练的模型包括在 ILSVRC CLS-LOC 数据集 [25] 上训练的解码器阶段，因为分类给出的是单个整体图像标签，而不是像检测那样的局部标签。最先进的检测器依赖于迁移学习的力量。在 ILSVRC CLS-LOC 数据集 [25] 的分类任务上预训练的模型使我们的检测器的准确性更高，并且收敛速度更快，相较于随机初始化的模型。由于我们的解码器没有预训练模型，我们无法利用迁移学习的优势，解码器层必须从随机初始化开始训练。反卷积层的一个重要方面是计算成本，特别是在除了反卷积过程之外，还要从前面的层添加信息时。

Deconvolution Module

反卷积模块

In order to help integrating information from earlier feature maps and the deconvolution layers, we introduce a deconvolution module as shown in Figure 3. This module fits into the overall DSSD architecture as indicated by the solid circles in the bottom of Figure 1. The deconvolution module is inspired by Pinheiro et al. [22] who suggested that a factored version of the deconvolution module for a refinement network has the same accuracy as a more complicated one and the network will be more efficient. We make the following modifications and show them in Figure 3. First, a batch normalization layer is added after each convolution layer. Second, we use the learned deconvolution layer instead of bilinear upsampling. Last, we test different combination methods: element-wise sum and element-wise product. The experimental results show that the element-wise product provides the best accuracy (See Table 4 bottom sections).

为了帮助整合早期特征图和反卷积层的信息，我们引入了一个反卷积模块，如图3所示。该模块适合于整体DSSD架构，如图1底部的实心圆所示。反卷积模块的灵感来源于Pinheiro等人[22]，他们建议用于精细化网络的反卷积模块的分解版本具有与更复杂版本相同的准确性，并且网络将更加高效。我们进行了以下修改，并在图3中展示了这些修改。首先，在每个卷积层后添加了一个批归一化层。其次，我们使用学习到的反卷积层，而不是双线性上采样。最后，我们测试了不同的组合方法：逐元素求和和逐元素乘积。实验结果表明，逐元素乘积提供了最佳的准确性（见表4底部部分）。

Training

训练

We follow almost the same training policy as SSD. First, we have to match a set of default boxes to target ground truth boxes. For each ground truth box, we match it with the

我们几乎遵循与SSD相同的训练策略。首先，我们必须将一组默认框与目标真实框进行匹配。对于每个真实框，我们将其与

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——