SSD: Single Shot MultiBox Detector【翻译】https://arxiv.org/pdf/

Doc2X | 全能 PDF 转换工具 Doc2X 提供专业的 PDF转Word、PDF转Latex、PDF转Markdown、PDF转HTML 功能，涵盖公式解析、多栏识别、表格转换，满足文档转换的全方位需求。 Doc2X | All-in-One PDF Conversion Tool Doc2X offers professional PDF to Word, LaTeX, Markdown, and HTML conversion, including formula parsing, multi-column recognition, and table conversion for all your document needs. 👉 了解 Doc2X | Learn About Doc2X

原文链接：1512.02325

SSD: Single Shot MultiBox Detector

SSD: 单次多框检测器

Wei Liu ${}^{1}$ ,Dragomir Anguelov ${}^{2}$ ,Dumitru Erhan ${}^{3}$ ,Christian Szegedy ${}^{3}$ , Scott Reed4,Cheng-Yang Fu1,Alexander C. Berg ${}^{1}$

${}^{1}$ UNC Chapel Hill ${}^{2}$ Zoox Inc. ${}^{3}$ Google Inc. ${}^{4}$ University of Michigan,Ann-Arbor

${}^{1}$ 北卡罗来纳大学教堂山分校 ${}^{2}$ Zoox公司 ${}^{3}$ 谷歌公司 ${}^{4}$ 密歇根大学安娜堡分校

○^u lui @cs.unc.edu, ^drago@zoox.com, ^{dumitru, szegedy}@google.com, ˈreedscot@umich.edu, ˈ{cyfu,aberg}@cs.unc.edu

Abstract. We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For ${300} \times {300}$ input,SSD achieves ${74.3}\% {\mathrm{{mAP}}}^{1}$ on VOC2007 test at 59 FPS on a Nvidia Titan $\mathrm{X}$ and for ${512} \times {512}$ input,SSD achieves ${76.9}\%$ mAP,outperforming a comparable state-of-the-art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at github.com/weiliu89/ca….

摘要。我们提出了一种使用单个深度神经网络检测图像中物体的方法。我们的方法名为SSD，它将边界框的输出空间离散化为一组在不同长宽比和尺度下的默认框，适用于每个特征图位置。在预测时，网络为每个默认框中每个物体类别的存在生成分数，并对框进行调整，以更好地匹配物体形状。此外，网络结合来自多个不同分辨率特征图的预测，自然处理各种大小的物体。与需要物体提议的方法相比，SSD相对简单，因为它完全消除了提议生成和后续像素或特征重采样阶段，并将所有计算封装在一个单一的网络中。这使得SSD易于训练，并且易于集成到需要检测组件的系统中。在PASCAL VOC、COCO和ILSVRC数据集上的实验结果确认，SSD的准确性与利用额外物体提议步骤的方法具有竞争力，并且速度更快，同时为训练和推理提供了统一的框架。对于 ${300} \times {300}$ 输入，SSD在Nvidia Titan $\mathrm{X}$ 上以59 FPS在VOC2007测试中达到了 ${74.3}\% {\mathrm{{mAP}}}^{1}$ ，而对于 ${512} \times {512}$ 输入，SSD达到了 ${76.9}\%$ mAP，超越了一个可比的最先进的Faster R-CNN模型。与其他单阶段方法相比，SSD即使在较小的输入图像尺寸下也具有更好的准确性。代码可在github.com/weiliu89/ca…

Keywords: Real-time Object Detection; Convolutional Neural Network

关键词：实时物体检测；卷积神经网络

1 Introduction

1 引言

Current state-of-the-art object detection systems are variants of the following approach: hypothesize bounding boxes, resample pixels or features for each box, and apply a high-quality classifier. This pipeline has prevailed on detection benchmarks since the Selective Search work [1] through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as [3]. While accurate, these approaches have been too computationally intensive for embedded systems and, even with high-end hardware, too slow for real-time applications. Often detection speed for these approaches is measured in seconds per frame (SPF), and even the fastest high-accuracy detector, Faster R-CNN, operates at only 7 frames per second (FPS). There have been many attempts to build faster detectors by attacking each stage of the detection pipeline (see related work in Sec. 4), but so far, significantly increased speed comes only at the cost of significantly decreased detection accuracy.

目前最先进的目标检测系统是以下方法的变体：假设边界框，为每个框重新采样像素或特征，并应用高质量分类器。自选择性搜索工作 [1] 以来，这一流程在检测基准测试中占据主导地位，直到当前在 PASCAL VOC、COCO 和 ILSVRC 检测中基于 Faster R-CNN[2] 的领先结果，尽管使用了更深的特征，如 [3]。虽然准确，但这些方法对于嵌入式系统来说计算成本过高，即使在高端硬件上，对于实时应用也太慢。这些方法的检测速度通常以每帧秒数（SPF）来衡量，即使是最快的高精度检测器 Faster R-CNN 也仅以每秒 7 帧（FPS）的速度运行。已经有许多尝试通过攻击检测流程的每个阶段来构建更快的检测器（见第 4 节的相关工作），但到目前为止，显著提高的速度仅以显著降低的检测精度为代价。

${}^{1}$ We achieved even better results using an improved data augmentation scheme in follow-on experiments: 77.2% mAP for 300×300 input and 79.8% mAP for 512×512 input on VOC2007. Please see Sec. 3.6 for details.

${}^{1}$ 在后续实验中，我们使用改进的数据增强方案取得了更好的结果：在 VOC2007 上，300×300 输入的 mAP 为 77.2%，512×512 输入的 mAP 为 79.8%。详细信息请参见第 3.6 节。

This paper presents the first deep network based object detector that does not resample pixels or features for bounding box hypotheses and and is as accurate as approaches that do. This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP 74.3% on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP 73.2% or YOLO 45 FPS with mAP 63.4%). The fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage. We are not the first to do this (cf [45]), but by adding a series of improvements, we manage to increase the accuracy significantly over previous attempts. Our improvements include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales. With these modifications-especially using multiple layers for prediction at different scales-we can achieve high-accuracy using relatively low resolution input, further increasing detection speed. While these contributions may seem small independently, we note that the resulting system improves accuracy on real-time detection for PASCAL VOC from ${63.4}\%$ mAP for YOLO to ${74.3}\%$ mAP for our SSD. This is a larger relative improvement in detection accuracy than that from the recent, very high-profile work on residual networks [3]. Furthermore, significantly improving the speed of high-quality detection can broaden the range of settings where computer vision is useful.

本文提出了首个基于深度网络的目标检测器，该检测器不对边界框假设进行像素或特征的重采样，并且其准确性与进行重采样的方法相当。这在高准确度检测中显著提高了速度（在 VOC2007 测试中，mAP 为 74.3% 时为 59 FPS，而 Faster R-CNN 为 7 FPS，mAP 为 73.2%，YOLO 为 45 FPS，mAP 为 63.4%）。速度的根本改善来自于消除边界框提议及随后的像素或特征重采样阶段。我们并不是第一个这样做的人（参见 [45]），但通过增加一系列改进，我们成功地显著提高了准确性。我们的改进包括使用小卷积滤波器来预测目标类别和边界框位置的偏移，针对不同的长宽比检测使用独立的预测器（滤波器），并将这些滤波器应用于网络后期的多个特征图，以便在多个尺度上进行检测。通过这些修改——特别是在不同尺度上使用多个层进行预测——我们可以在相对低分辨率输入下实现高准确度，从而进一步提高检测速度。虽然这些贡献单独看起来可能微小，但我们注意到，所得到的系统在 PASCAL VOC 的实时检测中将 YOLO 的 mAP 从 ${63.4}\%$ 提高到我们的 SSD 的 mAP ${74.3}\%$ 。这在检测准确度上相较于最近关于残差网络的高调工作 [3] 有更大的相对改善。此外，显著提高高质量检测的速度可以扩大计算机视觉有用的场景范围。

We summarize our contributions as follows:

我们的贡献总结如下：

We introduce SSD, a single-shot detector for multiple categories that is faster than the previous state-of-the-art for single shot detectors (YOLO), and significantly more accurate, in fact as accurate as slower techniques that perform explicit region proposals and pooling (including Faster R-CNN).
我们引入了 SSD，这是一种针对多类别的单次检测器，其速度快于之前的单次检测器的最新技术（YOLO），并且准确性显著更高，实际上与执行显式区域提议和池化的较慢技术（包括 Faster R-CNN）同样准确。
The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps.
SSD 的核心是使用小卷积滤波器对特征图应用于一组固定的默认边界框进行类别分数和框偏移的预测。
To achieve high detection accuracy we produce predictions of different scales from feature maps of different scales, and explicitly separate predictions by aspect ratio.
为了实现高检测准确率，我们从不同尺度的特征图中生成不同尺度的预测，并通过长宽比明确区分预测。
These design features lead to simple end-to-end training and high accuracy, even on low resolution input images, further improving the speed vs accuracy trade-off.
这些设计特征导致了简单的端到端训练和高准确率，即使在低分辨率输入图像上，也进一步改善了速度与准确率之间的权衡。
Experiments include timing and accuracy analysis on models with varying input size evaluated on PASCAL VOC, COCO, and ILSVRC and are compared to a range of recent state-of-the-art approaches.
实验包括对在 PASCAL VOC、COCO 和 ILSVRC 上评估的不同输入大小模型的时间和准确性分析，并与一系列最新的最先进方法进行比较。

2 The Single Shot Detector (SSD)

2 单次检测器 (SSD)

This section describes our proposed SSD framework for detection (Sec. 2.1) and the associated training methodology (Sec. 2.2). Afterwards, Sec. 3 presents dataset-specific model details and experimental results.

本节描述了我们提出的用于检测的 SSD 框架（第 2.1 节）及相关的训练方法（第 2.2 节）。随后，第 3 节介绍了数据集特定的模型细节和实验结果。

Fig. 1: SSD framework. (a) SSD only needs an input image and ground truth boxes for each object during training. In a convolutional fashion, we evaluate a small set (e.g. 4) of default boxes of different aspect ratios at each location in several feature maps with different scales (e.g. $8 \times 8$ and $4 \times 4$ in (b) and (c)). For each default box,we predict both the shape offsets and the confidences for all object categories $\left( \left( {{c}_{1},{c}_{2},\cdots ,{c}_{p}}\right) \right)$ . At training time, we first match these default boxes to the ground truth boxes. For example, we have matched two default boxes with the cat and one with the dog, which are treated as positives and the rest as negatives. The model loss is a weighted sum between localization loss (e.g. Smooth L1 [6]) and confidence loss (e.g. Softmax).

图 1：SSD 框架。（a）SSD 在训练期间只需要输入图像和每个对象的真实框。以卷积的方式，我们在多个不同尺度的特征图的每个位置评估一小组（例如 4 个）不同长宽比的默认框（例如（b）和（c）中的 $8 \times 8$ 和 $4 \times 4$ ）。对于每个默认框，我们预测所有对象类别的形状偏移和置信度 $\left( \left( {{c}_{1},{c}_{2},\cdots ,{c}_{p}}\right) \right)$ 。在训练时，我们首先将这些默认框与真实框匹配。例如，我们将两个默认框与猫匹配，一个与狗匹配，这些被视为正样本，其余被视为负样本。模型损失是定位损失（例如 Smooth L1 [6]）和置信度损失（例如 Softmax）之间的加权和。

2.1 Model

2.1 模型

The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections. The early network layers are based on a standard architecture used for high quality image classification (truncated before any classification layers), which we will call the base network ${}^{2}$ . We then add auxiliary structure to the network to produce detections with the following key features:

SSD 方法基于一个前馈卷积网络，该网络生成固定大小的边界框集合及其在这些框中存在的对象类别实例的得分，随后进行非极大值抑制步骤以产生最终检测结果。早期网络层基于用于高质量图像分类的标准架构（在任何分类层之前截断），我们将其称为基础网络 ${}^{2}$ 。然后，我们向网络添加辅助结构，以生成具有以下关键特征的检测：

Multi-scale feature maps for detection We add convolutional feature layers to the end of the truncated base network. These layers decrease in size progressively and allow predictions of detections at multiple scales. The convolutional model for predicting detections is different for each feature layer (cf Overfeat [4] and YOLO[5] that operate on a single scale feature map).

多尺度特征图用于检测我们在截断的基础网络末尾添加卷积特征层。这些层的大小逐渐减小，并允许在多个尺度上进行检测预测。用于预测检测的卷积模型对于每个特征层都是不同的（参见 Overfeat [4] 和 YOLO [5]，它们在单一尺度特征图上操作）。

Convolutional predictors for detection Each added feature layer (or optionally an existing feature layer from the base network) can produce a fixed set of detection predictions using a set of convolutional filters. These are indicated on top of the SSD network architecture in Fig. 2 For a feature layer of size $m \times n$ with $p$ channels,the basic element for predicting parameters of a potential detection is a $3 \times 3 \times p$ small kernel that produces either a score for a category, or a shape offset relative to the default box coordinates. At each of the $m \times n$ locations where the kernel is applied,it produces an output value. The bounding box offset output values are measured relative to a default

用于检测的卷积预测器每个添加的特征层（或可选的来自基础网络的现有特征层）可以使用一组卷积滤波器生成固定集的检测预测。这些在图 2 中的 SSD 网络架构顶部指示。对于大小为 $m \times n$ 且具有 $p$ 通道的特征层，预测潜在检测参数的基本元素是一个 $3 \times 3 \times p$ 小内核，该内核生成类别的得分或相对于默认框坐标的形状偏移。在应用内核的每个 $m \times n$ 位置，它生成一个输出值。边界框偏移输出值是相对于默认值测量的。

${}^{2}$ We use the VGG-16 network as a base,but other networks should also produce good results.

${}^{2}$ 我们使用 VGG-16 网络作为基础，但其他网络也应产生良好的结果。

Fig. 2: A comparison between two single shot detection models: SSD and YOLO [5]. Our SSD model adds several feature layers to the end of a base network, which predict the offsets to default boxes of different scales and aspect ratios and their associated confidences. SSD with a ${300} \times {300}$ input size significantly outperforms its ${448} \times {448}$ YOLO counterpart in accuracy on VOC2007 test while also improving the speed.

图 2：两个单次检测模型的比较：SSD 和 YOLO [5]。我们的 SSD 模型在基础网络的末尾添加了几个特征层，这些层预测不同尺度和长宽比的默认框的偏移量及其相关的置信度。在 VOC2007 测试中，具有 ${300} \times {300}$ 输入大小的 SSD 在准确性上显著优于其 ${448} \times {448}$ YOLO 对应模型，同时也提高了速度。

box position relative to each feature map location ( ${cf}$ the architecture of YOLO[5] that uses an intermediate fully connected layer instead of a convolutional filter for this step).

框的位置相对于每个特征图位置（ ${cf}$ YOLO 的架构 [5] 在此步骤中使用了一个中间的全连接层，而不是卷积滤波器）。

Default boxes and aspect ratios We associate a set of default bounding boxes with each feature map cell, for multiple feature maps at the top of the network. The default boxes tile the feature map in a convolutional manner, so that the position of each box relative to its corresponding cell is fixed. At each feature map cell, we predict the offsets relative to the default box shapes in the cell, as well as the per-class scores that indicate the presence of a class instance in each of those boxes. Specifically, for each box out of $k$ at a given location,we compute $c$ class scores and the 4 offsets relative to the original default box shape. This results in a total of $\left( {c + 4}\right) k$ filters that are applied around each location in the feature map,yielding $\left( {c + 4}\right) {kmn}$ outputs for a $m \times n$ feature map. For an illustration of default boxes, please refer to Fig. 1. Our default boxes are similar to the anchor boxes used in Faster R-CNN [2], however we apply them to several feature maps of different resolutions. Allowing different default box shapes in several feature maps let us efficiently discretize the space of possible output box shapes.

默认框和长宽比我们将一组默认边界框与网络顶部的每个特征图单元关联。默认框以卷积方式铺设特征图，因此每个框相对于其对应单元的位置是固定的。在每个特征图单元中，我们预测相对于单元中默认框形状的偏移量，以及指示每个框中类实例存在的每类分数。具体而言，对于给定位置的每个框 $k$ ，我们计算 $c$ 类分数和相对于原始默认框形状的 4 个偏移量。这导致在特征图中的每个位置应用总共 $\left( {c + 4}\right) k$ 个滤波器，从而为 $m \times n$ 特征图产生 $\left( {c + 4}\right) {kmn}$ 个输出。有关默认框的说明，请参见图 1。我们的默认框类似于 Faster R-CNN [2] 中使用的锚框，然而我们将它们应用于多个不同分辨率的特征图。在多个特征图中允许不同的默认框形状使我们能够有效地离散化可能输出框形状的空间。

2.2 Training

2.2 训练

The key difference between training SSD and training a typical detector that uses region proposals, is that ground truth information needs to be assigned to specific outputs in the fixed set of detector outputs. Some version of this is also required for training in YOLO[5] and for the region proposal stage of Faster R-CNN[2] and MultiBox[7]. Once this assignment is determined, the loss function and back propagation are applied end-to-end. Training also involves choosing the set of default boxes and scales for detection as well as the hard negative mining and data augmentation strategies.

训练 SSD 和训练使用区域提议的典型检测器之间的关键区别在于，真实标签信息需要分配给固定检测器输出集中的特定输出。YOLO[5] 的训练以及 Faster R-CNN[2] 和 MultiBox[7] 的区域提议阶段也需要某种形式的这种分配。一旦确定了这种分配，就会应用端到端的损失函数和反向传播。训练还涉及选择用于检测的默认框和尺度，以及困难负样本挖掘和数据增强策略。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——