Feature Pyramid Networks for Object Detection【翻译】

159 阅读29分钟

Doc2X:PDF 转 Word 的最佳选择 Doc2X 专注于 PDF转Word、PDF转Latex、PDF转HTML,支持 Mathpix公式识别、多栏解析、GLM翻译 等强大功能,助您快速完成文档处理! Doc2X: The Best Choice for PDF to Word Conversion Doc2X specializes in PDF to Word, PDF to LaTeX, and PDF to HTML, with features like Mathpix formula recognition, multi-column parsing, and GLM translation, making document processing faster! 👉 立即试用 Doc2X | Try Doc2X Now

原文链接:1612.03144

Feature Pyramid Networks for Object Detection

特征金字塔网络用于目标检测

Tsung-Yi Lin1,2{\operatorname{Lin}}^{1,2} ,Piotr Dollár 1{}^{1} ,Ross Girshick 1{}^{1} ,

Tsung-Yi Lin1,2{\operatorname{Lin}}^{1,2} ,Piotr Dollár 1{}^{1} ,Ross Girshick 1{}^{1} ,

Kaiming He1{\mathrm{{He}}}^{1} ,Bharath Hariharan 1{}^{1} ,and Serge Belongie 2{}^{2}

Kaiming He1{\mathrm{{He}}}^{1} ,Bharath Hariharan 1{}^{1} ,和 Serge Belongie 2{}^{2}

1 Facebook AI Research (FAIR)

1 Facebook AI Research (FAIR)

2{}^{2} Cornell University and Cornell Tech

2{}^{2} 康奈尔大学和康奈尔科技

Abstract

摘要

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster RR -CNN system,our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 6 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.

特征金字塔是识别系统中用于检测不同尺度物体的基本组成部分。然而,最近的深度学习目标检测器避免使用金字塔表示,部分原因是它们在计算和内存上消耗较大。本文利用深度卷积网络固有的多尺度金字塔层次结构,以边际额外成本构建特征金字塔。我们开发了一种自上而下的架构,具有横向连接,用于在所有尺度上构建高级语义特征图。这种架构称为特征金字塔网络(FPN),在多个应用中作为通用特征提取器显示出显著的改进。在一个基本的Faster RR -CNN系统中,我们的方法在COCO检测基准上实现了最先进的单模型结果,没有额外的复杂性,超越了包括COCO 2016挑战赛获胜者在内的所有现有单模型条目。此外,我们的方法在GPU上以6 FPS的速度运行,因此是一个实用且准确的多尺度目标检测解决方案。代码将公开发布。

1. Introduction

1. 引言

Recognizing objects at vastly different scales is a fundamental challenge in computer vision. Feature pyramids built upon image pyramids (for short we call these featur-ized image pyramids) form the basis of a standard solution [1] (Fig. 1(a)). These pyramids are scale-invariant in the sense that an object's scale change is offset by shifting its level in the pyramid. Intuitively, this property enables a model to detect objects across a large range of scales by scanning the model over both positions and pyramid levels.

在计算机视觉中,以极大不同的尺度识别物体是一个基本挑战。基于图像金字塔构建的特征金字塔(简称为特征化图像金字塔)形成了标准解决方案的基础 [1] (图 1(a))。这些金字塔在尺度上是不变的,意味着物体的尺度变化通过在金字塔中移动其层级来抵消。直观上,这一特性使得模型能够通过在位置和金字塔层级上扫描模型,检测出跨越大范围尺度的物体。

Featurized image pyramids were heavily used in the era of hand-engineered features [5,25]\left\lbrack {5,{25}}\right\rbrack . They were so critical that object detectors like DPM [7] required dense scale sampling to achieve good results (e.g., 10 scales per octave). For recognition tasks, engineered features have largely been replaced with features computed by deep convolutional networks (ConvNets) [19, 20]. Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15,11,29]\left\lbrack {{15},{11},{29}}\right\rbrack (Fig. 1(b)). But even with this robustness,pyramids are still needed to get the most accurate results. All recent top entries in the ImageNet [33] and COCO [21] detection challenges use multi-scale testing on featurized image pyramids (e.g., [16,35]\left\lbrack {{16},{35}}\right\rbrack ). The principle advantage of fea-turizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels.

特征化图像金字塔在手工设计特征的时代被广泛使用 [5,25]\left\lbrack {5,{25}}\right\rbrack。它们是如此关键,以至于像 DPM [7] 这样的目标检测器需要进行密集的尺度采样以获得良好的结果(例如,每个八度 10 个尺度)。在识别任务中,工程特征在很大程度上被深度卷积网络(ConvNets)计算的特征所取代 [19, 20]。除了能够表示更高层次的语义外,ConvNets 对尺度变化也更具鲁棒性,因此可以从在单一输入尺度上计算的特征中促进识别 [15,11,29]\left\lbrack {{15},{11},{29}}\right\rbrack(图 1(b))。但即使有这种鲁棒性,金字塔仍然是获得最准确结果所必需的。最近在 ImageNet [33] 和 COCO [21] 检测挑战中的所有顶级条目都使用了在特征化图像金字塔上进行的多尺度测试(例如,[16,35]\left\lbrack {{16},{35}}\right\rbrack)。特征化图像金字塔每一层的主要优势在于它生成了一个多尺度特征表示,其中所有层都是语义上强的,包括高分辨率层。

Figure 1. (a) Using an image pyramid to build a feature pyramid. Features are computed on each of the image scales independently, which is slow. (b) Recent detection systems have opted to use only single scale features for faster detection. (c) An alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as if it were a featurized image pyramid. (d) Our proposed Feature Pyramid Network (FPN) is fast like (b) and (c), but more accurate. In this figure, feature maps are indicate by blue outlines and thicker outlines denote semantically stronger features.

图 1. (a) 使用图像金字塔构建特征金字塔。特征是在每个图像尺度上独立计算的,这很慢。 (b) 最近的检测系统选择仅使用单尺度特征以加快检测速度。 (c) 另一种选择是重用由 ConvNet 计算的金字塔特征层次,就像它是一个特征化图像金字塔一样。 (d) 我们提出的特征金字塔网络(FPN)像 (b) 和 (c) 一样快速,但更准确。在此图中,特征图用蓝色轮廓表示,较粗的轮廓表示语义上更强的特征。

Nevertheless, featurizing each level of an image pyramid has obvious limitations. Inference time increases considerably (e.g., by four times [11]), making this approach impractical for real applications. Moreover, training deep networks end-to-end on an image pyramid is infeasible in terms of memory, and so, if exploited, image pyramids are used only at test time [15,11,16,35]\left\lbrack {{15},{11},{16},{35}}\right\rbrack ,which creates an inconsistency between train/test-time inference. For these reasons, Fast and Faster R-CNN [11, 29] opt to not use fea-turized image pyramids under default settings.

然而,对图像金字塔的每个层级进行特征化存在明显的局限性。推理时间显著增加(例如,增加四倍 [11]),使得这种方法在实际应用中不切实际。此外,在图像金字塔上端到端训练深度网络在内存方面是不可行的,因此,如果使用图像金字塔,它们仅在测试时使用 [15,11,16,35]\left\lbrack {{15},{11},{16},{35}}\right\rbrack,这在训练/测试推理之间造成了不一致。因此,Fast 和 Faster R-CNN [11, 29] 在默认设置下选择不使用特征化的图像金字塔。

However, image pyramids are not the only way to compute a multi-scale feature representation. A deep ConvNet computes a feature hierarchy layer by layer, and with subsampling layers the feature hierarchy has an inherent multi-scale, pyramidal shape. This in-network feature hierarchy produces feature maps of different spatial resolutions, but introduces large semantic gaps caused by different depths. The high-resolution maps have low-level features that harm their representational capacity for object recognition.

然而,图像金字塔并不是计算多尺度特征表示的唯一方法。深度卷积网络逐层计算特征层次,并且通过下采样层,特征层次具有固有的多尺度金字塔形状。这个网络内的特征层次生成不同空间分辨率的特征图,但由于不同深度引入了较大的语义差距。高分辨率图具有低级特征,这会损害它们在物体识别中的表征能力。

The Single Shot Detector (SSD) [22] is one of the first attempts at using a ConvNet's pyramidal feature hierarchy as if it were a featurized image pyramid (Fig. 1(c)). Ideally, the SSD-style pyramid would reuse the multi-scale feature maps from different layers computed in the forward pass and thus come free of cost. But to avoid using low-level features SSD foregoes reusing already computed layers and instead builds the pyramid starting from high up in the network (e.g., conv4_3 of VGG nets [36]) and then by adding several new layers. Thus it misses the opportunity to reuse the higher-resolution maps of the feature hierarchy. We show that these are important for detecting small objects.

单次检测器(SSD) [22] 是最早尝试将卷积网络的金字塔特征层次视为特征化图像金字塔的尝试之一(图 1(c))。理想情况下,SSD 风格的金字塔将重用在前向传播中计算的来自不同层的多尺度特征图,从而无需额外成本。但为了避免使用低级特征,SSD 放弃了重用已经计算的层,而是从网络的高层(例如,VGG 网络的 conv4_3 [36])开始构建金字塔,然后添加几个新层。因此,它错过了重用特征层次中高分辨率图的机会。我们表明,这对于检测小物体是重要的。

The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet's feature hierarchy while creating a feature pyramid that has strong semantics at all scales. To achieve this goal, we rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections (Fig. 1(d)). The result is a feature pyramid that has rich semantics at all levels and is built quickly from a single input image scale. In other words, we show how to create in-network feature pyramids that can be used to replace featurized image pyramids without sacrificing representational power, speed, or memory.

本文的目标是自然利用卷积网络特征层次的金字塔形状,同时创建一个在所有尺度上具有强语义的特征金字塔。为了实现这一目标,我们依赖于一种架构,该架构通过自上而下的路径和横向连接,将低分辨率、语义强的特征与高分辨率、语义弱的特征结合起来(图1(d))。最终结果是一个在所有层次上具有丰富语义的特征金字塔,并且可以快速从单一输入图像尺度构建。换句话说,我们展示了如何创建网络内特征金字塔,这些金字塔可以用来替代特征化的图像金字塔,而不牺牲表示能力、速度或内存。

Similar architectures adopting top-down and skip connections are popular in recent research [28,17,8,26]\left\lbrack {{28},{17},8,{26}}\right\rbrack . Their goals are to produce a single high-level feature map of a fine resolution on which the predictions are to be made (Fig. 2 top). On the contrary, our method leverages the architecture as a feature pyramid where predictions (e.g., object detections) are independently made on each level (Fig. 2 bottom). Our model echoes a featurized image pyramid, which has not been explored in these works.

近年来,采用自上而下和跳跃连接的类似架构在研究中非常流行 [28,17,8,26]\left\lbrack {{28},{17},8,{26}}\right\rbrack。它们的目标是生成一个单一的高分辨率特征图,以便进行预测(图2顶部)。相反,我们的方法利用该架构作为特征金字塔,其中预测(例如,物体检测)是在每个层次上独立进行的(图2底部)。我们的模型呼应了特征化的图像金字塔,而这一点在这些研究中尚未被探索。

We evaluate our method, called a Feature Pyramid Network (FPN), in various systems for detection and segmentation [11,29,27]\left\lbrack {{11},{29},{27}}\right\rbrack . Without bells and whistles,we report a state-of-the-art single-model result on the challenging COCO detection benchmark [21] simply based on FPN and a basic Faster R-CNN detector [29], surpassing all existing heavily-engineered single-model entries of competition winners. In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16]. Our method is also easily extended to mask proposals and improves both instance segmentation AR and speed over state-of-the-art methods that heavily depend on image pyramids.

我们在各种系统中评估我们的方法,称为特征金字塔网络(FPN),用于检测和分割 [11,29,27]\left\lbrack {{11},{29},{27}}\right\rbrack。没有花哨的修饰,我们在具有挑战性的 COCO 检测基准 [21] 上报告了基于 FPN 和基本的 Faster R-CNN 检测器 [29] 的最先进的单模型结果,超越了所有现有的竞争获胜者的重工程单模型条目。在消融实验中,我们发现对于边界框提议,FPN 显著提高了平均召回率(AR)8.0 个百分点;对于物体检测,它在强单尺度基线的 Faster R-CNN(使用 ResNets [16])上提高了 COCO 风格的平均精度(AP)2.3 个百分点和 PASCAL 风格的 AP 3.8 个百分点。我们的方法也可以轻松扩展到掩码提议,并在实例分割的 AR 和速度上优于依赖于图像金字塔的最先进方法。

Figure 2. Top: a top-down architecture with skip connections, where predictions are made on the finest level (e.g., [28]). Bottom: our model that has a similar structure but leverages it as a feature pyramid, with predictions made independently at all levels.

图 2. 上:具有跳跃连接的自上而下架构,其中在最细的层级(例如,[28])上进行预测。下:我们的模型具有类似的结构,但将其作为特征金字塔,所有层级独立进行预测。

In addition, our pyramid structure can be trained end-to-end with all scales and is used consistently at train/test time, which would be memory-infeasible using image pyramids. As a result, FPNs are able to achieve higher accuracy than all existing state-of-the-art methods. Moreover, this improvement is achieved without increasing testing time over the single-scale baseline. We believe these advances will facilitate future research and applications. Our code will be made publicly available.

此外,我们的金字塔结构可以端到端训练所有尺度,并在训练/测试时一致使用,这在使用图像金字塔时会导致内存不可行。因此,FPN 能够实现比所有现有最先进方法更高的准确性。此外,这一改进是在不增加单尺度基线测试时间的情况下实现的。我们相信这些进展将促进未来的研究和应用。我们的代码将公开提供。

2. Related Work

2. 相关工作

Hand-engineered features and early neural networks. SIFT features [25] were originally extracted at scale-space extrema and used for feature point matching. HOG features [5], and later SIFT features as well, were computed densely over entire image pyramids. These HOG and SIFT pyramids have been used in numerous works for image classification, object detection, human pose estimation, and more. There has also been significant interest in computing featurized image pyramids quickly. Dollár et al. [6] demonstrated fast pyramid computation by first computing a sparsely sampled (in scale) pyramid and then interpolating missing levels. Before HOG and SIFT, early work on face detection with ConvNets [38,32]\left\lbrack {{38},{32}}\right\rbrack computed shallow networks over image pyramids to detect faces across scales.

手工设计的特征和早期神经网络。SIFT 特征 [25] 最初是在尺度空间极值点提取的,并用于特征点匹配。HOG 特征 [5],以及后来的 SIFT 特征,也是密集计算在整个图像金字塔上。这些 HOG 和 SIFT 金字塔已在众多工作中用于图像分类、目标检测、人类姿态估计等。此外,快速计算特征化图像金字塔也引起了显著的兴趣。Dollár 等人 [6] 通过首先计算稀疏采样(在尺度上)的金字塔,然后插值缺失的层,展示了快速金字塔计算。在 HOG 和 SIFT 之前,早期关于使用卷积网络 [38,32]\left\lbrack {{38},{32}}\right\rbrack 进行人脸检测的工作在图像金字塔上计算浅层网络,以检测不同尺度的人脸。

Deep ConvNet object detectors. With the development of modern deep ConvNets [19], object detectors like Over-Feat [34] and R-CNN [12] showed dramatic improvements in accuracy. OverFeat adopted a strategy similar to early neural network face detectors by applying a ConvNet as a sliding window detector on an image pyramid. R-CNN adopted a region proposal-based strategy [37] in which each proposal was scale-normalized before classifying with a ConvNet. SPPnet [15] demonstrated that such region-based detectors could be applied much more efficiently on feature maps extracted on a single image scale. Recent and more accurate detection methods like Fast R-CNN [11] and Faster R-CNN [29] advocate using features computed from a single scale, because it offers a good trade-off between accuracy and speed. Multi-scale detection, however, still performs better, especially for small objects.

深度卷积网络目标检测器。随着现代深度卷积网络 [19] 的发展,像 Over-Feat [34] 和 R-CNN [12] 这样的目标检测器在准确性上表现出显著的提升。OverFeat 采用了一种类似于早期神经网络人脸检测器的策略,通过在图像金字塔上应用卷积网络作为滑动窗口检测器。R-CNN 采用了一种基于区域提议的策略 [37],其中每个提议在使用卷积网络分类之前进行了尺度归一化。SPPnet [15] 证明了这种基于区域的检测器可以在单一图像尺度上提取的特征图上更高效地应用。最近更准确的检测方法,如 Fast R-CNN [11] 和 Faster R-CNN [29],主张使用从单一尺度计算的特征,因为这在准确性和速度之间提供了良好的权衡。然而,多尺度检测仍然表现更好,特别是对于小物体。

Methods using multiple layers. A number of recent approaches improve detection and segmentation by using different layers in a ConvNet. FCN [24] sums partial scores for each category over multiple scales to compute semantic segmentations. Hypercolumns [13] uses a similar method for object instance segmentation. Several other approaches (HyperNet [18], ParseNet [23], and ION [2]) concatenate features of multiple layers before computing predictions, which is equivalent to summing transformed features. SSD [22] and MS-CNN [3] predict objects at multiple layers of the feature hierarchy without combining features or scores.

使用多层的方法。最近的一些方法通过在卷积网络中使用不同的层来改善检测和分割。FCN [24] 在多个尺度上对每个类别的部分得分进行求和,以计算语义分割。Hypercolumns [13] 使用类似的方法进行对象实例分割。其他几种方法(HyperNet [18]、ParseNet [23] 和 ION [2])在计算预测之前连接多个层的特征,这相当于对变换特征进行求和。SSD [22] 和 MS-CNN [3] 在特征层次的多个层次上预测对象,而不结合特征或得分。

There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and Sharp-Mask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation. Ghiasi et al. [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation. Although these methods adopt architectures with pyramidal shapes, they are unlike featurized image pyramids [5,7,34]\left\lbrack {5,7,{34}}\right\rbrack where predictions are made independently at all levels, see Fig. 2. In fact, for the pyramidal architecture in Fig. 2 (top), image pyramids are still needed to recognize objects across multiple scales [28].

最近有一些方法利用横向/跳过连接,将低级特征图与不同分辨率和语义层次关联,包括用于分割的 U-Net [31] 和 Sharp-Mask [28],用于人脸检测的 Recombinator 网络 [17],以及用于关键点估计的 Stacked Hourglass 网络 [26]。Ghiasi 等 [8] 提出了 FCNs 的拉普拉斯金字塔表示,以逐步细化分割。尽管这些方法采用了金字塔形状的架构,但它们与特征化的图像金字塔 [5,7,34]\left\lbrack {5,7,{34}}\right\rbrack 不同,后者在所有层次上独立进行预测,见图 2。实际上,对于图 2(顶部)中的金字塔架构,仍然需要图像金字塔来识别多个尺度上的对象 [28]。

3. Feature Pyramid Networks

3. 特征金字塔网络

Our goal is to leverage a ConvNet's pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout. The resulting Feature Pyramid Network is general-purpose and in this paper we focus on sliding window proposers (Region Proposal Network, RPN for short) [29] and region-based detectors (Fast R-CNN) [11]. We also generalize FPNs to instance segmentation proposals in Sec. 6.

我们的目标是利用卷积网络的金字塔特征层次,该层次从低到高具有语义,并构建一个贯穿高层语义的特征金字塔。所得到的特征金字塔网络是通用的,在本文中我们重点关注滑动窗口提议器(区域提议网络,简称 RPN)[29] 和基于区域的检测器(Fast R-CNN)[11]。我们还在第 6 节中将 FPNs 推广到实例分割提议。

Our method takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion. This process is independent of the backbone convolutional architectures (e.g., [19,36,16]\left\lbrack {{19},{36},{16}}\right\rbrack ),and in this paper we present results using ResNets [16]. The construction of our pyramid involves a bottom-up pathway, a top-down pathway, and lateral connections, as introduced in the following.

我们的方法以任意大小的单尺度图像作为输入,并以完全卷积的方式输出多个层次上按比例大小的特征图。这个过程与基础卷积架构(例如,[19,36,16]\left\lbrack {{19},{36},{16}}\right\rbrack)无关,在本文中我们展示了使用 ResNets [16] 的结果。我们的金字塔构建涉及自下而上的路径、自上而下的路径和侧向连接,具体如下所述。

Figure 3. A building block illustrating the lateral connection and the top-down pathway, merged by addition.

图 3. 一个构建块,说明了通过加法合并的侧向连接和自上而下的路径。

Bottom-up pathway. The bottom-up pathway is the feed-forward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2 . There are often many layers producing output maps of the same size and we say these layers are in the same network stage. For our feature pyramid, we define one pyramid level for each stage. We choose the output of the last layer of each stage as our reference set of feature maps, which we will enrich to create our pyramid. This choice is natural since the deepest layer of each stage should have the strongest features.

自下而上的路径。自下而上的路径是基础 ConvNet 的前馈计算,它计算由多个尺度的特征图组成的特征层次,缩放步长为 2。通常有许多层产生相同大小的输出图,我们称这些层处于同一网络阶段。对于我们的特征金字塔,我们为每个阶段定义一个金字塔层。我们选择每个阶段最后一层的输出作为我们的特征图参考集,并将其丰富以创建我们的金字塔。这个选择是自然的,因为每个阶段的最深层应该具有最强的特征。

Specifically, for ResNets [16] we use the feature activations output by each stage's last residual block. We denote the output of these last residual blocks as {C2,C3,C4,C5}\left\{ {{C}_{2},{C}_{3},{C}_{4},{C}_{5}}\right\} for conv2, conv3, conv4, and conv5 outputs, and note that they have strides of {4,8,16,32}\{ 4,8,{16},{32}\} pixels with respect to the input image. We do not include conv1 into the pyramid due to its large memory footprint.

具体而言,对于 ResNets [16],我们使用每个阶段最后一个残差块输出的特征激活。我们将这些最后残差块的输出表示为 {C2,C3,C4,C5}\left\{ {{C}_{2},{C}_{3},{C}_{4},{C}_{5}}\right\},对应于 conv2、conv3、conv4 和 conv5 的输出,并注意到它们相对于输入图像的步幅为 {4,8,16,32}\{ 4,8,{16},{32}\} 像素。由于 conv1 的内存占用较大,我们不将其纳入金字塔。

Top-down pathway and lateral connections. The top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. These features are then enhanced with features from the bottom-up pathway via lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.

自上而下的路径和侧向连接。自上而下的路径通过对来自更高金字塔层次的空间上较粗但语义上更强的特征图进行上采样,来幻觉出更高分辨率的特征。这些特征随后通过侧向连接与自下而上的路径中的特征进行增强。每个侧向连接合并来自自下而上路径和自上而下路径的相同空间大小的特征图。自下而上的特征图具有较低层次的语义,但由于其被下采样的次数较少,因此其激活更准确地定位。

Fig. 3 shows the building block that constructs our top-down feature maps. With a coarser-resolution feature map, we upsample the spatial resolution by a factor of 2 (using nearest neighbor upsampling for simplicity). The upsam-pled map is then merged with the corresponding bottom-up map (which undergoes a 1×11 \times 1 convolutional layer to reduce channel dimensions) by element-wise addition. This process is iterated until the finest resolution map is generated. To start the iteration,we simply attach a 1×11 \times 1 convolutional layer on C5{C}_{5} to produce the coarsest resolution map. Finally,we append a 3×33 \times 3 convolution on each merged map to generate the final feature map, which is to reduce the aliasing effect of upsampling. This final set of feature maps is called {P2,P3,P4,P5}\left\{ {{P}_{2},{P}_{3},{P}_{4},{P}_{5}}\right\} ,corresponding to {C2,C3,C4,C5}\left\{ {{C}_{2},{C}_{3},{C}_{4},{C}_{5}}\right\} that are respectively of the same spatial sizes.

图 3 显示了构建我们自上而下特征图的基本模块。对于较粗分辨率的特征图,我们通过一个因子为 2 的方式上采样空间分辨率(为简单起见,使用最近邻上采样)。上采样后的图与相应的自下而上图(经过一个 1×11 \times 1 卷积层以减少通道维度)通过逐元素相加的方式合并。这个过程会迭代,直到生成最细分辨率的图。为了开始迭代,我们简单地在 C5{C}_{5} 上附加一个 1×11 \times 1 卷积层,以生成最粗分辨率的图。最后,我们在每个合并的图上附加一个 3×33 \times 3 卷积,以生成最终特征图,从而减少上采样的混叠效应。这组最终的特征图称为 {P2,P3,P4,P5}\left\{ {{P}_{2},{P}_{3},{P}_{4},{P}_{5}}\right\},对应于 {C2,C3,C4,C5}\left\{ {{C}_{2},{C}_{3},{C}_{4},{C}_{5}}\right\},它们在空间大小上是相同的。

Because all levels of the pyramid use shared classifiers/regressors as in a traditional featurized image pyramid, we fix the feature dimension (numbers of channels, denoted as dd ) in all the feature maps. We set d=256d = {256} in this paper and thus all extra convolutional layers have 256-channel outputs. There are no non-linearities in these extra layers, which we have empirically found to have minor impacts.

由于金字塔的所有层级使用共享的分类器/回归器,类似于传统的特征化图像金字塔,我们在所有特征图中固定特征维度(通道数,记作 dd)。我们在本文中设置了 d=256d = {256},因此所有额外的卷积层输出为 256 通道。这些额外层中没有非线性,我们通过经验发现这对结果的影响较小。

Simplicity is central to our design and we have found that our model is robust to many design choices. We have experimented with more sophisticated blocks (e.g., using multilayer residual blocks [16] as the connections) and observed marginally better results. Designing better connection modules is not the focus of this paper, so we opt for the simple design described above.

简单性是我们设计的核心,我们发现我们的模型对许多设计选择具有鲁棒性。我们尝试了更复杂的模块(例如,使用多层残差块 [16] 作为连接),并观察到略微更好的结果。设计更好的连接模块并不是本文的重点,因此我们选择了上述简单设计。

4. Applications

4. 应用

Our method is a generic solution for building feature pyramids inside deep ConvNets. In the following we adopt our method in RPN [29] for bounding box proposal generation and in Fast R-CNN [11] for object detection. To demonstrate the simplicity and effectiveness of our method, we make minimal modifications to the original systems of [29,11]\left\lbrack {{29},{11}}\right\rbrack when adapting them to our feature pyramid.

我们的方法是构建深度卷积网络内部特征金字塔的通用解决方案。接下来,我们在 RPN [29] 中采用我们的方法用于边界框提议生成,并在 Fast R-CNN [11] 中用于目标检测。为了展示我们方法的简单性和有效性,我们在将它们适应我们的特征金字塔时,对 [29,11]\left\lbrack {{29},{11}}\right\rbrack 的原始系统进行了最小修改。

4.1. Feature Pyramid Networks for RPN

4.1. 用于 RPN 的特征金字塔网络

RPN [29] is a sliding-window class-agnostic object detector. In the original RPN design, a small subnetwork is evaluated on dense 3×33 \times 3 sliding windows,on top of a single-scale convolutional feature map, performing object/non-object binary classification and bounding box regression. This is realized by a 3×33 \times 3 convolutional layer followed by two sibling 1×11 \times 1 convolutions for classification and regression, which we refer to as a network head. The object/non-object criterion and bounding box regression target are defined with respect to a set of reference boxes called anchors [29]. The anchors are of multiple pre-defined scales and aspect ratios in order to cover objects of different shapes.

RPN [29] 是一种滑动窗口的类无关目标检测器。在原始 RPN 设计中,一个小的子网络在密集的 3×33 \times 3 滑动窗口上进行评估,基于单尺度卷积特征图,执行目标/非目标的二元分类和边界框回归。这是通过一个 3×33 \times 3 卷积层实现的,后面跟着两个兄弟 1×11 \times 1 卷积用于分类和回归,我们称之为网络头。目标/非目标标准和边界框回归目标是相对于一组称为锚点的参考框定义的 [29]。锚点具有多个预定义的尺度和纵横比,以覆盖不同形状的目标。

We adapt RPN by replacing the single-scale feature map with our FPN. We attach a head of the same design (3×3(3 \times 3 conv and two sibling 1×11 \times 1 convs) to each level on our feature pyramid. Because the head slides densely over all locations in all pyramid levels, it is not necessary to have multi-scale anchors on a specific level. Instead, we assign anchors of a single scale to each level. Formally, we define the anchors to have areas of {322,642,1282,2562,5122}\left\{ {{32}^{2},{64}^{2},{128}^{2},{256}^{2},{512}^{2}}\right\} pixels on {P2,P3,P4,P5,P6}\left\{ {{P}_{2},{P}_{3},{P}_{4},{P}_{5},{P}_{6}}\right\} respectively. 1{}^{1} As in [29] we also use anchors of multiple aspect ratios {1:2,1:1,2:1}\{ 1 : 2,1 : 1,2 : 1\} at each level. So in total there are 15 anchors over the pyramid.

我们通过用我们的特征金字塔网络(FPN)替换单尺度特征图来调整区域提议网络(RPN)。我们在特征金字塔的每个层级上附加一个相同设计的头部((3×3(3 \times 3 卷积和两个兄弟 1×11 \times 1 卷积)。由于头部在所有金字塔层级的所有位置上密集滑动,因此在特定层级上不需要多尺度锚点。相反,我们为每个层级分配单一尺度的锚点。正式地,我们定义锚点在 {322,642,1282,2562,5122}\left\{ {{32}^{2},{64}^{2},{128}^{2},{256}^{2},{512}^{2}}\right\} 像素和 {P2,P3,P4,P5,P6}\left\{ {{P}_{2},{P}_{3},{P}_{4},{P}_{5},{P}_{6}}\right\} 上分别具有的面积。1{}^{1} 与 [29] 中一样,我们还在每个层级使用多个长宽比的锚点 {1:2,1:1,2:1}\{ 1 : 2,1 : 1,2 : 1\}。因此,总共有 15 个锚点分布在金字塔上。

We assign training labels to the anchors based on their Intersection-over-Union (IoU) ratios with ground-truth bounding boxes as in [29]. Formally, an anchor is assigned a positive label if it has the highest IoU for a given ground-truth box or an IoU over 0.7 with any ground-truth box, and a negative label if it has IoU lower than 0.3 for all ground-truth boxes. Note that scales of ground-truth boxes are not explicitly used to assign them to the levels of the pyramid; instead, ground-truth boxes are associated with anchors, which have been assigned to pyramid levels. As such, we introduce no extra rules in addition to those in [29].

我们根据锚点与真实边界框的交并比(IoU)比率为锚点分配训练标签,如 [29] 中所述。正式地,如果一个锚点对给定的真实框具有最高的 IoU 或与任何真实框的 IoU 超过 0.7,则该锚点被分配为正标签;如果它对所有真实框的 IoU 低于 0.3,则被分配为负标签。请注意,真实框的尺度并未明确用于将其分配到金字塔的层级;相反,真实框与已分配到金字塔层级的锚点相关联。因此,我们没有引入除 [29] 中的规则之外的额外规则。

We note that the parameters of the heads are shared across all feature pyramid levels; we have also evaluated the alternative without sharing parameters and observed similar accuracy. The good performance of sharing parameters indicates that all levels of our pyramid share similar semantic levels. This advantage is analogous to that of using a fea-turized image pyramid, where a common head classifier can be applied to features computed at any image scale.

我们注意到,头部的参数在所有特征金字塔层级之间是共享的;我们还评估了不共享参数的替代方案,并观察到类似的准确性。共享参数的良好性能表明,我们的金字塔的所有层级共享相似的语义层次。这一优势类似于使用特征化图像金字塔的情况,在这种情况下,通用的头部分类器可以应用于在任何图像尺度下计算的特征。

With the above adaptations, RPN can be naturally trained and tested with our FPN, in the same fashion as in [29]. We elaborate on the implementation details in the experiments.

通过上述调整,RPN 可以像在 [29] 中一样,自然地与我们的 FPN 进行训练和测试。我们在实验中详细阐述了实现细节。

4.2. Feature Pyramid Networks for Fast R-CNN

4.2. 用于快速 R-CNN 的特征金字塔网络

Fast R-CNN [11] is a region-based object detector in which Region-of-Interest (RoI) pooling is used to extract features. Fast R-CNN is most commonly performed on a single-scale feature map. To use it with our FPN, we need to assign RoIs of different scales to the pyramid levels.

快速 R-CNN [11] 是一种基于区域的目标检测器,其中使用区域兴趣 (RoI) 池化来提取特征。快速 R-CNN 通常在单尺度特征图上执行。为了将其与我们的 FPN 一起使用,我们需要将不同尺度的 RoI 分配到金字塔层级。

We view our feature pyramid as if it were produced from an image pyramid. Thus we can adapt the assignment strategy of region-based detectors [15,11]\left\lbrack {{15},{11}}\right\rbrack in the case when they are run on image pyramids. Formally, we assign an RoI of width ww and height hh (on the input image to the network) to the level Pk{P}_{k} of our feature pyramid by:

我们将特征金字塔视为从图像金字塔生成的。因此,我们可以在基于区域的检测器 [15,11]\left\lbrack {{15},{11}}\right\rbrack 在图像金字塔上运行的情况下,调整分配策略。形式上,我们将宽度为 ww 和高度为 hh 的 RoI(在输入图像上)分配到特征金字塔的层级 Pk{P}_{k}

Here 224 is the canonical ImageNet pre-training size, and k0{k}_{0} is the target level on which an RoI with w×h=2242w \times h = {224}^{2} should be mapped into. Analogous to the ResNet-based Faster R-CNN system [16] that uses C4{C}_{4} as the single-scale feature map,we set k0{k}_{0} to 4 . Intuitively,Eqn. (1) means that if the RoI's scale becomes smaller (say, 1/2 of 224), it should be mapped into a finer-resolution level (say, k=3k = 3 ).

这里的 224 是标准的 ImageNet 预训练大小,而 k0{k}_{0} 是 RoI 应该映射到的目标层级,具有 w×h=2242w \times h = {224}^{2}。类似于使用 C4{C}_{4} 作为单尺度特征图的基于 ResNet 的 Faster R-CNN 系统 [16],我们将 k0{k}_{0} 设置为 4。直观上,方程 (1) 意味着如果 RoI 的尺度变小(例如,224 的 1/2),它应该被映射到更细分辨率的层级(例如, k=3k = 3)。


1{}^{1} Here we introduce P6{P}_{6} only for covering a larger anchor scale of 5122{512}^{2} . P6{P}_{6} is simply a stride two subsampling of P5.P6{P}_{5}.{P}_{6} is not used by the Fast R-CNN detector in the next section.

1{}^{1} 在这里我们引入 P6{P}_{6} 仅用于覆盖更大的锚点尺度 5122{512}^{2}P6{P}_{6} 只是对 P5.P6{P}_{5}.{P}_{6} 进行步幅为二的下采样,下一节中的快速 R-CNN 检测器不使用它。


—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——