Fully Convolutional Networks for Semantic Segmentation【翻译】

Doc2X | 全能 PDF 转换工具 Doc2X 提供专业的 PDF转Word、PDF转Latex、PDF转Markdown、PDF转HTML 功能，涵盖公式解析、多栏识别、表格转换，满足文档转换的全方位需求。 Doc2X | All-in-One PDF Conversion Tool Doc2X offers professional PDF to Word, LaTeX, Markdown, and HTML conversion, including formula parsing, multi-column recognition, and table conversion for all your document needs. 👉 了解 Doc2X | Learn About Doc2X

原文链接：1411.4038

Fully Convolutional Networks for Semantic Segmentation

全卷积网络用于语义分割

Jonathan Long* Evan Shelhamer* Trevor Darrell

UC Berkeley

加州大学伯克利分校

{jonlong, shelhamer, trevor}@cs.berkeley.edu

Abstract

摘要

Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [19], the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representations by fine-tuning [4] to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes less than one fifth of a second for a typical image.

卷积网络是强大的视觉模型，能够产生特征的层次结构。我们展示了仅通过卷积网络，经过端到端的训练，从像素到像素，超越了语义分割的最新技术水平。我们的关键见解是构建“全卷积”网络，该网络接受任意大小的输入，并生成相应大小的输出，同时实现高效的推理和学习。我们定义并详细说明了全卷积网络的空间，解释其在空间密集预测任务中的应用，并与先前的模型建立联系。我们将当代分类网络（AlexNet [19]、VGG 网络 [31] 和 GoogLeNet [32]）改编为全卷积网络，并通过微调 [4] 将其学习到的表示迁移到分割任务中。然后，我们定义了一种新颖的架构，将来自深层粗糙层的语义信息与来自浅层细致层的外观信息结合，以产生准确且详细的分割。我们的全卷积网络在 PASCAL VOC 上实现了最先进的分割（相对于 2012 年的 62.2% 平均交并比提高 20%），在 NYUDv2 和 SIFT Flow 上也表现出色，同时对于典型图像的推理时间不到五分之一秒。

1. Introduction

1. 引言

Convolutional networks are driving advances in recognition. Convnets are not only improving for whole-image classification $\left\lbrack {{19},{31},{32}}\right\rbrack$ ,but also making progress on local tasks with structured output. These include advances in bounding box object detection [29, 12, 17], part and key-point prediction [39, 24], and local correspondence [24, 9].

卷积网络正在推动识别领域的进步。卷积网络不仅在整体图像分类 $\left\lbrack {{19},{31},{32}}\right\rbrack$ 上有所改善，还在具有结构化输出的局部任务上取得了进展。这些进展包括边界框物体检测 [29, 12, 17]、部件和关键点预测 [39, 24] 以及局部对应 [24, 9]。

The natural next step in the progression from coarse to fine inference is to make a prediction at every pixel. Prior approaches have used convnets for semantic segmentation $\left\lbrack {{27},2,8,{28},{16},{14},{11}}\right\rbrack$ ,in which each pixel is labeled with the class of its enclosing object or region, but with shortcomings that this work addresses.

从粗略推断到精细推断的自然下一步是对每个像素进行预测。先前的方法使用卷积网络进行语义分割 $\left\lbrack {{27},2,8,{28},{16},{14},{11}}\right\rbrack$ ，在该方法中，每个像素被标记为其所包围的对象或区域的类别，但存在本研究所解决的缺点。

Figure 1. Fully convolutional networks can efficiently learn to make dense predictions for per-pixel tasks like semantic segmentation.

图 1. 完全卷积网络可以高效地学习对像语义分割这样的每像素任务进行密集预测。

We show that a fully convolutional network (FCN), trained end-to-end, pixels-to-pixels on semantic segmentation exceeds the state-of-the-art without further machinery. To our knowledge, this is the first work to train FCNs end-to-end (1) for pixelwise prediction and (2) from supervised pre-training. Fully convolutional versions of existing networks predict dense outputs from arbitrary-sized inputs. Both learning and inference are performed whole-image-at-a-time by dense feedforward computation and backpropa-gation. In-network upsampling layers enable pixelwise prediction and learning in nets with subsampled pooling.

我们展示了一个完全卷积网络（FCN），在语义分割上进行端到端、像素到像素的训练，超越了现有的最先进技术。据我们所知，这是首次为像素级预测（1）和（2）从监督预训练端到端训练 FCN 的工作。现有网络的完全卷积版本能够从任意大小的输入中预测密集输出。学习和推断都是通过密集前馈计算和反向传播在整个图像上同时进行的。在网络中的上采样层使得在具有下采样池化的网络中进行像素级预测和学习成为可能。

This method is efficient, both asymptotically and absolutely, and precludes the need for the complications in other works. Patchwise training is common [27, 2, 8, 28, 11], but lacks the efficiency of fully convolutional training. Our approach does not make use of pre- and post-processing complications,including superpixels $\left\lbrack {8,{16}}\right\rbrack$ ,proposals $\left\lbrack {{16},{14}}\right\rbrack$ , or post-hoc refinement by random fields or local classifiers [8, 16]. Our model transfers recent success in classification $\left\lbrack {{19},{31},{32}}\right\rbrack$ to dense prediction by reinterpreting classification nets as fully convolutional and fine-tuning from their learned representations. In contrast, previous works have applied small convnets without supervised pre-training $\left\lbrack {8,{28},{27}}\right\rbrack$ .

该方法在渐近和绝对上都是高效的，避免了其他工作中的复杂性。基于补丁的训练是常见的 [27, 2, 8, 28, 11]，但缺乏完全卷积训练的效率。我们的方法不使用预处理和后处理的复杂性，包括超像素 $\left\lbrack {8,{16}}\right\rbrack$ 、提议 $\left\lbrack {{16},{14}}\right\rbrack$ 或通过随机场或局部分类器的后期精细化 [8, 16]。我们的模型通过将分类网络重新解释为完全卷积网络，并从其学习的表示中进行微调，将最近在分类 $\left\lbrack {{19},{31},{32}}\right\rbrack$ 中的成功转移到密集预测。相比之下，之前的工作在没有监督预训练的情况下应用了小型卷积网络 $\left\lbrack {8,{28},{27}}\right\rbrack$ 。

Semantic segmentation faces an inherent tension between semantics and location: global information resolves what while local information resolves where. Deep feature hierarchies jointly encode location and semantics in a local-to-global pyramid. We define a novel "skip" architecture to combine deep, coarse, semantic information and shallow, fine, appearance information in Section 4.2 (see Figure 3).

语义分割面临语义与位置之间的固有张力：全局信息解决“什么”，而局部信息解决“哪里”。深度特征层次共同编码位置和语义，形成从局部到全局的金字塔。我们在第4.2节中定义了一种新颖的“跳跃”架构，以结合深层、粗糙的语义信息和浅层、细致的外观信息（见图3）。

*Authors contributed equally

*作者贡献相同

In the next section, we review related work on deep classification nets, FCNs, and recent approaches to semantic segmentation using convnets. The following sections explain FCN design and dense prediction tradeoffs, introduce our architecture with in-network upsampling and multilayer combinations, and describe our experimental framework. Finally, we demonstrate state-of-the-art results on PASCAL VOC 2011-2, NYUDv2, and SIFT Flow.

在下一节中，我们回顾了深度分类网络、全卷积网络（FCNs）以及使用卷积网络进行语义分割的最新方法。接下来的章节将解释FCN设计和密集预测的权衡，介绍我们具有网络内上采样和多层组合的架构，并描述我们的实验框架。最后，我们在PASCAL VOC 2011-2、NYUDv2和SIFT Flow上展示了最先进的结果。

2. Related work

2. 相关工作

Our approach draws on recent successes of deep nets for image classification $\left\lbrack {{19},{31},{32}}\right\rbrack$ and transfer learning $\left\lbrack {4,{38}}\right\rbrack$ . Transfer was first demonstrated on various visual recognition tasks $\left\lbrack {4,{38}}\right\rbrack$ ,then on detection,and on both instance and semantic segmentation in hybrid proposal-classifier models [12, 16, 14]. We now re-architect and fine-tune classification nets to direct, dense prediction of semantic segmentation. We chart the space of FCNs and situate prior models, both historical and recent, in this framework.

我们的方法借鉴了深度网络在图像分类 $\left\lbrack {{19},{31},{32}}\right\rbrack$ 和迁移学习 $\left\lbrack {4,{38}}\right\rbrack$ 中的最新成功。迁移首次在各种视觉识别任务上得到验证 $\left\lbrack {4,{38}}\right\rbrack$ ，然后在检测以及混合提议-分类器模型中的实例和语义分割上得到应用 [12, 16, 14]。我们现在重新架构并微调分类网络，以实现语义分割的直接密集预测。我们绘制了FCNs的空间，并在这一框架中定位了历史和近期的先前模型。

Fully convolutional networks To our knowledge, the idea of extending a convnet to arbitrary-sized inputs first appeared in Matan et al. [25], which extended the classic LeNet [21] to recognize strings of digits. Because their net was limited to one-dimensional input strings, Matan et al. used Viterbi decoding to obtain their outputs. Wolf and Platt [37] expand convnet outputs to 2-dimensional maps of detection scores for the four corners of postal address blocks. Both of these historical works do inference and learning fully convolutionally for detection. Ning et al. [27] define a convnet for coarse multiclass segmentation of $C$ . elegans tissues with fully convolutional inference.

完全卷积网络据我们所知，将卷积网络扩展到任意大小输入的想法首次出现在 Matan 等人 [25] 的研究中，他们将经典的 LeNet [21] 扩展到识别数字字符串。由于他们的网络仅限于一维输入字符串，Matan 等人使用维特比解码来获得输出。Wolf 和 Platt [37] 将卷积网络的输出扩展到邮政地址块四个角落的检测分数的二维地图。这两项历史性工作都在检测中完全卷积地进行推理和学习。Ning 等人 [27] 定义了一种用于 $C$ . elegans 组织的粗略多类分割的卷积网络，采用完全卷积推理。

Fully convolutional computation has also been exploited in the present era of many-layered nets. Sliding window detection by Sermanet et al. [29], semantic segmentation by Pinheiro and Collobert [28], and image restoration by Eigen et al. [5] do fully convolutional inference. Fully convolutional training is rare, but used effectively by Tompson et al. [35] to learn an end-to-end part detector and spatial model for pose estimation, although they do not exposit on or analyze this method.

在当前多层网络的时代，完全卷积计算也得到了利用。Sermanet 等人 [29] 的滑动窗口检测、Pinheiro 和 Collobert [28] 的语义分割以及 Eigen 等人 [5] 的图像恢复都进行了完全卷积推理。完全卷积训练较为少见，但 Tompson 等人 [35] 有效地使用它来学习端到端的部件检测器和姿态估计的空间模型，尽管他们并未详细说明或分析这种方法。

Alternatively, He et al. [17] discard the non-convolutional portion of classification nets to make a feature extractor. They combine proposals and spatial pyramid pooling to yield a localized, fixed-length feature for classification. While fast and effective, this hybrid model cannot be learned end-to-end.

另外，He 等人 [17] 丢弃分类网络的非卷积部分以创建特征提取器。他们结合提案和空间金字塔池化，产生一个局部的、固定长度的特征用于分类。虽然这种混合模型快速有效，但无法进行端到端学习。

Dense prediction with convnets Several recent works have applied convnets to dense prediction problems, including semantic segmentation by Ning et al. [27], Farabet et al. [8], and Pinheiro and Collobert [28]; boundary prediction for electron microscopy by Ciresan et al. [2] and for natural images by a hybrid neural net/nearest neighbor model by Ganin and Lempitsky [11]; and image restoration and depth estimation by Eigen et al. $\left\lbrack {5,6}\right\rbrack$ . Common elements of these approaches include

使用卷积网络进行密集预测最近的一些研究将卷积网络应用于密集预测问题，包括 Ning 等人 [27]、Farabet 等人 [8] 和 Pinheiro 与 Collobert [28] 的语义分割；Ciresan 等人 [2] 对电子显微镜的边界预测，以及 Ganin 和 Lempitsky [11] 使用混合神经网络/最近邻模型对自然图像的边界预测；以及 Eigen 等人 $\left\lbrack {5,6}\right\rbrack$ 的图像恢复和深度估计。这些方法的共同元素包括

small models restricting capacity and receptive fields;
小模型限制了容量和感受野；
patchwise training $\left\lbrack {{27},2,8,{28},{11}}\right\rbrack$ ;
基于补丁的训练 $\left\lbrack {{27},2,8,{28},{11}}\right\rbrack$ ；
post-processing by superpixel projection, random field regularization, filtering, or local classification [8, 2,
通过超像素投影、随机场正则化、过滤或局部分类进行后处理 [8, 2，

11];

input shifting and output interlacing for dense output $\left\lbrack {{28},{11}}\right\rbrack$ as introduced by OverFeat [29];
输入平移和输出交错以实现密集输出 $\left\lbrack {{28},{11}}\right\rbrack$ ，如 OverFeat [29] 所介绍；
multi-scale pyramid processing $\left\lbrack {8,{28},{11}}\right\rbrack$ ;
多尺度金字塔处理 $\left\lbrack {8,{28},{11}}\right\rbrack$ ；
saturating tanh nonlinearities $\left\lbrack {8,5,{28}}\right\rbrack$ ; and
饱和的 tanh 非线性 $\left\lbrack {8,5,{28}}\right\rbrack$ ；以及
ensembles $\left\lbrack {2,{11}}\right\rbrack$ ,
集成 $\left\lbrack {2,{11}}\right\rbrack$ ，

whereas our method does without this machinery. However, we do study patchwise training 3.4 and "shift-and-stitch" dense output 3.2 from the perspective of FCNs. We also discuss in-network upsampling 3.3, of which the fully connected prediction by Eigen et al. [6] is a special case.

而我们的方法则不使用这些机制。然而，我们确实从 FCNs 的角度研究了基于补丁的训练 3.4 和“平移与拼接”密集输出 3.2。我们还讨论了网络内上采样 3.3，其中 Eigen 等人 [6] 的全连接预测是一个特例。

Unlike these existing methods, we adapt and extend deep classification architectures, using image classification as supervised pre-training, and fine-tune fully convolutionally to learn simply and efficiently from whole image inputs and whole image ground thruths.

与这些现有方法不同，我们调整并扩展深度分类架构，使用图像分类作为监督预训练，并通过全卷积微调以简单高效地从整个图像输入和整个图像真实值中学习。

Hariharan et al. [16] and Gupta et al. [14] likewise adapt deep classification nets to semantic segmentation, but do so in hybrid proposal-classifier models. These approaches fine-tune an R-CNN system [12] by sampling bounding boxes and/or region proposals for detection, semantic segmentation, and instance segmentation. Neither method is learned end-to-end.

Hariharan 等人 [16] 和 Gupta 等人 [14] 同样将深度分类网络调整为语义分割，但在混合提议-分类器模型中进行。这些方法通过对边界框和/或区域提议进行采样来微调 R-CNN 系统 [12]，以进行检测、语义分割和实例分割。两种方法均未实现端到端学习。

They achieve state-of-the-art results on PASCAL VOC segmentation and NYUDv2 segmentation respectively, so we directly compare our standalone, end-to-end FCN to their semantic segmentation results in Section 5.

它们在 PASCAL VOC 分割和 NYUDv2 分割上分别取得了最先进的结果，因此我们在第 5 节中直接将我们的独立端到端 FCN 与它们的语义分割结果进行比较。

3. Fully convolutional networks

3. 完全卷积网络

Each layer of data in a convnet is a three-dimensional array of size $h \times w \times d$ ,where $h$ and $w$ are spatial dimensions,and $d$ is the feature or channel dimension. The first layer is the image,with pixel size $h \times w$ ,and $d$ color channels. Locations in higher layers correspond to the locations in the image they are path-connected to, which are called their receptive fields.

卷积网络中每一层的数据是一个大小为 $h \times w \times d$ 的三维数组，其中 $h$ 和 $w$ 是空间维度，而 $d$ 是特征或通道维度。第一层是图像，像素大小为 $h \times w$ ，并具有 $d$ 个颜色通道。更高层中的位置对应于它们所连接的图像中的位置，这些位置称为它们的感受野。

Convnets are built on translation invariance. Their basic components (convolution, pooling, and activation functions) operate on local input regions, and depend only on relative spatial coordinates. Writing ${\mathbf{x}}_{ij}$ for the data vector at location(i,j)in a particular layer,and ${\mathbf{y}}_{ij}$ for the following layer,these functions compute outputs ${\mathbf{y}}_{ij}$ by

卷积网络建立在平移不变性之上。它们的基本组件（卷积、池化和激活函数）在局部输入区域上操作，仅依赖于相对空间坐标。将 ${\mathbf{x}}_{ij}$ 写作特定层中位置 (i,j) 的数据向量，将 ${\mathbf{y}}_{ij}$ 写作下一层，这些函数通过以下方式计算输出 ${\mathbf{y}}_{ij}$ ：

where $k$ is called the kernel size, $s$ is the stride or subsampling factor,and ${f}_{ks}$ determines the layer type: a matrix multiplication for convolution or average pooling, a spatial max for max pooling, or an elementwise nonlinearity for an activation function, and so on for other types of layers.

其中 $k$ 被称为卷积核大小， $s$ 是步幅或下采样因子， ${f}_{ks}$ 确定层的类型：卷积或平均池化的矩阵乘法，最大池化的空间最大值，或激活函数的逐元素非线性，以及其他类型层的类似处理。

This functional form is maintained under composition, with kernel size and stride obeying the transformation rule

这种函数形式在组合下保持不变，卷积核大小和步幅遵循变换规则。

While a general deep net computes a general nonlinear function, a net with only layers of this form computes a nonlinear filter, which we call a deep filter or fully convolutional network. An FCN naturally operates on an input of any size, and produces an output of corresponding (possibly resampled) spatial dimensions.

虽然一般深度网络计算一般非线性函数，但仅具有这种形式层的网络计算非线性滤波器，我们称之为深度滤波器或完全卷积网络。FCN 自然适用于任何大小的输入，并产生相应（可能重新采样）的空间维度的输出。

A real-valued loss function composed with an FCN defines a task. If the loss function is a sum over the spatial dimensions of the final layer, $\ell \left( {\mathbf{x};\theta }\right) = \mathop{\sum }\limits_{{ij}}{\ell }^{\prime }\left( {{\mathbf{x}}_{ij};\theta }\right)$ ,its gradient will be a sum over the gradients of each of its spatial components. Thus stochastic gradient descent on $\ell$ computed on whole images will be the same as stochastic gradient descent on ${\ell }^{\prime }$ ,taking all of the final layer receptive fields as a minibatch.

由全连接网络（FCN）组成的实值损失函数定义了一个任务。如果损失函数是最终层的空间维度的总和， $\ell \left( {\mathbf{x};\theta }\right) = \mathop{\sum }\limits_{{ij}}{\ell }^{\prime }\left( {{\mathbf{x}}_{ij};\theta }\right)$ ，那么它的梯度将是每个空间分量梯度的总和。因此，在整个图像上计算的 $\ell$ 的随机梯度下降将与在 ${\ell }^{\prime }$ 上的随机梯度下降相同，将所有最终层感受野视为一个小批量。

When these receptive fields overlap significantly, both feedforward computation and backpropagation are much more efficient when computed layer-by-layer over an entire image instead of independently patch-by-patch.

当这些感受野显著重叠时，逐层计算整个图像的前馈计算和反向传播要比独立地逐块计算高效得多。

We next explain how to convert classification nets into fully convolutional nets that produce coarse output maps. For pixelwise prediction, we need to connect these coarse outputs back to the pixels. Section 3.2 describes a trick that OverFeat [29] introduced for this purpose. We gain insight into this trick by reinterpreting it as an equivalent network modification. As an efficient, effective alternative, we introduce deconvolution layers for upsampling in Section 3.3. In Section 3.4 we consider training by patchwise sampling, and give evidence in Section 4.3 that our whole image training is faster and equally effective.

接下来，我们解释如何将分类网络转换为生成粗略输出图的完全卷积网络。对于逐像素预测，我们需要将这些粗略输出连接回像素。第3.2节描述了OverFeat [29] 为此目的引入的一个技巧。通过将其重新解释为等效的网络修改，我们对这个技巧有了更深入的理解。作为一种高效且有效的替代方案，我们在第3.3节引入了用于上采样的反卷积层。在第3.4节中，我们考虑通过逐块采样进行训练，并在第4.3节提供证据，表明我们的全图训练速度更快且同样有效。

3.1. Adapting classifiers for dense prediction

3.1. 为密集预测调整分类器

Typical recognition nets, including LeNet [21], AlexNet [19], and its deeper successors [31, 32], ostensibly take fixed-sized inputs and produce nonspatial outputs. The fully connected layers of these nets have fixed dimensions and throw away spatial coordinates. However, these fully connected layers can also be viewed as convolutions with kernels that cover their entire input regions. Doing so casts them into fully convolutional networks that take input of any size and output classification maps. This transformation is illustrated in Figure 2. (By contrast, nonconvolutional nets, such as the one by Le et al. [20], lack this capability.)

典型的识别网络，包括 LeNet [21]、AlexNet [19] 及其更深的后续版本 [31, 32]，表面上接受固定大小的输入并产生非空间输出。这些网络的全连接层具有固定的维度，并丢弃空间坐标。然而，这些全连接层也可以被视为覆盖其整个输入区域的卷积。因此，它们被转化为全卷积网络，可以接受任意大小的输入并输出分类图。这一转变在图 2 中进行了说明。（相比之下，非卷积网络，如 Le 等人 [20] 的网络，缺乏这种能力。）

Figure 2. Transforming fully connected layers into convolution layers enables a classification net to output a heatmap. Adding layers and a spatial loss (as in Figure 1) produces an efficient machine for end-to-end dense learning.

图 2. 将全连接层转化为卷积层使得分类网络能够输出热图。添加层和空间损失（如图 1 所示）产生了一种高效的端到端密集学习机器。

Furthermore, while the resulting maps are equivalent to the evaluation of the original net on particular input patches, the computation is highly amortized over the overlapping regions of those patches. For example, while AlexNet takes ${1.2}\mathrm{\;{ms}}$ (on a typical GPU) to produce the classification scores of a ${227} \times {227}$ image,the fully convolutional version takes ${22}\mathrm{\;{ms}}$ to produce a ${10} \times {10}$ grid of outputs from a ${500} \times {500}$ image,which is more than 5 times faster than the naïve approach ${}^{1}$ .

此外，尽管生成的图与原始网络在特定输入补丁上的评估是等价的，但计算在这些补丁的重叠区域上高度摊销。例如，虽然 AlexNet 在典型 GPU 上需要 ${1.2}\mathrm{\;{ms}}$ 来生成 ${227} \times {227}$ 图像的分类分数，但全卷积版本只需 ${22}\mathrm{\;{ms}}$ 来生成来自 ${500} \times {500}$ 图像的 ${10} \times {10}$ 输出网格，这比朴素方法 ${}^{1}$ 快超过 5 倍。

The spatial output maps of these convolutionalized models make them a natural choice for dense problems like semantic segmentation. With ground truth available at every output cell, both the forward and backward passes are straightforward, and both take advantage of the inherent computational efficiency (and aggressive optimization) of convolution.

这些卷积化模型的空间输出图使其成为解决语义分割等密集问题的自然选择。由于每个输出单元都有真实值可用，前向和后向传递都很简单，并且都利用了卷积的固有计算效率（和激进优化）。

The corresponding backward times for the AlexNet example are ${2.4}\mathrm{\;{ms}}$ for a single image and ${37}\mathrm{\;{ms}}$ for a fully convolutional ${10} \times {10}$ output map,resulting in a speedup similar to that of the forward pass. This dense backpropa-gation is illustrated in Figure 1.

AlexNet 示例的对应反向传播时间为 ${2.4}\mathrm{\;{ms}}$ （单张图像）和 ${37}\mathrm{\;{ms}}$ （完全卷积 ${10} \times {10}$ 输出图），从而实现与前向传播相似的加速。这个密集反向传播在图 1 中进行了说明。

While our reinterpretation of classification nets as fully convolutional yields output maps for inputs of any size, the output dimensions are typically reduced by subsampling. The classification nets subsample to keep filters small and computational requirements reasonable. This coarsens the output of a fully convolutional version of these nets, reducing it from the size of the input by a factor equal to the pixel stride of the receptive fields of the output units.

尽管我们将分类网络重新解释为完全卷积网络可以为任意大小的输入生成输出图，但输出维度通常通过子采样来减少。分类网络通过子采样来保持滤波器小且计算需求合理。这使得这些网络的完全卷积版本的输出变得粗糙，将其从输入大小减少到输出单元感受野的像素步幅的倍数。

${}^{1}$ Assuming efficient batching of single image inputs. The classification scores for a single image by itself take ${5.4}\mathrm{\;{ms}}$ to produce,which is nearly 25 times slower than the fully convolutional version.

${}^{1}$ 假设对单张图像输入进行高效批处理。单张图像的分类得分需要 ${5.4}\mathrm{\;{ms}}$ 来生成，这比完全卷积版本慢近 25 倍。

3.2. Shift-and-stitch is filter rarefaction

3.2. 移位与拼接是滤波器稀疏化

Input shifting and output interlacing is a trick that yields dense predictions from coarse outputs without interpolation, introduced by OverFeat [29]. If the outputs are down-sampled by a factor of $f$ ,the input is shifted (by left and top padding) $x$ pixels to the right and $y$ pixels down,once for every value of $\left( {x,y}\right) \in \{ 0,\ldots ,f - 1\} \times \{ 0,\ldots ,f - 1\}$ . These ${f}^{2}$ inputs are each run through the convnet,and the outputs are interlaced so that the predictions correspond to the pixels at the centers of their receptive fields.

输入移位和输出交错是一种技巧，可以从粗糙输出中获得密集预测而无需插值，由 OverFeat [29] 提出。如果输出以 $f$ 的因子进行下采样，则输入将向右移位 $x$ 像素并向下移位 $y$ 像素，每个 $\left( {x,y}\right) \in \{ 0,\ldots ,f - 1\} \times \{ 0,\ldots ,f - 1\}$ 的值都执行一次。这些 ${f}^{2}$ 输入都通过卷积网络进行处理，输出交错以便预测与其感受野中心的像素相对应。

Changing only the filters and layer strides of a convnet can produce the same output as this shift-and-stitch trick. Consider a layer (convolution or pooling) with input stride $s$ ,and a following convolution layer with filter weights ${f}_{ij}$ (eliding the feature dimensions, irrelevant here). Setting the lower layer's input stride to 1 upsamples its output by a factor of $s$ ,just like shift-and-stitch. However,convolving the original filter with the upsampled output does not produce the same result as the trick, because the original filter only sees a reduced portion of its (now upsampled) input. To reproduce the trick, rarefy the filter by enlarging it as

仅更改卷积网络的滤波器和层步幅可以产生与这种移位和拼接技巧相同的输出。考虑一个具有输入步幅 $s$ 的层（卷积或池化），以及一个后续的卷积层，其滤波器权重为 ${f}_{ij}$ （省略特征维度，这里不相关）。将下层的输入步幅设置为 1 会将其输出上采样一个 $s$ 的因子，就像移位和拼接一样。然而，将原始滤波器与上采样的输出进行卷积并不会产生与该技巧相同的结果，因为原始滤波器只能看到其（现在已上采样）输入的减少部分。为了重现该技巧，通过将滤波器放大来稀疏滤波器。

(with $i$ and $j$ zero-based). Reproducing the full net output of the trick involves repeating this filter enlargement layer-by-layer until all subsampling is removed.

（以 $i$ 和 $j$ 为零基）。重现该技巧的完整网络输出涉及逐层重复这种滤波器放大，直到所有下采样被移除。

Simply decreasing subsampling within a net is a tradeoff: the filters see finer information, but have smaller receptive fields and take longer to compute. We have seen that the shift-and-stitch trick is another kind of tradeoff: the output is made denser without decreasing the receptive field sizes of the filters, but the filters are prohibited from accessing information at a finer scale than their original design.

简单地减少网络中的下采样是一种权衡：滤波器看到更细的信息，但感受野更小，计算时间更长。我们已经看到，移位和拼接技巧是另一种权衡：输出变得更密集，而不减少滤波器的感受野大小，但滤波器被禁止访问比其原始设计更细尺度的信息。

Although we have done preliminary experiments with shift-and-stitch, we do not use it in our model. We find learning through upsampling, as described in the next section, to be more effective and efficient, especially when combined with the skip layer fusion described later on.

尽管我们已经对移位和拼接进行了初步实验，但我们并未在我们的模型中使用它。我们发现通过上采样进行学习（如下一节所述）更有效且高效，特别是当与后面描述的跳层融合结合时。

3.3. Upsampling is backwards strided convolution

3.3. 上采样是反向步幅卷积

Another way to connect coarse outputs to dense pixels is interpolation. For instance, simple bilinear interpolation computes each output ${y}_{ij}$ from the nearest four inputs by a linear map that depends only on the relative positions of the input and output cells.

将粗糙输出连接到密集像素的另一种方法是插值。例如，简单的双线性插值通过仅依赖输入和输出单元的相对位置的线性映射，从最近的四个输入计算每个输出 ${y}_{ij}$ 。

In a sense,upsampling with factor $f$ is convolution with a fractional input stride of $1/f$ . So long as $f$ is integral,a natural way to upsample is therefore backwards convolution (sometimes called deconvolution) with an output stride of $f$ . Such an operation is trivial to implement,since it simply reverses the forward and backward passes of convolution. Thus upsampling is performed in-network for end-to-end learning by backpropagation from the pixelwise loss.

从某种意义上说，使用因子 $f$ 的上采样是具有分数输入步幅 $1/f$ 的卷积。因此，只要 $f$ 是整数，自然的上采样方式就是以输出步幅 $f$ 进行反向卷积（有时称为反卷积）。这样的操作实现起来非常简单，因为它只是反转了卷积的前向和后向传递。因此，通过从像素级损失进行反向传播，在网络中执行上采样以实现端到端学习。

Note that the deconvolution filter in such a layer need not be fixed (e.g., to bilinear upsampling), but can be learned. A stack of deconvolution layers and activation functions can even learn a nonlinear upsampling.

请注意，这种层中的反卷积滤波器不必是固定的（例如，双线性上采样），而可以是可学习的。反卷积层和激活函数的堆叠甚至可以学习非线性上采样。

In our experiments, we find that in-network upsampling is fast and effective for learning dense prediction. Our best segmentation architecture uses these layers to learn to up-sample for refined prediction in Section 4.2.

在我们的实验中，我们发现网络内上采样对于学习密集预测是快速且有效的。我们最佳的分割架构使用这些层来学习上采样，以便在第 4.2 节中进行精细预测。

3.4. Patchwise training is loss sampling

3.4. 基于补丁的训练是损失采样

In stochastic optimization, gradient computation is driven by the training distribution. Both patchwise training and fully-convolutional training can be made to produce any distribution, although their relative computational efficiency depends on overlap and minibatch size. Whole image fully convolutional training is identical to patchwise training where each batch consists of all the receptive fields of the units below the loss for an image (or collection of images). While this is more efficient than uniform sampling of patches, it reduces the number of possible batches. However, random selection of patches within an image may be recovered simply. Restricting the loss to a randomly sampled subset of its spatial terms (or, equivalently applying a DropConnect mask [36] between the output and the loss) excludes patches from the gradient computation.

在随机优化中，梯度计算是由训练分布驱动的。基于补丁的训练和全卷积训练都可以产生任何分布，尽管它们相对的计算效率取决于重叠和小批量大小。整幅图像的全卷积训练与基于补丁的训练相同，其中每个批次由图像（或图像集合）损失下方的所有感受野组成。虽然这比均匀采样补丁更有效，但它减少了可能批次的数量。然而，可以简单地恢复图像内补丁的随机选择。将损失限制在其空间项的随机采样子集（或等效地在输出和损失之间应用 DropConnect 掩码 [36]）会排除补丁的梯度计算。

If the kept patches still have significant overlap, fully convolutional computation will still speed up training. If gradients are accumulated over multiple backward passes, batches can include patches from several images. ${}^{2}$

如果保留的补丁仍然有显著重叠，完全卷积计算仍将加速训练。如果在多个反向传播中累积梯度，则批次可以包含来自多个图像的补丁。 ${}^{2}$

Sampling in patchwise training can correct class imbalance $\left\lbrack {{27},8,2}\right\rbrack$ and mitigate the spatial correlation of dense patches $\left\lbrack {{28},{16}}\right\rbrack$ . In fully convolutional training,class balance can also be achieved by weighting the loss, and loss sampling can be used to address spatial correlation.

在补丁训练中进行采样可以纠正类别不平衡 $\left\lbrack {{27},8,2}\right\rbrack$ 并减轻密集补丁的空间相关性 $\left\lbrack {{28},{16}}\right\rbrack$ 。在完全卷积训练中，可以通过加权损失来实现类别平衡，并且可以使用损失采样来解决空间相关性。

We explore training with sampling in Section 4.3, and do not find that it yields faster or better convergence for dense prediction. Whole image training is effective and efficient.

我们在第 4.3 节探讨了使用采样的训练，并没有发现它能为密集预测带来更快或更好的收敛。整体图像训练是有效且高效的。

4. Segmentation Architecture

4. 分割架构

We cast ILSVRC classifiers into FCNs and augment them for dense prediction with in-network upsampling and a pixelwise loss. We train for segmentation by fine-tuning. Next, we build a novel skip architecture that combines coarse, semantic and local, appearance information to refine prediction.

我们将 ILSVRC 分类器转化为 FCN，并通过网络内上采样和逐像素损失增强它们以进行密集预测。我们通过微调进行分割训练。接下来，我们构建了一种新颖的跳跃架构，结合粗略的语义信息和局部的外观信息，以细化预测。

For this investigation, we train and validate on the PASCAL VOC 2011 segmentation challenge [7]. We train with

在这项研究中，我们在 PASCAL VOC 2011 分割挑战 [7] 上进行训练和验证。我们进行训练

${}^{2}$ Note that not every possible patch is included this way,since the receptive fields of the final layer units lie on a fixed, strided grid. However, by shifting the image left and down by a random value up to the stride, random selection from all possible patches may be recovered.

${}^{2}$ 请注意，并非所有可能的补丁都以这种方式包含，因为最终层单元的感受野位于固定的、步幅的网格上。然而，通过将图像向左和向下随机移动一个步幅内的随机值，可能会恢复从所有可能补丁的随机选择。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——