【翻译】DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Co

Doc2X：科研翻译的首选提供 PDF 转 Markdown、Latex，结合多种翻译引擎，打造高效、精准的翻译体验。 Doc2X: First Choice for Research Translation Offers PDF to Markdown and LaTeX with multiple translation engines for efficient and accurate translation experiences. 👉 立即访问 Doc2X | Visit Doc2X Today

原文链接：1606.00915

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

DeepLab: 使用深度卷积网络、空洞卷积和全连接条件随机场进行语义图像分割

Liang-Chieh Chen, George Papandreou, Senior Member, IEEE, Iasonas Kokkinos, Member, IEEE, Kevin Murphy, and Alan L. Yuille, Fellow, IEEE

Liang-Chieh Chen, George Papandreou，IEEE 高级会员，Iasonas Kokkinos，IEEE 会员，Kevin Murphy 和 Alan L. Yuille，IEEE 院士

Abstract-In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed “DeepLab” system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.

摘要——在本研究中，我们解决了深度学习语义图像分割的任务，并做出了三项主要贡献，实验结果表明这些贡献具有实质性的实际意义。首先，我们强调使用上采样滤波器的卷积，或称为“空洞卷积”，作为密集预测任务中的一种强大工具。空洞卷积使我们能够明确控制在深度卷积神经网络中计算特征响应的分辨率。它还使我们能够有效地扩大滤波器的视野，以纳入更大的上下文，而不增加参数的数量或计算量。其次，我们提出了空洞空间金字塔池化（ASPP），以在多个尺度上稳健地分割物体。ASPP通过在多个采样率和有效视野的滤波器上探测传入的卷积特征层，从而在多个尺度上捕捉物体及其图像上下文。第三，我们通过结合深度卷积神经网络（DCNNs）和概率图模型的方法来改善物体边界的定位。DCNNs中常用的最大池化和下采样的组合实现了不变性，但对定位精度产生了影响。我们通过将最终DCNN层的响应与完全连接的条件随机场（CRF）结合来克服这一问题，定性和定量的结果表明这改善了定位性能。我们提出的“DeepLab”系统在PASCAL VOC-2012语义图像分割任务中设定了新的最先进水平，在测试集上达到了79.7%的mIOU，并在三个其他数据集上取得了进展：PASCAL-Context、PASCAL-Person-Part和Cityscapes。我们的所有代码均已公开发布。

Index Terms—Convolutional Neural Networks, Semantic Segmentation, Atrous Convolution, Conditional Random Fields.

关键词——卷积神经网络，语义分割，空洞卷积，条件随机场。

1 INTRODUCTION

1 引言

Deep Convolutional Neural Networks (DCNNs) [1] have pushed the performance of computer vision systems to soaring heights on a broad array of high-level problems, including image classification [2], [3], [4], [5], [6] and object detection [7], [8], [9], [10], [11], [12], where DCNNs trained in an end-to-end manner have delivered strikingly better results than systems relying on hand-crafted features. Essential to this success is the built-in invariance of DCNNs to local image transformations, which allows them to learn increasingly abstract data representations [13]. This invariance is clearly desirable for classification tasks, but can hamper dense prediction tasks such as semantic segmentation, where abstraction of spatial information is undesired.

深度卷积神经网络 (DCNNs) [1] 在广泛的高层次问题上推动了计算机视觉系统性能的飞速提升，包括图像分类 [2], [3], [4], [5], [6] 和目标检测 [7], [8], [9], [10], [11], [12]，在这些问题中，以端到端方式训练的 DCNNs 提供了显著优于依赖手工特征的系统的结果。DCNNs 对局部图像变换的内置不变性是这一成功的关键，这使得它们能够学习越来越抽象的数据表示 [13]。这种不变性对于分类任务显然是可取的，但在语义分割等密集预测任务中可能会妨碍，因为在这些任务中，空间信息的抽象是不希望的。

In particular we consider three challenges in the application of DCNNs to semantic image segmentation: (1) reduced feature resolution, (2) existence of objects at multiple scales, and (3) reduced localization accuracy due to DCNN invariance. Next, we discuss these challenges and our approach to overcome them in our proposed DeepLab system.

特别是，我们考虑将 DCNNs 应用于语义图像分割时面临的三个挑战：(1) 特征分辨率降低，(2) 存在多尺度对象，以及 (3) 由于 DCNN 不变性导致的定位精度降低。接下来，我们将讨论这些挑战以及我们在提出的 DeepLab 系统中克服它们的方法。

The first challenge is caused by the repeated combination of max-pooling and downsampling ('striding') performed at consecutive layers of DCNNs originally designed for image classification [2], [4], [5]. This results in feature maps with significantly reduced spatial resolution when the DCNN is employed in a fully convolutional fashion [14]. In order to overcome this hurdle and efficiently produce denser feature maps, we remove the downsampling operator from the last few max pooling layers of DCNNs and instead upsample the filters in subsequent convolutional layers, resulting in feature maps computed at a higher sampling rate. Filter upsampling amounts to inserting holes ('trous' in French) between nonzero filter taps. This technique has a long history in signal processing, originally developed for the efficient computation of the undecimated wavelet transform in a scheme also known as "algorithme à trous" [15]. We use the term atrous convolution as a shorthand for convolution with upsampled filters. Various flavors of this idea have been used before in the context of DCNNs by [3], [6], [16]. In practice, we recover full resolution feature maps by a combination of atrous convolution, which computes feature maps more densely, followed by simple bilinear interpolation of the feature responses to the original image size. This scheme offers a simple yet powerful alternative to using deconvolutional layers [13], [14] in dense prediction tasks. Compared to regular convolution with larger filters, atrous convolution allows us to effectively enlarge the field of view of filters without increasing the number of parameters or the amount of computation.

第一个挑战是由于在原本为图像分类设计的深度卷积神经网络（DCNNs）中，连续层之间重复进行的最大池化和下采样（“步幅”）所导致的 [2]，[4]，[5]。这导致当DCNN以全卷积方式使用时，特征图的空间分辨率显著降低 [14]。为了克服这一障碍并有效生成更密集的特征图，我们从DCNN的最后几层最大池化层中移除下采样操作，而是在后续的卷积层中上采样滤波器，从而以更高的采样率计算特征图。滤波器上采样相当于在非零滤波器抽头之间插入孔（法语中的“trous”）。这一技术在信号处理领域有着悠久的历史，最初是为高效计算未抽样小波变换而开发的，这一方案也被称为“algorithme à trous” [15]。我们使用“atrons卷积”这一术语作为上采样滤波器卷积的简写。在DCNN的背景下，之前已有不同形式的这一思想被 [3]，[6]，[16] 使用。在实践中，我们通过结合atrons卷积（更密集地计算特征图）和对特征响应进行简单的双线性插值以恢复全分辨率特征图，这样的方案为密集预测任务提供了一种简单而强大的替代方案，而不是使用反卷积层 [13]，[14]。与使用更大滤波器的常规卷积相比，atrons卷积使我们能够有效扩大滤波器的视野而不增加参数数量或计算量。

The second challenge is caused by the existence of objects at multiple scales. A standard way to deal with this is to present to the DCNN rescaled versions of the same image and then aggregate the feature or score maps [6], [17], [18]. We show that this approach indeed increases the performance of our system, but comes at the cost of computing feature responses at all DCNN layers for multiple scaled versions of the input image. Instead, motivated by spatial pyramid pooling [19], [20], we propose a computationally efficient scheme of resampling a given feature layer at multiple rates prior to convolution. This amounts to probing the original image with multiple filters that have complementary effective fields of view, thus capturing objects as well as useful image context at multiple scales. Rather than actually resampling features, we efficiently implement this mapping using multiple parallel atrous convolutional layers with different sampling rates; we call the proposed technique "atrous spatial pyramid pooling" (ASPP).

第二个挑战是由于存在多个尺度的对象。处理这个问题的标准方法是向深度卷积神经网络（DCNN）提供同一图像的重缩放版本，然后聚合特征或评分图 [6], [17], [18]。我们展示了这种方法确实提高了我们系统的性能，但代价是需要在所有DCNN层计算输入图像多个缩放版本的特征响应。相反，受到空间金字塔池化 [19], [20] 的启发，我们提出了一种在卷积之前以多个速率对给定特征层进行重采样的计算高效方案。这相当于用多个具有互补有效视场的滤波器探测原始图像，从而在多个尺度上捕捉对象及有用的图像上下文。我们并不实际重采样特征，而是通过使用具有不同采样率的多个并行空洞卷积层高效地实现这种映射；我们将所提出的技术称为“空洞空间金字塔池化”（ASPP）。

L.-C. Chen, G. Papandreou, and K. Murphy are with Google Inc. I. Kokki-nos is with University College London. A. Yuille is with the Departments of Cognitive Science and Computer Science, Johns Hopkins University. The first two authors contributed equally to this work.
L.-C. Chen、G. Papandreou 和 K. Murphy 供职于谷歌公司。I. Kokkinos 供职于伦敦大学学院。A. Yuille 供职于约翰霍普金斯大学的认知科学与计算机科学系。前两位作者对本研究贡献相同。

The third challenge relates to the fact that an object-centric classifier requires invariance to spatial transformations, inherently limiting the spatial accuracy of a DCNN. One way to mitigate this problem is to use skip-layers to extract "hyper-column" features from multiple network layers when computing the final segmentation result [14], [21]. Our work explores an alternative approach which we show to be highly effective. In particular, we boost our model's ability to capture fine details by employing a fully-connected Conditional Random Field (CRF) [22]. CRFs have been broadly used in semantic segmentation to combine class scores computed by multi-way classifiers with the low-level information captured by the local interactions of pixels and edges [23], [24] or superpixels [25]. Even though works of increased sophistication have been proposed to model the hierarchical dependency [26], [27], [28] and/or high-order dependencies of segments [29], [30], [31], [32], [33], we use the fully connected pairwise CRF proposed by [22] for its efficient computation, and ability to capture fine edge details while also catering for long range dependencies. That model was shown in [22] to improve the performance of a boosting-based pixel-level classifier. In this work, we demonstrate that it leads to state-of-the-art results when coupled with a DCNN-based pixel-level classifier.

第三个挑战与对象中心分类器需要对空间变换的不变性有关，这在本质上限制了深度卷积神经网络（DCNN）的空间精度。缓解此问题的一种方法是使用跳跃层从多个网络层提取“超列”特征，以计算最终的分割结果 [14]，[21]。我们的工作探索了一种替代方法，我们证明这种方法是非常有效的。特别是，我们通过采用全连接条件随机场（CRF） [22] 来增强模型捕捉细节的能力。CRF 已广泛用于语义分割，以结合多向分类器计算的类别分数与像素和边缘的局部交互所捕获的低级信息 [23]，[24] 或超像素 [25]。尽管已经提出了更复杂的工作来建模层次依赖关系 [26]，[27]，[28] 和/或段的高阶依赖关系 [29]，[30]，[31]，[32]，[33]，我们使用了 [22] 提出的全连接成对 CRF，因为它计算效率高，能够捕捉细微的边缘细节，同时也考虑了长距离依赖关系。该模型在 [22] 中被证明能够提高基于提升的像素级分类器的性能。在这项工作中，我们证明当与基于 DCNN 的像素级分类器结合时，它能够达到最先进的结果。

A high-level illustration of the proposed DeepLab model is shown in Fig. 1. A deep convolutional neural network (VGG-16 [4] or ResNet-101 [11] in this work) trained in the task of image classification is re-purposed to the task of semantic segmentation by (1) transforming all the fully connected layers to convolutional layers (i.e., fully convolutional network [14]) and (2) increasing feature resolution through atrous convolutional layers, allowing us to compute feature responses every 8 pixels instead of every 32 pixels in the original network. We then employ bi-linear interpolation to upsample by a factor of 8 the score map to reach the original image resolution, yielding the input to a fully-connected CRF [22] that refines the segmentation results.

提出的 DeepLab 模型的高层次示意图如图 1 所示。一个在图像分类任务中训练的深度卷积神经网络（本工作中的 VGG-16 [4] 或 ResNet-101 [11]）通过 (1) 将所有全连接层转换为卷积层（即全卷积网络 [14]）和 (2) 通过空洞卷积层增加特征分辨率，使我们能够每 8 个像素计算一次特征响应，而不是原网络中的每 32 个像素。然后，我们采用双线性插值将得分图上采样 8 倍，以达到原始图像分辨率，从而生成输入到全连接条件随机场 [22] 的数据，该条件随机场用于优化分割结果。

From a practical standpoint, the three main advantages of our DeepLab system are: (1) Speed: by virtue of atrous convolution, our dense DCNN operates at 8 FPS on an NVidia Titan X GPU, while Mean Field Inference for the fully-connected CRF requires 0.5 secs on a CPU. (2) Accuracy: we obtain state-of-art results on several challenging datasets, including the PASCAL VOC 2012 semantic segmentation benchmark [34], PASCAL-Context [35], PASCAL-Person-Part [36], and Cityscapes [37]. (3) Simplicity: our system is composed of a cascade of two very well-established modules, DCNNs and CRFs.

从实际的角度来看，我们的 DeepLab 系统有三个主要优点：(1) 速度：由于空洞卷积，我们的密集 DCNN 在 NVidia Titan X GPU 上以 8 FPS 的速度运行，而全连接 CRF 的均值场推断在 CPU 上需要 0.5 秒。(2) 准确性：我们在多个具有挑战性的数据集上获得了最先进的结果，包括 PASCAL VOC 2012 语义分割基准 [34]、PASCAL-Context [35]、PASCAL-Person-Part [36] 和 Cityscapes [37]。(3) 简单性：我们的系统由两个非常成熟的模块 DCNN 和 CRF 级联组成。

The updated DeepLab system we present in this paper features several improvements compared to its first version reported in our original conference publication [38]. Our new version can better segment objects at multiple scales, via either multi-scale input processing [17], [39], [40] or the proposed ASPP. We have built a residual net variant of DeepLab by adapting the state-of-art ResNet [11] image classification DCNN, achieving better semantic segmentation performance compared to our original model based on VGG-16 [4]. Finally, we present a more comprehensive experimental evaluation of multiple model variants and report state-of-art results not only on the PASCAL VOC 2012 benchmark but also on other challenging tasks. We have implemented the proposed methods by extending the Caffe framework [41]. We share our code and models at a companion web site liangchiehchen.com/projects/ DeepLab.html

我们在本文中提出的更新版 DeepLab 系统相比于我们在原始会议出版物 [38] 中报告的第一个版本，具有多个改进。我们的新版本能够更好地在多个尺度上对物体进行分割，采用多尺度输入处理 [17]、[39]、[40] 或所提出的 ASPP。我们通过改编最先进的 ResNet [11] 图像分类 DCNN 构建了 DeepLab 的残差网络变体，与基于 VGG-16 [4] 的原始模型相比，达到了更好的语义分割性能。最后，我们对多个模型变体进行了更全面的实验评估，并在 PASCAL VOC 2012 基准测试以及其他具有挑战性的任务上报告了最先进的结果。我们通过扩展 Caffe 框架 [41] 实现了所提出的方法。我们在一个伴随网站 liangchiehchen.com/projects/De… 上分享我们的代码和模型。

2 Related Work

2 相关工作

Most of the successful semantic segmentation systems developed in the previous decade relied on hand-crafted features combined with flat classifiers, such as Boosting [24], [42], Random Forests [43], or Support Vector Machines [44]. Substantial improvements have been achieved by incorporating richer information from context [45] and structured prediction techniques [22], [26], [27], [46], but the performance of these systems has always been compromised by the limited expressive power of the features. Over the past few years the breakthroughs of Deep Learning in image classification were quickly transferred to the semantic segmentation task. Since this task involves both segmentation and classification, a central question is how to combine the two tasks.

在过去十年中开发的成功语义分割系统大多依赖于手工制作的特征与平面分类器的结合，如 Boosting [24]、[42]、随机森林 [43] 或支持向量机 [44]。通过结合来自上下文的更丰富信息 [45] 和结构化预测技术 [22]、[26]、[27]、[46]，取得了显著的改进，但这些系统的性能始终受到特征表达能力有限的制约。在过去几年中，深度学习在图像分类方面的突破迅速转移到语义分割任务中。由于该任务涉及分割和分类，核心问题是如何将这两项任务结合起来。

The first family of DCNN-based systems for semantic segmentation typically employs a cascade of bottom-up image segmentation, followed by DCNN-based region classification. For instance the bounding box proposals and masked regions delivered by [47], [48] are used in [7] and [49] as inputs to a DCNN to incorporate shape information into the classification process. Similarly, the authors of [50] rely on a superpixel representation. Even though these approaches can benefit from the sharp boundaries delivered by a good segmentation, they also cannot recover from any of its errors.

第一类基于DCNN的语义分割系统通常采用自下而上的图像分割级联，随后进行基于DCNN的区域分类。例如，文献[47]和[48]提供的边界框提议和掩膜区域被用于文献[7]和[49]作为DCNN的输入，以将形状信息纳入分类过程。同样，文献[50]的作者依赖于超像素表示。尽管这些方法可以受益于良好分割所提供的清晰边界，但它们也无法从任何错误中恢复。

The second family of works relies on using convolution-ally computed DCNN features for dense image labeling, and couples them with segmentations that are obtained independently. Among the first have been [39] who apply DCNNs at multiple image resolutions and then employ a segmentation tree to smooth the prediction results. More recently, [21] propose to use skip layers and concatenate the computed intermediate feature maps within the DCNNs for pixel classification. Further, [51] propose to pool the intermediate feature maps by region proposals. These works still employ segmentation algorithms that are decoupled from the DCNN classifier's results, thus risking commitment to premature decisions.

第二类研究依赖于使用卷积计算的DCNN特征进行密集图像标注，并将其与独立获得的分割相结合。最早的研究之一是文献[39]，他们在多个图像分辨率下应用DCNN，然后使用分割树来平滑预测结果。最近，文献[21]提出使用跳层并在DCNN内连接计算的中间特征图进行像素分类。此外，文献[51]建议通过区域提议对中间特征图进行池化。这些研究仍然采用与DCNN分类器结果解耦的分割算法，从而面临过早决策的风险。

The third family of works uses DCNNs to directly provide dense category-level pixel labels, which makes it possible to even discard segmentation altogether. The segmentation-free approaches of [14], [52] directly apply DCNNs to the whole image in a fully convolutional fashion, transforming the last fully connected layers of the DCNN into convolutional layers. In order to deal with the spatial localization issues outlined in the introduction, [14] upsample and concatenate the scores from intermediate feature maps, while [52] refine the prediction result from coarse to fine by propagating the coarse results to another DCNN. Our work builds on these works, and as described in the introduction extends them by exerting control on the feature resolution, introducing multi-scale pooling techniques and integrating the densely connected CRF of [22] on top of the DCNN. We show that this leads to significantly better segmentation results, especially along object boundaries. The combination of DCNN and CRF is of course not new but previous works only tried locally connected CRF models. Specifically, [53] use CRFs as a proposal mechanism for a DCNN-based reranking system, while [39] treat superpixels as nodes for a local pairwise CRF and use graph-cuts for discrete inference. As such their models were limited by errors in superpixel computations or ignored long-range dependencies. Our approach instead treats every pixel as a CRF node receiving unary potentials by the DCNN. Crucially, the Gaussian CRF potentials in the fully connected CRF model of [22] that we adopt can capture long-range dependencies and at the same time the model is amenable to fast mean field inference. We note that mean field inference had been extensively studied for traditional image segmentation tasks [54], [55], [56], but these older models were typically limited to short-range connections. In independent work, [57] use a very similar densely connected CRF model to refine the results of DCNN for the problem of material classification. However, the DCNN module of [57] was only trained by sparse point supervision instead of dense supervision at every pixel.

第三类工作使用深度卷积神经网络（DCNN）直接提供密集的类别级像素标签，这使得甚至可以完全放弃分割。文献 [14] 和 [52] 的无分割方法以完全卷积的方式直接将 DCNN 应用于整个图像，将 DCNN 的最后全连接层转换为卷积层。为了处理引言中概述的空间定位问题，文献 [14] 对中间特征图的得分进行上采样和连接，而文献 [52] 则通过将粗略结果传播到另一个 DCNN 来将预测结果从粗到细进行精细化。我们的工作基于这些研究，并如引言中所述，通过对特征分辨率施加控制，引入多尺度池化技术，并在 DCNN 之上集成文献 [22] 的密集连接条件随机场（CRF）来扩展它们。我们展示了这导致了显著更好的分割结果，特别是在物体边界处。DCNN 和 CRF 的结合当然并不新颖，但以前的工作仅尝试了局部连接的 CRF 模型。具体而言，文献 [53] 将 CRF 用作基于 DCNN 的重排序系统的提议机制，而文献 [39] 将超像素视为局部成对 CRF 的节点，并使用图切割进行离散推理。因此，他们的模型受到超像素计算中的错误限制或忽略了长距离依赖关系。相反，我们的方法将每个像素视为接收来自 DCNN 的单一势能的 CRF 节点。关键是，我们采用的文献 [22] 的全连接 CRF 模型中的高斯 CRF 势能能够捕捉长距离依赖关系，同时该模型适合快速的均值场推理。我们注意到，均值场推理已被广泛研究用于传统图像分割任务 [54]、[55]、[56]，但这些较旧的模型通常仅限于短距离连接。在独立的工作中，文献 [57] 使用了非常相似的密集连接 CRF 模型来细化 DCNN 在材料分类问题上的结果。然而，文献 [57] 的 DCNN 模块仅通过稀疏点监督进行训练，而不是在每个像素上进行密集监督。

Fig. 1: Model Illustration. A Deep Convolutional Neural Network such as VGG-16 or ResNet-101 is employed in a fully convolutional fashion,using atrous convolution to reduce the degree of signal downsampling (from ${32}\mathrm{x}$ down $8\mathrm{x}$ ). A bilinear interpolation stage enlarges the feature maps to the original image resolution. A fully connected CRF is then applied to refine the segmentation result and better capture the object boundaries.

图1：模型示意图。采用深度卷积神经网络，如VGG-16或ResNet-101，以完全卷积的方式进行，使用空洞卷积来减少信号下采样的程度（从 ${32}\mathrm{x}$ 降到 $8\mathrm{x}$ ）。一个双线性插值阶段将特征图放大到原始图像分辨率。然后应用一个全连接条件随机场（CRF）来细化分割结果，更好地捕捉物体边界。

Since the first version of this work was made publicly available [38], the area of semantic segmentation has progressed drastically. Multiple groups have made important advances, significantly raising the bar on the PASCAL VOC 2012 semantic segmentation benchmark, as reflected to the high level of activity in the benchmark’s leaderboard ${}^{1}$ [17], [40], [58], [59], [60], [61], [62], [63]. Interestingly, most top-performing methods have adopted one or both of the key ingredients of our DeepLab system: Atrous convolution for efficient dense feature extraction and refinement of the raw DCNN scores by means of a fully connected CRF. We outline below some of the most important and interesting advances.

自从本工作的第一个版本公开发布以来[38]，语义分割领域发生了巨大的进展。多个研究组取得了重要的突破，显著提高了PASCAL VOC 2012语义分割基准的标准，这在基准的排行榜上反映了高度的活跃度 ${}^{1}$ [17]、[40]、[58]、[59]、[60]、[61]、[62]、[63]。有趣的是，大多数表现优异的方法都采用了我们DeepLab系统的一个或两个关键成分：用于高效密集特征提取的空洞卷积，以及通过全连接CRF对原始DCNN得分进行细化。以下是一些最重要和有趣的进展。

End-to-end training for structured prediction has more recently been explored in several related works. While we employ the CRF as a post-processing method, [40], [59], [62], [64], [65] have successfully pursued joint learning of the DCNN and CRF. In particular, [59], [65] unroll the CRF mean-field inference steps to convert the whole system into an end-to-end trainable feed-forward network, while [62] approximates one iteration of the dense CRF mean field inference [22] by convolutional layers with learnable filters. Another fruitful direction pursued by [40], [66] is to learn the pairwise terms of a CRF via a DCNN, significantly improving performance at the cost of heavier computation. In a different direction, [63] replace the bilateral filtering module used in mean field inference with a faster domain transform module [67], improving the speed and lowering the memory requirements of the overall system, while [18], [68] combine semantic segmentation with edge detection.

近年来，端到端训练结构化预测在多个相关工作中得到了探索。虽然我们采用 CRF 作为后处理方法，[40]、[59]、[62]、[64]、[65] 成功地追求了 DCNN 和 CRF 的联合学习。特别是，[59] 和 [65] 展开了 CRF 均值场推理步骤，将整个系统转换为一个端到端可训练的前馈网络，而 [62] 通过具有可学习滤波器的卷积层近似了稠密 CRF 均值场推理的一个迭代 [22]。另一个由 [40] 和 [66] 追求的富有成效的方向是通过 DCNN 学习 CRF 的成对项，显著提高了性能，但代价是计算负担加重。在另一个方向上，[63] 用更快的领域变换模块 [67] 替换了均值场推理中使用的双边滤波模块，提高了整体系统的速度并降低了内存需求，而 [18] 和 [68] 则将语义分割与边缘检测相结合。

Weaker supervision has been pursued in a number of papers, relaxing the assumption that pixel-level semantic annotations are available for the whole training set [58], [69], [70], [71], achieving significantly better results than weakly-supervised pre-DCNN systems such as [72]. In another line of research, [49], [73] pursue instance segmentation, jointly tackling object detection and semantic segmentation.

在多篇论文中，弱监督学习得到了追求，放宽了对整个训练集的像素级语义注释可用性的假设 [58]、[69]、[70]、[71]，取得了显著优于弱监督预 DCNN 系统如 [72] 的结果。在另一条研究方向上，[49] 和 [73] 追求实例分割，联合解决目标检测和语义分割问题。

What we call here atrous convolution was originally developed for the efficient computation of the undecimated wavelet transform in the "algorithme à trous" scheme of [15]. We refer the interested reader to [74] for early references from the wavelet literature. Atrous convolution is also intimately related to the "noble identities" in multi-rate signal processing, which builds on the same interplay of input signal and filter sampling rates [75]. Atrous convolution is a term we first used in [6]. The same operation was later called dilated convolution by [76], a term they coined motivated by the fact that the operation corresponds to regular convolution with upsampled (or dilated in the terminology of [15]) filters. Various authors have used the same operation before for denser feature extraction in DCNNs [3], [6], [16]. Beyond mere resolution enhancement, atrous convolution allows us to enlarge the field of view of filters to incorporate larger context, which we have shown in [38] to be beneficial. This approach has been pursued further by [76], who employ a series of atrous convolutional layers with increasing rates to aggregate multiscale context. The atrous spatial pyramid pooling scheme proposed here to capture multiscale objects and context also employs multiple atrous convolutional layers with different sampling rates, which we however lay out in parallel instead of in serial. Interestingly, the atrous convolution technique has also been adopted for a broader set of tasks, such as object detection [12], [77], instance-level segmentation [78], visual question answering [79], and optical flow [80].

我们在这里称之为空洞卷积的技术最初是为在 [15] 的“算法 à trous”方案中高效计算未去采样的小波变换而开发的。我们建议感兴趣的读者参考 [74]，以获取小波文献中的早期参考。空洞卷积与多速率信号处理中的“高贵恒等式”密切相关，这建立在输入信号和滤波器采样率之间的相互作用之上 [75]。空洞卷积是我们首次在 [6] 中使用的术语。后来 [76] 将同一操作称为扩张卷积，这个术语是他们创造的，动机是因为该操作对应于使用上采样（或在 [15] 的术语中称为扩张）滤波器的常规卷积。各种作者之前已经使用同一操作来在 DCNN 中进行更密集的特征提取 [3]，[6]，[16]。除了单纯的分辨率增强，空洞卷积还允许我们扩大滤波器的视野，以纳入更大的上下文，我们在 [38] 中已证明这是有益的。 [76] 进一步追求了这种方法，他们采用了一系列具有递增速率的空洞卷积层来聚合多尺度上下文。这里提出的空洞空间金字塔池化方案用于捕捉多尺度对象和上下文，也采用了多个具有不同采样率的空洞卷积层，但我们将其并行布局，而不是串行。值得注意的是，空洞卷积技术也已被采用于更广泛的任务，如对象检测 [12]，[77]，实例级分割 [78]，视觉问答 [79] 和光流 [80]。

host.robots.ox.ac.uk:8080/leaderboard…? challengeid=11&compid=6
host.robots.ox.ac.uk:8080/leaderboard…

We also show that, as expected, integrating into DeepLab more advanced image classification DCNNs such as the residual net of [11] leads to better results. This has also been observed independently by [81].

我们还展示了，正如预期的那样，将更先进的图像分类 DCNN（如 [11] 的残差网络）集成到 DeepLab 中可以获得更好的结果。这一点也被 [81] 独立观察到。

3 Methods

3 方法

3.1 Atrous Convolution for Dense Feature Extraction and Field-of-View Enlargement

3.1 稀疏卷积用于密集特征提取和视野扩展

The use of DCNNs for semantic segmentation, or other dense prediction tasks, has been shown to be simply and successfully addressed by deploying DCNNs in a fully convolutional fashion [3], [14]. However, the repeated combination of max-pooling and striding at consecutive layers of these networks reduces significantly the spatial resolution of the resulting feature maps, typically by a factor of 32 across each direction in recent DCNNs. A partial remedy is to use 'deconvolutional' layers as in [14], which however requires additional memory and time.

使用 DCNN 进行语义分割或其他密集预测任务，已经证明通过以完全卷积的方式部署 DCNN 可以简单且成功地解决 [3]，[14]。然而，这些网络在连续层中反复组合最大池化和步幅显著降低了生成特征图的空间分辨率，通常在最近的 DCNN 中每个方向降低了 32 倍。部分补救措施是使用“反卷积”层，如 [14] 中所示，但这需要额外的内存和时间。

We advocate instead the use of atrous convolution, originally developed for the efficient computation of the undecimated wavelet transform in the "algorithme à trous" scheme of [15] and used before in the DCNN context by [3], [6], [16]. This algorithm allows us to compute the responses of any layer at any desirable resolution. It can be applied post-hoc, once a network has been trained, but can also be seamlessly integrated with training.

我们建议使用稀疏卷积，这最初是为在 [15] 的“算法 à trous”方案中高效计算未去样波形变换而开发的，并且在 DCNN 上下文中之前由 [3]，[6]，[16] 使用。该算法允许我们以任何所需的分辨率计算任何层的响应。它可以在网络训练后应用，但也可以与训练无缝集成。

Considering one-dimensional signals first, the output $y\left\lbrack i\right\rbrack$ of atrous convolution 2 of a 1-D input signal $x\left\lbrack i\right\rbrack$ with a filter $w\left\lbrack k\right\rbrack$ of length $K$ is defined as:

首先考虑一维信号，稀疏卷积 2 对于一维输入信号 $x\left\lbrack i\right\rbrack$ 和长度为 $K$ 的滤波器 $w\left\lbrack k\right\rbrack$ 的输出 $y\left\lbrack i\right\rbrack$ 定义为：

The rate parameter $r$ corresponds to the stride with which we sample the input signal. Standard convolution is a special case for rate $r = 1$ . See Fig. 2 for illustration.

比率参数 $r$ 对应于我们对输入信号进行采样的步幅。标准卷积是比率 $r = 1$ 的特例。请参见图 2 以获取说明。

Fig. 2: Illustration of atrous convolution in 1-D. (a) Sparse feature extraction with standard convolution on a low resolution input feature map. (b) Dense feature extraction with atrous convolution with rate $r = 2$ ,applied on a high resolution input feature map.

图2：一维空洞卷积的示意图。(a) 在低分辨率输入特征图上使用标准卷积进行稀疏特征提取。(b) 在高分辨率输入特征图上使用速率为 $r = 2$ 的空洞卷积进行密集特征提取。

Fig. 3: Illustration of atrous convolution in 2-D. Top row: sparse feature extraction with standard convolution on a low resolution input feature map. Bottom row: Dense feature extraction with atrous convolution with rate $r = 2$ , applied on a high resolution input feature map.

图3：二维空洞卷积的示意图。顶部行：在低分辨率输入特征图上使用标准卷积进行稀疏特征提取。底部行：在高分辨率输入特征图上使用速率为 $r = 2$ 的空洞卷积进行密集特征提取。

We illustrate the algorithm's operation in 2-D through a simple example in Fig. 3 Given an image, we assume that we first have a downsampling operation that reduces the resolution by a factor of 2 , and then perform a convolution with a kernel - here, the vertical Gaussian derivative. If one implants the resulting feature map in the original image coordinates, we realize that we have obtained responses at only $1/4$ of the image positions. Instead,we can compute responses at all image positions if we convolve the full resolution image with a filter 'with holes', in which we up-sample the original filter by a factor of 2 , and introduce zeros in between filter values. Although the effective filter size increases, we only need to take into account the non-zero filter values, hence both the number of filter parameters and the number of operations per position stay constant. The resulting scheme allows us to easily and explicitly control the spatial resolution of neural network feature responses.

我们通过图3中的一个简单示例说明算法在二维中的操作。给定一幅图像，我们假设首先进行一个下采样操作，将分辨率降低为原来的2倍，然后使用一个核进行卷积——这里是垂直高斯导数。如果将得到的特征图嵌入到原始图像坐标中，我们会发现仅在图像位置的 $1/4$ 处获得了响应。相反，如果我们用一个“带孔”的滤波器对全分辨率图像进行卷积，我们可以在所有图像位置计算响应，其中我们将原始滤波器上采样2倍，并在滤波器值之间引入零。尽管有效滤波器的大小增加，但我们只需考虑非零滤波器值，因此滤波器参数的数量和每个位置的操作数量保持不变。所得到的方案使我们能够轻松明确地控制神经网络特征响应的空间分辨率。

In the context of DCNNs one can use atrous convolution in a chain of layers, effectively allowing us to compute the

在深度卷积神经网络（DCNN）的背景下，可以在一系列层中使用空洞卷积，从而有效地允许我们计算

We follow the standard practice in the DCNN literature and use non-mirrored filters in this definition.
我们遵循DCNN文献中的标准做法，在此定义中使用非镜像滤波器。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——