Rethinking Atrous Convolution for Semantic Image Segmentation【翻译】

Doc2X：PDF 转 Word 的最佳选择 Doc2X 专注于 PDF转Word、PDF转Latex、PDF转HTML，支持 Mathpix公式识别、多栏解析、GLM翻译等强大功能，助您快速完成文档处理！ Doc2X: The Best Choice for PDF to Word Conversion Doc2X specializes in PDF to Word, PDF to LaTeX, and PDF to HTML, with features like Mathpix formula recognition, multi-column parsing, and GLM translation, making document processing faster! 👉 立即试用 Doc2X | Try Doc2X Now

原文链接：1706.05587

Rethinking Atrous Convolution for Semantic Image Segmentation

重新思考用于语义图像分割的空洞卷积

Liang-Chieh Chen George Papandreou Florian Schroff Hartwig Adam

梁志杰乔治·帕潘德里欧弗洛里安·施罗夫哈特维格·亚当

Google Inc.

谷歌公司

{lcchen, gpapan, fschroff, hadam}@google.com

Abstract

摘要

In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter's field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects at multiple scales, we design modules which employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, we propose to augment our previously proposed Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales, with image-level features encoding global context and further boost performance. We also elaborate on implementation details and share our experience on training our system. The proposed 'DeepLabv3' system significantly improves over our previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.

在这项工作中，我们重新审视了空洞卷积，这是一种强大的工具，可以明确调整滤波器的视野以及控制深度卷积神经网络计算的特征响应的分辨率，以应用于语义图像分割。为了处理多尺度对象分割的问题，我们设计了使用空洞卷积的模块，这些模块以级联或并行的方式采用多个空洞率来捕捉多尺度上下文。此外，我们建议增强我们之前提出的空洞空间金字塔池化模块，该模块在多个尺度上探测卷积特征，加入编码全局上下文的图像级特征，以进一步提升性能。我们还详细阐述了实现细节，并分享了训练我们系统的经验。所提出的“DeepLabv3”系统在没有 DenseCRF 后处理的情况下显著改善了我们之前的 DeepLab 版本，并在 PASCAL VOC 2012 语义图像分割基准上达到了与其他最先进模型相当的性能。

1. Introduction

1. 引言

For the task of semantic segmentation $\left\lbrack {{20},{63},{14},{97},7}\right\rbrack$ , we consider two challenges in applying Deep Convolutional Neural Networks (DCNNs) [50]. The first one is the reduced feature resolution caused by consecutive pooling operations or convolution striding, which allows DCNNs to learn increasingly abstract feature representations. However, this invariance to local image transformation may impede dense prediction tasks, where detailed spatial information is desired. To overcome this problem, we advocate the use of atrous convolution $\left\lbrack {{36},{26},{74},{66}}\right\rbrack$ ,which has been shown to be effective for semantic image segmentation [10, 90, 11]. Atrous convolution, also known as dilated convolution, allows us to repurpose ImageNet [72] pretrained networks to extract denser feature maps by removing the downsam-pling operations from the last few layers and upsampling the corresponding filter kernels, equivalent to inserting holes ('trous' in French) between filter weights. With atrous convolution, one is able to control the resolution at which feature responses are computed within DCNNs without requiring learning extra parameters.

在语义分割的任务中 $\left\lbrack {{20},{63},{14},{97},7}\right\rbrack$ ，我们考虑在应用深度卷积神经网络（DCNNs）时面临的两个挑战 [50]。第一个挑战是由于连续的池化操作或卷积步幅造成的特征分辨率降低，这使得DCNNs能够学习越来越抽象的特征表示。然而，这种对局部图像变换的不变性可能会妨碍密集预测任务，因为这些任务需要详细的空间信息。为了克服这个问题，我们提倡使用空洞卷积 $\left\lbrack {{36},{26},{74},{66}}\right\rbrack$ ，研究表明它在语义图像分割中是有效的 [10, 90, 11]。空洞卷积，也称为扩张卷积，允许我们重新利用在ImageNet [72] 上预训练的网络，通过去除最后几层的下采样操作并上采样相应的滤波器核，从而提取更密集的特征图，这相当于在滤波器权重之间插入孔（法语中的“trous”）。通过空洞卷积，可以控制在DCNNs中计算特征响应的分辨率，而无需学习额外的参数。

Figure 1. Atrous convolution with kernel size $3 \times 3$ and different rates. Standard convolution corresponds to atrous convolution with rate $= 1$ . Employing large value of atrous rate enlarges the model's field-of-view, enabling object encoding at multiple scales.

图1. 具有核大小 $3 \times 3$ 和不同速率的空洞卷积。标准卷积对应于速率为 $= 1$ 的空洞卷积。采用较大的空洞速率可以扩大模型的视野，使得在多个尺度上进行对象编码成为可能。

Another difficulty comes from the existence of objects at multiple scales. Several methods have been proposed to handle the problem and we mainly consider four categories in this work, as illustrated in Fig. 2. First, the DCNN is applied to an image pyramid to extract features for each scale input $\left\lbrack {{22},{19},{69},{55},{12},{11}}\right\rbrack$ where objects at different scales become prominent at different feature maps. Second,the encoder-decoder structure $\left\lbrack {3,{71},{25},{54},{70},{68},{39}}\right\rbrack$ exploits multi-scale features from the encoder part and recovers the spatial resolution from the decoder part. Third, extra modules are cascaded on top of the original network for capturing long range information. In particular, DenseCRF [45] is employed to encode pixel-level pairwise similarities $\left\lbrack {{10},{96},{55},{73}}\right\rbrack$ ,while $\left\lbrack {{59},{90}}\right\rbrack$ develop several extra convolutional layers in cascade to gradually capture long range context. Fourth, spatial pyramid pooling [11, 95] probes an incoming feature map with filters or pooling operations at multiple rates and multiple effective field-of-views, thus capturing objects at multiple scales.

另一个困难来自于存在于多个尺度上的物体。为了解决这个问题，已经提出了几种方法，我们在这项工作中主要考虑四类，如图2所示。首先，DCNN被应用于图像金字塔，以提取每个尺度输入的特征 $\left\lbrack {{22},{19},{69},{55},{12},{11}}\right\rbrack$ ，其中不同尺度的物体在不同的特征图上变得突出。其次，编码器-解码器结构 $\left\lbrack {3,{71},{25},{54},{70},{68},{39}}\right\rbrack$ 利用编码器部分的多尺度特征，并从解码器部分恢复空间分辨率。第三，在原始网络的顶部级联额外模块，以捕获长距离信息。特别地，DenseCRF [45] 被用来编码像素级的成对相似性 $\left\lbrack {{10},{96},{55},{73}}\right\rbrack$ ，而 $\left\lbrack {{59},{90}}\right\rbrack$ 开发了几个额外的卷积层以级联方式逐渐捕获长距离上下文。第四，空间金字塔池化 [11, 95] 通过在多个速率和多个有效视场中使用滤波器或池化操作来探测传入的特征图，从而捕获多个尺度上的物体。

In this work, we revisit applying atrous convolution, which allows us to effectively enlarge the field of view of filters to incorporate multi-scale context, in the framework of both cascaded modules and spatial pyramid pooling. In particular, our proposed module consists of atrous convolution with various rates and batch normalization layers which we found important to be trained as well. We experiment with laying out the modules in cascade or in parallel (specifically, Atrous Spatial Pyramid Pooling (ASPP) method [11]). We discuss an important practical issue when applying a $3 \times 3$ atrous convolution with an extremely large rate, which fails to capture long range information due to image boundary effects,effectively simply degenerating to $1 \times 1$ convolution, and propose to incorporate image-level features into the ASPP module. Furthermore, we elaborate on implementation details and share experience on training the proposed models, including a simple yet effective bootstrapping method for handling rare and finely annotated objects. In the end, our proposed model, 'DeepLabv3' improves over our previous works $\left\lbrack {{10},{11}}\right\rbrack$ and attains performance of ${85.7}\%$ on the PASCAL VOC 2012 test set without DenseCRF postprocessing.

在这项工作中，我们重新审视了使用空洞卷积，这使我们能够有效地扩大滤波器的视野，以纳入多尺度上下文，在级联模块和空间金字塔池化的框架中。特别地，我们提出的模块由具有不同速率的空洞卷积和批量归一化层组成，我们发现这些层在训练中也非常重要。我们实验了将模块以级联或并行的方式布局（具体而言，空洞空间金字塔池化（ASPP）方法 [11]）。我们讨论了在应用具有极大速率的 $3 \times 3$ 空洞卷积时的一个重要实际问题，由于图像边界效应，它未能捕捉长距离信息，实际上简单地退化为 $1 \times 1$ 卷积，并提出将图像级特征纳入ASPP模块。此外，我们详细阐述了实现细节，并分享了训练所提模型的经验，包括一种简单而有效的自举方法，用于处理稀有和精细标注的对象。最后，我们提出的模型“DeepLabv3”在我们的先前工作 $\left\lbrack {{10},{11}}\right\rbrack$ 的基础上有所改进，并在PASCAL VOC 2012测试集上达到了 ${85.7}\%$ 的性能，而无需进行DenseCRF后处理。

Figure 2. Alternative architectures to capture multi-scale context.

图2. 捕捉多尺度上下文的替代架构。

2. Related Work

2. 相关工作

It has been shown that global features or contextual interactions $\left\lbrack {{33},{76},{43},{48},{27},{89}}\right\rbrack$ are beneficial in correctly classifying pixels for semantic segmentation. In this work, we discuss four types of Fully Convolutional Networks (FCNs) $\left\lbrack {{74},{60}}\right\rbrack$ (see Fig. 2 for illustration) that exploit context information for semantic segmentation $\left\lbrack {{30},{15},{62},9,{96},{55},{65},{73},{87}}\right\rbrack$ .

已经证明，全局特征或上下文交互 $\left\lbrack {{33},{76},{43},{48},{27},{89}}\right\rbrack$ 对于语义分割中正确分类像素是有益的。在这项工作中，我们讨论了四种利用上下文信息进行语义分割的全卷积网络（FCNs） $\left\lbrack {{74},{60}}\right\rbrack$ （见图2以作说明）。

Image pyramid: The same model, typically with shared weights, is applied to multi-scale inputs. Feature responses from the small scale inputs encode the long-range context, while the large scale inputs preserve the small object details. Typical examples include Farabet et al. [22] who transform the input image through a Laplacian pyramid, feed each scale input to a DCNN and merge the feature maps from all the scales. $\left\lbrack {{19},{69}}\right\rbrack$ apply multi-scale inputs sequentially from coarse-to-fine,while $\left\lbrack {{55},{12},{11}}\right\rbrack$ directly resize the input for several scales and fuse the features from all the scales. The main drawback of this type of models is that it does not scale well for larger/deeper DCNNs (e.g., networks like $\left\lbrack {{32},{91},{86}}\right\rbrack$ ) due to limited GPU memory and thus it is usually applied during the inference stage [16].

图像金字塔：相同的模型，通常具有共享权重，应用于多尺度输入。小尺度输入的特征响应编码了长距离上下文，而大尺度输入则保留了小物体的细节。典型的例子包括 Farabet 等人 [22]，他们通过拉普拉斯金字塔转换输入图像，将每个尺度的输入馈送到 DCNN，并合并所有尺度的特征图。 $\left\lbrack {{19},{69}}\right\rbrack$ 依次应用多尺度输入，从粗到细，而 $\left\lbrack {{55},{12},{11}}\right\rbrack$ 直接调整输入的大小以适应多个尺度，并融合所有尺度的特征。这种类型模型的主要缺点是，由于 GPU 内存有限，它在较大/较深的 DCNN（例如，像 $\left\lbrack {{32},{91},{86}}\right\rbrack$ 的网络）上扩展性较差，因此通常在推理阶段应用 [16]。

Encoder-decoder: This model consists of two parts: (a) the encoder where the spatial dimension of feature maps is gradually reduced and thus longer range information is more easily captured in the deeper encoder output, and (b) the decoder where object details and spatial dimension are gradually recovered. For example, $\left\lbrack {{60},{64}}\right\rbrack$ employ deconvolution [92] to learn the upsampling of low resolution feature responses. SegNet [3] reuses the pooling indices from the encoder and learn extra convolutional layers to densify the feature responses, while U-Net [71] adds skip connections from the encoder features to the corresponding decoder activations, and [25] employs a Laplacian pyramid reconstruction network. More recently, RefineNet [54] and [70, 68, 39] have demonstrated the effectiveness of models based on encoder-decoder structure on several semantic segmentation benchmarks. This type of model is also explored in the context of object detection $\left\lbrack {{56},{77}}\right\rbrack$ .

编码器-解码器：该模型由两个部分组成：(a) 编码器，其中特征图的空间维度逐渐减少，因此在更深的编码器输出中更容易捕获长距离信息，以及 (b) 解码器，其中对象细节和空间维度逐渐恢复。例如， $\left\lbrack {{60},{64}}\right\rbrack$ 采用反卷积 [92] 来学习低分辨率特征响应的上采样。SegNet [3] 重用来自编码器的池化索引，并学习额外的卷积层以增强特征响应，而 U-Net [71] 从编码器特征到相应的解码器激活添加跳跃连接，[25] 则采用拉普拉斯金字塔重建网络。最近，RefineNet [54] 和 [70, 68, 39] 在多个语义分割基准上展示了基于编码器-解码器结构的模型的有效性。这种类型的模型在目标检测的背景下也得到了探索 $\left\lbrack {{56},{77}}\right\rbrack$ 。

Context module: This model contains extra modules laid out in cascade to encode long-range context. One effective method is to incorporate DenseCRF [45] (with efficient high-dimensional filtering algorithms [2]) to DCNNs $\left\lbrack {{10},{11}}\right\rbrack$ . Furthermore, $\left\lbrack {{96},{55},{73}}\right\rbrack$ propose to jointly train both the CRF and DCNN components, while [59, 90] employ several extra convolutional layers on top of the belief maps of DCNNs (belief maps are the final DCNN feature maps that contain output channels equal to the number of predicted classes) to capture context information. Recently, [41] proposes to learn a general and sparse high-dimensional convolution (bilateral convolution), and [82, 8] combine Gaussian Conditional Random Fields and DCNNs for semantic segmentation.

上下文模块：该模型包含额外的级联模块，以编码长距离上下文。一种有效的方法是将 DenseCRF [45]（具有高效的高维滤波算法 [2]）与 DCNNs $\left\lbrack {{10},{11}}\right\rbrack$ 结合。此外， $\left\lbrack {{96},{55},{73}}\right\rbrack$ 提出了联合训练 CRF 和 DCNN 组件的方案，而 [59, 90] 在 DCNN 的信念图上增加了几个额外的卷积层（信念图是包含与预测类别数量相等的输出通道的最终 DCNN 特征图），以捕获上下文信息。最近， [41] 提出了学习一种通用且稀疏的高维卷积（双边卷积），而 [82, 8] 将高斯条件随机场与 DCNNs 结合用于语义分割。

Spatial pyramid pooling: This model employs spatial pyramid pooling $\left\lbrack {{28},{49}}\right\rbrack$ to capture context at several ranges. The image-level features are exploited in ParseNet [58] for global context information. DeepLabv2 [11] proposes atrous spatial pyramid pooling (ASPP), where parallel atrous convolution layers with different rates capture multi-scale information. Recently, Pyramid Scene Parsing Net (PSP) [95] performs spatial pooling at several grid scales and demonstrates outstanding performance on several semantic segmentation benchmarks. There are other methods based on LSTM [35] to aggregate global context [53, 6, 88]. Spatial pyramid pooling has also been applied in object detection [31].

空间金字塔池化：该模型采用空间金字塔池化 $\left\lbrack {{28},{49}}\right\rbrack$ 来捕获多个范围的上下文。ParseNet [58] 利用图像级特征获取全局上下文信息。DeepLabv2 [11] 提出了空洞空间金字塔池化（ASPP），其中并行的空洞卷积层以不同的速率捕获多尺度信息。最近，金字塔场景解析网络（PSP） [95] 在多个网格尺度上执行空间池化，并在多个语义分割基准测试中表现出色。还有其他基于 LSTM [35] 的方法用于聚合全局上下文 [53, 6, 88]。空间金字塔池化也已应用于目标检测 [31]。

In this work, we mainly explore atrous convolution $\left\lbrack {{36},{26},{74},{66},{10},{90},{11}}\right\rbrack$ as a context module and tool for spatial pyramid pooling. Our proposed framework is general in the sense that it could be applied to any network. To be concrete, we duplicate several copies of the original last block in ResNet [32] and arrange them in cascade, and also revisit the ASPP module [11] which contains several atrous convolutions in parallel. Note that our cascaded modules are applied directly on the feature maps instead of belief maps. For the proposed modules, we experimentally find it important to train with batch normalization [38]. To further capture global context, we propose to augment ASPP with image-level features, similar to [58, 95].

在本研究中，我们主要探讨了空洞卷积 $\left\lbrack {{36},{26},{74},{66},{10},{90},{11}}\right\rbrack$ 作为上下文模块和空间金字塔池化的工具。我们提出的框架具有通用性，因为它可以应用于任何网络。具体而言，我们复制了 ResNet [32] 中原始最后一个块的多个副本，并将它们级联排列，同时重新审视了包含多个并行空洞卷积的 ASPP 模块 [11]。请注意，我们的级联模块直接应用于特征图，而不是置信图。对于所提出的模块，我们通过实验发现使用批量归一化 [38] 进行训练是重要的。为了进一步捕捉全局上下文，我们建议用图像级特征增强 ASPP，类似于 [58, 95]。

Atrous convolution: Models based on atrous convolution have been actively explored for semantic segmentation. For example, [85] experiments with the effect of modifying atrous rates for capturing long-range information, [84] adopts hybrid atrous rates within the last two blocks of ResNet, while [18] further proposes to learn the deformable convolution which samples the input features with learned offset, generalizing atrous convolution. To further improve the segmentation model accuracy, [83] exploits image captions, [40] utilizes video motion, and [44] incorporates depth information. Besides, atrous convolution has been applied to object detection by $\left\lbrack {{66},{17},{37}}\right\rbrack$ .

空洞卷积：基于空洞卷积的模型在语义分割中得到了积极探索。例如，[85] 实验了修改空洞率以捕捉长距离信息的效果，[84] 在 ResNet 的最后两个块中采用了混合空洞率，而 [18] 进一步提出学习可变形卷积，该卷积以学习的偏移量对输入特征进行采样，从而推广了空洞卷积。为了进一步提高分割模型的准确性，[83] 利用图像标题，[40] 利用视频运动，[44] 则结合了深度信息。此外，空洞卷积还被应用于目标检测 $\left\lbrack {{66},{17},{37}}\right\rbrack$ 。

3. Methods

3. 方法

In this section, we review how atrous convolution is applied to extract dense features for semantic segmentation. We then discuss the proposed modules with atrous convolution modules employed in cascade or in parallel.

在本节中，我们回顾了空洞卷积如何应用于提取密集特征以进行语义分割。然后，我们讨论了在级联或并行中使用空洞卷积模块的提议模块。

3.1. Atrous Convolution for Dense Feature Extrac- tion

3.1. 用于密集特征提取的空洞卷积

Deep Convolutional Neural Networks (DCNNs) [50] deployed in fully convolutional fashion $\left\lbrack {{74},{60}}\right\rbrack$ have shown to be effective for the task of semantic segmentation. However, the repeated combination of max-pooling and striding at consecutive layers of these networks significantly reduces the spatial resolution of the resulting feature maps, typically by a factor of 32 across each direction in recent DCNNs $\left\lbrack {{47},{78},{32}}\right\rbrack$ . Deconvolutional layers (or transposed convolution) $\left\lbrack {{92},{60},{64},3,{71},{68}}\right\rbrack$ have been employed to recover the spatial resolution. Instead, we advocate the use of 'atrous convolution', originally developed for the efficient computation of the undecimated wavelet transform in the "algorithme à trous" scheme of [36] and used before in the DCNN context by [26, 74, 66].

深度卷积神经网络 (DCNNs) [50] 以完全卷积的方式 $\left\lbrack {{74},{60}}\right\rbrack$ 部署，已被证明在语义分割任务中有效。然而，这些网络在连续层中重复组合最大池化和步幅显著降低了生成特征图的空间分辨率，通常在最近的 DCNNs 中每个方向降低了 32 倍 $\left\lbrack {{47},{78},{32}}\right\rbrack$ 。反卷积层（或转置卷积） $\left\lbrack {{92},{60},{64},3,{71},{68}}\right\rbrack$ 被用于恢复空间分辨率。相反，我们提倡使用“空洞卷积”，最初是为在 [36] 的“算法 à trous”方案中高效计算未去除的小波变换而开发的，并在 DCNN 上下文中被 [26, 74, 66] 之前使用。

Consider two-dimensional signals,for each location $\mathbf{i}$ on the output $\mathbf{y}$ and a filter $\mathbf{w}$ ,atrous convolution is applied over the input feature map $\mathbf{x}$ :

考虑二维信号，对于输出 $\mathbf{y}$ 上的每个位置 $\mathbf{i}$ 和一个滤波器 $\mathbf{w}$ ，在输入特征图 $\mathbf{x}$ 上应用空洞卷积：

where the atrous rate $r$ corresponds to the stride with which we sample the input signal, which is equivalent to convolving the input $\mathbf{x}$ with upsampled filters produced by inserting $r - 1$ zeros between two consecutive filter values along each spatial dimension (hence the name atrous convolution where the French word trous means holes in English). Standard convolution is a special case for rate $r = 1$ ,and atrous convolution allows us to adaptively modify filter's field-of-view by changing the rate value. See Fig. 1 for illustration.

其中空洞率 $r$ 对应于我们对输入信号进行采样的步幅，这等同于用通过在每个空间维度的两个连续滤波器值之间插入 $r - 1$ 零而产生的上采样滤波器对输入 $\mathbf{x}$ 进行卷积（因此称为空洞卷积，其中法语单词 trous 在英语中意为孔）。标准卷积是空洞率 $r = 1$ 的特例，而空洞卷积使我们能够通过改变率值自适应地修改滤波器的视野。有关说明，请参见图 1。

Atrous convolution also allows us to explicitly control how densely to compute feature responses in fully convolutional networks. Here, we denote by output_stride the ratio of input image spatial resolution to final output resolution. For the DCNNs $\left\lbrack {{47},{78},{32}}\right\rbrack$ deployed for the task of image classification, the final feature responses (before fully connected layers or global pooling) is 32 times smaller than the input image dimension,and thus output_stride $= {32}$ . If one would like to double the spatial density of computed feature responses in the DCNNs (i.e.,output_stride $= {16}$ ),the stride of last pooling or convolutional layer that decreases resolution is set to 1 to avoid signal decimation. Then, all subsequent convolutional layers are replaced with atrous convolutional layers having rate $r = 2$ . This allows us to extract denser feature responses without requiring learning any extra parameters. Please refer to [11] for more details.

Atrous 卷积还允许我们明确控制在全卷积网络中计算特征响应的密度。在这里，我们用 output_stride 表示输入图像空间分辨率与最终输出分辨率的比率。对于用于图像分类任务的 DCNNs $\left\lbrack {{47},{78},{32}}\right\rbrack$ ，最终的特征响应（在全连接层或全局池化之前）比输入图像尺寸小 32 倍，因此 output_stride $= {32}$ 。如果希望在 DCNNs 中将计算特征响应的空间密度加倍（即，output_stride $= {16}$ ），则最后一个降低分辨率的池化或卷积层的步幅设置为 1，以避免信号降采样。然后，所有后续的卷积层都被替换为具有速率 $r = 2$ 的 atrous 卷积层。这使我们能够提取更密集的特征响应，而无需学习任何额外的参数。有关更多详细信息，请参见 [11]。

3.2. Going Deeper with Atrous Convolution

3.2. 使用 Atrous 卷积深入研究

We first explore designing modules with atrous convolution laid out in cascade. To be concrete, we duplicate several copies of the last ResNet block, denoted as block4 in Fig. 3, and arrange them in cascade. There are three $3 \times 3$ convolutions in those blocks, and the last convolution contains stride 2 except the one in last block, similar to original ResNet. The motivation behind this model is that the introduced striding makes it easy to capture long range information in the deeper blocks. For example, the whole image feature could be summarized in the last small resolution feature map, as illustrated in Fig. 3 (a). However, we discover that the consecutive striding is harmful for semantic segmentation (see Tab. 1 in Sec. 4) since detail information is decimated, and thus we apply atrous convolution with rates determined by the desired output_stride value, as shown in Fig. 3 (b) where output_stride $= {16}$ .

我们首先探索设计级联排列的空洞卷积模块。具体来说，我们复制最后一个 ResNet 块的几个副本，在图 3 中标记为 block4，并将它们级联排列。这些块中有三个 $3 \times 3$ 卷积，最后一个卷积的步幅为 2，除了最后一个块，类似于原始的 ResNet。该模型背后的动机是引入的步幅使得在更深的块中捕捉长距离信息变得容易。例如，整个图像特征可以在最后的小分辨率特征图中进行总结，如图 3 (a) 所示。然而，我们发现连续的步幅对语义分割是有害的（见第 4 节的表 1），因为细节信息被削减，因此我们应用空洞卷积，其速率由所需的 output_stride 值决定，如图 3 (b) 所示，其中 output_stride $= {16}$ 。

In this proposed model, we experiment with cascaded ResNet blocks up to block7 (i.e., extra block5, block6, block7 as replicas of block4),which has output_stride $= {256}$ if no atrous convolution is applied.

在这个提议的模型中，我们实验了级联的 ResNet 块，直到 block7（即，额外的 block5、block6、block7 作为 block4 的副本），如果不应用空洞卷积，则其 output_stride 为 $= {256}$ 。

(b) Going deeper with atrous convolution. Atrous convolution with rate $> 1$ is applied after block 3 when output_stride $= {16}$ . Figure 3. Cascaded modules without and with atrous convolution.

（b）使用空洞卷积深入。空洞卷积的速率 $> 1$ 在 block 3 之后应用，当 output_stride 为 $= {16}$ 时。图 3. 无空洞卷积和有空洞卷积的级联模块。

3.2.1 Multi-grid Method

3.2.1 多网格方法

Motivated by multi-grid methods which employ a hierarchy of grids of different sizes $\left\lbrack {4,{81},5,{67}}\right\rbrack$ and following $\left\lbrack {{84},{18}}\right\rbrack$ ,we adopt different atrous rates within block 4 to block7 in the proposed model. In particular, we define as Multi_Grid $= \left( {{r}_{1},{r}_{2},{r}_{3}}\right)$ the unit rates for the three convolutional layers within block4 to block7. The final atrous rate for the convolutional layer is equal to the multiplication of the unit rate and the corresponding rate. For example, when output_stride $= {16}$ and Multi_Grid $= \left( {1,2,4}\right)$ ,the three convolutions will have rates $= 2 \cdot \left( {1,2,4}\right) = \left( {2,4,8}\right)$ in the block4, respectively.

受到采用不同大小网格层次的多网格方法的启发 $\left\lbrack {4,{81},5,{67}}\right\rbrack$ ，并遵循 $\left\lbrack {{84},{18}}\right\rbrack$ ，我们在所提出的模型中对第4块到第7块采用不同的空洞率。特别地，我们将第4块到第7块内三个卷积层的单位率定义为 Multi_Grid $= \left( {{r}_{1},{r}_{2},{r}_{3}}\right)$ 。卷积层的最终空洞率等于单位率与相应率的乘积。例如，当输出步幅为 $= {16}$ 和 Multi_Grid 为 $= \left( {1,2,4}\right)$ 时，三个卷积在第4块中的速率分别为 $= 2 \cdot \left( {1,2,4}\right) = \left( {2,4,8}\right)$ 。

3.3. Atrous Spatial Pyramid Pooling

3.3. 空洞空间金字塔池化

We revisit the Atrous Spatial Pyramid Pooling proposed in [11], where four parallel atrous convolutions with different atrous rates are applied on top of the feature map. ASPP is inspired by the success of spatial pyramid pooling [28, 49, 31] which showed that it is effective to resample features at different scales for accurately and efficiently classifying regions of an arbitrary scale. Different from [11], we include batch normalization within ASPP.

我们重新审视在 [11] 中提出的空洞空间金字塔池化，其中在特征图上应用四个具有不同空洞率的并行空洞卷积。ASPP 的灵感来源于空间金字塔池化的成功 [28, 49, 31]，该方法表明在不同尺度上重新采样特征对于准确和高效地分类任意尺度的区域是有效的。与 [11] 不同，我们在 ASPP 中加入了批量归一化。

ASPP with different atrous rates effectively captures multi-scale information. However, we discover that as the sampling rate becomes larger, the number of valid filter weights (i.e., the weights that are applied to the valid feature region, instead of padded zeros) becomes smaller. This effect is illustrated in Fig. 4 when applying a $3 \times 3$ filter to a ${65} \times {65}$ feature map with different atrous rates. In the extreme case where the rate value is close to the feature map size,the $3 \times 3$ filter,instead of capturing the whole image context,degenerates to a simple $1 \times 1$ filter since only the center filter weight is effective.

具有不同空洞率的 ASPP 有效捕捉多尺度信息。然而，我们发现随着采样率的增大，有效滤波器权重的数量（即应用于有效特征区域的权重，而不是填充的零）变得更少。当将 $3 \times 3$ 滤波器应用于具有不同空洞率的 ${65} \times {65}$ 特征图时，这一效应在图4中得到了说明。在极端情况下，当速率值接近特征图大小时， $3 \times 3$ 滤波器并没有捕捉整个图像上下文，而是退化为一个简单的 $1 \times 1$ 滤波器，因为只有中心滤波器权重是有效的。

To overcome this problem and incorporate global context information to the model, we adopt image-level features, similar to $\left\lbrack {{58},{95}}\right\rbrack$ . Specifically,we apply global average pooling on the last feature map of the model, feed the resulting image-level features to a $1 \times 1$ convolution with 256 filters (and batch normalization [38]), and then bilinearly upsample the feature to the desired spatial dimension. In the end,our improved ASPP consists of (a) one $1 \times 1$ convolution and three $3 \times 3$ convolutions with rates $= \left( {6,{12},{18}}\right)$ when output_stride $= {16}$ (all with 256 filters and batch normalization), and (b) the image-level features, as shown in Fig. 5. Note that the rates are doubled when output_stride $= 8$ . The resulting features from all the branches are then concatenated and pass through another $1 \times 1$ convolution (also with 256 filters and batch normalization) before the final $1 \times 1$ convolution which generates the final logits.

为了克服这个问题并将全局上下文信息纳入模型，我们采用图像级特征，类似于 $\left\lbrack {{58},{95}}\right\rbrack$ 。具体而言，我们对模型的最后特征图应用全局平均池化，将得到的图像级特征输入到一个具有256个滤波器的 $1 \times 1$ 卷积中（并进行批量归一化 [38]），然后将特征双线性上采样到所需的空间维度。最终，我们改进的ASPP由（a）一个 $1 \times 1$ 卷积和三个具有速率 $= \left( {6,{12},{18}}\right)$ 的 $3 \times 3$ 卷积组成，当 output_stride 为 $= {16}$ 时（所有卷积均具有256个滤波器和批量归一化），以及（b）图像级特征，如图5所示。请注意，当 output_stride 为 $= 8$ 时，速率加倍。然后，来自所有分支的结果特征被连接并通过另一个 $1 \times 1$ 卷积（同样具有256个滤波器和批量归一化）后，经过最终的 $1 \times 1$ 卷积生成最终的 logits。

Figure 4. Normalized counts of valid weights with a $3 \times 3$ filter on a ${65} \times {65}$ feature map as atrous rate varies. When atrous rate is small, all the 9 filter weights are applied to most of the valid region on feature map,while atrous rate gets larger,the $3 \times 3$ filter degenerates to a $1 \times 1$ filter since only the center weight is effective.

图4. 在 ${65} \times {65}$ 特征图上使用 $3 \times 3$ 滤波器的有效权重的归一化计数，随着空洞率的变化。当空洞率较小时，所有9个滤波器权重应用于特征图的大部分有效区域，而当空洞率增大时， $3 \times 3$ 滤波器退化为 $1 \times 1$ 滤波器，因为只有中心权重是有效的。

4. Experimental Evaluation

4. 实验评估

We adapt the ImageNet-pretrained [72] ResNet [32] to the semantic segmentation by applying atrous convolution to extract dense features. Recall that output_stride is defined as the ratio of input image spatial resolution to final out-

我们通过应用空洞卷积来提取密集特征，调整ImageNet预训练的 [72] ResNet [32] 以进行语义分割。回想一下，output_stride 定义为输入图像空间分辨率与最终输出的比例。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——