【翻译】Semantic Image Segmentation with Deep Con- VOLUTIONAL NETS AND FULLY CONNECT

Doc2X：智能文档解析工具 Doc2X 支持从 PDF 转换为 Docx、HTML、Markdown，功能覆盖公式识别、代码解析、表格转换、多栏布局解析，并整合了 GPT翻译和 Deepseek 翻译！ Doc2X: Intelligent Document Parsing Tool Doc2X supports PDF to Docx, HTML, and Markdown, with features like formula recognition, code parsing, table conversion, and multi-column layout parsing, integrated with GPT and DeepSeek translations! 👉 了解 Doc2X 的独特功能 | Explore Doc2X Features

原文链接：1412.7062

Semantic Image Segmentation with Deep Con- VOLUTIONAL NETS AND FULLY CONNECTED CRFS

基于深度卷积网络和全连接条件随机场的语义图像分割

Liang-Chieh Chen

梁志杰

Univ. of California, Los Angeles

加州大学洛杉矶分校

lcchen@cs.ucla.edu

George Papandreou *

乔治·帕潘德里欧 *

Google Inc.

谷歌公司

gpapan@google.com

Iasonas Kokkinos

伊阿索纳斯·科基诺斯

CentraleSupélec and INRIA

CentraleSupélec 和 INRIA

iasonas.kokkinos@ecp.fr

Kevin Murphy

凯文·墨菲

Google Inc.

谷歌公司

kpmurphy@google.com

Alan L. Yuille

阿兰·L·尤尔

Univ. of California, Los Angeles

加州大学洛杉矶分校

yuille@stat.ucla.edu

ABSTRACT

摘要

Deep Convolutional Neural Networks (DCNNs) have recently shown state of the art performance in high level vision tasks, such as image classification and object detection. This work brings together methods from DCNNs and probabilistic graphical models for addressing the task of pixel-level classification (also called "semantic image segmentation"). We show that responses at the final layer of DCNNs are not sufficiently localized for accurate object segmentation. This is due to the very invariance properties that make DCNNs good for high level tasks. We overcome this poor localization property of deep networks by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF). Qualitatively, our "DeepLab" system is able to localize segment boundaries at a level of accuracy which is beyond previous methods. Quantitatively, our method sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task,reaching ${71.6}\%$ IOU accuracy in the test set. We show how these results can be obtained efficiently: Careful network re-purposing and a novel application of the 'hole' algorithm from the wavelet community allow dense computation of neural net responses at 8 frames per second on a modern GPU.

深度卷积神经网络（DCNNs）最近在高层次视觉任务中表现出最先进的性能，例如图像分类和目标检测。本研究结合了DCNNs和概率图模型的方法，以解决像素级分类（也称为“语义图像分割”）的任务。我们表明，DCNNs最后一层的响应在准确的对象分割方面并不够局部化。这是由于使DCNNs适合高层次任务的那些不变性特性。我们通过将DCNN最后一层的响应与全连接条件随机场（CRF）相结合，克服了深度网络的这种局部化不足特性。从定性上看，我们的“DeepLab”系统能够以超越以往方法的准确度定位分割边界。从定量上看，我们的方法在PASCAL VOC-2012语义图像分割任务中设定了新的最先进水平，在测试集中达到了 ${71.6}\%$ IOU准确率。我们展示了如何高效地获得这些结果：仔细的网络重构和小波社区“孔”算法的新颖应用使得在现代GPU上以每秒8帧的速度进行神经网络响应的密集计算成为可能。

1 INTRODUCTION

1 引言

Deep Convolutional Neural Networks (DCNNs) had been the method of choice for document recognition since LeCun et al. (1998), but have only recently become the mainstream of high-level vision research. Over the past two years DCNNs have pushed the performance of computer vision systems to soaring heights on a broad array of high-level problems, including image classification (Krizhevsky et al., 2013, Sermanet et al., 2013, Simonyan & Zisserman, 2014, Szegedy et al., 2014, Papandreou et al. 2014), object detection (Girshick et al. 2014), fine-grained categorization (Zhang et al. 2014), among others. A common theme in these works is that DCNNs trained in an end-to-end manner deliver strikingly better results than systems relying on carefully engineered representations, such as SIFT or HOG features. This success can be partially attributed to the built-in invariance of DCNNs to local image transformations, which underpins their ability to learn hierarchical abstractions of data (Zeiler & Fergus, 2014). While this invariance is clearly desirable for high-level vision tasks, it can hamper low-level tasks, such as pose estimation (Chen & Yuille, 2014; Tompson et al. 2014) and semantic segmentation - where we want precise localization, rather than abstraction of spatial details.

深度卷积神经网络（DCNNs）自LeCun等人（1998）以来一直是文档识别的首选方法，但最近才成为高层次视觉研究的主流。在过去两年中，DCNNs推动了计算机视觉系统在广泛高层次问题上的性能飞跃，包括图像分类（Krizhevsky等，2013；Sermanet等，2013；Simonyan & Zisserman，2014；Szegedy等，2014；Papandreou等，2014）、目标检测（Girshick等，2014）、细粒度分类（Zhang等，2014）等。这些研究的一个共同主题是，以端到端方式训练的DCNNs所提供的结果显著优于依赖于精心设计的表示（如SIFT或HOG特征）的系统。这一成功部分归因于DCNNs对局部图像变换的内在不变性，这为它们学习数据的层次抽象能力提供了基础（Zeiler & Fergus，2014）。虽然这种不变性对于高层次视觉任务显然是可取的，但它可能会妨碍低层次任务，例如姿态估计（Chen & Yuille，2014；Tompson等，2014）和语义分割——在这些任务中，我们希望实现精确的定位，而不是空间细节的抽象。

*Work initiated when G.P. was with the Toyota Technological Institute at Chicago. The first two authors contributed equally to this work.

*工作在G.P.与芝加哥丰田技术学院合作时开始。前两位作者对这项工作贡献相同。

There are two technical hurdles in the application of DCNNs to image labeling tasks: signal down-sampling, and spatial 'insensitivity' (invariance). The first problem relates to the reduction of signal resolution incurred by the repeated combination of max-pooling and downsampling ('striding') performed at every layer of standard DCNNs (Krizhevsky et al. 2013; Simonyan & Zisserman, 2014; Szegedy et al. 2014). Instead, as in Papandreou et al. (2014), we employ the 'atrous' (with holes) algorithm originally developed for efficiently computing the undecimated discrete wavelet transform (Mallat 1999). This allows efficient dense computation of DCNN responses in a scheme substantially simpler than earlier solutions to this problem (Giusti et al. 2013, Sermanet et al., 2013).

在将深度卷积神经网络（DCNN）应用于图像标记任务时，有两个技术障碍：信号下采样和空间“无敏感性”（不变性）。第一个问题与标准DCNN在每一层进行的最大池化和下采样（“步幅”）的重复组合所导致的信号分辨率降低有关（Krizhevsky et al. 2013; Simonyan & Zisserman, 2014; Szegedy et al. 2014）。相反，如Papandreou et al.（2014）所述，我们采用了最初为高效计算未下采样离散小波变换而开发的“空洞”（带孔）算法（Mallat 1999）。这使得在一个比早期解决方案（Giusti et al. 2013, Sermanet et al., 2013）简单得多的方案中高效地密集计算DCNN响应成为可能。

The second problem relates to the fact that obtaining object-centric decisions from a classifier requires invariance to spatial transformations, inherently limiting the spatial accuracy of the DCNN model. We boost our model's ability to capture fine details by employing a fully-connected Conditional Random Field (CRF). Conditional Random Fields have been broadly used in semantic segmentation to combine class scores computed by multi-way classifiers with the low-level information captured by the local interactions of pixels and edges (Rother et al., 2004, Shotton et al., 2009) or superpixels (Lucchi et al. 2011). Even though works of increased sophistication have been proposed to model the hierarchical dependency (He et al. 2004; Ladicky et al. 2009, Lempitsky et al. 2011) and/or high-order dependencies of segments (Delong et al. 2012, Gonfaus et al. 2010, Kohli et al. 2009; Chen et al., 2013; Wang et al., 2015), we use the fully connected pairwise CRF proposed by Krähenbühl & Koltun (2011) for its efficient computation, and ability to capture fine edge details while also catering for long range dependencies. That model was shown in Krähenbühl & Koltun (2011) to largely improve the performance of a boosting-based pixel-level classifier, and in our work we demonstrate that it leads to state-of-the-art results when coupled with a DCNN-based pixel-level classifier.

第二个问题与从分类器获得以对象为中心的决策所需的对空间变换的不变性有关，这在本质上限制了 DCNN 模型的空间精度。我们通过采用全连接的条件随机场 (CRF) 来增强模型捕捉细节的能力。条件随机场已广泛应用于语义分割，以结合多路分类器计算的类别得分与像素和边缘的局部交互所捕获的低级信息（Rother et al., 2004; Shotton et al., 2009）或超像素（Lucchi et al., 2011）。尽管已经提出了更复杂的工作来建模层次依赖性（He et al., 2004; Ladicky et al., 2009; Lempitsky et al., 2011）和/或段的高阶依赖性（Delong et al., 2012; Gonfaus et al., 2010; Kohli et al., 2009; Chen et al., 2013; Wang et al., 2015），我们使用 Krähenbühl 和 Koltun（2011）提出的全连接成对 CRF，因为它计算高效，能够捕捉细微的边缘细节，同时也考虑了长距离依赖性。Krähenbühl 和 Koltun（2011）表明，该模型在很大程度上提高了基于提升的像素级分类器的性能，而在我们的工作中，我们证明了它与基于 DCNN 的像素级分类器结合时能够达到最先进的结果。

The three main advantages of our "DeepLab" system are (i) speed: by virtue of the 'atrous' algorithm, our dense DCNN operates at 8 fps, while Mean Field Inference for the fully-connected CRF requires 0.5 second, (ii) accuracy: we obtain state-of-the-art results on the PASCAL semantic segmentation challenge, outperforming the second-best approach of Mostajabi et al. (2014) by a margin of ${7.2}\%$ and (iii) simplicity: our system is composed of a cascade of two fairly well-established modules, DCNNs and CRFs.

我们的“DeepLab”系统的三个主要优点是 (i) 速度：由于采用了“空洞”算法，我们的密集 DCNN 以 8 帧每秒的速度运行，而完全连接的 CRF 的均值场推理需要 0.5 秒，(ii) 准确性：我们在 PASCAL 语义分割挑战中获得了最先进的结果，超越了 Mostajabi 等人 (2014) 的第二最佳方法，差距为 ${7.2}\%$ ，以及 (iii) 简单性：我们的系统由两个相对成熟的模块 DCNN 和 CRF 的级联组成。

2 RELATED WORK

2 相关工作

Our system works directly on the pixel representation, similarly to Long et al. (2014). This is in contrast to the two-stage approaches that are now most common in semantic segmentation with DCNNs: such techniques typically use a cascade of bottom-up image segmentation and DCNN-based region classification, which makes the system commit to potential errors of the front-end segmentation system. For instance, the bounding box proposals and masked regions delivered by (Arbeláez et al. 2014 (Uijlings et al. 2013) are used in Girshick et al. (2014) and (Hariharan et al. 2014b) as inputs to a DCNN to introduce shape information into the classification process. Similarly, the authors of Mostajabi et al. (2014) rely on a superpixel representation. A celebrated non-DCNN precursor to these works is the second order pooling method of (Carreira et al., 2012) which also assigns labels to the regions proposals delivered by (Carreira & Sminchisescu, 2012). Understanding the perils of committing to a single segmentation, the authors of Cogswell et al. (2014) build on (Yadollah-pour et al. 2013) to explore a diverse set of CRF-based segmentation proposals, computed also by (Carreira & Sminchisescu, 2012). These segmentation proposals are then re-ranked according to a DCNN trained in particular for this reranking task. Even though this approach explicitly tries to handle the temperamental nature of a front-end segmentation algorithm, there is still no explicit exploitation of the DCNN scores in the CRF-based segmentation algorithm: the DCNN is only applied post-hoc, while it would make sense to directly try to use its results during segmentation.

我们的系统直接在像素表示上工作，类似于 Long 等人（2014）的研究。这与目前在语义分割中最常见的两阶段方法形成对比，这些方法通常使用自下而上的图像分割和基于 DCNN 的区域分类的级联，这使得系统可能会受到前端分割系统潜在错误的影响。例如，(Arbeláez et al. 2014 (Uijlings et al. 2013) 提供的边界框提议和掩码区域在 Girshick et al. (2014) 和 (Hariharan et al. 2014b) 中被用作输入到 DCNN，以将形状信息引入分类过程中。同样，Mostajabi et al. (2014) 的作者依赖于超像素表示。这些工作的一个著名的非 DCNN 前驱是 (Carreira et al., 2012) 的二阶池化方法，该方法也为 (Carreira & Sminchisescu, 2012) 提供的区域提议分配标签。理解承诺于单一分割的危险，Cogswell et al. (2014) 的作者基于 (Yadollah-pour et al. 2013) 探索了一组多样的基于 CRF 的分割提议，这些提议同样由 (Carreira & Sminchisescu, 2012) 计算。这些分割提议随后根据专门为此重新排序任务训练的 DCNN 进行重新排序。尽管这种方法明确尝试处理前端分割算法的多变特性，但在基于 CRF 的分割算法中仍然没有明确利用 DCNN 分数：DCNN 仅在事后应用，而直接在分割过程中使用其结果是有意义的。

Moving towards works that lie closer to our approach, several other researchers have considered the use of convolutionally computed DCNN features for dense image labeling. Among the first have been Farabet et al. (2013) who apply DCNNs at multiple image resolutions and then employ a segmentation tree to smooth the prediction results; more recently, Hariharan et al. (2014a) propose to concatenate the computed inter-mediate feature maps within the DCNNs for pixel classification, and Dai et al. (2014) propose to pool the inter-mediate feature maps by region proposals. Even though these works still employ segmentation algorithms that are decoupled from the DCNN classifier's results, we believe it is advantageous that segmentation is only used at a later stage, avoiding the commitment to premature decisions.

朝着与我们的方法更接近的工作，几位其他研究者考虑了使用卷积计算的 DCNN 特征进行密集图像标注。最早的研究之一是 Farabet 等人（2013），他们在多个图像分辨率下应用 DCNN，然后使用分割树来平滑预测结果；最近，Hariharan 等人（2014a）提出将 DCNN 中计算的中间特征图进行连接以进行像素分类，而 Dai 等人（2014）则建议通过区域提议对中间特征图进行池化。尽管这些工作仍然使用与 DCNN 分类器结果解耦的分割算法，但我们认为在后期阶段仅使用分割是有利的，避免了对过早决策的承诺。

More recently, the segmentation-free techniques of (Long et al., 2014, Eigen & Fergus, 2014) directly apply DCNNs to the whole image in a sliding window fashion, replacing the last fully connected layers of a DCNN by convolutional layers. In order to deal with the spatial localization issues outlined in the beginning of the introduction, Long et al. (2014) upsample and concatenate the scores from inter-mediate feature maps, while Eigen & Fergus (2014) refine the prediction result from coarse to fine by propagating the coarse results to another DCNN.

最近，(Long et al., 2014, Eigen & Fergus, 2014) 的无分割技术直接以滑动窗口的方式将 DCNN 应用到整个图像中，用卷积层替换 DCNN 的最后全连接层。为了处理引言开头概述的空间定位问题，Long et al.（2014）对中间特征图的分数进行上采样和连接，而 Eigen & Fergus（2014）则通过将粗略结果传播到另一个 DCNN 来将预测结果从粗到细进行细化。

The main difference between our model and other state-of-the-art models is the combination of pixel-level CRFs and DCNN-based 'unary terms'. Focusing on the closest works in this direction, Cogswell et al. (2014) use CRFs as a proposal mechanism for a DCNN-based reranking system, while Farabet et al. (2013) treat superpixels as nodes for a local pairwise CRF and use graph-cuts for discrete inference; as such their results can be limited by errors in superpixel computations, while ignoring long-range superpixel dependencies. Our approach instead treats every pixel as a CRF node, exploits long-range dependencies, and uses CRF inference to directly optimize a DCNN-driven cost function. We note that mean field had been extensively studied for traditional image segmentation/edge detection tasks, e.g., (Geiger & Girosi, 1991, Geiger & Yuille, 1991, Kokkinos et al. 2008), but recently Krähenbühl & Koltun (2011) showed that the inference can be very efficient for fully connected CRF and particularly effective in the context of semantic segmentation.

我们的模型与其他最先进模型之间的主要区别在于像素级条件随机场（CRFs）与基于深度卷积神经网络（DCNN）的“单项项”相结合。关注这一方向上最接近的工作，Cogswell 等人（2014）将 CRFs 用作基于 DCNN 的重排序系统的提议机制，而 Farabet 等人（2013）将超像素视为局部成对 CRF 的节点，并使用图切割进行离散推理；因此，他们的结果可能受到超像素计算错误的限制，同时忽略了长距离超像素依赖关系。我们的做法是将每个像素视为 CRF 节点，利用长距离依赖关系，并使用 CRF 推理直接优化由 DCNN 驱动的成本函数。我们注意到，均值场在传统图像分割/边缘检测任务中已被广泛研究，例如（Geiger & Girosi, 1991; Geiger & Yuille, 1991; Kokkinos et al. 2008），但最近 Krähenbühl & Koltun（2011）表明，对于全连接 CRF，推理可以非常高效，特别是在语义分割的背景下。

After the first version of our manuscript was made publicly available, it came to our attention that two other groups have independently and concurrently pursued a very similar direction, combining DCNNs and densely connected CRFs (Bell et al., 2014; Zheng et al., 2015). There are several differences in technical aspects of the respective models. Bell et al. (2014) focus on the problem of material classification, while Zheng et al. (2015) unroll the CRF mean-field inference steps to convert the whole system into an end-to-end trainable feed-forward network.

在我们的手稿的第一个版本公开发布后，我们注意到另外两个团队独立且同时追求了非常相似的方向，结合了 DCNN 和密集连接的 CRF（Bell et al., 2014; Zheng et al., 2015）。各自模型在技术方面有几个不同之处。Bell 等人（2014）关注材料分类问题，而 Zheng 等人（2015）则展开 CRF 均值场推理步骤，将整个系统转换为端到端可训练的前馈网络。

We have updated our proposed "DeepLab" system with much improved methods and results in our latest work (Chen et al. 2016). We refer the interested reader to the paper for details.

我们在最新的工作中（Chen et al. 2016）更新了我们提出的“DeepLab”系统，采用了大幅改进的方法和结果。我们建议感兴趣的读者查阅论文以获取详细信息。

3 CONVOLUTIONAL NEURAL NETWORKS FOR DENSE IMAGE LABELING

3 卷积神经网络用于密集图像标注

Herein we describe how we have re-purposed and finetuned the publicly available Imagenet-pretrained state-of-art 16-layer classification network of (Simonyan & Zisserman, 2014) (VGG-16) into an efficient and effective dense feature extractor for our dense semantic image segmentation system.

在这里，我们描述了如何将公开可用的Imagenet预训练的最先进的16层分类网络（Simonyan & Zisserman, 2014）（VGG-16）重新调整和微调，以便将其转变为我们密集语义图像分割系统的高效有效的密集特征提取器。

3.1 EFFICIENT DENSE SLIDING WINDOW FEATURE EXTRACTION WITH THE HOLE ALGORITHM

3.1 使用孔算法的高效密集滑动窗口特征提取

Dense spatial score evaluation is instrumental in the success of our dense CNN feature extractor. As a first step to implement this, we convert the fully-connected layers of VGG-16 into convolutional ones and run the network in a convolutional fashion on the image at its original resolution. However this is not enough as it yields very sparsely computed detection scores (with a stride of 32 pixels). To compute scores more densely at our target stride of 8 pixels, we develop a variation of the method previously employed by Giusti et al. (2013); Sermanet et al. (2013). We skip subsampling after the last two max-pooling layers in the network of Simonyan & Zisserman (2014) and modify the convolutional filters in the layers that follow them by introducing zeros to increase their length $(2 \times$ in the last three convolutional layers and $4 \times$ in the first fully connected layer). We can implement this more efficiently by keeping the filters intact and instead sparsely sample the feature maps on which they are applied on using an input stride of 2 or 4 pixels, respectively. This approach, illustrated in Fig. 1 is known as the 'hole algorithm' ('atrous algorithm') and has been developed before for efficient computation of the undecimated wavelet transform (Mallat, 1999). We have implemented this within the Caffe framework (Jia et al. 2014) by adding to the im2col function (it converts multichannel feature maps to vectorized patches) the option to sparsely sample the underlying feature map. This approach is generally applicable and allows us to efficiently compute dense CNN feature maps at any target subsampling rate without introducing any approximations.

密集空间评分评估在我们密集的 CNN 特征提取器成功中起着重要作用。作为实现这一目标的第一步，我们将 VGG-16 的全连接层转换为卷积层，并在图像的原始分辨率上以卷积的方式运行网络。然而，这还不够，因为它产生的检测评分非常稀疏（步幅为 32 像素）。为了在我们目标步幅 8 像素下更密集地计算评分，我们开发了一种变体，之前由 Giusti 等人（2013 年）和 Sermanet 等人（2013 年）采用的方法。我们在 Simonyan 和 Zisserman（2014 年）网络的最后两个最大池化层之后跳过了下采样，并通过在后续层中引入零来修改卷积滤波器，以增加它们的长度 $(2 \times$ 在最后三个卷积层中和 $4 \times$ 在第一个全连接层中）。我们可以通过保持滤波器不变，而是分别使用 2 或 4 像素的输入步幅稀疏地对应用于它们的特征图进行采样，从而更有效地实现这一点。这种方法如图 1 所示，被称为“孔算法”（“atrous 算法”），之前已被开发用于高效计算未降采样的小波变换（Mallat, 1999）。我们在 Caffe 框架内实现了这一点（Jia 等人 2014），通过向 im2col 函数（它将多通道特征图转换为向量化补丁）添加稀疏采样底层特征图的选项。这种方法具有普遍适用性，使我们能够在任何目标下采样率下高效计算密集的 CNN 特征图，而无需引入任何近似。

Figure 1: Illustration of the hole algorithm in 1-D,when kernel_size $= 3$ ,input_stride $= 2$ ,and output_stride $= 1$ .

图 1：孔算法在 1-D 中的示意图，当 kernel_size $= 3$ ，input_stride $= 2$ ，和 output_stride $= 1$ 。

We finetune the model weights of the Imagenet-pretrained VGG-16 network to adapt it to the image classification task in a straightforward fashion, following the procedure of Long et al. (2014). We replace the 1000-way Imagenet classifier in the last layer of VGG-16 with a 21-way one. Our loss function is the sum of cross-entropy terms for each spatial position in the CNN output map (subsampled by 8 compared to the original image). All positions and labels are equally weighted in the overall loss function. Our targets are the ground truth labels (subsampled by 8). We optimize the objective function with respect to the weights at all network layers by the standard SGD procedure of Krizhevsky et al. (2013).

我们微调了Imagenet预训练的VGG-16网络的模型权重，以便以直接的方式将其适应于图像分类任务，遵循Long等人（2014）的程序。我们将VGG-16最后一层的1000类Imagenet分类器替换为21类分类器。我们的损失函数是CNN输出图（与原始图像相比，进行了8倍下采样）中每个空间位置的交叉熵项之和。所有位置和标签在整体损失函数中被等权重对待。我们的目标是真实标签（进行了8倍下采样）。我们通过Krizhevsky等人（2013）的标准SGD程序优化所有网络层的权重的目标函数。

During testing, we need class score maps at the original image resolution. As illustrated in Figure 2 and further elaborated in Section 4.1, the class score maps (corresponding to log-probabilities) are quite smooth, which allows us to use simple bilinear interpolation to increase their resolution by a factor of 8 at a negligible computational cost. Note that the method of Long et al. (2014) does not use the hole algorithm and produces very coarse scores (subsampled by a factor of 32) at the CNN output. This forced them to use learned upsampling layers, significantly increasing the complexity and training time of their system: Fine-tuning our network on PASCAL VOC 2012 takes about 10 hours, while they report a training time of several days (both timings on a modern GPU).

在测试过程中，我们需要原始图像分辨率下的类别得分图。如图2所示，并在4.1节中进一步阐述，类别得分图（对应于对数概率）相当平滑，这使我们能够以微不足道的计算成本使用简单的双线性插值将其分辨率提高8倍。请注意，Long等人（2014）的方法不使用整个算法，并且在CNN输出时产生非常粗糙的得分（下采样因子为32）。这迫使他们使用学习的上采样层，显著增加了他们系统的复杂性和训练时间：在PASCAL VOC 2012上微调我们的网络大约需要10小时，而他们报告的训练时间为几天（两者均在现代GPU上进行）。

3.2 CONTROLLING THE RECEPTIVE FIELD SIZE AND ACCELERATING DENSE COMPUTATION WITH CONVOLUTIONAL NETS

3.2 控制感受野大小和加速密集计算的卷积网络

Another key ingredient in re-purposing our network for dense score computation is explicitly controlling the network's receptive field size. Most recent DCNN-based image recognition methods rely on networks pre-trained on the Imagenet large-scale classification task. These networks typically have large receptive field size: in the case of the VGG-16 net we consider, its receptive field is ${224} \times {224}$ (with zero-padding) and ${404} \times {404}$ pixels if the net is applied convolutionally. After converting the network to a fully convolutional one, the first fully connected layer has 4,096 filters of large $7 \times 7$ spatial size and becomes the computational bottleneck in our dense score map computation.

重新利用我们的网络进行密集评分计算的另一个关键因素是明确控制网络的感受野大小。最近的基于深度卷积神经网络（DCNN）的图像识别方法依赖于在Imagenet大规模分类任务上预训练的网络。这些网络通常具有较大的感受野大小：以我们考虑的VGG-16网络为例，其感受野为 ${224} \times {224}$ （带零填充），如果以卷积方式应用，则为 ${404} \times {404}$ 像素。在将网络转换为完全卷积网络后，第一个全连接层具有4096个大型 $7 \times 7$ 空间大小的滤波器，并成为我们密集评分图计算中的计算瓶颈。

We have addressed this practical problem by spatially subsampling (by simple decimation) the first FC layer to $4 \times 4$ (or $3 \times 3$ ) spatial size. This has reduced the receptive field of the network down to ${128} \times {128}$ (with zero-padding) or ${308} \times {308}$ (in convolutional mode) and has reduced computation time for the first FC layer by $2 - 3$ times. Using our Caffe-based implementation and a Titan GPU,the resulting VGG-derived network is very efficient: Given a ${306} \times {306}$ input image,it produces ${39} \times {39}$

我们通过对第一个全连接层进行空间下采样（通过简单的抽取）来解决这个实际问题，目标空间大小为 $4 \times 4$ （或 $3 \times 3$ ）。这将网络的感受野减少到 ${128} \times {128}$ （带零填充）或 ${308} \times {308}$ （在卷积模式下），并将第一个全连接层的计算时间减少了 $2 - 3$ 倍。使用我们基于Caffe的实现和Titan GPU，得到的VGG衍生网络非常高效：给定一个 ${306} \times {306}$ 输入图像，它生成 ${39} \times {39}$ 。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——