Visual Tracking with Fully Convolutional Networks【翻译】Visual

Doc2X：高效代码提取工具从 PDF 中快速提取代码块，并转换为 Markdown 或 HTML，便于开发与交流。 Doc2X: Efficient Code Extraction Tool Quickly extract code blocks from PDFs and convert them into Markdown or HTML for seamless development and sharing. 👉 了解更多 Doc2X | Learn More About Doc2X

原文链接：Visual Tracking With Fully Convolutional Networks

This ICCV paper is the Open Access version, provided by the Computer Vision Foundation. Except for this watermark, it is identical to the version available on IEEE Xplore.

本 ICCV 论文是由计算机视觉基金会提供的开放访问版本。除去此水印外，与 IEEE Xplore 上的版本完全相同。

Visual Tracking with Fully Convolutional Networks

使用全卷积网络进行视觉跟踪

Lijun ${\mathrm{{Wang}}}^{1,2}$ ,Wanli Ouyang ${}^{2}$ ,Xiaogang Wang ${}^{2}$ ,and Huchuan ${\mathrm{{Lu}}}^{1}$

Lijun ${\mathrm{{Wang}}}^{1,2}$ ，Wanli Ouyang ${}^{2}$ ，Xiaogang Wang ${}^{2}$ ，以及 Huchuan ${\mathrm{{Lu}}}^{1}$

${}^{1}$ Dalian University of Technology,China

${}^{1}$ 大连理工大学，中国

${}^{2}$ The Chinese University of Hong Kong,Hong Kong,China

${}^{2}$ 香港中文大学，香港，中国

Abstract

摘要

We propose a new approach for general object tracking with fully convolutional neural network. Instead of treating convolutional neural network (CNN) as a black-box feature extractor,we conduct in-depth study on the properties of $C$ - NN features offline pre-trained on massive image data and classification task on ImageNet. The discoveries motivate the design of our tracking system. It is found that convolutional layers in different levels characterize the target from different perspectives. A top layer encodes more semantic features and serves as a category detector, while a lower layer carries more discriminative information and can better separate the target from distracters with similar appearance. Both layers are jointly used with a switch mechanism during tracking. It is also found that for a tracking target, only a subset of neurons are relevant. A feature map selection method is developed to remove noisy and irrelevan- $t$ feature maps,which can reduce computation redundancy and improve tracking accuracy. Extensive evaluation on the widely used tracking benchmark [36] shows that the proposed tacker outperforms the state-of-the-art significantly.

我们提出了一种基于全卷积神经网络的通用目标跟踪新方法。与将卷积神经网络 (CNN) 视为黑箱特征提取器不同，我们深入研究了 $C$ 在大规模图像数据和 ImageNet 分类任务上离线预训练的神经网络特征的属性。这些发现促使了我们跟踪系统的设计。研究发现，不同层次的卷积层从不同的角度描述目标。顶层编码更多语义特征并充当类别检测器，而较低层携带更多判别信息，并能更好地将目标与外观相似的干扰物分开。在跟踪过程中，两种层次通过切换机制共同使用。同时发现，对于跟踪目标，仅部分神经元是相关的。我们开发了一种特征图选择方法，用于去除噪声和无关的 $t$ 特征图，从而减少计算冗余并提高跟踪精度。在广泛使用的跟踪基准 [36] 上进行的广泛评估表明，所提出的跟踪器显著优于现有的最先进方法。

1. Introduction

1. 引言

Visual tracking, as a fundamental problem in computer vision, has found wide applications. Although much progress $\left\lbrack {7,{29},{38}}\right\rbrack$ has been made in the past decade, tremendous challenges still exist in designing a robust tracker that can well handle significant appearance changes, pose variations, severe occlusions, and background clutters.

视觉跟踪作为计算机视觉中的一个基础问题，已经有广泛的应用。尽管过去十年取得了很大进展，设计一个能够很好地处理显著外观变化、姿态变化、严重遮挡和背景干扰的鲁棒跟踪器仍然面临巨大挑战。

Existing appearance-based tracking methods adopt either generative or discriminative models to separate the foreground from background and distinct co-occurring objects. One major drawback is that they rely on low-level hand-crafted features which are incapable to capture semantic information of targets, not robust to significant appearance changes, and only have limited discriminative power.

现有基于外观的跟踪方法采用生成模型或判别模型，将前景与背景以及共现的不同目标分离开来。一个主要缺点是它们依赖于低级的人工设计特征，这些特征无法捕捉目标的语义信息，对显著的外观变化不够鲁棒，且区分能力有限。

Driven by the emergence of large-scale visual data sets and fast development of computation power, Deep Neural Networks (DNNs), especially convolutional neural networks [17] (CNNs), with their strong capabilities of learning feature representations, have demonstrated record breaking performance in computer vision tasks, e.g., image classification [16, 27], object detection [8, 23, 22], and saliency detection $\left\lbrack {{33},{39}}\right\rbrack$ . Different from hand-crafted features, those learned by CNNs from massive annotated visual data and a large number of object classes (such as Im-ageNet [4]) carry rich high-level semantic information and are strong at distinguishing objects of different categories. These features have good generalization capability across data sets. Recent studies $\left\lbrack {{28},1}\right\rbrack$ have also shown that such features are robust to data corruption. Their neuron responses have strong selectiveness on object identities, i.e., for a particular object only a subset of neurons are responded and different objects have different responding neurons.

受大规模视觉数据集的出现和计算能力快速发展的驱动，深度神经网络（DNNs），特别是卷积神经网络（CNNs）[17]，凭借其强大的特征表示学习能力，在计算机视觉任务中表现出破纪录的性能，例如图像分类[16, 27]、目标检测[8, 23, 22]以及显著性检测 $\left\lbrack {{33},{39}}\right\rbrack$ 。与人工设计特征不同，CNNs 从大量标注的视觉数据和大量目标类别（如 ImageNet [4]）中学习的特征携带丰富的高级语义信息，并在区分不同类别的目标上表现出色。这些特征在不同数据集之间具有良好的泛化能力。最近的研究 $\left\lbrack {{28},1}\right\rbrack$ 还表明，这些特征对数据损坏具有鲁棒性。它们的神经元响应对目标身份具有很强的选择性，即，对于特定目标，只有一部分神经元会响应，而不同目标有不同的响应神经元。

Figure 1. Feature maps for target localization. (a)(b) From left to right: input image, the ground truth target heat map, the predicted heat maps using feature maps of conv5-3 and conv4-3 layers of VGG network [27] (See Section 4.2 for the regression method). (c) From left to right: input image, ground truth foreground mask, average feature maps of conv5-3 (top) and conv4-3 (bottom) layers, average selected feature maps conv5-3 (top) and conv4-3 (bottom) layers (See Section 4.1 for the feature map selection method).

图1. 用于目标定位的特征图。(a)(b) 从左到右分别是：输入图像、真实目标热图、使用 VGG 网络 [27] 的 conv5-3 和 conv4-3 层特征图预测的热图（回归方法见第4.2节）。(c) 从左到右分别是：输入图像、真实前景掩码、conv5-3（上）和 conv4-3（下）层的平均特征图、conv5-3（上）和 conv4-3（下）层的平均选择特征图（特征图选择方法见第4.1节）。

All these motivate us to apply CNNs to address above challenges faced by tracking. Given the limited number of training samples in online tracking and the complexity of deep models, it is inferior to directly apply CNNs to tracking, since the power of CNNs relies on large-scale training. Prior works $\left\lbrack {6,{35},{12}}\right\rbrack$ attempted to transfer offline learned DNN features (e.g. from ImageNet) for online tracking and achieved state-of-the-art performance. However, DNN was treated as a black-box classifier in these works. In contrast, we conduct in-depth study on the properties of CNN features from the perspective of online visual tracking and in order to make better use of them in terms of both efficiency and accuracy. Two such properties are discovered and they motivate the design of our tracking system.

所有这些都促使我们应用 CNN 来解决跟踪面临的上述挑战。鉴于在线跟踪中训练样本的有限数量以及深度模型的复杂性，直接将 CNN 应用于跟踪效果不佳，因为 CNN 的性能依赖于大规模训练。先前的工作 $\left\lbrack {6,{35},{12}}\right\rbrack$ 尝试将离线学习的 DNN 特征（例如来自 ImageNet）转移到在线跟踪中，并取得了最先进的性能。然而，这些工作中将 DNN 视为一个黑箱分类器。相反，我们从在线视觉跟踪的角度对 CNN 特征的属性进行了深入研究，以便在效率和准确性方面更好地利用它们。发现了两个这样的特性，并激发了我们跟踪系统的设计。

First, CNN features at different levels/depths have different properties that fit the tracking problem. A top convolutional layer captures more abstract and high-level semantic features. They are strong at distinguishing objects of different classes and are very robust to deformation and occlusion as shown in Figure 1 (a). However, they are less discriminative to objects of the same category as shown by the examples in Figure 1 (b). A lower layer provides more detailed local features which help to separate the target from distracters (e.g. other objects in the same class) with similar appearance as shown in Figure 1 (b). But they are less robust to dramatic change of appearance, as shown in Figure 1 (a). Based on these observations, we propose to automatically switch the usage of these two layers during tracking depending on the occurrence of distracters.

首先，不同层级/深度的 CNN 特征具有不同的性质，适合解决跟踪问题。顶层卷积层提取的是更加抽象和高级的语义特征。它们在区分不同类别的物体时表现出色，并且对变形和遮挡具有很强的鲁棒性，如图 1 (a) 所示。然而，对于同一类别的物体，它们的区分能力较弱，如图 1 (b) 中的示例所示。较低的层级提供了更详细的局部特征，这些特征有助于将目标从外界干扰（例如外观相似的同类物体）中区分出来，如图 1 (b) 所示。但这些特征对外观剧烈变化的鲁棒性较低，如图 1 (a) 所示。基于这些观察，我们提出在跟踪过程中根据干扰物的出现情况自动切换这两层的使用。

Second, the CNN features pre-trained on ImageNet are for distinguishing generic objects. However, for a particular target, not all the features are useful for robust tracking. Some feature responses may serve as noise. As shown in Figure 1 (c), it is hard to distinguish the target object from background if all the feature maps are used. In contrast, through proper feature selection, the noisy feature maps not related to the representation of the target are cleared out and the remaining ones can more accurately highlight the target and suppress responses from background. We propose a principled method to select discriminative feature maps and discard noisy or unrelated ones for the tracking target.

其次，ImageNet 预训练的 CNN 特征用于区分通用物体。然而，对于特定目标，并非所有特征都对稳健跟踪有用。一些特征响应可能会成为噪声。如图 1 (c) 所示，如果使用所有特征图，很难将目标物体与背景区分开。相比之下，通过适当的特征选择，与目标表示无关的噪声特征图被清除，剩下的特征图能够更准确地突出目标并抑制背景响应。我们提出了一种系统化的方法，用于选择区分性的特征图并丢弃噪声或无关的特征图，从而用于目标跟踪。

The contributions of this work are three folds:

本工作的贡献主要有以下三个方面：

i) We analyze CNN features learned from the large-scale image classification task and find important properties for online tracking. It facilitates further understanding of CNN features and helps to design effective CNN-based trackers.

i) 我们分析了从大规模图像分类任务中学习到的 CNN 特征，并发现了对在线跟踪重要的属性。这有助于进一步理解 CNN 特征，并有助于设计有效的基于 CNN 的跟踪器。

ii) We propose a new tracking method which jointly considers two convolutional layers of different levels so that they complement each other in handling drastic appearance change and distinguishing target object from its similar dis-tracters. This design significantly mitigate drifts.

ii) 我们提出了一种新的跟踪方法，该方法结合了两个不同层级的卷积层，使它们能够在处理外观剧烈变化以及将目标对象与其相似的干扰物区分时相互补充。这种设计显著减轻了漂移问题。

iii) We develop a principled method which automatically selects discriminative feature maps and discards noisy or unrelated ones, further improving tracking accuracy.

iii) 我们开发了一种基于原则的方法，该方法能够自动选择具有判别性的特征图，并剔除噪声或不相关的特征图，从而进一步提高跟踪精度。

Evaluation on the widely used tracking benchmark [36] shows that the proposed method well handles a variety of challenging problems and outperforms state-of-the-art methods.

在广泛使用的跟踪基准 [36] 上的评估表明，所提出的方法能够很好地处理各种具有挑战性的问题，并优于最先进的方法。

2. Related Work

2. 相关工作

A tracker contains two components: an appearance model updated online and a search strategy to find the most likely target locations. Most recent works $\left\lbrack {2,{41},{11},{20}}\right\rbrack$ focus on the design of appearance models. In generative models, candidates are searched to minimize reconstruction errors. For example, Ross et al. [25] learned subspace online to model target appearance. Recently, sparse coding has been exploited for tracking [21, 3, 32, 31, 30], where the target is reconstructed by a sparse linear combination of target templates. In discriminative models, tracking is cast as a foreground and background separation problem $\left\lbrack {{24},{37},{34}}\right\rbrack$ . Online learning algorithms based on CRF-s [24], boosting [9], multiple instance learning [2] and structured SVM [10] were applied in tracking and achieved good performance. In [40], the generative and discriminative models were incorporated for more accurate online tracking. All these methods used hand-crafted features.

一个追踪器包含两个组件：一个在线更新的外观模型和一个搜索策略，用于找到最可能的目标位置。最近的大多数研究 $\left\lbrack {2,{41},{11},{20}}\right\rbrack$ 关注外观模型的设计。在生成模型中，通过搜索候选项以最小化重建误差。例如，Ross 等人 [25] 在线学习子空间以建模目标外观。最近，稀疏编码已被用于追踪 [21, 3, 32, 31, 30]，其中目标通过目标模板的稀疏线性组合进行重建。在判别模型中，追踪被看作前景和背景分离问题 $\left\lbrack {{24},{37},{34}}\right\rbrack$ 。基于 CRF [24]、提升方法 [9]、多实例学习 [2] 和结构化支持向量机 [10] 的在线学习算法被应用于追踪，并取得了良好的性能。在 [40] 中，生成模型和判别模型被结合以实现更准确的在线追踪。所有这些方法都使用了手工设计的特征。

The application of DNNs in online tracking is under fully explored. In [35], a stacked denoising autoencoder (S-DAE) was offline trained on an auxiliary tiny image data set to learn generic features and then used for online tracking. In [19], tracking was performed as foreground-background classification with CNN trained online without offline pretraining. Fan et al. [6] used fully convolutional network for human tracking. It took the whole frame as input and predicted the foreground heat map by one-pass forward propagation. Redundant computation was saved. Whereas [35] and [19] operated in a patch-by-by scanning manner. Given $N$ patches cropped from the frame,DNNs had to be evaluated for $N$ times. The overlap between patches leads to a lot of redundant computation. In [12], pre-trained CN- $\mathrm{N}$ features were used to construct target-specific saliency maps for online tracking. Existing works treated DNNs as black-box feature extractors. Our contributions summarized in Section 1 were not explored in these works.

深度神经网络 (DNNs) 在在线跟踪中的应用尚未被充分探索。在文献 [35] 中，一个堆叠去噪自动编码器 (S-DAE) 在辅助的小型图像数据集上进行了离线训练以学习通用特征，然后用于在线跟踪。在文献 [19] 中，跟踪被视为前景与背景分类，使用卷积神经网络 (CNN) 进行了在线训练，而无需离线预训练。Fan 等人在文献 [6] 中使用全卷积网络 (FCN) 进行人类跟踪。该方法以整个帧作为输入，通过一次前向传播预测前景热图，从而减少了冗余计算。而文献 [35] 和 [19] 采用逐块扫描的方式操作。给定从帧中裁剪的 $N$ 块，DNNs 需要评估 $N$ 次。块之间的重叠导致了大量冗余计算。在文献 [12] 中，使用预训练的 CN- $\mathrm{N}$ 特征构建目标特定的显著性图用于在线跟踪。现有工作将 DNNs 视为黑盒特征提取器。我们在第 1 节中总结的贡献在这些工作中尚未被探索。

3. Deep Feature Analysis for Visual Tracking

3. 深度特征分析在视觉跟踪中的应用

Analysis on deep representations is important to understand the mechanism of deep learning. However, it is still very rare for the purpose of visual tracking. In this section, we present some important properties of CNN features which can better facilitate visual tracking. Our feature analysis is conducted based on the 16-layer VGG network [27] pre-trained on the ImageNet image classification task [4], which consists of 13 convolutional layers followed by 3 fully connected layers. We mainly focus on the conv4-3 layer (the 10-th convolutional layer) and the conv5-3 layer (the 13-th convolutional layer), both of which generate 512 feature maps.

对深度表示的分析对于理解深度学习的机制至关重要。然而，这在视觉跟踪领域仍然非常少见。在本节中，我们展示了一些卷积神经网络特征的重要属性，这些属性能够更好地促进视觉跟踪。我们的特征分析基于在 ImageNet 图像分类任务 [4] 上预训练的 16 层 VGG 网络 [27]，该网络包括 13 个卷积层和 3 个全连接层。我们主要关注 conv4-3 层（第 10 个卷积层）和 conv5-3 层（第 13 个卷积层），这两层均生成 512 个特征图。

Figure 2. CNNs trained on image classification task carry spatial configuration information. (a) input image (top) and ground truth foreground mask. (b) feature maps (top row) of conv4-3 layer which are activated within the target region and are discriminative to the background distracter. Their associated saliency maps (bottom row) are mainly focused on the target region. (c) feature maps (top row) of conv5-3 layer which are activated within the target region and capture more semantic information of the category (both the target and background distracter). Their saliency maps (bottom row) present spatial information of the category.

图 2. 在图像分类任务中训练的 CNNs 携带空间配置信息。 (a) 输入图像（顶部）和真实前景掩码。(b) conv4-3 层的特征图（顶部行），它们在目标区域内激活，并且对背景干扰物具有辨别性。其相关的显著图（底部行）主要集中在目标区域。(c) conv5-3 层的特征图（顶部行），它们在目标区域内激活，并捕获类别的更多语义信息（包括目标和背景干扰物）。其显著图（底部行）呈现出类别的空间信息。

Observation 1 Although the receptive field ${}^{1}$ of CNN feature maps is large, the activated feature maps are sparse and localized. The activated regions are highly correlated to the regions of semantic objects .

观察 1 尽管 CNN 特征图的感受野 ${}^{1}$ 较大，但激活的特征图是稀疏且局部化的。激活区域与语义对象的区域高度相关。

Due to pooling and convolutional layers, the receptive fields of the conv4-3 and conv5-3 layers are very large ( ${92} \times {92}$ and ${196} \times {196}$ pixels,respectively). Figure 2 shows some feature maps with the maximum activation values in the object region. It can be seen that the feature maps have only small regions with nonzero values. These nonzero values are localized and mainly correspond to the image region of foreground objects. We also use the approach in [26] to obtain the saliency maps of CNN features. The saliency maps in Figure 2 (bottom row) show that the change of input that results in the largest increase of the selected feature maps are located within the object regions. Therefore, the feature maps are capturing the visual representation related to the objects. These evidences indicate that DNN features learned from image classification are localized and correlated to the object visual cues. Thus, these CNN features can be used for target localization.

由于池化层和卷积层的存在，conv4-3 和 conv5-3 层的感受野非常大（分别为 ${92} \times {92}$ 和 ${196} \times {196}$ 像素）。图 2 显示了在目标区域内最大激活值的一些特征图。可以看到，这些特征图仅在小范围内有非零值。这些非零值是局部化的，主要对应于前景目标的图像区域。我们还使用 [26] 中的方法来获取 CNN 特征的显著图。图 2（底行）中的显著图表明，输入变化导致选定特征图最大增量的区域位于目标区域内。因此，这些特征图捕捉到了与目标相关的视觉表示。这些证据表明，从图像分类中学习到的 DNN 特征是局部化的，并与目标视觉线索相关。因此，这些 CNN 特征可以用于目标定位。

Observation 2 Many CNN feature maps are noisy or unrelated for the task of discriminating a particular target from its background.

观察 2 许多 CNN 特征图是嘈杂的或与区分特定目标与其背景的任务无关。

The CNN features pre-trained on ImageNet can describe a large variety of generic objects and therefore they are supposed to detect abundant visual patterns with a large number of neurons. However, when tracking a particular target object, it should focuses on a much smaller subset of visual patterns which well separate the target from background. As illustrated in Figure 1 (c), the average of all the feature maps is cluttered with background noise. And we should discard feature maps that have high response in both target region and background region so that the tracker does not drift to the background regions. Figure 3 shows the histograms of the activation values for all the feature maps within the object region. The activation value of a feature map is defined as the sum of its responses in the object region. As demonstrated in Figure 3, most of the feature maps have small or zero values within the object region. Therefore, there are lots of feature maps that are not related to the target object. This property provides us the possibility in selecting only a small number of feature maps with small degradation in tracking performance.

预训练于 ImageNet 的 CNN 特征能够描述各种通用对象，因此被认为可以通过大量神经元检测丰富的视觉模式。然而，在跟踪特定目标对象时，应专注于能够很好地将目标与背景区分开的更小范围的视觉模式。如图 1 (c) 所示，所有特征图的平均值夹杂了背景噪声。我们应当舍弃那些在目标区域和背景区域中都有较高响应的特征图，以避免跟踪器偏移到背景区域。图 3 展示了对象区域内所有特征图的激活值直方图。特征图的激活值定义为其在对象区域中响应值的总和。如图 3 所示，大多数特征图在对象区域内的激活值较小或为零。因此，有许多特征图与目标对象无关。这一特性为我们选择少量特征图以在跟踪性能上仅造成小幅下降提供了可能性。

Figure 3. Activation value histograms of feature maps in conv4-3 (left) and conv5-3 (right).

图 3. conv4-3（左）和 conv5-3（右）中特征图的激活值直方图。

Observation 3 Different layers encode different types of features. Higher layers capture semantic concepts on object categories, whereas lower layers encode more discriminative features to capture intra class variations.

观察 3：不同的层编码不同类型的特征。更高的层捕捉对象类别的语义概念，而较低的层编码更具区分性的特征以捕捉类内差异。

Because of the redundancy of feature maps, we employ a sparse representation scheme to facilitate better visualization. By feeding forward an image of an object through the VGG network,we obtain the feature maps $\mathbf{F} \in {\mathbb{R}}^{d \times n}$ of a convolutional layer (either the conv4-3 or conv5-3 layer),

由于特征图存在冗余，我们采用稀疏表示方案以实现更好的可视化。通过将一个对象的图像输入到 VGG 网络中，我们获得卷积层（如 conv4-3 或 conv5-3 层）的特征图 $\mathbf{F} \in {\mathbb{R}}^{d \times n}$ 。

${}^{1}$ We use the term receptive field to denote the input image region that are connected to a particular neuron in the feature maps

我们使用感受野一词来表示与特征图中特定神经元相连接的输入图像区域 ${}^{1}$ 。

Figure 4. The first and the third rows are input images. The second and the fourth rows are reconstructed foreground masks using conv5-3 feature maps. The sparse coefficients are computed using the images in the first column and directly applied to the other columns without change.

图 4. 第一行和第三行为输入图像。第二行和第四行为使用 conv5-3 特征图重建的前景掩码。稀疏系数是基于第一列的图像计算的，并直接应用于其他列而无需更改。

where each feature map is reshaped into a $d$ -dimensional vector and $n$ denotes the number of feature maps. We further associate the image with a foreground mask $\mathbf{\pi } \in {\mathbb{R}}^{d \times 1}$ , where the $i$ -th element ${\pi }_{i} = 1$ if the $i$ -th neuron of each feature map is located within the foreground object, and ${\pi }_{i} = 0$ ,otherwise. We reconstruct the foreground mask using a subset of the feature maps by solving

每个特征图被重塑为一个 $d$ 维向量，并且 $n$ 表示特征图的数量。我们进一步将图像与前景掩膜 $\mathbf{\pi } \in {\mathbb{R}}^{d \times 1}$ 关联，其中，第 $i$ 个元素 ${\pi }_{i} = 1$ 表示当特征图的第 $i$ 个神经元位于前景对象内时， ${\pi }_{i} = 0$ 表示否则。我们通过解决一个子集的特征图来重建前景掩膜。

where $\mathbf{c} \in {\mathbb{R}}^{n \times 1}$ is the sparse coefficient vector,and $\lambda$ the parameter to balance the reconstruction error and sparsity.

其中 $\mathbf{c} \in {\mathbb{R}}^{n \times 1}$ 是稀疏系数向量， $\lambda$ 是平衡重建误差和稀疏性的参数。

Figure 4 shows some of the reconstructed foreground masks using the feature maps of conv5-3. For the two examples (face and motorcycle) in Figure 4, we only compute the sparse coefficients for images in the first column and use the coefficients to reconstruct foreground masks for the rest of the columns. The selected feature maps in Figure 4 (a) capture the semantic concepts of human faces and are robust to faces with appearance variation and even identity change. The selected feature maps in Figure 4 (b) accurately separate the target from cluttered background and are invariant to pose variation and rotation. Although trained on the image classification task, the high-level semantic representation of object categories encoded by the conv5-3 layer enables object localization. However, these features are not discriminative enough to different objects of the same category, thus they can not be directly applied to visual tracking.

图4展示了一些使用conv5-3的特征图重建的前景掩码。在图4中的两个例子（人脸和摩托车），我们只计算第一列图像的稀疏系数，并利用这些系数重建其余列的前景掩码。图4(a)中的选定特征图捕捉了人脸的语义概念，并且对面部外观变化甚至身份变化具有鲁棒性。图4(b)中的选定特征图能够准确地将目标与杂乱的背景分离，并且对姿态变化和旋转不变。尽管该网络是在图像分类任务上训练的，但conv5-3层编码的物体类别的高级语义表示使得物体定位成为可能。然而，这些特征对同一类别的不同物体区分度不足，因此不能直接应用于视觉跟踪。

Compared with the conv5-3 feature maps, the features captured by conv4-3 are more sensitive to intra-class appearance variation. In Figure 2, the selected feature maps of conv4-3 can well separate the target person from the other non-target person. Besides, different feature maps focus on different object parts.

与 conv5-3 特征图相比，conv4-3 捕捉到的特征对类内外观变化更加敏感。在图 2 中，选定的 conv4-3 特征图能够很好地区分目标人物与其他非目标人物。此外，不同的特征图关注于不同的物体部分。

Table 1. Face classification accuracy using different feature maps. Experiment 1 is to classify face and non-face. Experiment 2 is classify face identities.

表1. 使用不同特征图的面部分类准确率。实验1是分类面部和非面部。实验2是分类面部身份。

To further verify this, we conduct two quantitative experiments. 1800 human face images belonging to six identities and 2000 images containing non-face objects are collected from the benchmark sequences [36]. Each image is associated with a foreground mask to indicate the region of the foreground object. In the first experiment, we evaluate the accuracy in classifying the images into face and non-face using the conv4-3 and conv5-3 layers separately. Three face images belonging to three identities are selected as positive training samples to compute a set of sparse coefficients $\left\{ {{\mathbf{c}}_{1},{\mathbf{c}}_{2},{\mathbf{c}}_{2}}\right\}$ via (1). At the test stage,given the feature maps $\mathbf{F}$ and the foreground mask $\mathbf{\pi }$ of an input image,the reconstruction error $e$ for the foreground map is computed by

为了进一步验证这一点，我们进行了两个定量实验。从基准序列 [36] 中收集了1800张属于六个身份的人脸图像和2000张包含非人脸物体的图像。每张图像都配有一个前景掩码，用于指示前景物体的区域。在第一个实验中，我们分别使用 conv4-3 和 conv5-3 层评估将图像分类为面部和非面部的准确率。选择三张属于三个身份的面部图像作为正训练样本，通过公式 (1) 计算一组稀疏系数 $\left\{ {{\mathbf{c}}_{1},{\mathbf{c}}_{2},{\mathbf{c}}_{2}}\right\}$ 。在测试阶段，给定输入图像的特征图 $\mathbf{F}$ 和前景掩码 $\mathbf{\pi }$ ，计算前景图的重构误差 $e$ 。

The image is classified as a face image if its reconstruction error $e$ is less than a predefined threshold. Otherwise,it is classified as a non-face image.

如果其重建误差 $e$ 小于预定义的阈值，则该图像被分类为人脸图像。否则，它被分类为非人脸图像。

In the second experiment, our task is to classify all the face images into different identities. For each identity, 20 images are used as the training samples to learn the sparse coefficients ${\mathbf{c}}_{i},i = 1,2,\ldots ,6$ using (1). At the test stage, the foreground mask reconstruction error for each identity is calculated, and the test image is classified as the identity that has the minimum error as follows:

在第二个实验中，我们的任务是将所有人脸图像分类为不同的身份。对于每个身份，使用20张图像作为训练样本，通过公式 (1) 学习稀疏系数 ${\mathbf{c}}_{i},i = 1,2,\ldots ,6$ 。在测试阶段，计算每个身份的前景掩膜重建误差，并将测试图像分类为误差最小的身份，具体如下：

The classification accuracy using the feature maps of conv4-3 and conv5-3 for the two experiments are demonstrated in Table 1. The feature maps of conv5-3 encode high level semantic information and can better separate face from non-face objects. But they achieve lower accuracy than the features maps of conv4-3 in discriminating one identity from another. The feature maps of conv4-3 preserve more middle-level information and enables more accurate classification of different images belonging to the same category (human faces). But they are worse than the feature maps of conv5-3 in discriminating face from non-face. These result-s motivate us to consider these two layers jointly for more robust tracking.

使用 conv4-3 和 conv5-3 特征图进行的两项实验的分类准确率如表 1 所示。conv5-3 的特征图编码了更高层次的语义信息，能够更好地区分人脸和非人脸物体。但在区分不同身份时，它们的准确率低于 conv4-3 的特征图。conv4-3 的特征图保留了更多的中层信息，能够更准确地分类属于同一类别（人脸）的不同图像。但在区分人脸与非人脸时，它们的表现不如 conv5-3 的特征图。这些结果激发我们考虑将这两层共同使用，以实现更稳健的跟踪。

4. Proposed Algorithm

4. 提议的算法

An overview (Figure 5) of the proposed fully convolutional network based tracker (FCNT) is as follows:

提议的基于全卷积网络的跟踪器 (FCNT) 概述 (图 5) 如下：

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——