Transferring Rich Feature Hierarchies for Robust Visual Tracking【翻译】

Doc2X：PDF 转 Word 的最佳选择 Doc2X 专注于 PDF转Word、PDF转Latex、PDF转HTML，支持 Mathpix公式识别、多栏解析、GLM翻译等强大功能，助您快速完成文档处理！ Doc2X: The Best Choice for PDF to Word Conversion Doc2X specializes in PDF to Word, PDF to LaTeX, and PDF to HTML, with features like Mathpix formula recognition, multi-column parsing, and GLM translation, making document processing faster! 👉 立即试用 Doc2X | Try Doc2X Now

原文链接：1501.04587

Transferring Rich Feature Hierarchies for Robust Visual Tracking

转移丰富的特征层次以实现稳健的视觉跟踪

Naiyan Wang ${}^{ \dagger }\;$ Siyi ${\mathrm{{Li}}}^{ \dagger }\;$ Abhinav Gupta ${}^{ \ddagger }\;$ Dit-Yan Yeung ${}^{ \dagger }$

${}^{ \dagger }$ Hong Kong University of Science and Technology ${}^{ \ddagger }$ Carnegie Mellon University

${}^{ \dagger }$ 香港科技大学 ${}^{ \ddagger }$ 卡内基梅隆大学

winsty@gmail.com sliay@cse.ust.hk abhinavg@cs.cmu.edu dyyeung@cse.ust.hk

Abstract

摘要

Convolutional neural network (CNN) models have demonstrated great success in various computer vision tasks including image classification and object detection. However, some equally important tasks such as visual tracking remain relatively unexplored. We believe that a major hurdle that hinders the application of CNN to visual tracking is the lack of properly labeled training data. While existing applications that liberate the power of CNN often need an enormous amount of training data in the order of millions, visual tracking applications typically have only one labeled example in the first frame of each video. We address this research issue here by pre-training a CNN offline and then transferring the rich feature hierarchies learned to online tracking. The CNN is also fine-tuned during online tracking to adapt to the appearance of the tracked target specified in the first video frame. To fit the characteristics of object tracking,we first pre-train the ${CNN}$ to recognize what is an object, and then propose to generate a probability map instead of producing a simple class label. Using two challenging open benchmarks for performance evaluation, our proposed tracker has demonstrated substantial improvement over other state-of-the-art trackers.

卷积神经网络（CNN）模型在各种计算机视觉任务中表现出色，包括图像分类和目标检测。然而，一些同样重要的任务，如视觉跟踪，仍然相对未被探索。我们认为，阻碍CNN在视觉跟踪中应用的一个主要障碍是缺乏适当标注的训练数据。虽然现有的应用程序利用CNN的强大能力通常需要数百万的训练数据，但视觉跟踪应用通常在每个视频的第一帧中只有一个标注示例。我们在这里解决这个研究问题，通过离线预训练CNN，然后将学习到的丰富特征层次转移到在线跟踪中。在在线跟踪过程中，CNN也会进行微调，以适应在第一帧视频中指定的被跟踪目标的外观。为了适应目标跟踪的特性，我们首先预训练 ${CNN}$ 以识别什么是目标，然后提出生成概率图，而不是简单地生成类别标签。使用两个具有挑战性的开放基准进行性能评估，我们提出的跟踪器在其他最先进的跟踪器上显示出显著的改进。

1. Introduction

1. 引言

The past few years have been very exciting in the history of computer vision. A great deal of excitement has been reported when applying the biologically-inspired convolutional neural network (CNN) models to some challenging computer vision tasks. For example, breakthrough performance has been reported for image classification [20] and object detection [10] tasks. However, some other computer vision tasks such as visual tracking remain relatively unexplored in this recent surge of research interest. We believe that a major reason is the lack of sufficient labeled training data which usually plays a very important role in achieving breakthrough performance for other applications because CNN training is typically done in a fully supervised manner. In the case of visual tracking, however, labeled training data is usually very limited, often with only one labeled example as the object to track specified in the first frame of each video. This makes direct application of the large-scale CNN approach infeasible. In this paper, we present an approach which can address this challenge and hence can bring the CNN framework to visual tracking. Using this approach to implement a tracker, we achieve very promising performance which outperforms the best state-of-the-art baseline tracker by more than ${10}\%$ (see Fig. 1 for some qualitative tracking results).

在计算机视觉的历史上，过去几年非常激动人心。当将生物启发的卷积神经网络（CNN）模型应用于一些具有挑战性的计算机视觉任务时，报告了大量的兴奋。例如，在图像分类 [20] 和目标检测 [10] 任务中，取得了突破性的表现。然而，在最近的研究热潮中，一些其他计算机视觉任务，如视觉跟踪，仍然相对未被探索。我们认为，主要原因是缺乏足够的标记训练数据，而标记训练数据通常在实现其他应用的突破性表现中发挥着非常重要的作用，因为CNN的训练通常是在完全监督的方式下进行的。然而，在视觉跟踪的情况下，标记训练数据通常非常有限，通常只有一个标记示例作为在每个视频的第一帧中指定的跟踪对象。这使得大规模CNN方法的直接应用变得不可行。在本文中，我们提出了一种可以解决这一挑战的方法，从而可以将CNN框架引入视觉跟踪。使用这种方法实现跟踪器，我们取得了非常有前景的表现，超过了最佳的最先进基线跟踪器超过 ${10}\%$ （有关一些定性跟踪结果，请参见图1）。

Figure 1. Tracking results for motocross 1 and skiing video sequences (SO-DLT is our proposed tracker).

图1. 摩托越野和滑雪视频序列的跟踪结果（SO-DLT是我们提出的跟踪器）。

Although visual tracking can be formulated in different settings according to different applications, the focus of this paper is the one-pass model-free single object tracking setting. Specifically, it assumes that the bounding box of one single object in the first frame is given but no other appearance model is available. Given this single (labeled) instance, the goal is to track the movement of the object in an online manner. Consequently, this setting involves adapting the tracker to appearance changes of the object based on the possibly noisy output of the tracker. Another way to formulate this problem would be as a self-taught one-shot learning problem in which the single example comes from the previous frame. Since learning a visual model from a single example is an ill-posed problem, a successful approach would require using some auxiliary data to learn an invariant representation of generic object features. Although some recent work $\left\lbrack {{31},{33}}\right\rbrack$ also shares this spirit,the performance reported is inferior to the state of the art due to the lack of sufficient training data on one hand and the limited representational power of the model used on the other hand. CNN has a role to play here by learning more robust features. To make it feasible with limited training data during online tracking, we pre-train a CNN offline and then transfer the generic features learned to the online tracking task.

尽管视觉跟踪可以根据不同的应用在不同的环境中进行表述，但本文的重点是一次性无模型单对象跟踪设置。具体而言，它假设在第一帧中给定一个单一对象的边界框，但没有其他外观模型可用。给定这个单一（标记的）实例，目标是在在线方式下跟踪对象的运动。因此，这种设置涉及根据跟踪器可能产生的噪声输出调整跟踪器以适应对象的外观变化。另一种表述这个问题的方法是将其视为自我学习的一次性学习问题，其中单一示例来自前一帧。由于从单一示例中学习视觉模型是一个不适定问题，因此成功的方法需要使用一些辅助数据来学习通用对象特征的不变表示。尽管一些最近的工作 $\left\lbrack {{31},{33}}\right\rbrack$ 也具有这种精神，但由于一方面缺乏足够的训练数据，另一方面所使用模型的表示能力有限，报告的性能低于最先进水平。卷积神经网络（CNN）在这里可以发挥作用，通过学习更鲁棒的特征。为了在在线跟踪过程中使其在有限的训练数据下可行，我们离线预训练一个CNN，然后将学习到的通用特征转移到在线跟踪任务中。

The first deep learning tracker (DLT) [31] reported in the literature is based on a stacked denoising autoencoder network. While this approach is very promising, the exact realization of the approach reported in the paper has two limitations that hinder the tracking performance of DLT as compared to other state-of-the-art trackers. First, the pretraining of DLT may not be very suitable for tracking applications. The data used for pre-training is from the ${80}\mathrm{M}$ Tiny Images dataset [29] with each image obtained by down-sampling directly from a full-sized image. Although some generic image features can be learned by learning to reconstruct the input images, the target to track in a typical tracking task is a single object rather than an entire image. Features that are effective for tracking should be able to distinguish objects from non-objects (i.e. background), not just to reconstruct an entire image. Second, in each frame, DLT first generates candidates or proposals of the target based on the predictions of the previous frames, and then treats tracking as a classification problem. It ignores the structured nature of bounding boxes in that a bounding box or segmentation result corresponds to a region of an image, not just a simple label or real number as in a classification or regression problem. Some previous work $\left\lbrack {{14},{32}}\right\rbrack$ showed that exploiting the structured nature explicitly in the model could improve the performance significantly. Moreover, the number of proposals is usually in the order of several hundreds, making it hard to apply larger and more powerful deep learning models.

文献中报告的第一个深度学习跟踪器 (DLT) [31] 是基于堆叠去噪自编码器网络。尽管这种方法非常有前景，但论文中报告的方法的确切实现存在两个限制，这些限制妨碍了 DLT 相对于其他最先进的跟踪器的跟踪性能。首先，DLT 的预训练可能不太适合跟踪应用。用于预训练的数据来自 ${80}\mathrm{M}$ Tiny Images 数据集 [29]，每张图像是通过直接从全尺寸图像下采样获得的。尽管通过学习重建输入图像可以学习到一些通用的图像特征，但在典型的跟踪任务中，目标是单个物体，而不是整个图像。有效的跟踪特征应该能够区分物体与非物体（即背景），而不仅仅是重建整个图像。其次，在每一帧中，DLT 首先根据前一帧的预测生成目标的候选或提议，然后将跟踪视为分类问题。它忽略了边界框的结构化特性，因为边界框或分割结果对应于图像的一个区域，而不仅仅是分类或回归问题中的简单标签或实数。一些先前的工作 $\left\lbrack {{14},{32}}\right\rbrack$ 表明，在模型中显式利用结构化特性可以显著提高性能。此外，提议的数量通常在几百个的数量级，使得应用更大更强大的深度学习模型变得困难。

We propose a novel structured output CNN which transfers generic object features for online tracking. The contributions of our paper are summarized as follows:

我们提出了一种新颖的结构化输出卷积神经网络 (CNN)，用于在线跟踪中的通用物体特征转移。我们论文的贡献总结如下：

To alleviate the overfitting and drifting problems during online tracking, we pre-train the CNN to distinguish objects from non-objects instead of simply reconstructing the input or performing categorical classification on large-scale datasets with object-level annotations [7].
为了缓解在线跟踪中的过拟合和漂移问题，我们预训练 CNN 以区分物体与非物体，而不是简单地重建输入或在具有物体级注释的大规模数据集上执行分类。
The output of the CNN is a pixel-wise map to indicate the probability that each pixel in the input image belongs to the bounding box of an object. The key advantages of the pixel-wise output are its induced structured loss and computational scalability.
CNN 的输出是一个逐像素的地图，用于指示输入图像中每个像素属于物体边界框的概率。逐像素输出的主要优点在于其引入的结构化损失和计算可扩展性。
We evaluate our proposed method on an open benchmark [34] as well as a challenging non-rigid object tracking dataset and obtain very remarkable results. In particular, we improve the area under curve (AUC) metric of the overlap rate curve from 0.529 to 0.602 for the open benchmark.
我们在一个开放基准 [34] 以及一个具有挑战性的非刚性物体跟踪数据集上评估了我们提出的方法，并获得了非常显著的结果。特别是，我们将开放基准的重叠率曲线下的面积（AUC）指标从 0.529 提高到 0.602。

2. Related Work

2. 相关工作

2.1. Deep Learning and CNNs

2.1. 深度学习与卷积神经网络

The root of deep learning can be dated back to research on multilayered neural networks in the late 1980s. The resurgence of research interest in neural networks owes to a more recent work [17] which used pre-training to make the training of deeper networks feasible. Among different deep learning models, CNN seems to be a more suitable choice for many vision tasks as the design of CNN has been inspired by the vision systems of the biological counterparts. Among its characteristics, the convolution operation can capture local and repetitive similarity and the pooling operation can allow local translational invariance in images. The rapid development of powerful computing devices such as general-purpose graphics processing units (GPGPU) and the availability of large-scale labeled datasets such as ImageNet [7] have made the training of large-scale CNN possible. It has been demonstrated visually in [35] that a CNN can gradually learn low-level to high-level features through the transformation and enlargement of receptive fields in different layers. As opposed to using handcrafted features as in the conventional recognition pipeline, it has been demonstrated that the features learned by a large-scale CNN can achieve very superior performance in some high-level vision tasks such as image classification [20] and object detection [10].

深度学习的根源可以追溯到 1980 年代末对多层神经网络的研究。对神经网络研究兴趣的复兴归功于一项更近期的工作 [17]，该工作利用预训练使得更深网络的训练成为可能。在不同的深度学习模型中，卷积神经网络（CNN）似乎是许多视觉任务的更合适选择，因为 CNN 的设计受到生物视觉系统的启发。在其特性中，卷积操作可以捕捉局部和重复的相似性，而池化操作则可以允许图像中的局部平移不变性。通用图形处理单元（GPGPU）等强大计算设备的快速发展以及大规模标记数据集（如 ImageNet [7]）的可用性使得大规模 CNN 的训练成为可能。在 [35] 中已直观展示，CNN 可以通过不同层中感受野的变换和扩大逐渐学习从低级到高级的特征。与传统识别流程中使用手工特征不同，已有证明表明，大规模 CNN 学习的特征在某些高级视觉任务（如图像分类 [20] 和物体检测 [10]）中可以实现非常优越的性能。

2.2. Visual Tracking

2.2. 视觉跟踪

Many methods have been proposed for single object tracking. For a systematic review and comparison, we refer the readers to a recent survey and a benchmark [27, 34]. Most of the existing tracking methods belong to the general framework of Bayesian tracking [1]. It decomposes the problem into two parts which involve a motion model and an appearance model. Although some trackers attempt to go beyond this framework,e.g. $\left\lbrack {{21},{22},{23}}\right\rbrack$ ,most still focus on improving the appearance model because this aspect is crucial to enhancement in performance.

许多方法已被提出用于单目标跟踪。有关系统评审和比较，我们建议读者参考最近的调查和基准 [27, 34]。现有的大多数跟踪方法属于贝叶斯跟踪的一般框架 [1]。它将问题分解为涉及运动模型和外观模型的两个部分。尽管一些跟踪器试图超越这一框架，例如 $\left\lbrack {{21},{22},{23}}\right\rbrack$ ，但大多数仍然专注于改善外观模型，因为这一方面对性能的提升至关重要。

Generally speaking, most trackers belong to either one of two categories: generative trackers and discriminative trackers. Generative trackers usually assume a generative process of the tracked target and search for the most probable candidate as the tracking result. Some representative methods are based on principal component analysis [26], sparse coding [25], and dictionary learning [30]. On the other hand, discriminative trackers learn to separate the foreground from the background using a classifier. Many advanced machine learning algorithms have been used, including boosting variants $\left\lbrack {{12},{13}}\right\rbrack$ ,multiple-instance learning [2], structured output SVM [14], and Gaussian process regression [9]. These two approaches are in general complementary. Discriminative trackers are usually more resistant to cluttered background since they explicitly sample image patches from the background as negative training examples. On the other hand, generative trackers are usually more accurate under normal situations. Besides, some methods exploit correlation filters for the target or context $\left\lbrack {4,{16},{37}}\right\rbrack$ . Their primary advantage is that only fast Fourier transform and several matrix operations are needed, making them very suitable for real-time applications. Moreover, some methods take the ensemble learning approach $\left\lbrack {3,{36},{32}}\right\rbrack$ which is especially effective when the constituent trackers involved in the ensemble have high diversity. Furthermore, some methods focus on long-term tracking,e.g. $\left\lbrack {{19},{28}}\right\rbrack$ .

一般来说，大多数跟踪器属于两类之一：生成跟踪器和判别跟踪器。生成跟踪器通常假设被跟踪目标的生成过程，并搜索最可能的候选作为跟踪结果。一些代表性的方法基于主成分分析 [26]、稀疏编码 [25] 和字典学习 [30]。另一方面，判别跟踪器学习使用分类器将前景与背景分离。许多先进的机器学习算法已被应用，包括增强变体 $\left\lbrack {{12},{13}}\right\rbrack$ 、多实例学习 [2]、结构化输出支持向量机 [14] 和高斯过程回归 [9]。这两种方法通常是互补的。判别跟踪器通常对杂乱背景更具抵抗力，因为它们明确从背景中采样图像块作为负训练示例。另一方面，生成跟踪器在正常情况下通常更准确。此外，一些方法利用目标或上下文的相关滤波器 $\left\lbrack {4,{16},{37}}\right\rbrack$ 。它们的主要优势在于只需要快速傅里叶变换和几个矩阵运算，使其非常适合实时应用。此外，一些方法采用集成学习方法 $\left\lbrack {3,{36},{32}}\right\rbrack$ ，当参与集成的组成跟踪器具有高度多样性时，尤其有效。此外，一些方法专注于长期跟踪，例如 $\left\lbrack {{19},{28}}\right\rbrack$ 。

As for applying deep learning to visual tracking, besides the DLT [31] mentioned in the previous section, some recent methods include using an ensemble [39] and maintaining a pool of CNNs [24]. However, due to the lack of sufficient training data, these methods only show comparable or even inferior results compared to other state-of-the-art trackers. In summary, for visual tracking applications, we believe that the power of deep learning has not yet been fully liberated.

关于将深度学习应用于视觉跟踪，除了前一节提到的 DLT [31]，一些近期的方法包括使用集成 [39] 和维护一个 CNN 池 [24]。然而，由于缺乏足够的训练数据，这些方法的结果仅与其他最先进的跟踪器相当，甚至表现更差。总之，对于视觉跟踪应用，我们认为深度学习的潜力尚未得到充分释放。

3.Our Tracker

3.我们的跟踪器

In this section, we will present our structured output deep learning tracker (SO-DLT). We first present the CNN architecture in SO-DLT and the offline pre-training process of the CNN. We then present details of the online tracking process.

在本节中，我们将介绍我们的结构化输出深度学习跟踪器 (SO-DLT)。我们首先介绍 SO-DLT 中的 CNN 架构以及 CNN 的离线预训练过程。然后，我们将详细介绍在线跟踪过程。

3.1. Overview

3.1. 概述

Training of the tracker can be divided into two stages, the offline pre-training stage and the online fine-tuning and tracking stage. In the pre-training stage, we train a CNN to learn generic object features for distinguishing objects from non-objects, i.e., to learn from examples the notion of objectness. Instead of fixing the learned parameters of CNN during online tracking, we fine-tune them so that the CNN can adapt to the target being tracked. For robustness, we run two CNNs concurrently during online tracking to account for possible mistakes caused by model update. The two CNNs work collaboratively in determining the tracking result of each video frame.

跟踪器的训练可以分为两个阶段：离线预训练阶段和在线微调与跟踪阶段。在预训练阶段，我们训练一个 CNN 来学习通用对象特征，以区分对象和非对象，即从示例中学习对象性概念。在在线跟踪过程中，我们并不固定 CNN 学习到的参数，而是对其进行微调，以便 CNN 能够适应被跟踪的目标。为了增强鲁棒性，我们在在线跟踪过程中同时运行两个 CNN，以考虑模型更新可能导致的错误。这两个 CNN 在确定每个视频帧的跟踪结果时协同工作。

3.2. Objectness Pre-training

3.2. 对象性预训练

The architecture of the structured output CNN is shown in Fig. 2. It consists of seven convolutional layers and three fully connected layers. Between these two parts, a multi-scale pooling scheme [15] is introduced to retain more features related to locality since the output needs them for localization. The parameter setting of the network is shown in Fig. 2. In contrast to the conventional CNN used for classification or regression, there is a crucial difference in our model: the output of the CNN is a ${50} \times {50}$ probability map rather than a single number. Each output pixel corresponds to a $2 \times 2$ region in the original input,with its value representing the probability that the corresponding input region belongs to an object. In our implementation, the output layer is a 2500-dimensional fully connected layer which is then reshaped to the ${50} \times {50}$ probability map. Since there exists strong correlation between neighboring pixels of the probability map, we only use 512 hidden units in the previous layer to help prevent overfitting.

结构化输出卷积神经网络的架构如图2所示。它由七个卷积层和三个全连接层组成。在这两个部分之间，引入了一种多尺度池化方案 [15]，以保留与局部性相关的更多特征，因为输出需要这些特征进行定位。网络的参数设置如图2所示。与用于分类或回归的传统卷积神经网络相比，我们模型的一个关键区别在于：卷积神经网络的输出是一个 ${50} \times {50}$ 概率图，而不是一个单一的数字。每个输出像素对应于原始输入中的一个 $2 \times 2$ 区域，其值表示对应输入区域属于某个对象的概率。在我们的实现中，输出层是一个2500维的全连接层，然后被重塑为 ${50} \times {50}$ 概率图。由于概率图的相邻像素之间存在强相关性，我们仅在前一层使用512个隐藏单元，以帮助防止过拟合。

To train such a large CNN, it is essential to use a large dataset to prevent overfitting. Since we are interested in object-level features, we use the ImageNet 2014 detection ${\text{dataset}}^{1}$ which contains 478,807 bounding boxes in the training set. For each annotated bounding box, we add random padding and scaling around it. We also randomly sample some negative examples when the overlap rates ${}^{2}$ of the positive examples are below a certain threshold. Note that it does not learn to distinguish different object classes as in a typical classification or detection task, since we are only interested in learning to differentiate objects from non-objects in this stage. Consequently, we use an element-wise logistic regression model in each position of the ${50} \times {50}$ output map and define the loss function accordingly. For the training target, a pixel inside the bounding box is set to 1 while it is 0 outside. As for a negative example, the target is 0 for the entire probability map. This setting is equivalent to penalizing the number of mismatched pixels between the prediction and the ground truth, thus inducing a structured loss function which fits the problem better. Mathematically, let ${p}_{ij}$ denotes the prediction of(i,j)position,and ${t}_{ij}$ is a binary variable denotes the ground truth of(i,j)position, the loss function of our method is defined as:

为了训练这样一个大型卷积神经网络（CNN），使用大型数据集以防止过拟合是至关重要的。由于我们关注的是对象级特征，因此使用了包含478,807个边界框的ImageNet 2014检测数据集 ${\text{dataset}}^{1}$ 。对于每个标注的边界框，我们在其周围添加随机的填充和缩放。当正例的重叠率 ${}^{2}$ 低于某个阈值时，我们还会随机抽取一些负例。请注意，它并没有学习区分不同的对象类别，如同典型的分类或检测任务，因为在这个阶段我们只关注学习区分对象和非对象。因此，我们在 ${50} \times {50}$ 输出图的每个位置使用逐元素逻辑回归模型，并相应地定义损失函数。对于训练目标，边界框内的像素被设置为1，而外部则为0。至于负例，整个概率图的目标为0。这个设置相当于惩罚预测与真实值之间不匹配的像素数量，从而引入一个更适合该问题的结构化损失函数。从数学上讲，设 ${p}_{ij}$ 表示(i,j)位置的预测，而 ${t}_{ij}$ 是一个二元变量，表示(i,j)位置的真实值，我们的方法的损失函数定义为：

The detailed parameters for training are described in Sec. 4.1.

训练的详细参数在第4.1节中描述。

${}^{1}$ image-net.org/challenges/…

${}^{2}$ The overlap rate between two bounding boxes is defined as the area of intersection of the two bounding boxes over the area of their union.

${}^{2}$ 两个边界框之间的重叠率定义为两个边界框的交集面积与它们的并集面积之比。

Figure 2. Architecture of the proposed structured output CNN.

图2. 提出的结构化输出CNN的架构。

Fig. 3 shows some results when the pre-trained CNN is tested on the held-out validation set provided by the Ima-geNet 2014 detection task. In most cases, the CNN can successfully determine whether the input image contains an object, and if yes it can accurately locate the object of interest. Note that since the labels of our training data are only bounding boxes,the output of the ${50} \times {50}$ probability map is also in the form of a square. Although there are methods [6] that utilize bounding box information to provide weak supervision and obtain pixel-wise segmentation as well, we believe that the probability map output in our model is sufficient for the purpose of tracking.

图 3 显示了在 ImageNet 2014 检测任务提供的保留验证集上测试预训练 CNN 的一些结果。在大多数情况下，CNN 能够成功判断输入图像是否包含对象，如果包含，它能够准确定位感兴趣的对象。请注意，由于我们的训练数据的标签仅为边界框，因此 ${50} \times {50}$ 概率图的输出形式也是一个正方形。尽管有方法 [6] 利用边界框信息提供弱监督并获得逐像素分割，但我们认为我们模型中的概率图输出足以满足跟踪的目的。

Figure 3. Testing of the pre-trained objectness CNN on the Ima-geNet 2014 detection validation set. The first row shows two positive examples each of which contains an object. The objectness CNN can accurately detect the object position and scale. The second row shows two negative examples. The objectness CNN does not fire on them showing the lack of evidence for the existence of any object of interest. The CNN plays an important role in making our SO-DLT robust to occlusion and cluttered background during online tracking.

图 3. 在 ImageNet 2014 检测验证集上测试预训练的对象性 CNN。第一行显示了两个正例，每个正例都包含一个对象。对象性 CNN 能够准确检测对象的位置和尺度。第二行显示了两个负例。对象性 CNN 在这些负例上没有触发，表明缺乏任何感兴趣对象存在的证据。CNN 在使我们的 SO-DLT 在在线跟踪过程中对遮挡和杂乱背景具有鲁棒性方面发挥了重要作用。

3.3. Online Tracking

3.3. 在线跟踪

The CNN pre-trained to learn generic object features as described above cannot be used directly for online tracking because the data bias of the ImageNet data is different from that of the data observed during online tracking. Moreover, if we do not fine-tune the CNN, it will fire on all objects that appear in a video frame instead of just the object being tracked. Therefore, it is essential to fine-tune the pre-trained CNN using the annotation in the first frame of each video collected during online tracking to make sure that the CNN is specific to the target. Note that fine-tuning, or online model adaptation, is an indispensable part of our tracker rather than an optional feature solely introduced to further improve the tracking performance.

如上所述，预训练的 CNN 用于学习通用对象特征，不能直接用于在线跟踪，因为 ImageNet 数据的偏差与在线跟踪过程中观察到的数据的偏差不同。此外，如果我们不对 CNN 进行微调，它将在视频帧中出现的所有对象上触发，而不仅仅是被跟踪的对象。因此，使用在线跟踪过程中收集的每个视频第一帧的注释对预训练 CNN 进行微调是至关重要的，以确保 CNN 专门针对目标。请注意，微调或在线模型适应是我们跟踪器不可或缺的一部分，而不是仅仅为了进一步提高跟踪性能而引入的可选特性。

We now present the basic online tracking pipeline. We maintain two CNNs which use different model update strategies. After fine-tuning using the annotation in the first frame, we crop some image patches from each new frame based on the estimation of the previous frame. By making a simple forward pass through the CNN, we can obtain the probability map for each of the image patches. The final estimation is then determined by searching for a proper bounding box. The two CNNs are updated if necessary. We illustrate the pipeline of the tracking algorithm in Fig. 4. In what follows, we will elaborate the major steps of the pipeline separately.

我们现在介绍基本的在线跟踪管道。我们维护两个使用不同模型更新策略的卷积神经网络（CNN）。在使用第一帧的注释进行微调后，我们根据前一帧的估计从每个新帧中裁剪一些图像补丁。通过简单地向前传递通过CNN，我们可以获得每个图像补丁的概率图。最终估计通过搜索合适的边界框来确定。如果需要，这两个CNN会被更新。我们在图4中展示了跟踪算法的管道。接下来，我们将分别详细阐述管道的主要步骤。

3.3.1 Bounding Box Determination

3.3.1 边界框确定

When a new frame comes, the first step of our tracker is to determine the best location and scale of the target. We first specify the possible regions that may contain the target and feed the regions into the CNN. Next, we decide the most probable location of the bounding box based on the probability map.

当新帧到来时，我们跟踪器的第一步是确定目标的最佳位置和尺度。我们首先指定可能包含目标的区域，并将这些区域输入到CNN中。接下来，我们根据概率图决定边界框的最可能位置。

Search Mechanism: Selecting a proper search range for the target is a nontrivial problem. Using too small search regions makes it easy to lose track of a target under fast motion, but using too large search regions may include salient distractors in the background. For example, in Fig. 5, the output response gets weaker as the search region is enlarged mainly due to the cluttered background and another person nearby. To address this issue, we propose a multi-scale search scheme for determining the proper bounding box. First, all the cropped regions are centered at the estimation of the previous frame. Then, we start searching with the smallest scale. If the sum over the output probability map is below a threshold (i.e., the target may not be in this scale), then we proceed to the next larger scale. If we cannot find the object in all scales, we report that the target is missing.

搜索机制：为目标选择合适的搜索范围是一个非平凡的问题。使用过小的搜索区域会使得在快速运动下容易失去目标，但使用过大的搜索区域可能会包含背景中的显著干扰物。例如，在图5中，随着搜索区域的扩大，输出响应变得更弱，主要是由于杂乱的背景和附近的另一个人。为了解决这个问题，我们提出了一种多尺度搜索方案来确定合适的边界框。首先，所有裁剪的区域都以前一帧的估计为中心。然后，我们开始以最小尺度进行搜索。如果输出概率图的总和低于阈值（即目标可能不在这个尺度内），那么我们就进入下一个更大的尺度。如果在所有尺度中都找不到目标，我们会报告目标缺失。

Generating Bounding Box: After we have selected the best scale, we need to generate the final bounding box for the current frame. We first determine the center of the bounding box and then estimate its scale change with respect to the previous frame. To determine the center we use

生成边界框：在选择最佳尺度后，我们需要为当前帧生成最终的边界框。我们首先确定边界框的中心，然后估计其相对于前一帧的尺度变化。为了确定中心，我们使用

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——