Fully-Convolutional Siamese Networks for Object Tracking【翻译】

Doc2X：Markdown 转换的首选工具支持 PDF 中表格、代码块和公式的精准提取，轻松转化为 Markdown 格式。 Doc2X: The Go-To Tool for Markdown Conversion Precisely extract tables, code blocks, and formulas from PDFs, and seamlessly convert them to Markdown format. 👉 点击体验 Doc2X | Try Doc2X

原文链接：1606.09549

Fully-Convolutional Siamese Networks for Object Tracking

完全卷积孪生网络用于物体追踪

Luca Bertinetto * Jack Valmadre * João F. Henriques Andrea Vedaldi Philip H. S. Torr

Department of Engineering Science, University of Oxford {name.surname}@eng.ox.ac.uk

牛津大学工程科学系 {name.surname}@eng.ox.ac.uk

Abstract. The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object's appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn. Recently, several attempts have been made to exploit the expressive power of deep convolutional networks. However, when the object to track is not known beforehand, it is necessary to perform Stochastic Gradient Descent online to adapt the weights of the network, severely compromising the speed of the system. In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video. Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.

摘要。传统上，任意物体追踪问题是通过在线学习物体外观的模型来解决的，唯一的训练数据就是视频本身。尽管这些方法取得了成功，但它们的在线-only 方法本质上限制了它们可以学习的模型的丰富性。近年来，已经有若干尝试利用深度卷积网络的表达能力。然而，当待追踪的物体事先未知时，必须执行随机梯度下降（Stochastic Gradient Descent）来在线调整网络的权重，这严重影响了系统的速度。在本文中，我们为一个基本的追踪算法配备了一种新颖的完全卷积孪生网络，该网络在 ILSVRC15 数据集上端到端地进行训练，用于视频中的物体检测。我们的追踪器的帧率超过实时，并且尽管其极其简单，但在多个基准测试中达到了最先进的性能。

Keywords: object-tracking, Siamese-network, similarity-learning, deep-learning 1 Introduction

关键词：物体追踪、孪生网络、相似性学习、深度学习 1 引言

We consider the problem of tracking an arbitrary object in video, where the object is identified solely by a rectangle in the first frame. Since the algorithm may be requested to track any arbitrary object, it is impossible to have already gathered data and trained a specific detector.

我们考虑在视频中追踪任意物体的问题，其中物体仅通过第一帧中的矩形来标识。由于该算法可能被要求追踪任何任意物体，因此不可能事先收集数据并训练一个特定的检测器。

For several years, the most successful paradigm for this scenario has been to learn a model of the object's appearance in an online fashion using examples extracted from the video itself [1]. This owes in large part to the demonstrated ability of methods like TLD [2], Struck [3] and KCF [4]. However, a clear deficiency of using data derived exclusively from the current video is that only comparatively simple models can be learnt. While other problems in computer vision have seen an increasingly pervasive adoption of deep convolutional networks (conv-nets) trained from large supervised datasets, the scarcity of supervised data and the constraint of real-time operation prevent the naive application of deep learning within this paradigm of learning a detector per video.

多年来，对于这一场景，最成功的范式是通过在线方式学习物体外观的模型，使用从视频中提取的示例 [1]。这在很大程度上归功于像 TLD [2]、Struck [3] 和 KCF [4] 等方法所展示的能力。然而，使用仅从当前视频中提取的数据的明显不足之处在于，所能学习的模型通常比较简单。尽管计算机视觉领域的其他问题已经看到越来越广泛地采用从大型监督数据集训练的深度卷积网络（conv-nets），但由于缺乏监督数据以及实时操作的限制，使得在这种每个视频学习一个检测器的范式中，深度学习的直接应用仍然受到阻碍。

The first two authors contributed equally, and are listed in alphabetical order.
前两位作者贡献相等，按字母顺序列出。

Several recent works have aimed to overcome this limitation using a pre-trained deep conv-net that was learnt for a different but related task. These approaches either apply "shallow" methods (e.g. correlation filters) using the network’s internal representation as features $\left\lbrack {5,6}\right\rbrack$ or perform SGD (stochastic gradient descent) to fine-tune multiple layers of the network $\left\lbrack {7,8,9}\right\rbrack$ . While the use of shallow methods does not take full advantage of the benefits of end-to-end learning, methods that apply SGD during tracking to achieve state-of-the-art results have not been able to operate in real-time.

最近的一些研究致力于克服这一限制，采用了一个预训练的深度卷积网络，该网络是为一个不同但相关的任务学习的。这些方法要么应用“浅层”方法（例如，相关滤波器），使用网络的内部表示作为特征 $\left\lbrack {5,6}\right\rbrack$ ，要么执行随机梯度下降（SGD）来微调网络的多个层 $\left\lbrack {7,8,9}\right\rbrack$ 。尽管浅层方法未能充分利用端到端学习的优势，但在跟踪过程中应用SGD以实现最先进结果的方法尚未能够实现实时操作。

We advocate an alternative approach in which a deep conv-net is trained to address a more general similarity learning problem in an initial offline phase, and then this function is simply evaluated online during tracking. The key contribution of this paper is to demonstrate that this approach achieves very competitive performance in modern tracking benchmarks at speeds that far exceed the frame-rate requirement. Specifically, we train a Siamese network to locate an exemplar image within a larger search image. A further contribution is a novel Siamese architecture that is fully-convolutional with respect to the search image: dense and efficient sliding-window evaluation is achieved with a bilinear layer that computes the cross-correlation of its two inputs.

我们提倡一种替代方法，其中一个深度卷积网络在初始的离线阶段被训练来解决更一般的相似性学习问题，然后这个函数在跟踪过程中仅在线评估。本文的主要贡献是展示这种方法在现代跟踪基准中能以远超帧率要求的速度实现非常有竞争力的性能。具体来说，我们训练一个孪生网络来定位一个示例图像在更大的搜索图像中的位置。进一步的贡献是提出了一种新颖的孪生架构，该架构对搜索图像完全卷积：通过一个双线性层实现了密集且高效的滑动窗口评估，该层计算两个输入之间的互相关。

We posit that the similarity learning approach has gone relatively neglected because the tracking community did not have access to vast labelled datasets. In fact, until recently the available datasets comprised only a few hundred annotated videos. However, we believe that the emergence of the ILSVRC dataset for object detection in video [10] (henceforth ImageNet Video) makes it possible to train such a model. Furthermore, the fairness of training and testing deep models for tracking using videos from the same domain is a point of controversy, as it has been recently prohibited by the VOT committee. We show that our model generalizes from the ImageNet Video domain to the ALOV/OTB/VOT [1,11,12] domain, enabling the videos of tracking benchmarks to be reserved for testing purposes.

我们认为，相似性学习方法之所以相对被忽视，是因为跟踪社区一直无法获取大量的标注数据集。事实上，直到最近，现有的数据集仅包含几百个注释过的视频。然而，我们认为，ILSVRC 数据集在视频中用于物体检测的出现 [10]（以下简称 ImageNet 视频）使得训练这样的模型成为可能。此外，使用来自同一领域的视频来训练和测试深度跟踪模型的公平性问题一直存在争议，因为 VOT 委员会最近已禁止这种做法。我们展示了我们的模型能够从 ImageNet 视频领域推广到 ALOV/OTB/VOT [1,11,12] 领域，从而使得跟踪基准的视频可以保留用于测试目的。

2 Deep similarity learning for tracking

2 深度相似性学习在跟踪中的应用

Learning to track arbitrary objects can be addressed using similarity learning. We propose to learn a function $f\left( {z,x}\right)$ that compares an exemplar image $z$ to a candidate image $x$ of the same size and returns a high score if the two images depict the same object and a low score otherwise. To find the position of the object in a new image, we can then exhaustively test all possible locations and choose the candidate with the maximum similarity to the past appearance of the object. In experiments, we will simply use the initial appearance of the object as the exemplar. The function $f$ will be learnt from a dataset of videos with labelled object trajectories.

学习跟踪任意对象可以通过相似性学习来解决。我们提出学习一个函数 $f\left( {z,x}\right)$ ，该函数将一个样本图像 $z$ 与一个大小相同的候选图像 $x$ 进行比较，如果两幅图像描绘的是相同的对象，则返回一个高分，否则返回低分。为了在新图像中找到对象的位置，我们可以穷举所有可能的位置，并选择与该对象过去外观最相似的候选位置。在实验中，我们将简单地使用对象的初始外观作为样本。函数 $f$ 将通过一个包含标注对象轨迹的视频数据集来学习。

Given their widespread success in computer vision $\left\lbrack {{13},{14},{15},{16}}\right\rbrack$ ,we will use a deep conv-net as the function $f$ . Similarity learning with deep conv-nets is typically addressed using Siamese architectures $\left\lbrack {{17},{18},{19}}\right\rbrack$ . Siamese networks apply an identical transformation $\varphi$ to both inputs and then combine their representations using another function $g$ according to $f\left( {z,x}\right) = g\left( {\varphi \left( z\right) ,\varphi \left( x\right) }\right)$ . When the function $g$ is a simple distance or similarity metric,the function $\varphi$ can be considered an embedding. Deep Siamese conv-nets have previously been applied to tasks such as face verification $\left\lbrack {{18},{20},{14}}\right\rbrack$ ,keypoint descriptor learning $\left\lbrack {{19},{21}}\right\rbrack$ and one-shot character recognition [22].

鉴于深度卷积神经网络在计算机视觉领域的广泛成功 $\left\lbrack {{13},{14},{15},{16}}\right\rbrack$ ，我们将使用深度卷积网络作为该函数 $f$ 。使用深度卷积网络进行相似性学习通常采用孪生网络架构 $\left\lbrack {{17},{18},{19}}\right\rbrack$ 。孪生网络对两个输入应用相同的变换 $\varphi$ ，然后根据 $f\left( {z,x}\right) = g\left( {\varphi \left( z\right) ,\varphi \left( x\right) }\right)$ 使用另一个函数 $g$ 合并它们的表示。当函数 $g$ 是一个简单的距离或相似性度量时，函数 $\varphi$ 可以被视为一个嵌入。深度孪生卷积网络此前已应用于人脸验证 $\left\lbrack {{18},{20},{14}}\right\rbrack$ 、关键点描述符学习 $\left\lbrack {{19},{21}}\right\rbrack$ 和一次性字符识别 [22] 等任务。

Fig. 1: Fully-convolutional Siamese architecture. Our architecture is fully-convolutional with respect to the search image $x$ . The output is a scalar-valued score map whose dimension depends on the size of the search image. This enables the similarity function to be computed for all translated sub-windows within the search image in one evaluation. In this example, the red and blue pixels in the score map contain the similarities for the corresponding sub-windows. Best viewed in colour.

图 1：全卷积孪生网络架构。我们的架构对于搜索图像 $x$ 是全卷积的。输出是一个标量值的分数图，其维度取决于搜索图像的大小。这使得可以在一次评估中计算搜索图像中所有平移子窗口的相似性函数。在这个示例中，分数图中的红色和蓝色像素包含相应子窗口的相似性。最佳效果请查看彩色图像。

2.1 Fully-convolutional Siamese architecture

2.1 全卷积孪生网络架构

We propose a Siamese architecture which is fully-convolutional with respect to the candidate image $x$ . We say that a function is fully-convolutional if it commutes with translation. To give a more precise definition,introducing ${L}_{\tau }$ to denote the translation operator $\left( {{L}_{\tau }x}\right) \left\lbrack u\right\rbrack = x\left\lbrack {u - \tau }\right\rbrack$ ,a function $h$ that maps signals to signals is fully-convolutional with integer stride $k$ if

我们提出了一种与候选图像 $x$ 完全卷积的孪生网络结构。我们说一个函数是完全卷积的，如果它与平移操作是可交换的。为了给出更精确的定义，引入 ${L}_{\tau }$ 来表示平移算子 $\left( {{L}_{\tau }x}\right) \left\lbrack u\right\rbrack = x\left\lbrack {u - \tau }\right\rbrack$ ，一个将信号映射到信号的函数 $h$ 如果具有整数步幅 $k$ ，则是完全卷积的，前提是对于任何平移 $x$ 。

for any translation $\tau$ . (When $x$ is a finite signal,this only need hold for the valid region of the output.)

（当 $x$ 是有限信号时，这只需在输出的有效区域内成立。）

The advantage of a fully-convolutional network is that, instead of a candidate image of the same size, we can provide as input to the network a much larger search image and it will compute the similarity at all translated sub-windows on a dense grid in a single evaluation. To achieve this, we use a convolutional embedding function $\varphi$ and combine the resulting feature maps using a cross-correlation layer where $b\mathbb{1}$ denotes a signal which takes value $b \in \mathbb{R}$ in every location. The output of this network is not a single score but rather a score map defined on a finite grid $\mathcal{D} \subset {\mathbb{Z}}^{2}$ as illustrated in Figure 1. Note that the output of the embedding function is a feature map with spatial support as opposed to a plain vector. The same technique has been applied in contemporary work on stereo matching [23].

完全卷积网络的优势在于，我们可以将一个远大于候选图像的搜索图像作为输入，而不是提供相同大小的候选图像，网络将计算在所有平移子窗口上的相似度，这一过程只需一次评估。为了实现这一点，我们使用卷积嵌入函数 $\varphi$ ，并通过交叉相关层将得到的特征图进行组合，其中 $b\mathbb{1}$ 表示在每个位置取值为 $b \in \mathbb{R}$ 的信号。该网络的输出不是单一的分数，而是一个在有限网格 $\mathcal{D} \subset {\mathbb{Z}}^{2}$ 上定义的分数图，如图 1 所示。请注意，嵌入函数的输出是具有空间支持的特征图，而不是普通的向量。相同的技术已经在当代的立体匹配工作中得到了应用 [23]。

During tracking, we use a search image centred at the previous position of the target. The position of the maximum score relative to the centre of the score map, multiplied by the stride of the network, gives the displacement of the target from frame to frame. Multiple scales are searched in a single forward-pass by assembling a mini-batch of scaled images.

在跟踪过程中，我们使用以目标先前位置为中心的搜索图像。相对于分数图中心的最大分数位置，乘以网络的步幅，给出了目标在帧与帧之间的位移。在一次前向传播中，通过将多个尺度的图像组装成一个小批量来进行多尺度搜索。

Combining feature maps using cross-correlation and evaluating the network once on the larger search image is mathematically equivalent to combining feature maps using the inner product and evaluating the network on each translated sub-window independently. However, the cross-correlation layer provides an incredibly simple method to implement this operation efficiently within the framework of existing conv-net libraries. While this is clearly useful during testing, it can also be exploited during training. 2.2 Training with large search images

使用交叉相关结合特征图并在较大的搜索图像上对网络进行评估，在数学上等价于使用内积结合特征图，并在每个平移子窗口上独立地评估网络。然而，交叉相关层提供了一种非常简单的方法，可以在现有的卷积神经网络库框架内高效地实现这一操作。虽然这在测试过程中显然非常有用，但在训练过程中也可以加以利用。2.2 使用大搜索图像进行训练

We employ a discriminative approach, training the network on positive and negative pairs and adopting the logistic loss

我们采用了一种判别性方法，通过在正负样本对上训练网络，并采用逻辑损失函数。

where $v$ is the real-valued score of a single exemplar-candidate pair and $y \in$ $\{ + 1, - 1\}$ is its ground-truth label. We exploit the fully-convolutional nature of our network during training by using pairs that comprise an exemplar image and a larger search image. This will produce a map of scores $v : \mathcal{D} \rightarrow \mathbb{R}$ ,effectively generating many examples per pair. We define the loss of a score map to be the mean of the individual losses

其中 $v$ 是单个示例-候选对的实值评分， $y \in$ $\{ + 1, - 1\}$ 是其真实标签。在训练过程中，我们利用网络的全卷积特性，通过使用包含示例图像和更大搜索图像的配对数据。这将生成一个评分图 $v : \mathcal{D} \rightarrow \mathbb{R}$ ，有效地为每对样本生成多个示例。我们定义评分图的损失为各个损失的均值。

requiring a true label $y\left\lbrack u\right\rbrack \in \{ + 1, - 1\}$ for each position $u \in \mathcal{D}$ in the score map. The parameters of the conv-net $\theta$ are obtained by applying Stochastic Gradient Descent (SGD) to the problem

需要为得分图中的每个位置 $u \in \mathcal{D}$ 提供一个真实标签 $y\left\lbrack u\right\rbrack \in \{ + 1, - 1\}$ 。通过将随机梯度下降（SGD）应用于该问题，获得了卷积网络 $\theta$ 的参数。

Pairs are obtained from a dataset of annotated videos by extracting exemplar and search images that are centred on the target, as shown in Figure 2. The images are extracted from two frames of a video that both contain the object and are at most $T$ frames apart. The class of the object is ignored during training. The scale of the object within each image is normalized without corrupting the

从一个带注释的视频数据集中获取对，通过提取以目标为中心的示例图像和搜索图像，如图2所示。这些图像来自视频中的两帧，这两帧都包含目标且最多相隔 $T$ 帧。在训练过程中忽略了对象的类别。每张图像中对象的尺度被规范化而不破坏图像。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——