Learning to Track at 100 FPS with Deep Regression Networks【翻译】

Doc2X | 专业 PDF 转换与文档处理工具 Doc2X 提供全面的 PDF转Word、PDF转Latex、PDF转Markdown、PDF转HTML 功能，支持公式识别、表格解析、多栏转换、代码识别。无论是学术论文处理还是日常文档转换，Doc2X 都能高效完成 Doc2X | Professional PDF Conversion and Document Processing Tool Doc2X provides comprehensive features like PDF to Word, PDF to LaTeX, PDF to Markdown, PDF to HTML, supporting formula recognition, table parsing, multi-column conversion, and code recognition. Perfect for academic papers or daily document needs! 👉 立即体验 Doc2X 的强大功能 | Try Doc2X Now

原文链接：1604.01802

Learning to Track at 100 FPS with Deep Regression Networks

学习以 100 FPS 进行跟踪的深度回归网络

David Held, Sebastian Thrun, Silvio Savarese

Department of Computer Science

计算机科学系

Stanford University

斯坦福大学

{davheld,thrun,ssilvio}@cs.stanford.edu

Abstract. Machine learning techniques are often used in computer vision due to their ability to leverage large amounts of training data to improve performance. Unfortunately, most generic object trackers are still trained from scratch online and do not benefit from the large number of videos that are readily available for offline training. We propose a method for offline training of neural networks that can track novel objects at test-time at 100 fps. Our tracker is significantly faster than previous methods that use neural networks for tracking, which are typically very slow to run and not practical for real-time applications. Our tracker uses a simple feed-forward network with no online training required. The tracker learns a generic relationship between object motion and appearance and can be used to track novel objects that do not appear in the training set. We test our network on a standard tracking benchmark to demonstrate our tracker's state-of-the-art performance. Further, our performance improves as we add more videos to our offline training set. To the best of our knowledge,our tracker ${}^{1}$ is the first neural-network tracker that learns to track generic objects at ${100}\mathrm{{fps}}$ .

摘要。由于机器学习技术能够利用大量训练数据来提高性能，因此它们常常被应用于计算机视觉。不幸的是，大多数通用物体跟踪器仍然是在线从头训练的，并且无法利用大量可用于离线训练的视频。我们提出了一种离线训练神经网络的方法，该方法可以在测试时以 100 fps 跟踪新颖物体。我们的跟踪器显著快于以前使用神经网络进行跟踪的方法，这些方法通常运行非常缓慢，不适合实时应用。我们的跟踪器使用一个简单的前馈网络，无需在线训练。该跟踪器学习物体运动与外观之间的通用关系，可以用于跟踪在训练集中未出现的新颖物体。我们在标准跟踪基准上测试了我们的网络，以展示我们跟踪器的先进性能。此外，随着我们向离线训练集中添加更多视频，我们的性能也有所提高。根据我们的最佳知识，我们的跟踪器 ${}^{1}$ 是第一个学习跟踪通用物体的神经网络跟踪器 ${100}\mathrm{{fps}}$ 。

Keywords: Tracking, deep learning, neural networks, machine learning

关键词: 跟踪, 深度学习, 神经网络, 机器学习

1 Introduction

1 引言

Given some object of interest marked in one frame of a video, the goal of "single-target tracking" is to locate this object in subsequent video frames, despite object motion, changes in viewpoint, lighting changes, or other variations. Single-target tracking is an important component of many systems. For person-following applications, a robot must track a person as they move through their environment. For autonomous driving, a robot must track dynamic obstacles in order to estimate where they are moving and predict how they will move in the future.

在视频的某一帧中标记某个感兴趣的目标后，“单目标跟踪”的目标是定位该目标在后续视频帧中的位置，尽管可能存在目标运动、视角变化、光照变化或其他变化。单目标跟踪是许多系统的重要组成部分。在人员跟随应用中，机器人必须跟踪人员在其环境中的移动。在自动驾驶中，机器人必须跟踪动态障碍物，以估计其移动方向并预测其未来的运动轨迹。

Generic object trackers (trackers that are not specialized for specific classes of objects) are traditionally trained entirely from scratch online (i.e. during test time) 15333619, with no offline training being performed. Such trackers suffer in performance because they cannot take advantage of the large number of videos that are readily available to improve their performance. Offline training videos

通用物体追踪器（不是专门为特定类别物体设计的追踪器）通常完全在线训练（即在测试时进行训练），并且没有离线训练。这类追踪器在性能上受到限制，因为它们无法利用大量现成的视频来提升性能。

1 Our tracker is available at davheld.github.io/GOTURN/GOTU…

1 我们的追踪器可以在以下网址获取

Fig. 1. Using a collection of videos and images with bounding box labels (but no class information), we train a neural network to track generic objects. At test time, the network is able to track novel objects without any fine-tuning. By avoiding fine-tuning, our network is able to track at ${100}\mathrm{{fps}}$

图 1. 使用带有边界框标签（但没有类别信息）的视频和图像集合，我们训练一个神经网络来跟踪通用对象。在测试时，该网络能够跟踪新对象而无需任何微调。通过避免微调，我们的网络能够在 ${100}\mathrm{{fps}}$ 进行跟踪。

can be used to teach the tracker to handle rotations, changes in viewpoint, lighting changes, and other complex challenges.

可以用于教导跟踪器处理旋转、视角变化、光照变化以及其他复杂挑战。

In many other areas of computer vision, such as image classification, object detection, segmentation, or activity recognition, machine learning has allowed vision algorithms to train from offline data and learn about the world [523 13 25 9 28]. In each of these cases, the performance of the algorithm improves as it iterates through the training set of images. Such models benefit from the ability of neural networks to learn complex functions from large amounts of data.

在计算机视觉的许多其他领域，如图像分类、物体检测、分割或活动识别，机器学习使得视觉算法能够从离线数据中训练并学习关于世界的知识 [523 13 25 9 28]。在这些情况下，算法的性能随着训练集图像的迭代而提高。这些模型得益于神经网络从大量数据中学习复杂函数的能力。

In this work, we show that it is possible to learn to track generic objects in real-time by watching videos offline of objects moving in the world. To achieve this goal, we introduce GOTURN, Generic Object Tracking Using Regression Networks. We train a neural network for tracking in an entirely offline manner. At test time, when tracking novel objects, the network weights are frozen, and no online fine-tuning required (as shown in Figure 1). Through the offline training procedure, the tracker learns to track novel objects in a fast, robust, and accurate manner.

在本研究中，我们展示了通过离线观看物体在世界中移动的视频来学习实时跟踪通用物体是可能的。为实现这一目标，我们提出了GOTURN（通用物体跟踪回归网络）。我们训练了一个完全离线的神经网络用于跟踪。在测试时，当跟踪新物体时，网络权重被冻结，不需要进行在线微调（如图1所示）。通过离线训练过程，跟踪器能够以快速、稳健且精确的方式跟踪新物体。

Although some initial work has been done in using neural networks for tracking, these efforts have produced neural-network trackers that are too slow for practical use. In contrast, our tracker is able to track objects at 100 fps, making it, to the best of our knowledge, the fastest neural-network tracker to-date. Our real-time speed is due to two factors. First, most previous neural network trackers are trained online 26273437353039724; however, training neural networks is a slow process, leading to slow tracking. In contrast, our tracker is trained offline to learn a generic relationship between appearance and motion, so no online training is required. Second, most trackers take a classification-based approach, classifying many image patches to find the target object 262737303924333. In contrast, our tracker uses a regression-based approach, requiring just a single feed-forward pass through the network to regresses directly to the location of the target object. The combination of offline training and one-pass regression leads to a significant speed-up compared to previous approaches and allows us to track objects at real-time speeds.

尽管在使用神经网络进行跟踪方面已经有了一些初步的研究工作，但这些努力产生的神经网络跟踪器速度过慢，无法用于实际应用。相比之下，我们的跟踪器能够以每秒 100 帧的速度跟踪物体，这使其成为目前为止最快的神经网络跟踪器。我们的实时速度得益于两个因素。首先，大多数以前的神经网络跟踪器采用在线训练方式；然而，训练神经网络是一个缓慢的过程，导致跟踪速度较慢。相比之下，我们的跟踪器采用离线训练，学习外观与运动之间的通用关系，因此不需要进行在线训练。其次，大多数跟踪器采用基于分类的方法，通过分类多个图像块来寻找目标物体。相比之下，我们的跟踪器采用基于回归的方法，仅通过一次前向传播即可直接回归到目标物体的位置。离线训练和单次回归的结合使得我们的跟踪器在速度上相比之前的方法有了显著的提升，并能够以实时速度跟踪物体。

GOTURN is the first generic object neural-network tracker that is able to run at 100 fps. We use a standard tracking benchmark to demonstrate that our tracker outperforms state-of-the-art trackers. Our tracker trains from a set of labeled training videos and images, but we do not require any class-level labeling or information about the types of objects being tracked. GOTURN establishes a new framework for tracking in which the relationship between appearance and motion is learned offline in a generic manner. Our code and additional experiments can be found at davheld.github.io/GOTURN/GOTU…

GOTURN 是首个能够以每秒 100 帧速度运行的通用物体神经网络跟踪器。我们使用标准的跟踪基准测试来证明我们的跟踪器优于当前最先进的跟踪器。我们的跟踪器从一组带标签的训练视频和图像中进行训练，但我们不需要任何类别级别的标签或关于被跟踪物体类型的信息。GOTURN 建立了一个新的跟踪框架，其中外观与运动之间的关系以通用方式离线学习。我们的代码和其他实验可以在以下位置找到

2 Related Work

2 相关工作

Online training for tracking. Trackers for generic object tracking are typically trained entirely online, starting from the first frame of a video [15336 19. A typical tracker will sample patches near the target object, which are considered as "foreground" [3]. Some patches farther from the target object are also sampled, and these are considered as "background." These patches are then used to train a foreground-background classifier, and this classifier is used to score patches from the next frame to estimate the new location of the target object 3619 . Unfortunately, since these trackers are trained entirely online, they cannot take advantage of the large amount of videos that are readily available for offline training that can potentially be used to improve their performance.

在线训练用于跟踪。通用目标跟踪的跟踪器通常完全依赖在线训练，从视频的第一帧开始 [15336 19]。一个典型的跟踪器会在目标物体附近采样补丁，这些补丁被视为“前景” [3]。一些距离目标物体较远的补丁也会被采样，这些被视为“背景”。这些补丁随后被用来训练一个前景-背景分类器，而该分类器被用来为下一帧中的补丁打分，以估计目标物体的新位置 [3619]。遗憾的是，由于这些跟踪器完全依赖在线训练，它们无法利用可以通过离线训练提升性能的大量可用视频。

Some researchers have also attempted to use neural networks for tracking within the traditional online training framework 26 27 34 37 35 30 39 7 24 16, showing state-of-the-art results [307 21]. Unfortunately, neural networks are very slow to train, and if online training is required, then the resulting tracker will be very slow at test time. Such trackers range from 0.8 fps [26] to 15 fps [37], with the top performing neural-network trackers running at 1 fps on a GPU [30 7 21 . Hence, these trackers are not usable for most practical applications. Because our tracker is trained offline in a generic manner, no online training of our tracker is required,enabling us to track at ${100}\mathrm{{fps}}$ .

一些研究人员也尝试在传统的在线训练框架中使用神经网络进行跟踪 [26 27 34 37 35 30 39 7 24 16]，并展示了最先进的结果 [307 21]。然而，神经网络的训练速度非常慢，如果需要在线训练，那么最终的跟踪器在测试时也会非常慢。这类跟踪器的速度范围为 0.8 fps [26] 到 15 fps [37]，其中性能最好的神经网络跟踪器在 GPU 上的运行速度为 1 fps [30 7 21]。因此，这些跟踪器在大多数实际应用中不可用。由于我们的跟踪器以通用的方式离线训练，因此不需要在线训练，从而使我们能够以 ${100}\mathrm{{fps}}$ 的速度进行跟踪。

Model-based trackers. A separate class of trackers are the model-based trackers which are designed to track a specific class of objects 12111 . For example, if one is only interested in tracking pedestrians, then one can train a pedestrian detector. During test-time, these detections can be linked together using temporal information. These trackers are trained offline, but they are limited because they can only track a specific class of objects. Our tracker is trained offline in a generic fashion and can be used to track novel objects at test time.

基于模型的跟踪器。另一类跟踪器是基于模型的跟踪器，它们被设计用来跟踪特定类别的物体 12111。例如，如果只关心跟踪行人，可以训练一个行人检测器。在测试时，这些检测结果可以通过时间信息连接起来。这些跟踪器是离线训练的，但它们的局限性在于只能跟踪特定类别的物体。我们的跟踪器以通用方式离线训练，可以在测试时跟踪新颖的物体。

Other neural network tracking frameworks. A related area of research is patch matching [1438], which was recently used for tracking in 33, running at $4\mathrm{{fps}}$ . In such an approach,many candidate patches are passed through the network, and the patch with the highest matching score is selected as the tracking output. In contrast, our network only passes two images through the network, and the network regresses directly to the bounding box location of the target object. By avoiding the need to score many candidate patches, we are able to track objects at 100 fps.

其他神经网络跟踪框架。相关的研究领域是补丁匹配 [1438]，该方法最近在 33 中被用于跟踪，运行速度为 $4\mathrm{{fps}}$ 。在这种方法中，许多候选补丁通过网络传递，匹配得分最高的补丁被选为跟踪输出。与此不同，我们的网络只传递两幅图像，并且网络直接回归目标物体的边界框位置。通过避免为许多候选补丁打分，我们能够以 100 帧每秒的速度跟踪物体。

Fig. 2. Our network architecture for tracking. We input to the network a search region from the current frame and a target from the previous frame. The network learns to compare these crops to find the target object in the current image

图 2. 我们的跟踪网络架构。我们将当前帧的搜索区域和上一帧的目标输入到网络中。网络学习比较这些区域，以在当前图像中找到目标物体。

Prior attempts have been made to use neural networks for tracking in various other ways [18], including visual attention models [429]. However, these approaches are not competitive with other state-of-the-art trackers when evaluated on difficult tracker datasets.

之前已经有尝试通过神经网络以各种方式进行跟踪 [18]，包括视觉注意力模型 [429]。然而，这些方法在评估困难的跟踪数据集时，无法与其他最先进的跟踪器竞争。

3 Method

3 方法

3.1 Method Overview

3.1 方法概述

At a high level, we feed frames of a video into a neural network, and the network successively outputs the location of the tracked object in each frame. We train the tracker entirely offline with video sequences and images. Through our offline training procedure, our tracker learns a generic relationship between appearance and motion that can be used to track novel objects at test time with no online training required. 3.2 Input / output format

从高层次来看，我们将视频的帧输入到神经网络中，然后网络依次输出每一帧中被追踪物体的位置。我们完全离线训练追踪器，使用视频序列和图像。通过我们的离线训练程序，追踪器学习到外观和运动之间的通用关系，这使得它能够在测试时追踪新的物体，而无需进行在线训练。

What to track. In case there are multiple objects in the video, the network must receive some information about which object in the video is being tracked. To achieve this, we input an image of the target object into the network. We crop and scale the previous frame to be centered on the target object, as shown in Figure 2. This input allows our network to track novel objects that it has not seen before; the network will track whatever object is being input in this crop. We pad this crop to allow the network to receive some contextual information about the surroundings of the target object.

要追踪什么。在视频中如果有多个物体，网络必须接收一些关于视频中正在被追踪的物体的信息。为此，我们将目标物体的图像输入到网络中。我们将前一帧裁剪并缩放，使其以目标物体为中心，如图 2 所示。这个输入使得我们的网络能够追踪之前未见过的新物体；网络将追踪这个裁剪区域内的任何物体。我们对这个裁剪区域进行填充，以便网络能够接收关于目标物体周围环境的上下文信息。

In more detail,suppose that in frame $t - 1$ ,our tracker previously predicted that the target was located in a bounding box centered at $c = \left( {{c}_{x},{c}_{y}}\right)$ with a width of $w$ and a height of $h$ . At time $t$ ,we take a crop of frame $t - 1$ centered

更详细地说，假设在帧 $t - 1$ 中，我们的追踪器先前预测目标位于一个以 $c = \left( {{c}_{x},{c}_{y}}\right)$ 为中心、宽度为 $w$ 和高度为 $h$ 的边界框内。在时刻 $t$ ，我们从帧 $t - 1$ 中裁剪出以目标物体为中心的区域。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——