Online Object Tracking: A Benchmark【翻译】

109 阅读24分钟

选择 Doc2X,让 PDF 转换更轻松 支持 PDF 转 Word、Latex、Markdown,多栏与公式精准解析,还提供深度翻译功能,适合科研及日常办公! Choose Doc2X, Simplify PDF Conversion Supports PDF to Word, LaTeX, and Markdown, with precise multi-column and formula parsing, plus advanced translation for research and daily work! 👉 立即试用 Doc2X | Try Doc2X Now

原文链接:cvpr13_benchmark.pdf

Online Object Tracking: A Benchmark

在线物体跟踪:基准测试

Yi Wu

吴毅

University of California at Merced

加州大学默塞德分校

ywu29@ucmerced.edu

Jongwoo Lim

林钟宇

Hanyang University

汉阳大学

jlim@hanyang.ac.kr

Ming-Hsuan Yang

杨名轩

University of California at Merced

加州大学默塞德分校

mhyang@ucmerced.edu

Abstract

摘要

Object tracking is one of the most important components in numerous applications of computer vision. While much progress has been made in recent years with efforts on sharing code and datasets, it is of great importance to develop a library and benchmark to gauge the state of the art. After briefly reviewing recent advances of online object tracking, we carry out large scale experiments with various evaluation criteria to understand how these algorithms perform. The test image sequences are annotated with different attributes for performance evaluation and analysis. By analyzing quantitative results, we identify effective approaches for robust tracking and provide potential future research directions in this field.

物体跟踪是计算机视觉众多应用中最重要的组成部分之一。尽管近年来在共享代码和数据集方面取得了很大进展,但开发一个库和基准以评估最新技术的水平仍然非常重要。在简要回顾在线物体跟踪的最新进展后,我们进行了大规模实验,采用各种评估标准来理解这些算法的表现。测试图像序列被标注了不同的属性以进行性能评估和分析。通过分析定量结果,我们识别出有效的鲁棒跟踪方法,并提供该领域未来研究的潜在方向。

1. Introduction

1. 引言

Object tracking is one of the most important components in a wide range of applications in computer vision, such as surveillance, human computer interaction, and medical imaging [60, 12]. Given the initialized state (e.g., position and size) of a target object in a frame of a video, the goal of tracking is to estimate the states of the target in the subsequent frames. Although object tracking has been studied for several decades, and much progress has been made in recent years [28,16,47,5,40,26,19]\left\lbrack {{28},{16},{47},5,{40},{26},{19}}\right\rbrack ,it remains a very challenging problem. Numerous factors affect the performance of a tracking algorithm, such as illumination variation, occlusion, as well as background clutters, and there exists no single tracking approach that can successfully handle all scenarios. Therefore, it is crucial to evaluate the performance of state-of-the-art trackers to demonstrate their strength and weakness and help identify future research directions in this field for designing more robust algorithms.

物体跟踪是计算机视觉广泛应用中最重要的组成部分之一,例如监控、人机交互和医学成像 [60, 12]。给定视频帧中目标物体的初始化状态(例如位置和大小),跟踪的目标是估计目标在后续帧中的状态。尽管物体跟踪已经研究了数十年,并且近年来取得了很大进展 [28,16,47,5,40,26,19]\left\lbrack {{28},{16},{47},5,{40},{26},{19}}\right\rbrack,但它仍然是一个非常具有挑战性的问题。许多因素影响跟踪算法的性能,例如光照变化、遮挡以及背景杂乱,并且没有单一的跟踪方法能够成功处理所有场景。因此,评估最新跟踪器的性能至关重要,以展示它们的优缺点,并帮助识别该领域未来研究的方向,以设计更鲁棒的算法。

For comprehensive performance evaluation, it is critical to collect a representative dataset. There exist several datasets for visual tracking in the surveillance scenarios, such as the VIVID [14], CAVIAR [21], and PETS databases. However, the target objects are usually humans or cars of small size in these surveillance sequences, and the background is usually static. Although some tracking dataset-s [47,5,33]\left\lbrack {{47},5,{33}}\right\rbrack for generic scenes are annotated with bounding box, most of them are not. For sequences without labeled ground truth, it is difficult to evaluate tracking algorithms as the reported results are based on inconsistently annotated object locations.

为了进行全面的性能评估,收集一个具有代表性的数据集至关重要。在监控场景中存在多个用于视觉跟踪的数据集,例如 VIVID [14]、CAVIAR [21] 和 PETS 数据库。然而,在这些监控序列中,目标对象通常是小型的人或汽车,背景通常是静态的。尽管一些用于通用场景的跟踪数据集 [47,5,33]\left\lbrack {{47},5,{33}}\right\rbrack 已用边界框进行了标注,但大多数并没有。对于没有标注真实值的序列,评估跟踪算法变得困难,因为报告的结果是基于不一致标注的对象位置。

Recently, more tracking source codes have been made publicly available, e.g., the OAB [22], IVT [47], MIL [5], L1 [40], and TLD [31] algorithms, which have been commonly used for evaluation. However, the input and output formats of most trackers are different and thus it is inconvenient for large scale performance evaluation. In this work, we build a code library that includes most publicly available trackers and a test dataset with ground-truth annotations to facilitate the evaluation task. Additionally each sequence in the dataset is annotated with attributes that often affect tracking performance, such as occlusion, fast motion, and illumination variation.

最近,更多的跟踪源代码已公开可用,例如 OAB [22]、IVT [47]、MIL [5]、L1 [40] 和 TLD [31] 算法,这些算法通常用于评估。然而,大多数跟踪器的输入和输出格式不同,因此在大规模性能评估中不太方便。在这项工作中,我们构建了一个代码库,包含大多数公开可用的跟踪器和一个带有真实值标注的测试数据集,以便于评估任务。此外,数据集中的每个序列都标注了常常影响跟踪性能的属性,如遮挡、快速运动和光照变化。

One common issue in assessing tracking algorithms is that the results are reported based on just a few sequences with different initial conditions or parameters. Thus, the results do not provide the holistic view of these algorithm-s. For fair and comprehensive performance evaluation, we propose to perturb the initial state spatially and temporally from the ground-truth target locations. While the robustness to initialization is a well-known problem in the field, it is seldom addressed in the literature. To the best of our knowledge, this is the first comprehensive work to address and analyze the initialization problem of object tracking. We use the precision plots based on location error metric and the success plots based on the overlap metric, to analyze the performance of each algorithm.

评估跟踪算法时一个常见的问题是,结果仅基于少数具有不同初始条件或参数的序列进行报告。因此,结果并未提供这些算法的整体视图。为了进行公平和全面的性能评估,我们建议从真实目标位置对初始状态进行空间和时间上的扰动。尽管对初始化的鲁棒性是该领域一个众所周知的问题,但在文献中很少被提及。根据我们所知,这是第一项全面解决和分析对象跟踪初始化问题的工作。我们使用基于位置误差度量的精度图和基于重叠度量的成功图来分析每个算法的性能。

The contribution of this work is three-fold:

本工作的贡献有三方面:

Dataset. We build a tracking dataset with 50 fully annotated sequences to facilitate tracking evaluation.

数据集。我们构建了一个包含50个完全注释序列的跟踪数据集,以促进跟踪评估。

Code library. We integrate most publicly available tracker- s\mathrm{s} in our code library with uniform input and output formats to facilitate large scale performance evaluation. At present, it includes 29 tracking algorithms.

代码库。我们将大多数公开可用的跟踪器 s\mathrm{s} 集成到我们的代码库中,采用统一的输入和输出格式,以便进行大规模性能评估。目前,它包括29种跟踪算法。

Robustness evaluation. The initial bounding boxes for tracking are sampled spatially and temporally to evaluate the robustness and characteristics of trackers. Each tracker is extensively evaluated by analyzing more than 660,000 bounding box outputs.

鲁棒性评估。跟踪的初始边界框在空间和时间上进行采样,以评估跟踪器的鲁棒性和特性。通过分析超过660,000个边界框输出,对每个跟踪器进行了广泛评估。

This work mainly focuses on the online 1{}^{1} tracking of single target. The code library, annotated dataset and all the tracking results are available on the website visual-tracking.net .

本工作主要集中在单目标的在线 1{}^{1} 跟踪。代码库、注释数据集和所有跟踪结果均可在网站 visual-tracking.net 上获取。

2. Related Work

2. 相关工作

In this section, we review recent algorithms for object tracking in terms of several main modules: target representation scheme, search mechanism, and model update. In addition, some methods have been proposed that build on combing some trackers or mining context information.

在本节中,我们回顾了最近的目标跟踪算法,主要从几个模块进行讨论:目标表示方案、搜索机制和模型更新。此外,还提出了一些基于组合某些跟踪器或挖掘上下文信息的方法。

Representation Scheme. Object representation is one of major components in any visual tracker and numerous schemes have been presented [35]. Since the pioneering work of Lucas and Kanade [37, 8], holistic templates (raw intensity values) have been widely used for tracking [25,39,2]\left\lbrack {{25},{39},2}\right\rbrack . Subsequently,subspace-based tracking approaches [11,47]\left\lbrack {{11},{47}}\right\rbrack have been proposed to better account for appearance changes. Furthermore, Mei and Ling [40] proposed a tracking approach based on sparse representation to handle the corrupted appearance and recently it has been further improved [41, 57, 64, 10, 55, 42]. In addition to template, many other visual features have been adopted in tracking algorithms, such as color histogram-s [16], histograms of oriented gradients (HOG) [17, 52], covariance region descriptor [53,46,56]\left\lbrack {{53},{46},{56}}\right\rbrack and Haar-like features [54,22]\left\lbrack {{54},{22}}\right\rbrack . Recently,the discriminative model has been widely adopted in tracking [15,4]\left\lbrack {{15},4}\right\rbrack ,where a binary classifier is learned online to discriminate the target from the background. Numerous learning methods have been adapted to the tracking problem, such as SVM [3], structured output SVM [26], ranking SVM [7], boosting [4, 22], semi-boosting [23] and multi-instance boosting [5]. To make trackers more robust to pose variation and partial occlusion, an object can be represented by parts where each one is represented by descriptors or histograms. In [1] several local histograms are used to represent the object in a pre-defined grid structure. Kwon and Lee [32] propose an approach to automatically update the topology of local patches to handle large pose changes. To better handle appearance variation-s, some approaches regarding integration of multiple representation schemes have recently been proposed [62, 51, 33]. Search Mechanism. To estimate the state of the target objects, deterministic or stochastic methods have been used. When the tracking problem is posed within an optimization framework, assuming the objective function is differentiable with respect to the motion parameters, gradient descent methods can be used to locate the target efficiently [37,16,20,49]\left\lbrack {{37},{16},{20},{49}}\right\rbrack . However,these objective functions are usually nonlinear and contain many local minima. To alleviate this problem, dense sampling methods have been adopted [22,5,26]\left\lbrack {{22},5,{26}}\right\rbrack at the expense of high computational load. On the other hand, stochastic search algorithms such as particle filters [28,44]\left\lbrack {{28},{44}}\right\rbrack have been widely used since they are relatively insensitive to local minima and computationally efficient [47,40,30]\left\lbrack {{47},{40},{30}}\right\rbrack .

表示方案。对象表示是任何视觉跟踪器中的主要组成部分之一,已经提出了许多方案 [35]。自从 Lucas 和 Kanade 的开创性工作 [37, 8] 以来,整体模板(原始强度值)已被广泛用于跟踪 [25,39,2]\left\lbrack {{25},{39},2}\right\rbrack。随后,提出了基于子空间的跟踪方法 [11,47]\left\lbrack {{11},{47}}\right\rbrack,以更好地考虑外观变化。此外,Mei 和 Ling [40] 提出了基于稀疏表示的跟踪方法,以处理受损的外观,最近这一方法得到了进一步改进 [41, 57, 64, 10, 55, 42]。除了模板,许多其他视觉特征也被采用在跟踪算法中,例如颜色直方图 [16]、方向梯度直方图(HOG) [17, 52]、协方差区域描述符 [53,46,56]\left\lbrack {{53},{46},{56}}\right\rbrack 和 Haar-like 特征 [54,22]\left\lbrack {{54},{22}}\right\rbrack。最近,判别模型在跟踪中得到了广泛应用 [15,4]\left\lbrack {{15},4}\right\rbrack,其中在线学习一个二分类器以区分目标与背景。许多学习方法已被调整以解决跟踪问题,例如 SVM [3]、结构化输出 SVM [26]、排序 SVM [7]、提升 [4, 22]、半提升 [23] 和多实例提升 [5]。为了使跟踪器对姿态变化和部分遮挡更具鲁棒性,可以通过部分来表示对象,每个部分由描述符或直方图表示。在 [1] 中,使用几个局部直方图在预定义的网格结构中表示对象。Kwon 和 Lee [32] 提出了一种方法,自动更新局部补丁的拓扑以处理大幅姿态变化。为了更好地处理外观变化,最近提出了一些关于多种表示方案集成的方法 [62, 51, 33]。搜索机制。为了估计目标对象的状态,已使用确定性或随机方法。当跟踪问题在优化框架内提出时,假设目标函数对运动参数是可微的,可以使用梯度下降方法有效地定位目标 [37,16,20,49]\left\lbrack {{37},{16},{20},{49}}\right\rbrack。然而,这些目标函数通常是非线性的,并且包含许多局部最小值。为了缓解这个问题,采用了密集采样方法 [22,5,26]\left\lbrack {{22},5,{26}}\right\rbrack,代价是高计算负载。另一方面,随机搜索算法,如粒子滤波器 [28,44]\left\lbrack {{28},{44}}\right\rbrack,由于对局部最小值相对不敏感且计算效率高,已被广泛使用 [47,40,30]\left\lbrack {{47},{40},{30}}\right\rbrack

Model Update. It is crucial to update the target representation or model to account for appearance variations. Matthews et al. [39] address the template update problem for the Lucas-Kanade algorithm [37] where the template is updated with the combination of the fixed reference template extracted from the first frame and the result from the most recent frame. Effective update algorithms have also been proposed via online mixture model [29], online boosting [22], and incremental subspace update [47]. For discriminative models, the main issue has been improving the sample collection part to make the online-trained classifier more robust [23,5,31,26]\left\lbrack {{23},5,{31},{26}}\right\rbrack . While much progress has been made, it is still difficult to get an adaptive appearance model to avoid drifts.

模型更新。更新目标表示或模型以考虑外观变化至关重要。Matthews 等人 [39] 解决了 Lucas-Kanade 算法 [37] 的模板更新问题,其中模板是通过将从第一帧提取的固定参考模板与最近帧的结果相结合来更新的。还提出了通过在线混合模型 [29]、在线提升 [22] 和增量子空间更新 [47] 的有效更新算法。对于判别模型,主要问题是改善样本收集部分,以使在线训练的分类器更加稳健 [23,5,31,26]\left\lbrack {{23},5,{31},{26}}\right\rbrack。尽管取得了很大进展,但仍然难以获得自适应外观模型以避免漂移。

Context and Fusion of Trackers. Context information is also very important for tracking. Recently some approaches have been proposed by mining auxiliary objects or local visual information surrounding the target to assist tracking [59,24,18]\left\lbrack {{59},{24},{18}}\right\rbrack . The context information is especially helpful when the target is fully occluded or leaves the image region [24]. To improve the tracking performance, some tracker fusion methods have been proposed recently. Sant-ner et al. [48] proposed an approach that combines static, moderately adaptive and highly adaptive trackers to account for appearance changes. Even multiple trackers [34] or multiple feature sets [61] are maintained and selected in a Bayesian framework to better account for appearance changes.

跟踪器的上下文与融合。上下文信息对于跟踪也非常重要。最近,一些方法通过挖掘辅助对象或目标周围的局部视觉信息来辅助跟踪 [59,24,18]\left\lbrack {{59},{24},{18}}\right\rbrack。当目标完全被遮挡或离开图像区域时,上下文信息尤其有帮助 [24]。为了提高跟踪性能,最近提出了一些跟踪器融合方法。Sant-ner 等人 [48] 提出了一种结合静态、中等自适应和高度自适应跟踪器的方法,以考虑外观变化。甚至在贝叶斯框架中维护和选择多个跟踪器 [34] 或多个特征集 [61] 以更好地考虑外观变化。

3. Evaluated Algorithms and Datasets

3. 评估的算法和数据集

For fair evaluation, we test the tracking algorithms whose original source or binary codes are publicly available as all implementations inevitably involve technical details and specific parameter settings 2{}^{2} . Table 1 shows the list of the evaluated tracking algorithms. We also evaluate the trackers in the VIVID testbed [14] including the mean shift (MS-V), template matching (TM-V), ratio shift (RS-V) and peak difference (PD-V) methods.

为了公平评估,我们测试那些原始源代码或二进制代码公开可用的跟踪算法,因为所有实现不可避免地涉及技术细节和特定参数设置 2{}^{2}。表 1 显示了评估的跟踪算法列表。我们还评估了 VIVID 测试平台 [14] 中的跟踪器,包括均值漂移 (MS-V)、模板匹配 (TM-V)、比例漂移 (RS-V) 和峰值差异 (PD-V) 方法。

In recent years, many benchmark datasets have been developed for various vision problems, such as the Berkeley segmentation [38], FERET face recognition [45] and optical flow dataset [9]. There exist some datasets for the tracking in the surveillance scenario, such as the VIVID [14] and CAVIAR [21] datasets. For generic visual tracking, more

最近几年,已经为各种视觉问题开发了许多基准数据集,例如伯克利分割 [38]、FERET人脸识别 [45] 和光流数据集 [9]。在监控场景中存在一些用于跟踪的数据集,例如VIVID [14] 和CAVIAR [21] 数据集。对于通用视觉跟踪,更多


2{}^{2} Some source codes [36,58]\left\lbrack {{36},{58}}\right\rbrack are obtained from direct contact,and some methods are implemented on our own [44,16]\left\lbrack {{44},{16}}\right\rbrack .

2{}^{2} 一些源代码 [36,58]\left\lbrack {{36},{58}}\right\rbrack 是通过直接联系获得的,某些方法是我们自己实现的 [44,16]\left\lbrack {{44},{16}}\right\rbrack

1{}^{1} Here,the word online means during tracking only the information of previous few frames is used for inference at any time instance.

1{}^{1} 在这里,在线一词意味着在跟踪过程中,仅使用前几帧的信息进行任何时刻的推断。


Table 1. Evaluated tracking algorithms (MU: model update, FP-S: frames per second). For representation schemes, L: local, H: holistic, T: template, IH: intensity histogram, BP: binary pattern, PCA: principal component analysis, SPCA: sparse PCA, SR: sparse representation, DM: discriminative model, GM: generative model. For search mechanism, PF: particle filter, MCMC: Markov Chain Monte Carlo, LOS: local optimum search, DS: dense sampling search. For the model update, N: No, Y: Yes. In the Code column, M: Matlab, C:C/C++, MC: Mixture of Matlab and C/C++, suffix E: executable binary code. sequences have been used for evaluation [47,5]\left\lbrack {{47},5}\right\rbrack . However, most sequences do not have the ground truth annotation- s\mathrm{s} ,and the quantitative evaluation results may be generated with different initial conditions. To facilitate fair performance evaluation, we have collected and annotated most commonly used tracking sequences. Figure 1 shows the first frame of each sequence where the target object is initialized with a bounding box.

表1. 评估的跟踪算法(MU:模型更新,FP-S:每秒帧数)。对于表示方案,L:局部,H:整体,T:模板,IH:强度直方图,BP:二进制模式,PCA:主成分分析,SPCA:稀疏PCA,SR:稀疏表示,DM:判别模型,GM:生成模型。对于搜索机制,PF:粒子滤波器,MCMC:马尔可夫链蒙特卡洛,LOS:局部最优搜索,DS:密集采样搜索。对于模型更新,N:否,Y:是。在代码列中,M:Matlab,C:C/C++,MC:Matlab和C/C++的混合,后缀E:可执行二进制代码。序列已用于评估 [47,5]\left\lbrack {{47},5}\right\rbrack。然而,大多数序列没有真实值注释 s\mathrm{s},定量评估结果可能是在不同初始条件下生成的。为了促进公平的性能评估,我们收集并注释了最常用的跟踪序列。图1显示了每个序列的第一帧,其中目标对象用边界框初始化。

Attributes of a test sequence. Evaluating trackers is difficult because many factors can affect the tracking performance. For better evaluation and analysis of the strength and weakness of tracking approaches, we propose to categorize the sequences by annotating them with the 11 attributes shown in Table 2.

测试序列的属性。评估跟踪器是困难的,因为许多因素可以影响跟踪性能。为了更好地评估和分析跟踪方法的优缺点,我们建议通过对其进行标注,按照表2所示的11个属性对序列进行分类。

The attribute distribution in our dataset is shown in Figure 2(a). Some attributes occur more frequently, e.g., OPR and IPR, than others. It also shows that one sequence is often annotated with several attributes. Aside from summarizing the performance on the whole dataset, we also construct several subsets corresponding to attributes to report specific challenging conditions. For example, the OCC subset contains 29 sequences which can be used to analyze the performance of trackers to handle occlusion. The attribute distributions in OCC subset is shown in Figure 2(b) and others are available in the supplemental material.

我们数据集中属性的分布如图2(a)所示。一些属性的出现频率较高,例如 OPR 和 IPR。它还表明一个序列通常会被标注多个属性。除了总结整个数据集的性能外,我们还构建了几个对应于属性的子集,以报告特定的挑战条件。例如,OCC 子集包含29个序列,可用于分析跟踪器处理遮挡的性能。OCC 子集中的属性分布如图2(b)所示,其他信息可在补充材料中获得。

Table 2. List of the attributes annotated to test sequences. The threshold values used in this work are also shown.

表2. 标注给测试序列的属性列表。本研究中使用的阈值也显示在内。

Figure 2. (a) Attribute distribution of the entire testset, and (b) the distribution of the sequences with occlusion (OCC) attribute.

图2. (a) 整个测试集的属性分布,以及 (b) 带有遮挡 (OCC) 属性的序列分布。

4. Evaluation Methodology

4. 评估方法

In this work, we use the precision and success rate for quantitative analysis. In addition, we evaluate the robustness of tracking algorithms in two aspects.

在本研究中,我们使用精度和成功率进行定量分析。此外,我们从两个方面评估跟踪算法的鲁棒性。

Precision plot. One widely used evaluation metric on tracking precision is the center location error, which is defined as the average Euclidean distance between the center locations of the tracked targets and the manually labeled ground truths. Then the average center location error over all the frames of one sequence is used to summarize the overall performance for that sequence. However, when the tracker loses the target, the output location can be random and the average error value may not measure the tracking performance correctly [6]. Recently the precision plot [6, 27] has been adopted to measure the overall tracking performance. It shows the percentage of frames whose estimated location is within the given threshold distance of the ground truth. As the representative precision score for each tracker we use the score for the threshold =20= {20} pixels [6].

精度图。跟踪精度的一个广泛使用的评估指标是中心位置误差,它被定义为被跟踪目标的中心位置与手动标记的真实值之间的平均欧几里得距离。然后,针对一个序列的所有帧计算的平均中心位置误差用于总结该序列的整体性能。然而,当跟踪器失去目标时,输出位置可能是随机的,平均误差值可能无法正确衡量跟踪性能 [6]。最近,精度图 [6, 27] 被采用来衡量整体跟踪性能。它显示了估计位置在给定阈值距离内的真实值的帧百分比。作为每个跟踪器的代表性精度分数,我们使用阈值 =20= {20} 像素的分数 [6]。

Success plot. Another evaluation metric is the bounding box overlap. Given the tracked bounding box rt{r}_{t} and the ground truth bounding box ra{r}_{a} ,the overlap score is defined as S=rtrartraS = \frac{\left| {r}_{t} \cap {r}_{a}\right| }{\left| {r}_{t} \cup {r}_{a}\right| } ,where \cap and \cup represent the intersection and union of two regions,respectively,and \left| \cdot \right| denotes the number of pixels in the region. To measure the performance on a sequence of frames, we count the number of successful frames whose overlap SS is larger than the given threshold to{t}_{o} . The success plot shows the ratios of successful frames at the thresholds varied from 0 to 1 . Using one success rate value at a specific threshold (e.g. to=0.5{t}_{o} = {0.5} ) for tracker evaluation may not be fair or representative. Instead we use the area under curve (AUC) of each success plot to rank the tracking algorithms.

成功图。另一个评估指标是边界框重叠。给定被跟踪的边界框 rt{r}_{t} 和真实值边界框 ra{r}_{a},重叠分数定义为 S=rtrartraS = \frac{\left| {r}_{t} \cap {r}_{a}\right| }{\left| {r}_{t} \cup {r}_{a}\right| },其中 \cap\cup 分别表示两个区域的交集和并集,而 \left| \cdot \right| 表示区域内的像素数量。为了测量一系列帧的性能,我们计算重叠 SS 大于给定阈值 to{t}_{o} 的成功帧数量。成功图显示了成功帧的比例,阈值从 0 到 1 变化。在特定阈值(例如 to=0.5{t}_{o} = {0.5})下使用一个成功率值进行跟踪器评估可能不公平或不具代表性。相反,我们使用每个成功图的曲线下面积(AUC)来对跟踪算法进行排名。

Figure 1. Tracking sequences for evaluation. The first frame with the bounding box of the target object is shown for each sequence. The sequences are ordered based on our ranking results (See supplementary material): the ones on the top left are more difficult for tracking than the ones on the bottom right. Note that we annotated two targets for the jogging sequence.

图1. 用于评估的跟踪序列。每个序列显示了目标物体的边界框的第一帧。序列根据我们的排名结果进行排序(见补充材料):左上角的序列比右下角的序列更难跟踪。请注意,我们为慢跑序列标注了两个目标。

Robustness Evaluation. The conventional way to evaluate trackers is to run them throughout a test sequence with initialization from the ground truth position in the first frame and report the average precision or success rate. We refer this as one-pass evaluation (OPE). However a tracker may be sensitive to the initialization, and its performance with different initialization at a different start frame may become much worse or better. Therefore, we propose two ways to analyze a tracker's robustness to initialization, by perturbing the initialization temporally (i.e., start at different frames) and spatially (i.e., start by different bounding boxes). These tests are referred as temporal robustness evaluation (TRE) and spatial robustness evaluation (SRE) respectively.

鲁棒性评估。评估跟踪器的传统方法是在测试序列中运行它们,从第一帧的真实位置进行初始化,并报告平均精度或成功率。我们将其称为单次评估(OPE)。然而,跟踪器可能对初始化敏感,在不同起始帧的不同初始化下,其性能可能会变得更差或更好。因此,我们提出两种方法来分析跟踪器对初始化的鲁棒性,分别通过时间扰动初始化(即,从不同帧开始)和空间扰动初始化(即,从不同的边界框开始)。这些测试分别称为时间鲁棒性评估(TRE)和空间鲁棒性评估(SRE)。

The proposed test scenarios happen a lot in the real-world applications as a tracker is often initialized by an object detector, which is likely to introduce initialization errors in terms of position and scale. In addition, an object detector may be used to re-initialize a tracker at differen-t time instances. By investigating a tracker's characteristic in the robustness evaluation, more thorough understanding and analysis of the tracking algorithm can be carried out.

所提出的测试场景在实际应用中非常常见,因为跟踪器通常由物体检测器初始化,这可能会引入位置和尺度方面的初始化误差。此外,物体检测器可能在不同时间点重新初始化跟踪器。通过研究跟踪器在鲁棒性评估中的特征,可以对跟踪算法进行更深入的理解和分析。

Temporal Robustness Evaluation. Given one initial frame together with the ground-truth bounding box of target, one tracker is initialized and runs to the end of the sequence, i.e., one segment of the entire sequence. The tracker is evaluated on each segment, and the overall statistics are tallied.

时间鲁棒性评估。给定一个初始帧及其目标的真实边界框,一个跟踪器被初始化并运行到序列的末尾,即整个序列的一段。跟踪器在每个段上进行评估,并汇总整体统计数据。

Spatial Robustness Evaluation. We sample the initial bounding box in the first frame by shifting or scaling the ground truth. Here, we use 8 spatial shifts including 4 center shifts and 4 corner shifts, and 4 scale variations (supplement). The amount for shift is 10%{10}\% of target size,and the scale ratio varys among0.8,0.9,1.1and 1.2 to the ground truth. Thus, we evaluate each tracker 12 times for SRE.

空间鲁棒性评估。我们通过移动或缩放真实边界框来采样第一帧中的初始边界框。在这里,我们使用了8个空间位移,包括4个中心位移和4个角落位移,以及4个缩放变化(补充材料)。位移的量为目标大小的 10%{10}\%,缩放比率在0.8、0.9、1.1和1.2之间变化。因此,我们对每个跟踪器进行了12次评估以进行空间鲁棒性评估(SRE)。

5. Evaluation Results

5. 评估结果

For each tracker, the default parameters with the source code are used in all evaluations. Table 1 lists the average FPS of each tracker in OPE running on a PC with Intel i7 3770CPU(3.4GHz){3770}\mathrm{{CPU}}\left( {{3.4}\mathrm{{GHz}}}\right) . More detailed speed statistics,such as minimum and maximum, are available in the supplement.

对于每个跟踪器,所有评估中都使用了源代码的默认参数。表1列出了在配备Intel i7的PC上运行的每个跟踪器的平均FPS 3770CPU(3.4GHz){3770}\mathrm{{CPU}}\left( {{3.4}\mathrm{{GHz}}}\right)。更详细的速度统计数据,如最小值和最大值,见补充材料。

For OPE, each tracker is tested on more than 29,000 frames. For SRE, each tracker is evaluated 12 times on each sequence, where more than 350,000 bounding box results are generated. For TRE, each sequence is partitioned into 20 segments and thus each tracker is performed on around 310,000 frames. To the best of our knowledge, this is the largest scale performance evaluation of visual tracking. We report the most important findings in this manuscript and more details and figures can be found in the supplement.

对于OPE,每个跟踪器在超过29,000帧上进行了测试。对于SRE,每个跟踪器在每个序列上评估12次,生成了超过350,000个边界框结果。对于TRE,每个序列被划分为20个段,因此每个跟踪器在大约310,000帧上进行评估。根据我们所知,这是视觉跟踪的最大规模性能评估。我们在本文中报告了最重要的发现,更多细节和图表可以在补充材料中找到。

5.1. Overall Performance

5.1. 整体性能

The overall performance for all the trackers is summarized by the success and precision plots as shown in Fig-

所有跟踪器的整体性能通过成功率和精度图进行总结,如图所示。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——