Doc2X:一站式文档转换平台 支持 PDF 转 Word、Latex、HTML 和 Markdown,集成 公式解析、表格识别 和 沉浸式翻译功能。 Doc2X: One-Stop Document Conversion Platform Supports PDF to Word, LaTeX, HTML, and Markdown, integrated with formula parsing, table recognition, and immersive translation. 👉 点击访问 Doc2X 官网 | Visit Doc2X
原文链接:1608.03773
2 Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking
2 超越相关滤波器:学习用于视觉跟踪的连续卷积算子
Martin Danelljan, Andreas Robinson, Fahad Shahbaz Khan, Michael Felsberg
马丁·达内尔扬,安德烈亚斯·罗宾逊,法哈德·沙赫巴兹·汗,迈克尔·费尔斯伯格
CVL, Department of Electrical Engineering, Linköping University, Sweden
瑞典林雪平大学电气工程系CVL
{martin.danelljan, andreas.robinson, fahad.khan, michael.felsberg}@liu.se
{martin.danelljan, andreas.robinson, fahad.khan, michael.felsberg}@liu.se
Abstract. Discriminative Correlation Filters (DCF) have demonstrated excellent performance for visual object tracking. The key to their success is the ability to efficiently exploit available negative data by including all shifted versions of a training sample. However, the underlying DCF formulation is restricted to single-resolution feature maps, significantly limiting its potential. In this paper, we go beyond the conventional DCF framework and introduce a novel formulation for training continuous convolution filters. We employ an implicit interpolation model to pose the learning problem in the continuous spatial domain. Our proposed formulation enables efficient integration of multi-resolution deep feature maps, leading to superior results on three object tracking benchmarks: OTB-2015 (+5.1% in mean OP), Temple-Color (+4.6% in mean OP), and VOT2015 (20% relative reduction in failure rate). Additionally, our approach is capable of sub-pixel localization, crucial for the task of accurate feature point tracking. We also demonstrate the effectiveness of our learning formulation in extensive feature point tracking experiments. Code and supplementary material are available at www.cvl.isy.liu.se/research/ob…
摘要:区分性相关滤波器(DCF)在视觉目标跟踪中表现出色。其成功的关键在于能够通过包含训练样本的所有平移版本来有效利用可用的负数据。然而,基础的DCF公式仅限于单分辨率特征图,显著限制了其潜力。在本文中,我们超越传统的DCF框架,引入了一种新的连续卷积滤波器训练公式。我们采用隐式插值模型将学习问题置于连续空间域中。我们提出的公式能够有效整合多分辨率深度特征图,在三个目标跟踪基准测试中取得了优越的结果:OTB-2015(平均OP提高5.1%),Temple-Color(平均OP提高4.6%)和VOT2015(故障率相对减少20%)。此外,我们的方法能够进行亚像素定位,这对于准确的特征点跟踪任务至关重要。我们还在广泛的特征点跟踪实验中展示了我们学习公式的有效性。代码和补充材料可在www.cvl.isy.liu.se/research/ob…
Visual tracking is the task of estimating the trajectory of a target in a video. It is one of the fundamental problems in computer vision. Tracking of objects or feature points has numerous applications in robotics, structure-from-motion, and visual surveillance. In recent years, Discriminative Correlation Filter (DCF) based approaches have shown outstanding results on object tracking benchmarks . DCF methods train a correlation filter for the task of predicting the target classification scores. Unlike other methods, the DCF efficiently utilize all spatial shifts of the training samples by exploiting the discrete Fourier transform.
视觉跟踪是估计视频中目标轨迹的任务。它是计算机视觉中的基本问题之一。物体或特征点的跟踪在机器人技术、运动结构重建和视觉监控等领域有着广泛的应用。近年来,基于判别相关滤波器(DCF)的方法在物体跟踪基准测试中表现出色 。DCF 方法为预测目标分类分数的任务训练一个相关滤波器。与其他方法不同,DCF 通过利用离散傅里叶变换有效地利用了训练样本的所有空间位移。
Deep convolutional neural networks (CNNs) have shown impressive performance for many tasks, and are therefore of interest for DCF-based tracking. A CNN consists of several layers of convolution, normalization and pooling operations. Recently, activations from the last convolutional layers have been successfully employed for image classification. Features from these deep convolutional layers are discriminative while preserving spatial and structural information. Surprisingly, in the context of tracking, recent DCF-based methods [10,35] have demonstrated the importance of shallow convolutional layers. These layers provide higher spatial resolution, which is crucial for accurate target localization. However, fusing multiple layers in a DCF framework is still an open problem.
深度卷积神经网络(CNN)在许多任务中表现出色,因此在基于 DCF 的跟踪中引起了关注。CNN 由多个卷积、归一化和池化操作的层组成。最近,最后几层卷积的激活被成功应用于图像分类。这些深度卷积层的特征具有区分性,同时保留了空间和结构信息。令人惊讶的是,在跟踪的背景下,最近的基于 DCF 的方法 [10,35] 证明了浅层卷积层的重要性。这些层提供了更高的空间分辨率,这对于准确的目标定位至关重要。然而,在 DCF 框架中融合多个层仍然是一个未解决的问题。
Fig. 1. Visualization of our continuous convolution operator, applied to a multi-resolution deep feature map. The feature map (left) consists of the input RGB patch along with the first and last convolutional layer of a pre-trained deep network. The second column visualizes the continuous convolution filters learned by our framework. The resulting continuous convolution outputs for each layer (third column) are combined into the final continuous confidence function (right) of the target (green box).
图 1. 我们的连续卷积算子在多分辨率深度特征图上的可视化。特征图(左侧)由输入的 RGB 补丁以及预训练深度网络的第一层和最后一层卷积组成。第二列可视化了我们的框架学习到的连续卷积滤波器。每层的连续卷积输出(第三列)被组合成目标的最终连续置信函数(右侧)(绿色框)。
The conventional DCF formulation is limited to a single-resolution feature map. Therefore, all feature channels must have the same spatial resolution, as in e.g. the HOG descriptor. This limitation prohibits joint fusion of multiple convolutional layers with different spatial resolutions. A straightforward strategy to counter this restriction is to explicitly resample all feature channels to the same common resolution. However, such a resampling strategy is both cumbersome, adds redundant data and introduces artifacts. Instead, a principled approach for integrating multi-resolution feature maps in the learning formulation is preferred.
传统的 DCF 公式仅限于单一分辨率的特征图。因此,所有特征通道必须具有相同的空间分辨率,例如 HOG 描述符。这一限制禁止了具有不同空间分辨率的多个卷积层的联合融合。应对这一限制的直接策略是明确地将所有特征通道重新采样到相同的共同分辨率。然而,这种重新采样策略既繁琐,又增加了冗余数据,并引入了伪影。因此,更倾向于在学习公式中整合多分辨率特征图的原则性方法。
In this work, we propose a novel formulation for learning a convolution operator in the continuous spatial domain. The proposed learning formulation employs an implicit interpolation model of the training samples. Our approach learns a set of convolution filters to produce a continuous-domain confidence map of the target. This enables an elegant fusion of multi-resolution feature maps in a joint learning formulation. Figure 1 shows a visualization of our continuous convolution operator, when integrating multi-resolution deep feature maps. We validate the effectiveness of our approach on three object tracking benchmarks: OTB- 2015 [46], Temple-Color [32] and VOT2015 [29]. On the challenging OTB-2015 with 100 videos, our object tracking framework improves the state-of-the-art from 77.3% to 82.4% in mean overlap precision.
在本工作中,我们提出了一种在连续空间域中学习卷积算子的新的公式。所提出的学习公式采用了训练样本的隐式插值模型。我们的方法学习一组卷积滤波器,以生成目标的连续域置信图。这使得在联合学习公式中优雅地融合多分辨率特征图成为可能。图 1 显示了我们在整合多分辨率深度特征图时的连续卷积算子的可视化。我们在三个目标跟踪基准上验证了我们方法的有效性:OTB-2015 [46]、Temple-Color [32] 和 VOT2015 [29]。在具有 100 个视频的挑战性 OTB-2015 上,我们的目标跟踪框架将最先进的平均重叠精度从 77.3% 提高到 82.4%。
In addition to multi-resolution fusion, our continuous domain learning formulation enables accurate sub-pixel localization. This is achieved by labeling the training samples with sub-pixel precise continuous confidence maps. Our formulation is therefore also suitable for accurate feature point tracking. Further, our learning-based approach is discriminative and does not require explicit interpolation of the image to achieve sub-pixel accuracy. We demonstrate the accuracy and robustness of our approach by performing extensive feature point tracking experiments on the popular MPI Sintel dataset [7].
除了多分辨率融合,我们的连续域学习公式还实现了准确的亚像素定位。这是通过用亚像素精确的连续置信图标记训练样本来实现的。因此,我们的公式也适用于准确的特征点跟踪。此外,我们的基于学习的方法是判别性的,不需要对图像进行显式插值即可实现亚像素精度。我们通过在流行的 MPI Sintel 数据集 [7] 上进行广泛的特征点跟踪实验,展示了我们方法的准确性和鲁棒性。
2 Related Work
2 相关工作
Discriminative Correlation Filters (DCF) [5,11,24] have shown promising results for object tracking. These methods exploit the properties of circular correlation for training a regressor in a sliding-window fashion. Initially, the DCF approaches were restricted to a single feature channel. The DCF framework was later extended to multi-channel feature maps . The multi-channel DCF allows high-dimensional features, such as HOG and Color Names, to be incorporated for improved tracking. In addition to the incorporation of multi-channel features, the DCF framework has been significantly improved lately by, e.g., including scale estimation ,non-linear kernels ,a long-term memory ,and by alleviating the periodic effects of circular convolution [11,15,18].
判别相关滤波器 (DCF) [5,11,24] 在目标跟踪中显示出良好的结果。这些方法利用循环相关的特性,以滑动窗口的方式训练回归器。最初,DCF 方法 限于单一特征通道。后来,DCF 框架扩展到了多通道特征图 。多通道 DCF 允许高维特征(如 HOG 和颜色名称)被纳入,以改善跟踪效果。除了多通道特征的引入,DCF 框架最近也得到了显著改进,例如,包括尺度估计 、非线性核 、长期记忆 ,以及缓解循环卷积的周期性效应 [11,15,18]。
With the advent of deep CNNs, fully connected layers of the network have been commonly employed for image representation [38,43]. Recently, the last (deep) convolutional layers were shown to be more beneficial for image classification . On the other hand,the first (shallow) convolutional layer was shown to be more suitable for visual tracking, compared to the deeper layers [10]. The deep convolutional layers are discriminative and possess high-level visual information. In contrast, the shallow layers contain low-level features at high spatial resolution, beneficial for localization. Ma et al. [35] employed multiple convolutional layers in a hierarchical ensemble of independent DCF trackers. Instead, we propose a novel continuous formulation to fuse multiple convolutional layers with different spatial resolutions in a joint learning framework.
随着深度卷积神经网络 (CNN) 的出现,网络的全连接层通常用于图像表示 [38,43]。最近,最后的(深层)卷积层被证明对图像分类更有利 。另一方面,第一层(浅层)卷积层被证明比深层更适合视觉跟踪 [10]。深层卷积层具有判别性并包含高级视觉信息。相对而言,浅层包含高空间分辨率的低级特征,有利于定位。Ma 等人 [35] 在独立 DCF 跟踪器的分层集成中采用了多个卷积层。相反,我们提出了一种新的连续形式,以在联合学习框架中融合具有不同空间分辨率的多个卷积层。
Unlike object tracking, feature point tracking is the task of accurately estimating the motion of distinctive key-points. It is a core component in many vision systems . Most feature point tracking methods are derived from the classic Kanade-Lucas-Tomasi (KLT) tracker [34,44]. The KLT tracker is a generative method, that is based on minimizing the squared sum of differences between two image patches. In the last decades, significant effort has been spent on improving the KLT tracker . In contrast,we propose a discriminative learning based approach for feature point tracking.
与目标跟踪不同,特征点跟踪是准确估计独特关键点运动的任务。它是许多视觉系统中的核心组件 。大多数特征点跟踪方法源自经典的 Kanade-Lucas-Tomasi (KLT) 跟踪器 [34,44]。KLT 跟踪器是一种生成方法,基于最小化两个图像块之间差异的平方和。在过去几十年中,已经花费了大量精力来改进 KLT 跟踪器 。相比之下,我们提出了一种基于判别学习的特征点跟踪方法。
Our approach: Our main contribution is a theoretical framework for learning discriminative convolution operators in the continuous spatial domain. Our formulation has two major advantages compared to the conventional DCF framework. Firstly, it allows a natural integration of multi-resolution feature maps, e.g. combinations of convolutional layers or multi-resolution HOG and color features. This property is especially desirable for object tracking, detection and action recognition applications. Secondly, our continuous formulation enables accurate sub-pixel localization, crucial in many feature point tracking problems.
我们的方法:我们的主要贡献是一个在连续空间域中学习判别卷积算子的理论框架。与传统的 DCF 框架相比,我们的公式有两个主要优点。首先,它允许自然整合多分辨率特征图,例如卷积层的组合或多分辨率 HOG 和颜色特征。这一特性在目标跟踪、检测和动作识别应用中尤为重要。其次,我们的连续公式实现了精确的亚像素定位,这在许多特征点跟踪问题中至关重要。
3 Learning Continuous Convolution Operators
3 学习连续卷积算子
In this section, we present a theoretical framework for learning continuous convolution operators. Our formulation is generic and can be applied for supervised learning tasks, such as visual tracking and detection.
在本节中,我们提出了一个学习连续卷积算子的理论框架。我们的公式是通用的,可以应用于监督学习任务,如视觉跟踪和检测。
3.1 Preliminaries and Notation
3.1 基础知识和符号
In this paper, we utilize basic concepts and results in continuous Fourier analysis. For clarity, we first formulate our learning method for data defined in a one-dimensional domain, i.e. for functions of a single spatial variable. We then describe the generalization to higher dimensions, including images, in section 3.5.
在本文中,我们利用连续傅里叶分析中的基本概念和结果。为清晰起见,我们首先为定义在一维域中的数据制定我们的学习方法,即针对单一空间变量的函数。然后我们在第 3.5 节中描述推广到更高维度,包括图像。
We consider the space of complex-valued functions that are periodic with period and square Lebesgue integrable. The space is a Hilbert space equipped with an inner product . For functions ,
我们考虑复值函数的空间 ,该函数以周期 周期性,并且是平方勒贝格可积的。空间 是一个希尔伯特空间,配备有内积 。对于函数 ,
Here, the bar denotes complex conjugation. In (1) we have also defined the circular convolution operation .
这里,横线表示复共轭。在 (1) 中,我们还定义了循环卷积运算 。
In our derivations,we use the complex exponential functions since they are eigenfunctions of the convolution operation (1). The set further forms an orthonormal basis for . We define the Fourier coefficients of as . For clarity,we use square brackets for functions with discrete domains. Any can be expressed in terms of its Fourier series . The Fourier coefficients satisfy Parseval’s formula ,where and is the squared -norm. Further,the Fourier coefficients satisfy the two convolution properties and ,where . 3.2 Our Continuous Learning Formulation
在我们的推导中,我们使用复指数函数 ,因为它们是卷积运算 (1) 的特征函数。集合 进一步构成 的正交归一基。我们将 的傅里叶系数定义为 。为清晰起见,我们对离散域的函数使用方括号。任何 都可以用其傅里叶级数 表示。傅里叶系数满足帕尔塞瓦尔公式 ,其中 和 是平方 -范数。此外,傅里叶系数满足两个卷积性质 和 ,其中 。3.2 我们的连续学习公式
Here we formulate our novel learning approach. The aim is to train a continuous convolution operator based on training samples . The samples consist of feature maps extracted from image patches. Each sample contains feature channels ,extracted from the same image patch. Conventional DCF formulations assume the feature channels to have the same spatial resolution, i.e. have the same number of spatial sample points. Unlike previous works,we eliminate this restriction in our formulation and let denote the number of spatial samples in . In our formulation,the feature channel is viewed as a function indexed by the discrete spatial variable . The sample space is expressed as .
在这里,我们提出了我们新颖的学习方法。其目的是基于训练样本 训练一个连续卷积算子。这些样本由从图像补丁中提取的特征图组成。每个样本 包含 特征通道 ,这些通道是从同一图像补丁中提取的。传统的 DCF 公式 假设特征通道具有相同的空间分辨率,即具有相同数量的空间采样点。与以前的工作不同,我们在我们的公式中消除了这一限制,并让 表示 中的空间样本数量。在我们的公式中,特征通道 被视为一个由离散空间变量 索引的函数 。样本空间表示为 。
To pose the learning problem in the continuous spatial domain, we introduce an implicit interpolation model of the training samples. We regard the continuous
为了在连续空间域中提出学习问题,我们引入了训练样本的隐式插值模型。我们将连续
—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——