DeepPose: Human Pose Estimation via Deep Neural Networks【翻译】

选择 Doc2X，让 PDF 转换更简单 Doc2X 提供全面的 PDF转Word、公式识别、批量转换服务，支持沉浸式双语翻译和代码解析。高效、安全、精准，助您轻松应对文档处理挑战！ Choose Doc2X, Simplify PDF Conversion Doc2X offers comprehensive PDF to Word, formula recognition, and batch conversion services, with immersive bilingual translation and code parsing. Efficient, secure, and precise, tackling document challenges with ease! 👉 点击了解 Doc2X 的更多优势 | Discover More About Doc2X

原文链接：1312.4659

DeepPose: Human Pose Estimation via Deep Neural Networks

DeepPose: 通过深度神经网络进行人体姿态估计

Alexander Toshev

亚历山大·托谢夫

Google

谷歌

1600 Amphitheatre Pkwy

Mountain View, CA 94043

加州山景城，邮政编码 94043

toshev, szegedy@google.com

Figure 1. Besides extreme variability in articulations, many of the joints are barely visible. We can guess the location of the right arm in the left image only because we see the rest of the pose and anticipate the motion or activity of the person. Similarly, the left body half of the person on the right is not visible at all. These are examples of the need for holistic reasoning. We believe that DNNs can naturally provide such type of reasoning.

图 1. 除了关节的极大变异性外，许多关节几乎不可见。我们只能通过看到姿态的其余部分并预测该人的动作或活动来推测左图中右臂的位置。同样，右侧人物的左半身完全不可见。这些都是需要整体推理的例子。我们相信深度神经网络（DNN）可以自然地提供这种类型的推理。

Abstract

摘要

We propose a method for human pose estimation based on Deep Neural Networks (DNNs). The pose estimation is formulated as a DNN-based regression problem towards body joints. We present a cascade of such DNN regressors which results in high precision pose estimates. The approach has the advantage of reasoning about pose in a holistic fashion and has a simple but yet powerful formulation which capitalizes on recent advances in Deep Learning. We present a detailed empirical analysis with state-of-art or better performance on four academic benchmarks of diverse real-world images.

我们提出了一种基于深度神经网络（DNN）的人体姿态估计方法。姿态估计被表述为一个基于 DNN 的回归问题，目标是定位身体关节。我们展示了一系列这样的 DNN 回归器，能够产生高精度的姿态估计。该方法的优点在于能够以整体的方式推理姿态，并且具有简单而强大的表述，利用了深度学习的最新进展。我们提供了详细的实证分析，在四个多样化的真实世界图像的学术基准上表现出最先进或更好的性能。

1. Introduction

1. 引言

The problem of human pose estimation, defined as the problem of localization of human joints, has enjoyed substantial attention in the computer vision community. In Fig. 1, one can see some of the challenges of this problem - strong articulations, small and barely visible joints, occlusions and the need to capture the context.

人体姿态估计问题被定义为定位人体关节的问题，受到了计算机视觉领域的广泛关注。在图 1 中，可以看到这个问题的一些挑战——强烈的关节运动、小而几乎不可见的关节、遮挡以及捕捉上下文的需求。

The main stream of work in this field has been motivated mainly by the first challenge, the need to search in the large space of all possible articulated poses. Part-based models lend themselves naturally to model articulations ([16, 8]) and in the recent years a variety of models with efficient inference have been proposed ([6, 19]).

该领域的主要研究工作主要受到第一个挑战的驱动，即在所有可能的关节姿态的大空间中进行搜索。基于部件的模型自然适合于建模关节（[16, 8]），近年来提出了多种具有高效推理能力的模型（[6, 19]）。

The above efficiency, however, is achieved at the cost of limited expressiveness - the use of local detectors, which reason in many cases about a single part, and most importantly by modeling only a small subset of all interactions between body parts. These limitations, as exemplified in Fig. 1, have been recognized and methods reasoning about pose in a holistic manner have been proposed [15, 21] but with limited success in real-world problems.

然而，上述效率是以有限的表现力为代价的——使用局部检测器，这在许多情况下只针对单个部件进行推理，最重要的是仅建模身体部件之间所有交互的一个小子集。这些局限性，如图1所示，已被认识到，并提出了以整体方式推理姿态的方法[15, 21]，但在现实世界问题中成功有限。

In this work we ascribe to this holistic view of human pose estimation. We capitalize on recent developments of deep learning and propose a novel algorithm based on a Deep Neural Network (DNN). DNNs have shown outstanding performance on visual classification tasks [14] and more recently on object localization [23, 9]. However, the question of applying DNNs for precise localization of articulated objects has largely remained unanswered. In this paper we attempt to cast a light on this question and present a simple and yet powerful formulation of holistic human pose estimation as a DNN.

在这项工作中，我们采用了这种整体的人体姿态估计观点。我们利用深度学习的最新发展，提出了一种基于深度神经网络（DNN）的新算法。DNN在视觉分类任务上表现出色[14]，最近在物体定位方面也取得了良好效果[23, 9]。然而，关于将DNN应用于关节物体的精确定位的问题在很大程度上仍未得到解答。本文试图对此问题进行探讨，并提出了一种简单而强大的整体人体姿态估计的DNN表述。

We formulate the pose estimation as a joint regression problem and show how to successfully cast it in DNN settings. The location of each body joint is regressed to using as an input the full image and a 7-layered generic convolutional DNN. There are two advantages of this formulation. First, the DNN is capable of capturing the full context of each body joint - each joint regressor uses the full image as a signal. Second, the approach is substantially simpler to formulate than methods based on graphical models - no need to explicitly design feature representations and detectors for parts; no need to explicitly design a model topology and interactions between joints. Instead, we show that a generic convolutional DNN can be learned for this problem.

我们将姿态估计公式化为一个联合回归问题，并展示如何在深度神经网络（DNN）设置中成功实现。每个身体关节的位置通过使用完整图像和一个7层通用卷积DNN作为输入进行回归。这种公式化有两个优点。首先，DNN能够捕捉每个身体关节的完整上下文——每个关节回归器使用完整图像作为信号。其次，这种方法的公式化比基于图形模型的方法要简单得多——不需要显式设计特征表示和部件检测器；不需要显式设计模型拓扑和关节之间的交互。相反，我们展示了可以为这个问题学习一个通用卷积DNN。

Further, we propose a cascade of DNN-based pose predictors. Such a cascade allows for increased precision of joint localization. Starting with an initial pose estimation, based on the full image, we learn DNN-based regressors which refines the joint predictions by using higher resolution sub-images.

此外，我们提出了一种基于DNN的姿态预测器级联。这种级联允许提高关节定位的精度。从基于完整图像的初始姿态估计开始，我们学习基于DNN的回归器，通过使用更高分辨率的子图像来细化关节预测。

We show state-of-art results or better than state-of-art on four widely used benchmarks against all reported results. We show that our approach performs well on images of people which exhibit strong variation in appearance as well as articulations. Finally, we show generalization performance by cross-dataset evaluation.

我们在四个广泛使用的基准测试中展示了最先进的结果或优于最先进的结果，涵盖了所有报告的结果。我们展示了我们的方法在表现出强烈外观变化和关节活动的人的图像上表现良好。最后，我们通过跨数据集评估展示了泛化性能。

2. Related Work

2. 相关工作

The idea of representing articulated objects in general, and human pose in particular, as a graph of parts has been advocated from the early days of computer vision [16]. The so called Pictorial Strictures (PSs), introduced by Fishler and Elschlager [8], were made tractable and practical by Felzenszwalb and Huttenlocher [6] using the distance transform trick. As a result, a wide variety of PS-based models with practical significance were subsequently developed.

从计算机视觉的早期阶段开始，就有人倡导将关节物体（特别是人类姿态）表示为部件的图形。由Fishler和Elschlager提出的所谓图像结构（PSs）通过Felzenszwalb和Huttenlocher使用距离变换技巧变得可处理和实用。因此，随后开发了多种具有实际意义的基于PS的模型。

The above tractability, however, comes with the limitation of having a tree-based pose models with simple binary potential not depending on image data. As a result, research has focused on enriching the representational power of the models while maintaining tractability. Earlier attempts to achieve this were based on richer part detectors [19, 1, 4]. More recently, a wide variety of models expressing complex joint relationships were proposed. Yang and Ramanan [27] use a mixture model of parts. Mixture models on the full model scale, by having mixture of PSs, have been studied by Johnson and Everingham [13]. Richer higher-order spatial relationships were captured in a hierarchical model by Tian et al. [25]. A different approach to capture higher-order relationship is through image-dependent PS models, which can be estimated via a global classifier [26, 20, 18].

然而，上述可处理性伴随着一个限制，即具有简单的二元潜力的基于树的姿态模型不依赖于图像数据。因此，研究集中在丰富模型的表示能力，同时保持可处理性。早期实现这一目标的尝试基于更丰富的部件检测器 [19, 1, 4]。最近，提出了多种表达复杂关节关系的模型。杨和拉马南 [27] 使用了部件的混合模型。约翰逊和埃弗宁汉 [13] 研究了在完整模型规模上，具有 PS 混合的混合模型。田等人 [25] 在一个层次模型中捕捉到了更丰富的高阶空间关系。捕捉高阶关系的另一种方法是通过图像依赖的 PS 模型，这可以通过全局分类器进行估计 [26, 20, 18]。

Approaches which ascribe to our philosophy of reasoning about pose in a holistic manner have shown limited practicality. Mori and Malik [15] try to find for each test image the closest exemplar from a set of labeled images and transfer the joint locations. A similar nearest neighbor setup is employed by Shakhnarovich et al. [21], who however use locality sensitive hashing. More recently, Gkioxari et al. [10] propose a semi-global classifier for part configuration. This formulation has shown very good results on real-world data, however, it is based on linear classifiers with less expressive representation than ours and is tested on arms only. Finally, the idea of pose regression has been employed by Ionescu et al. [11], however they reason about 3D pose.

采用整体方式推理姿态的方式在实际应用中显示出有限的实用性。森和马利克 [15] 试图为每个测试图像找到一组标记图像中最接近的样本，并转移关节位置。沙赫纳罗维奇等人 [21] 采用了类似的最近邻设置，但他们使用了局部敏感哈希。最近，Gkioxari 等人 [10] 提出了一个用于部件配置的半全局分类器。这一公式在真实世界数据上显示出非常好的结果，然而，它基于线性分类器，其表示能力不如我们的模型，并且仅在手臂上进行了测试。最后，Ionescu 等人 [11] 采用了姿态回归的思想，然而他们推理的是 3D 姿态。

The closest work to ours uses convolution NNs together with Neighborhood Component Analysis to regress toward a point in an embedding representing pose [24]. However, this work does not employ a cascade of networks. Cascades of DNN regressors have been used for localization, however of facial points [22]. On the related problem of face pose estimation, Osadchy et al. [17] employ a NN-based pose embedding trained with a contrastive loss.

我们的工作最接近的研究使用卷积神经网络结合邻域成分分析来回归到表示姿态的嵌入中的一个点 [24]。然而，这项工作并没有采用网络级联。虽然深度神经网络回归器的级联已被用于定位，但仅限于面部关键点 [22]。在与面部姿态估计相关的问题上，Osadchy 等人 [17] 使用基于神经网络的姿态嵌入，并通过对比损失进行训练。

3. Deep Learning Model for Pose Estimation

3. 用于姿态估计的深度学习模型

We use the following notation. To express a pose, we encode the locations of all $k$ body joints in pose vector defined as $\mathbf{y} = {\left( \ldots ,{\mathbf{y}}_{i}^{T},\ldots \right) }^{T},i \in \{ 1,\ldots ,k\}$ ,where ${\mathbf{y}}_{i}$ contains the $x$ and $y$ coordinates of the ${i}^{\text{th }}$ joint. A labeled image is denoted by(x,y)where $x$ stands for the image data and $\mathbf{y}$ is the ground truth pose vector.

我们使用以下符号表示法。为了表达姿态，我们编码所有 $k$ 身体关节的位置，形成定义为 $\mathbf{y} = {\left( \ldots ,{\mathbf{y}}_{i}^{T},\ldots \right) }^{T},i \in \{ 1,\ldots ,k\}$ 的姿态向量，其中 ${\mathbf{y}}_{i}$ 包含 $x$ 和 $y$ 关节的坐标。标记图像用 (x,y) 表示，其中 $x$ 代表图像数据， $\mathbf{y}$ 是真实姿态向量。

Further, since the joint coordinates are in absolute image coordinates, it proves beneficial to normalize them w. r. t. a box $b$ bounding the human body or parts of it. In a trivial case, the box can denote the full image. Such a box is defined by its center ${b}_{c} \in {\mathbb{R}}^{2}$ as well as width ${b}_{w}$ and height ${b}_{h} : b = \left( {{b}_{c},{b}_{w},{b}_{h}}\right)$ . Then the joint ${\mathbf{y}}_{i}$ can be translated by the box center and scaled by the box size which we refer to as normalization by $b$ :

此外，由于关节坐标是以绝对图像坐标表示的，因此相对于一个框 $b$ 规范化它们是有益的，该框包围着人体或其部分。在一个简单的情况下，该框可以表示整个图像。这样的框由其中心 ${b}_{c} \in {\mathbb{R}}^{2}$ 以及宽度 ${b}_{w}$ 和高度 ${b}_{h} : b = \left( {{b}_{c},{b}_{w},{b}_{h}}\right)$ 定义。然后，关节 ${\mathbf{y}}_{i}$ 可以通过框的中心进行平移，并通过框的大小进行缩放，我们称之为通过 $b$ 进行规范化：

Further, we can apply the same normalization to the elements of pose vector $N\left( {\mathbf{y};b}\right) = {\left( \ldots ,N{\left( {\mathbf{y}}_{i};b\right) }^{T},\ldots \right) }^{T}$ resulting in a normalized pose vector. Finally, with a slight abuse of notation,we use $N\left( {x;b}\right)$ to denote a crop of the image $x$ by the bounding box $\mathrm{b}$ ,which de facto normalizes the image by the box. For brevity we denote by $N\left( \cdot \right)$ normalization with $b$ being the full image box.

此外，我们可以对姿态向量 $N\left( {\mathbf{y};b}\right) = {\left( \ldots ,N{\left( {\mathbf{y}}_{i};b\right) }^{T},\ldots \right) }^{T}$ 的元素应用相同的规范化，从而得到一个规范化的姿态向量。最后，稍微滥用符号，我们用 $N\left( {x;b}\right)$ 表示通过边界框 $\mathrm{b}$ 裁剪的图像 $x$ ，这实际上是通过框对图像进行规范化。为简洁起见，我们用 $N\left( \cdot \right)$ 表示规范化，其中 $b$ 是完整的图像框。

3.1. Pose Estimation as DNN-based Regression

3.1. 基于深度神经网络的姿态估计

In this work, we treat the problem of pose estimation as regression,where the we train and use a function $\psi \left( {x;\theta }\right) \in$ ${\mathbb{R}}^{2k}$ which for an image $x$ regresses to a normalized pose vector,where $\theta$ denotes the parameters of the model. Thus, using the normalization transformation from Eq. (1) the pose prediction ${y}^{ * }$ in absolute image coordinates reads

在本研究中，我们将姿态估计问题视为回归问题，我们训练并使用一个函数 $\psi \left( {x;\theta }\right) \in$ ${\mathbb{R}}^{2k}$ ，该函数对于一幅图像 $x$ 回归到一个标准化的姿态向量，其中 $\theta$ 表示模型的参数。因此，利用公式 (1) 中的标准化变换，姿态预测 ${y}^{ * }$ 在绝对图像坐标中表示为

Despite its simple formulation, the power and complexity of the method is in $\psi$ ,which is based on a convolutional Deep Neural Network (DNN). Such a convolutional network consists of several layers - each being a linear transformation followed by a non-linear one. The first layer takes as input an image of predefined size and has a size equal to the number of pixels times three color channels. The last layer outputs the target values of the regression, in our case ${2k}$ joint coordinates.

尽管其公式简单，但该方法的强大和复杂性体现在 $\psi$ ，该方法基于卷积深度神经网络（DNN）。这样的卷积网络由多个层组成，每一层都是线性变换后接非线性变换。第一层的输入是预定义大小的图像，其大小等于像素数量乘以三个颜色通道。最后一层输出回归的目标值，在我们的案例中是 ${2k}$ 关节坐标。

We base the architecture of the $\psi$ on the work by Krizhevsky et al. [14] for image classification since it has shown outstanding results on object localization as well [23]. In a nutshell, the network consists of 7 layers (see Fig. 2 left). Denote by $C$ a convolutional layer,by ${LRN}$ a local response normalization layer, $P$ a pooling layer and by $F$ a fully connected layer. Only $C$ and $F$ layers contain learnable parameters, while the rest are parameter free. Both $C$ and $F$ layers consist of a linear transformation followed by a nonlinear one, which in our case is a rectified linear unit. For $C$ layers,the size is defined as width $\times$ height $\times$ depth,where the first two dimensions have a spatial meaning while the depth defines the number of filters. If we write the size of each layer in parentheses, then the network can be described concisely as $C\left( {{55} \times {55} \times {96}}\right) - {LRN} - P - C\left( {{27} \times {27} \times {256}}\right) -$

我们的 $\psi$ 架构基于 Krizhevsky 等人的工作 [14]，该工作在图像分类方面表现出色，同时在物体定位方面也取得了优异的结果 [23]。简而言之，网络由 7 层组成（见图 2 左侧）。用 $C$ 表示卷积层，用 ${LRN}$ 表示局部响应归一化层，用 $P$ 表示池化层，用 $F$ 表示全连接层。只有 $C$ 和 $F$ 层包含可学习的参数，而其余层则是无参数的。 $C$ 和 $F$ 层都由线性变换后接非线性变换组成，在我们的情况下，非线性变换是修正线性单元。对于 $C$ 层，大小定义为宽度 $\times$ 高度 $\times$ 深度，其中前两个维度具有空间意义，而深度定义为滤波器的数量。如果我们在括号中写出每层的大小，则网络可以简洁地描述为 $C\left( {{55} \times {55} \times {96}}\right) - {LRN} - P - C\left( {{27} \times {27} \times {256}}\right) -$ 。

${LRN} - P - C\left( {{13} \times {13} \times {384}}\right) - C\left( {{13} \times {13} \times {384}}\right) -$

$C\left( {{13} \times {13} \times {256}}\right) - P - F\left( {4096}\right) - F\left( {4096}\right)$ . The filter size for the first two $C$ layers is ${11} \times {11}$ and $5 \times 5$ and for the remaining three is $3 \times 3$ . Pooling is applied after three layers and contributes to increased performance despite the reduction of resolution. The input to the net is an image of ${220} \times {220}$ which via stride of 4 is fed into the network. The total number of parameters in the above model is about ${40}\mathrm{M}$ . For further details,we refer the reader to [14].

$C\left( {{13} \times {13} \times {256}}\right) - P - F\left( {4096}\right) - F\left( {4096}\right)$ 。前两个 $C$ 层的滤波器大小为 ${11} \times {11}$ 和 $5 \times 5$ ，其余三个层的滤波器大小为 $3 \times 3$ 。在三层之后应用池化，尽管分辨率降低，但仍有助于提高性能。输入到网络的是一个 ${220} \times {220}$ 的图像，通过步幅为 4 的方式输入网络。上述模型中的参数总数约为 ${40}\mathrm{M}$ 。有关更多详细信息，请参阅 [14]。

Figure 2. Left: schematic view of the DNN-based pose regression. We visualize the network layers with their corresponding dimensions, where convolutional layers are in blue, while fully connected ones are in green. We do not show the parameter free layers. Right: at stage $s$ ,a refining regressor is applied on a sub image to refine a prediction from the previous stage.

图2. 左侧：基于DNN的姿态回归的示意图。我们可视化网络层及其对应的维度，其中卷积层用蓝色表示，而全连接层用绿色表示。我们不显示无参数层。右侧：在阶段 $s$ ，对子图像应用一个精细回归器，以精细化来自前一阶段的预测。

The use of a generic DNN architecture is motivated by its outstanding results on both classification and localization problems. In the experimental section we show that such a generic architecture can be used to learn a model resulting in state-of-art or better performance on pose estimation as well. Further, such a model is a truly holistic one - the final joint location estimate is based on a complex nonlinear transformation of the full image.

使用通用DNN架构的动机在于其在分类和定位问题上的卓越表现。在实验部分，我们展示了这样的通用架构可以用于学习一个模型，从而在姿态估计上实现最先进或更好的性能。此外，这样的模型是一个真正的整体模型——最终的关节位置估计是基于对整个图像的复杂非线性变换。

Additionally, the use of a DNN obviates the need to design a domain specific pose model. Instead such a model and the features are learned from the data. Although the regression loss does not model explicit interactions between joints, such are implicitly captured by all of the 7 hidden layers - all the internal features are shared by all joint regressors.

此外，使用DNN消除了设计特定领域姿态模型的需要。相反，这样的模型和特征是从数据中学习的。尽管回归损失并未显式建模关节之间的相互作用，但这些相互作用在所有7个隐藏层中隐式捕获——所有内部特征由所有关节回归器共享。

Training The difference to [14] is the loss. Instead of a classification loss, we train a linear regression on top of the last network layer to predict a pose vector by minimizing ${L}_{2}$ distance between the prediction and the true pose vector. Since the ground truth pose vector is defined in absolute image coordinates and poses vary in size from image to image,we normalize our training set $D$ using the normalization from Eq. 1.1):

训练与 [14] 的区别在于损失。我们在最后一个网络层上训练线性回归，以通过最小化 ${L}_{2}$ 预测与真实姿态向量之间的距离来预测姿态向量。由于真实姿态向量是在绝对图像坐标中定义的，并且姿态在图像之间的大小各异，我们使用公式1.1中的归一化来规范化我们的训练集 $D$ 。

Then the ${L}_{2}$ loss for obtaining optimal network parameters reads:

然后，获取最优网络参数的 ${L}_{2}$ 损失为：

For clarity we write out the optimization over individual joints. It should be noted, that the above objective can be used even if for some images not all joints are labeled. In this case, the corresponding terms in the sum would be omitted.

为了清晰起见，我们写出对各个关节的优化。需要注意的是，即使某些图像并未标记所有关节，上述目标仍然可以使用。在这种情况下，和中的相应项将被省略。

The above parameters $\theta$ are optimized for using Back-propagation in a distributed online implementation. For each mini-batch of size 128 , adaptive gradient updates are computed [3]. The learning rate, as the most important parameter, is set to 0.0005 . Since the model has large number of parameters and the used datasets are of relatively small size, we augment the data using large number of randomly translated image crops (see Sec. 3.2), left/right flips as well as DropOut regularization for the $F$ layers set to 0.6 .

上述参数 $\theta$ 针对在分布式在线实现中使用反向传播进行了优化。对于每个大小为 128 的小批量，计算自适应梯度更新 [3]。学习率作为最重要的参数，设置为 0.0005。由于模型具有大量参数且所使用的数据集相对较小，我们通过使用大量随机翻译的图像裁剪（见第 3.2 节）、左右翻转以及将 DropOut 正则化设置为 0.6 来增强数据，应用于 $F$ 层。

3.2. Cascade of Pose Regressors

3.2. 姿态回归器的级联

The pose formulation from the previous section has the advantage that the joint estimation is based on the full image and thus relies on context. However, due to its fixed input size of ${220} \times {220}$ ,the network has limited capacity to look at detail - it learns filters capturing pose properties at coarse scale. These are necessary to estimate rough pose but insufficient to always precisely localize the body joints. Note that we cannot easily increase the input size since this will increase the already large number of parameters. In order to achieve better precision, we propose to train a cascade of pose regressors. At the first stage, the cascade starts off by estimating an initial pose as outlined in the previous section. At subsequent stages, additional DNN regressors are trained to predict a displacement of the joint locations from previous stage to the true location. Thus, each subsequent stage can be thought of as a refinement of the currently predicted pose, as shown in Fig. 2

前一节的姿态公式具有基于完整图像进行关节估计的优势，因此依赖于上下文。然而，由于其固定的输入大小为 ${220} \times {220}$ ，网络在细节观察方面的能力有限——它学习捕捉粗尺度姿态特征的滤波器。这些特征对于估计粗略姿态是必要的，但不足以始终精确定位身体关节。请注意，我们无法轻易增加输入大小，因为这将增加已经很大的参数数量。为了实现更好的精度，我们建议训练一个姿态回归器的级联。在第一阶段，级联通过估计如前一节所述的初始姿态开始。在后续阶段，训练额外的深度神经网络回归器，以预测从前一阶段到真实位置的关节位置位移。因此，每个后续阶段可以被视为对当前预测姿态的细化，如图 2 所示。

Further, each subsequent stage uses the predicted joint locations to focus on the relevant parts of the image - sub-images are cropped around the predicted joint location from previous stage and the pose displacement regressor for this joint is applied on this sub-image. In this way, subsequent pose regressors see higher resolution images and thus learn features for finer scales which ultimately leads to higher precision.

此外，每个后续阶段使用预测的关节位置来关注图像的相关部分——从前一阶段预测的关节位置周围裁剪出子图像，并在该子图像上应用该关节的姿态位移回归器。通过这种方式，后续的姿态回归器看到更高分辨率的图像，从而学习更细尺度的特征，最终导致更高的精度。

We use the same network architecture for all stages of the cascade but learn different network parameters. For stage $s \in \{ 1,\ldots ,S\}$ of total $S$ cascade stages,we denote by ${\theta }_{s}$ the learned network parameters. Thus,the pose displacement regressor reads $\psi \left( {x;{\theta }_{s}}\right)$ . To refine a given joint location ${\mathbf{y}}_{i}$ we will consider a joint bounding box ${b}_{i}$ capturing the sub-image around ${\mathbf{y}}_{i} : {b}_{i}\left( {\mathbf{y};\sigma }\right) =$ $\left( {{\mathbf{y}}_{i},\sigma \operatorname{diam}\left( \mathbf{y}\right) ,\sigma \operatorname{diam}\left( \mathbf{y}\right) }\right)$ having as center the $i$ -th joint and as dimension the pose diameter scaled by $\sigma$ . The diameter $\operatorname{diam}\left( \mathbf{y}\right)$ of the pose is defined as the distance between opposing joints on the human torso, such as left shoulder and right hip, and depends on the concrete pose definition and dataset.

我们在级联的所有阶段使用相同的网络架构，但学习不同的网络参数。对于总共 $S$ 个级联阶段中的第 $s \in \{ 1,\ldots ,S\}$ 阶段，我们用 ${\theta }_{s}$ 表示学习到的网络参数。因此，姿态位移回归器的表达为 $\psi \left( {x;{\theta }_{s}}\right)$ 。为了细化给定的关节位置 ${\mathbf{y}}_{i}$ ，我们将考虑一个关节边界框 ${b}_{i}$ ，该边界框捕捉到围绕 ${\mathbf{y}}_{i} : {b}_{i}\left( {\mathbf{y};\sigma }\right) =$ $\left( {{\mathbf{y}}_{i},\sigma \operatorname{diam}\left( \mathbf{y}\right) ,\sigma \operatorname{diam}\left( \mathbf{y}\right) }\right)$ 的子图像，其中心为第 $i$ 个关节，尺寸为姿态直径缩放 $\sigma$ 。姿态的直径 $\operatorname{diam}\left( \mathbf{y}\right)$ 定义为人体躯干上对立关节之间的距离，例如左肩和右髋，且依赖于具体的姿态定义和数据集。

Using the above notation,at the stage $s = 1$ we start with a bounding box ${b}^{0}$ which either encloses the full image or is obtained by a person detector. We obtain an initial pose:

使用上述符号，在第 $s = 1$ 阶段，我们从一个边界框 ${b}^{0}$ 开始，该边界框要么包围整个图像，要么由一个人检测器获得。我们获得一个初始姿态：

At each subsequent stage $s \geq 2$ ,for all joints $i \in \{ 1,\ldots ,k\}$ we regress first towards a refinement displacement ${\mathbf{y}}_{i}^{s} -$ ${\mathbf{y}}_{i}^{\left( s - 1\right) }$ by applying a regressor on the sub image defined by ${b}_{i}^{\left( s - 1\right) }$ from previous stage(s - 1). Then,we estimate new joint boxes ${b}_{i}^{s}$ :

在随后的每个阶段 $s \geq 2$ ，对于所有关节 $i \in \{ 1,\ldots ,k\}$ ，我们首先通过在由前一阶段（s - 1）定义的子图像 ${b}_{i}^{\left( s - 1\right) }$ 上应用回归器，回归到一个细化位移 ${\mathbf{y}}_{i}^{s} -$ ${\mathbf{y}}_{i}^{\left( s - 1\right) }$ 。然后，我们估计新的关节框 ${b}_{i}^{s}$ ：

We apply the cascade for a fixed number of stages $S$ , which is determined as explained in Sec. 4.1.

我们对固定数量的阶段 $S$ 应用级联，该数量如第 4.1 节所述确定。

Training The network parameters ${\theta }_{1}$ are trained as outlined in Sec. 3.1, Eq. (4). At subsequent stages $s \geq 2$ ,the training is done identically with one important difference. Each joint $i$ from a training example(x,y)is normalized using a different bounding box $\left( {{\mathbf{y}}_{i}^{\left( s - 1\right) },\sigma \operatorname{diam}\left( {\mathbf{y}}^{\left( s - 1\right) }\right) ,\sigma \operatorname{diam}\left( {\mathbf{y}}^{\left( s - 1\right) }\right) }\right)$ - the one centered at the prediction for the same joint obtained from previous stage - so that we condition the training of the stage based on the model from previous stage.

训练网络参数 ${\theta }_{1}$ 的训练如第 3.1 节中所述，公式 (4)。在随后的阶段 $s \geq 2$ 中，训练的方式是相同的，但有一个重要的区别。每个来自训练示例 (x,y) 的关节 $i$ 使用不同的边界框 $\left( {{\mathbf{y}}_{i}^{\left( s - 1\right) },\sigma \operatorname{diam}\left( {\mathbf{y}}^{\left( s - 1\right) }\right) ,\sigma \operatorname{diam}\left( {\mathbf{y}}^{\left( s - 1\right) }\right) }\right)$ 进行归一化——该边界框以从前一阶段获得的同一关节的预测为中心——以便我们根据前一阶段的模型来调整该阶段的训练。

Since deep learning methods have large capacity, we augment the training data by using multiple normalizations for each image and joint. Instead of using the prediction from previous stage only, we generate simulated predictions. This is done by randomly displacing the ground truth location for joint $i$ by a vector sampled at random from a 2-dimensional Normal distribution ${\mathcal{N}}_{i}^{\left( s - 1\right) }$ with mean and variance equal to the mean and variance of the observed displacements $\left( {{\mathbf{y}}_{i}^{\left( s - 1\right) } - {\mathbf{y}}_{i}}\right)$ across all examples in the training data. The full augmented training data can be defined by first sampling an example and a joint from the original data at uniform and then generating a simulated prediction based on a sampled displacement $\delta$ from ${\mathcal{N}}_{i}^{\left( s - 1\right) }$ :

由于深度学习方法具有较大的容量，我们通过对每个图像和关节使用多种归一化来增强训练数据。我们不仅使用前一阶段的预测，而是生成模拟预测。这是通过随机位移关节的真实位置 $i$ ，使用从二维正态分布 ${\mathcal{N}}_{i}^{\left( s - 1\right) }$ 中随机采样的向量进行的，该分布的均值和方差等于训练数据中所有示例的观察位移的均值和方差 $\left( {{\mathbf{y}}_{i}^{\left( s - 1\right) } - {\mathbf{y}}_{i}}\right)$ 。完整的增强训练数据可以通过首先从原始数据中均匀采样一个示例和一个关节，然后基于从 ${\mathcal{N}}_{i}^{\left( s - 1\right) }$ 中采样的位移 $\delta$ 生成模拟预测来定义：

The training objective for cascade stage $s$ is done as in Eq. 4 by taking extra care to use the correct normalization for each joint:

级联阶段 $s$ 的训练目标如公式 4 所示，特别注意为每个关节使用正确的归一化：

4. Empirical Evaluation

4. 实证评估

4.1. Setup

4.1. 设置

Datasets There is a wide variety of benchmarks for human pose estimation. In this work we use datasets, which have large number of training examples sufficient to train a large model such as the proposed DNN, as well as are realistic and challenging.

数据集人体姿态估计有多种基准。在本工作中，我们使用的数据集具有大量的训练示例，足以训练像所提议的深度神经网络（DNN）这样的大模型，并且这些数据集是现实且具有挑战性的。

The first dataset we use is Frames Labeled In Cinema (FLIC), introduced by [20], which consists of 4000 training and 1000 test images obtained from popular Hollywood movies. The images contain people in diverse poses and especially diverse clothing. For each labeled human, 10 upper body joints are labeled.

我们使用的第一个数据集是由 [20] 提出的电影标记帧（FLIC），该数据集包含来自流行好莱坞电影的4000张训练图像和1000张测试图像。这些图像中包含各种姿势的人，尤其是多样的服装。对于每个标记的人体，标记了10个上半身关节。

The second dataset we use is Leeds Sports Dataset [12] and its extension [13], which we will jointly denote by LSP. Combined they contain 11000 training and 1000 testing images. These are images from sports activities and as such are quite challenging in terms of appearance and especially articulations. In addition, the majority of people have 150 pixel height which makes the pose estimation even more challenging. In this dataset, for each person the full body is labeled with total 14 joints.

我们使用的第二个数据集是利兹体育数据集 [12] 及其扩展 [13]，我们将其统称为 LSP。它们结合包含 11000 张训练图像和 1000 张测试图像。这些图像来自体育活动，因此在外观和特别是关节动作方面都相当具有挑战性。此外，大多数人的高度为 150 像素，这使得姿态估计更加困难。在这个数据集中，每个人的全身都标注了总共 14 个关节。

For all of the above datasets, we define the diameter of a pose $\mathbf{y}$ to be the distance between a shoulder and hip from

对于上述所有数据集，我们定义姿态的直径 $\mathbf{y}$ 为肩膀和臀部之间的距离。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——