Stacked Hourglass Networks for Human Pose EstimationStacked

Doc2X：科研文档高效处理的秘密武器 Doc2X 提供精确的 PDF转Docx、批量PDF识别、多栏转换、公式编辑，并支持 Mathpix公式识别、GLM翻译、大模型语料提取。轻松提升科研效率！ Doc2X: The Secret Weapon for Efficient Academic Document Handling Doc2X offers precise PDF to Docx, batch PDF recognition, multi-column conversion, and formula editing, supporting Mathpix formula recognition, GLM translation, and large-model corpus extraction. Enhance your research efficiency effortlessly! 👉 点击访问 Doc2X 官网 | Visit Doc2X Official Site

原文链接：1603.06937

Stacked Hourglass Networks for Human Pose Estimation

用于人体姿态估计的堆叠沙漏网络

Alejandro Newell, Kaiyu Yang, and Jia Deng

Alejandro Newell, Kaiyu Yang, 和 Jia Deng

University of Michigan, Ann Arbor {alnewell,yangky,jiadeng}@umich.edu

密歇根大学安娜堡分校 {alnewell,yangky,jiadeng}@umich.edu

Abstract. This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a "stacked hourglass" network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.

摘要。本工作介绍了一种新颖的卷积网络架构，用于人体姿态估计任务。特征在所有尺度上进行处理，并整合以最佳捕捉与身体相关的各种空间关系。我们展示了如何将重复的自下而上、自上而下的处理与中间监督结合使用，对提高网络性能至关重要。我们将该架构称为“堆叠沙漏”网络，基于为产生最终预测集而进行的逐步池化和上采样。我们在FLIC和MPII基准上取得了最先进的结果，超越了所有近期的方法。

Keywords: Human Pose Estimation

关键词：人体姿态估计

Fig. 1. Our network for pose estimation consists of multiple stacked hourglass modules which allow for repeated bottom-up, top-down inference.

图1. 我们的姿态估计网络由多个堆叠的沙漏模块组成，这些模块允许进行重复的自下而上、自上而下的推理。

1 Introduction

1 引言

A key step toward understanding people in images and video is accurate pose estimation. Given a single RGB image, we wish to determine the precise pixel location of important keypoints of the body. Achieving an understanding of a person's posture and limb articulation is useful for higher level tasks like action recognition, and also serves as a fundamental tool in fields such as human-computer interaction and animation.

准确的姿态估计是理解图像和视频中人类的关键步骤。给定一张单独的 RGB 图像，我们希望确定身体重要关键点的精确像素位置。理解一个人的姿势和肢体关节动作对于动作识别等更高级的任务非常有用，同时也作为人机交互和动画等领域的基本工具。

As a well established problem in vision, pose estimation has plagued researchers with a variety of formidable challenges over the years. A good pose estimation system must be robust to occlusion and severe deformation, successful on rare and novel poses, and invariant to changes in appearance due to factors like clothing and lighting. Early work tackles such difficulties using robust image features and sophisticated structured prediction [1-9]: the former is used to produce local interpretations, whereas the latter is used to infer a globally consistent pose.

作为视觉领域一个成熟的问题，姿态估计多年来一直困扰着研究人员，面临着各种严峻的挑战。一个好的姿态估计系统必须对遮挡和严重变形具有鲁棒性，能够成功处理稀有和新颖的姿势，并且对由于衣物和照明等因素引起的外观变化保持不变。早期的研究通过使用鲁棒的图像特征和复杂的结构化预测来解决这些困难 [1-9]：前者用于产生局部解释，而后者用于推断全局一致的姿态。

This conventional pipeline, however, has been greatly reshaped by convolutional neural networks (ConvNets) [10-14], a main driver behind an explosive rise in performance across many computer vision tasks. Recent pose estimation systems [15-20] have universally adopted ConvNets as their main building block, largely replacing hand-crafted features and graphical models; this strategy has yielded drastic improvements on standard benchmarks $\left\lbrack {1,{21},{22}}\right\rbrack$ .

然而，这一传统流程已被卷积神经网络（ConvNets）[10-14] 大大重塑，卷积神经网络是许多计算机视觉任务性能爆炸性提升的主要推动力。最近的姿态估计系统 [15-20] 普遍采用卷积神经网络作为其主要构建模块，基本上取代了手工设计的特征和图形模型；这一策略在标准基准测试上取得了显著的改进 $\left\lbrack {1,{21},{22}}\right\rbrack$ 。

We continue along this trajectory and introduce a novel "stacked hourglass" network design for predicting human pose. The network captures and consolidates information across all scales of the image. We refer to the design as an hourglass based on our visualization of the steps of pooling and subsequent up-sampling used to get the final output of the network. Like many convolutional approaches that produce pixel-wise outputs, the hourglass network pools down to a very low resolution, then upsamples and combines features across multiple resolutions $\left\lbrack {{15},{23}}\right\rbrack$ . On the other hand,the hourglass differs from prior designs primarily in its more symmetric topology.

我们沿着这个轨迹继续，并引入了一种新颖的“堆叠沙漏”网络设计，用于预测人类姿态。该网络捕捉并整合了图像各个尺度的信息。我们将这种设计称为沙漏，基于我们对用于获得网络最终输出的池化步骤和随后的上采样的可视化。与许多产生逐像素输出的卷积方法一样，沙漏网络先将分辨率降低到非常低的水平，然后进行上采样并在多个分辨率之间组合特征 $\left\lbrack {{15},{23}}\right\rbrack$ 。另一方面，沙漏网络与以前的设计主要不同在于其更对称的拓扑结构。

We expand on a single hourglass by consecutively placing multiple hourglass modules together end-to-end. This allows for repeated bottom-up, top-down inference across scales. In conjunction with the use of intermediate supervision, repeated bidirectional inference is critical to the network's final performance. The final network architecture achieves a significant improvement on the state-of-the-art for two standard pose estimation benchmarks (FLIC [1] and MPII Human Pose [21]). On MPII there is over a 2% average accuracy improvement across all joints,with as much as a $4 - 5\%$ improvement on more difficult joints like the knees and ankles. ${}^{1}$

我们通过将多个沙漏模块连续地端对端放置在一起，扩展了单个沙漏。这允许在不同尺度上进行重复的自下而上和自上而下的推理。结合中间监督的使用，重复的双向推理对网络的最终性能至关重要。最终的网络架构在两个标准姿态估计基准（FLIC [1] 和 MPII 人体姿态 [21]）上实现了显著的性能提升。在 MPII 上，所有关节的平均准确率提高了超过 2%，在膝盖和踝关节等更困难的关节上提高了 $4 - 5\%$ 。 ${}^{1}$

2 Related Work

2 相关工作

With the introduction of "DeepPose" by Toshev et al. [24], research on human pose estimation began the shift from classic approaches [1-9] to deep networks. Toshev et al. use their network to directly regress the x,y coordinates of joints. The work by Tompson et al. [15] instead generates heatmaps by running an image through multiple resolution banks in parallel to simultaneously capture features at a variety of scales. Our network design largely builds off of their work, exploring how to capture information across scales and adapting their method for combining features across different resolutions.

随着Toshev等人提出的“DeepPose” [24]，人类姿态估计的研究开始从经典方法 [1-9] 转向深度网络。Toshev等人使用他们的网络直接回归关节的x,y坐标。Tompson等人 [15] 的工作则通过将图像并行地运行在多个分辨率库中生成热图，以同时捕捉不同尺度的特征。我们的网络设计在很大程度上基于他们的工作，探索如何跨尺度捕捉信息，并调整他们的特征组合方法以适应不同分辨率。

${}^{1}$ Code is available at www-personal.umich.edu/ alnewell/pose

${}^{1}$ 代码可在 www-personal.umich.edu/alnewell/po… 获取

Fig. 2. Example output produced by our network. On the left we see the final pose estimate provided by the max activations across each heatmap. On the right we show sample heatmaps. (From left to right: neck, left elbow, left wrist, right knee, right ankle)

图 2. 我们网络生成的示例输出。左侧是通过每个热图的最大激活提供的最终姿态估计。右侧展示了样本热图。（从左到右：颈部、左肘、左腕、右膝、右踝）

A critical feature of the method proposed by Tompson et al. [15] is the joint use of a ConvNet and a graphical model. Their graphical model learns typical spatial relationships between joints. Others have recently tackled this in similar ways $\left\lbrack {{17},{20},{25}}\right\rbrack$ with variations on how to approach unary score generation and pairwise comparison of adjacent joints. Chen et al. [25] cluster detections into typical orientations so that when their classifier makes predictions additional information is available indicating the likely location of a neighboring joint. We achieve superior performance without the use of a graphical model or any explicit modeling of the human body.

Tompson 等人 [15] 提出的该方法的一个关键特征是联合使用卷积网络和图形模型。他们的图形模型学习关节之间的典型空间关系。其他研究者最近以类似的方式解决了这个问题 $\left\lbrack {{17},{20},{25}}\right\rbrack$ ，在如何生成单一评分和相邻关节的成对比较方面进行了变体。陈等人 [25] 将检测结果聚类为典型方向，以便在他们的分类器进行预测时，提供额外信息以指示邻近关节的可能位置。我们在不使用图形模型或任何明确的人体建模的情况下实现了更优的性能。

There are several examples of methods making successive predictions for pose estimation. Carreira et al. [19] use what they refer to as Iterative Error Feedback. A set of predictions is included with the input, and each pass through the network further refines these predictions. Their method requires multi-stage training and the weights are shared across each iteration. Wei et al. [18] build on the work of multi-stage pose machines [26] but now with the use of ConvNets for feature extraction. Given our use of intermediate supervision, our work is similar in spirit to these methods, but our building block (the hourglass module) is different. Hu & Ramanan [27] have an architecture more similar to ours that can also be used for multiple stages of predictions, but their model ties weights in the bottom-up and top-down portions of computation as well as across iterations.

有几种方法可以进行连续的姿态估计预测。Carreira 等人 [19] 使用他们所称的迭代误差反馈。输入中包含一组预测，每次通过网络的传递进一步细化这些预测。他们的方法需要多阶段训练，并且权重在每次迭代中共享。Wei 等人 [18] 在多阶段姿态机器 [26] 的基础上进行改进，但现在使用卷积网络进行特征提取。考虑到我们使用中间监督，我们的工作在精神上与这些方法相似，但我们的构建模块（沙漏模块）不同。Hu 和 Ramanan [27] 的架构与我们的更为相似，也可以用于多个阶段的预测，但他们的模型在自下而上和自上而下的计算部分以及跨迭代的权重是相互关联的。

Tompson et al. build on their work in [15] with a cascade to refine predictions. This serves to increase efficency and reduce memory usage of their method while improving localization performance in the high precision range [16]. One consideration is that for many failure cases a refinement of position within a local window would not offer much improvement since error cases often consist of either occluded or misattributed limbs. For both situations, any further evaluation at a local scale will not improve the prediction.

Tompson 等人在他们的工作 [15] 基础上，使用级联来优化预测。这有助于提高他们方法的效率并减少内存使用，同时在高精度范围内改善定位性能 [16]。需要考虑的一点是，对于许多失败情况，在局部窗口内的精细化位置并不会带来太大改进，因为错误情况通常由遮挡或误归属的肢体组成。对于这两种情况，在局部尺度上的任何进一步评估都不会改善预测。

There are variations to the pose estimation problem which include the use of additional features such as depth or motion cues. [28-30] Also, there is the more challenging task of simultaneous annotation of multiple people $\left\lbrack {{17},{31}}\right\rbrack$ . In addition, there is work like that of Oliveira et al. [32] that performs human part segmentation based on fully convolutional networks [23]. Our work focuses solely on the task of keypoint localization of a single person's pose from an RGB image.

姿态估计问题存在多种变体，包括使用深度或运动线索等附加特征。[28-30] 还有更具挑战性的任务，即同时标注多个个体 $\left\lbrack {{17},{31}}\right\rbrack$ 。此外，还有像 Oliveira 等人的工作 [32]，其基于全卷积网络 [23] 进行人体部位分割。我们的工作专注于从 RGB 图像中定位单个人体姿态的关键点任务。

Fig. 3. An illustration of a single "hourglass" module. Each box in the figure corresponds to a residual module as seen in Figure 4. The number of features is consistent across the whole hourglass.

图3. 单个“沙漏”模块的示意图。图中的每个框对应于图4中所示的残差模块。特征数量在整个沙漏中是一致的。

Our hourglass module before stacking is closely connected to fully convolutional networks [23] and other designs that process spatial information at multiple scales for dense prediction $\left\lbrack {{15},{33} - {41}}\right\rbrack$ . Xie et al. [33] give a summary of typical architectures. Our hourglass module differs from these designs mainly in its more symmetric distribution of capacity between bottom-up processing (from high resolutions to low resolutions) and top-down processing (from low resolutions to high resolutions). For example, fully convolutional networks [23] and holistically-nested architectures [33] are both heavy in bottom-up processing but light in their top-down processing, which consists only of a (weighted) merging of predictions across multiple scales. Fully convolutional networks are also trained in multiple stages.

我们的沙漏模块在堆叠之前与全卷积网络 [23] 以及其他在多个尺度上处理空间信息以进行密集预测的设计密切相关 $\left\lbrack {{15},{33} - {41}}\right\rbrack$ 。Xie 等人 [33] 对典型架构进行了总结。我们的沙漏模块与这些设计的主要区别在于其在自下而上的处理（从高分辨率到低分辨率）和自上而下的处理（从低分辨率到高分辨率）之间的容量分配更加对称。例如，全卷积网络 [23] 和整体嵌套架构 [33] 在自下而上的处理上都很重，但在自上而下的处理上则较轻，仅由跨多个尺度的预测（加权）合并组成。全卷积网络也在多个阶段进行训练。

The hourglass module before stacking is also related to conv-deconv and encoder-decoder architectures [42-45]. Noh et al. [42] use the conv-deconv architecture to do semantic segmentation, Rematas et al. [44] use it to predict reflectance maps of objects. Zhao et al. [43] develop a unified framework for supervised, unsupervised and semi-supervised learning by adding a reconstruction loss. Yang et al. [46] employ an encoder-decoder architecture without skip connections for image generation. Rasmus et al. [47] propose a denoising auto-encoder with special, "modulated" skip connections for unsupervised/semi-supervised feature learning. The symmetric topology of these networks is similar, but the nature of the operations is quite different in that we do not use unpooling or deconv layers. Instead, we rely on simple nearest neighbor upsampling and skip connections for top-down processing. Another major difference of our work is that we perform repeated bottom-up, top-down inference by stacking multiple hourglasses.

堆叠前的沙漏模块也与卷积-反卷积和编码器-解码器架构相关 [42-45]。Noh 等人 [42] 使用卷积-反卷积架构进行语义分割，Rematas 等人 [44] 使用它来预测物体的反射率图。Zhao 等人 [43] 通过添加重建损失开发了一个统一的框架，用于监督、无监督和半监督学习。Yang 等人 [46] 采用没有跳跃连接的编码器-解码器架构进行图像生成。Rasmus 等人 [47] 提出了一个去噪自编码器，具有特殊的“调制”跳跃连接，用于无监督/半监督特征学习。这些网络的对称拓扑相似，但操作的性质却大相径庭，因为我们不使用反池化或反卷积层。相反，我们依赖简单的最近邻上采样和跳跃连接进行自上而下的处理。我们工作的另一个主要区别是通过堆叠多个沙漏进行重复的自下而上、自上而下的推理。

3 Network Architecture

3 网络架构

3.1 Hourglass Design

3.1 沙漏设计

The design of the hourglass is motivated by the need to capture information at every scale. While local evidence is essential for identifying features like faces and

沙漏的设计受到在每个尺度上捕捉信息需求的激励。虽然局部证据对于识别面部等特征至关重要，

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——