DensePose: Dense Human Pose Estimation In The Wild【翻译】

62 阅读22分钟

Doc2X:一站式文档转换与翻译工具 无论是 PDF转Word、PDF转Latex、PDF转Markdown,还是复杂的 公式解析、多栏识别、沉浸式双语翻译,Doc2X 都能轻松搞定,让您的文档处理事半功倍! Doc2X: One-Stop Document Conversion and Translation Tool Whether it’s PDF to Word, LaTeX, or Markdown, or advanced formula parsing, multi-column recognition, and immersive bilingual translation, Doc2X simplifies your workflow, saving time and effort! 👉 立即试用 Doc2X | Start Using Doc2X

原文链接:1802.00434

DensePose: Dense Human Pose Estimation In The Wild

DensePose: 在野外的密集人类姿态估计

Ruza Alp Güler*

Ruza Alp Güler*

INRIA-CentraleSupélec

INRIA-CentraleSupélec

riza.guler@inria.fr

Natalia Neverova

Natalia Neverova

Facebook AI Research

Facebook AI Research

nneverova@fb.com

Iasonas Kokkinos

Iasonas Kokkinos

Facebook AI Research

Facebook AI Research

iasonask@fb.com

Figure 1: Dense pose estimation aims at mapping all human pixels of an RGB image to the 3D surface of the human body. We introduce DensePose-COCO, a large-scale ground-truth dataset with image-to-surface correspondences manually annotated on 50  K{50}\mathrm{\;K} COCO images and train DensePose-RCNN,to densely regress part-specific UV coordinates within every human region at multiple frames per second. Left: The image and the regressed correspondence by DensePose-RCNN, Middle: DensePose COCO Dataset annotations, Right: Partitioning and UV parametrization of the body surface.

图1:密集姿态估计旨在将RGB图像中的所有人类像素映射到人类身体的3D表面。我们引入了DensePose-COCO,这是一个大规模的真实数据集,包含在50  K{50}\mathrm{\;K} COCO图像上手动注释的图像与表面对应关系,并训练DensePose-RCNN,以在每个人体区域内以每秒多个帧的速度密集回归特定部位的UV坐标。左:DensePose-RCNN回归的图像及其对应关系,中:DensePose COCO数据集注释,右:身体表面的分区和UV参数化。

Abstract

摘要

In this work, we establish dense correspondences between an RGB image and a surface-based representation of the human body, a task we refer to as dense human pose estimation. We first gather dense correspondences for 50  K{50}\mathrm{\;K} persons appearing in the COCO dataset by introducing an efficient annotation pipeline. We then use our dataset to train CNN-based systems that deliver dense correspondence 'in the wild', namely in the presence of background, occlusions and scale variations. We improve our training set's effectiveness by training an 'inpainting' network that can fill in missing ground truth values, and report clear improvements with respect to the best results that would be achievable in the past. We experiment with fully-convolutional networks and region-based models and observe a superiority of the latter; we further improve accuracy through cascading, obtaining a system that delivers highly-accurate results in real time. Supplementary materials and videos are provided on the project page http: //densepose.org.

在本研究中,我们建立了RGB图像与基于表面的身体表示之间的密集对应关系,这一任务我们称之为密集人类姿态估计。我们首先通过引入高效的注释流程,为出现在COCO数据集中的50  K{50}\mathrm{\;K} 人员收集密集对应关系。然后,我们使用我们的数据集训练基于CNN的系统,以在“野外”提供密集对应关系,即在背景、遮挡和尺度变化的情况下。我们通过训练一个可以填补缺失真实值的“修复”网络来提高训练集的有效性,并报告与过去可实现的最佳结果相比的明显改进。我们对全卷积网络和基于区域的模型进行了实验,观察到后者的优越性;我们进一步通过级联提高了准确性,获得了一个能够实时提供高精度结果的系统。补充材料和视频已在项目页面 densepose.org 提供。

1. Introduction

1. 引言

This work aims at pushing further the envelope of human understanding in images by establishing dense correspondences from a 2D image to a 3D, surface-based representation of the human body. We can understand this task as involving several other problems, such as object detection, pose estimation, part and instance segmentation either as special cases or prerequisites. Addressing this task has applications in problems that require going beyond plain landmark localization, such as graphics, augmented reality, or human-computer interaction, and could also be a stepping stone towards general 3D-based object understanding.

本研究旨在通过建立从二维图像到三维表面表示的人体的密集对应关系,进一步推动人类对图像的理解。我们可以将这个任务理解为涉及几个其他问题,例如物体检测、姿态估计、部分和实例分割,作为特殊情况或先决条件。解决这一任务在需要超越简单标志定位的问题中具有应用,例如图形学、增强现实或人机交互,并且可能成为通向一般三维物体理解的一个垫脚石。

The task of establishing dense correspondences from an image to a surface-based model has been addressed mostly in the setting where a depth sensor is available, as in the Vitruvian manifold of [41], metric regression forests [33], or the more recent dense point cloud correspondence of [44]. By contrast, in our case we consider a single RGB image as input, based on which we establish a correspondence between surface points and image pixels.

从图像建立到基于表面的模型的密集对应关系的任务,主要是在深度传感器可用的情况下进行研究,如[41]中的维特鲁威流形、[33]中的度量回归森林,或[44]中的最新密集点云对应。相比之下,在我们的案例中,我们考虑单个RGB图像作为输入,并基于此建立表面点与图像像素之间的对应关系。

Several other works have recently aimed at recovering dense correspondences between pairs [3] or sets of RGB images [48,10]\left\lbrack {{48},{10}}\right\rbrack in an unsupervised setting. More recently,[42] used the equivariance principle in order to align sets of images to a common coordinate system, while following the general idea of groupwise image alignment,e.g. [23,21]\left\lbrack {{23},{21}}\right\rbrack .

最近还有其他几项工作旨在在无监督环境中恢复RGB图像对[3]或集合之间的密集对应关系[48,10]\left\lbrack {{48},{10}}\right\rbrack。更近期的[42]利用等变性原理将图像集合对齐到一个共同的坐标系统,同时遵循组图像对齐的一般思想,例如[23,21]\left\lbrack {{23},{21}}\right\rbrack

While these works are aiming at general categories, our work is focused on arguably the most important visual category, humans. For humans one can simplify the task by exploiting parametric deformable surface models, such as

虽然这些工作旨在针对一般类别,但我们的工作专注于可以说是最重要的视觉类别——人类。对于人类,可以通过利用参数化可变形表面模型来简化任务,例如


1{}^{1} Riza Alp Güler was with Facebook AI Research during this work.

1{}^{1} Riza Alp Güler在此工作期间曾在Facebook AI研究院工作。


Figure 2: We annotate dense correspondence between images and a 3D surface model by asking the annotators to segment the image into semantic regions and to then localize the corresponding surface point for each of the sampled points on any of the rendered part images. The red cross indicates the currently annotated point. The surface coordinates of the rendered views localize the collected 2D2\mathrm{D} points on the 3D3\mathrm{D} model.

图2:我们通过要求标注者将图像分割成语义区域,然后为渲染部分图像上每个采样点本地化相应的表面点,来注释图像与3D表面模型之间的密集对应关系。红色十字表示当前注释的点。渲染视图的表面坐标将收集到的 2D2\mathrm{D} 点定位到 3D3\mathrm{D} 模型上。

the Skinned Multi-Person Linear (SMPL) model of [2], or the more recent Adam model of [14] obtained through carefully controlled 3D surface acquisition. Turning to the task of image-to-surface mapping, in [2], the authors propose a two-stage method of first detecting human landmarks through a CNN and then fitting a parametric deformable surface model to the image through iterative minimization. In parallel to our work, [20] develop the method of [2] to operate in an end-to-end fashion, incorporating the iterative reprojection error minimization as a module of a deep network that recovers 3D camera pose and the low-dimensional body parametrization.

[2] 的皮肤多人体线性 (SMPL) 模型,或通过精确控制的3D表面采集获得的更近期的 [14] Adam 模型。转向图像到表面映射的任务,在 [2] 中,作者提出了一种两阶段方法,首先通过卷积神经网络 (CNN) 检测人类地标,然后通过迭代最小化将参数可变表面模型拟合到图像中。与我们的工作并行,[20] 开发了 [2] 的方法,使其以端到端的方式运行,将迭代重投影误差最小化作为深度网络的一个模块,恢复3D相机姿态和低维身体参数化。

Our methodology differs from all these works in that we take a full-blown supervised learning approach and gather ground-truth correspondences between images and a detailed, accurate parametric surface model of the human body [27]: rather than using the SMPL model at test time we only use it as a means of defining our problem during training. Our approach can be understood as the next step in the line of works on extending the standard for humans in [26,1,19,7,40,18,28]\left\lbrack {{26},1,{19},7,{40},{18},{28}}\right\rbrack . Human part segmentation masks have been provided in the Fashionista [46], PASCAL-Parts [6], and Look-Into-People (LIP) [12] datasets; these can be understood as providing a coarsened version of image-to-surface correspondence, where rather than continuous coordinates one predicts dis-cretized part labels. Surface-level supervision was only recently introduced for synthetic images in [43], while in [22] a dataset of 8515 images is annotated with keypoints and semi-automated fits of 3D models to images. In this work instead of compromising the extent and realism of our training set we introduce a novel annotation pipeline that allows us to gather ground-truth correspondences for 50  K{50}\mathrm{\;K} images of the COCO dataset, yielding our new DensePose-COCO dataset.

我们的方法论与所有这些工作不同,因为我们采用全面的监督学习方法,并收集图像与详细、准确的人体参数化表面模型之间的真实对应关系 [27]:在测试时,我们并不使用 SMPL 模型,而仅在训练过程中将其作为定义我们问题的一种手段。我们的方法可以理解为在 [26,1,19,7,40,18,28]\left\lbrack {{26},1,{19},7,{40},{18},{28}}\right\rbrack 中扩展人类标准的工作系列中的下一步。Fashionista [46]、PASCAL-Parts [6] 和 Look-Into-People (LIP) [12] 数据集中提供了人体部分分割掩码;这些可以理解为提供了一种粗略的图像到表面对应关系,其中不是预测连续坐标,而是预测离散的部分标签。表面级监督最近才在 [43] 中为合成图像引入,而在 [22] 中,一个包含 8515 张图像的数据集被标注了关键点和 3D 模型与图像的半自动拟合。在这项工作中,我们引入了一种新颖的注释流程,允许我们为 COCO 数据集的 50  K{50}\mathrm{\;K} 图像收集真实对应关系,从而产生我们新的 DensePose-COCO 数据集。

Our work is closest in spirit to the recent DenseReg framework [13], where CNNs were trained to successfully establish dense correspondences between a 3D model and images 'in the wild'. That work focused mainly on faces, and evaluated their results on datasets with moderate pose variability. Here, however, we are facing new challenges, due to the higher complexity and flexibility of the human body, as well as the larger variation in poses. We address these challenges by designing appropriate architectures, as described in Sec. 3, which yield substantial improvements over a DenseReg-type fully convolutional architecture. By combining our approach with the recent Mask-RCNN system of [15] we show that a discriminatively trained model can recover highly-accurate correspondence fields for complex scenes involving tens of persons with real-time speed: on a GTX 1080 GPU our system operates at 20-26 frames per second for a 240×320{240} \times {320} image or 4-5 frames per second for a 800×1100{800} \times {1100} image.

我们的工作在精神上与最近的 DenseReg 框架 [13] 最为接近,该框架训练了 CNN 成功地建立 3D 模型与“野外”图像之间的密集对应关系。该工作主要集中在面部,并在具有适度姿态变化的数据集上评估其结果。然而,在这里,我们面临新的挑战,这些挑战源于人体的更高复杂性和灵活性,以及姿态的更大变化。我们通过设计适当的架构来应对这些挑战,如第 3 节所述,这些架构在 DenseReg 类型的全卷积架构上实现了显著的改进。通过将我们的方法与最近的 Mask-RCNN 系统 [15] 结合,我们展示了一个经过判别训练的模型能够以实时速度恢复涉及数十个人的复杂场景的高精度对应场。我们的系统在 GTX 1080 GPU 上以每秒 20-26 帧的速度处理 240×320{240} \times {320} 图像,或以每秒 4-5 帧的速度处理 800×1100{800} \times {1100} 图像。

Our contributions can be summarized in three points. Firstly, as described in Sec. 2, we introduce the first manually-collected ground truth dataset for the task, by gathering dense correspondences between the SMPL model [27] and persons appearing in the COCO dataset. This is accomplished through a novel annotation pipeline that exploits 3D surface information during annotation.

我们的贡献可以总结为三点。首先,如第 2 节所述,我们通过收集 SMPL 模型 [27] 与出现在 COCO 数据集中的人物之间的密集对应关系,首次引入了该任务的手动收集的真实数据集。这是通过一种新颖的注释流程实现的,该流程在注释过程中利用了 3D 表面信息。

Secondly, as described in Sec. 3, we use the resulting dataset to train CNN-based systems that deliver dense correspondence 'in the wild', by regressing body surface coordinates at any image pixel. We experiment with both fully-convolutional architectures, relying on Deeplab [4], and also with region-based systems, relying on Mask-RCNN [15], observing a superiority of region-based models over fully-convolutional networks. We also consider cascading variants of our approach, yielding further improvements over existing architectures.

其次,如第 3 节所述,我们使用生成的数据集训练基于 CNN 的系统,以在“野外”提供密集对应关系,通过回归任意图像像素的身体表面坐标。我们实验了完全卷积架构,依赖于 Deeplab [4],以及基于区域的系统,依赖于 Mask-RCNN [15],观察到基于区域的模型优于完全卷积网络。我们还考虑了我们方法的级联变体,进一步提高了现有架构的性能。

Thirdly, we explore different ways of exploiting our constructed ground truth information. Our supervision signal is defined over a randomly chosen subset of image pixels per training sample. We use these sparse correspondences to train a 'teacher' network that can 'inpaint' the supervision signal in the rest of the image domain. Using this inpainted signal results in clearly better performance when compared to either sparse points, or any other existing dataset, as shown experimentally in Sec. 4.

第三,我们探索了利用我们构建的真实信息的不同方法。我们的监督信号是在每个训练样本中随机选择的一组图像像素上定义的。我们使用这些稀疏对应关系来训练一个“教师”网络,该网络可以在图像的其余区域“填充”监督信号。使用这个填充后的信号与稀疏点或任何其他现有数据集相比,实验结果表明性能明显更好,如第4节所示。

Figure 3: The user interface for collecting per-part correspondence annotations: We provide the annotators six pre-rendered views of a body part such that the whole part-surface is visible. Once the target point is annotated, the point is displayed on all rendered images simultaneously.

图3:用于收集每个部分对应注释的用户界面:我们为注释者提供了六个预渲染的身体部位视图,以便整个部位表面可见。一旦目标点被注释,该点会同时显示在所有渲染的图像上。

Our experiments indicate that dense human pose estimation is to a large extent feasible, but still has space for improvement. We conclude our paper with some qualitative results and directions that show the potential of the method. We will make code and data publicly available from our project's webpage, densepose.org.

我们的实验表明,密集的人体姿态估计在很大程度上是可行的,但仍有改进空间。我们在论文的最后总结了一些定性结果和方向,展示了该方法的潜力。我们将从我们项目的网页 densepose.org 上公开代码和数据。

2. COCO-DensePose Dataset

2. COCO-DensePose 数据集

Gathering rich, high-quality training sets has been a catalyst for progress in the classification [38], detection and segmentation [8,26]\left\lbrack {8,{26}}\right\rbrack tasks. There currently exists no manually collected ground-truth for dense human pose estimation for real images. The works of [22] and [43] can be used as surrogates, but as we show in Sec. 4 provide worse supervision.

收集丰富、高质量的训练集一直是分类 [38]、检测和分割 [8,26]\left\lbrack {8,{26}}\right\rbrack 任务进展的催化剂。目前,对于真实图像的密集人体姿态估计尚不存在手动收集的真实数据。文献 [22] 和 [43] 的工作可以作为替代,但正如我们在第4节中所示,它们提供的监督效果更差。

In this Section we introduce our COCO-DensePose dataset, alongside with evaluation measures that allow us to quantify progress in the task in Sec. 4. We have gathered annotations for 50  K{50}\mathrm{\;K} humans,collecting more then 5 million manually annotated correspondences.

在本节中,我们介绍我们的 COCO-DensePose 数据集,以及评估指标,这些指标使我们能够量化第4节中的任务进展。我们收集了 50  K{50}\mathrm{\;K} 人类的注释,收集了超过500万个手动注释的对应关系。

We start with a presentation of our annotation pipeline, since this required several design choices that may be more generally useful for 3D3\mathrm{D} annotation. We then turn to an analysis of the accuracy of the gathered ground-truth, alongside with the resulting performance measures used to assess the different methods.

我们首先介绍我们的注释流程,因为这需要一些设计选择,这些选择可能对 3D3\mathrm{D} 注释更具普遍性。然后,我们转向对收集到的真实数据的准确性分析,以及用于评估不同方法的性能指标。

2.1. Annotation System

2.1. 注释系统

In this work, we involve human annotators to establish dense correspondences from 2D2\mathrm{D} images to surface-based representations of the human body. If done naively, this would require 'hunting vertices' for every 2D image point, by manipulating a surface through rotations - which can be frustratingly inefficient. Instead, we construct an annotation pipeline through which we can efficiently gather annotations for image-to-surface correspondence.

在本研究中,我们涉及人类注释者,以建立从 2D2\mathrm{D} 图像到基于表面的身体表示之间的密集对应关系。如果简单地进行,这将需要为每个二维图像点“寻找顶点”,通过旋转操作来操纵表面——这可能会令人沮丧地低效。相反,我们构建了一个注释流程,通过该流程我们可以高效地收集图像到表面对应关系的注释。

As shown in Fig. 2, in the first stage we ask annotators to delineate regions corresponding to visible, semantically defined body parts. These include Head, Torso, Lower/Upper Arms, Lower/Upper Legs, Hands and Feet. In order to use simplify the UV parametrization we design the parts to be isomorphic to a plane, partitioning the limbs and torso into lower-upper and frontal-back parts.

如图 2 所示,在第一阶段,我们要求注释者划定与可见的、语义定义的身体部位相对应的区域。这些部位包括头部、躯干、上臂/下臂、上腿/下腿、手和脚。为了简化 UV 参数化,我们设计这些部位与平面同构,将四肢和躯干划分为上下和前后部分。

For head, hands and feet, we use the manually obtained UV fields provided in the SMPL model [27]. For the rest of the parts we obtain the unwrapping via multidimensional scaling applied to pairwise geodesic distances. The UV fields for the resulting 24 parts are visualized in Fig. 1 (right).

对于头部、手和脚,我们使用在 SMPL 模型 [27] 中手动获得的 UV 场。对于其余部分,我们通过对成对的测地距离应用多维尺度法获得展开。结果 24 个部位的 UV 场在图 1(右)中可视化。

We instruct the annotators to estimate the body part behind the clothes, so that for instance wearing a large skirt would not complicate the subsequent annotation of correspondences. In the second stage we sample every part region with a set of roughly equidistant points obtained via k\mathrm{k} -means and request the annotators to bring these points in correspondence with the surface. The number of sampled points varies based on the size of the part and the maximum number of sampled points per part is 14 . In order to simplify this task we 'unfold' the part surface by providing six pre-rendered views of the same body part and allow the user to place landmarks on any of them Fig. 3. This allows the annotator to choose the most convenient point of view by selecting one among six options instead of manually rotating the surface.

我们指示注释员估计衣物下的身体部位,以便例如穿着大裙子不会使后续的对应注释变得复杂。在第二阶段,我们用通过 k\mathrm{k} -均值获得的一组大致等距的点对每个部位区域进行采样,并要求注释员将这些点与表面对应起来。采样点的数量根据部位的大小而变化,每个部位的最大采样点数为14。为了简化这一任务,我们通过提供同一身体部位的六个预渲染视图来“展开”部位表面,并允许用户在其中任何一个视图上放置标记点(见图3)。这使得注释员可以通过在六个选项中选择一个来选择最方便的视角,而不是手动旋转表面。

As the user indicates a point on any of the rendered part views, its surface coordinates are used to simultaneously show its position on the remaining views - this gives a global overview of the correspondence. The image points are presented to the annotator in a horizontal/vertical succession, which makes it easier to deliver geometrically consistent annotations by avoiding self-crossings of the surface. This two-stage annotation process has allowed us to very efficiently gather highly accurate correspondences. If we quantify the complexity of the annotation task in terms of the time it takes to complete it, we have seen that the part segmentation and correspondence annotation tasks take approximately the same time, which is surprising given the more challenging nature of the latter task. Visualizations of the collected annotations are provided in Fig. 4, where the partitioning of the surface and U,V\mathrm{U},\mathrm{V} coordinates are shown in Fig. 1.

当用户在任何渲染的部件视图上指示一个点时,其表面坐标被用来同时显示其在剩余视图上的位置——这提供了对应关系的全局概览。图像点以水平/垂直的顺序呈现给注释员,这使得通过避免表面的自交来提供几何一致的注释变得更容易。这一两阶段的注释过程使我们能够非常高效地收集高度准确的对应关系。如果我们从完成任务所需的时间来量化注释任务的复杂性,我们发现部件分割和对应注释任务所需的时间大致相同,这令人惊讶,因为后者任务的性质更具挑战性。收集到的注释的可视化结果见图4,其中表面的分区和 U,V\mathrm{U},\mathrm{V} 坐标在图1中展示。

Figure 4: Visualization of annotations: Image (left), U (middle) and V (right) values for the collected points.

图4:注释的可视化:收集点的图像(左)、U(中)和V(右)值。

2.2. Accuracy of human annotators

2.2. 人类注释员的准确性

We assess human annotator with respect to a gold-standard measure of performance. Typically in pose estimation one asks multiple annotators to label the same landmark, which is then used to assess the variance in position, e.g. [26,36]\left\lbrack {{26},{36}}\right\rbrack . In our case,we can render images where we have access to the true mesh coordinates used to render a pixel. We thereby directly compare the true position used during rendering and the one estimated by annotators, rather than first estimating a 'consensus' landmark location among multiple human annotators.

我们根据黄金标准的性能测量来评估人类标注者。通常在姿态估计中,会要求多个标注者标记同一个地标,然后用来评估位置的方差,例如 [26,36]\left\lbrack {{26},{36}}\right\rbrack。在我们的案例中,我们可以渲染图像,其中我们可以访问用于渲染像素的真实网格坐标。因此,我们直接比较渲染过程中使用的真实位置和标注者估计的位置,而不是首先在多个人工标注者之间估计“共识”地标位置。

In particular, we provide annotators with synthetic images generated through the exact same surface model as the one we use in our ground-truth annotation, exploiting the rendering system and textures of [43]. We then ask annotators to bring the synthesized images into correspondence with the surface using our annotation tool, and for every image kk estimate the geodesic distance di,k{d}_{i,k} between the correct surface point, ii and the point estimated by human annotators i^k{\widehat{i}}_{k} :

特别是,我们为标注者提供通过与我们在真实注释中使用的相同表面模型生成的合成图像,利用 [43] 的渲染系统和纹理。然后,我们要求标注者使用我们的注释工具将合成图像与表面进行对应,并对每个图像 kk 估计正确表面点 ii 和人类标注者估计的点 i^k{\widehat{i}}_{k} 之间的测地距离 di,k{d}_{i,k}

where g(,)g\left( {\cdot , \cdot }\right) measures the geodesic distance between two surface points.

其中 g(,)g\left( {\cdot , \cdot }\right) 测量两个表面点之间的测地距离。

For any image kk ,we annotate and estimate the error only on a randomly sampled set of surface points Sk{\mathcal{S}}_{k} and interpolate the errors on the remainder of the surface. Finally, we average the errors across all KK examples used to assess annotator performance.

对于任何图像 kk,我们仅在随机抽样的表面点集 Sk{\mathcal{S}}_{k} 上进行注释和误差估计,并在其余表面上插值误差。最后,我们对所有 KK 用于评估标注者性能的示例的误差进行平均。

As shown in Fig. 5 the annotation errors are substantially smaller on small surface parts with distinctive features that could help localization (face, hands, feet), while on larger uniform areas that are typically covered by clothes (torso, back, hips) the annotator errors can get larger.

如图 5 所示,注释误差在具有独特特征的小表面部分(如面部、手部、脚部)上显著较小,而在通常被衣物覆盖的大面积均匀区域(如躯干、背部、臀部)上,标注者的误差可能会增大。

2.3. Evaluation Measures

2.3. 评估指标

We consider two different ways of summarizing correspondence accuracy over the whole human body, including pointwise and per-instance evaluation.

我们考虑两种不同的方式来总结整个身体的对应准确性,包括逐点和逐实例评估。

Pointwise evaluation. This approach evaluates correspondence accuracy over the whole image domain through the Ratio of Correct Point (RCP) correspondences, where a correspondence is declared correct if the geodesic distance is below a certain threshold. As the threshold tt varies,we obtain a curve f(t)f\left( t\right) ,whose area provides us with a scalar summary of the correspondence accuracy. For any given image we have a varying set of points coming with ground-truth signals. We summarize performance on the ensemble of such points, gathered across images. We evaluate the area under the curve (AUC), AUCa=1a0af(t)dt{\mathrm{{AUC}}}_{a} = \frac{1}{a}{\int }_{0}^{a}f\left( t\right) {dt} ,for two different values of a=10  cm,30  cma = {10}\mathrm{\;{cm}},{30}\mathrm{\;{cm}} yielding AUC10{\mathrm{{AUC}}}_{10} and AUC30{\mathrm{{AUC}}}_{30} respectively,where AUC10{\mathrm{{AUC}}}_{10} is understood as being an accuracy measure for more refined correspondence. This performance measure is easily applicable to both single-and multi-person scenarios and can deliver directly comparable values. In Fig. 6, we provide the per-part pointwise evaluation of the human annotator performance on synthetic data, which can be seen as an upper bound for the performance of our systems.

点对点评估。该方法通过正确点对应比例(RCP)评估整个图像域的对应准确性,其中如果测地距离低于某个阈值,则声明该对应为正确。随着阈值 tt 的变化,我们获得一条曲线 f(t)f\left( t\right),其面积为我们提供了对应准确性的标量总结。对于任何给定的图像,我们都有一组带有真实信号的变化点。我们总结了在跨图像收集的这些点的整体性能。我们评估曲线下的面积(AUC) AUCa=1a0af(t)dt{\mathrm{{AUC}}}_{a} = \frac{1}{a}{\int }_{0}^{a}f\left( t\right) {dt},对于两个不同的 a=10  cm,30  cma = {10}\mathrm{\;{cm}},{30}\mathrm{\;{cm}} 值分别得到 AUC10{\mathrm{{AUC}}}_{10}AUC30{\mathrm{{AUC}}}_{30},其中 AUC10{\mathrm{{AUC}}}_{10} 被理解为更精细对应的准确性度量。该性能度量易于应用于单人和多人场景,并且可以直接提供可比较的值。在图6中,我们提供了人类标注者在合成数据上的每个部分的点对点评估,这可以视为我们系统性能的上限。

Per-instance evaluation. Inspired by the object keypoint similarity (OKS) measure used for pose evaluation on the COCO dataset [26,36]\left\lbrack {{26},{36}}\right\rbrack ,we introduce geodesic point simi-

每实例评估。受到用于COCO数据集上姿态评估的对象关键点相似度(OKS)度量的启发,我们引入了测地点相似度。

Figure 5: Average human annotation error as a function of surface position.

图5:平均人类标注误差与表面位置的关系。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——