Deep Face Recognition【翻译】https://www.robots.ox.ac.uk/~vgg/pu

Doc2X：科研翻译的首选提供 PDF 转 Markdown、Latex，结合多种翻译引擎，打造高效、精准的翻译体验。 Doc2X: First Choice for Research Translation Offers PDF to Markdown and LaTeX with multiple translation engines for efficient and accurate translation experiences. 👉 立即访问 Doc2X | Visit Doc2X Today

原文链接：parkhi15.pdf

Deep Face Recognition

深度人脸识别

Omkar M. Parkhi

omkar@robots.ox.ac.uk

Andrea Vedaldi

vedaldi@robots.ox.ac.uk

Andrew Zisserman

az@robots.ox.ac.uk

Abstract

摘要

The goal of this paper is face recognition - from either a single photograph or from a set of faces tracked in a video. Recent progress in this area has been due to two factors: (i) end to end learning for the task using a convolutional neural network (CNN), and (ii) the availability of very large scale training datasets.

本文的目标是人脸识别——从单张照片或从视频中跟踪的一组人脸中进行识别。该领域最近的进展归功于两个因素：（i）使用卷积神经网络（CNN）对任务进行端到端学习，以及（ii）大规模训练数据集的可用性。

We make two contributions: first,we show how a very large scale dataset $({2.6}\mathrm{M}$ images, over 2.6K people) can be assembled by a combination of automation and human in the loop, and discuss the trade off between data purity and time; second, we traverse through the complexities of deep network training and face recognition to present methods and procedures to achieve comparable state of the art results on the standard LFW and YTF face benchmarks.

我们做出了两个贡献：首先，我们展示了如何通过自动化和人工循环的结合来组装一个非常大的数据集 $({2.6}\mathrm{M}$ （超过2.6K人），并讨论了数据纯度和时间之间的权衡；其次，我们探讨了深度网络训练和人脸识别的复杂性，提出了在标准LFW和YTF人脸基准上达到可比最先进结果的方法和程序。

1 Introduction

1 引言

Convolutional Neural Networks (CNNs) have taken the computer vision community by storm, significantly improving the state of the art in many applications. One of the most important ingredients for the success of such methods is the availability of large quantities of training data. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [IL6] was instrumental in providing this data for the general image classification task. More recently, researchers have made datasets available for segmentation, scene classification and image segmentation [12, 83].

卷积神经网络（CNNs）在计算机视觉领域引起了轰动，显著提升了许多应用的最先进水平。这些方法成功的关键因素之一是大量训练数据的可用性。ImageNet大规模视觉识别挑战赛（ILSVRC）[IL6]在为一般图像分类任务提供数据方面起到了重要作用。最近，研究人员为分割、场景分类和图像分割提供了数据集[12, 83]。

In the world of face recognition, however, large scale public datasets have been lacking and, largely due to this factor, most of the recent advances in the community remain restricted to Internet giants such as Facebook and Google etc. For example, the most recent face recognition method by Google [II. ] was trained using 200 million images and eight million unique identities. The size of this dataset is almost three orders of magnitude larger than any publicly available face dataset (see Table 1). Needless to say, building a dataset this large is beyond the capabilities of most international research groups, particularly in academia.

然而，在人脸识别领域，大规模的公开数据集一直缺乏，这在很大程度上导致该领域的最新进展主要局限于Facebook和Google等互联网巨头。例如，Google最新的面部识别方法[II.]使用了2亿张图像和800万个独特的身份进行训练。该数据集的规模几乎是任何公开可用的人脸数据集的三个数量级（见表1）。毋庸置疑，构建如此大规模的数据集超出了大多数国际研究小组的能力，尤其是在学术界。

This paper has two goals. The first one is to propose a procedure to create a reasonably large face dataset whilst requiring only a limited amount of person-power for annotation. To this end we propose a method for collecting face data using knowledge sources available on the web (Section 3). We employ this procedure to build a dataset with over two million faces,

本文有两个目标。第一个目标是提出一种创建合理大规模人脸数据集的程序，同时仅需有限的人力进行标注。为此，我们提出了一种利用网络上可用的知识源收集人脸数据的方法（第3节）。我们采用这一程序构建了一个包含超过200万张人脸的数据集，

It may be distributed unchanged freely in print or electronic forms.

本文档可以以印刷或电子形式自由分发，无需更改。

Table 1: Dataset comparisons: Our dataset has the largest collection of face images outside industrial datasets by Goole, Facebook, or Baidu, which are not publicly available.

表1：数据集比较：我们的数据集在Google、Facebook或百度等工业数据集之外拥有最大的人脸图像集合，这些工业数据集并未公开。

and will make this freely available to the research community. The second goal is to investigate various CNN architectures for face identification and verification, including exploring face alignment and metric learning, using the novel dataset for training (Section 4). Many recent works on face recognition have proposed numerous variants of CNN architectures for faces, and we assess some of these modelling choices in order to filter what is important from irrelevant details. The outcome is a much simpler and yet effective network architecture achieving near state-of-the-art results on all popular image and video face recognition benchmarks (Section 5 and 6). Our findings are summarised in Section 6.2.

并将免费提供给研究社区。第二个目标是研究用于人脸识别和验证的各种CNN架构，包括探索人脸对齐和度量学习，使用新数据集进行训练（第4节）。近年来，许多人脸识别工作提出了众多CNN架构的变体，我们评估了其中一些建模选择，以筛选出重要的细节与无关的细节。结果是一种更简单但有效的网络架构，在所有流行的图像和视频人脸识别基准测试中接近最先进的结果（第5节和第6节）。我们的研究结果在第6.2节中进行了总结。

2 Related Work

2 相关工作

This paper focuses on face recognition in images and videos, a problem that has received significant attention in the recent past. Among the many methods proposed in the literature, we distinguish the ones that do not use deep learning, which we refer as "shallow", from ones that do, that we call "deep". Shallow methods start by extracting a representation of the face image using handcrafted local image descriptors such as SIFT, LBP, HOG [8, § 2.2, [23, B2]; then they aggregate such local descriptors into an overall face descriptor by using a pooling mechanism, for example the Fisher Vector [III, [20]. There are a large variety of such methods which cannot be described in detail here (see for example the references in [II3] for an overview).

本文重点研究图像和视频中的人脸识别问题，该问题在过去一段时间内受到了广泛关注。在文献中提出的众多方法中，我们将不使用深度学习的方法称为“浅层”方法，而将使用深度学习的方法称为“深层”方法。浅层方法首先通过手工设计的局部图像描述符（如SIFT、LBP、HOG）提取人脸图像的表示[8, § 2.2, [23, B2]；然后通过池化机制（例如Fisher Vector）将这些局部描述符聚合为一个整体的人脸描述符[III, [20]。这类方法种类繁多，无法在此详细描述（例如参见[II3]中的参考文献以获取概述）。

This work is concerned mainly with deep architectures for face recognition. The defining characteristic of such methods is the use of a CNN feature extractor, a learnable function obtained by composing several linear and non-linear operators. A representative system of this class of methods is DeepFace [29]. This method uses a deep CNN trained to classify faces using a dataset of 4 million examples spanning 4000 unique identities. It also uses a siamese network architecture, where the same CNN is applied to pairs of faces to obtain descriptors that are then compared using the Euclidean distance. The goal of training is to minimise the distance between congruous pairs of faces (i.e. portraying the same identity) and maximise the distance between incongruous pairs, a form of metric learning. In addition to using a very large amount of training data, DeepFace uses an ensemble of CNNs, as well as a pre-processing phase in which face images are aligned to a canonical pose using a 3D model. When introduced, DeepFace achieved the best performance on the Labelled Faces in the Wild (LFW; [8]) benchmark as well as the Youtube Faces in the Wild (YFW; [BZ]) benchmark. The authors later extended this work in [EO], by increasing the size of the dataset by two orders of magnitude, including 10 million identities and 50 images per identity. They proposed a bootstrapping strategy to select identities to train the network and showed that the generalisation of the network can be improved by controlling the dimensionality of the fully connected layer.

这项工作主要关注用于人脸识别的深度架构。这类方法的定义特征是使用卷积神经网络（CNN）特征提取器，这是一个通过组合多个线性和非线性算子获得的可学习函数。这类方法的一个代表性系统是DeepFace [29]。该方法使用一个深度CNN，通过一个包含400万张跨越4000个独特身份的数据集进行训练来识别人脸。它还使用了一种孪生网络架构，其中相同的CNN应用于成对的人脸以获得描述符，然后使用欧几里得距离进行比较。训练的目标是使一致的人脸对（即描绘相同身份）之间的距离最小化，并使不一致的人脸对之间的距离最大化，这是一种度量学习的形式。除了使用大量训练数据外，DeepFace还使用了一个CNN集合，以及一个预处理阶段，其中人脸图像使用3D模型对齐到规范姿态。当引入时，DeepFace在野外标记人脸（LFW；[8]）基准以及野外Youtube人脸（YFW；[BZ]）基准上取得了最佳性能。作者后来在[EO]中扩展了这项工作，将数据集的大小增加了两个数量级，包括1000万个身份和每个身份50张图像。他们提出了一种自举策略来选择身份以训练网络，并表明通过控制全连接层的维度可以提高网络的泛化能力。

The DeepFace work was extended by the DeepId series of papers by Sun et al. [124, 23, [26, 27], each of which incrementally but steadily increased the performance on LFW and

DeepFace的工作被Sun等人[124, 23, [26, 27]的DeepId系列论文所扩展，每一篇论文都在LFW上逐步但稳步地提高了性能。

Figure 1: Example images from our dataset for six identities.

图1：我们数据集中六个身份的示例图像。

YFW. A number of new ideas were incorporated over this series of papers, including: using multiple CNNs [25], a Bayesian learning framework [2] to train a metric, multi-task learning over classification and verification [124], different CNN architectures which branch a fully connected layer after each convolution layer [26], and very deep networks inspired by [119, [23] in [124]. Compared to DeepFace, DeepID does not use 3D face alignment, but a simpler 2D affine alignment (as we do in this paper) and trains on combination of CelebFaces [23] and WDRef [2]. However, the final model in [27] is quite complicated involving around 200 CNNs.

YFW。在这一系列论文中，融入了许多新思路，包括：使用多个卷积神经网络（CNNs）[25]，贝叶斯学习框架[2]来训练度量，多任务学习在分类和验证上的应用[124]，不同的CNN架构在每个卷积层后分支全连接层[26]，以及受[119, [23]在[124]启发的非常深的网络。与DeepFace相比，DeepID不使用3D人脸对齐，而是采用更简单的2D仿射对齐（正如我们在本文中所做的），并在CelebFaces [23]和WDRef [2]的组合上进行训练。然而，[27]中的最终模型相当复杂，涉及约200个CNNs。

Very recently, researchers from Google [II. ] used a massive dataset of 200 million face identities and 800 million image face pairs to train a CNN similar to [23] and [13]. A point of difference is in their use of a "triplet-based" loss,where a pair of two congruous(a,b)and a third incongruous face $c$ are compared. The goal is to make $a$ closer to $b$ than $c$ ; in other words, differently from other metric learning approaches, comparisons are always relative to a "pivot" face. This matches more closely how the metric is used in applications, where a query face is compared to a database of other faces to find the matching ones. In training this loss is applied at multiple layers, not just the final one. This method currently achieves the best performance on LFW and YTF.

最近，谷歌的研究人员[II.]使用了一个包含2亿个面部身份和8亿个图像面部对的庞大数据集，训练了一个类似于[23]和[13]的CNN。一个不同之处在于他们使用了“基于三元组的”损失函数，其中一对两个一致的面部(a,b)和一个不一致的面部 $c$ 进行比较。目标是使 $a$ 比 $c$ 更接近 $b$ ；换句话说，与其他度量学习方法不同，比较总是相对于一个“基准”面部进行的。这与度量在应用中的使用方式更为接近，即查询面部与数据库中的其他面部进行比较以找到匹配的面部。在训练中，这种损失在多个层上应用，而不仅仅是在最后一层。这种方法目前在LFW和YTF上取得了最佳性能。

3 Dataset Collection

3 数据集收集

In this section we propose a multi-stage strategy to effectively collect a large face dataset containing hundreds of example images for thousands of unique identities (Table 1). The different stages of this process and corresponding statistics are summarised in Table 2. Individual stages are discussed in detail in the following paragraphs.

在本节中，我们提出了一种多阶段策略，以有效收集包含数千个独特身份的数百个示例图像的大型人脸数据集（表1）。该过程的不同阶段及其相应统计数据总结在表2中。以下段落将详细讨论各个阶段。

Stage 1. Bootstrapping and filtering a list of candidate identity names. The first stage in building the dataset is to obtain a list of names of candidate identities for obtaining faces. The idea is to focus on celebrities and public figures, such as actors or politicians, so that a sufficient number of distinct images are likely to be found on the web, and also to avoid any privacy issue in downloading their images. An initial list of public figures is obtained by extracting males and females, ranked by popularity, from the Internet Movie Data Base (IMDB) celebrity list. This list, which contains mostly actors, is intersected with all the people in the Freebase knowledge graph [II], which has information on about 500K different identities,resulting in a ranked lists of ${2.5}\mathrm{\;K}$ males and ${2.5}\mathrm{\;K}$ females. This forms a candidate list of $5\mathrm{\;K}$ names which are known to be popular (from IMDB),and for which we have attribute information such as ethnicity, age, kinship etc. (from the knowledge graph). The total of $5\mathrm{\;K}$ images was chosen to make the subsequent annotation process manageable for a small annotator team.

第一阶段：引导和过滤候选身份名称列表。构建数据集的第一阶段是获取用于获取人脸的候选身份名称列表。我们的想法是专注于名人和公众人物，如演员或政治家，以便在网络上找到足够数量的不同图像，同时避免在下载他们的图像时出现任何隐私问题。通过从互联网电影数据库（IMDB）名人列表中提取按人气排名的男性和女性，获得了一个初始的公众人物列表。这个列表主要包含演员，与Freebase知识图谱[II]中的所有人相交，该图谱包含约50万个不同身份的信息，从而得到了一个按排名排列的 ${2.5}\mathrm{\;K}$ 名男性和 ${2.5}\mathrm{\;K}$ 名女性的列表。这形成了一个候选名称列表，这些名称已知是受欢迎的（来自IMDB），并且我们拥有诸如种族、年龄、亲属关系等属性信息（来自知识图谱）。总共选择了 $5\mathrm{\;K}$ 张图像，以便后续的标注过程对于一个小型标注团队来说是可管理的。

The candidate list is then filtered to remove identities for which there are not enough distinct images, and to eliminate any overlap with standard benchmark datasets. To this end 200 images for each of the $5\mathrm{\;K}$ names are downloaded using Google Image Search. The 200 images are then presented to human annotators (sequentially in four groups of 50 ) to determine which identities result in sufficient image purity. Specifically, annotators are asked to retain an identity only if the corresponding set of 200 images is roughly ${90}\%$ pure. The lack of purity could be due to homonymy or image scarcity. This filtering step reduces the candidate list to 3,250 identities. Next, any names appearing in the LFW and YTF datasets are removed in order to make it possible to train on the new dataset and still evaluate fairly on those benchmarks. In this manner, a final list of 2,622 celebrity names is obtained.

然后对候选名单进行过滤，以移除那些没有足够不同图像的身份，并消除与标准基准数据集的任何重叠。为此，使用 Google 图片搜索为每个 $5\mathrm{\;K}$ 名称下载 200 张图片。然后将这 200 张图片呈现给人工标注者（按顺序分为四组，每组 50 张），以确定哪些身份能够产生足够的图像纯度。具体来说，标注者只有在相应的一组 200 张图片大致 ${90}\%$ 纯的情况下才会保留该身份。纯度不足可能是由于同名或图像稀缺造成的。这一过滤步骤将候选名单缩减至 3,250 个身份。接下来，移除出现在 LFW 和 YTF 数据集中的任何名称，以便能够在新数据集上进行训练，同时仍然在这些基准上进行公平评估。通过这种方式，最终获得了 2,622 个名人名称的列表。

Stage 2. Collecting more images for each identity. Each of the 2,622 celebrity names is queried in both Google and Bing Image Search, and then again after appending the keyword "actor" to the names. This results in four queries per name and 500 results for each, obtaining 2,000 images for each identity.

第二阶段：为每个身份收集更多图像。对 2,622 个名人名称分别在 Google 和 Bing 图片搜索中进行查询，然后在名称后附加关键词“演员”再次查询。这导致每个名称进行四次查询，每次查询结果为 500 张图片，从而为每个身份获取 2,000 张图片。

Stage 3. Improving purity with an automatic filter. The aim of this stage is to remove any erroneous faces in each set automatically using a classifier. To achieve this the top 50 images (based on Google search rank in the downloaded set) for each identity are used as positive training samples, and the top 50 images of all other identities are used as negative training samples. A one-vs-rest linear SVM is trained for each identity using the Fisher Vector Faces descriptor [I. 9, 20]. The linear SVM for each identity is then used to rank the 2,000 downloaded images for that identity, and the top 1,000 are retained (the threshold number of 1,000 was chosen to favour high precision in the positive predictions).

第三阶段：通过自动过滤器提高纯度。此阶段的目的是使用分类器自动移除每个集合中的任何错误面孔。为此，每个身份的前50张图片（基于下载集合中的Google搜索排名）被用作正样本，所有其他身份的前50张图片被用作负样本。使用Fisher Vector Faces描述符[I. 9, 20]为每个身份训练一个一对多的线性SVM。然后，使用每个身份的线性SVM对下载的2,000张图片进行排序，并保留前1,000张（选择1,000作为阈值数量，以在正样本预测中偏向高精度）。

Stage 4. Near duplicate removal. Exact duplicate images arising from the same image being found by two different search engines, or by copies of the same image being found at two different Internet locations, are removed. Near duplicates (e.g. images differing only in colour balance, or with text superimposed) are also removed. This is done by computing the VLAD descriptor [2, 8] for each image, clustering such descriptors within the 1,000 images for each identity using a very tight threshold, and retaining a single element per cluster.

第四阶段：近似重复图片的移除。由于同一图片被两个不同的搜索引擎找到，或同一图片在两个不同的互联网位置被找到而产生的完全重复图片被移除。近似重复图片（例如，仅在色彩平衡上不同，或有文字叠加的图片）也被移除。这是通过为每张图片计算VLAD描述符[2, 8]，使用非常严格的阈值对每个身份的1,000张图片中的描述符进行聚类，并保留每个聚类中的一个元素来完成的。

Stage 5. Final manual filtering. At this point there are 2,622 identities and up to 1,000 images per identity. The aim of this final stage is to increase the purity (precision) of the data using human annotations. However, in order to make the annotation task less burdensome, and hence avoid high annotation costs, annotators are aided by using automatic ranking once more. This time, however, a multi-way CNN is trained to discriminate between the 2,622 face identities using the AlexNet architecture of [III]; then the softmax scores are used to rank images within each identity set by decreasing likelihood of being an inlier. In order to accelerate the work of the annotators, the ranked images of each identity are displayed in blocks of 200 and annotators are asked to validate blocks as a whole. In particular, a block is declared good if approximate purity is greater than ${95}\%$ . The final number of good images is 982,803, of which approximately 95% are frontal and 5% profile.

第五阶段：最终手动过滤。此时有 2,622 个身份，每个身份最多有 1,000 张图像。这一最终阶段的目的是通过人工标注来提高数据的纯度（精确度）。然而，为了减轻标注任务的负担，从而避免高昂的标注成本，标注人员再次借助自动排序。这次，使用多路卷积神经网络（CNN）来区分 2,622 个面部身份，采用 [III] 中的 AlexNet 架构；然后使用 softmax 分数按内点可能性递减对每个身份集内的图像进行排序。为了加速标注人员的工作，每个身份的排序图像以 200 张为一组显示，并要求标注人员整体验证这些组。特别是，如果近似纯度大于 ${95}\%$ ，则该组被宣布为良好。最终的良好图像数量为 982,803 张，其中约 95% 为正面图像，5% 为侧面图像。

Discussion. Overall, this combination of using Internet search engines, filtering data using existing face recognition methods, and limited manual curation is able to produce an accurate

讨论。总体而言，这种结合使用互联网搜索引擎、利用现有面部识别方法过滤数据以及有限的手动筛选的方法，能够生成准确的

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——