Doc2X:智能公式编辑与解析 支持从 PDF 中提取并编辑复杂公式,同时转化为 Word 或 Latex,精准高效,为科研工作提速。 Doc2X: Smart Formula Editing and Parsing Extract and edit complex formulas from PDFs with conversion to Word or LaTeX. Accurate and efficient for research workflows. 👉 立即体验 Doc2X | Try Doc2X Now
原文链接:1502.00873
DeepID3: Face Recognition with Very Deep Neural Networks
DeepID3: 使用非常深的神经网络进行人脸识别
Yi Sun Xiaogang Wang Xiaoou Tang
甄怡 王小刚 唐晓欧
Department of Information Engineering,The Chinese University of Hong Kong SenseTime Group
香港中文大学信息工程系 商汤科技集团
Department of Electronic Engineering,The Chinese University of Hong Kong Shenzhen Institutes of Advanced Technology,Chinese Academy of Sciences
香港中文大学电子工程系 中国科学院深圳先进技术研究院
sy011@ie.cuhk.edu.hk liangding@sensetime.com
sy011@ie.cuhk.edu.hk liangding@sensetime.com
xgwang@ee.cuhk.edu.hk xtang@ie.cuhk.edu.hk
xgwang@ee.cuhk.edu.hk xtang@ie.cuhk.edu.hk
Abstract
摘要
The state-of-the-art of face recognition has been significantly advanced by the emergence of deep learning. Very deep neural networks recently achieved great success on general object recognition because of their superb learning capacity. This motivates us to investigate their effectiveness on face recognition. This paper proposes two very deep neural network architectures, referred to as DeepID3, for face recognition. These two architectures are rebuilt from stacked convolution and inception layers proposed in VGG net [10] and GoogLeNet [16] to make them suitable to face recognition. Joint face identification-verification supervisory signals are added to both intermediate and final feature extraction layers during training. An ensemble of the proposed two architectures achieves LFW face verification accuracy and LFW rank-1 face identification accuracy, respectively. A further discussion of LFW face verification result is given in the end.
人脸识别的最新进展得益于深度学习的出现。由于其卓越的学习能力,最近非常深的神经网络在通用物体识别方面取得了巨大成功。这促使我们探讨它们在人脸识别中的有效性。本文提出了两种非常深的神经网络架构,称为DeepID3,用于人脸识别。这两种架构是从VGG网络 [10] 和GoogLeNet [16] 提出的堆叠卷积层和Inception层重新构建的,以使其适用于人脸识别。在训练过程中,将联合人脸识别和验证的监督信号添加到中间层和最终特征提取层中。所提出的两种架构的集成分别实现了 LFW 人脸验证准确率和 LFW 排名-1 人脸识别准确率。最后对LFW人脸验证结果进行了进一步讨论。
1. Introduction
1. 引言
Using deep neural networks to learn effective feature representations has become popular in face recognition [12, 20, 17, 22, 14, 13, 18, 21, 19, 15]. With better deep network architectures and supervisory methods, face recognition accuracy has been boosted rapidly in recent years. In particular, a few noticeable face representation learning techniques are evolved recently. An early effort of learning deep face representation in a supervised way was to employ face verification as the supervisory signal [12], which required classifying a pair of training images as being the same person or not. It greatly reduced the intra-personal variations in the face representation. Then learning discriminative deep face representation through large-scale face identity classification (face identification) was proposed by DeepID [14] and DeepFace [17, 18]. By classifying training images into a large amount of identities, the last hidden layer of deep neural networks would form rich identity-related features. With this technique, deep learning got close to human performance for the first time on tightly cropped face images of the extensively evaluated LFW face verification dataset [6]. However, the learned face representation could also contain significant intrapersonal variations. Motivated by both [12] and [14], an approach of learning deep face representation by joint face identification-verification was proposed in DeepID2 [13] and was further improved in DeepID2+ [15]. Adding verification supervisory signals significantly reduced intrapersonal variations, leading to another significant improvement on face recognition performance. Human face verification accuracy on the entire face images of LFW was surpassed finally [13, 15]. Both GoogLeNet [16] and VGG [10] ranked in the top in general image classification in ILSVRC 2014. This motivates us to investigate whether the superb learning capacity brought by very deep net structures can also benefit face recognition.
使用深度神经网络来学习有效的特征表示在面部识别中变得非常流行 [12, 20, 17, 22, 14, 13, 18, 21, 19, 15]。随着更好的深度网络架构和监督方法的出现,面部识别的准确率在近年来迅速提高。特别是,一些显著的面部表示学习技术最近得到了发展。早期采用监督方式学习深度面部表示的努力是将面部验证作为监督信号 [12],该方法需要将一对训练图像分类为是否为同一人。这极大地减少了面部表示中的个人内部差异。随后,DeepID [14] 和 DeepFace [17, 18] 提出了通过大规模的面部身份分类(面部识别)来学习具有辨别性的深度面部表示。通过将训练图像分类为大量身份,深度神经网络的最后一层隐藏层会形成丰富的身份相关特征。采用这种技术后,深度学习首次在广泛评估的 LFW 面部验证数据集 [6] 的紧密裁剪面部图像上接近了人类的表现。然而,学习到的面部表示仍然可能包含显著的个人内部差异。受到 [12] 和 [14] 启发,DeepID2 [13] 提出了通过联合面部识别和验证来学习深度面部表示的方法,并在 DeepID2+ [15] 中进一步改进。增加验证监督信号显著减少了个人内部差异,导致面部识别性能的另一次显著提升。最终,LFW 全面面部图像上的人类面部验证准确率被超越 [13, 15]。GoogLeNet [16] 和 VGG [10] 在 ILSVRC 2014 的一般图像分类中名列前茅。这激励我们探讨由非常深的网络结构带来的卓越学习能力是否也能促进面部识别。
Although supervised by advanced supervisory signals, the network architectures of DeepID2 and DeepID2+ are much shallower compared to recently proposed high-performance deep neural networks in general object recognition such as VGG and GoogLeNet. VGG net stacked multiple convolutional layers together to form complex features. GoogLeNet is more advanced by incorporating multi-scale convolutions and pooling into a single feature extraction layer coined inception [16]. To learn efficiently, it also introduced convolutions for feature dimension reduction.
尽管受到先进监督信号的指导,DeepID2 和 DeepID2+ 的网络架构与最近提出的高性能深度神经网络(如 VGG 和 GoogLeNet)相比要浅得多,这些网络在一般物体识别中表现出色。VGG 网络将多个卷积层堆叠在一起以形成复杂特征。GoogLeNet 通过将多尺度卷积和池化结合到一个称为 inception 的特征提取层中,进一步提升了其先进性 [16]。为了高效学习,它还引入了 卷积以减少特征维度。
In this paper, we propose two deep neural network architectures, referred to as DeepID3, which are significantly deeper than the previous state-of-the-art DeepID2+ architecture for face recognition. DeepID3 networks are rebuilt from basic elements (i.e., stacked convolution or inception layers) of VGG net [10] and GoogLeNet [16]. During training, joint face identification-verification supervisory signals [13] are added to the final feature extraction layer as well as a few intermediate layers of each network. In addition, to learn a richer pool of facial features, weights in higher layers of some of DeepID3 networks are unshared. Being trained on the same dataset as DeepID2+, DeepID3 improves the face verification accuracy from to and rank-1 face identification accuracy from to on LFW,compared with DeepID2+. The "true" face verification accuracy when wrongly labeled face pairs are corrected and a few hard test samples will be further discussed in the end.
在本文中,我们提出了两种深度神经网络架构,称为 DeepID3,这些架构在面部识别方面显著比之前的最先进的 DeepID2+ 架构更深。DeepID3 网络是从 VGG 网络 [10] 和 GoogLeNet [16] 的基本元素(即堆叠的卷积层或 inception 层)重建而来的。在训练过程中,将联合面部识别-验证监督信号 [13] 添加到最终特征提取层以及每个网络的一些中间层。此外,为了学习更丰富的面部特征,部分 DeepID3 网络的高层权重是未共享的。在与 DeepID2+ 相同的数据集上进行训练后,DeepID3 在 LFW 上将面部验证准确率从 提高到 ,将排名第一的面部识别准确率从 提高到 。当错误标记的面部对被纠正以及一些困难测试样本时,“真实”的面部验证准确率将在最后进一步讨论。
2. DeepID3 net
2. DeepID3 网络
For the comparison purpose, we briefly review the previously proposed DeepID2+ net architecture [15]. As illustrated in Fig. 1. DeepID2+ net has three convolutional layers followed by max-pooling (neurons in the third convolutional layer share weights in only local regions), followed by one locally-connected layer and one fully-connected layer. Joint identification-verification supervisory signals [13] are added to the last fully-connected layer (from which the final features are extracted for face recognition) as well as a few fully connected layers branched out from intermediate pooling layers to better supervise early feature extraction processes.
为了进行比较,我们简要回顾了先前提出的 DeepID2+ 网络架构 [15]。如图 1 所示,DeepID2+ 网络包括三个卷积层,后接最大池化层(第三个卷积层中的神经元仅在局部区域内共享权重),然后是一个局部连接层和一个全连接层。联合身份验证监督信号 [13] 被添加到最后的全连接层(从该层提取最终特征用于人脸识别),以及从中间池化层分支出来的几个全连接层,以更好地监督早期特征提取过程。
The proposed DeepID3 net inherits a few characteristics of the DeepID2+ net, including unshared neural weights in the last few feature extraction layers and the way of adding supervisory signals to early layers. However, the DeepID3 net is significantly deeper, with ten to fifteen non-linear feature extraction layers, compared to five in DeepID2+. In particular, we propose two DeepID3 net architectures, referred to as DeepID3 net1 and DeepID3 net2, as illustrated in Fig. 2 and Fig. 3, respectively. The depth of DeepID3 net is due to stacking multiple convolution/inception layers before each pooling layer. Continuous convolution/inception helps to form features with larger receptive fields and more complex nonlinearity while restricting the number of parameters [10].
提出的 DeepID3 网络继承了 DeepID2+ 网络的一些特性,包括在最后几个特征提取层中不共享神经网络权重以及向早期层添加监督信号的方式。然而,DeepID3 网络显著更深,具有十到十五个非线性特征提取层,而 DeepID2+ 网络仅有五层。特别地,我们提出了两种 DeepID3 网络架构,分别称为 DeepID3 net1 和 DeepID3 net2,如图 2 和图 3 所示。DeepID3 网络的深度来源于在每个池化层之前堆叠多个卷积/ inception 层。连续的卷积/ inception 有助于形成具有更大感受野和更复杂非线性的特征,同时限制了参数的数量 [10]。
The proposed DeepID3 net1 takes two continuous convolutional layers before each pooling layer. Compared to the VGG net proposed in previous literature [10, 19], we add additional supervisory signals in a number of full-connection layers branched out from intermediate layers, which helps to learn better mid-level features and makes optimization of a very deep neural network easier. The top two convolutional layers are replaced by locally connected layers. With unshared parameters, top layers could form more expressive features with a reduced feature dimension. The last locally connected layer of our DeepID3 net1 is used to extract the final features without an additional fully
提出的 DeepID3 net1 在每个池化层之前采用两个连续的卷积层。与之前文献中提出的 VGG net [10, 19] 相比,我们在从中间层分支出的多个全连接层中添加了额外的监督信号,这有助于更好地学习中层特征,并使得非常深的神经网络的优化变得更容易。顶部的两个卷积层被局部连接层替代。通过不共享参数,顶部层可以以减少的特征维度形成更具表现力的特征。我们 DeepID3 net1 的最后一个局部连接层用于提取最终特征,而不需要额外的全连接层。
DeepID2+
DeepID2+
Figure 1: Architecture of DeepID2+ net [15]. Solid arrows show forward-propagation directions. Dashed arrows point the layers on which joint face identification-verification supervisory signals are added. The final feature extraction layer in red box is used for face recognition.
图 1:DeepID2+ 网络的架构 [15]。实线箭头表示前向传播方向。虚线箭头指向添加了联合人脸识别-验证监督信号的层。红框中的最终特征提取层用于人脸识别。
connected layer.
连接层。
DeepID3 net2 starts with every two continuous convolutional layers followed by one pooling layer as does in DeepID3 net1, while taking inception layers [16] in later feature extraction stages: there are three continuous inception layers before the third pooling layer and two inception layers before the fourth pooling layer. Joint identification-verification supervisory signals are added on fully connected layers following each pooling layer.
DeepID3 net2 从每两个连续的卷积层后跟一个池化层开始,和 DeepID3 net1 一样,同时在后续特征提取阶段采用了 inception 层 [16]:在第三个池化层之前有三个连续的 inception 层,在第四个池化层之前有两个 inception 层。在每个池化层之后的全连接层上添加了联合识别-验证监督信号。
In the proposed two network architectures, rectified linear non-linearity [9] is used for all except pooling layers, and dropout learning [5] is added on the final feature extraction layer. Although with significant depth, our DeepID3 networks are much smaller than VGG net or GoogLeNet proposed in general object recognition due to a restricted number of feature maps in each layer.
在提出的两种网络架构中,除了池化层外,所有层均使用修正线性非线性 [9],并在最终特征提取层上添加了 dropout 学习 [5]。尽管深度显著,我们的 DeepID3 网络由于每层特征图数量的限制,远小于一般物体识别中提出的 VGG net 或 GoogLeNet。
The proposed DeepID3 nets are trained on the same 25 face regions as DeepID2+ nets [15], with each network taking a particular face region as input. These face regions are selected by feature selection in the previous work [13], which differ in positions, scales, and color channels such that different networks could learn complementary information. After training, these networks are used to extract features from respective face regions. Then an additional Joint Bayesian model [3] is learned on these features for face verification or identification. All the DeepID3 networks and Joint Bayesian models are learned on the same approximately 300 thousand training samples as used in DeepID2+ [15], which is a combination of CelebFaces+ [14] and WDRef [3] datasets, and tested on LFW [6]. People in these two training data sets and the LFW test set are mutually exclusive. The face verification performance on LFW of individual DeepID3 net is compared to DeepID2+ net in Fig. 4 on the 25 face regions (with horizontal flipping), respectively. On average, DeepID3 net1 and DeepID3 net2 reduce the error rate by and compared to DeepID2+ net,respectively.
提出的 DeepID3 网络与 DeepID2+ 网络 [15] 在相同的 25 个面部区域上进行训练,每个网络以特定的面部区域作为输入。这些面部区域是通过先前工作中的特征选择 [13] 选定的,它们在位置、尺度和颜色通道上有所不同,从而使得不同的网络能够学习互补的信息。训练完成后,这些网络用于从各自的面部区域提取特征。然后,基于这些特征学习一个额外的联合贝叶斯模型 [3] 用于人脸验证或识别。所有的 DeepID3 网络和联合贝叶斯模型都是在与 DeepID2+ [15] 相同的约 30 万个训练样本上学习的,这些样本是 CelebFaces+ [14] 和 WDRef [3] 数据集的结合,测试则在 LFW [6] 数据集上进行。这两个训练数据集与 LFW 测试集中的人脸是相互独立的。在 LFW 上,单个 DeepID3 网络与 DeepID2+ 网络在 25 个面部区域(包括水平翻转)上的人脸验证性能如图 4 所示进行比较。平均而言,DeepID3 net1 和 DeepID3 net2 分别将错误率减少了 和 ,与 DeepID2+ 网络相比。
DeepID3 net1
DeepID3 net1
Figure 2: Architecture of DeepID3 net1. Figure description is the same as Fig. 1
图 2:DeepID3 net1 架构。图示与图 1 相同。
3. Experiments
3. 实验
To reduce redundancy, DeepID3 net1 and net2 are used to extract features on either the original or the horizontally flipped face region but not both. In test, feature extraction takes 50 times of forward propagation with half from DeepID3 net1 and the other half from net2. These features are concatenated into a long feature vector of approximately 30,000 dimensions. With PCA, it is reduced to 300 dimensions on which a Joint Bayesian model is learned for face recognition.
为了减少冗余,DeepID3 net1 和 net2 用于提取原始或水平翻转面部区域的特征,但不会同时对两者进行处理。在测试中,特征提取执行 50 次前向传播,其中一半来自 DeepID3 net1,另一半来自 net2。这些特征被连接成一个大约 30,000 维的长特征向量。通过主成分分析(PCA),该特征向量被降维到 300 维,并在其上学习一个联合贝叶斯模型用于人脸识别。
DeepID3 net2 Figure 3: Architecture of DeepID3 net2. Figure description is the same as Fig. 1
DeepID3 net2 图 3: DeepID3 net2 的架构。图的描述与图 1 相同。
Figure 4: LFW face verification accuracy of individual DeepID2+ and DeepID3 net trained on the same face regions in [15].
图 4: 在 [15] 中训练的单个 DeepID2+ 和 DeepID3 net 在 LFW 人脸验证上的准确性。
Table 1: Face verification on LFW.
表 1: LFW 上的人脸验证。
We evaluate DeepID3 networks under the LFW face verification [6] and LFW face identification [1, 18] protocols, respectively. For face verification, 6000 given face pairs are verified to tell if they are from the same person. We achieve a mean accuracy of under this protocol. Comparisons with previous works on mean accuracy and ROC curves are shown in Tab. 1 and Fig. 5, respectively.
我们分别在 LFW 人脸验证 [6] 和 LFW 人脸识别 [1, 18] 协议下评估 DeepID3 网络。对于人脸验证,验证 6000 对给定的人脸,以判断它们是否来自同一个人。我们在该协议下实现了 的平均准确率。与之前工作的平均准确率和 ROC 曲线的比较分别显示在表 1 和图 5 中。
For face identification, we take one closed-set and one open-set identification protocols. For closed-set identification, the gallery set contains 4249 subjects with a single face image per subject, and the probe set contains 3143 face images from the same set of subjects in the gallery. For open-set identification, the gallery set contains 596 subjects with a single face image per subject, and the probe set contains 596 genuine probes and 9494 imposter ones. Table 2 compares Rank-1 identification accuracy of closed-set identification and Rank-1 Detection and Identification rate (DIR) at a 1% False Alarm Rate (FAR) of open-set identification,respectively. We achieve closed-set and open-set face identification accuracies, respectively.
对于人脸识别,我们采用一个闭集和一个开集识别协议。对于闭集识别,图库包含 4249 个主体,每个主体有一张人脸图像,探测集包含来自图库同一组主体的 3143 张人脸图像。对于开集识别,图库包含 596 个主体,每个主体有一张人脸图像,探测集包含 596 个真实探测和 9494 个冒充者。表 2 比较了闭集识别的 Rank-1 识别准确率和开集识别在 1% 假警报率 (FAR) 下的 Rank-1 检测和识别率 (DIR)。我们分别实现了 的闭集和 的开集人脸识别准确率。
4. Discussion
4. 讨论
There are three test face pairs which are labeled as the same person but are actually different people as announced on the LFW website. Among these three pairs, two are classified as the same person while the other one is classified as different people by our DeepID3 algorithm. Therefore, when the label of these three face pairs are corrected, the actual face verification accuracy of DeepID3 is . For DeepID2+ [15],its face verification accuracy before correcting the three wrong labels is . However,DeepID2+ classified all the three wrongly labeled positive face pairs as different people. When these three wrong labels are corrected, the true face verification accuracy of DeepID2+ is also [15]. DeepID3, although taking similar very deep architectures as VGG and GoogLeNet, does not improve over DeepID2+, with significantly shallower architecture, on the LFW face verification task. Whether those very deep architectures would take advantage of more training face data and finally surpass shallower architectures like DeepID2+ remains an open question.
有三个测试面孔对被标记为同一个人,但实际上是不同的人,这在 LFW 网站上已公布。在这三个面孔对中,两个被我们的 DeepID3 算法分类为同一个人,而另一个被分类为不同的人。因此,当这三个面孔对的标签被更正时,DeepID3 的实际面孔验证准确率为 。对于 DeepID2+ [15],在更正这三个错误标签之前,其面孔验证准确率为 。然而,DeepID2+ 将所有三个错误标记的正面孔对都错误分类为不同的人。当这三个错误标签被更正时,DeepID2+ 的真实面孔验证准确率也是 [15]。尽管 DeepID3 采用了与 VGG 和 GoogLeNet 相似的非常深的架构,但在 LFW 面孔验证任务中,它并没有在显著更浅的架构上超越 DeepID2+。这些非常深的架构是否能够利用更多的训练面孔数据并最终超越像 DeepID2+ 这样的较浅架构仍然是一个悬而未决的问题。
Figure 5: ROC of face verification on LFW.
图 5:LFW 上面孔验证的 ROC 曲线。
Table 2: Closed- and open-set identification tasks on LFW.
表 2:LFW 上的闭集和开集识别任务。
We examine the test face pairs in LFW which are wrongly classified by all the DeepID series algorithms including DeepID [14], DeepID2 [13, 11],DeepID2+ [15], and DeepID3. There are nine common false positives and three common false negatives in total, around half of all wrongly classified face pairs by DeepID3. The three face pairs labeled as the same person but being classified as different people are shown in Fig. 6. The first pair of faces show great contrast of ages. The second pair is actually different people due to errors in labeling. The third one
我们检查了在 LFW 中被所有 DeepID 系列算法错误分类的测试面孔对,包括 DeepID [14]、DeepID2 [13, 11]、DeepID2+ [15] 和 DeepID3。总共有九个共同的假阳性和三个共同的假阴性,约占 DeepID3 所有错误分类面孔对的一半。标记为同一个人但被分类为不同人的三个面孔对如图 6 所示。第一对面孔显示出明显的年龄对比。第二对实际上是由于标记错误而是不同的人。第三对
—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——