Deep Learning Face Representation from Predicting 10,000 Classes【翻译】

90 阅读25分钟

Doc2X:表格解析与翻译一体化工具 从 PDF 中快速提取表格并翻译为多语言格式,支持 Word、HTML 输出。 Doc2X: Table Parsing and Translation Integrated Tool Quickly extract tables from PDFs and translate them into multilingual formats, with output options for Word and HTML. 👉 访问 Doc2X 官网 | Visit Doc2X Official Site

原文链接:Deep Learning Face Representation from Predicting 10,000 Classes

Deep Learning Face Representation from Predicting 10,000 Classes

深度学习面部表示来自预测10,000类

Yi Sun1

Yi Sun1

1{}^{1} Department of Information Engineering,The Chinese University of Hong Kong

1{}^{1} 香港中文大学信息工程系

2{}^{2} Department of Electronic Engineering,The Chinese University of Hong Kong

2{}^{2} 香港中文大学电子工程系

3{}^{3} Shenzhen Institutes of Advanced Technology,Chinese Academy of Sciences

3{}^{3} 深圳先进技术研究院,中国科学院

sy011@ie.cuhk.edu.hk

Abstract

摘要

This paper proposes to learn a set of high-level feature representations through deep learning, referred to as Deep hidden IDentity features (DeepID), for face verification. We argue that DeepID can be effectively learned through challenging multi-class face identification tasks, whilst they can be generalized to other tasks (such as verification) and new identities unseen in the training set. Moreover, the generalization capability of DeepID increases as more face classes are to be predicted at training. DeepID features are taken from the last hidden layer neuron activations of deep convolutional networks (ConvNets). When learned as classifiers to recognize about10,000face identities in the training set and configured to keep reducing the neuron numbers along the feature extraction hierarchy, these deep ConvNets gradually form compact identity-related features in the top layers with only a small number of hidden neurons. The proposed features are extracted from various face regions to form complementary and over-complete representations. Any state-of-the-art classifiers can be learned based on these high-level representations for face verification. 97.45%{97.45}\% verification accuracy on LFW is achieved with only weakly aligned faces.

本文提出通过深度学习学习一组高层次的特征表示,称为深度隐藏身份特征(DeepID),用于人脸验证。我们认为,DeepID可以通过具有挑战性的多类人脸识别任务有效学习,同时它们可以推广到其他任务(如验证)和训练集中未见的新身份。此外,随着训练时预测的人脸类别增多,DeepID的泛化能力也会增强。DeepID特征来自深度卷积神经网络(ConvNets)最后一层隐藏层神经元的激活。当这些特征作为分类器用于识别训练集中约10,000个人脸身份时,并通过不断减少特征提取层中的神经元数量,这些深度卷积网络逐渐在顶部层形成紧凑的身份相关特征,仅需少量隐藏神经元。所提特征从不同的人脸区域提取,形成互补且过完备的表示。任何先进的分类器都可以基于这些高层表示进行人脸验证。在LFW上的验证准确率仅使用弱对齐的人脸即可实现。

1. Introduction

1. 引言

Face verification in unconstrained conditions has been studied extensively in recent years [21,15,7,34,17,26\lbrack {21},{15},7,{34},{17},{26} , 18,8,2,9,3,29,6]{18},8,2,9,3,{29},6\rbrack due to its practical applications and the publishing of LFW [19], an extensively reported dataset for face verification algorithms. The current best-performing face verification algorithms typically represent faces with over-complete low-level features, followed by shallow models [9,29,6]\left\lbrack {9,{29},6}\right\rbrack . Recently,deep models such as ConvNets [24] have been proved effective for extracting high-level visual features [11,20,14]\left\lbrack {{11},{20},{14}}\right\rbrack and are used for face verification [18,5,31,32,36]\left\lbrack {{18},5,{31},{32},{36}}\right\rbrack . Huang et al. [18] learned a generative deep model without supervision. Cai et al. [5] learned deep nonlinear metrics. In [31], the deep models are supervised by the binary face verification target. Differently, in this paper we propose to learn high-level face identity features with deep models through face identification, i.e. classifying a training image into one of nn identities ( n10,000n \approx {10},{000} in this work). This high-dimensional prediction task is much more challenging than face verification, however, it leads to good generalization of the learned feature representations. Although learned through identification, these features are shown to be effective for face verification and new faces unseen in the training set.

在无约束条件下的面部验证近年来得到了广泛研究 [21,15,7,34,17,26\lbrack {21},{15},7,{34},{17},{26}18,8,2,9,3,29,6]{18},8,2,9,3,{29},6\rbrack,这主要是由于其实际应用以及 LFW [19] 的发布,这是一个关于面部验证算法的广泛报道的数据集。目前表现最佳的面部验证算法通常使用过完备的低级特征来表示面部,然后使用浅层模型 [9,29,6]\left\lbrack {9,{29},6}\right\rbrack。最近,深度模型如卷积网络(ConvNets) [24] 已被证明在提取高级视觉特征方面有效 [11,20,14]\left\lbrack {{11},{20},{14}}\right\rbrack,并被用于面部验证 [18,5,31,32,36]\left\lbrack {{18},5,{31},{32},{36}}\right\rbrack。Huang 等人 [18] 学习了一种无监督的生成深度模型。Cai 等人 [5] 学习了深度非线性度量。在 [31] 中,深度模型通过二元面部验证目标进行监督。与此不同的是,在本文中,我们提出通过面部识别来学习高级面部身份特征,即将训练图像分类为 nn 中的一个身份(在本工作中为 n10,000n \approx {10},{000})。这个高维预测任务比面部验证更具挑战性,然而,它有助于学习特征表示的良好泛化。尽管是通过识别学习的,这些特征在面部验证和训练集中未见的新面孔上被证明是有效的。

Figure 1. An illustration of the feature extraction process. Arrows indicate forward propagation directions. The number of neurons in each layer of the multiple deep ConvNets are labeled beside each layer. The DeepID features are taken from the last hidden layer of each ConvNet, and predict a large number of identity classes. Feature numbers continue to reduce along the feature extraction cascade till the DeepID layer.

图 1. 特征提取过程的示意图。箭头表示前向传播方向。每个多层深度卷积网络中每层的神经元数量标记在每层旁边。DeepID 特征取自每个卷积网络的最后一个隐藏层,并预测大量身份类别。特征数量在特征提取级联过程中持续减少,直到 DeepID 层。

We propose an effective way to learn high-level overcomplete features with deep ConvNets. A high-level illustration of our feature extraction process is shown in Figure 1. The ConvNets are learned to classify all the faces available for training by their identities, with the last hidden layer neuron activations as features (referred to as Deep hidden IDentity features or DeepID). Each ConvNet takes a face patch as input and extracts local low-level features in the bottom layers. Feature numbers continue to reduce along the feature extraction cascade while gradually more global and high-level features are formed in the top layers. Highly compact 160-dimensional DeepID is acquired at the end of the cascade that contain rich identity information and directly predict a much larger number (e.g., 10,000){10},{000}) of identity classes. Classifying all the identities simultaneously instead of training binary classifiers as in [21,2,3]\left\lbrack {{21},2,3}\right\rbrack is based on two considerations. First,it is much more difficult to predict a training sample into one of many classes than to perform binary classification. This challenging task can make full use of the super learning capacity of neural networks to extract effective features for face recognition. Second, it implicitly adds a strong regularization to ConvNets, which helps to form shared hidden representations that can classify all the identities well. Therefore, the learned high-level features have good generalization ability and do not over-fit to a small subset of training faces. We constrain DeepID to be significantly fewer than the classes of identities they predict, which is key to learning highly compact and discriminative features. We further concatenate the DeepID extracted from various face regions to form complementary and over-complete representations. The learned features can be well generalized to new identities in test, which are not seen in training, and can be readily integrated with any state-of-the-art face classifiers (e.g., Joint Bayesian [8]) for face verification.

我们提出了一种有效的方法,通过深度卷积神经网络(ConvNets)学习高层次的过完备特征。我们特征提取过程的高层次示意图如图1所示。ConvNets被训练用来根据身份分类所有可用的面部图像,其中最后一层隐藏层的神经元激活值作为特征(称为深度隐藏身份特征或DeepID)。每个ConvNet以一个面部图像块作为输入,在底层提取局部的低层特征。特征数量在特征提取级联过程中不断减少,而在顶部层次中逐渐形成更多的全局性和高层次的特征。在级联的末尾获得高度紧凑的160维DeepID,它包含丰富的身份信息,并直接预测一个更大的类别数量(例如,10,000){10},{000}) 的身份类别)。与[21,2,3]\left\lbrack {{21},2,3}\right\rbrack中通过训练二分类器进行身份分类不同,我们同时分类所有身份的做法基于两个考虑。首先,将一个训练样本预测为多个类别中的一个比进行二分类要困难得多。这一挑战性任务可以充分利用神经网络的超级学习能力,从而提取有效的面部识别特征。其次,这种方法隐式地为ConvNets添加了强正则化,有助于形成共享的隐藏表示,从而能够较好地分类所有身份。因此,学习到的高层次特征具有良好的泛化能力,并且不会过拟合到训练集中小部分的面部样本。我们约束DeepID的维度远小于它们预测的身份类别数量,这是学习高度紧凑且具有辨别力的特征的关键。我们进一步将从不同面部区域提取的DeepID特征进行拼接,以形成互补的过完备表示。这些学习到的特征能够很好地泛化到测试集中未见过的新身份,并且可以与任何先进的面部分类器(例如,联合贝叶斯方法[8])进行集成,以用于面部验证。

Our method achieves 97.45%{97.45}\% face verification accuracy on LFW using only weakly aligned faces, which is almost as good as human performance of 97.53%{97.53}\% . We also observe that as the number of training identities increases, the verification performance steadily gets improved. Although the prediction task at the training stage becomes more challenging, the discrimination and generalization ability of the learned features increases. It leaves the door wide open for future improvement of accuracy with more training data.

我们的方法在 LFW 上通过仅使用弱对齐的人脸实现了 97.45%{97.45}\% 面部验证准确率,这几乎与人类的 97.53%{97.53}\% 性能相当。我们还观察到,随着训练身份数量的增加,验证性能稳步提高。尽管训练阶段的预测任务变得更加具有挑战性,但学习到的特征的区分和泛化能力也在增加。这为未来通过更多训练数据提升准确性提供了广阔的空间。

2. Related work

2. 相关工作

Many face verification methods represent faces by high-dimensional over-complete face descriptors, followed by shallow models. Cao et al. [7] encoded each face image into 26  K{26}\mathrm{\;K} learning-based (LE) descriptors,and then calculated the L2{L}_{2} distance between the LE descriptors after PCA. Chen et al. [9] extracted 100  K{100}\mathrm{\;K} LBP descriptors at dense facial landmarks with multiple scales and used Joint Bayesian [8] for verification after PCA. Simonyan et al. [29] computed 1.7M SIFT descriptors densely in scale and space, encoded the dense SIFT features into Fisher vectors, and learned linear projection for discriminative dimensionality reduction. Huang et al. [17] combined 1.2M CMD [33] and SLBP [1] descriptors, and learned sparse Mahalanobis metrics for face verification.

许多面部验证方法通过高维的过完备面部描述符表示人脸,然后采用浅层模型。Cao 等人 [7] 将每个面部图像编码为 26  K{26}\mathrm{\;K} 基于学习的 (LE) 描述符,然后计算 PCA 后 LE 描述符之间的 L2{L}_{2} 距离。Chen 等人 [9] 在密集的面部标志点上提取多尺度的 100  K{100}\mathrm{\;K} LBP 描述符,并在 PCA 后使用联合贝叶斯 [8] 进行验证。Simonyan 等人 [29] 在尺度和空间上密集地计算了 170 万个 SIFT 描述符,将密集的 SIFT 特征编码为 Fisher 向量,并学习了用于区分性降维的线性投影。Huang 等人 [17] 结合了 120 万个 CMD [33] 和 SLBP [1] 描述符,并学习了稀疏的马氏度量进行面部验证。

Some previous studies have further learned identity-related features based on low-level features. Kumar et al. [21] trained attribute and simile classifiers to detect facial attributes and measure face similarities to a set of reference people. Berg and Belhumeur [2,3]\left\lbrack {2,3}\right\rbrack trained classifiers to distinguish the faces from two different people. Features are outputs of the learned classifiers. They used SVM classifiers, which are shallow structures, and their learned features are still relatively low-level. In contrast, we classify all the identities from the training set simultaneously. Moreover, we use the last hidden layer activations as features instead of the classifier outputs. In our ConvNets, the neuron number of the last hidden layer is much smaller than that of the output, which forces the last hidden layer to learn shared hidden representations for faces of different people in order to well classify all of them, resulting in highly discriminative and compact features with good generalization ability.

一些先前的研究进一步基于低级特征学习与身份相关的特征。Kumar 等人 [21] 训练了属性和相似性分类器,以检测面部属性并测量与一组参考人物的面部相似度。Berg 和 Belhumeur [2,3]\left\lbrack {2,3}\right\rbrack 训练了分类器,以区分两个不同人的面孔。特征是学习到的分类器的输出。他们使用了支持向量机(SVM)分类器,这是一种浅层结构,他们学习到的特征仍然相对低级。相比之下,我们同时对训练集中的所有身份进行分类。此外,我们使用最后一个隐藏层的激活值作为特征,而不是分类器的输出。在我们的卷积神经网络(ConvNets)中,最后一个隐藏层的神经元数量远小于输出层的数量,这迫使最后一个隐藏层学习不同人脸的共享隐藏表示,以便能够很好地对它们进行分类,从而产生具有良好泛化能力的高度区分性和紧凑特征。

A few deep models have been used for face verification or identification. Chopra et al. [10] used a Siamese network [4] for deep metric learning. The Siamese network extracts features separately from two compared inputs with two identical sub-networks, taking the distance between the outputs of the two sub-networks as dissimilarity. [10] used deep ConvNets as the sub-networks. In contrast to the Siamese network in which feature extraction and recognition are jointly learned with the face verification target, we conduct feature extraction and recognition in two steps, with the first feature extraction step learned with the target of face identification, which is a much stronger supervision signal than verification. Huang et al. [18] generatively learned features with CDBNs [25], then used ITML [13] and linear SVM for face verification. Cai et al. [5] also learned deep metrics under the Siamese network framework as [10], but used a two-level ISA network [23] as the sub-networks instead. Zhu et al. [35, 36] learned deep neural networks to transform faces in arbitrary poses and illumination to frontal faces with normal illumination, and then used the last hidden layer features or the transformed faces for face recognition. Sun et al. [31] used multiple deep ConvNets to learn high-level face similarity features and trained classification RBM [22] for face verification. Their features are jointly extracted from a pair of faces instead of from a single face.

一些深度模型已被用于人脸验证或识别。Chopra 等人 [10] 使用了一个孪生网络 [4] 进行深度度量学习。孪生网络通过两个相同的子网络分别从两个比较的输入中提取特征,将两个子网络输出之间的距离作为不相似性。[10] 使用深度卷积网络作为子网络。与孪生网络中特征提取和识别与人脸验证目标共同学习不同,我们将特征提取和识别分为两个步骤,第一步特征提取是以人脸识别为目标进行学习的,这比验证提供了更强的监督信号。Huang 等人 [18] 使用 CDBNs [25] 生成性地学习特征,然后使用 ITML [13] 和线性 SVM 进行人脸验证。Cai 等人 [5] 也在孪生网络框架下学习深度度量,类似于 [10],但使用了一个两层的 ISA 网络 [23] 作为子网络。Zhu 等人 [35, 36] 学习深度神经网络,将任意姿势和光照下的人脸转换为正面人脸,并使用最后一个隐藏层特征或转换后的人脸进行人脸识别。Sun 等人 [31] 使用多个深度卷积网络学习高级人脸相似性特征,并训练分类 RBM [22] 进行人脸验证。他们的特征是从一对人脸共同提取的,而不是从单个面孔提取的。

3. Learning DeepID for face verification

3. 学习 DeepID 进行人脸验证

3.1. Deep ConvNets

3.1. 深度卷积网络

Our deep ConvNets contain four convolutional layers (with max-pooling) to extract features hierarchically, followed by the fully-connected DeepID layer and the softmax output layer indicating identity classes. The input is 39×{39} \times 31×k{31} \times k for rectangle patches,and 31×31×k{31} \times {31} \times k for square patches,where k=3k = 3 for color patches and k=1k = 1 for gray patches. Figure 2 shows the detailed structure of the ConvNet which takes 39×31×1{39} \times {31} \times 1 input and predicts nn (e.g., n=10,000n = {10},{000} ) identity classes. When the input sizes change, the height and width of maps in the following layers will change accordingly. The dimension of the DeepID layer is fixed to 160 , while the dimension of the output layer varies according to the number of classes it predicts. Feature numbers continue to reduce along the feature extraction hierarchy until the last hidden layer (the DeepID layer), where highly compact and predictive features are formed, which predict a much larger number of identity classes with only a few features.

我们的深度卷积神经网络(ConvNet)包含四个卷积层(带有最大池化层),以层次化地提取特征,之后是全连接的DeepID层和指示身份类别的softmax输出层。输入为 39×{39} \times 31×k{31} \times k 代表矩形补丁,和 31×31×k{31} \times {31} \times k 代表正方形补丁,其中 k=3k = 3 为彩色补丁,k=1k = 1 为灰度补丁。图2显示了ConvNet的详细结构,它接收 39×31×1{39} \times {31} \times 1 输入并预测 nn(例如,n=10,000n = {10},{000})身份类别。当输入尺寸发生变化时,后续层中图像的高度和宽度会相应改变。DeepID层的维度固定为160,而输出层的维度根据其预测的类别数而变化。特征数在特征提取层次结构中逐渐减少,直到最后一层隐藏层(DeepID层),在此处形成高度紧凑且具有预测能力的特征,这些特征可以用少量的特征预测更多的身份类别。

Figure 2. ConvNet structure. The length, width, and height of each cuboid denotes the map number and the dimension of each map for all input, convolutional, and max-pooling layers. The inside small cuboids and squares denote the 3D3\mathrm{D} convolution kernel sizes and the 2D pooling region sizes of convolutional and max-pooling layers, respectively. Neuron numbers of the last two fully-connected layers are marked beside each layer.

图2. ConvNet结构。每个立方体的长度、宽度和高度表示输入、卷积和最大池化层的图像数量及每个图像的维度。内部的小立方体和方形表示 3D3\mathrm{D} 卷积核的大小和卷积与最大池化层的2D池化区域大小。最后两个全连接层的神经元数量标注在每个层旁边。

The convolution operation is expressed as

卷积操作表示为

where xi{x}^{i} and yj{y}^{j} are the ii -th input map and the jj -th output map,respectively. kij{k}^{ij} is the convolution kernel between the ii -th input map and the jj -th output map. * denotes convolution. bj{b}^{j} is the bias of the jj -th output map. We use ReLU nonlinearity (y=max(0,x))\left( {y = \max \left( {0,x}\right) }\right) for hidden neurons, which is shown to have better fitting abilities than the sigmoid function [20]. Weights in higher convolutional layers of our ConvNets are locally shared to learn different mid- or high-level features in different regions [18]. rr in Equation 1 indicates a local region where weights are shared. In the third convolutional layer, weights are locally shared in every 2×22 \times 2 regions,while weights in the fourth convolutional layer are totally unshared. Max-pooling is formulated as

其中 xi{x}^{i}yj{y}^{j} 分别是第 ii 个输入图和第 jj 个输出图。kij{k}^{ij} 是第 ii 个输入图和第 jj 个输出图之间的卷积核。* 表示卷积。bj{b}^{j} 是第 jj 个输出图的偏置。我们对隐藏神经元使用 ReLU 非线性 (y=max(0,x))\left( {y = \max \left( {0,x}\right) }\right),研究表明其比 sigmoid 函数 [20] 具有更好的拟合能力。我们 ConvNets 中更高卷积层的权重在局部共享,以学习不同区域中的中级或高级特征 [18]。方程 1 中的 rr 表示权重共享的局部区域。在第三个卷积层中,权重在每个 2×22 \times 2 区域内局部共享,而在第四个卷积层中的权重则完全不共享。最大池化被表述为

where each neuron in the ii -th output map yi{y}^{i} pools over an s×ss \times s non-overlapping local region in the ii -th input map xi{x}^{i} .

在第 ii 个输出图 yi{y}^{i} 中,每个神经元在第 ii 个输入图 xi{x}^{i} 中对一个不重叠的局部区域进行池化 s×ss \times s

Figure 3. Top: ten face regions of medium scales. The five regions in the top left are global regions taken from the weakly aligned faces, the other five in the top right are local regions centered around the five facial landmarks (two eye centers, nose tip, and two mouse corners). Bottom: three scales of two particular patches.

图 3. 上:中等规模的十个面部区域。左上角的五个区域是从弱对齐的面孔中提取的全局区域,右上角的另外五个区域是围绕五个面部标志点(两个眼睛中心、鼻尖和两个嘴角)中心的局部区域。下:两个特定补丁的三个尺度。

The last hidden layer of DeepID is fully connected to both the third and fourth convolutional layers (after max-pooling) such that it sees multi-scale features [28] (features in the fourth convolutional layer are more global than those in the third one). This is critical to feature learning because after successive down-sampling along the cascade, the fourth convolutional layer contains too few neurons and becomes the bottleneck for information propagation. Adding the bypassing connections between the third convolutional layer (referred to as the skipping layer) and the last hidden layer reduces the possible information loss in the fourth convolutional layer. The last hidden layer takes the function

DeepID 的最后一个隐藏层与第三和第四个卷积层(经过最大池化后)完全连接,从而使其能够看到多尺度特征 [28](第四个卷积层中的特征比第三个卷积层中的特征更具全局性)。这对于特征学习至关重要,因为在级联过程中连续下采样后,第四个卷积层中的神经元数量过少,成为信息传播的瓶颈。在第三个卷积层(称为跳过层)和最后一个隐藏层之间添加旁路连接可以减少第四个卷积层中可能的信息损失。最后一个隐藏层的功能是

where x1,w1,x2,w2{x}^{1},{w}^{1},{x}^{2},{w}^{2} denote neurons and weights in the third and fourth convolutional layers, respectively. It linearly combines features in the previous two convolutional layers, followed by ReLU non-linearity.

其中 x1,w1,x2,w2{x}^{1},{w}^{1},{x}^{2},{w}^{2} 分别表示第三和第四个卷积层中的神经元和权重。它线性组合前两个卷积层中的特征,然后应用 ReLU 非线性激活函数。

The ConvNet output is an nn -way softmax predicting the probability distribution over nn different identities.

ConvNet 的输出是一个 nn -way softmax,预测 nn 种不同身份的概率分布。

where yj=i=1160xiwi,j+bj{y}_{j}^{\prime } = \mathop{\sum }\limits_{{i = 1}}^{{160}}{x}_{i} \cdot {w}_{i,j} + {b}_{j} linearly combines the 160 DeepID features xi{x}_{i} as the input of neuron jj ,and yj{y}_{j} is its output. The ConvNet is learned by minimizing logyt- \log {y}_{t} , with the tt -th target class. Stochastic gradient descent is used with gradients calculated by back-propagation.

其中 yj=i=1160xiwi,j+bj{y}_{j}^{\prime } = \mathop{\sum }\limits_{{i = 1}}^{{160}}{x}_{i} \cdot {w}_{i,j} + {b}_{j} 线性组合 160 个 DeepID 特征 xi{x}_{i} 作为神经元 jj 的输入, yj{y}_{j} 是其输出。通过最小化 logyt- \log {y}_{t} 来学习 ConvNet,目标类为 tt -th。使用随机梯度下降法,并通过反向传播计算梯度。

3.2. Feature extraction

3.2. 特征提取

We detect five facial landmarks, including the two eye centers, the nose tip, and the two mouth corners, with the facial point detection method proposed by Sun et al. [30]. Faces are globally aligned by similarity transformation according to the two eye centers and the mid-point of the two mouth corners. Features are extracted from 60 face patches with ten regions, three scales, and RGB or gray channels. Figure 3 shows the ten face regions and the three scales of two particular face regions. We trained 60 ConvNets, each of which extracts two 160-dimensional DeepID vectors from a particular patch and its horizontally flipped counterpart. A special case is patches around the two eye centers and the two mouth corners, which are not flipped themselves, but the patches symmetric with them (for example, the flipped counterpart of the patch centered on the left eye is derived by flipping the patch centered on the right eye). The total length of DeepID is 19,200 (160×2×60)\left( {{160} \times 2 \times {60}}\right) ,which is ready for the final face verification.

我们检测五个面部特征点,包括两个眼睛中心、鼻尖和两个嘴角,采用了Sun等人提出的面部点检测方法[30]。根据两个眼睛中心和两个嘴角的中点,通过相似变换对面部进行全局对齐。从60个面部补丁中提取特征,这些补丁包含十个区域、三个尺度,以及RGB或灰度通道。图3展示了十个面部区域和两个特定面部区域的三个尺度。我们训练了60个卷积网络,每个网络从特定补丁及其水平翻转的对应补丁中提取两个160维的DeepID向量。一个特殊情况是围绕两个眼睛中心和两个嘴角的补丁,它们本身并不翻转,但与它们对称的补丁(例如,围绕左眼中心的补丁的翻转对应补丁是通过翻转围绕右眼中心的补丁获得的)。DeepID的总长度为19,200(160×2×60)\left( {{160} \times 2 \times {60}}\right),准备进行最终的面部验证。

3.3. Face verification

3.3. 面部验证

We use the Joint Bayesian [8] technique for face verification based on the DeepID. Joint Bayesian has been highly successful for face verification [9,6]\left\lbrack {9,6}\right\rbrack . It represents the extracted facial features xx (after subtracting the mean) by the sum of two independent Gaussian variables

我们使用基于DeepID的联合贝叶斯[8]技术进行面部验证。联合贝叶斯在面部验证中取得了很大的成功[9,6]\left\lbrack {9,6}\right\rbrack。它通过两个独立的高斯变量的和来表示提取的面部特征xx(减去均值后)。

where μN(0,Sμ)\mu \sim N\left( {0,{S}_{\mu }}\right) represents the face identity and ϵN(0,Sϵ)\epsilon \sim N\left( {0,{S}_{\epsilon }}\right) the intra-personal variations. Joint Bayesian models the joint probability of two faces given the intra-or extra-personal variation hypothesis, P(x1,x2HI)P\left( {{x}_{1},{x}_{2} \mid {H}_{I}}\right) and P(x1,x2HE)P\left( {{x}_{1},{x}_{2} \mid {H}_{E}}\right) . It is readily shown from Equation 5 that these two probabilities are also Gaussian with variations

其中μN(0,Sμ)\mu \sim N\left( {0,{S}_{\mu }}\right)表示面部身份,ϵN(0,Sϵ)\epsilon \sim N\left( {0,{S}_{\epsilon }}\right)表示个体内变异。联合贝叶斯模型给定个体内或个体外变异假设的两张面部的联合概率P(x1,x2HI)P\left( {{x}_{1},{x}_{2} \mid {H}_{I}}\right)P(x1,x2HE)P\left( {{x}_{1},{x}_{2} \mid {H}_{E}}\right)。从方程5可以很容易地看出,这两个概率也是高斯分布,具有变异性。

and

respectively. Sμ{S}_{\mu } and Sϵ{S}_{\epsilon } can be learned from data with EM algorithm. In test, it calculates the likelihood ratio

分别。Sμ{S}_{\mu }Sϵ{S}_{\epsilon }可以通过EM算法从数据中学习。在测试中,它计算似然比

which has closed-form solutions and is efficient.

该似然比具有封闭形式解且效率高。

We also train a neural network for verification and compare it to Joint Bayesian to see if other models can also learn from the extracted features and how much the features and a good face verification model contribute to the performance, respectively. The neural network contains one input layer taking the DeepID, one locally-connected layer, one fully-connected layer, and a single output neuron indicating face similarities. The input features are divided into 60 groups, each of which contains 640 features extracted from a particular patch pair with a particular ConvNet. Features in the same group are highly correlated. Neurons in the locally-connected layer only connect to a single group of features to learn their local relations and reduce the feature dimension at the same time. The second hidden layer is fully-connected to the first hidden layer to learn global relations. The single output neuron is fully connected to the second hidden layer. The hidden neurons are ReLUs and the output neuron is sigmoid. An illustration of the neural network structure is shown in Figure 4. It has 38,400 input neurons with 19, 200 DeepID features from each patch, and 4,800 neurons in the following two hidden layers, with every 80 neurons in the first hidden layer locally connected to one of the 60 groups of input neurons.

我们还训练了一个神经网络用于验证,并将其与联合贝叶斯(Joint Bayesian)进行比较,以查看其他模型是否也能够从提取的特征中学习,以及特征和优秀的面部验证模型分别对性能的贡献。该神经网络包含一个输入层,接收DeepID,一个局部连接层,一个全连接层,以及一个表示面部相似度的单一输出神经元。输入特征被划分为60组,每组包含从特定补丁对中提取的640个特征,这些特征由特定的卷积神经网络(ConvNet)处理。同一组中的特征高度相关。局部连接层的神经元仅与单一组特征连接,以学习它们的局部关系并同时降低特征维度。第二个隐藏层与第一个隐藏层全连接,用于学习全局关系。单一输出神经元与第二个隐藏层全连接。隐藏层神经元使用ReLU激活函数,输出神经元使用sigmoid激活函数。神经网络结构的示意图见图4。该网络有38,400个输入神经元,每个补丁有19,200个DeepID特征,接下来两个隐藏层有4,800个神经元,其中第一个隐藏层的每80个神经元与60组输入神经元中的一组局部连接。

Figure 4. The structure of the neural network used for face verification. The layer type and dimension are labeled beside each layer. The solid neurons form a subnetwork.

图4. 用于面部验证的神经网络结构。每层的类型和维度标记在每层旁边。实心神经元形成一个子网络。

Dropout learning [16] is used for all the hidden neurons. The input neurons cannot be dropped because the learned features are compact and distributed representations (representing a large number of identities with very few neurons) and have to collaborate with each other to represent the identities well. On the other hand, learning high-dimensional features without dropout is difficult due to gradient diffusion. To solve this problem, we first train 60 subnetworks, each with features of a single group as input. A particular subnetwork is illustrated in Figure 4. We then use the first-layer weights of the subnetworks to initialize those of the original network, and tune the second and third layers of the original network with the first layer weights clipped.

Dropout 学习 [16] 被应用于所有隐藏神经元。输入神经元不能被丢弃,因为学习到的特征是紧凑的分布式表示(通过非常少的神经元表示大量身份),并且必须相互协作才能很好地表示身份。另一方面,由于梯度扩散,无法在没有 dropout 的情况下学习高维特征。为了解决这个问题,我们首先训练了 60 个子网络,每个子网络的输入是单一组的特征。图 4 展示了一个特定的子网络。然后,我们使用子网络的第一层权重初始化原网络的权重,并通过裁剪后的第一层权重调节原网络的第二层和第三层。

4. Experiments

4. 实验

We evaluate our algorithm on LFW, which reveals the state-of-the-art of face verification in the wild. Though LFW contains 5749 people, only 85 have more than 15 images, and 4069 people have only one image. It is inadequate to train identity classifiers with so few images per person. Instead, we trained our model on CelebFaces

我们在 LFW 上评估了我们的算法,LFW 展示了实际场景下人脸验证的最先进技术。尽管 LFW 包含 5749 人,但其中只有 85 人有超过 15 张图像,而 4069 人只有一张图像。每人如此少的图像不足以训练身份分类器。因此,我们在 CelebFaces 上训练了我们的模型。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——