FaceNet: A Unified Embedding for Face Recognition and Clustering【翻译】

Doc2X | 专业 PDF 转换与文档处理工具 Doc2X 提供全面的 PDF转Word、PDF转Latex、PDF转Markdown、PDF转HTML 功能，支持公式识别、表格解析、多栏转换、代码识别。无论是学术论文处理还是日常文档转换，Doc2X 都能高效完成 Doc2X | Professional PDF Conversion and Document Processing Tool Doc2X provides comprehensive features like PDF to Word, PDF to LaTeX, PDF to Markdown, PDF to HTML, supporting formula recognition, table parsing, multi-column conversion, and code recognition. Perfect for academic papers or daily document needs! 👉 立即体验 Doc2X 的强大功能 | Try Doc2X Now

原文链接：1503.03832

FaceNet: A Unified Embedding for Face Recognition and Clustering

FaceNet: 面部识别与聚类的统一嵌入

Florian Schroff

fschroff@google.com

Google Inc.

Dmitry Kalenichenko

dkalenichenko@google.com

Google Inc.

James Philbin

jphilbin@google.com

Google Inc.

Abstract

摘要

Despite significant recent advances in the field of face recognition [10, 14, 15, 17], implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.

尽管面部识别领域在最近取得了显著进展 [10, 14, 15, 17]，但在大规模实现面部验证和识别时，当前的方法仍面临严峻的挑战。本文介绍了一种名为 FaceNet 的系统，该系统直接学习从面部图像到紧凑欧几里得空间的映射，在该空间中，距离直接对应于面部相似度的度量。一旦生成了该空间，面部识别、验证和聚类等任务就可以通过使用 FaceNet 嵌入作为特征向量的标准技术轻松实现。

Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face.

我们的方法使用一个深度卷积网络，直接优化嵌入本身，而不是像以往深度学习方法中那样使用一个中间瓶颈层。为了训练，我们使用通过一种新颖的在线三元组挖掘方法生成的大致对齐的匹配/不匹配面部补丁三元组。我们方法的优点在于更高的表示效率：我们仅使用每个面部128字节就能实现最先进的面部识别性能。

On the widely used Labeled Faces in the Wild (LFW) dataset, our system achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result [15] by 30% on both datasets.

在广泛使用的 Labeled Faces in the Wild (LFW) 数据集上，我们的系统实现了 99.63% 的新纪录准确率。在 YouTube Faces DB 数据集上，准确率为 95.12%。与最佳公开结果 [15] 相比，我们的系统在两个数据集上的错误率降低了 30%。

We also introduce the concept of harmonic embeddings, and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.

我们还引入了谐波嵌入和谐波三重损失的概念，这描述了不同版本的面部嵌入（由不同网络生成）之间的兼容性，并允许它们之间的直接比较。

1. Introduction

1. 引言

In this paper we present a unified system for face verification (is this the same person), recognition (who is this person) and clustering (find common people among these faces). Our method is based on learning a Euclidean embedding per image using a deep convolutional network. The network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances.

在本文中，我们提出了一个统一的面部验证（这是否是同一个人）、识别（这个人是谁）和聚类（在这些面孔中找到共同的人）的系统。我们的方法基于使用深度卷积网络为每张图像学习欧几里得嵌入。网络的训练方式使得嵌入空间中的平方 L2 距离直接对应于面部相似性：同一人的面孔之间的距离较小，而不同人的面孔之间的距离较大。

Figure 1. Illumination and Pose invariance. Pose and illumination have been a long standing problem in face recognition. This figure shows the output distances of FaceNet between pairs of faces of the same and a different person in different pose and illumination combinations. A distance of 0.0 means the faces are identical, 4.0 corresponds to the opposite spectrum, two different identities. You can see that a threshold of 1.1 would classify every pair correctly.

图 1. 光照和姿态不变性。姿态和光照一直是面部识别中的一个长期问题。该图显示了在不同姿态和光照组合下，FaceNet 对同一人和不同人面孔对之间的输出距离。距离为 0.0 表示面孔是相同的，4.0 对应于相反的光谱，即两个不同的身份。可以看到，1.1 的阈值将正确分类每一对面孔。

Once this embedding has been produced, then the aforementioned tasks become straight-forward: face verification simply involves thresholding the distance between the two embeddings; recognition becomes a k-NN classification problem; and clustering can be achieved using off-the-shelf techniques such as k-means or agglomerative clustering.

一旦生成了这个嵌入，之前提到的任务就变得简单明了：面部验证仅涉及对两个嵌入之间的距离进行阈值处理；识别变成一个 k-NN 分类问题；聚类可以使用现成的技术，如 k-means 或聚合聚类来实现。

Previous face recognition approaches based on deep networks use a classification layer $\left\lbrack {{15},{17}}\right\rbrack$ trained over a set of known face identities and then take an intermediate bottleneck layer as a representation used to generalize recognition beyond the set of identities used in training. The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck representation generalizes well to new faces; and by using a bottleneck layer the representation size per face is usually very large (1000s of dimensions). Some recent work [15] has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network.

以前基于深度网络的人脸识别方法使用一个分类层 $\left\lbrack {{15},{17}}\right\rbrack$ ，该层在一组已知的人脸身份上进行训练，然后将一个中间瓶颈层作为表示，用于将识别扩展到训练中使用的身份集合之外。这种方法的缺点是间接性和低效性：必须希望瓶颈表示能很好地推广到新的人脸；而且，由于使用瓶颈层，每个人脸的表示通常非常大（有成千上万的维度）。一些近期的研究 [15] 已经通过主成分分析（PCA）降低了这一维度，但这只是一个线性变换，可以在网络的一个层中轻松学习。

In contrast to these approaches, FaceNet directly trains its output to be a compact 128-D embedding using a triplet-based loss function based on LMNN [19]. Our triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin. The thumbnails are tight crops of the face area, no 2D or 3D alignment, other than scale and translation is performed.

与这些方法不同，FaceNet 直接训练其输出为一个紧凑的 128 维嵌入，使用基于 LMNN [19] 的三元组损失函数。我们的三元组由两个匹配的人脸缩略图和一个不匹配的人脸缩略图组成，损失函数旨在通过距离边界将正对对分开与负对对。缩略图是脸部区域的紧密裁剪，除了尺度和位移外，不进行任何二维或三维对齐。

Choosing which triplets to use turns out to be very important for achieving good performance and, inspired by curriculum learning [1], we present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains. To improve clustering accuracy, we also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.

选择使用哪些三元组对于实现良好的性能非常重要，受到课程学习 [1] 的启发，我们提出了一种新颖的在线负示例挖掘策略，确保随着网络训练的进行，三元组的难度持续增加。为了提高聚类准确性，我们还探索了硬正例挖掘技术，这些技术鼓励为单个人的嵌入形成球形聚类。

As an illustration of the incredible variability that our method can handle see Figure 1. Shown are image pairs from PIE [13] that previously were considered to be very difficult for face verification systems.

作为我们方法能处理的惊人变异性的一个例子，请参见图1。展示的是来自 PIE [13] 的图像对，之前这些图像被认为是对人脸验证系统来说非常困难的。

An overview of the rest of the paper is as follows: in section 2 we review the literature in this area; section 3.1 defines the triplet loss and section 3.2 describes our novel triplet selection and training procedure; in section 3.3 we describe the model architecture used. Finally in section 4 and 5 we present some quantitative results of our embed-dings and also qualitatively explore some clustering results.

本文其余部分的概述如下：在第2节中，我们回顾了该领域的文献；第3.1节定义了三元组损失，第3.2节描述了我们新颖的三元组选择和训练过程；在第3.3节中，我们描述了所使用的模型架构。最后，在第4和第5节中，我们展示了一些嵌入的定量结果，并定性探讨了一些聚类结果。

2. Related Work

2. 相关工作

Similarly to other recent works which employ deep networks $\left\lbrack {{15},{17}}\right\rbrack$ ,our approach is a purely data driven method which learns its representation directly from the pixels of the face. Rather than using engineered features, we use a large dataset of labelled faces to attain the appropriate in-variances to pose, illumination, and other variational conditions.

类似于其他最近使用深度网络的工作 $\left\lbrack {{15},{17}}\right\rbrack$ ，我们的方法是一种纯数据驱动的方法，它直接从面部的像素中学习其表示。我们使用一个大型标记面孔的数据集，以获得对姿势、光照和其他变异条件的适当不变性，而不是使用工程特征。

In this paper we explore two different deep network architectures that have been recently used to great success in the computer vision community. Both are deep convolutional networks $\left\lbrack {8,{11}}\right\rbrack$ . The first architecture is based on the Zeiler&Fergus [22] model which consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations, and max pooling layers. We additionally add several $1 \times 1 \times d$ convolution layers inspired by the work of [9]. The second architecture is based on the Inception model of Szegedy et al. which was recently used as the winning approach for ImageNet 2014 [16]. These networks use mixed layers that run several different convolutional and pooling layers in parallel and concatenate their responses. We have found that these models can reduce the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance.

在本文中，我们探讨了最近在计算机视觉领域取得巨大成功的两种不同的深度网络架构。两者都是深度卷积网络 $\left\lbrack {8,{11}}\right\rbrack$ 。第一种架构基于Zeiler&Fergus [22]模型，该模型由多个交错的卷积层、非线性激活、局部响应归一化和最大池化层组成。我们还增加了受[9]工作启发的几个 $1 \times 1 \times d$ 卷积层。第二种架构基于Szegedy等人的Inception模型，该模型最近被用作2014年ImageNet竞赛的获胜方法 [16]。这些网络使用混合层，平行运行几种不同的卷积和池化层，并将它们的响应连接起来。我们发现这些模型可以将参数数量减少多达20倍，并有潜力减少实现可比性能所需的FLOPS数量。

There is a vast corpus of face verification and recognition works. Reviewing it is out of the scope of this paper so we will only briefly discuss the most relevant recent work.

有大量的面部验证和识别研究。对其进行回顾超出了本文的范围，因此我们将仅简要讨论最相关的最近工作。

The works of $\left\lbrack {{15},{17},{23}}\right\rbrack$ all employ a complex system of multiple stages, that combines the output of a deep convolutional network with PCA for dimensionality reduction and an SVM for classification.

$\left\lbrack {{15},{17},{23}}\right\rbrack$ 的所有工作都采用了一个复杂的多阶段系统，该系统结合了深度卷积网络的输出、主成分分析 (PCA) 进行降维，并使用支持向量机 (SVM) 进行分类。

Zhenyao et al. [23] employ a deep network to "warp" faces into a canonical frontal view and then learn CNN that classifies each face as belonging to a known identity. For face verification, PCA on the network output in conjunction with an ensemble of SVMs is used.

Zhenyao 等人 [23] 使用深度网络将面部“扭曲”到标准的正面视图，然后训练卷积神经网络 (CNN) 来将每个面部分类为已知身份。对于人脸验证，结合网络输出的 PCA 和支持向量机集成 (SVM) 被使用。

Taigman et al. [17] propose a multi-stage approach that aligns faces to a general 3D shape model. A multi-class network is trained to perform the face recognition task on over four thousand identities. The authors also experimented with a so called Siamese network where they directly optimize the ${L}_{1}$ -distance between two face features. Their best performance on LFW $\left( {{97.35}\% }\right)$ stems from an ensemble of three networks using different alignments and color channels. The predicted distances (non-linear SVM predictions based on the ${\chi }^{2}$ kernel) of those networks are combined using a non-linear SVM.

Taigman 等人 [17] 提出了一个多阶段方法，将面部对齐到一个通用的 3D 形状模型。一个多类网络被训练用于执行超过四千个身份的人脸识别任务。作者还尝试了所谓的“孪生网络”，在该网络中，他们直接优化两个人脸特征之间的 ${L}_{1}$ 距离。他们在 LFW $\left( {{97.35}\% }\right)$ 上的最佳表现来自于使用不同对齐方式和颜色通道的三种网络的集成。这些网络的预测距离（基于 ${\chi }^{2}$ 核的非线性 SVM 预测）通过非线性 SVM 进行组合。

Sun et al. $\left\lbrack {{14},{15}}\right\rbrack$ propose a compact and therefore relatively cheap to compute network. They use an ensemble of 25 of these network, each operating on a different face patch. For their final performance on LFW (99.47% [15]) the authors combine 50 responses (regular and flipped). Both PCA and a Joint Bayesian model [2] that effectively correspond to a linear transform in the embedding space are employed. Their method does not require explicit 2D/3D alignment. The networks are trained by using a combination of classification and verification loss. The verification loss is similar to the triplet loss we employ $\left\lbrack {{12},{19}}\right\rbrack$ ,in that it minimizes the ${L}_{2}$ -distance between faces of the same identity and enforces a margin between the distance of faces of different identities. The main difference is that only pairs of images are compared, whereas the triplet loss encourages a relative distance constraint.

Sun 等人 $\left\lbrack {{14},{15}}\right\rbrack$ 提出了一个紧凑且相对便宜的网络计算方法。他们使用了 25 个这样的网络，每个网络在不同的面部区域上运行。为了在 LFW 上获得最终性能（99.47% [15]），作者结合了 50 个响应（常规和翻转）。同时采用了 PCA 和一个有效对应于嵌入空间线性变换的联合贝叶斯模型 [2]。他们的方法不需要显式的 2D/3D 对齐。网络通过结合分类损失和验证损失进行训练。验证损失类似于我们使用的三元组损失 $\left\lbrack {{12},{19}}\right\rbrack$ ，其最小化同一身份面孔之间的 ${L}_{2}$ 距离，并在不同身份面孔之间施加一个边际。主要区别在于，仅比较图像对，而三元组损失则鼓励相对距离约束。

A similar loss to the one used here was explored in Wang et al. [18] for ranking images by semantic and visual similarity.

Wang 等人 [18] 探索了一种与此处使用的类似损失，用于根据语义和视觉相似性对图像进行排序。

Figure 2. Model structure. Our network consists of a batch input layer and a deep CNN followed by ${L}_{2}$ normalization,which results in the face embedding. This is followed by the triplet loss during training.

图 2. 模型结构。我们的网络由一个批量输入层和一个深度 CNN 组成，随后是 ${L}_{2}$ 归一化，最终得到面部嵌入。训练过程中使用三元组损失。

Figure 3. The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.

图 3. 三元组损失最小化锚点与正样本之间的距离，二者具有相同身份，同时最大化锚点与不同身份的负样本之间的距离。

3. Method

3. 方法

FaceNet uses a deep convolutional network. We discuss two different core architectures: The Zeiler&Fergus [22] style networks and the recent Inception [16] type networks. The details of these networks are described in section 3.3.

FaceNet 使用深度卷积网络。我们讨论了两种不同的核心架构：Zeiler&Fergus [22] 风格的网络和最近的 Inception [16] 类型网络。这些网络的详细信息在第 3.3 节中描述。

Given the model details, and treating it as a black box (see Figure 2), the most important part of our approach lies in the end-to-end learning of the whole system. To this end we employ the triplet loss that directly reflects what we want to achieve in face verification, recognition and clustering. Namely,we strive for an embedding $f\left( x\right)$ ,from an image $x$ into a feature space ${\mathbb{R}}^{d}$ ,such that the squared distance between all faces, independent of imaging conditions, of the same identity is small, whereas the squared distance between a pair of face images from different identities is large.

考虑到模型细节，并将其视为一个黑箱（见图2），我们方法中最重要的部分在于整个系统的端到端学习。为此，我们采用了三元组损失，它直接反映了我们在面部验证、识别和聚类中想要实现的目标。即，我们努力从一张图像 $x$ 中获得一个嵌入 $f\left( x\right)$ 到特征空间 ${\mathbb{R}}^{d}$ ，使得同一身份的所有面孔之间的平方距离在不同成像条件下都很小，而来自不同身份的一对面孔图像之间的平方距离则很大。

Although we did not directly compare to other losses, e.g. the one using pairs of positives and negatives, as used in [14] Eq. (2), we believe that the triplet loss is more suitable for face verification. The motivation is that the loss from [14] encourages all faces of one identity to be projected onto a single point in the embedding space. The triplet loss, however, tries to enforce a margin between each pair of faces from one person to all other faces. This allows the faces for one identity to live on a manifold, while still enforcing the distance and thus discriminability to other identities.

尽管我们没有直接与其他损失进行比较，例如在 [14] 的公式 (2) 中使用的正负对损失，但我们认为三元组损失更适合面部验证。其动机在于，[14] 的损失鼓励同一身份的所有面孔在嵌入空间中投影到一个单一的点上。然而，三元组损失则试图在一个人的每对面孔与所有其他面孔之间强制施加一个边际。这使得同一身份的面孔可以位于一个流形上，同时仍然强制施加与其他身份的距离，从而增强可区分性。

The following section describes this triplet loss and how it can be learned efficiently at scale.

以下部分描述了这种三元组损失及其如何在大规模上高效学习。

3.1. Triplet Loss

3.1. 三元组损失

The embedding is represented by $f\left( x\right) \in {\mathbb{R}}^{d}$ . It embeds an image $x$ into a $d$ -dimensional Euclidean space. Additionally, we constrain this embedding to live on the $d$ -dimensional hypersphere,i.e. $\parallel f\left( x\right) {\parallel }_{2} = 1$ . This loss is motivated in [19] in the context of nearest-neighbor classification. Here we want to ensure that an image ${x}_{i}^{a}$ (anchor) of a specific person is closer to all other images ${x}_{i}^{p}$ (positive) of the same person than it is to any image ${x}_{i}^{n}$ (negative) of any other person. This is visualized in Figure 3.

嵌入表示由 $f\left( x\right) \in {\mathbb{R}}^{d}$ 表示。它将图像 $x$ 嵌入到一个 $d$ 维欧几里得空间中。此外，我们将这个嵌入约束在 $d$ 维超球面上，即 $\parallel f\left( x\right) {\parallel }_{2} = 1$ 。这个损失在 [19] 中的最近邻分类背景下得到了动机。在这里，我们希望确保一个特定人的图像 ${x}_{i}^{a}$ （锚点）比任何其他人的图像 ${x}_{i}^{n}$ （负样本）更接近该特定人所有其他图像 ${x}_{i}^{p}$ （正样本）。这一点在图 3 中得到了可视化。

Thus we want,

因此，我们希望，

where $\alpha$ is a margin that is enforced between positive and negative pairs. $\mathcal{T}$ is the set of all possible triplets in the training set and has cardinality $N$ .

其中 $\alpha$ 是强加于正负样本对之间的边距。 $\mathcal{T}$ 是训练集中所有可能三元组的集合，其基数为 $N$ 。

The loss that is being minimized is then $L =$

被最小化的损失为 $L =$ 。

Generating all possible triplets would result in many triplets that are easily satisfied (i.e. fulfill the constraint in Eq. (1)). These triplets would not contribute to the training and result in slower convergence, as they would still be passed through the network. It is crucial to select hard triplets, that are active and can therefore contribute to improving the model. The following section talks about the different approaches we use for the triplet selection.

生成所有可能的三元组将导致许多容易满足的三元组（即满足公式 (1) 中的约束）。这些三元组不会对训练产生贡献，并且会导致收敛速度变慢，因为它们仍然会通过网络。选择困难的三元组至关重要，这些三元组是活跃的，因此可以帮助提高模型性能。以下部分讨论了我们用于三元组选择的不同方法。

3.2. Triplet Selection

3.2. 三元组选择

In order to ensure fast convergence it is crucial to select triplets that violate the triplet constraint in Eq. (1). This means that,given ${x}_{i}^{a}$ ,we want to select an ${x}_{i}^{p}$ (hard positive) such that ${\operatorname{argmax}}_{{x}_{i}^{p}}{\begin{Vmatrix}f\left( {x}_{i}^{a}\right) - f\left( {x}_{i}^{p}\right) \end{Vmatrix}}_{2}^{2}$ and similarly ${x}_{i}^{n}$ (hard negative) such that ${\operatorname{argmin}}_{{x}_{i}^{n}}{\begin{Vmatrix}f\left( {x}_{i}^{a}\right) - f\left( {x}_{i}^{n}\right) \end{Vmatrix}}_{2}^{2}$ .

为了确保快速收敛，选择违反公式 (1) 中三元组约束的三元组至关重要。这意味着，给定 ${x}_{i}^{a}$ ，我们希望选择一个 ${x}_{i}^{p}$ （困难的正样本），使得 ${\operatorname{argmax}}_{{x}_{i}^{p}}{\begin{Vmatrix}f\left( {x}_{i}^{a}\right) - f\left( {x}_{i}^{p}\right) \end{Vmatrix}}_{2}^{2}$ ，同样地，选择一个 ${x}_{i}^{n}$ （困难的负样本），使得 ${\operatorname{argmin}}_{{x}_{i}^{n}}{\begin{Vmatrix}f\left( {x}_{i}^{a}\right) - f\left( {x}_{i}^{n}\right) \end{Vmatrix}}_{2}^{2}$ 。

It is infeasible to compute the argmin and argmax across the whole training set. Additionally, it might lead to poor training, as mislabelled and poorly imaged faces would dominate the hard positives and negatives. There are two obvious choices that avoid this issue:

在整个训练集上计算 argmin 和 argmax 是不可行的。此外，这可能导致训练效果不佳，因为错误标记和图像质量差的面孔会主导困难的正例和负例。有两个明显的选择可以避免这个问题：

Generate triplets offline every $\mathrm{n}$ steps,using the most recent network checkpoint and computing the argmin and argmax on a subset of the data.
每隔 $\mathrm{n}$ 步离线生成三元组，使用最新的网络检查点，并在数据的一个子集上计算 argmin 和 argmax。
Generate triplets online. This can be done by selecting the hard positive/negative exemplars from within a mini-batch.
在线生成三元组。这可以通过从一个小批量中选择困难的正例/负例样本来完成。

Here, we focus on the online generation and use large mini-batches in the order of a few thousand exemplars and only compute the argmin and argmax within a mini-batch.

在这里，我们专注于在线生成，并使用几千个样本的较大小批量，仅在一个小批量内计算 argmin 和 argmax。

To have a meaningful representation of the anchor-positive distances, it needs to be ensured that a minimal number of exemplars of any one identity is present in each mini-batch. In our experiments we sample the training data such that around 40 faces are selected per identity per mini-batch. Additionally, randomly sampled negative faces are added to each mini-batch.

为了对锚点-正例距离有一个有意义的表示，需要确保每个小批量中存在最少数量的同一身份的样本。在我们的实验中，我们对训练数据进行抽样，以便每个小批量中每个身份选择大约 40 张面孔。此外，随机抽样的负面面孔被添加到每个小批量中。

Instead of picking the hardest positive, we use all anchor-positive pairs in a mini-batch while still selecting the hard negatives. We don't have a side-by-side comparison of hard anchor-positive pairs versus all anchor-positive pairs within a mini-batch, but we found in practice that the all anchor-positive method was more stable and converged slightly faster at the beginning of training.

我们没有选择最困难的正例，而是使用小批量中的所有锚点-正例对，同时仍然选择困难的负例。我们没有对比小批量中困难的锚点-正例对与所有锚点-正例对的并排比较，但我们在实践中发现，所有锚点-正例的方法更稳定，并且在训练开始时收敛略快。

We also explored the offline generation of triplets in conjunction with the online generation and it may allow the use of smaller batch sizes, but the experiments were inconclusive.

我们还探索了离线生成三元组与在线生成相结合的方式，这可能允许使用更小的批量大小，但实验结果并不确定。

Selecting the hardest negatives can in practice lead to bad local minima early on in training, specifically it can result in a collapsed model (i.e. $f\left( x\right) = 0$ ). In order to mitigate this,it helps to select ${x}_{i}^{n}$ such that

选择最困难的负例在训练早期实际上可能导致不良的局部最小值，具体来说，它可能导致模型崩溃（即 $f\left( x\right) = 0$ ）。为了缓解这个问题，选择 ${x}_{i}^{n}$ 是有帮助的，以使得

We call these negative exemplars semi-hard, as they are further away from the anchor than the positive exemplar, but still hard because the squared distance is close to the anchor-positive distance. Those negatives lie inside the margin $\alpha$ .

我们将这些负例称为半硬负例，因为它们距离锚点比正例远，但由于平方距离接近于锚点-正例的距离，它们仍然是硬负例。这些负例位于边界内 $\alpha$ 。

As mentioned before, correct triplet selection is crucial for fast convergence. On the one hand we would like to use small mini-batches as these tend to improve convergence during Stochastic Gradient Descent (SGD) [20]. On the other hand, implementation details make batches of tens to hundreds of exemplars more efficient. The main constraint with regards to the batch size, however, is the way we select hard relevant triplets from within the mini-batches. In most experiments we use a batch size of around 1,800 exemplars.

如前所述，正确的三元组选择对快速收敛至关重要。一方面，我们希望使用较小的迷你批次，因为这有助于在随机梯度下降 (SGD) [20] 过程中提高收敛速度。另一方面，实施细节使得包含数十到数百个样本的批次更为高效。然而，关于批次大小的主要限制是我们如何从迷你批次中选择硬相关三元组。在大多数实验中，我们使用约 1,800 个样本的批次大小。

3.3. Deep Convolutional Networks

3.3. 深度卷积网络

In all our experiments we train the CNN using Stochastic Gradient Descent (SGD) with standard backprop [8, 11] and AdaGrad [5]. In most experiments we start with a learning rate of 0.05 which we lower to finalize the model. The models are initialized from random, similar to [16], and trained on a CPU cluster for 1,000 to 2,000 hours. The decrease in the loss (and increase in accuracy) slows down drastically after ${500}\mathrm{\;h}$ of training,but additional training can still significantly improve performance. The margin $\alpha$ is set to 0.2 .

在我们所有的实验中，我们使用随机梯度下降 (SGD) 配合标准的反向传播 [8, 11] 和 AdaGrad [5] 来训练卷积神经网络 (CNN)。在大多数实验中，我们从 0.05 的学习率开始，然后逐步降低以完成模型训练。这些模型从随机初始化开始，类似于 [16]，并在 CPU 集群上训练 1,000 到 2,000 小时。训练过程中，损失的下降（以及准确率的提高）在 ${500}\mathrm{\;h}$ 训练后显著减缓，但额外的训练仍能显著提高性能。边界 $\alpha$ 设置为 0.2。

We used two types of architectures and explore their trade-offs in more detail in the experimental section. Their practical differences lie in the difference of parameters and FLOPS. The best model may be different depending on the application. E.g. a model running in a datacenter can have many parameters and require a large number of FLOPS, whereas a model running on a mobile phone needs to have few parameters, so that it can fit into memory. All our models use rectified linear units as the non-linear activation function.

我们使用了两种类型的架构，并在实验部分更详细地探讨了它们的权衡。它们的实际差异在于参数和FLOPS的不同。最佳模型可能因应用而异。例如，在数据中心运行的模型可以有很多参数并需要大量的FLOPS，而在手机上运行的模型需要有较少的参数，以便能够适应内存。我们所有的模型都使用修正线性单元作为非线性激活函数。

Table 1. NN1. This table show the structure of our Zeiler&Fergus [22] based model with $1 \times 1$ convolutions inspired by [9]. The input and output sizes are described in rows $\times$ cols $\times$ #filters. The kernel is specified as rows $\times$ cols,stride and the maxout [6] pooling size as $p = 2$ .

表1. NN1。该表展示了基于Zeiler&Fergus [22] 的模型结构，其中包含受[9] 启发的 $1 \times 1$ 卷积。输入和输出大小在行 $\times$ 列 $\times$ #filters 中描述。卷积核被指定为行 $\times$ 列，步幅和maxout [6] 池化大小为 $p = 2$ 。

The first category,shown in Table 1,adds $1 \times 1 \times d$ convolutional layers, as suggested in [9], between the standard convolutional layers of the Zeiler&Fergus [22] architecture and results in a model 22 layers deep. It has a total of 140 million parameters and requires around 1.6 billion FLOPS per image.

第一类，如表1所示，在Zeiler&Fergus [22] 架构的标准卷积层之间添加了 $1 \times 1 \times d$ 卷积层，正如[9] 所建议的，这导致模型深度达到22层。它总共有1.4亿个参数，并且每张图像需要约16亿FLOPS。

The second category we use is based on GoogLeNet style Inception models [16]. These models have ${20} \times$ fewer parameters (around ${6.6}\mathrm{M} - {7.5}\mathrm{M}$ ) and up to $5 \times$ fewer FLOPS (between ${500}\mathrm{\;M} - {1.6}\mathrm{\;B}$ ). Some of these models are dramatically reduced in size (both depth and number of filters), so that they can be run on a mobile phone. One, NNS1, has ${26}\mathrm{M}$ parameters and only requires ${220}\mathrm{M}$ FLOPS per image. The other, NNS2, has 4.3M parameters and 20M FLOPS. Table 2 describes NN2 our largest network in detail. NN3 is identical in architecture but has a reduced input size of 160x160. NN4 has an input size of only 96x96, thereby drastically reducing the CPU requirements (285M FLOPS vs 1.6B for NN2). In addition to the reduced input size it does not use 5x5 convolutions in the higher layers as the receptive field is already too small by then. Generally we found that the $5 \times 5$ convolutions can be removed throughout

我们使用的第二类基于 GoogLeNet 风格的 Inception 模型 [16]。这些模型具有 ${20} \times$ 更少的参数（大约 ${6.6}\mathrm{M} - {7.5}\mathrm{M}$ ）和最多 $5 \times$ 更少的 FLOPS（在 ${500}\mathrm{\;M} - {1.6}\mathrm{\;B}$ 之间）。其中一些模型在大小上（包括深度和滤波器数量）显著减少，因此可以在手机上运行。其中一个，NNS1，具有 ${26}\mathrm{M}$ 参数，每张图像仅需 ${220}\mathrm{M}$ FLOPS。另一个，NNS2，具有 4.3M 参数和 20M FLOPS。表 2 详细描述了我们的最大网络 NN2。NN3 在架构上与之相同，但输入大小减少为 160x160。NN4 的输入大小仅为 96x96，从而大幅降低了 CPU 的需求（285M FLOPS 对比 NN2 的 1.6B）。除了减少输入大小外，它在较高层不使用 5x5 卷积，因为到那时感受野已经太小。一般来说，我们发现可以在整个过程中去除 $5 \times 5$ 卷积。

—— 更多内容请到Doc2X翻译查看—— —— For more content, please visit Doc2X for translations ——