最近在做信息检索相关的工作。这篇文章使用的模型在微软ms-marco榜单中排名第一,特来拜读学习一下的。
摘要
We use the recently proposed Condenser pre-training architecture, which learns to condense information into the dense vector through LM pre-training. On top of it, we propose coCondenser, which adds an unsupervised corpus-level contrastive loss to warm up the passage embedding space. It shows comparable performance to RocketQA, a state-of-the-art, heavily engineered system, using simple small batch finetuning.
作者使用最近提出的 Condenser 预训练架构,它通过 LM 预训练学习将信息压缩到密集向量中。最重要的是,作者进一步提出了 coCondenser,它添加了一个无监督的语料库级对比损失来预热段落嵌入空间。它显示出与 RocketQA 相当的性能,而RocketQA 是当时比较先进的、经过精心设计的系统,coCondenser使用简单的小批量进行微调。
原文解读
introduction
The recent RocketQA system (Qu et al., 2021) significantly improves the performance of a dense retriever by designing an optimized fine- tuning pipeline that includes i) denoising hard neg- atives, which corrects mislabeling, and ii) large batch training. While this is very effective, the entire pipeline is very heavy in computation and not feasible for people who do not have tremendous hardware resources, especially those in academia.
RocketQA系统设计了一个优化的finetune pipeline,首先是去除错误标注的样本(模型学会正确样本对模型的正向提升是大的),第二个是非常大的batch size(大batch再一次的平均了loss,可以降低bad sample的干扰),训练更加稳定。
We hypothesize that typical LMs are sensitive to mislabeling, which can cause detrimental updates to the model weights. Denoising can effectively remove the bad samples and their updates.
因为语言模型对于误标签数据对于模型权重的更新是致命的,denoising可以很大程度的去处bad samples。
On the other hand, for most LMs, the CLS vectors are either trained with a simple task (Devlin et al., 2019) or not explicitly trained at all (Liu et al., 2019). These vectors are far from being able to form an embedding space of passages (Lee et al., 2019)
这里说的是一般的LMs的训练都是简单的或者是没有一个明确的训练目标(而我们取sentence的embedding一般是取CLS的向量)。
To this end, we want to pre-train an LM such that it is locally noise-resistant and has a well-structured global embedding space.
这里提出了本次文章想要达成的目标:预训练 LM,使其具有局部抗噪性并具有结构良好的全局嵌入空间。
For noise resistance, we borrow the Condenser pre-training architecture (Gao and Callan, 2021), which performs language model pre-training actively conditioning on the CLS vector. It produces an information-rich CLS representation that can robustly condense an input sequence.
通过Condenser pre-training architecture来增强CLS的信息表达能力。
We then introduce a simple corpus level contrastive learning objective: given a target corpus of documents to retrieve from, at each training step sample text span pairs from a batch of documents and train the model such that the CLS embeddings of two spans from the same document are close and spans from different documents are far apart.
这里使用了无监督学习,即随机从一笔文档中抽出文本片段,然后训练模型。目标是让相同文档出来的CLS的embedding要尽可能的相似,而来自不同的文档出来embedding的要尽可能不相近。
In this paper, we test coCondenser pre-training on two popular corpora, Wikipedia and MS- MARCO. Both have served as information sources for a wide range of tasks. This popularity justifies pre-training models specifically for each of them. We directly fine-tune the pre-trained coCondenser using small training batches without data engineering. On Natural Question, TriviaQA, and MS-MARCO passage ranking tasks, we found that the resulting models perform on-par or better than RocketQA and other contemporary methods.
最后作者比较了两个语料库:Wikipedia和MS-MARCO,finetune的方式也是直接用了文章的部分语料进行的,而且没有做任何的数据工程,得出来的结果和RocketQA还有其他同时代的方法相当或更好。
Related Works
Among them, Qu et al. (2021) proposed the RocketQA fine-tuning pipeline which hugely advanced the performance of dense retrievers.
第一段讲了Transformer类的dense retrieval的近几年的发展,然后最后又表扬了RocketQA的巨大性能提升,同时也是在夸自己的感觉。
In this work, we use contrastive learning to do pre-training for dense retrieval. Different from earlier work, instead of individual representations (Giorgi et al., 2020), we are interested in the full learned embedding space, which we will use to warm start the retriever. In general, this extends to any training procedure that uses contrastive loss, including dense retrieval pre-training (Guu et al., 2020; Chang et al., 2020). Gao et al. (2021b) recently devised a gradient cache technique that upper-bounds peak memory usage of contrastive learning to almost constant.
这里说主要的方式是使用对比学习,同时为了缓解batch过大导致的GPU受限,设计了一种梯度缓存机制,这在之后还会说到。
Method
Condenser
Con-denser is a stack of Transformer blocks. As shown in Figure 1, these Transformer blocks are divided into three groups, early backbone encoder layers, late backbone encoder layers, and head layers. An input x = [x1, x2, ..] is first prepended a CLS, embedded, and run through the backbone layers.
Condenser一共由六层的Transformer blocks组成,一共是三组,early backbone encoder、late backbone encoder以及head,最下面的为输入层,头部拼接了一个special token [CLS],
这里的公式写的还是很清晰的,最下面一层对应的是第一个公式的左边部分,第二层到第三层对应EncoderearlyEncoderearly,第四层到第五层对应EncoderlateEncoderlate,后面将拿到的h^{late}_{cls}和h^{early}作为head的输入。
最后head的输出训练的是一个masked language model,这里应该最后还接入了一层全连接层,参数为W,即预测masked tokenimasked tokeni和之前的概率差距,loss用的cross entropy。
To utilize the capacity of the late layers, Condenser is forced to learn to aggregate information into the CLS, which will then participate in the LM predic- tion. Leveraging the rich and effective training sig- nal produced by MLM, Condenser learn to utilize the powerful Transformer architecture to generate dense CLS representation. We hypothesize that with this LM objective typically used to train token representation now put on the dense CLS representation, the learned LM gains improved robustness against noise.
为了利用late layer的容量,Condenser 被迫学习将信息聚合到 CLS 中,然后 CLS 将参与 LM 预测(这里有一点迷糊,之后要读一下Condense这篇论文)。利用 MLM 产生的丰富有效的训练信号,Condenser 学习利用强大的 Transformer 架构来生成密集的 CLS 表示。我们假设通常用于训练令牌表示的 LM 目标现在放在密集的 CLS 表示上,学习到的 LM 获得了更好的抗噪声鲁棒性。
coCondense
while information embedded in the CLS can be non-linearly interpreted by the head, inner products between these vectors still lack semantics. Consequently, they do not form an effective embedding space. To this end, we augment the Condenser MLM loss with a contrastive loss. Unlike previous work that pre-trains on artificial query passage pairs, in this paper, we pro- pose to simply pre-train the passage embedding space in a query-agnostic fashion, using a contrastive loss defined over the target search corpus. Concretely, given a random list of n documents [d1, d2, ..., dn], we extract randomly from each a pair of spans, [s11, s12, ..., sn1, sn2]. These spans then form a training batch of coCondenser. Write a span sij ’s corresponding late CLS representation hij, its corpus-aware contrastive loss is defined over the batch,
CLS虽然可以通过head进行非线性表征,但是这些向量的内积依然是不具备语义信息的。作者提出加入对比的loss来提升语义表达。这种方式在introduction部分提到了,就是抽取文档中的片段,这些片段形成了一个训练批次集,将,loss的计算公式如下图。
Following the spirit of the distributional hypothesis, passages close together should have similar representations while those in different documents should have different representations.
这里的核心思想就是同一篇文章里的片段的分布应该是尽可能相似的,而不同文章的片段应该有不同的空间表达。最后的loss就是masked loss➕span LM losss
Memory Efficient Pre-training
这里主要的对梯度的一个积累,因为大batch容易爆显存,而每一次loss的计算都是选n篇文章,然后抽取2个片段,这样每一个batch的数据量都很大,所以想要训练的更稳定而且设备有限的话就需要使用这种方式。先通过对hijhij的计算积累,然后通过公式(9)计算出对应的作为模型的参数进行梯度求解公式(10,11)
最后总体的梯度公式为:
Fine-tune
At the end of pre-training, we discard the Con- denser head, keeping only the backbone layers. Consequently, the model reduces to its backbone, or effectively a Transformer Encoder.
在预训练后,去掉Condenser的head层,只保留backbone 层,模型就变成了一个高效的Transformer Encoder,用这个encoder来初始化query和passage的embedding(使用last layer CLS 的output),然后计算相似度,再进行negative log likelihood loss计算,比较不同的document的loss。
We run a two-round training as described in the DPR (Karpukhin et al., 2020) toolkit. As shown in Figure 2b, in the first round, the retrievers are trained with BM25 negatives. The first-round retriever is then used to mine hard negatives to complement the negative pool. The second round retriever trains with the negative pool generated in the first round. This is in contrast to the multi-stage pipeline of RocketQA shown in Figure 2a.
第一阶段是用BM25算法计算出一些negative sample,第二阶段是用第一阶段中训练过程找到的hard negatives 再进行一次微调。
Experiments
从表格中我们可以看出,RocketQA的batch size和本文使用的batch size的差距还是很明显的,但是从结果上来看coCondenser再加入Hard negatives之后,在MS-MARCO上的表现要优于RocketQA。
MRR@10:Mean Reciprocal Rank
把标准答案在被评价系统给出结果中的排序取倒数作为它的准确度,再对所有的问题取平均。如query1 中标准答案被排在第3 位,query 2 中的排在第2 位,query 3 的排在第一位,则 MRR 为:
(1/3+1/2+1/1)/3=0.61
R@10: top 10 的召回率