TLTR
Dense passage retrieval下的训练和推理之间的差异,即训练过程中只是选取了部分的样本作为负例,而推理过程中则是对所有的样本进行比较。同时,在进行训练的过程中negative samples往往存在了大量的false negative samples,即标注为非答案文本的文段也是可以作为答案的。
针对上面这两个问题,文章提出了三个优化策略Cross-batch negatives, Denoised Hard Negatives, Data Augmentation.
Cross-batch negatives主要做的就是将m台上的GPU机器上的n个samples都进行embeddings计算,然后下发到每一个GPU机器上,那么每一个训练样本都有m*n-1个负样本,对比之前的in-batch negatives的n-1个负样本,这样极大地增加了负例样本的学习。
Denoised Hard Negatives,先train一个dual encoder来召回negatives, 然后再训练一个cross encoder来去除false negatives,这样让模型学习到负样本尽可能是对的,相当于是做了一次数据清洗。
Data Augmentation就是用已经训练好的cross encoder对一些未标注数据进行标注,类似于semi-supervised learning,来增大数据量。
这种方式在测评数据集上的表现不俗,但是对于机器的要求比较高,就比如第一步的Cross-batch negatives来说,这需要尽可能地增加机器的数目,同时对于单个GPU机器来说,增加了大批量的negatives,对于GPU的显存来说是一个很大的挑战。后面的train dual encoder的方式也是一个多阶段的训练,相对来说训练的成本比较高。
摘要
Typically, the dual-encoder architecture is adopted to learn dense representations of questions and passages for semantic matching. However, it is difficult to effectively train a dual-encoder due to the challenges including the discrepancy between training and inference, the existence of unlabeled positives and limited training data. To address these challenges, we propose an optimized training approach, called RocketQA, to improving dense passage retrieval. We make three major technical contributions in RocketQA, namely cross-batch negatives, denoised hard negatives and data augmentation.
开篇说了DPR(dense passage retrieval)模型训练下的问题:训练和推理的差异,有限的训练样本。然后再本文中提出了三种方式来优化DPR,cross-batch negatives,去除hard negatives,数据增强。
Though efficient with an inverted index, traditional IR retrievers with term-based sparse rep- resentations have limited capabilities in matching questions and passages, e.g., term mismatch.
BM25这种sparse retriever还是缺少语义上的理解。
To deal with the issue of term mismatch, the dual-encoder architecture (as shown in Figure 1a) has been widely explored (Lee et al., 2019; Guu et al., 2020; Karpukhin et al., 2020; Luan et al., 2020; Xiong et al., 2020) to learn dense represen- tations of questions and passages in an end-to-end manner, which provides better representations for semantic matching.
DPR的方式能够更好的学习到Query和Passage之间的语义匹配关系。
First, there exists the discrepancy between training and inference for the dual-encoder retriever. During inference, the retriever needs to identify positive (or relevant) passages for each question from a large collection containing millions of candidates. However, during training, the model is learned to estimate the probabilities of positive passages in a small candidate set for each question, due to the limited memory of a single GPU (or other device).
但还是存在些许问题,比如说推理和训练的差异。在推理过程中,每一个query都需要和所有的候选集进行比对。而在训练的时候,模型知识去评估一小部分的候选集是否是positive的。
As will be shown in our experiments, we manually examine the top-retrieved passages that were not labeled as positives in the original MSMARCO dataset, and we find that 70% of them are actually positives. Hence, it is likely to bring false negatives when sampling hard negatives from the top-k retrieved passages.
检索了MSMARCO的数据集,发现大部分没有被标记为positive的句子都存在大部分实际为postive的时候,如果从top-k中抽取hard negatives的话容易造成false negatives。
They are created from commercial search engines, and have 516K and 300K annotated questions, respectively. However, it is still insufficient to cover all the topics of questions issued by users to search engines.
即便是如此大的开放数据集,还是不足以cover所有用户的问题
First, RocketQA introduces cross-batch negatives. Comparing to in-batch negatives, it increases the number of available negatives for each question during training, and alleviates the discrepancy between training and inference. Second, RocketQA introduces denoised hard negatives. It aims to remove false negatives from the top-ranked results retrieved by a retriever and derive more reliable hard negatives. Third, RocketQA leverages large-scale unsupervised data “labeled” by a cross-encoder (as shown in Figure 1b) for data augmentation.
RocketQA的三个优化:cross-batch negatives, denoised hard nega- tives, and data augmentation.
Related Work
Passage retrieval for open-domain QA: 先是提了一下段落检索的发展历史,从最初时的term-based的TF-IDF和BM25算法到后面的深度学习方式。现在的方式总结为两大类:(1)self-supervised pre-training for retrieval and (2) fine-tuning pre-trained language models on labeled data.
Passage re-ranking for open-domain QA: 这里主要是提到了一个二阶段的精排,cross-encoder虽然也可以作为一个检索工具,但是对于海量的文本来说太过于耗时,所以只能作为二阶段的精排来使用。
Approach
这一部分主要讲一下RocketQA的三个优化策略
Cross-batch Negatives:以前的in-batch做法是在当前的gpu机器上,直接使用当前批次的所有query的positive(除本身之外的)作为negative samples,而cross-batch的做法是:先在一台GPU机器上计算好所有的passage的embedding,然后再将这些所有的embedding share到所有的机器上去,以这样的做法来增加negative samples的数目。
Denoised Hard Negatives: 优化hard negatives,挑选topk ranked passages 是很容易造成false negative的,所以这里训了一个交叉模型来对模型去掉false negatives。当然如果直接从海量的段落中来进行挑选是很低效的。故使用Dual-Encoder来进行初筛,然后cross-encoder再进行排序,选择置信度高的负类作为hard negatives。
Data Augmentation:用cross-encoder模型再来对unlabeled data进行标注,从而增大数据集。
Experiments
整体上的效果还是不错的。