阅读理解综述大型数据集是阅读理解任务发展重要贡献者。早期的数据集有MCTest、CNN/DailyMail、Childr

综述

大型数据集是阅读理解任务发展重要贡献者。早期的数据集有MCTest、CNN/DailyMail、Childrens Book Test以及SQuAD。

端到端的方法几乎都用到了注意力机制。

Teaching machines to read and comprehend. Karl Moritz Hermann. NIPS 2015
A thorough examination of the cnn/daily mail reading comprehension task. Danqi Chen. ACL 2016
Attention-over-attention neural networks for reading comprehension. Yiming Cui, Ting Liu. 2016
Machine Comprehension Using Match-LSTM and Answer Pointer. Shuohang Wang, Jing Jiang. 2016
（BiDAF）Bi-directional Attention Flow for Machine Comprehension. Minjoon Seo. ICLR 2017
（DrQA）Reading Wikipedia to Answer Open-Domain Questions. Danqi Chen. ACL 2017
（ORQA）Latent Retrieval for Weakly Supervised Open Domain Question Answering. Kenton Lee. ACL 2019 (Google)
Dense Passage Retrieval for Open-Domain Question Answering. Vladimir Karpukhin, ..., Danqi Chen, Wen-tau Yih. EMNLP 2020 论文代码
Learning Dense Representations of Phrases at Scale. Jinhyuk Lee, ..., Danqi Chen. 2021 论文代码

一篇用远程监督做无监督关系抽取的文章：Distant supervision for relation extraction without labeled data. Mike Mintz. ACL/IJCNLP 2009

数据集 SQuAD 1.0

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In EMNLP, 2016.

SQuAD 1.0是根据维基百科上的文章构建的高质量的阅读理解数据集，特点是大，包含了107785个问答对，全人工标注。对于训练集，每个问答对只有一个答案，并且是原文的一段文本；对于验证集和测试集，一个问题存在多个答案。

模型评估

Exact Match(EM): 完全匹配任何一个正确答案的百分比。
F1: 答案和预测都看成是一个token bag，计算预测和每个正确答案的F1，选择最大的，最后在整个数据集的问题上求平均。

经典模型一：BiDAF

notice: 中文的word是词语，character是汉字；英文的word是单词，character是字母。

1. Character Embedding Layer

使用character-level cnn把每个单词映射到向量空间。首先得到不定长的character embeddings，然后通过最大池化得到一个固定长度的向量

2. Word Embedding Layer

通过GloVe获取每个word的word embedding；将character和word embedding拼接传递给two-layer Highway Network。分别为context和query生成矩阵 $\bold{X} \in \Bbb{R}^{d \times T}$ 和 $\bold{Q} \in \Bbb{R}^{d \times J}$ .

3. Contextual Embedding Layer

双向LSTM。根据 $\bold{X}$ 、 $\bold{Q}$ 分别得到 $\bold{H} \in \Bbb{R}^{2d \times T}$ 和 $\bold{U} \in \Bbb{R}^{2d \times J}$ .

4. Attention Flow Layer

$\bold{S} \in \Bbb{R}^{T \times J}$ 是未归一化注意力权重矩阵，其中 $\bold{S}_{tj}$ 表示第 $t$ 个上下文词（ $H$ 的第 $t$ 列 $H_{:t}$ ）和第 $j$ 个query词（ $U$ 的第 $j$ 列 $U_{:j}$ ）之间的相似度：

\begin{aligned} \bold{S}_{tj} & = \alpha(\bold{H}_{:t}, \bold{U}_{:j})\\ \alpha(\bold{h}, \bold{u}) & = \bold{w}^T_{(\bold{S})}[\bold{h};\bold{u};\bold{h } \circ \bold{u}] \end{aligned}

其中 $\bold{w}_{(\bold{S})} \in \Bbb{R}^{6d}$ 是可训练参数， $\circ$ 表示元素级乘法， $[;]$ 表示行拼接。

context-to-query attention: 对于第 $t$ 个context word，计算注意力加权的query representation $\bold{\tilde{U}}_{:t}$ ，拼接后得 $\bold{\tilde{U}} \in \Bbb{R}^{2d \times T}$ .

\begin{aligned} \bold{a}_t & = softmax(\bold{S}_{t:})\\ \bold{\tilde{U}_{:t}} & = \sum_j \bold{a}_{tj}\bold{U}_{:j} \end{aligned}

其中 $\bold{a}_t \in \Bbb{R}^J$ 表示第 $t$ 个context word到query words的注意力向量， $\sum \bold{a}_{tj} = 1$

query-to-context attention: 根据query，计算出注意力加权的context representation $\bold{\tilde{h}} \in \Bbb{R}^{2d}$ ，重复 $T$ 次得 $\bold{\tilde{H}} \in \Bbb{R}^{2d \times T}$ .

\begin{aligned} \bold{\tilde{h}} & = \sum_t b_t \bold{H}_{:t}\\ \bold{b} & = softmax(max_{col}(\bold{S})) \end{aligned}

其中 $max_{col}$ 是在列上求最大值， $\bold{b} \in \Bbb{R}^T$ .

这里的 $\bold{b}$ 的维度需要仔细点想。

然后，将上下文嵌入和注意力向量结合在一起生成 $\bold{G} \in \Bbb{R}^{d_{\bold{G}} \times T}$ （question-aware representations of context words ），其第 $t$ 列 $\bold{G_{:t}} \in \Bbb{R}^{d_{\bold{G}}}$ 计算为：

\begin{aligned} \bold{G_{:t}} & = \bold{\beta}(\bold{H}_{:t}, \bold{\tilde{U}}_{:t}, \bold{\tilde{H}}_{:t})\\ \bold{\beta}(\bold{h}, \bold{\tilde{u}}, \bold{\tilde{h}}) & = [\bold{h}; \bold{\tilde{u}}; \bold{h} \circ \bold{\tilde{u}}; \bold{h} \circ \bold{\tilde{h}}] \end{aligned}

$d_{\bold{G}}$ 是 $\bold{\beta}$ 的输出维度，在上式中 $d_{\bold{G}} = 8d$ .

5. Modeling Layer

modeling layer的输入是问题感知的上下文表征，这个和contextual embedding layer的输出是不一样的。经过bilstm后，得到维度为2d的向量，由此得到矩阵 $M \in \Bbb{R}^{2d \times T}$ .

为什么图中的输入有三个？除了g还有两个是什么？

6. Output Layer

不同任务会有不同的输出层。在阅读理解任务上，需要预测答案的起始位置和终止位置。

起始位置：直接把 $M$ 输入到一个线性层，然后用softmax找到。
终止位置：把 $M$ 输入到另一个bilstm得到 $M^2$ ，然后用softmax找到。

\begin{aligned} \bold{p}^1 = softmax(\bold{w}_{(\bold{p}^1)}^{\top}[\bold{G};\bold{M}])\\ \bold{p}^2 = softmax(\bold{w}_{(\bold{p}^2)}^{\top}[\bold{G};\bold{M}^2]) \end{aligned}

训练损失值和测试推理

每个数据样例的损失值为起始位置和结束位置负对数损失之和，总的损失是在所有样例上做平均。

L(\theta) = - \frac{1}{N}\sum_i^N \log (\bold{p}_{y_i^1}^1) + \log (\bold{p}_{y_i^2}^2)

测试时，可以通过动态规划在线性时间内选取出 $\bold{p}_k^1\bold{p}_l^2$ 最大的片段 $(k,l)$ ，其中 $k \leq l$ 。

成绩

经典模型二：DrQA

DrQA分为两个部分：

Document Retriever：使用bigram hashing和TF-IDF寻找相关的5篇文章。
Document Reader：使用多层循环神经网络从多篇文章中抽取出答案文本段。

文档检索器 Document Retriever

A simple inverted index lookup followed by by term vector model scoring performs quite well on this task for many question types, compared to the built-in ElasticSearch based Wikipedia Search API. Articles and questions are compared as TF-IDF weighted bag-of-word vectors. We further improve our system by taking local word order into account with n-gram features. Our best performing systems uses bigram counts while preserving speed and memory efficiency by using the hashing of Weinberger et al. 2009 to map the bigrams to 2^24 bins with unsigned murmur3 hash.

在这个任务上（代指阅读理解），与内置的基于ElasticSearch的Wikipedia搜索API相比，一个简单的倒排索引查找紧跟一个用来打分的term向量模型，在很多问题类型上都表现地相当好。文章和问题在做对比时都会被转成TF-IDF加权的词袋向量。我们还考虑了局部词序信息（加入n-gram特征）来进一步提升我们的系统。我们性能最佳的系统使用了bigram统计数据，然后采用Weinberger等人的哈希方法通过无符号murmur3哈希将bigrams映射到2^24个箱子上。

文档阅读器 Document Reader

从文章中抽取出一小段答案文本。

段落编码：将段落token的四个嵌入特征拼接在一起，然后输入到bilstm中。四个特征包括：

Word Embeddings: 使用Glove的300维词向量；实验过程中，固定词嵌入，只微调问题中出现的1000个最高频的词，比如what, how, which等；
Exact Match：三个简单的二元特征，表示词是否和问题中某个词一模一样，或者原始形式一样，或者是大小写关系，或者词根相同；
Token Features：人工特征，加入POS、NER和归一化的TF值；
Aligned Question Embedding：对于段落中的每个token，通过点乘给问题的每个token分配注意力权重，归一化后，得到该段落token所对应的对齐问题嵌入；

问题编码：将问题token的word embedding直接输入到另一个bilstm中，然后定义一个参数向量 $\bold{w}$ ：

\begin{aligned} \bold{q} & = \sum_j b_j\bold{q}_j\\ b_j & = \frac{exp(\bold{w} \cdot \bold{q}_j)}{\sum_{j'}exp(\bold{w} \cdot \bold{q}_{j'})} \end{aligned}

注意力泛滥的时代，处处是注意力，换做平时直接求平均了。

训练损失和测试推理：根据段落向量集合 $\{\bold{p}_1, ..., \bold{p}_m\}$ 和问题向量 $\bold{q}$ ，我们分别训练两个独立的分类器来预测起始和终止位置。具体地，使用一个双线性项来捕获 $\bold{p}_i$ 和 $\bold{q}$ 之间的相似度，然后计算每个token为起始和终止的概率为：

\begin{aligned} P_{start}(i) \propto exp(\bold{p}_i\bold{W}_s\bold{q})\\ P_{end}(i) \propto exp(\bold{p}_i\bold{W}_e\bold{q}) \end{aligned}

在预测阶段，选择最佳的片段 $(i, i')$ ，其中 $i \leq i' \leq i+15$ 并且 $P_{start}(i) \times P_{end}(i')$ 最大。

实验结果

经典模型三：OrQA

在不需要检索系统的情况下根据问答对来联合训练retriever和reader，这样所有wikipedia文章检索到的证据都是隐变量。由于从头学习非常不切实际，所以直接用反转完形填空任务（Inverse Cloze Task）来预训练retriever。

ORQA的一个优势在于能够从任何一个开放语料中抽取出文本，而不是仅仅局限于一个IR黑盒系统返回的封闭集合。

Retriever Component: 定义检索分数为问题q和证据块b的稠密向量表示的内积

S_{retr}(b, q) = h_q^{\top}h_b\\ h_q = \bold