23 A BERT-based Unsupervised Grammatical Error Correction Framework

9 阅读4分钟

Grammar is a set of specific rules that connects diverse words officially.

1. 模型 (multi-class classification)

BERT-based unsupervised GEC framework is composed of three modules: 1) data flow construction module, 2) sentence perplexity scoring module (困惑度分数), and 3) error detecting and correcting module.

  • An unsupervised GEC framework, without depending on any annotated data.
  • A scoring method for pseudo-perplexity to evaluate the likely validity of a sentence.
  • construct an evaluation corpus for Tagalog GEC.
  • Tagalog corpus we constructed and open-source Indonesian corpus.

1) data flow construction module

数据流构建:利用POS和confusing words 替换指定的token,构建输入II,并作为句子困惑度评分模块的输入。 image.png

2) sentence perplexity scoring module (困惑度分数计算)

Text fluency: N-grams [5,14], RNNLM [12, 17]

Text perplexity: multi-order perplexity scoring

  • 一级 困惑度(a single masked token): 根据Masked以外的词生成一个隐藏状态,然后用softmax层将隐藏状态转换为所有词的概率分布。 image.png

  • 二级 困惑度(multiple masked tokens):通常是指连续两个token,首尾词只mask一个token。 image.png

  • 融合一级和二级 困惑度:

image.png

3) error detecting and correcting module.

  1. 𝐼 would be sorted by the pseudo perplexity scores.
  2. corresponding to the sentence with the lowest predicted perplexity score 𝑋1𝑋^′_1, the confusion word 𝑐𝑝𝑟𝑒𝑑 would be matched
  3. the model could determine whether 𝑐𝑝𝑟𝑒𝑑 is the same as 𝑥𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒. If it is the same, the original sentence is correct. If it is different, the model would replace 𝑥𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 with 𝑐𝑝𝑟𝑒𝑑 to acquire the correct answer.

2. 背景知识

在计算伪困惑度(PPL)时,条件概率是关键部分。以下是如何计算条件概率的简单介绍,适合新手理解。

什么是条件概率?

条件概率 P(wiw1,w2,,wi1)P(w_i | w_1, w_2, \ldots, w_{i-1}) 表示在已知前面所有词的情况下,生成下一个词 wiw_i 的概率。。

如何计算条件概率?

  1. 选择模型

    • 使用语言模型(如N-gram模型、RNN、LSTM、Transformer等)来计算这些条件概率。
  2. N-gram模型

    • 在N-gram模型中,条件概率可以通过统计方法计算。例如,对于一个2-gram模型,条件概率可以用以下公式表示:

      P(wiwi1)=C(wi1,wi)C(wi1)P(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})}

      其中 C(wi1,wi)C(w_{i-1}, w_i) 是词 wi1w_{i-1}wiw_i 共同出现的次数,C(wi1)C(w_{i-1}) 是词 wi1w_{i-1} 出现的总次数。

  3. 神经网络模型

    • 在RNN或LSTM等神经网络模型中,条件概率是通过网络的输出层计算的。模型会根据前面的词生成一个隐藏状态,然后用softmax层将隐藏状态转换为词汇表中每个词的概率分布:

      P(wiw1,w2,,wi1)=softmax(hi)P(wi∣w1,w2,…,wi−1)=\text{softmax}(h_i)

      其中 hih_i 是当前的隐藏状态。

总结

条件概率的计算依赖于所使用的模型。N-gram模型使用统计方法,而神经网络模型通过学习到的表示来预测。理解这些概念后,你就能更好地计算伪困惑度了。

POS: 语义角色标注

3. 相关工作

1 data synthesis method for unsupervised GEC
  • pseudo-labeling: LM-Critic and the BIFI algorithm.
  • confusion function、confusion sets collected from a spellchecker.
2 language model scoring for unsupervised GEC

The LM-scoring-based GEC approach makes the assumption that low probability sentences are more likely to contain grammatical errors than high probability sentences, and the GEC system determines how to transform the former into the latter based on language model probabilities.

  • LM-based with rules [1]
  • error templates: An error template is a regular expression that is used to detect text errors [20]
  • delta-log-perplexity for example scoring [10]
  • multi-classification classifications [11]
3 sentences scoring based on pre-trained model

In recent years, the sentence fluency representation ability of the BERT model has gradually been tapped.

  • pseudo-loglikelihood scores (PLLs) for sentence scoring [12,13]

4. 数据集

Tagalog [9]

  • 我们对所有词性对应的词汇表进行了排序。从林等人[9]创建的他加禄语词性标注语料库中,我们剔除了包含大量词汇的词性,以及不符合他加禄语语法要求的词性(即他加禄语语法中不应存在的错误类型)。最终,如表1所示,我们保留了8类他加禄语语法错误对应的混淆集。intive Tagalog experts to review

  • the corpus we developed follows that of the Indonesian GEC corpus constructed by Lin et al [11]. The final Tagalog GEC corpus has 12,953 samples, the statistics for which are shown in Table 3.

image.png

image.png

image.png

5. Baselines

  • IndoGEC [11]

6. 实验结果

accuary of micro-averaging/ recall of micro-averaging/ F1 value of micro-averaging 𝐹0.5𝑚𝑖𝑐𝑟𝑜 , / the accuracy of macro-averaging.

image.png

Ablation Study: 去掉第二级困惑度,部分类别性能略有所提升。

推荐多个候选词的实验: 1)多级困惑度,2)仅去掉一级困惑度、3)仅去掉二级困惑度,通过Hit@K曲线可以分析出最小的K值,能在一定程度上减少计算开销,但三种方法的曲线没有显著区别。

image.png Experiment in Indonesian Dataset: 我们的方法在“连词”和“介词”两类极易混淆的词上的效果不如IndoGEC方法,说明当混淆词的规模达到一定数量时,BERT模型无法很好地区分不同的词。

参考

[1] Language Model Based Grammatical Error Correction without Annotated Training Data. 无监督

[9] Pre-trained Language Models for Tagalog with Multi-source Data. 遗漏的baseline对比

[11] A Framework for Indonesian Grammar Error Correction

潜在不足与展望

  • 对比方法单一,仅有IndoGEC,说服力较差。
  • 第一级困混度起到决定性作用,而多级困混度的计算略显冗余(见Table 5分析,去掉第二级PPLs只有微不足道的影响)
  • 推荐多个候选词的实验:这部分可能存在多重可接受性问题,需要进一步分析。
  • 需要更强更通用的方法:“连词”和“介词”两类极易混淆的词的区分仍存在困难。
  • Table 6 与 Table 5 表达方式应保持一致。