2020 COLING Heterogeneous Recycle Generation for CGEC

53 阅读2分钟

Heterogeneous Recycle Generation for Chinese Grammatical Error Correction 汉语语法纠错的异构循环生成

  • Grammatical error correction (GEC) is the task of correcting grammatical and spelling errors that appear in a sentence. (语法和拼写错误)
  • the NLPCC 2018 shared task (Zhao et al., 2018) being the first to focus on CGEC topic.

1. 模型(character level tokenization,侧重4种编辑级别的错误)

  • Our idea is to leverage the advantages of both machine translation models (NMT) and sequence editing models (SEM). NMT负责大范围重写句子如重排序,SEM 负责小范围修正,如增删字。
  • Our GEC system is composed of three separate components: a neural machine translation system, a sequence editing system, and a spell-checker.
  • This performs recycle generation with one model trained to do translation and another model trained to do sequence editing.

1.1 校正过程image.png

1.2 Lasertagger处理过程:(1)编码;(2)赋予token对应的edit tag;(3)转换输出正确语句 image.png

1.3 循环生成设置(4种不同)image.png

image.png

image.png

2. 背景知识

  • NMT:通常依赖于平行语料
  • Recycle generation (i.e., iterative decoding (Lichtarge et al., 2018).))
  • Sequence editing models, also known as a text-editing model, learn to edit a sequence through applying a fixed set of operations to the input.
  • LaserTagger (Malmi et al., 2019), a tool for sequence editing (limits: small phrase vocabulary).
  • OpenCC: convert traditional Chinese characters to simplified Chinese characters.
  • Errant: Adapting ERRANT for Chinese Sentences,对齐原始句子
    • Edit Extraction:The per-sentence edits (annotations) are extracted in three steps: tokenization, alignment, and merging. (the Cilin (Mei et al., 1996) thesaurus)
  • SE - NMT: 表示NMT followed by SE
  • Homogeneous Generation:NMTN^N 或 SEN^N
  • Heterogeneous Generation: (NMT + SE)N^N 或 (SE + NMT)N^N

3. 评价指标:

  • The Maxmatch (M2) scorer (Dahlmeier and Ng, 2012) 只能评价模型整体性能;
  • The ERRANT scorer (Bryant et al., 2017) 既能评价edit-level操作,也能评价特定英语语法错误类型。因此被借鉴到CGEC。

3.1 ERRANT for Chinese GEC

4. 数据集

  • NLPCC 2018 shared task dataset (Zhao et al., 2018)
  • Since an official validation set is not provided, we randomly select 5,000 pairs from the training set to serve as a validation set.

5. 参考

  • Qiu and Qu (2019):heterogeneous recycle generation and a spellchecker