2020 COLING Heterogeneous Recycle Generation for CGECHeterog

Heterogeneous Recycle Generation for Chinese Grammatical Error Correction 汉语语法纠错的异构循环生成

Grammatical error correction (GEC) is the task of correcting grammatical and spelling errors that appear in a sentence. （语法和拼写错误）
the NLPCC 2018 shared task (Zhao et al., 2018) being the first to focus on CGEC topic.

1. 模型（character level tokenization，侧重4种编辑级别的错误）

Our idea is to leverage the advantages of both machine translation models (NMT) and sequence editing models (SEM). NMT负责大范围重写句子如重排序，SEM 负责小范围修正，如增删字。
Our GEC system is composed of three separate components: a neural machine translation system, a sequence editing system, and a spell-checker.
This performs recycle generation with one model trained to do translation and another model trained to do sequence editing.

1.1 校正过程：

1.2 Lasertagger处理过程：（1）编码；（2）赋予token对应的edit tag；（3）转换输出正确语句

1.3 循环生成设置（4种不同）：

NMT：通常依赖于平行语料
Recycle generation (i.e., iterative decoding (Lichtarge et al., 2018).))
Sequence editing models, also known as a text-editing model, learn to edit a sequence through applying a fixed set of operations to the input.
LaserTagger (Malmi et al., 2019), a tool for sequence editing (limits: small phrase vocabulary).
OpenCC: convert traditional Chinese characters to simplified Chinese characters.
Errant: Adapting ERRANT for Chinese Sentences，对齐原始句子
- Edit Extraction：The per-sentence edits (annotations) are extracted in three steps: tokenization, alignment, and merging. （the Cilin (Mei et al., 1996) thesaurus）
SE - NMT：表示NMT followed by SE
Homogeneous Generation：NMT $^N$ 或 SE $^N$
Heterogeneous Generation： (NMT + SE) $^N$ 或 (SE + NMT) $^N$

The Maxmatch (M2) scorer (Dahlmeier and Ng, 2012) 只能评价模型整体性能；
The ERRANT scorer (Bryant et al., 2017) 既能评价edit-level操作，也能评价特定英语语法错误类型。因此被借鉴到CGEC。

NLPCC 2018 shared task dataset (Zhao et al., 2018)
Since an official validation set is not provided, we randomly select 5,000 pairs from the training set to serve as a validation set.