Heterogeneous Recycle Generation for Chinese Grammatical Error Correction 汉语语法纠错的异构循环生成
- Grammatical error correction (GEC) is the task of correcting grammatical and spelling errors that appear in a sentence. (语法和拼写错误)
- the NLPCC 2018 shared task (Zhao et al., 2018) being the first to focus on CGEC topic.
1. 模型(character level tokenization,侧重4种编辑级别的错误)
- Our idea is to leverage the advantages of both machine translation models (NMT) and sequence editing models (SEM). NMT负责大范围重写句子如重排序,SEM 负责小范围修正,如增删字。
- Our GEC system is composed of three separate components: a neural machine translation system, a sequence editing system, and a spell-checker.
- This performs recycle generation with one model trained to do translation and another model trained to do sequence editing.
1.1 校正过程:
1.2 Lasertagger处理过程:(1)编码;(2)赋予token对应的edit tag;(3)转换输出正确语句
1.3 循环生成设置(4种不同):
2. 背景知识
- NMT:通常依赖于平行语料
- Recycle generation (i.e., iterative decoding (Lichtarge et al., 2018).))
- Sequence editing models, also known as a text-editing model, learn to edit a sequence through applying a fixed set of operations to the input.
- LaserTagger (Malmi et al., 2019), a tool for sequence editing (limits: small phrase vocabulary).
- OpenCC: convert traditional Chinese characters to simplified Chinese characters.
- Errant: Adapting ERRANT for Chinese Sentences,对齐原始句子
- Edit Extraction:The per-sentence edits (annotations) are extracted in three steps: tokenization, alignment, and merging. (the Cilin (Mei et al., 1996) thesaurus)
- SE - NMT: 表示NMT followed by SE
- Homogeneous Generation:NMT 或 SE
- Heterogeneous Generation: (NMT + SE) 或 (SE + NMT)
3. 评价指标:
- The Maxmatch (M2) scorer (Dahlmeier and Ng, 2012) 只能评价模型整体性能;
- The ERRANT scorer (Bryant et al., 2017) 既能评价edit-level操作,也能评价特定英语语法错误类型。因此被借鉴到CGEC。
3.1 ERRANT for Chinese GEC
4. 数据集
- NLPCC 2018 shared task dataset (Zhao et al., 2018)
- Since an official validation set is not provided, we randomly select 5,000 pairs from the training set to serve as a validation set.
5. 参考
- Qiu and Qu (2019):heterogeneous recycle generation and a spellchecker