23 CGER using Pre-trained Models and Pseudo DataPseudo Data

Chinese Grammatical Error Correction Using Pre-trained Models and Pseudo Data （ACM Transactions on Asian and Low-Resource Language Information Processing 2023）

Pseudo Data 利用的最佳设置因语言不同而有所差异，因此不能直接将英语的GEC方法用到中文的GEC任务[20]。 Our codes are available at github.com/wang1369065…

动机：However, few studies have examined the role of pre-trained models and pseudo data in the Chinese GEC task.

1. 模型

we develop Chinese GEC models based on three pre-trained models: Chinese BERT, Chinese T5, and Chinese BART. (BART+Lang-8 (MaskGEC) is the ideal setting since Wiki (MaskGEC) uses 10× more pseudo data than Lang-8 (MaskGEC))
Word-level errors dominate all error types, and word selection errors must be addressed.
训练：we initialized the encoder of Transformer with the parameters of Chinese BERT; the decoder is initialized randomly. Finally, we train the initialized model on Chinese GEC data.
- Chinese T5 and Chinese BART are both encoder-decoder architectures.

2. 背景知识

Grammatical error correction (GEC) is the task of correcting a variety of grammatical errors in text written typically by non-native speakers.

pre-trained models [11, 12, 23] and pseudo data [14, 34].

1） 3种 Chinese Pre-Trained Models: 1) Chinese BERT、2）Chinese T5、3）Chinese BART

2） 2种 Generating Pseudo Data For Chinese GEC： 1) Rule-based Method (MaskGEC)、 2) Backtranslation Method

-基于规则[39]：(1)选择的标记被替换为填充符号；(2)选择的标记被替换为词汇表中的随机标记；(3)选择的标记根据频率被替换为词汇表中的标记；(4)选择的标记根据频率被替换为同音词。（静态掩码与动态掩码两种策略之间没有明显的差异。）

基于回译[27]：为机器翻译任务生成伪数据。在Chinese GEC training data上先预训练一个back-translation 模型。

3. 数据集

We train and evaluate our models using the data provided by the NLPCC 2018 GEC shared task. (Training data: 1.2 million sentence pairs from Lang-8; Development data: 5,000 sentences from the training data; Test data: 2,000 sentences extracted from the PKU Chinese Learner Corpus. )
three kinds of Pseudo data: Lang-8 (MaskGEC), Wiki (MaskGEC), and Wiki (Backtranslation) generated by [39]：
- Lang-8 (MaskGEC) <- 语料：hinative.com/lang-8
- Wiki (MaskGEC) <- 语料：nlp_chinese_corpus
- Wiki (Backtranslation) <- 语料：nlp_chinese_corpus
工具： wikipedia-corpus-creator

4. 实验结果

Error Type distribution: 参考HSK learner corpus的标注方式标注了验证集中的66个句子，得到100中错误类型（每个句子可能不止一种错误），例如：B, CC, CD, CQ, CJwo, CJ+, CJ,- and CJetc。其中，词级错误(CC, CQ, and CD)是最频繁出现的。

中文GEC有自身的特点

例如，拼写错误主要源于字形和发音的相似性，句级错误往往取决于词的位置。

参考

[1] Yi Wang, Ruibin Yuan, Yan‘gen Luo, Yufang Qin, NianYong Zhu, Peng Cheng, and Lihuan Wang. 2020. Chinese grammatical error correction based on hybrid models with data augmentation. In BEA. 78–86
Implemented baselines:
- awesome-transformer: github.com/ictnlp/awes….
- BERT-encoder-ChineseGEC: github.com/wang1369065….
- t5-pegasus-chinese: github.com/SunnyGJing/….
- CPT: github.com/fastnlp/CPT.
segment the sentences using the pkunlp: http://59.108.48.12/lcwm/pkunlp/downloads/libgrass-ui.tar.gz.
the phrase-level edits computed using MaxMatch (M2): github.com/nusnlp/m2sc….