Chinese Grammatical Error Correction Using Pre-trained Models and Pseudo Data (ACM Transactions on Asian and Low-Resource Language Information Processing 2023)
Pseudo Data 利用的最佳设置因语言不同而有所差异,因此不能直接将英语的GEC方法用到中文的GEC任务[20]。 Our codes are available at github.com/wang1369065…
动机:However, few studies have examined the role of pre-trained models and pseudo data in the Chinese GEC task.
1. 模型
-
we develop Chinese GEC models based on three pre-trained models: Chinese BERT, Chinese T5, and Chinese BART. (BART+Lang-8 (MaskGEC) is the ideal setting since Wiki (MaskGEC) uses 10× more pseudo data than Lang-8 (MaskGEC))
-
Word-level errors dominate all error types, and word selection errors must be addressed.
-
训练:we initialized the encoder of Transformer with the parameters of Chinese BERT; the decoder is initialized randomly. Finally, we train the initialized model on Chinese GEC data.
- Chinese T5 and Chinese BART are both encoder-decoder architectures.
2. 背景知识
Grammatical error correction (GEC) is the task of correcting a variety of grammatical errors in text written typically by non-native speakers.
- pre-trained models [11, 12, 23] and pseudo data [14, 34].
1) 3种 Chinese Pre-Trained Models: 1) Chinese BERT、2)Chinese T5、3)Chinese BART
2) 2种 Generating Pseudo Data For Chinese GEC: 1) Rule-based Method (MaskGEC)、 2) Backtranslation Method
-基于规则[39]:(1)选择的标记被替换为填充符号;(2)选择的标记被替换为词汇表中的随机标记;(3)选择的标记根据频率被替换为词汇表中的标记;(4)选择的标记根据频率被替换为同音词。 (静态掩码与动态掩码两种策略之间没有明显的差异。)
- 基于回译[27]:为机器翻译任务生成伪数据。在Chinese GEC training data上先预训练一个back-translation 模型。
3. 数据集
- We train and evaluate our models using the data provided by the NLPCC 2018 GEC shared task. (Training data: 1.2 million sentence pairs from Lang-8; Development data: 5,000 sentences from the training data; Test data: 2,000 sentences extracted from the PKU Chinese Learner Corpus. )
- three kinds of Pseudo data: Lang-8 (MaskGEC), Wiki (MaskGEC), and Wiki (Backtranslation) generated by [39]:
- Lang-8 (MaskGEC) <- 语料:hinative.com/lang-8
- Wiki (MaskGEC) <- 语料:nlp_chinese_corpus
- Wiki (Backtranslation) <- 语料:nlp_chinese_corpus
- 工具: wikipedia-corpus-creator
4. 实验结果
- Error Type distribution: 参考HSK learner corpus的标注方式标注了验证集中的66个句子,得到100中错误类型(每个句子可能不止一种错误) ,例如:B, CC, CD, CQ, CJwo, CJ+, CJ,- and CJetc。 其中,词级错误(CC, CQ, and CD)是最频繁出现的。
中文GEC有自身的特点
例如,拼写错误主要源于字形和发音的相似性,句级错误往往取决于词的位置。
参考
- [1] Yi Wang, Ruibin Yuan, Yan‘gen Luo, Yufang Qin, NianYong Zhu, Peng Cheng, and Lihuan Wang. 2020. Chinese grammatical error correction based on hybrid models with data augmentation. In BEA. 78–86
- Implemented baselines:
- awesome-transformer: github.com/ictnlp/awes….
- BERT-encoder-ChineseGEC: github.com/wang1369065….
- t5-pegasus-chinese: github.com/SunnyGJing/….
- CPT: github.com/fastnlp/CPT.
- segment the sentences using the pkunlp: http://59.108.48.12/lcwm/pkunlp/downloads/libgrass-ui.tar.gz.
- the phrase-level edits computed using MaxMatch (M2): github.com/nusnlp/m2sc….