1. 模型:
In 2020, Chinese grammatical error diagnosis(CGED) was held in NLP-TEA. -> BERT-based model:
Chinese as Foreign Language (CFL) learners
链接:github.com/NYCU-NLP/NL…
可分为正确类别和以下四类错误:
Four error types are defined as redundant words (denoted as a capital “R”), missing words (“M”), word selection errors (“S”), and word ordering errors (“W”).
- Word Redundant Error
- Word Missing Error
- Word Selection Error
- Word Disorder Error
2. 数据集(4):
- CGED (比赛数据集)
- HSK (汉语第二语言数据)
- Lang8 (普通话学习者语料)
- School (中小学作业语料,closed-source)
3. 发现:
- 中小学数据增强的同源错误有助于提高模型性能
- 样本比例分布偏差会阻碍模型性能
- False Positive Rate (FPR):误诊率对该任务很重要。
4. Case:
5. 补充
- 2014年NLPTEA-CGED,先二分类,再错误细分四类
- 2015年NLPTEA-CGED,识别the range of occurring error.
- 2016年NLPTEA-CGED,More than one errors + HSK curpus.
欢迎评论补充相关资源...