20 NLPTEA Chinese Grammatical Error Detection Based on BERT Model

31 阅读1分钟

1. 模型:

In 2020, Chinese grammatical error diagnosis(CGED) was held in NLP-TEA. -> BERT-based model: image.png Chinese as Foreign Language (CFL) learners

链接:github.com/NYCU-NLP/NL…

可分为正确类别和以下四类错误:

Four error types are defined as redundant words (denoted as a capital “R”), missing words (“M”), word selection errors (“S”), and word ordering errors (“W”).

  • Word Redundant Error
  • Word Missing Error
  • Word Selection Error
  • Word Disorder Error

2. 数据集(4):

  • CGED (比赛数据集)
  • HSK (汉语第二语言数据)
  • Lang8 (普通话学习者语料)
  • School (中小学作业语料,closed-source)

image.png

3. 发现:

  • 中小学数据增强的同源错误有助于提高模型性能
  • 样本比例分布偏差会阻碍模型性能
  • False Positive Rate (FPR):误诊率对该任务很重要。

4. Case:

image.png

35ea8be9fa672ab2eaabf16fa00786d.png

5. 补充

  • 2014年NLPTEA-CGED,先二分类,再错误细分四类
  • 2015年NLPTEA-CGED,识别the range of occurring error.
  • 2016年NLPTEA-CGED,More than one errors + HSK curpus.

欢迎评论补充相关资源...