25 COING Linguistically-Informed Error TypologyImproving Gra

Improving Automatic Grammatical Error Annotation for Chinese Through Linguistically-Informed Error Typology 通过基于语言的错误类型学改进中文自动语法错误标注

1. 模型

Pinyin+PIL+resnet50：We compute pronunciation-based similarities using the Pinyin Python library, which is applied when the two words have equal lengths. For shape similarity, we utilize the Python Imaging Library (PIL) to convert each character into an image, followed by the application of the resnet50 model from PyTorch to evaluate visual similarity. In cases involving multiple characters, we calculate the position-based average similarity across each corresponding character pair.

链接：open-writing-evaluation.github.io/

2. 背景知识

1）语言学习者分类：

Native Language Learners -> L1
Non-native Language Learners -> L2

2）中文写作系统的特征：

缺乏显式词边界，如被，棉被
汉字而不是字母，中文有数千个不同的字符，每个字符代表一个词或词的一部分。这导致了声音、意义和书写形式之间的复杂关系。有些可能表示声音-字形的映射问题，而另一些可能反映视觉或语义的混淆。
L1&L2对比见解：L2多是明显的、系统性错误；L1比L2犯错更轻微，更多关联高级语言运用（L2与正确认字相关，L1 作家最常见的错误与句子结构和词序有关。）。FCGEC for L1 errors and MuCGEC for L2 errors.

-- 但实际可能是“拼音和形状”相似的错误导致的

3）Three-tier hierarchical approach:

detection (classifying sentences as erroneous or correct), identification (categorizing errors into one of seven fine-grained types, such as structure confusion or illogical errors), and correction (applying specific operations like insert, delete, modify, or switch) (Xu et al., 2022).

4）评价指标：

M $^2$ (Dahlmeier and Ng, 2012),
GLEU (Napoles et al., 2015),
errant (Bryant et al., 2017),
PT M $^2$ (Gong et al., 2022).

3. 数据集

L1: FCGEC: github.com/xlxwalex/FC…
L2: MUCGEC: github.com/HillZhang19…

4. 相关工具

ChERRANT Scorers: github.com/HillZhang19…
LTP (ACL2021)：github.com/HIT-SCIR/lt…

5. ChERRANT介绍

ChERRANT标注主要包含3种操作错误：Redundant error (R)、Missing error (M)、Substitution error (S)等，但实际用的多是RMSW。
ChERRANT m2 files classify errors into redundant (R), missing (M), substitution (S), along with word order errors (W) and general spelling errors (S:SPELL).
- m2 format: ChERRANT includes the reference sentences in its m2 files, marked with "T0-A" at the start of the line.
This work uses a different tokenizer from ChERRANT, the tokenized version of the original sentence may differ as well.

上图的下标标注"[..)",与下图例子的下标不一致 "(..]" -WHY？

局限性：由于文本长度和某些文字的插入删除，ChERRANT可能无法识别顺序类错误，并误将其分类为插入和删除操作错误。（因此，chERRANT 注释可能容易受到多个复杂错误影响）

总体而言，我们分析表明，自动注释与手动注释的匹配率高达 95%，对于包含多个语法错误的复杂句子，匹配率为 73%。虽然自动注释也有局限性，但它们仍然是从大型数据集获取见解的重要工具。

6. Refined Annotation for CGEC

New error typology
- pinyin-based：同音字（权力&权利）、形近字（已&己）、多重相似（洲&州）、的地得（缺失）、字序（么什）
New implemetation
- POS 用stanza (Qi et al., 2020)，而不是NTP
- SMR-> RMU
- 更细力度标注：如"一前": S:SPELL -> R:PINYIN
- 发现：本以为是L2比L1有更多的拼写错误，但实际发现是“拼音和形状”相似的错误导致的。

25 COING Linguistically-Informed Error Typology