最近在读《Speech and Language Processing》这本书,发现一个有趣的结论:
In fact, for most Chinese NLP tasks it turns out to work better to take characters rather than words as input, since characters are at a reasonable semantic level for most applications, and since most word standards, by contrast, result in a huge vocabulary with large numbers of very rare words (Li et al., 2019b).
实际上,对于大多数中文 NLP 任务,事实证明将字符而不是单词作为输入会更好,因为对于大多数应用程序而言,字符处于合理的语义级别,而且相比之下,大多数单词标准会产生大量非常罕见的单词。
这个结论来自于2019年的论文 Is Word Segmentation Necessary for Deep Learning of Chinese Representations?