有趣的结论：中文分词不一定必要最近在读《Speech and Language Processing》这本书，发现一个有

最近在读《Speech and Language Processing》这本书，发现一个有趣的结论：

In fact, for most Chinese NLP tasks it turns out to work better to take characters rather than words as input, since characters are at a reasonable semantic level for most applications, and since most word standards, by contrast, result in a huge vocabulary with large numbers of very rare words (Li et al., 2019b).

实际上，对于大多数中文 NLP 任务，事实证明将字符而不是单词作为输入会更好，因为对于大多数应用程序而言，字符处于合理的语义级别，而且相比之下，大多数单词标准会产生大量非常罕见的单词。

这个结论来自于2019年的论文 Is Word Segmentation Necessary for Deep Learning of Chinese Representations?