Segmenting a chunk of text into words is usually the first step of processing Chinese text,but its necessity has rarely been explored.In this paper, we ask the fundamental question of whether Chinese word segmentation (CWS)is necessary for deep learning-based Chinese Natural Language Processing. We benchmark neural word-based models which rely on word segmentation against neural char-based models which do not involve word segmentation in four end-to-end NLP benchmark tasks:language modeling, machine translation, sentence matching/paraphrase and text classification. Through direct comparisons between these two types of models, we find that charbased models consistently outperform wordbased models.Based on these observations, we conduct comprehensive experiments to study why wordbased models underperform char-based models in these deep learning-based NLP tasks. We show that it is because word-based models are more vulnerable to data sparsity and the presence of out-of-vocabulary (OOV) words, and thus more prone to overfitting. We hope this paper could encourage researchers in the community to rethink the necessity of word segmentation in deep learning-based Chinese Natural Language Processing
-
Particularly, CWS is a relatively hard and complicated task, primarily because word boundary of Chinese words is usually quite vague. As discussed in Chen et al. (2017c), different linguistic perspectives have different criteria for CWS (Chen et al., 2017c). As shown in Table 1, in the two most widely adopted CWS datasets PKU (Yu et al., 2001) and CTB (Xia,2000), the same sentence is segmented differently
-
Based on eye movement data, Tsai and McConkie (2003) found that fixations of Chinese readers do not land more frequently on the centers of Chinese words, suggesting that characters, rather than words, should be the basic units of Chinese reading comprehension.
-
We identify the major factor contributing to the disadvantage of word-based models, i.e., data sparsity, which in turn leads to overfitting, prevelance of OOV words,and weak domain transfer ability.