jieba去除停用词&词性标注去除停用词停用词过滤,是文本分析中一个预处理方法。它的功能是过滤分词结果中的噪声(例如:

去除停用词

停用词过滤,是文本分析中一个预处理方法。它的功能是过滤分词结果中的噪声(例如:的、是、啊等)

import jieba
# 将停用词读出放在stopwords这个列表中
filepath = r'stopwords.txt'
stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
jieba.load_userdict("userdict.txt")
seg_list = jieba.cut("李小福是创新办主任也是云计算方面的专家")  # 自定义词典保存在了结巴分词的缓存中
seg_list = [i for i in seg_list if i not in stopwords]   # 当分出来的词不再停用词表中时，我们保留它
print()
print("这是自定义后结果: ")
print(", ".join(seg_list))

词性标注

POS，Part-of-speech tagging的缩写

标注句子分词后每个词的词性

import jieba.posseg as pseg
words = pseg.cut("我爱学习")
for w in words:
 print(w.word, w.flag)