基于NLP实现文本分类聚类和主题建模1 实现文本分类、聚类分类和主题建模分类和聚类：基于scikit-learn库主

1 实现文本分类、聚类分类和主题建模

分类和聚类：基于scikit-learn库
主题建模：基于gensim库

数据使用的是从UCI官网下载的数据：archive.ics.uci.edu/dataset/228…

2 导入库

In [3]:

import pandas as pd
import numpy as np

import sklearn
from sklearn.metrics import confusion_matrix, consensus_score,classification_report,precision_score

import nltk
import csv

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

3 文本清理函数

一个针对文本数据的清洗函数：

In [4]:

def preprocessing(text):
    #text = text.decode("utf8")
    
    # 列表推导式：从大到小；从文本到句子，再从句子到单词   最终得到单词列表
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]  
    
    # 删除停用词
    stop = stopwords.words("english")
    # 如果token不在stop中则删除
    tokens = [token for token in tokens if token not in stop]
    
    # 删除长度小于3个的单词
    tokens = [word for word in tokens if len(word) >= 3]
    
    # 全部变小写
    tokens = [word.lower() for word in tokens]
    
    # 使用WordNetLemmatizer类从nltk库进行词形还原  比如running变成run
    lmtzr = WordNetLemmatizer()
    tokens = [lmtzr.lemmatize(word) for word in tokens]
    
    # 合并处理后的单词
    preprocessed_text = " ".join(tokens)
    return preprocessed_text

4 数据读取

In [5]:

sms = open("char6-SMSSpamCollection",encoding="utf8")  # 打开数据
sms

Out[5]:

<_io.TextIOWrapper name='char6-SMSSpamCollection' mode='r' encoding='utf8'>

In [6]:

sms_data = []
sms_labels = []

csv_reader = csv.reader(sms,delimiter="\t")

for line in csv_reader:
    sms_labels.append(line[0])
    sms_data.append(line[1])
    
sms.close()

In [7]:

sms_labels[:5]

Out[7]:

['ham', 'ham', 'spam', 'ham', 'ham']

In [8]:

sms_data[:2]

Out[8]:

['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
 'Ok lar... Joking wif u oni...']

5 数据切分

In [9]:

# train:test = 7:3

train_size = int(round(len(sms_data) * 0.7))

In [10]:

X_train = np.array(["".join(el) for el in sms_data[0:train_size]])
y_train = np.array(["".join(el) for el in sms_labels[0:train_size]])

X_test = np.array(["".join(el) for el in sms_data[train_size+1:len(sms_data)]])
y_test = np.array(["".join(el) for el in sms_labels[train_size+1:len(sms_labels)]])

In [11]:

print(X_train)
['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...' 'Ok lar... Joking wif u oni...' "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's" ... 'tells u 2 call 09066358152 to claim £5000 prize. U have 2 enter all ur mobile & personal details @ the prompts. Careful!' "No. Thank you. You've been wonderful" 'Otherwise had part time job na-tuition..']

In [12]:

print(y_train)
['ham' 'ham' 'spam' ... 'spam' 'ham' 'ham']

6 TDM(term-document matrix)

词汇文档矩阵TDM(term-document matrix)：将整个文本转成向量形式。

文本文档的另一种表示形式：用词袋(BOW,bag of word)表示

6.1 用Python生成类似词汇文档矩阵

需要使用scikit中的向量化器vectorizer：

In [13]:

from sklearn.feature_extraction.text import CountVectorizer

In [14]:

sms_exp = []

for line in sms_data:
    sms_exp.append(preprocessing(line))
    
sms_exp[0]

Out[14]:

'jurong point crazy available bugis great world buffet ... cine got amore wat ...'

In [15]:

vectorizer = CountVectorizer(min_df=1)

In [16]:

X_exp = vectorizer.fit_transform(sms_exp)
X_exp

Out[16]:

<5572x8028 sparse matrix of type '<class 'numpy.int64'>'
	with 47783 stored elements in Compressed Sparse Row format>

In [17]:

# print("||".join(vectorizer.get_feature_names_out()))

print(X_exp.toarray())
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

6.2 TF-IDF

计数向量在使用的过程中会遇到一个问题：较长文档所获得的平均计数值会高于较短的文档。

解决方法：用该文档中每个单词出现的次数除以该文档中的单词总数。这个特征值被称之为tf（term frequencies）

tf的另一个优化点：对语料库中许多文档中出现的单词进行降格加权。这种方式就能够减少那些只在该语料库中某一小部分出现的信息，这种方法称之为tf-idf（term frequency-inverse documnent frequency）

In [18]:

from sklearn.feature_extraction.text import TfidfVectorizer

In [19]:

vectorizer = TfidfVectorizer(min_df=2, 
                             ngram_range=(1,2),
                             stop_words="english",
                             strip_accents="unicode", 
                             norm="l2")

In [20]:

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)  # 针对测试集不需要fit过程

7 文本分类

文本分类是NLP最为常见的任务，其本质就是利用一个单词或词组对文本文档进行分类的过程。

7.1 朴素贝叶斯Naive Bayes

7.1.1 建模

后验概率 = \frac{先验概率 * 似然概率}{证据因子}

英文原文是：

posterior = \frac{prior * likelihood}{evidence}

基于朴素贝叶斯的建模：

In [21]:

from sklearn.naive_bayes import MultinomialNB

In [22]:

X_train.shape

Out[22]:

(3900, 6497)

In [23]:

y_train.shape

Out[23]:

(3900,)

In [24]:

X_test.shape

Out[24]:

(1671, 6497)

In [25]:

y_test.shape

Out[25]:

(1671,)

In [26]:

clf = MultinomialNB().fit(X_train, y_train)
y_nb = clf.predict(X_test)  # 预测
y_nb

Out[26]:

array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'ham'], dtype='<U4')

7.1.2 模型评估

In [27]:

cm = confusion_matrix(y_test, y_nb)  # 混淆矩阵
print(cm)
[[1443    0]
 [  50  178]]

查看分类报告结果：

In [28]:

print(classification_report(y_test, y_nb))
              precision    recall  f1-score   support

         ham       0.97      1.00      0.98      1443
        spam       1.00      0.78      0.88       228

    accuracy                           0.97      1671
   macro avg       0.98      0.89      0.93      1671
weighted avg       0.97      0.97      0.97      1671

分类模型评估指标：准确率accuracy、精确率precision、召回率recall、F2得分F2_socre

Accuracy = \frac{tp+tn}{tp+tn+fp+fn}

Precision = \frac{tp}{tp+fp}

Recall = \frac{tp}{tp+fn}

F2_{score} = 2 * \frac{Precision * Recall}{Precision + Recall}

获取相关特征及其系数：

In [29]:

feature_names = vectorizer.get_feature_names_out()  # scikit-learn版本1.X后添加_out
feature_names

Out[29]:

array(['00', '00 sub', '00 subs', ..., 'zed', 'zed 08701417012', 'zoe'],
      dtype=object)

In [30]:

coefs = clf.feature_log_prob_[0]  # 原文clf.coefs
coefs

Out[30]:

array([-9.59741851, -9.59741851, -9.59741851, ..., -9.59741851,
       -9.59741851, -9.26687496])

In [31]:

coefs_with_fns = sorted(zip(clf.feature_log_prob_[0], feature_names))
coefs_with_fns[:10]

Out[31]:

[(-9.597418513754713, '00'), (-9.597418513754713, '00 sub'), (-9.597418513754713, '00 subs'), (-9.597418513754713, '000'), (-9.597418513754713, '000 bonus'), (-9.597418513754713, '000 cash'), (-9.597418513754713, '000 homeowners'), (-9.597418513754713, '000 pounds'), (-9.597418513754713, '000 prize'), (-9.597418513754713, '000 xmas')]

In [32]:

n = 10

top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n+1):-1])
#print(coefs[:n])
#print(coefs_with_fns[-(n+1):-1])

for (coef1, fn1),(coef2, fn2) in top:
    
    print('\t%.4f\t%-15s\t\t%.4f\t%-15s'%(coef1,fn1,coef2,fn2))
	-9.5974	00             		-5.3285	ok             
	-9.5974	00 sub         		-5.6802	ll             
	-9.5974	00 subs        		-5.7654	come           
	-9.5974	000            		-5.8329	know           
	-9.5974	000 bonus      		-5.8459	lt             
	-9.5974	000 cash       		-5.8492	gt             
	-9.5974	000 homeowners 		-5.8527	just           
	-9.5974	000 pounds     		-5.8874	like           
	-9.5974	000 prize      		-5.9038	good           
	-9.5974	000 xmas       		-5.9572	got

从向量化器中读取了所有的特征名称，并获取了与给定特征相关的系数。

7.2 决策树

主要是基于CART算法。

In [33]:

from sklearn import tree

In [34]:

# scikit库的树模型只接受Numpy数组：建模前先将稀疏矩阵转成Numpy数组

clf = tree.DecisionTreeClassifier().fit(X_train.toarray(), y_train)
y_tree = clf.predict(X_test.toarray())  

y_tree

Out[34]:

array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'ham'], dtype='<U4')

In [35]:

print(classification_report(y_test, y_tree))
              precision    recall  f1-score   support

         ham       0.97      0.98      0.98      1443
        spam       0.88      0.81      0.84       228

    accuracy                           0.96      1671
   macro avg       0.92      0.89      0.91      1671
weighted avg       0.96      0.96      0.96      1671

In [36]:

cm = confusion_matrix(y_test, y_tree)
print(cm)
[[1417   26]
 [  44  184]]

7.3 随机梯度下降算法Stochastic gredient descent(SGD)

非常适用于线性模型的方法。当目标样本数量（特征数量）非常庞大时，效果特别突出。SGD算法有时候也称之为最大熵（Maximum Entropy，简称为MaxEnt算法），它会使用不同的损失函数loss function和惩罚机制来适配分类问题和回归问题的线性模型。

当loss=log时，它适配的是一个对数回归模型
当loss=hinge时，它适配的是一个线性的SVM

#参数
sklearn.linear_model.SGDClassifier(
    loss='hinge',
    penalty='l2', 
    alpha=0.0001, 
    l1_ratio=0.15, 
    fit_intercept=True, 
    max_iter=1000, 
    tol=0.001, 
    shuffle=True, 
    verbose=0, 
    epsilon=0.1, 
    n_jobs=None, 
    random_state=None, 
    learning_rate='optimal', 
    eta0=0.0, 
    power_t=0.5, 
    early_stopping=False, 
    validation_fraction=0.1, 
    n_iter_no_change=5, 
    class_weight=None, 
    warm_start=False, 
    average=False
)

In [37]:

from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix

In [38]:

# 新版本中n_iter变成n_iter_no_change

clf = SGDClassifier(alpha=0.001,n_iter_no_change=50).fit(X_train, y_train)
y_sgd = clf.predict(X_test)

In [39]:

print(classification_report(y_test, y_sgd))
              precision    recall  f1-score   support

         ham       0.97      1.00      0.99      1443
        spam       0.98      0.83      0.90       228

    accuracy                           0.97      1671
   macro avg       0.98      0.91      0.94      1671
weighted avg       0.97      0.97      0.97      1671

7.4 Logistic Regression逻辑回归

逻辑回归是一种针对分类问题的线性回归模型，也称之为对元逻辑、最大熵MaxEnt分类法或对数线性分类器。

L2正则化：

L1正则化：

7.5 支持向量机SVM

In [40]:

from sklearn.svm import LinearSVC, LinearSVR

In [41]:

svc = LinearSVC().fit(X_train, y_train)
y_svm = svc.predict(X_test)

In [42]:

print(classification_report(y_test, y_svm))  # 分类报告
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99      1443
        spam       0.98      0.90      0.94       228

    accuracy                           0.98      1671
   macro avg       0.98      0.95      0.96      1671
weighted avg       0.98      0.98      0.98      1671

In [43]:

cm = confusion_matrix(y_test, y_svm)   # 混淆矩阵
print(cm)
[[1438    5]
 [  23  205]]

7.6 随机森林RandomForest

In [44]:

from sklearn.ensemble import RandomForestClassifier

In [45]:

clf = RandomForestClassifier(n_estimators=10).fit(X_train, y_train)
y_rf = clf.predict(X_test)  # 预测

In [46]:

print(classification_report(y_test, y_rf))  # 分类报告
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98      1443
        spam       0.99      0.75      0.86       228

    accuracy                           0.97      1671
   macro avg       0.98      0.88      0.92      1671
weighted avg       0.97      0.97      0.96      1671

In [47]:

cm = confusion_matrix(y_test, y_rf)   # 混淆矩阵
print(cm)
[[1441    2]
 [  56  172]]

8 文本聚类

最为常用的是k-means 和层次聚类hierarchical clustering

In [48]:

from sklearn.cluster import KMeans, MiniBatchKMeans

In [49]:

k = 5

km = KMeans(n_clusters=k, init="k-means++", max_iter=100, n_init=1)
kmini = MiniBatchKMeans(n_clusters=k, init="k-means++",n_init=1,init_size=1000,batch_size=1000)

In [50]:

km_model = km.fit(X_train)
kmini_model = kmini.fit(X_train)

In [51]:

import collections

clustering = collections.defaultdict(list)

for idx, label in enumerate(km_model.labels_):
    clustering[idx].append(idx)

In [52]:

for idx, label in enumerate(kmini_model.labels_):
    clustering[idx].append(idx)

9 主题建模

常用来对文本进行主题建模的方法有：

9.1 LDA

隐含狄利克雷分布（Latent Dirichlet Allocation，简称LDA）是一种主题模型，它可以将文档集中每篇文档的主题以概率分布的形式给出。LDA是基于贝叶斯模型的，涉及到贝叶斯模型离不开“先验分布”，“数据（似然）”和"后验分布"三块。

在贝叶斯学派这里：先验分布 + 数据（似然）= 后验分布。一篇文档可以包含多个主题，文档中每一个词都由其中的一个主题生成。

9.2 LSI

潜在语义索引（LSI）也是一种无监督学习方法，主要用于发现文本与单词之间的基于话题的语义关系。LSI是通过矩阵分解发现文本与单词之间的基于话题的语义关系的一种方法，也称为潜在语义索引（LSI）。

LSI解决部分一词多义和一义多词问题，也可以用于降维，但LSI不是概率模型，缺乏严谨的数理统计基础。

总的来说，两者都是用于揭示文本数据中隐藏的主题结构的无监督学习方法，但实现方式和应用场景有所不同。

9.3 基于gensim的主题建模

gensim提供了多种方法来创建不同的语料库格式，比如TFIDF、LIBSVM、market matrix等；

In [53]:

import gensim
from gensim import corpora, models, similarities
from itertools import chain
import nltk
from nltk.corpus import stopwords
from operator import itemgetter
import re

In [54]:

documents = [document for document in sms_data]
stoplist = stopwords.words("english")

In [55]:

# 常规操作：读取数据，移除停用词
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]

In [56]:

# 将文档列表转成BOW模型，再转成TF-IDF语料库：

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)  # 创建了一个TF-IDF模型对象
corpus_tfidf = tfidf[corpus]  # 使用TF-IDF模型将corpus中的每个文档转换成TF-IDF表示形式

In [57]:

# 建立LSI模型

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=100)

lsi.print_topics(5)

Out[57]:

[(0,
  '0.497*"later" + 0.494*"sorry," + 0.474*"i\'ll" + 0.399*"call" + 0.116*"u" + 0.087*"i\'m" + 0.072*"ok" + 0.065*"meeting" + 0.056*"get" + 0.054*"2"'),
 (1,
  '-0.417*"u" + 0.234*"sorry," + 0.201*"later" + -0.200*"ok" + -0.188*"2" + 0.159*"i\'ll" + -0.156*"ur" + -0.151*"i\'m" + -0.147*"come" + -0.141*"get"'),
 (2,
  '0.591*"ok" + -0.389*"?" + -0.275*"..." + 0.196*"u" + 0.154*"lor..." + 0.145*"lor." + -0.127*"&lt;#&gt;" + 0.120*"ü" + -0.108*"send" + -0.108*"love"'),
 (3,
  '0.463*"ok" + 0.379*"?" + -0.295*"u" + 0.278*"..." + -0.196*"call" + -0.163*"2" + -0.143*"free" + -0.122*"ur" + 0.117*"come" + 0.110*"i\'m"'),
 (4,
  '-0.408*"u" + 0.348*"ok" + 0.210*"pls" + 0.210*"send" + 0.205*"." + 0.202*"message" + 0.202*"now." + -0.196*"?" + 0.188*"right" + 0.187*"call"')]

In [58]:

# 建立LDA模型

n_topics = 5

lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=n_topics)

In [59]:

for i in range(0,n_topics):
    temp = lda.show_topic(i,10)
    #print(temp)
    terms = []
    for term in temp:
        # 原文terms.append(term[1])，修改如下
        terms.append(str(term[0]))
        
    print("Top 10 terms for topic #" + str(i) + ": " + ",".join(terms))
Top 10 terms for topic #0: u,ok...,&lt;#&gt;,like,see,i'll,you.,get,call,day.
Top 10 terms for topic #1: call,later,i'm,sorry,,u,i'll,got,time,ur,2
Top 10 terms for topic #2: ?,...,u,get,happy,i'm,going,pick,take,.
Top 10 terms for topic #3: u,ok,call,2,lor.,anything,i'm,lor...,ur,4
Top 10 terms for topic #4: &lt;#&gt;,u,ü,ur,call,text,want,free,got,get