基于NLP的主题预测建模大家好，我是Peter~ 更新一篇机器学习实战项目：NLP任务之主题建模，主要内容包含：数据探

公众号：尤而小屋
作者：Peter
编辑：Peter

大家好，我是Peter~

更新一篇机器学习实战项目：NLP任务之主题建模，主要内容包含：

数据探索性分析
文本信息的词云图展示
文本预处理（分词、去除停用词、词干提取、词性还原、文本向量化）
基于LDA（狄利克雷分布）的主题建模

1 导入库

In [1]:

import base64
import numpy as np
import pandas as pd

# 可视化相关库
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly_express as px
import plotly.tools as tls
from matplotlib import pyplot as plt
%matplotlib inline

from skimage.io import imread

# 建模相关
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

# 忽略警告
import warnings  
warnings.filterwarnings("ignore")

In [2]:

train = pd.read_csv("train/train.csv")
train.head()

Out[2]:

	id	text	author
0	id26305	This process, however, afforded me no means of...	EAP
1	id17569	It never once occurred to me that the fumbling...	HPL
2	id11008	In his left hand was a gold snuff box, from wh...	EAP
3	id27763	How lovely is spring As we looked from Windsor...	MWS
4	id12958	Finding nothing else, not even gold, the Super...	HPL

In [3]:

train.shape

Out[3]:

(19579, 3)

2 探索性分析EDA

2.1 作者数量统计

In [4]:

data = [go.Bar(
    x = train.author.unique(),  # x-y轴
    y = train.author.value_counts(),
    marker = dict(colorscale = "Jet",  # 颜色设置
                  color = train.author.value_counts().values),
    text = train.author.value_counts()  # 柱子显示数据；同x
)]

# Layout中的标题设置
layout = go.Layout(
    title = "不同作者数量统计"
)

fig = go.Figure(data=data, layout=layout)

fig.show()

2.2 单词统计

所有文本的切割并展开：

In [5]:

all_words = train["text"].str.split(expand=True)  # 切分后展开
all_words

Out[5]:

	0	1	2	3	4	5	6	7	8	9	...	851	852	853	854	855	856	857	858	859	860
0	This	process,	however,	afforded	me	no	means	of	ascertaining	the	...	None	None	None	None	None	None	None	None	None	None
1	It	never	once	occurred	to	me	that	the	fumbling	might	...	None	None	None	None	None	None	None	None	None	None
2	In	his	left	hand	was	a	gold	snuff	box,	from	...	None	None	None	None	None	None	None	None	None	None
3	How	lovely	is	spring	As	we	looked	from	Windsor	Terrace	...	None	None	None	None	None	None	None	None	None	None
4	Finding	nothing	else,	not	even	gold,	the	Superintendent	abandoned	his	...	None	None	None	None	None	None	None	None	None	None
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
19574	I	could	have	fancied,	while	I	looked	at	it,	that	...	None	None	None	None	None	None	None	None	None	None
19575	The	lids	clenched	themselves	together	as	if	in	a	spasm.	...	None	None	None	None	None	None	None	None	None	None
19576	Mais	il	faut	agir	that	is	to	say,	a	Frenchman	...	None	None	None	None	None	None	None	None	None	None
19577	For	an	item	of	news	like	this,	it	strikes	us	...	None	None	None	None	None	None	None	None	None	None
19578	He	laid	a	gnarled	claw	on	my	shoulder,	and	it	...	None	None	None	None	None	None	None	None

上面数据的unstack过程：

In [6]:

all_words.unstack()

Out[6]:

0    0           This
     1             It
     2             In
     3            How
     4        Finding
               ...   
860  19574       None
     19575       None
     19576       None
     19577       None
     19578       None
Length: 16857519, dtype: object

In [7]:

all_words = all_words.unstack().value_counts()

all_words

Out[7]:

the          33296
of           20851
and          17059
to           12615
I            10382
             ...  
sunrise."        1
Averni;          1
buildin's        1
useless;         1
reduced?         1
Name: count, Length: 47556, dtype: int64

In [8]:

data = [go.Bar(
    x = all_words.index.values[:50],
    y = all_words.values[:50],
    marker = dict(colorscale = "rdylbu",  # 颜色设置
                  color = all_words.values[:200]),
    text = all_words.values[:50],  # 柱子显示数据；同y
)]


# Layout中的标题设置
layout = go.Layout(
    title = "Top50 words"
)

fig = go.Figure(data=data, layout=layout)

fig.show()

3 词云图表示

3.1 文本读取

In [9]:

eap = train[train["author"]=="EAP"]["text"].values
hpl = train[train["author"]=="HPL"]["text"].values
mws = train[train["author"]=="MWS"]["text"].values
eap

Out[9]:

array(['This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.',
       'In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.',
       'The astronomer, perhaps, at this point, took refuge in the suggestion of non luminosity; and here analogy was suddenly let fall.',
       ..., 'The lids clenched themselves together as if in a spasm.',
       'Mais il faut agir that is to say, a Frenchman never faints outright.',
       'For an item of news like this, it strikes us it was very coolly received."'],
      dtype=object)

3.2 背景图生成

In [10]:

from skimage.io import imread

In [11]:

# 见项目地址：3个字符串信息

In [12]:

import codecs  # 主要用于在不同数据之间转换文本的编码器和解码器
 
f1 = open("eap.png", "wb")  # 打开eap文件写入二进制数据（wb）；如果文件不存在则创建
f1.write(codecs.decode(eap_64, "base64"))  # 解码eap64字符串，将解码后的数据写入f1中
f1.close()  # 关闭文件f1
img1 = imread("eap.png")  # 读取文件eap.png
hcmask = img1

In [13]:

f2 = open("mws.png", "wb")  
f2.write(codecs.decode(mws_64, "base64"))  
f2.close() 
img2 = imread("mws.png")  
hcmask2 = img2

In [14]:

f3 = open("hpl.png", "wb")  
f3.write(codecs.decode(hpl_64, "base64"))  
f3.close() 
img3 = imread("hpl.png")  
hcmask3 = img3

3.3 词云图展示

In [15]:

# 导入词云图和停用词

from wordcloud import WordCloud, STOPWORDS

In [16]:

plt.figure(figsize=(12,10))

wc = WordCloud(background_color="black",
              max_words=10000,
              mask=hcmask3,
              stopwords=STOPWORDS,
              max_font_size=40)
wc.generate(" ".join(hpl))

plt.title("Author: HP Lovecraft", fontsize=15)
plt.imshow(wc.recolor(colormap="Pastel2", random_state=17), alpha=0.98)
plt.axis("off")

Out[16]:

(-0.5, 511.5, 511.5, -0.5)

作者EAP的词云展示：

In [17]:

plt.figure(figsize=(14,12))

plt.subplot(211)

wc = WordCloud(background_color="black",
              max_words=10000,
              mask=hcmask,
              stopwords=STOPWORDS,
              max_font_size=40)
wc.generate(" ".join(eap))

plt.title("Author: Edgar Allen Poe", fontsize=15)
plt.imshow(wc.recolor(colormap="PuBu", random_state=17), alpha=0.98)
plt.axis("off")

Out[17]:

(-0.5, 639.5, 390.5, -0.5)

作者MWS的词云展示：

In [18]:

plt.figure(figsize=(12,10))


wc = WordCloud(background_color="black",
              max_words=10000,
              mask=hcmask2,
              stopwords=STOPWORDS,
              max_font_size=40)
wc.generate(" ".join(mws))


plt.title("Mary Shelley", fontsize=15)
plt.imshow(wc.recolor(colormap="viridis", random_state=17), alpha=0.98)
plt.axis("off")

Out[18]:

(-0.5, 639.5, 589.5, -0.5)

4 文本处理

4.1 文本预处理步骤

在文本预处理的任务中，无论是主题建模、词聚类还是文档文本分类等，通常都必须经过这几个预处理步骤，将输入的原始文本转换为你的模型和机器可读的形式：

分词Tokenization
去除停用词Stopwords
词干提取Stemming
向量化过程Vectorization

4.2 分词tokenization

In [19]:

import nltk  # pip install nltk

In [20]:

# 以第一个文本为例

first_text = train.text.values[0] 
first_text

Out[20]:

'This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.'

通过split函数进行分割：

In [21]:

print(first_text.split(" "))
['This', 'process,', 'however,', 'afforded', 'me', 'no', 'means', 'of', 'ascertaining', 'the', 'dimensions', 'of', 'my', 'dungeon;', 'as', 'I', 'might', 'make', 'its', 'circuit,', 'and', 'return', 'to', 'the', 'point', 'whence', 'I', 'set', 'out,', 'without', 'being', 'aware', 'of', 'the', 'fact;', 'so', 'perfectly', 'uniform', 'seemed', 'the', 'wall.']

使用nltk的word_tokenize函数进行分割：

In [22]:

first_text_list = nltk.word_tokenize(first_text)
print(first_text_list)
['This', 'process', ',', 'however', ',', 'afforded', 'me', 'no', 'means', 'of', 'ascertaining', 'the', 'dimensions', 'of', 'my', 'dungeon', ';', 'as', 'I', 'might', 'make', 'its', 'circuit', ',', 'and', 'return', 'to', 'the', 'point', 'whence', 'I', 'set', 'out', ',', 'without', 'being', 'aware', 'of', 'the', 'fact', ';', 'so', 'perfectly', 'uniform', 'seemed', 'the', 'wall', '.']

可以看到二者的区别：split函数分割的时候是基于空格，所以比如特殊符号;、.是和单词连在一起的

4.3 去除停用词 Stopword Removal

In [23]:

stopwords = nltk.corpus.stopwords.words("english")

stopwords[:10]  # 列表形式的停用词

Out[23]:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [24]:

len(stopwords)

Out[24]:

筛选出不在停用词列表中的词语：

In [25]:

first_text_list_cleaned = [word for word in first_text_list if word.lower() not in stopwords]
print(first_text_list_cleaned)
['process', ',', 'however', ',', 'afforded', 'means', 'ascertaining', 'dimensions', 'dungeon', ';', 'might', 'make', 'circuit', ',', 'return', 'point', 'whence', 'set', ',', 'without', 'aware', 'fact', ';', 'perfectly', 'uniform', 'seemed', 'wall', '.']

In [26]:

print("Before Stopword Removal：", len(first_text_list))
print("After Stopword Removal：", len(first_text_list_cleaned))
Before Stopword Removal： 48
After Stopword Removal： 28

4.4 词干提取和词形还原stemming and Lemmatization

在自然语言处理（NLP）中，词干提取（stemming）和词形还原（lemmatization）都是常用的文本预处理技术，用于将单词转化为它们的原始形式，以减少词汇的变形形式，从而简化文本分析和比较。

词干提取（stemming）是一种基于规则的文本处理方法，通过删除单词的后缀来提取词干（stem）。它的目的是将单词转化为其基本的语言形式，即词干，而不考虑单词的语法和语义。例如，将“running”、“runs”和“ran”都转化为词干“run”。这种方法简单、快速且易于实现，但对于一些复杂的NLP任务，词干提取可能无法准确地还原单词的原始形式。
相比之下，词形还原（lemmatization）是一种更复杂的方法，它通过应用语法规则和词形变化规则，将单词还原为其原始或标准形式。这种方法更健壮，能够处理更多的词汇变形形式，并且能够更准确地还原单词的原始语义。例如，“cars”可以还原为“car”，“ate”可以还原为“eat”。然而，词形还原需要更多的计算资源和时间，并且实现起来也更复杂。

在NLP中，词干提取和词形还原都可以用于文本分析和比较。通过将单词转化为它们的原始形式，可以更好地识别和理解文本中的词汇关系、语义相似性和概念映射等。这在信息检索、文本分类、情感分析、机器翻译等NLP任务中都非常重要。

4.4.1 词干提取PorterStemmer

In [27]:

stemmer = nltk.stem.PorterStemmer()
stemmer

Out[27]:

<PorterStemmer>

In [28]:

print("The stemmed form of running is: {}".format(stemmer.stem("running")))
print("The stemmed form of runs is: {}".format(stemmer.stem("runs")))
print("The stemmed form of run is: {}".format(stemmer.stem("run")))
The stemmed form of running is: run
The stemmed form of runs is: run
The stemmed form of run is: run

使用nltk有时候也可能出现差错：

In [29]:

print("The stemmed form of leaves is: {}".format(stemmer.stem("leaves")))
The stemmed form of leaves is: leav

4.4.2 词性还原WordNetLemmatizer

In [30]:

from nltk.stem import WordNetLemmatizer  #  词性还原

lemm = WordNetLemmatizer()  # 实例化

通过词性还原的方法能够很好地找到leaves的原型：

In [31]:

print(f"The lemmatized form of leaves is : {lemm.lemmatize('leaves')}")
The lemmatized form of leaves is : leaf

4.5 向量化文本Vectorizing Raw Text

In [32]:

# 生成词袋向量

sentence = ["I love to eat Burgers", "I love to eat Fries"]
vectorizer = CountVectorizer(min_df=0)   # 基于词频统计向量化
sentence_transform = vectorizer.fit_transform(sentence)
sentence_transform

Out[32]:

<2x5 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [33]:

vectorizer.get_feature_names_out()  # 输出有用特征

Out[33]:

array(['burgers', 'eat', 'fries', 'love', 'to'], dtype=object)

In [34]:

sentence_transform.toarray()

Out[34]:

array([[1, 1, 0, 1, 1],
       [0, 1, 1, 1, 1]], dtype=int64)

5 主题建模

5.1 两种主题建模方法

常见的两种主题建模方法：

LDA：潜在狄利克雷分布，Latent Dirichlet Allocation
NNMF：非负矩阵分解，Non-negative Matrix Factorization

In [35]:

def print_top_words(model, feature_names, n_top_words):
    for index, topic in enumerate(model.components_):
        message = "\nTopic #{}:".format(index)
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
        print("*" * 50)

5.2 重写LemmaCountVectorizer类

In [36]:

lemm = WordNetLemmatizer()

class LemmaCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(LemmaCountVectorizer, self).build_analyzer()
        return lambda doc: (lemm.lemmatize(w) for w in analyzer(doc))

In [37]:

text = list(train["text"].values)
text[:1]

Out[37]:

['This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.']

In [38]:

tf_vectorizer = LemmaCountVectorizer(max_df=0.95,
                                    min_df=2,
                                    stop_words="english",
                                    decode_error="ignore")

tf = tf_vectorizer.fit_transform(text)
tf

Out[38]:

<19579x13781 sparse matrix of type '<class 'numpy.int64'>'
	with 212037 stored elements in Compressed Sparse Row format>

In [39]:

feature_names = tf_vectorizer.get_feature_names_out()
count_vec = np.asarray(tf.sum(axis=0)).ravel()

In [40]:

feature_names

Out[40]:

array(['aback', 'abandon', 'abandoned', ..., 'zone', 'ædile', 'æronaut'],
      dtype=object)

In [41]:

count_vec

Out[41]:

array([ 2, 11, 29, ...,  3,  2,  2], dtype=int64)

In [42]:

zipped = list(zip(feature_names, count_vec))
zipped[:5]

Out[42]:

[('aback', 2), ('abandon', 11), ('abandoned', 29), ('abandoning', 3), ('abandonment', 5)]

In [43]:

x,y = (list(x) for x in zip(*sorted(zipped, key=lambda x: x[1], reverse=True)))

X = np.concatenate([x[0:15], x[:-16:-1]])
Y = np.concatenate([y[0:15], y[:-16:-1]])
X

Out[43]:

array(['time', 'man', 'day', 'thing', 'eye', 'said', 'did', 'old', 'like',
       'life', 'night', 'thought', 'little', 'great', 'long', 'æronaut',
       'ædile', 'zodiacal', 'zigzagging', 'zigzag', 'zerubbabel', 'zar',
       'zaimi', 'yuletide', 'yule', 'youngish', 'yorktown', 'yoke',
       'yelling', 'yea'], dtype='<U10')

5.2.1 Top50

In [44]:

# 前50高频词语

data = [go.Bar(
    x = x[0:50],
    y = y[0:50],
    marker = dict(colorscale = "Jet",
                 color = y[0:50]),
    text="Word Counts"
)]


layout = go.Layout(title = "Top50 Frequencies after Preprocessing")

fig = go.Figure(data=data, layout=layout)

fig.show()

5.2.2 Bottom100

In [45]:

# 词频最低的100词

data = [go.Bar(
    x = x[-100:],
    y = y[-100:],
    marker = dict(colorscale = "Portland",
                 color = y[-100:]),
    text="Word Counts"
)]

layout = go.Layout(title = "Bottom100 Frequencies after Preprocessing")

fig = go.Figure(data=data, layout=layout)

fig.show()

5.3 基于LDA主题建模

In [46]:

lda = LatentDirichletAllocation(n_components=11,
                               max_iter=5,
                               learning_method="online",
                               learning_offset=50,
                               random_state=0
                               )

In [47]:

# tf_vectorizer = LemmaCountVectorizer(max_df=0.95,min_df=2,stop_words="english",decode_error="ignore")
# tf = tf_vectorizer.fit_transform(text)

lda.fit(tf)

Out[47]:

LatentDirichletAllocation

LatentDirichletAllocation(learning_method='online', learning_offset=50,
                          max_iter=5, n_components=11, random_state=0)

基于LDA的主题建模：

In [48]:

n_top_words = 40
tf_feature_names = tf_vectorizer.get_feature_names_out()

print(lda)
print(tf_feature_names)
print(n_top_words)
LatentDirichletAllocation(learning_method='online', learning_offset=50,
                          max_iter=5, n_components=11, random_state=0)
['aback' 'abandon' 'abandoned' ... 'zone' 'ædile' 'æronaut']
40

In [49]:

print_top_words(lda, tf_feature_names, n_top_words)
Topic #0:mean night fact young return great human looking wonder countenance difficulty greater wife finally set possessed regard struck perceived act society law health key fearful mr exceedingly evidence carried home write lady various recall accident force poet neck conduct investigation
**************************************************

Topic #1:death love raymond hope heart word child went time good man ground evil long misery replied filled passion bed till happiness memory heavy region year escape spirit grief visit doe story beauty die plague making influence thou letter appeared power
**************************************************

Topic #2:left let hand said took say little length body air secret gave right having great arm thousand character minute foot true self gentleman pleasure box clock discovered point sought pain nearly case best mere course manner balloon fear head going
**************************************************

Topic #3:called sense table suddenly sympathy machine sens unusual labour thrown mist solution suppose specie movement whispered urged frequent wine hour appears ring turk place stage noon justine ceased obscure chair completely exist sitting supply weird bottle seated drink material bell
**************************************************

Topic #4:house man old soon city room sight did believe mr light entered sir cloud order ill way dr apparently clear certain forgotten day quite door considered need great fine began journey search walked disposition view long concerning walk drawn saw
**************************************************

Topic #5:thing thought eye mind said men night like face life head dream knew saw form world away deep stone told matter morning perdita dead general man strange seen terrible sleep tell object tear know account better black say remained little
**************************************************

Topic #6:father moon stood longer attention end sure leave remember time excited period trace dream given star place able grew subject set cut visited captain consequence marie taking forward started descent atmosphere impulse departure dog men truly abyss appear magnificent quarter
**************************************************

Topic #7:day did heard life time friend new far horror nature come look tree year present soul passed known people heart felt degree scene idea hand feeling world came country adrian moment make word affection sun gone reached idris youth seen
**************************************************

Topic #8:came earth street near like sound wall window just open lay fell wind looked saw moment water eye dark spirit beneath mountain old did light foot long town space floor low happy held half voice living direction ear small end
**************************************************

Topic #9:shall place sea time think long fear know mother day person say brought expression land change question night result ye week mad month feel god rest got manner course horrible large resolved kind passage far discovery word answer eye ago
**************************************************

Topic #10:door turned close away design view doubt ordinary tried oh madness room enemy le lower exertion chamber opening candle legend occupation abode lofty author compartment breath flame accursed machinery horse iron proceeded curse ve louder desired entering appeared lock oil
**************************************************

5.3.1 主题成分提取

In [50]:

topic1 = lda.components_[0]
topic2 = lda.components_[1]
topic3 = lda.components_[2]
topic4 = lda.components_[3]

In [51]:

topic1.shape, topic2.shape, topic3.shape, topic4.shape

Out[51]:

((13781,), (13781,), (13781,), (13781,))

5.3.2 主题词云可视化

In [52]:

topic1_words = [tf_feature_names[i] for i in topic1.argsort()[:-50 - 1 : -1]]
topic2_words = [tf_feature_names[i] for i in topic2.argsort()[:-50 - 1 : -1]]
topic3_words = [tf_feature_names[i] for i in topic3.argsort()[:-50 - 1 : -1]]
topic4_words = [tf_feature_names[i] for i in topic4.argsort()[:-50 - 1 : -1]]

In [53]:

wordcloud1 = WordCloud(stopwords=STOPWORDS,
                      background_color="black",
                      width=2500,
                      height=1800,
                    ).generate(" ".join(topic1_words))

plt.imshow(wordcloud1)
plt.axis("off")
plt.show()

第二个词云图：

In [54]:

wordcloud2 = WordCloud(stopwords=STOPWORDS,
                      background_color="black",
                      width=2500,
                      height=1800,
                    ).generate(" ".join(topic2_words))

plt.imshow(wordcloud2)
plt.axis("off")
plt.show()

第三个词云图：

In [55]:

wordcloud3 = WordCloud(stopwords=STOPWORDS,
                      background_color="black",
                      width=2500,
                      height=1800,
                    ).generate(" ".join(topic3_words))

plt.imshow(wordcloud3)
plt.axis("off")
plt.show()

第四个词云图：

In [56]:

wordcloud4 = WordCloud(stopwords=STOPWORDS,
                      background_color="black",
                      width=2500,
                      height=1800,
                    ).generate(" ".join(topic4_words))

plt.imshow(wordcloud4)
plt.axis("off")
plt.show()

项目地址：www.kaggle.com/code/arthur…