公众号:尤而小屋
作者:Peter
编辑:Peter
大家好,我是Peter~
更新一篇机器学习实战项目:NLP任务之主题建模,主要内容包含:
- 数据探索性分析
- 文本信息的词云图展示
- 文本预处理(分词、去除停用词、词干提取、词性还原、文本向量化)
- 基于LDA(狄利克雷分布)的主题建模
1 导入库
In [1]:
import base64
import numpy as np
import pandas as pd
# 可视化相关库
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly_express as px
import plotly.tools as tls
from matplotlib import pyplot as plt
%matplotlib inline
from skimage.io import imread
# 建模相关
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
# 忽略警告
import warnings
warnings.filterwarnings("ignore")
In [2]:
train = pd.read_csv("train/train.csv")
train.head()
Out[2]:
| id | text | author | |
|---|---|---|---|
| 0 | id26305 | This process, however, afforded me no means of... | EAP |
| 1 | id17569 | It never once occurred to me that the fumbling... | HPL |
| 2 | id11008 | In his left hand was a gold snuff box, from wh... | EAP |
| 3 | id27763 | How lovely is spring As we looked from Windsor... | MWS |
| 4 | id12958 | Finding nothing else, not even gold, the Super... | HPL |
In [3]:
train.shape
Out[3]:
(19579, 3)
2 探索性分析EDA
2.1 作者数量统计
In [4]:
data = [go.Bar(
x = train.author.unique(), # x-y轴
y = train.author.value_counts(),
marker = dict(colorscale = "Jet", # 颜色设置
color = train.author.value_counts().values),
text = train.author.value_counts() # 柱子显示数据;同x
)]
# Layout中的标题设置
layout = go.Layout(
title = "不同作者数量统计"
)
fig = go.Figure(data=data, layout=layout)
fig.show()
2.2 单词统计
所有文本的切割并展开:
In [5]:
all_words = train["text"].str.split(expand=True) # 切分后展开
all_words
Out[5]:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 851 | 852 | 853 | 854 | 855 | 856 | 857 | 858 | 859 | 860 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | This | process, | however, | afforded | me | no | means | of | ascertaining | the | ... | None | None | None | None | None | None | None | None | None | None |
| 1 | It | never | once | occurred | to | me | that | the | fumbling | might | ... | None | None | None | None | None | None | None | None | None | None |
| 2 | In | his | left | hand | was | a | gold | snuff | box, | from | ... | None | None | None | None | None | None | None | None | None | None |
| 3 | How | lovely | is | spring | As | we | looked | from | Windsor | Terrace | ... | None | None | None | None | None | None | None | None | None | None |
| 4 | Finding | nothing | else, | not | even | gold, | the | Superintendent | abandoned | his | ... | None | None | None | None | None | None | None | None | None | None |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 19574 | I | could | have | fancied, | while | I | looked | at | it, | that | ... | None | None | None | None | None | None | None | None | None | None |
| 19575 | The | lids | clenched | themselves | together | as | if | in | a | spasm. | ... | None | None | None | None | None | None | None | None | None | None |
| 19576 | Mais | il | faut | agir | that | is | to | say, | a | Frenchman | ... | None | None | None | None | None | None | None | None | None | None |
| 19577 | For | an | item | of | news | like | this, | it | strikes | us | ... | None | None | None | None | None | None | None | None | None | None |
| 19578 | He | laid | a | gnarled | claw | on | my | shoulder, | and | it | ... | None | None | None | None | None | None | None | None |
上面数据的unstack过程:
In [6]:
all_words.unstack()
Out[6]:
0 0 This
1 It
2 In
3 How
4 Finding
...
860 19574 None
19575 None
19576 None
19577 None
19578 None
Length: 16857519, dtype: object
In [7]:
all_words = all_words.unstack().value_counts()
all_words
Out[7]:
the 33296
of 20851
and 17059
to 12615
I 10382
...
sunrise." 1
Averni; 1
buildin's 1
useless; 1
reduced? 1
Name: count, Length: 47556, dtype: int64
In [8]:
data = [go.Bar(
x = all_words.index.values[:50],
y = all_words.values[:50],
marker = dict(colorscale = "rdylbu", # 颜色设置
color = all_words.values[:200]),
text = all_words.values[:50], # 柱子显示数据;同y
)]
# Layout中的标题设置
layout = go.Layout(
title = "Top50 words"
)
fig = go.Figure(data=data, layout=layout)
fig.show()
3 词云图表示
3.1 文本读取
In [9]:
eap = train[train["author"]=="EAP"]["text"].values
hpl = train[train["author"]=="HPL"]["text"].values
mws = train[train["author"]=="MWS"]["text"].values
eap
Out[9]:
array(['This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.',
'In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.',
'The astronomer, perhaps, at this point, took refuge in the suggestion of non luminosity; and here analogy was suddenly let fall.',
..., 'The lids clenched themselves together as if in a spasm.',
'Mais il faut agir that is to say, a Frenchman never faints outright.',
'For an item of news like this, it strikes us it was very coolly received."'],
dtype=object)
3.2 背景图生成
In [10]:
from skimage.io import imread
In [11]:
# 见项目地址:3个字符串信息
In [12]:
import codecs # 主要用于在不同数据之间转换文本的编码器和解码器
f1 = open("eap.png", "wb") # 打开eap文件写入二进制数据(wb);如果文件不存在则创建
f1.write(codecs.decode(eap_64, "base64")) # 解码eap64字符串,将解码后的数据写入f1中
f1.close() # 关闭文件f1
img1 = imread("eap.png") # 读取文件eap.png
hcmask = img1
In [13]:
f2 = open("mws.png", "wb")
f2.write(codecs.decode(mws_64, "base64"))
f2.close()
img2 = imread("mws.png")
hcmask2 = img2
In [14]:
f3 = open("hpl.png", "wb")
f3.write(codecs.decode(hpl_64, "base64"))
f3.close()
img3 = imread("hpl.png")
hcmask3 = img3
3.3 词云图展示
In [15]:
# 导入词云图和停用词
from wordcloud import WordCloud, STOPWORDS
In [16]:
plt.figure(figsize=(12,10))
wc = WordCloud(background_color="black",
max_words=10000,
mask=hcmask3,
stopwords=STOPWORDS,
max_font_size=40)
wc.generate(" ".join(hpl))
plt.title("Author: HP Lovecraft", fontsize=15)
plt.imshow(wc.recolor(colormap="Pastel2", random_state=17), alpha=0.98)
plt.axis("off")
Out[16]:
(-0.5, 511.5, 511.5, -0.5)
作者EAP的词云展示:
In [17]:
plt.figure(figsize=(14,12))
plt.subplot(211)
wc = WordCloud(background_color="black",
max_words=10000,
mask=hcmask,
stopwords=STOPWORDS,
max_font_size=40)
wc.generate(" ".join(eap))
plt.title("Author: Edgar Allen Poe", fontsize=15)
plt.imshow(wc.recolor(colormap="PuBu", random_state=17), alpha=0.98)
plt.axis("off")
Out[17]:
(-0.5, 639.5, 390.5, -0.5)
作者MWS的词云展示:
In [18]:
plt.figure(figsize=(12,10))
wc = WordCloud(background_color="black",
max_words=10000,
mask=hcmask2,
stopwords=STOPWORDS,
max_font_size=40)
wc.generate(" ".join(mws))
plt.title("Mary Shelley", fontsize=15)
plt.imshow(wc.recolor(colormap="viridis", random_state=17), alpha=0.98)
plt.axis("off")
Out[18]:
(-0.5, 639.5, 589.5, -0.5)
4 文本处理
4.1 文本预处理步骤
在文本预处理的任务中,无论是主题建模、词聚类还是文档文本分类等,通常都必须经过这几个预处理步骤,将输入的原始文本转换为你的模型和机器可读的形式:
- 分词Tokenization
- 去除停用词Stopwords
- 词干提取Stemming
- 向量化过程Vectorization
4.2 分词tokenization
In [19]:
import nltk # pip install nltk
In [20]:
# 以第一个文本为例
first_text = train.text.values[0]
first_text
Out[20]:
'This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.'
通过split函数进行分割:
In [21]:
print(first_text.split(" "))
['This', 'process,', 'however,', 'afforded', 'me', 'no', 'means', 'of', 'ascertaining', 'the', 'dimensions', 'of', 'my', 'dungeon;', 'as', 'I', 'might', 'make', 'its', 'circuit,', 'and', 'return', 'to', 'the', 'point', 'whence', 'I', 'set', 'out,', 'without', 'being', 'aware', 'of', 'the', 'fact;', 'so', 'perfectly', 'uniform', 'seemed', 'the', 'wall.']
使用nltk的word_tokenize函数进行分割:
In [22]:
first_text_list = nltk.word_tokenize(first_text)
print(first_text_list)
['This', 'process', ',', 'however', ',', 'afforded', 'me', 'no', 'means', 'of', 'ascertaining', 'the', 'dimensions', 'of', 'my', 'dungeon', ';', 'as', 'I', 'might', 'make', 'its', 'circuit', ',', 'and', 'return', 'to', 'the', 'point', 'whence', 'I', 'set', 'out', ',', 'without', 'being', 'aware', 'of', 'the', 'fact', ';', 'so', 'perfectly', 'uniform', 'seemed', 'the', 'wall', '.']
可以看到二者的区别:split函数分割的时候是基于空格,所以比如特殊符号;、.是和单词连在一起的
4.3 去除停用词 Stopword Removal
In [23]:
stopwords = nltk.corpus.stopwords.words("english")
stopwords[:10] # 列表形式的停用词
Out[23]:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
In [24]:
len(stopwords)
Out[24]:
179
筛选出不在停用词列表中的词语:
In [25]:
first_text_list_cleaned = [word for word in first_text_list if word.lower() not in stopwords]
print(first_text_list_cleaned)
['process', ',', 'however', ',', 'afforded', 'means', 'ascertaining', 'dimensions', 'dungeon', ';', 'might', 'make', 'circuit', ',', 'return', 'point', 'whence', 'set', ',', 'without', 'aware', 'fact', ';', 'perfectly', 'uniform', 'seemed', 'wall', '.']
In [26]:
print("Before Stopword Removal:", len(first_text_list))
print("After Stopword Removal:", len(first_text_list_cleaned))
Before Stopword Removal: 48
After Stopword Removal: 28
4.4 词干提取和词形还原stemming and Lemmatization
在自然语言处理(NLP)中,词干提取(stemming)和词形还原(lemmatization)都是常用的文本预处理技术,用于将单词转化为它们的原始形式,以减少词汇的变形形式,从而简化文本分析和比较。
- 词干提取(stemming)是一种基于规则的文本处理方法,通过删除单词的后缀来提取词干(stem)。它的目的是将单词转化为其基本的语言形式,即词干,而不考虑单词的语法和语义。例如,将“running”、“runs”和“ran”都转化为词干“run”。这种方法简单、快速且易于实现,但对于一些复杂的NLP任务,词干提取可能无法准确地还原单词的原始形式。
- 相比之下,词形还原(lemmatization)是一种更复杂的方法,它通过应用语法规则和词形变化规则,将单词还原为其原始或标准形式。这种方法更健壮,能够处理更多的词汇变形形式,并且能够更准确地还原单词的原始语义。例如,“cars”可以还原为“car”,“ate”可以还原为“eat”。然而,词形还原需要更多的计算资源和时间,并且实现起来也更复杂。
在NLP中,词干提取和词形还原都可以用于文本分析和比较。通过将单词转化为它们的原始形式,可以更好地识别和理解文本中的词汇关系、语义相似性和概念映射等。这在信息检索、文本分类、情感分析、机器翻译等NLP任务中都非常重要。
4.4.1 词干提取PorterStemmer
In [27]:
stemmer = nltk.stem.PorterStemmer()
stemmer
Out[27]:
<PorterStemmer>
In [28]:
print("The stemmed form of running is: {}".format(stemmer.stem("running")))
print("The stemmed form of runs is: {}".format(stemmer.stem("runs")))
print("The stemmed form of run is: {}".format(stemmer.stem("run")))
The stemmed form of running is: run
The stemmed form of runs is: run
The stemmed form of run is: run
使用nltk有时候也可能出现差错:
In [29]:
print("The stemmed form of leaves is: {}".format(stemmer.stem("leaves")))
The stemmed form of leaves is: leav
4.4.2 词性还原WordNetLemmatizer
In [30]:
from nltk.stem import WordNetLemmatizer # 词性还原
lemm = WordNetLemmatizer() # 实例化
通过词性还原的方法能够很好地找到leaves的原型:
In [31]:
print(f"The lemmatized form of leaves is : {lemm.lemmatize('leaves')}")
The lemmatized form of leaves is : leaf
4.5 向量化文本Vectorizing Raw Text
In [32]:
# 生成词袋向量
sentence = ["I love to eat Burgers", "I love to eat Fries"]
vectorizer = CountVectorizer(min_df=0) # 基于词频统计向量化
sentence_transform = vectorizer.fit_transform(sentence)
sentence_transform
Out[32]:
<2x5 sparse matrix of type '<class 'numpy.int64'>'
with 8 stored elements in Compressed Sparse Row format>
In [33]:
vectorizer.get_feature_names_out() # 输出有用特征
Out[33]:
array(['burgers', 'eat', 'fries', 'love', 'to'], dtype=object)
In [34]:
sentence_transform.toarray()
Out[34]:
array([[1, 1, 0, 1, 1],
[0, 1, 1, 1, 1]], dtype=int64)
5 主题建模
5.1 两种主题建模方法
常见的两种主题建模方法:
- LDA:潜在狄利克雷分布,Latent Dirichlet Allocation
- NNMF:非负矩阵分解,Non-negative Matrix Factorization
In [35]:
def print_top_words(model, feature_names, n_top_words):
for index, topic in enumerate(model.components_):
message = "\nTopic #{}:".format(index)
message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
print(message)
print("*" * 50)
5.2 重写LemmaCountVectorizer类
In [36]:
lemm = WordNetLemmatizer()
class LemmaCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer = super(LemmaCountVectorizer, self).build_analyzer()
return lambda doc: (lemm.lemmatize(w) for w in analyzer(doc))
In [37]:
text = list(train["text"].values)
text[:1]
Out[37]:
['This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.']
In [38]:
tf_vectorizer = LemmaCountVectorizer(max_df=0.95,
min_df=2,
stop_words="english",
decode_error="ignore")
tf = tf_vectorizer.fit_transform(text)
tf
Out[38]:
<19579x13781 sparse matrix of type '<class 'numpy.int64'>'
with 212037 stored elements in Compressed Sparse Row format>
In [39]:
feature_names = tf_vectorizer.get_feature_names_out()
count_vec = np.asarray(tf.sum(axis=0)).ravel()
In [40]:
feature_names
Out[40]:
array(['aback', 'abandon', 'abandoned', ..., 'zone', 'ædile', 'æronaut'],
dtype=object)
In [41]:
count_vec
Out[41]:
array([ 2, 11, 29, ..., 3, 2, 2], dtype=int64)
In [42]:
zipped = list(zip(feature_names, count_vec))
zipped[:5]
Out[42]:
[('aback', 2), ('abandon', 11), ('abandoned', 29), ('abandoning', 3), ('abandonment', 5)]
In [43]:
x,y = (list(x) for x in zip(*sorted(zipped, key=lambda x: x[1], reverse=True)))
X = np.concatenate([x[0:15], x[:-16:-1]])
Y = np.concatenate([y[0:15], y[:-16:-1]])
X
Out[43]:
array(['time', 'man', 'day', 'thing', 'eye', 'said', 'did', 'old', 'like',
'life', 'night', 'thought', 'little', 'great', 'long', 'æronaut',
'ædile', 'zodiacal', 'zigzagging', 'zigzag', 'zerubbabel', 'zar',
'zaimi', 'yuletide', 'yule', 'youngish', 'yorktown', 'yoke',
'yelling', 'yea'], dtype='<U10')
5.2.1 Top50
In [44]:
# 前50高频词语
data = [go.Bar(
x = x[0:50],
y = y[0:50],
marker = dict(colorscale = "Jet",
color = y[0:50]),
text="Word Counts"
)]
layout = go.Layout(title = "Top50 Frequencies after Preprocessing")
fig = go.Figure(data=data, layout=layout)
fig.show()
5.2.2 Bottom100
In [45]:
# 词频最低的100词
data = [go.Bar(
x = x[-100:],
y = y[-100:],
marker = dict(colorscale = "Portland",
color = y[-100:]),
text="Word Counts"
)]
layout = go.Layout(title = "Bottom100 Frequencies after Preprocessing")
fig = go.Figure(data=data, layout=layout)
fig.show()
5.3 基于LDA主题建模
In [46]:
lda = LatentDirichletAllocation(n_components=11,
max_iter=5,
learning_method="online",
learning_offset=50,
random_state=0
)
In [47]:
# tf_vectorizer = LemmaCountVectorizer(max_df=0.95,min_df=2,stop_words="english",decode_error="ignore")
# tf = tf_vectorizer.fit_transform(text)
lda.fit(tf)
Out[47]:
LatentDirichletAllocation
LatentDirichletAllocation(learning_method='online', learning_offset=50,
max_iter=5, n_components=11, random_state=0)
基于LDA的主题建模:
In [48]:
n_top_words = 40
tf_feature_names = tf_vectorizer.get_feature_names_out()
print(lda)
print(tf_feature_names)
print(n_top_words)
LatentDirichletAllocation(learning_method='online', learning_offset=50,
max_iter=5, n_components=11, random_state=0)
['aback' 'abandon' 'abandoned' ... 'zone' 'ædile' 'æronaut']
40
In [49]:
print_top_words(lda, tf_feature_names, n_top_words)
Topic #0:mean night fact young return great human looking wonder countenance difficulty greater wife finally set possessed regard struck perceived act society law health key fearful mr exceedingly evidence carried home write lady various recall accident force poet neck conduct investigation
**************************************************
Topic #1:death love raymond hope heart word child went time good man ground evil long misery replied filled passion bed till happiness memory heavy region year escape spirit grief visit doe story beauty die plague making influence thou letter appeared power
**************************************************
Topic #2:left let hand said took say little length body air secret gave right having great arm thousand character minute foot true self gentleman pleasure box clock discovered point sought pain nearly case best mere course manner balloon fear head going
**************************************************
Topic #3:called sense table suddenly sympathy machine sens unusual labour thrown mist solution suppose specie movement whispered urged frequent wine hour appears ring turk place stage noon justine ceased obscure chair completely exist sitting supply weird bottle seated drink material bell
**************************************************
Topic #4:house man old soon city room sight did believe mr light entered sir cloud order ill way dr apparently clear certain forgotten day quite door considered need great fine began journey search walked disposition view long concerning walk drawn saw
**************************************************
Topic #5:thing thought eye mind said men night like face life head dream knew saw form world away deep stone told matter morning perdita dead general man strange seen terrible sleep tell object tear know account better black say remained little
**************************************************
Topic #6:father moon stood longer attention end sure leave remember time excited period trace dream given star place able grew subject set cut visited captain consequence marie taking forward started descent atmosphere impulse departure dog men truly abyss appear magnificent quarter
**************************************************
Topic #7:day did heard life time friend new far horror nature come look tree year present soul passed known people heart felt degree scene idea hand feeling world came country adrian moment make word affection sun gone reached idris youth seen
**************************************************
Topic #8:came earth street near like sound wall window just open lay fell wind looked saw moment water eye dark spirit beneath mountain old did light foot long town space floor low happy held half voice living direction ear small end
**************************************************
Topic #9:shall place sea time think long fear know mother day person say brought expression land change question night result ye week mad month feel god rest got manner course horrible large resolved kind passage far discovery word answer eye ago
**************************************************
Topic #10:door turned close away design view doubt ordinary tried oh madness room enemy le lower exertion chamber opening candle legend occupation abode lofty author compartment breath flame accursed machinery horse iron proceeded curse ve louder desired entering appeared lock oil
**************************************************
5.3.1 主题成分提取
In [50]:
topic1 = lda.components_[0]
topic2 = lda.components_[1]
topic3 = lda.components_[2]
topic4 = lda.components_[3]
In [51]:
topic1.shape, topic2.shape, topic3.shape, topic4.shape
Out[51]:
((13781,), (13781,), (13781,), (13781,))
5.3.2 主题词云可视化
In [52]:
topic1_words = [tf_feature_names[i] for i in topic1.argsort()[:-50 - 1 : -1]]
topic2_words = [tf_feature_names[i] for i in topic2.argsort()[:-50 - 1 : -1]]
topic3_words = [tf_feature_names[i] for i in topic3.argsort()[:-50 - 1 : -1]]
topic4_words = [tf_feature_names[i] for i in topic4.argsort()[:-50 - 1 : -1]]
In [53]:
wordcloud1 = WordCloud(stopwords=STOPWORDS,
background_color="black",
width=2500,
height=1800,
).generate(" ".join(topic1_words))
plt.imshow(wordcloud1)
plt.axis("off")
plt.show()
第二个词云图:
In [54]:
wordcloud2 = WordCloud(stopwords=STOPWORDS,
background_color="black",
width=2500,
height=1800,
).generate(" ".join(topic2_words))
plt.imshow(wordcloud2)
plt.axis("off")
plt.show()
第三个词云图:
In [55]:
wordcloud3 = WordCloud(stopwords=STOPWORDS,
background_color="black",
width=2500,
height=1800,
).generate(" ".join(topic3_words))
plt.imshow(wordcloud3)
plt.axis("off")
plt.show()
第四个词云图:
In [56]:
wordcloud4 = WordCloud(stopwords=STOPWORDS,
background_color="black",
width=2500,
height=1800,
).generate(" ".join(topic4_words))
plt.imshow(wordcloud4)
plt.axis("off")
plt.show()