机器学习之红楼梦作者判断(贝叶斯分类)

989 阅读2分钟

上篇文章不带假设的使用聚类算法去判断红楼梦的作者,这一篇打算使用监督学习方法的贝叶斯分类,假设后四十回不是曹雪芹所写, 去做个验证。

关于我读过比较好的贝叶斯讲解:A simple explanation of Naive Bayes Classification

基本过程

先带入假设,前80回和后40回不是同一人所写。利用上篇文章构建的词向量,对数据集进行分层随机抽样,比如随机选取前80中的60回,后40中的30回进行训练,然后进行预测。如果预测效果很好的话,说明我们的假设正确。 后续可以做个验证,只使用前80回,对前后40回打上0, 1标签。进行同样训练。由于同一人所写,预测难度加大, 如果精确度不如上一步,则侧面证明假设正确。

准备工作

import os
import numpy as np
import pandas as pd
import re
import sys  
import matplotlib.pyplot as plt

text = pd.read_csv("./datasets/hongloumeng.csv")
import jieba
import jieba.analyse

vorc = [jieba.analyse.extract_tags(i, topK=1000) for i in text["text"]]
vorc = [" ".join(i) for i in vorc]

from  sklearn.feature_extraction.text import CountVectorizer
vertorizer = CountVectorizer(max_features=5000)
train_data_features = vertorizer.fit_transform(vorc)

train_data_features = train_data_features.toarray()

train_data_features.shape

上述是准备工作,生成词向量。

标签生成

labels = np.array([[0] * 80 + [1] * 40]).reshape(-1 ,1) # 目标值
labels.shape

这一步生成目标labels。

分层随机抽样

# 分层抽样
from sklearn.model_selection import train_test_split
# train_data_features = train_data_features[0:80]
X_train, X_test, Y_train, Y_test = train_test_split(train_data_features, labels, 
                                                    test_size = 0.2, stratify=labels)

这里做过多解释, stratify=labels 表示按照目标类别进行随机抽样。

模型训练和预测

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

gnb = GaussianNB()
gnb.fit(X_train, Y_train)

y_pred = gnb.predict(X_test)

对得到的预测值进行精确度计算:

from sklearn.metrics import accuracy_score
accuracy_score(Y_test, y_pred)
# 0.875

分层交叉验证

from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=10,test_size=0.2)
for train_index, test_index in sss.split(train_data_features, labels):
    X_train, X_test = train_data_features[train_index], train_data_features[test_index]
    Y_train, Y_test = labels[train_index], labels[test_index]
    gnb = GaussianNB()
    gnb.fit(X_train, Y_train)
    Y_pred = gnb.predict(X_test)
    scores.append(accuracy_score(Y_test, Y_pred))
print(scores)
print(np.array(scores).mean())

输出结果如下:

[0.9166666666666666, 0.8333333333333334, 1.0, 0.875, 0.75, 0.8333333333333334, 0.8333333333333334, 0.9583333333333334, 0.875, 0.8333333333333334]
0.8708333333333333

如上,可以对文章刚开始的假设做个解释,初步判断我们的假设正确。

交叉验证

labels_val = np.array([[0] * 40 + [1] * 40]).reshape(-1 ,1) # 目标值
sss_val = StratifiedShuffleSplit(n_splits=5,test_size=0.2)#分成5组,测试比例为0.25,训练比例是0.75
scores = []
train_data_features_val = train_data_features[0:80]
for train_index, test_index in sss_val.split(train_data_features_val, labels_val):
    X_train, X_test = train_data_features_val[train_index], train_data_features_val[test_index]
    Y_train, Y_test = labels_val[train_index], labels_val[test_index]
    gnb = GaussianNB()
    gnb.fit(X_train, Y_train)
    Y_pred = gnb.predict(X_test)
    scores.append(accuracy_score(Y_test, Y_pred))
print(scores)
print(np.array(scores).mean())

输出结果如下:

[0.8125, 0.875, 0.75, 0.875, 0.75]
0.8125

经过多次计算求平均值,后者得到的平均分数都低于前者, 文章开头的假设正确。

完整代码:github