本文已参与「新人创作礼」活动,一起开启掘金创作之路。
如有错误,恳请指出。
最近在学习一些sklearn的特征降维与特征提取的事情,可以参考前面几篇发的博客。在之前也提及到,对于一个文本,可以将其编码为词频向量,或在是权重向量,也可以将其编码为哈希向量。对于最后的哈希向量可以理解为一个降维的特征提取过程。
首先说一下,我暂时完全没有接触过nlp的相关任务,也还没有使用过rnn,lstm等时序网络。但是现在,已经把一个文本也可以称得上为一个样本编码为一个向量,配合我们之前学到的各种机器学习分类器与深度学习,那么不就可以对这个特征向量进行分类训练,从而实现文本分类的任务了吗?
所以,不会rnn或者lstm的搭建也没有关系,现在可以通过其他的方法将文本特征提取出来,那么就已经足够了。我就可以拿这个特征来做下一步的事情。这就像视频理解领域的C3D,将视频编码为一个4096的特征向量,那么就拿这个特征向量来进行其他的下游任务。同理,这里也是一样的,将其编码为哈希向量之后,我就可以使用分类器对其进行一个文本分类的NLP任务。
理论分析结束,接下来就是实战验证。对于哈希编码的特征向量,我使用了svm、随机森林与搭建了个神经网络分别测试了分类的效果。此外还用svm测试了权重编码的效果,同时查看了一下那些特征会影响分类结果。最后,贴上了一个官方的代码以供学习。
@[toc] 这里我使用的是sklearn自带的一个数据集,属于一个新闻分类数据,有20类,18846个样本,其中11314个训练样本。其类别信息分别为:
['alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']
这里会选取其中的4个类别进行文本分类处理:"alt.atheism", "talk.religion.misc", "comp.graphics", "sci.space"
1. 文本向量提取
1.1 数据集导入
from sklearn.datasets import fetch_20newsgroups
categories = [ "alt.atheism", "talk.religion.misc", "comp.graphics", "sci.space",]
remove = ("headers", "footers", "quotes")
data_train = fetch_20newsgroups(
data_home='./dataset/', subset='train', categories=categories, remove=remove, random_state=42
)
data_test = fetch_20newsgroups(
data_home='./dataset/', subset='test', categories=categories, remove=remove, random_state=42
)
data_train.target_names, data_test.target_names
# 输出:
# (['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc'],
# ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc'])
y_train, y_test = data_train.target, data_test.target
ps:这里通过remove删除文件的标题、签名快与应用块,使得数据更加的真实。这是因为,如果不删除分类器会过分拟合许多的内容:
- 几乎每个组都通过诸如 NNTP-Posting-Host:和之类的标题是否Distribution:经常出现来区分。
- 另一个重要特征涉及发件人是否隶属于大学,如其标题或签名所示。
- “文章”这个词是一个重要的特征,基于人们引用以前的帖子的频率是这样的:“在文章 [文章 ID],[名称] <[电子邮件地址]> 写道:”
- 其他功能与当时发帖的特定人员的姓名和电子邮件地址相匹配。
有了如此丰富的区分新闻组的线索,分类器几乎不需要从文本中识别主题,而且它们都在相同的高水平上执行。所以,这里删除了('headers', 'footers', 'quotes')
的信息。
1.2 哈希编码
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = HashingVectorizer(stop_words="english", alternate_sign=False, n_features=2 ** 10)
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
print("X_train.shape:{}, X_test.shape:{}".format(X_train.shape, X_test.shape))
# 输出:
# X_train.shape:(2034, 1024), X_test.shape:(1353, 1024)
1.3 卡方过滤
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
select_kbest = SelectKBest(chi2, k=200)
X_train = select_kbest.fit_transform(X_train, y_train)
X_test = select_kbest.transform(X_test)
print("X_train.shape:{}, X_test.shape:{}".format(X_train.shape, X_test.shape))
# 输出:
# X_train.shape:(2034, 200), X_test.shape:(1353, 200)
哈希编码内容与卡方过滤内容详细可以见我之前的两篇笔记,链接如下:
1. sklearn特征提取方法汇总(包含字典、文本、图像的特征提取)
2. klearn特征降维方法汇总(方差过滤,卡方,F过滤,互信息,嵌入法)
2. 机器学习模型训练
2.1 SVM模型测试
from time import time
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
t0 = time()
clf = LinearSVC(dual=False, tol=1e-3)
clf.fit(X_train, y_train)
train_time = time() - t0
print("train time: %0.3fs" % train_time)
t0 = time()
pred = clf.predict(X_test)
test_time = time() - t0
print("test time: %0.3fs" % test_time)
test_sroce = accuracy_score(y_test, pred)
train_sroce = accuracy_score(y_train, clf.predict(X_train))
print("test accuracy: %0.3f" % test_sroce,
"train accuracy: %0.3f" % train_sroce)
输出:
train time: 0.020s
test time: 0.001s
test accuracy: 0.662 train accuracy: 0.775
2.2 随机森林模型测试
from time import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
t0 = time()
clf = RandomForestClassifier(verbose=1, random_state=42)
clf.fit(X_train, y_train)
train_time = time() - t0
print("train time: %0.3fs" % train_time)
t0 = time()
pred = clf.predict(X_test)
test_time = time() - t0
print("test time: %0.3fs" % test_time)
test_sroce = accuracy_score(y_test, pred)
train_sroce = accuracy_score(y_train, clf.predict(X_train))
print("test accuracy: %0.3f" % test_sroce,
"train accuracy: %0.3f" % train_sroce)
输出:
train time: 0.466s
test time: 0.032s
test accuracy: 0.617 train accuracy: 0.964
3. 搭建神经网络测试
其实,文本分类只要是将文本信息编码为特征,但是这一步已经在哈希编码与卡方过滤后提取了出来。最后每个文本信息被编码为了一个200维的特征信息,那么现在就可以利用这个特征信息构建一个神经网络来进行训练。
所以下面会简单的搭建一个多层感知机来进行训练,这里我就搭建了一个3层的全连接层,没有特别的设计(其实我也不会特别设计,哭...)
3.1 神经网络搭建
import torch.nn as nn
class MLP(nn.Module):
def __init__(self, embedding=200, n_class=4):
super().__init__()
self.model = nn.Sequential(
nn.Linear(embedding, 256),
nn.Dropout(p=0.6),
nn.ReLU(),
nn.Linear(256, 512),
nn.Dropout(p=0.6),
nn.ReLU(),
nn.Linear(512, n_class),
)
def forward(self, x):
return self.model(x)
3.2 神经网络训练
- 数据格式转换
在训练之前,需要进行数据格式的转换。由于卡方过滤出来的矩阵是一个稀疏矩阵,也就是一个稀疏的格式矩阵,所以需要进行.toarray()
转换为numpy的array
格式。eg:X_train = X_train..toarray()
之后具体是转什么样的格式,就执行在出错的位置看error的提醒就可以了,或者直接按照我这里的数据格式转换,基本是正确的。
import torch
X_train = torch.tensor(X_train.toarray()).float()
X_test = torch.tensor(X_test.toarray()).float()
y_train = torch.tensor(y_train).long()
y_test = torch.tensor(y_test).long()
- 神经网络训练
数据格式转换后,就可以使用神经网络来进行训练,参考代码如下:
import torch
from torch import optim
from time import time
from sklearn.metrics import accuracy_score
# 设置相关参数
epochsize = 500
learning_rate = 1e-3
best_acc = 0
model = MLP()
criteon = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# 训练
t0 = time()
for epoch in range(epochsize):
model.train()
# 损失计算
pred = model(X_train)
loss = criteon(pred, y_train)
# 反向传播
optimizer.zero_grad()
loss.backward()
optimizer.step()
# 查看训练中正确个数
# print(epoch, 'loss:', loss.item())
model.eval()
with torch.no_grad():
category = model(X_test)
pred = category.argmax(dim=1)
score = accuracy_score(y_test, pred)
if best_acc < score:
best_acc = score
train_time = time() - t0
print("_" * 80)
print("best acc:{}".format(best_acc))
print("train time: %0.3fs" % train_time)
输出:
________________________________________________________________________________
best acc:0.6688839615668883
train time: 2.777s
最后,用神经网络来训练的结果好像与svm得到的结果差不多,无论是集成算法还是支持向量机算法,还是神经网络,最后的结果都在66%左右。
4. 权重向量编码
对于文本分类来说,如果想要了解对于一个文本的分类哪个单词特征是最重要的,可以通过权重向量编码配合一些设计权重分配的分类器来查看。(比如SVM等)
4.1 获取列表中最大的N个数索引
这里用一节的内容来介绍一个trick,也就是一个库的使用方法,参考链接见参考资料3.
- 针对数组无重复数的
import heapq
int n
lis=[2,4,5,1,7]
re1 = map(lis.index, heapq.nlargest(n, lis)) #求最大的n个索引 nsmallest 求最小 nlargest求最大
re2 = heapq.nlargest(n, lis) #求最大的三个元素
print(list(re1)) #因为re1由map()生成的不是list,直接print不出来,添加list()就行了
print(re2)
- 针对有重复数的
import heapq
lis= [2, 4, 4, 1, 0]
int n
max_number = heapq.nlargest(n, lis)
max_index = []
for t in max_number:
index = lis.index(t)
max_index.append(index)
lis[index] = 0
print(max_number)
print(max_index)
4.2 查看息量最大的N个特征名字
- np.argsort实现
import numpy as np
# 查看信息量最大的前10个特征单词
def show_top10(classifier, vectorizer, categories):
# 获取文本位置编码信息
# vectorizer.get_feature_names_out() 以列表的形式输出
# vectorizer.vocabulary_: 以字典的形式输出
feature_names = vectorizer.get_feature_names_out()
for i, category in enumerate(categories):
# clf.coef_.shape: (4, 26576) 表示的是每个类别每个单词的重要程度
# np.argsort是返回排序后的树荫,这里先排序再选择
top10 = np.argsort(classifier.coef_[i])[-10:]
print("%s:: %s" % (category, " ".join(feature_names[top10])))
show_top10(clf, vectorizer, data_train.target_names)
输出:
alt.atheism:: nanci islamic deletion motto islam atheist bobby atheists religion atheism
comp.graphics:: card images 42 looking hi computer 3d file image graphics
sci.space:: flight mars solar moon shuttle spacecraft launch nasa orbit space
talk.religion.misc:: commandment koresh blood jesus children rosicrucian christ fbi christians christian
代码来源,见参考资料4.
- heapq实现
import numpy as np
import heapq
# 查看信息量最大的前10个特征单词
def show_top10(classifier, vectorizer, categories, top_k=10):
# 获取文本位置编码信息
# vectorizer.get_feature_names_out() 以列表的形式输出
# vectorizer.vocabulary_: 以字典的形式输出
feature_names = vectorizer.get_feature_names_out()
for i, category in enumerate(categories):
# 利用heapq取出数值最高的前k个索引
top = map(list(clf.coef_[i]).index, heapq.nlargest(top_k, clf.coef_[i]))
top = list(top)
print("%s:: %s" % (category, " ".join(feature_names[top])))
show_top10(clf, vectorizer, data_train.target_names)
输出:
alt.atheism:: atheism religion atheists bobby atheist islam motto deletion islamic nanci
comp.graphics:: graphics image file 3d computer hi looking 42 images card
sci.space:: space orbit nasa launch spacecraft shuttle moon solar mars flight
talk.religion.misc:: christian christians fbi christ rosicrucian children jesus blood koresh commandment
可以看见,这两种的方法的输出结果是一直的,只是第二种方法还会按权重的大小从大到小排列。而对于输出结果来说,大概可以知道权重越大的单词和文本类别越相关。
4.3 使用权重编码进行文本分类
参考代码如下:
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
# 权重向量编码
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words="english")
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
# X_train.shape, X_test.shape: ((2034, 26576), (1353, 26576))
y_train, y_test = data_train.target, data_test.target
# 构建分类器训练
t0 = time()
clf = LinearSVC(penalty='l2', dual=False, tol=1e-3)
clf.fit(X_train, y_train)
train_time = time() - t0
print("train time: %0.3fs" % train_time)
t0 = time()
pred = clf.predict(X_test)
test_time = time() - t0
print("test time: %0.3fs" % test_time)
# 查看训练效果
test_sroce = accuracy_score(y_test, pred)
train_sroce = accuracy_score(y_train, clf.predict(X_train))
print("test accuracy: %0.3f" % test_sroce,
"train accuracy: %0.3f" % train_sroce)
输出:
train time: 0.247s
test time: 0.002s
test accuracy: 0.780 train accuracy: 0.978
分析:可以看见,使用权重编码相比哈希编码的效果要更好,这是因为哈希编码其实是一种降维的方法,其减少了训练的时长,但是同时也会丢失部分的信息。所以哈希编码的特征分类效果要差一点,但是训练的时间要短一点。
5. 使用稀疏特征对文本文档进行分类
这里贴上一个官方的提供代码,供自己学习,具体内容见参考资料5. 相关的内容这里翻译为中文。
这是一个示例,展示了如何使用 scikit-learn 使用词袋方法按主题对文档进行分类。此示例使用 scipy.sparse 矩阵来存储特征,并演示了可以有效处理稀疏矩阵的各种分类器。
此示例中使用的数据集是 20 个新闻组数据集。它将被自动下载,然后缓存。
5.1 参数设置
# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>
# Olivier Grisel <olivier.grisel@ensta.org>
# Mathieu Blondel <mathieu@mblondel.org>
# Lars Buitinck
# License: BSD 3 clause
import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, ComplementNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.extmath import density
from sklearn import metrics
# Display progress logs on stdout
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
op = OptionParser()
op.add_option(
"--report",
action="store_true",
dest="print_report",
help="Print a detailed classification report.",
)
op.add_option(
"--chi2_select",
action="store",
type="int",
dest="select_chi2",
help="Select some number of features using a chi-squared test",
)
op.add_option(
"--confusion_matrix",
action="store_true",
dest="print_cm",
help="Print the confusion matrix.",
)
op.add_option(
"--top10",
action="store_true",
dest="print_top10",
help="Print ten most discriminative terms per class for every classifier.",
)
op.add_option(
"--all_categories",
action="store_true",
dest="all_categories",
help="Whether to use all categories or not.",
)
op.add_option("--use_hashing", action="store_true", help="Use a hashing vectorizer.")
op.add_option(
"--n_features",
action="store",
type=int,
default=2 ** 16,
help="n_features when using the hashing vectorizer.",
)
op.add_option(
"--filtered",
action="store_true",
help=(
"Remove newsgroup information that is easily overfit: "
"headers, signatures, and quoting."
),
)
def is_interactive():
return not hasattr(sys.modules["__main__"], "__file__")
# work-around for Jupyter notebook and IPython console
argv = [] if is_interactive() else sys.argv[1:]
(opts, args) = op.parse_args(argv)
if len(args) > 0:
op.error("this script takes no arguments.")
sys.exit(1)
print(__doc__)
op.print_help()
print()
Out:
Usage: plot_document_classification_20newsgroups.py [options]
Options:
-h, --help show this help message and exit
--report Print a detailed classification report.
--chi2_select=SELECT_CHI2
Select some number of features using a chi-squared
test
--confusion_matrix Print the confusion matrix.
--top10 Print ten most discriminative terms per class for
every classifier.
--all_categories Whether to use all categories or not.
--use_hashing Use a hashing vectorizer.
--n_features=N_FEATURES
n_features when using the hashing vectorizer.
--filtered Remove newsgroup information that is easily overfit:
headers, signatures, and quoting.
5.2 从训练集中加载数据
让我们从新闻组数据集中加载数据,该数据集包含 20 个主题的大约 18000 个新闻组帖子,分为两个子集:一个用于训练(或开发),另一个用于测试(或性能评估)。
if opts.all_categories:
categories = None
else:
categories = [
"alt.atheism",
"talk.religion.misc",
"comp.graphics",
"sci.space",
]
if opts.filtered:
remove = ("headers", "footers", "quotes")
else:
remove = ()
print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")
data_train = fetch_20newsgroups(
subset="train", categories=categories, shuffle=True, random_state=42, remove=remove
)
data_test = fetch_20newsgroups(
subset="test", categories=categories, shuffle=True, random_state=42, remove=remove
)
print("data loaded")
# order of labels in `target_names` can be different from `categories`
target_names = data_train.target_names
def size_mb(docs):
return sum(len(s.encode("utf-8")) for s in docs) / 1e6
data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)
print(
"%d documents - %0.3fMB (training set)" % (len(data_train.data), data_train_size_mb)
)
print("%d documents - %0.3fMB (test set)" % (len(data_test.data), data_test_size_mb))
print("%d categories" % len(target_names))
print()
# split a training set and a test set
y_train, y_test = data_train.target, data_test.target
print("Extracting features from the training data using a sparse vectorizer")
t0 = time()
if opts.use_hashing:
vectorizer = HashingVectorizer(
stop_words="english", alternate_sign=False, n_features=opts.n_features
)
X_train = vectorizer.transform(data_train.data)
else:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words="english")
X_train = vectorizer.fit_transform(data_train.data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_train_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_train.shape)
print()
print("Extracting features from the test data using the same vectorizer")
t0 = time()
X_test = vectorizer.transform(data_test.data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_test_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_test.shape)
print()
# mapping from integer feature name to original token string
if opts.use_hashing:
feature_names = None
else:
feature_names = vectorizer.get_feature_names_out()
if opts.select_chi2:
print("Extracting %d best features by a chi-squared test" % opts.select_chi2)
t0 = time()
ch2 = SelectKBest(chi2, k=opts.select_chi2)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)
if feature_names is not None:
# keep selected feature names
feature_names = feature_names[ch2.get_support()]
print("done in %fs" % (time() - t0))
print()
def trim(s):
"""Trim string to fit on terminal (assuming 80-column display)"""
return s if len(s) <= 80 else s[:77] + "..."
Out:
Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
data loaded
2034 documents - 3.980MB (training set)
1353 documents - 2.867MB (test set)
4 categories
Extracting features from the training data using a sparse vectorizer
done in 0.383082s at 10.388MB/s
n_samples: 2034, n_features: 33809
Extracting features from the test data using the same vectorizer
done in 0.236998s at 12.099MB/s
n_samples: 1353, n_features: 33809
5.3 分类构建
用 15 种不同的分类模型训练和测试数据集,并获得每个模型的性能结果。
def benchmark(clf):
print("_" * 80)
print("Training: ")
print(clf)
t0 = time()
clf.fit(X_train, y_train)
train_time = time() - t0
print("train time: %0.3fs" % train_time)
t0 = time()
pred = clf.predict(X_test)
test_time = time() - t0
print("test time: %0.3fs" % test_time)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
if hasattr(clf, "coef_"):
print("dimensionality: %d" % clf.coef_.shape[1])
print("density: %f" % density(clf.coef_))
if opts.print_top10 and feature_names is not None:
print("top 10 keywords per class:")
for i, label in enumerate(target_names):
top10 = np.argsort(clf.coef_[i])[-10:]
print(trim("%s: %s" % (label, " ".join(feature_names[top10]))))
print()
if opts.print_report:
print("classification report:")
print(metrics.classification_report(y_test, pred, target_names=target_names))
if opts.print_cm:
print("confusion matrix:")
print(metrics.confusion_matrix(y_test, pred))
print()
clf_descr = str(clf).split("(")[0]
return clf_descr, score, train_time, test_time
results = []
for clf, name in (
(RidgeClassifier(tol=1e-2, solver="sag"), "Ridge Classifier"),
(Perceptron(max_iter=50), "Perceptron"),
(PassiveAggressiveClassifier(max_iter=50), "Passive-Aggressive"),
(KNeighborsClassifier(n_neighbors=10), "kNN"),
(RandomForestClassifier(), "Random forest"),
):
print("=" * 80)
print(name)
results.append(benchmark(clf))
for penalty in ["l2", "l1"]:
print("=" * 80)
print("%s penalty" % penalty.upper())
# Train Liblinear model
results.append(benchmark(LinearSVC(penalty=penalty, dual=False, tol=1e-3)))
# Train SGD model
results.append(benchmark(SGDClassifier(alpha=0.0001, max_iter=50, penalty=penalty)))
# Train SGD with Elastic Net penalty
print("=" * 80)
print("Elastic-Net penalty")
results.append(
benchmark(SGDClassifier(alpha=0.0001, max_iter=50, penalty="elasticnet"))
)
# Train NearestCentroid without threshold
print("=" * 80)
print("NearestCentroid (aka Rocchio classifier)")
results.append(benchmark(NearestCentroid()))
# Train sparse Naive Bayes classifiers
print("=" * 80)
print("Naive Bayes")
results.append(benchmark(MultinomialNB(alpha=0.01)))
results.append(benchmark(BernoulliNB(alpha=0.01)))
results.append(benchmark(ComplementNB(alpha=0.1)))
print("=" * 80)
print("LinearSVC with L1-based feature selection")
# The smaller C, the stronger the regularization.
# The more regularization, the more sparsity.
results.append(
benchmark(
Pipeline(
[
(
"feature_selection",
SelectFromModel(LinearSVC(penalty="l1", dual=False, tol=1e-3)),
),
("classification", LinearSVC(penalty="l2")),
]
)
)
)
Out:
================================================================================
Ridge Classifier
________________________________________________________________________________
Training:
RidgeClassifier(solver='sag', tol=0.01)
/home/circleci/project/sklearn/linear_model/_ridge.py:729: UserWarning: "sag" solver requires many iterations to fit an intercept with sparse inputs. Either set the solver to "auto" or "sparse_cg", or set a low "tol" and a high "max_iter" (especially if inputs are not standardized).
warnings.warn(
train time: 0.167s
test time: 0.001s
accuracy: 0.898
dimensionality: 33809
density: 1.000000
================================================================================
Perceptron
________________________________________________________________________________
Training:
Perceptron(max_iter=50)
train time: 0.015s
test time: 0.001s
accuracy: 0.888
dimensionality: 33809
density: 0.255302
================================================================================
Passive-Aggressive
________________________________________________________________________________
Training:
PassiveAggressiveClassifier(max_iter=50)
train time: 0.027s
test time: 0.001s
accuracy: 0.902
dimensionality: 33809
density: 0.711867
================================================================================
kNN
________________________________________________________________________________
Training:
KNeighborsClassifier(n_neighbors=10)
train time: 0.001s
test time: 0.148s
accuracy: 0.858
================================================================================
Random forest
________________________________________________________________________________
Training:
RandomForestClassifier()
train time: 1.258s
test time: 0.079s
accuracy: 0.826
================================================================================
L2 penalty
________________________________________________________________________________
Training:
LinearSVC(dual=False, tol=0.001)
train time: 0.072s
test time: 0.001s
accuracy: 0.900
dimensionality: 33809
density: 1.000000
________________________________________________________________________________
Training:
SGDClassifier(max_iter=50)
train time: 0.024s
test time: 0.001s
accuracy: 0.903
dimensionality: 33809
density: 0.579424
================================================================================
L1 penalty
________________________________________________________________________________
Training:
LinearSVC(dual=False, penalty='l1', tol=0.001)
train time: 0.176s
test time: 0.001s
accuracy: 0.873
dimensionality: 33809
density: 0.005553
________________________________________________________________________________
Training:
SGDClassifier(max_iter=50, penalty='l1')
train time: 0.092s
test time: 0.002s
accuracy: 0.880
dimensionality: 33809
density: 0.022509
================================================================================
Elastic-Net penalty
________________________________________________________________________________
Training:
SGDClassifier(max_iter=50, penalty='elasticnet')
train time: 0.134s
test time: 0.001s
accuracy: 0.901
dimensionality: 33809
density: 0.184685
================================================================================
NearestCentroid (aka Rocchio classifier)
________________________________________________________________________________
Training:
NearestCentroid()
train time: 0.004s
test time: 0.002s
accuracy: 0.855
================================================================================
Naive Bayes
________________________________________________________________________________
Training:
MultinomialNB(alpha=0.01)
train time: 0.003s
test time: 0.001s
accuracy: 0.899
/home/circleci/project/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `coef_` was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).
warnings.warn(msg, category=FutureWarning)
dimensionality: 33809
density: 1.000000
________________________________________________________________________________
Training:
BernoulliNB(alpha=0.01)
train time: 0.005s
test time: 0.004s
accuracy: 0.884
/home/circleci/project/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `coef_` was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).
warnings.warn(msg, category=FutureWarning)
dimensionality: 33809
density: 1.000000
________________________________________________________________________________
Training:
ComplementNB(alpha=0.1)
train time: 0.003s
test time: 0.001s
accuracy: 0.911
/home/circleci/project/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `coef_` was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).
warnings.warn(msg, category=FutureWarning)
dimensionality: 33809
density: 1.000000
================================================================================
LinearSVC with L1-based feature selection
________________________________________________________________________________
Training:
Pipeline(steps=[('feature_selection',
SelectFromModel(estimator=LinearSVC(dual=False, penalty='l1',
tol=0.001))),
('classification', LinearSVC())])
train time: 0.192s
test time: 0.002s
accuracy: 0.879
5.4 可视化处理
条形图表示每个分类器的准确度、训练时间(标准化)和测试时间(标准化)。
indices = np.arange(len(results))
results = [[x[i] for x in results] for i in range(4)]
clf_names, score, training_time, test_time = results
training_time = np.array(training_time) / np.max(training_time)
test_time = np.array(test_time) / np.max(test_time)
plt.figure(figsize=(12, 8))
plt.title("Score")
plt.barh(indices, score, 0.2, label="score", color="navy")
plt.barh(indices + 0.3, training_time, 0.2, label="training time", color="c")
plt.barh(indices + 0.6, test_time, 0.2, label="test time", color="darkorange")
plt.yticks(())
plt.legend(loc="best")
plt.subplots_adjust(left=0.25)
plt.subplots_adjust(top=0.95)
plt.subplots_adjust(bottom=0.05)
for i, c in zip(indices, clf_names):
plt.text(-0.3, i, c)
plt.show()
Out:
参考资料:
1. sklearn特征提取方法汇总(包含字典、文本、图像的特征提取)
2. klearn特征降维方法汇总(方差过滤,卡方,F过滤,互信息,嵌入法)