本文已参与「新人创作礼」活动,一起开启掘金创作之路。
下面是做一个学习的归纳
# 标签是字符的例子
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
在建模的过程中,一般会优先考虑算法和参数,但是数据特征吃决定了整体结果的上限,而算法和参数只决定了如何逼近这个上限。特征工程就是要从原始数据中找到最优价值的信息,并转换成计算机所能读懂的形式。
数值特征
不需要考虑数据特征的具体含义,只进行特征操作即可
1.字符串编码
分类结果的标签中,有可能会出现字符串,所以这里我们要把字符转换成数值
# 直接运行会报错, 'utf-8' codec can't decode byte 0xa8,默认编码是utf-8,所以要改变编码格式
vg_df = pd.read_csv('./vgsales.csv',encoding="ISO-8859-1")
vg_df.iloc[0:7]
Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Wii Sports | Wii | 2006.0 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 |
1 | 2 | Super Mario Bros. | NES | 1985.0 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 |
2 | 3 | Mario Kart Wii | Wii | 2008.0 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 |
3 | 4 | Wii Sports Resort | Wii | 2009.0 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 |
4 | 5 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 |
5 | 6 | Tetris | GB | 1989.0 | Puzzle | Nintendo | 23.20 | 2.26 | 4.22 | 0.58 | 30.26 |
6 | 7 | New Super Mario Bros. | DS | 2006.0 | Platform | Nintendo | 11.38 | 9.23 | 6.50 | 2.90 | 30.01 |
# 找到其中所有唯一值的属性,并显示其格式
genres = np.unique(vg_df['Genre'])
genres
array(['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle', 'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports', 'Strategy'], dtype=object)
可以看见一共有12个不同的属性值,将其转换成数值即可
- 编码方法1:数字映射
# 最简单的方法
from sklearn.preprocessing import LabelEncoder
# LabelEncoder函数可以快速完成映射工作
gle = LabelEncoder()
# fit_transform是实际的执行函数,自动的对属性特征进行映射操作
genre_labels = gle.fit_transform(vg_df['Genre'])
# 显示结果
genre_mappings = {index: label for index, label in enumerate(gle.classes_)}
genre_mappings
{0: 'Action',
1: 'Adventure',
2: 'Fighting',
3: 'Misc',
4: 'Platform',
5: 'Puzzle',
6: 'Racing',
7: 'Role-Playing',
8: 'Shooter',
9: 'Simulation',
10: 'Sports',
11: 'Strategy'}
- 编码方法2:自定义映射
此方法为数字映射的改进
# 读取另外有一个数据集
poke_df = pd.read_csv('./Pokemon.csv',encoding='utf-8')
poke_df.iloc[0:7]
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | Gen 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | Gen 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | Gen 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | Gen 1 | False |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | Gen 1 | False |
5 | 5 | Charmeleon | Fire | NaN | 405 | 58 | 64 | 58 | 80 | 65 | 80 | Gen 1 | False |
6 | 6 | Charizard | Fire | Flying | 534 | 78 | 84 | 78 | 109 | 85 | 100 | Gen 1 | False |
# 设置lable,查看唯一值
poke_df = poke_df.sample(random_state=1, frac=1).reset_index(drop=True)
np.unique(poke_df['Generation'])
array(['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6'], dtype=object)
# 自定义属性的映射值
gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}
poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map)
# 显示4:10行的结果
poke_df.iloc[4:10]
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | GenerationLabel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 14 | Kakuna | Bug | Poison | 205 | 45 | 25 | 50 | 25 | 25 | 35 | Gen 1 | False | 1 |
5 | 462 | Magnezone | Electric | Steel | 535 | 70 | 70 | 115 | 130 | 90 | 60 | Gen 4 | False | 4 |
6 | 193 | Yanma | Bug | Flying | 390 | 65 | 65 | 45 | 75 | 45 | 95 | Gen 2 | False | 2 |
7 | 214 | HeracrossMega Heracross | Bug | Fighting | 600 | 80 | 185 | 115 | 40 | 105 | 75 | Gen 2 | False | 2 |
8 | 324 | Torkoal | Fire | NaN | 470 | 70 | 85 | 140 | 85 | 70 | 20 | Gen 3 | False | 3 |
9 | 678 | MeowsticFemale | Psychic | NaN | 466 | 74 | 48 | 76 | 83 | 81 | 104 | Gen 6 | False | 6 |
- 编码方法3:独热编码
使用OneHotFncoder工具包实现
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# 完成LabelEncoder
gen_le = LabelEncoder()
gen_labels = gen_le.fit_transform(poke_df['Generation'])
# 一下命令可以实现与GenerationLabel相同的列
# poke_df['Gen_Label'] = gen_labels + 1
poke_df['Gen_Label'] = gen_labels
# 完成OneHotEncoder
gen_ohe = OneHotEncoder()
# gen_feature_arr便为一个矩阵
gen_feature_arr = gen_ohe.fit_transform(poke_df[['Gen_Label']]).toarray()
gen_feature_labels = list(gen_le.classes_)
# gen_feature_labels为列标签
# 输出为['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6']
# 将转换好的特征组合到dataframe中
gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels)
# 简化输出结果
poke_df_sub = poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary']]
poke_df_ohe = pd.concat([poke_df_sub, gen_features], axis=1)
poke_df_ohe.head()
Name | Generation | Gen_Label | Legendary | Gen 1 | Gen 2 | Gen 3 | Gen 4 | Gen 5 | Gen 6 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Rapidash | Gen 1 | 0 | False | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | Togekiss | Gen 4 | 3 | False | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | Basculin | Gen 5 | 4 | False | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | Zangoose | Gen 3 | 2 | False | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
4 | Kakuna | Gen 1 | 0 | False | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
- 编码方法4:pandas工具包中的get_dummies函数
# 以下代码可以替换上诉的一大段,prefix = 'one_hot'可以修改参数,直接指定一个前缀用于标识
gen_onehot_features = pd.get_dummies(poke_df['Generation'],prefix = 'one_hot')
pd.concat([poke_df[['Name', 'Generation']], gen_onehot_features], axis=1).iloc[0:5]
Name | Generation | one_hot_Gen 1 | one_hot_Gen 2 | one_hot_Gen 3 | one_hot_Gen 4 | one_hot_Gen 5 | one_hot_Gen 6 | |
---|---|---|---|---|---|---|---|---|
0 | Rapidash | Gen 1 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | Togekiss | Gen 4 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | Basculin | Gen 5 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | Zangoose | Gen 3 | 0 | 0 | 1 | 0 | 0 | 0 |
4 | Kakuna | Gen 1 | 1 | 0 | 0 | 0 | 0 | 0 |
现在所有支线独热编码的特征就全部带上"one-hot"前缀了,最后一个方法最为方便
2.二值特征
popsong_df = pd.read_csv('./song_views.csv')
popsong_df.head()
user_id | song_id | title | listen_count | |
---|---|---|---|---|
0 | b6b799f34a204bd928ea014c243ddad6d0be4f8f | SOBONKR12A58A7A7E0 | You're The One | 2 |
1 | b41ead730ac14f6b6717b9cf8859d5579f3f8d4d | SOBONKR12A58A7A7E0 | You're The One | 0 |
2 | 4c84359a164b161496d05282707cecbd50adbfc4 | SOBONKR12A58A7A7E0 | You're The One | 0 |
3 | 779b5908593756abb6ff7586177c966022668b06 | SOBONKR12A58A7A7E0 | You're The One | 0 |
4 | dd88ea94f605a63d9fc37a214127e3f00e85e42d | SOBONKR12A58A7A7E0 | You're The One | 0 |
根据数据中包括不同用户对歌曲的播放量设置一个二值特征,以表示用户是否听过该歌曲
- 方法1:逻辑判断方法
# 拿到需要比较的特征
watched = np.array(popsong_df['listen_count'])
# 进行比较,如果听过一次就复制为1,没有听过就赋值为0,以此实现二值化操作
watched[watched >= 1] = 1
popsong_df['watched'] = watched
popsong_df.head(8)
user_id | song_id | title | listen_count | watched | |
---|---|---|---|---|---|
0 | b6b799f34a204bd928ea014c243ddad6d0be4f8f | SOBONKR12A58A7A7E0 | You're The One | 2 | 1 |
1 | b41ead730ac14f6b6717b9cf8859d5579f3f8d4d | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 |
2 | 4c84359a164b161496d05282707cecbd50adbfc4 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 |
3 | 779b5908593756abb6ff7586177c966022668b06 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 |
4 | dd88ea94f605a63d9fc37a214127e3f00e85e42d | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 |
5 | 68f0359a2f1cedb0d15c98d88017281db79f9bc6 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 |
6 | 116a4c95d63623a967edf2f3456c90ebbf964e6f | SOBONKR12A58A7A7E0 | You're The One | 17 | 1 |
7 | 45544491ccfcdc0b0803c34f201a6287ed4e30f8 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 |
- 方法2:sklearn工具包中的Binarizer实现
from sklearn.preprocessing import Binarizer
# threshold参数为设置判断次数的阈值,0.99表示大于0.99就设置为1
bn = Binarizer(threshold=0.99)
pd_watched = bn.transform([popsong_df['listen_count']])[0]
popsong_df['pd_watched'] = pd_watched
popsong_df.head(10)
user_id | song_id | title | listen_count | watched | pd_watched | |
---|---|---|---|---|---|---|
0 | b6b799f34a204bd928ea014c243ddad6d0be4f8f | SOBONKR12A58A7A7E0 | You're The One | 2 | 1 | 1 |
1 | b41ead730ac14f6b6717b9cf8859d5579f3f8d4d | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 | 0 |
2 | 4c84359a164b161496d05282707cecbd50adbfc4 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 | 0 |
3 | 779b5908593756abb6ff7586177c966022668b06 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 | 0 |
4 | dd88ea94f605a63d9fc37a214127e3f00e85e42d | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 | 0 |
5 | 68f0359a2f1cedb0d15c98d88017281db79f9bc6 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 | 0 |
6 | 116a4c95d63623a967edf2f3456c90ebbf964e6f | SOBONKR12A58A7A7E0 | You're The One | 17 | 1 | 1 |
7 | 45544491ccfcdc0b0803c34f201a6287ed4e30f8 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 | 0 |
8 | e701a24d9b6c59f5ac37ab28462ca82470e27cfb | SOBONKR12A58A7A7E0 | You're The One | 68 | 1 | 1 |
9 | edc8b7b1fd592a3b69c3d823a742e1a064abec95 | SOBONKR12A58A7A7E0 | You're The One | 0 | 0 | 0 |
3.多项式特征
也就是增加特征的操作
atk_def = poke_df[['Attack', 'Defense']]
atk_def.head()
Attack | Defense | |
---|---|---|
0 | 100 | 70 |
1 | 50 | 95 |
2 | 92 | 65 |
3 | 115 | 60 |
4 | 25 | 50 |
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
res = pf.fit_transform(atk_def)
# 打印前5行结果出来
res[:5]
array([[ 100., 70., 10000., 7000., 4900.],
[ 50., 95., 2500., 4750., 9025.],
[ 92., 65., 8464., 5980., 4225.],
[ 115., 60., 13225., 6900., 3600.],
[ 25., 50., 625., 1250., 2500.]])
PolynomialFeatures函数涉及以下3个参数:
- degree:控制多项式的度,数值越大,特征越多
- interaction_only,设置是否需要有特征自己和自己结合的项
- include_bias,默认为True,会新增加一列
为了清晰的展示,可以加上操作的别名:
intr_features = pd.DataFrame(res, columns=['Attack', 'Defense', 'Attack^2', 'Attack x Defense', 'Defense^2'])
intr_features.head(5)
Attack | Defense | Attack^2 | Attack x Defense | Defense^2 | |
---|---|---|---|---|---|
0 | 100.0 | 70.0 | 10000.0 | 7000.0 | 4900.0 |
1 | 50.0 | 95.0 | 2500.0 | 4750.0 | 9025.0 |
2 | 92.0 | 65.0 | 8464.0 | 5980.0 | 4225.0 |
3 | 115.0 | 60.0 | 13225.0 | 6900.0 | 3600.0 |
4 | 25.0 | 50.0 | 625.0 | 1250.0 | 2500.0 |
4.连续指离散化
fcc_survey_df = pd.read_csv('./fcc_2016_coder_survey_subset.csv', encoding='utf-8')
fcc_survey_df[['ID.x', 'EmploymentField', 'Age', 'Income']].head()
# fcc_survey_df.head()
ID.x | EmploymentField | Age | Income | |
---|---|---|---|---|
0 | cef35615d61b202f1dc794ef2746df14 | office and administrative support | 28.0 | 32000.0 |
1 | 323e5a113644d18185c743c241407754 | food and beverage | 22.0 | 15000.0 |
2 | b29a1027e5cd062e654a63764157461d | finance | 19.0 | 48000.0 |
3 | 04a11e4bcb573a1261eb0d9948d32637 | arts, entertainment, sports, or media | 26.0 | 43000.0 |
4 | 9368291c93d5d5f5c8cdb1a575e18bec | education | 20.0 | 6000.0 |
上面有一份带有年龄的数据集,接下来要对年龄进行离散化操作,也就是分成一个个的区间
# 查看年龄分布
fig, ax = plt.subplots()
# 横坐标表示年龄的分布,纵坐标表示的数量的多少
fcc_survey_df['Age'].hist(bins=20,color='#A9C5D3')
ax.set_title('Developer Age Histogram', fontsize=12)
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
Text(0, 0.5, 'Frequency')
根据以下操作进行年龄的划分
Age Range: Bin
---------------
0 - 9 : 0
10 - 19 : 1
20 - 29 : 2
30 - 39 : 3
40 - 49 : 4
50 - 59 : 5
60 - 69 : 6
... and so on
# 其中的np.floor表示向下取整
fcc_survey_df['Age_bin_round'] = np.array(np.floor(np.array(fcc_survey_df['Age']) / 10.))
fcc_survey_df[['ID.x', 'Age', 'Age_bin_round']].iloc[1071:1076]
ID.x | Age | Age_bin_round | |
---|---|---|---|
1071 | 6a02aa4618c99fdb3e24de522a099431 | 17.0 | 1.0 |
1072 | f0e5e47278c5f248fe861c5f7214c07a | 38.0 | 3.0 |
1073 | 6e14f6d0779b7e424fa3fdd9e4bd3bf9 | 21.0 | 2.0 |
1074 | c2654c07dc929cdf3dad4d1aec4ffbb3 | 53.0 | 5.0 |
1075 | f07449fc9339b2e57703ec7886232523 | 35.0 | 3.0 |
# 查看收入分布
fig, ax = plt.subplots()
# 横坐标表示年龄的分布,纵坐标表示的数量的多少
fcc_survey_df['Income'].hist(bins=20,color='#A9C5D3')
ax.set_title('Developer Age Histogram', fontsize=12)
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
Text(0, 0.5, 'Frequency')
# 进行分数位划分,以下是按照数量的比例来进行的划分
quantile_list = [0, .25, .5, .75, 1.]
quantiles = fcc_survey_df['Income'].quantile(quantile_list)
quantiles
0.00 6000.0
0.25 20000.0
0.50 37000.0
0.75 60000.0
1.00 200000.0
Name: Income, dtype: float64
# 划分为4个区间,定义分类的标签
quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
# 添加了Income_quantile_range新列
fcc_survey_df['Income_quantile_range'] = pd.qcut(fcc_survey_df['Income'], q=quantile_list)
# 添加了Income_quantile_label新列
fcc_survey_df['Income_quantile_label'] = pd.qcut(fcc_survey_df['Income'], q=quantile_list, labels=quantile_labels)
fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_quantile_range', 'Income_quantile_label']].iloc[4:9]
ID.x | Age | Income | Income_quantile_range | Income_quantile_label | |
---|---|---|---|---|---|
4 | 9368291c93d5d5f5c8cdb1a575e18bec | 20.0 | 6000.0 | (5999.999, 20000.0] | 0-25Q |
5 | dd0e77eab9270e4b67c19b0d6bbf621b | 34.0 | 40000.0 | (37000.0, 60000.0] | 50-75Q |
6 | 7599c0aa0419b59fd11ffede98a3665d | 23.0 | 32000.0 | (20000.0, 37000.0] | 25-50Q |
7 | 6dff182db452487f07a47596f314bddc | 35.0 | 40000.0 | (37000.0, 60000.0] | 50-75Q |
8 | 9dc233f8ed1c6eb2432672ab4bb39249 | 33.0 | 80000.0 | (60000.0, 200000.0] | 75-100Q |
本文特征
文本特征经常在数据中出现,一句话,一篇文章都是文本特征,但是要转化成计算机认识的特征,首先需要将其转换成数值,也就是向量。
这里需要介绍一个nltk工具包,官网为www.nltk.org/data.html
import pandas as pd
import numpy as np
import re
import nltk #anaconda已经帮我安装好了
先构建一个简单的本文数据集
corpus = ['The sky is blue and beautiful.',
'Love this blue and beautiful sky!',
'The quick brown fox jumps over the lazy dog.',
'The brown fox is quick and the blue dog is lazy!',
'The sky is very blue and the sky is very beautiful today',
'The dog is lazy but the brown fox is quick!'
]
labels = ['weather', 'weather', 'animals', 'animals', 'weather', 'animals']
corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus, 'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df
Document | Category | |
---|---|---|
0 | The sky is blue and beautiful. | weather |
1 | Love this blue and beautiful sky! | weather |
2 | The quick brown fox jumps over the lazy dog. | animals |
3 | The brown fox is quick and the blue dog is lazy! | animals |
4 | The sky is very blue and the sky is very beaut... | weather |
5 | The dog is lazy but the brown fox is quick! | animals |
# 执行以下命令,会跳出安装界面,提出的错误可能比较多
nltk.download()
showing info http://www.nltk.org/nltk_data/
True
1.基本预处理
#加载停用词,这两句代码需要安装nltk的一些安装包(处于考虑,我选择全部包都安装了)
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')
# stop_words 可以打印出来自行查看
def normalize_corpus(doc):
# 去掉特殊字符
doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I)
# 转换成小写
doc = doc.lower()
doc = doc.strip()
# 分词
tokens = wpt.tokenize(doc)
# 去停用词
filtered_tokens = [token for token in tokens if token not in stop_words]
# 重新组合成文章
doc = ' '.join(filtered_tokens)
return doc
# normalize_corpus = np.vectorize(normalize_document)
# norm_corpus = normalize_corpus(corpus)
norm_corpus = np.array(['sky blue beautiful',
'love blue beautiful sky',
'quick brown fox jumps lazy dog',
'brown fox quick blue dog lazy',
'sky blue sky beautiful today',
'dog lazy brown fox quick'])
norm_corpus
array(['sky blue beautiful', 'love blue beautiful sky',
'quick brown fox jumps lazy dog', 'brown fox quick blue dog lazy',
'sky blue sky beautiful today', 'dog lazy brown fox quick'],
dtype='<U30')
像the,this等对整句话的主题不起作用的词全部去掉,下面对文本进行特征提取,也就是将每句话转成数值问题
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(min_df=0., max_df=1.)
cv.fit(norm_corpus)
# 打印出全部内容的单词
print (cv.get_feature_names())
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix = cv_matrix.toarray()
# 每句话构成一个独热编码内容的向量组
cv_matrix
['beautiful', 'blue', 'brown', 'dog', 'fox', 'jumps', 'lazy', 'love', 'quick', 'sky', 'today']
array([[1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0],
[0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 1],
[0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0]], dtype=int64)
2.词袋模型
- 单个单词
# CountVectorizer为关键的词组设置
cv = CountVectorizer(min_df=0., max_df=1.)
cv.fit(norm_corpus)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix = cv_matrix.toarray()
vocab = cv.get_feature_names()
pd.DataFrame(cv_matrix, columns=vocab)
beautiful | blue | brown | dog | fox | jumps | lazy | love | quick | sky | today | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
2 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 |
3 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
4 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 |
5 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
文章出现多少个不同的词,其向量的维度就有多大,再依照其出现的次数和位置,就可以吧向量构造出来。
还可以把词和词之间的组合考虑进来
- 两两组合
# ngram_range参数表示需要考虑词的上下文,此处指考虑两两结合的情况;若设置成(1:2),则表示包括一个词也包括两个词的情况
bv = CountVectorizer(ngram_range=(2,2))
# 每句话构成一个独热编码内容的向量组,与上类似
bv_matrix = bv.fit_transform(norm_corpus)
bv_matrix = bv_matrix.toarray()
# 设置列向量的标签
vocab = bv.get_feature_names()
pd.DataFrame(bv_matrix, columns=vocab)
beautiful sky | beautiful today | blue beautiful | blue dog | blue sky | brown fox | dog lazy | fox jumps | fox quick | jumps lazy | lazy brown | lazy dog | love blue | quick blue | quick brown | sky beautiful | sky blue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
3 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
4 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
5 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
词袋模型只考虑词频,并且每个词的重要程度完全和其出现的次数相关,通常会变成一个较大的稀疏矩阵,并不利于计算
3.常用文本特征构造方法
TF-IDF 模型
TF-IDF 模型中会考虑每个词的重要程度
# 代码结构是类似的,只是换了一个工具
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()
vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)
beautiful | blue | brown | dog | fox | jumps | lazy | love | quick | sky | today | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.60 | 0.52 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.60 | 0.00 |
1 | 0.46 | 0.39 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.66 | 0.00 | 0.46 | 0.00 |
2 | 0.00 | 0.00 | 0.38 | 0.38 | 0.38 | 0.54 | 0.38 | 0.00 | 0.38 | 0.00 | 0.00 |
3 | 0.00 | 0.36 | 0.42 | 0.42 | 0.42 | 0.00 | 0.42 | 0.00 | 0.42 | 0.00 | 0.00 |
4 | 0.36 | 0.31 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.72 | 0.52 |
5 | 0.00 | 0.00 | 0.45 | 0.45 | 0.45 | 0.00 | 0.45 | 0.00 | 0.45 | 0.00 | 0.00 |
Similarity特征
只要确定了特征,并且全部转换成数值数据,吃吗计算他们之间的相似性,就这里采用的余弦相似性,直接将上诉的特征的提取结果当作是输入便可
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 1.000000 | 0.753128 | 0.000000 | 0.185447 | 0.807539 | 0.000000 |
1 | 0.753128 | 1.000000 | 0.000000 | 0.139665 | 0.608181 | 0.000000 |
2 | 0.000000 | 0.000000 | 1.000000 | 0.784362 | 0.000000 | 0.839987 |
3 | 0.185447 | 0.139665 | 0.784362 | 1.000000 | 0.109653 | 0.933779 |
4 | 0.807539 | 0.608181 | 0.000000 | 0.109653 | 1.000000 | 0.000000 |
5 | 0.000000 | 0.000000 | 0.839987 | 0.933779 | 0.000000 | 1.000000 |
聚类特征
聚类就是把数据按堆进行划分,最后每堆给出一个实际的标签,需要先把数据转换成数值特征,然后计算器聚类结果
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2)
km.fit_transform(similarity_df)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([corpus_df, cluster_labels], axis=1)
Document | Category | ClusterLabel | |
---|---|---|---|
0 | The sky is blue and beautiful. | weather | 1 |
1 | Love this blue and beautiful sky! | weather | 1 |
2 | The quick brown fox jumps over the lazy dog. | animals | 0 |
3 | The brown fox is quick and the blue dog is lazy! | animals | 0 |
4 | The sky is very blue and the sky is very beaut... | weather | 1 |
5 | The dog is lazy but the brown fox is quick! | animals | 0 |
主题模型
得到主题类型以及其中每一个词的权重结果
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=2, max_iter=100, random_state=42)
dt_matrix = lda.fit_transform(tv_matrix)
features = pd.DataFrame(dt_matrix, columns=['T1', 'T2'])
tt_matrix = lda.components_
for topic_weights in tt_matrix:
topic = [(token, weight) for token, weight in zip(vocab, topic_weights)]
topic = sorted(topic, key=lambda x: -x[1])
topic = [item for item in topic if item[1] > 0.6]
print(topic)
print()
输出
[('fox', 1.7265536238698524), ('quick', 1.7264910761871224), ('dog', 1.7264019823624879), ('brown', 1.7263774760262807), ('lazy', 1.7263567668213813), ('jumps', 1.0326450363521607), ('blue', 0.7770158513472083)]
[('sky', 2.263185143458752), ('beautiful', 1.9057084998062579), ('blue', 1.7954559705805624), ('love', 1.1476805311187976), ('today', 1.0064979209198706)]
词向量模型
词向量模型也就是word2vec,基本原理是基于神经网络的
# 这里需要安装genism安装包安装包
from gensim.models import word2vec
wpt = nltk.WordPunctTokenizer()
tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]
# 需要设置一些参数
feature_size = 10 # 词向量维度
window_context = 10 # 滑动窗口
min_word_count = 1 # 最小词频
w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size, window=window_context, min_count = min_word_count)
w2v_model.wv['sky']
array([-0.00243433, -0.02265637, 0.00145745, 0.0342871 , 0.0279882 ,
0.02946787, 0.02539628, 0.0080244 , 0.02513938, 0.02639634],
dtype=float32)
输出结果就是输入预料中的每一个词都转换成向量
(论文与benchmark来找解决方案)