机器学习——特征工程

79 阅读15分钟

本文已参与「新人创作礼」活动,一起开启掘金创作之路。

下面是做一个学习的归纳

# 标签是字符的例子
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

在建模的过程中,一般会优先考虑算法和参数,但是数据特征吃决定了整体结果的上限,而算法和参数只决定了如何逼近这个上限。特征工程就是要从原始数据中找到最优价值的信息,并转换成计算机所能读懂的形式。

数值特征

不需要考虑数据特征的具体含义,只进行特征操作即可

1.字符串编码

分类结果的标签中,有可能会出现字符串,所以这里我们要把字符转换成数值

# 直接运行会报错, 'utf-8' codec can't decode byte 0xa8,默认编码是utf-8,所以要改变编码格式
vg_df = pd.read_csv('./vgsales.csv',encoding="ISO-8859-1")
vg_df.iloc[0:7]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
0 1 Wii Sports Wii 2006.0 Sports Nintendo 41.49 29.02 3.77 8.46 82.74
1 2 Super Mario Bros. NES 1985.0 Platform Nintendo 29.08 3.58 6.81 0.77 40.24
2 3 Mario Kart Wii Wii 2008.0 Racing Nintendo 15.85 12.88 3.79 3.31 35.82
3 4 Wii Sports Resort Wii 2009.0 Sports Nintendo 15.75 11.01 3.28 2.96 33.00
4 5 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing Nintendo 11.27 8.89 10.22 1.00 31.37
5 6 Tetris GB 1989.0 Puzzle Nintendo 23.20 2.26 4.22 0.58 30.26
6 7 New Super Mario Bros. DS 2006.0 Platform Nintendo 11.38 9.23 6.50 2.90 30.01
# 找到其中所有唯一值的属性,并显示其格式
genres = np.unique(vg_df['Genre'])
genres
array(['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle',       'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports',       'Strategy'], dtype=object)

可以看见一共有12个不同的属性值,将其转换成数值即可

  • 编码方法1:数字映射
# 最简单的方法
from sklearn.preprocessing import LabelEncoder
# LabelEncoder函数可以快速完成映射工作
gle = LabelEncoder()
# fit_transform是实际的执行函数,自动的对属性特征进行映射操作
genre_labels = gle.fit_transform(vg_df['Genre'])
# 显示结果
genre_mappings = {index: label for index, label in enumerate(gle.classes_)}
genre_mappings
{0: 'Action',
 1: 'Adventure',
 2: 'Fighting',
 3: 'Misc',
 4: 'Platform',
 5: 'Puzzle',
 6: 'Racing',
 7: 'Role-Playing',
 8: 'Shooter',
 9: 'Simulation',
 10: 'Sports',
 11: 'Strategy'}
  • 编码方法2:自定义映射

此方法为数字映射的改进

# 读取另外有一个数据集
poke_df = pd.read_csv('./Pokemon.csv',encoding='utf-8')
poke_df.iloc[0:7]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 Gen 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 Gen 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 Gen 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 Gen 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 Gen 1 False
5 5 Charmeleon Fire NaN 405 58 64 58 80 65 80 Gen 1 False
6 6 Charizard Fire Flying 534 78 84 78 109 85 100 Gen 1 False
# 设置lable,查看唯一值
poke_df = poke_df.sample(random_state=1, frac=1).reset_index(drop=True)
np.unique(poke_df['Generation'])
array(['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6'], dtype=object)
# 自定义属性的映射值
gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}
poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map)
# 显示4:10行的结果
poke_df.iloc[4:10]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary GenerationLabel
4 14 Kakuna Bug Poison 205 45 25 50 25 25 35 Gen 1 False 1
5 462 Magnezone Electric Steel 535 70 70 115 130 90 60 Gen 4 False 4
6 193 Yanma Bug Flying 390 65 65 45 75 45 95 Gen 2 False 2
7 214 HeracrossMega Heracross Bug Fighting 600 80 185 115 40 105 75 Gen 2 False 2
8 324 Torkoal Fire NaN 470 70 85 140 85 70 20 Gen 3 False 3
9 678 MeowsticFemale Psychic NaN 466 74 48 76 83 81 104 Gen 6 False 6
  • 编码方法3:独热编码

使用OneHotFncoder工具包实现

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# 完成LabelEncoder
gen_le = LabelEncoder()
gen_labels = gen_le.fit_transform(poke_df['Generation'])
# 一下命令可以实现与GenerationLabel相同的列
# poke_df['Gen_Label'] = gen_labels + 1
poke_df['Gen_Label'] = gen_labels

# 完成OneHotEncoder
gen_ohe = OneHotEncoder()
# gen_feature_arr便为一个矩阵
gen_feature_arr = gen_ohe.fit_transform(poke_df[['Gen_Label']]).toarray()
gen_feature_labels = list(gen_le.classes_)
# gen_feature_labels为列标签  
# 输出为['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6']

# 将转换好的特征组合到dataframe中
gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels)
# 简化输出结果
poke_df_sub = poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary']]
poke_df_ohe = pd.concat([poke_df_sub, gen_features], axis=1)
poke_df_ohe.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Name Generation Gen_Label Legendary Gen 1 Gen 2 Gen 3 Gen 4 Gen 5 Gen 6
0 Rapidash Gen 1 0 False 1.0 0.0 0.0 0.0 0.0 0.0
1 Togekiss Gen 4 3 False 0.0 0.0 0.0 1.0 0.0 0.0
2 Basculin Gen 5 4 False 0.0 0.0 0.0 0.0 1.0 0.0
3 Zangoose Gen 3 2 False 0.0 0.0 1.0 0.0 0.0 0.0
4 Kakuna Gen 1 0 False 1.0 0.0 0.0 0.0 0.0 0.0
  • 编码方法4:pandas工具包中的get_dummies函数
# 以下代码可以替换上诉的一大段,prefix = 'one_hot'可以修改参数,直接指定一个前缀用于标识
gen_onehot_features = pd.get_dummies(poke_df['Generation'],prefix = 'one_hot')
pd.concat([poke_df[['Name', 'Generation']], gen_onehot_features], axis=1).iloc[0:5]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Name Generation one_hot_Gen 1 one_hot_Gen 2 one_hot_Gen 3 one_hot_Gen 4 one_hot_Gen 5 one_hot_Gen 6
0 Rapidash Gen 1 1 0 0 0 0 0
1 Togekiss Gen 4 0 0 0 1 0 0
2 Basculin Gen 5 0 0 0 0 1 0
3 Zangoose Gen 3 0 0 1 0 0 0
4 Kakuna Gen 1 1 0 0 0 0 0

现在所有支线独热编码的特征就全部带上"one-hot"前缀了,最后一个方法最为方便

2.二值特征

popsong_df = pd.read_csv('./song_views.csv')
popsong_df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
user_id song_id title listen_count
0 b6b799f34a204bd928ea014c243ddad6d0be4f8f SOBONKR12A58A7A7E0 You're The One 2
1 b41ead730ac14f6b6717b9cf8859d5579f3f8d4d SOBONKR12A58A7A7E0 You're The One 0
2 4c84359a164b161496d05282707cecbd50adbfc4 SOBONKR12A58A7A7E0 You're The One 0
3 779b5908593756abb6ff7586177c966022668b06 SOBONKR12A58A7A7E0 You're The One 0
4 dd88ea94f605a63d9fc37a214127e3f00e85e42d SOBONKR12A58A7A7E0 You're The One 0

根据数据中包括不同用户对歌曲的播放量设置一个二值特征,以表示用户是否听过该歌曲

  • 方法1:逻辑判断方法
# 拿到需要比较的特征
watched = np.array(popsong_df['listen_count']) 
# 进行比较,如果听过一次就复制为1,没有听过就赋值为0,以此实现二值化操作
watched[watched >= 1] = 1
popsong_df['watched'] = watched
popsong_df.head(8)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
user_id song_id title listen_count watched
0 b6b799f34a204bd928ea014c243ddad6d0be4f8f SOBONKR12A58A7A7E0 You're The One 2 1
1 b41ead730ac14f6b6717b9cf8859d5579f3f8d4d SOBONKR12A58A7A7E0 You're The One 0 0
2 4c84359a164b161496d05282707cecbd50adbfc4 SOBONKR12A58A7A7E0 You're The One 0 0
3 779b5908593756abb6ff7586177c966022668b06 SOBONKR12A58A7A7E0 You're The One 0 0
4 dd88ea94f605a63d9fc37a214127e3f00e85e42d SOBONKR12A58A7A7E0 You're The One 0 0
5 68f0359a2f1cedb0d15c98d88017281db79f9bc6 SOBONKR12A58A7A7E0 You're The One 0 0
6 116a4c95d63623a967edf2f3456c90ebbf964e6f SOBONKR12A58A7A7E0 You're The One 17 1
7 45544491ccfcdc0b0803c34f201a6287ed4e30f8 SOBONKR12A58A7A7E0 You're The One 0 0
  • 方法2:sklearn工具包中的Binarizer实现
from sklearn.preprocessing import Binarizer
# threshold参数为设置判断次数的阈值,0.99表示大于0.99就设置为1
bn = Binarizer(threshold=0.99)
pd_watched = bn.transform([popsong_df['listen_count']])[0]
popsong_df['pd_watched'] = pd_watched
popsong_df.head(10)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
user_id song_id title listen_count watched pd_watched
0 b6b799f34a204bd928ea014c243ddad6d0be4f8f SOBONKR12A58A7A7E0 You're The One 2 1 1
1 b41ead730ac14f6b6717b9cf8859d5579f3f8d4d SOBONKR12A58A7A7E0 You're The One 0 0 0
2 4c84359a164b161496d05282707cecbd50adbfc4 SOBONKR12A58A7A7E0 You're The One 0 0 0
3 779b5908593756abb6ff7586177c966022668b06 SOBONKR12A58A7A7E0 You're The One 0 0 0
4 dd88ea94f605a63d9fc37a214127e3f00e85e42d SOBONKR12A58A7A7E0 You're The One 0 0 0
5 68f0359a2f1cedb0d15c98d88017281db79f9bc6 SOBONKR12A58A7A7E0 You're The One 0 0 0
6 116a4c95d63623a967edf2f3456c90ebbf964e6f SOBONKR12A58A7A7E0 You're The One 17 1 1
7 45544491ccfcdc0b0803c34f201a6287ed4e30f8 SOBONKR12A58A7A7E0 You're The One 0 0 0
8 e701a24d9b6c59f5ac37ab28462ca82470e27cfb SOBONKR12A58A7A7E0 You're The One 68 1 1
9 edc8b7b1fd592a3b69c3d823a742e1a064abec95 SOBONKR12A58A7A7E0 You're The One 0 0 0

3.多项式特征

也就是增加特征的操作

atk_def = poke_df[['Attack', 'Defense']]
atk_def.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Attack Defense
0 100 70
1 50 95
2 92 65
3 115 60
4 25 50
from sklearn.preprocessing import PolynomialFeatures

pf = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
res = pf.fit_transform(atk_def)
# 打印前5行结果出来
res[:5]
array([[  100.,    70., 10000.,  7000.,  4900.],
       [   50.,    95.,  2500.,  4750.,  9025.],
       [   92.,    65.,  8464.,  5980.,  4225.],
       [  115.,    60., 13225.,  6900.,  3600.],
       [   25.,    50.,   625.,  1250.,  2500.]])

PolynomialFeatures函数涉及以下3个参数:

  • degree:控制多项式的度,数值越大,特征越多
  • interaction_only,设置是否需要有特征自己和自己结合的项
  • include_bias,默认为True,会新增加一列

为了清晰的展示,可以加上操作的别名:

intr_features = pd.DataFrame(res, columns=['Attack', 'Defense', 'Attack^2', 'Attack x Defense', 'Defense^2'])
intr_features.head(5)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Attack Defense Attack^2 Attack x Defense Defense^2
0 100.0 70.0 10000.0 7000.0 4900.0
1 50.0 95.0 2500.0 4750.0 9025.0
2 92.0 65.0 8464.0 5980.0 4225.0
3 115.0 60.0 13225.0 6900.0 3600.0
4 25.0 50.0 625.0 1250.0 2500.0

4.连续指离散化

fcc_survey_df = pd.read_csv('./fcc_2016_coder_survey_subset.csv', encoding='utf-8')
fcc_survey_df[['ID.x', 'EmploymentField', 'Age', 'Income']].head()
# fcc_survey_df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
ID.x EmploymentField Age Income
0 cef35615d61b202f1dc794ef2746df14 office and administrative support 28.0 32000.0
1 323e5a113644d18185c743c241407754 food and beverage 22.0 15000.0
2 b29a1027e5cd062e654a63764157461d finance 19.0 48000.0
3 04a11e4bcb573a1261eb0d9948d32637 arts, entertainment, sports, or media 26.0 43000.0
4 9368291c93d5d5f5c8cdb1a575e18bec education 20.0 6000.0

上面有一份带有年龄的数据集,接下来要对年龄进行离散化操作,也就是分成一个个的区间

# 查看年龄分布
fig, ax = plt.subplots()
# 横坐标表示年龄的分布,纵坐标表示的数量的多少
fcc_survey_df['Age'].hist(bins=20,color='#A9C5D3')
ax.set_title('Developer Age Histogram', fontsize=12)
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
Text(0, 0.5, 'Frequency')

在这里插入图片描述

根据以下操作进行年龄的划分

Age Range: Bin
---------------
 0 -  9  : 0
10 - 19  : 1
20 - 29  : 2
30 - 39  : 3
40 - 49  : 4
50 - 59  : 5
60 - 69  : 6
  ... and so on
# 其中的np.floor表示向下取整
fcc_survey_df['Age_bin_round'] = np.array(np.floor(np.array(fcc_survey_df['Age']) / 10.))
fcc_survey_df[['ID.x', 'Age', 'Age_bin_round']].iloc[1071:1076]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
ID.x Age Age_bin_round
1071 6a02aa4618c99fdb3e24de522a099431 17.0 1.0
1072 f0e5e47278c5f248fe861c5f7214c07a 38.0 3.0
1073 6e14f6d0779b7e424fa3fdd9e4bd3bf9 21.0 2.0
1074 c2654c07dc929cdf3dad4d1aec4ffbb3 53.0 5.0
1075 f07449fc9339b2e57703ec7886232523 35.0 3.0
# 查看收入分布
fig, ax = plt.subplots()
# 横坐标表示年龄的分布,纵坐标表示的数量的多少
fcc_survey_df['Income'].hist(bins=20,color='#A9C5D3')
ax.set_title('Developer Age Histogram', fontsize=12)
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
Text(0, 0.5, 'Frequency')

在这里插入图片描述

# 进行分数位划分,以下是按照数量的比例来进行的划分
quantile_list = [0, .25, .5, .75, 1.]
quantiles = fcc_survey_df['Income'].quantile(quantile_list)
quantiles
0.00      6000.0
0.25     20000.0
0.50     37000.0
0.75     60000.0
1.00    200000.0
Name: Income, dtype: float64
# 划分为4个区间,定义分类的标签
quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']
# 添加了Income_quantile_range新列
fcc_survey_df['Income_quantile_range'] = pd.qcut(fcc_survey_df['Income'], q=quantile_list)
# 添加了Income_quantile_label新列
fcc_survey_df['Income_quantile_label'] = pd.qcut(fcc_survey_df['Income'], q=quantile_list, labels=quantile_labels)
fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_quantile_range', 'Income_quantile_label']].iloc[4:9]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
ID.x Age Income Income_quantile_range Income_quantile_label
4 9368291c93d5d5f5c8cdb1a575e18bec 20.0 6000.0 (5999.999, 20000.0] 0-25Q
5 dd0e77eab9270e4b67c19b0d6bbf621b 34.0 40000.0 (37000.0, 60000.0] 50-75Q
6 7599c0aa0419b59fd11ffede98a3665d 23.0 32000.0 (20000.0, 37000.0] 25-50Q
7 6dff182db452487f07a47596f314bddc 35.0 40000.0 (37000.0, 60000.0] 50-75Q
8 9dc233f8ed1c6eb2432672ab4bb39249 33.0 80000.0 (60000.0, 200000.0] 75-100Q

本文特征

文本特征经常在数据中出现,一句话,一篇文章都是文本特征,但是要转化成计算机认识的特征,首先需要将其转换成数值,也就是向量。

这里需要介绍一个nltk工具包,官网为www.nltk.org/data.html

import pandas as pd
import numpy as np
import re
import nltk   #anaconda已经帮我安装好了

先构建一个简单的本文数据集

corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'    
]
labels = ['weather', 'weather', 'animals', 'animals', 'weather', 'animals']
corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus, 'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Document Category
0 The sky is blue and beautiful. weather
1 Love this blue and beautiful sky! weather
2 The quick brown fox jumps over the lazy dog. animals
3 The brown fox is quick and the blue dog is lazy! animals
4 The sky is very blue and the sky is very beaut... weather
5 The dog is lazy but the brown fox is quick! animals
# 执行以下命令,会跳出安装界面,提出的错误可能比较多
nltk.download() 
showing info http://www.nltk.org/nltk_data/





True

1.基本预处理

#加载停用词,这两句代码需要安装nltk的一些安装包(处于考虑,我选择全部包都安装了)
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')
# stop_words 可以打印出来自行查看

def normalize_corpus(doc):
    # 去掉特殊字符
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I)
    # 转换成小写
    doc = doc.lower()
    doc = doc.strip()
    # 分词
    tokens = wpt.tokenize(doc)
    # 去停用词
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # 重新组合成文章
    doc = ' '.join(filtered_tokens)
    return doc

# normalize_corpus = np.vectorize(normalize_document)
# norm_corpus = normalize_corpus(corpus)

norm_corpus = np.array(['sky blue beautiful',
                        'love blue beautiful sky',
                        'quick brown fox jumps lazy dog',
                        'brown fox quick blue dog lazy',
                        'sky blue sky beautiful today',
                        'dog lazy brown fox quick'])
norm_corpus
array(['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog', 'brown fox quick blue dog lazy',
       'sky blue sky beautiful today', 'dog lazy brown fox quick'],
      dtype='<U30')

像the,this等对整句话的主题不起作用的词全部去掉,下面对文本进行特征提取,也就是将每句话转成数值问题

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(min_df=0., max_df=1.)
cv.fit(norm_corpus)
# 打印出全部内容的单词
print (cv.get_feature_names())
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix = cv_matrix.toarray()
# 每句话构成一个独热编码内容的向量组
cv_matrix
['beautiful', 'blue', 'brown', 'dog', 'fox', 'jumps', 'lazy', 'love', 'quick', 'sky', 'today']





array([[1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0],
       [0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 1],
       [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0]], dtype=int64)

2.词袋模型

  • 单个单词
# CountVectorizer为关键的词组设置
cv = CountVectorizer(min_df=0., max_df=1.)
cv.fit(norm_corpus)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix = cv_matrix.toarray()
vocab = cv.get_feature_names()
pd.DataFrame(cv_matrix, columns=vocab)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
beautiful blue brown dog fox jumps lazy love quick sky today
0 1 1 0 0 0 0 0 0 0 1 0
1 1 1 0 0 0 0 0 1 0 1 0
2 0 0 1 1 1 1 1 0 1 0 0
3 0 1 1 1 1 0 1 0 1 0 0
4 1 1 0 0 0 0 0 0 0 2 1
5 0 0 1 1 1 0 1 0 1 0 0

文章出现多少个不同的词,其向量的维度就有多大,再依照其出现的次数和位置,就可以吧向量构造出来。

还可以把词和词之间的组合考虑进来

  • 两两组合
# ngram_range参数表示需要考虑词的上下文,此处指考虑两两结合的情况;若设置成(1:2),则表示包括一个词也包括两个词的情况
bv = CountVectorizer(ngram_range=(2,2))
# 每句话构成一个独热编码内容的向量组,与上类似
bv_matrix = bv.fit_transform(norm_corpus)
bv_matrix = bv_matrix.toarray()
# 设置列向量的标签
vocab = bv.get_feature_names()
pd.DataFrame(bv_matrix, columns=vocab)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
beautiful sky beautiful today blue beautiful blue dog blue sky brown fox dog lazy fox jumps fox quick jumps lazy lazy brown lazy dog love blue quick blue quick brown sky beautiful sky blue
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
2 0 0 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0
3 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0
4 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1
5 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0

词袋模型只考虑词频,并且每个词的重要程度完全和其出现的次数相关,通常会变成一个较大的稀疏矩阵,并不利于计算

3.常用文本特征构造方法

TF-IDF 模型

TF-IDF 模型中会考虑每个词的重要程度

# 代码结构是类似的,只是换了一个工具
from sklearn.feature_extraction.text import TfidfVectorizer 
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()
vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
beautiful blue brown dog fox jumps lazy love quick sky today
0 0.60 0.52 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.60 0.00
1 0.46 0.39 0.00 0.00 0.00 0.00 0.00 0.66 0.00 0.46 0.00
2 0.00 0.00 0.38 0.38 0.38 0.54 0.38 0.00 0.38 0.00 0.00
3 0.00 0.36 0.42 0.42 0.42 0.00 0.42 0.00 0.42 0.00 0.00
4 0.36 0.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.72 0.52
5 0.00 0.00 0.45 0.45 0.45 0.00 0.45 0.00 0.45 0.00 0.00
Similarity特征

只要确定了特征,并且全部转换成数值数据,吃吗计算他们之间的相似性,就这里采用的余弦相似性,直接将上诉的特征的提取结果当作是输入便可

from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
0 1 2 3 4 5
0 1.000000 0.753128 0.000000 0.185447 0.807539 0.000000
1 0.753128 1.000000 0.000000 0.139665 0.608181 0.000000
2 0.000000 0.000000 1.000000 0.784362 0.000000 0.839987
3 0.185447 0.139665 0.784362 1.000000 0.109653 0.933779
4 0.807539 0.608181 0.000000 0.109653 1.000000 0.000000
5 0.000000 0.000000 0.839987 0.933779 0.000000 1.000000
聚类特征

聚类就是把数据按堆进行划分,最后每堆给出一个实际的标签,需要先把数据转换成数值特征,然后计算器聚类结果

from sklearn.cluster import KMeans

km = KMeans(n_clusters=2)
km.fit_transform(similarity_df)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([corpus_df, cluster_labels], axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Document Category ClusterLabel
0 The sky is blue and beautiful. weather 1
1 Love this blue and beautiful sky! weather 1
2 The quick brown fox jumps over the lazy dog. animals 0
3 The brown fox is quick and the blue dog is lazy! animals 0
4 The sky is very blue and the sky is very beaut... weather 1
5 The dog is lazy but the brown fox is quick! animals 0
主题模型

得到主题类型以及其中每一个词的权重结果

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_topics=2, max_iter=100, random_state=42)
dt_matrix = lda.fit_transform(tv_matrix)
features = pd.DataFrame(dt_matrix, columns=['T1', 'T2'])

tt_matrix = lda.components_
for topic_weights in tt_matrix:
    topic = [(token, weight) for token, weight in zip(vocab, topic_weights)]
    topic = sorted(topic, key=lambda x: -x[1])
    topic = [item for item in topic if item[1] > 0.6]
    print(topic)
    print()

输出

[('fox', 1.7265536238698524), ('quick', 1.7264910761871224), ('dog', 1.7264019823624879), ('brown', 1.7263774760262807), ('lazy', 1.7263567668213813), ('jumps', 1.0326450363521607), ('blue', 0.7770158513472083)]

[('sky', 2.263185143458752), ('beautiful', 1.9057084998062579), ('blue', 1.7954559705805624), ('love', 1.1476805311187976), ('today', 1.0064979209198706)]

词向量模型

词向量模型也就是word2vec,基本原理是基于神经网络的

# 这里需要安装genism安装包安装包
from gensim.models import word2vec

wpt = nltk.WordPunctTokenizer()
tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]

# 需要设置一些参数
feature_size = 10    # 词向量维度
window_context = 10  # 滑动窗口                                                                        
min_word_count = 1   # 最小词频             

w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size, window=window_context, min_count = min_word_count)
w2v_model.wv['sky']
array([-0.00243433, -0.02265637,  0.00145745,  0.0342871 ,  0.0279882 ,
        0.02946787,  0.02539628,  0.0080244 ,  0.02513938,  0.02639634],
      dtype=float32)

输出结果就是输入预料中的每一个词都转换成向量

(论文与benchmark来找解决方案)