前言
在大数据环境下,我们的生活可以由不同的条件指标量化描述,那么我们的生活状态和情感状况是否也可以通过大数据进行分析呢?本文通过天池平台的数据集对人生幸福感和一见钟情指数进行探索与分析。
幸福感预测
1.1 Jupyter设置、导包及数据集加载
导入相关模块。
import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import warnings
import sklearn
import seaborn as sns
拦截警告
warnings.filterwarnings('ignore')
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)
防止中文乱码,设置seaborn中文字体。
mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus'] =False
sns.set(font='SimHei')
加载数据。
data = pd.read_csv('happiness_train_complete.csv',encoding='gbk')
1.2 探索性分析
1.2.1 数据集预览
数据集的shape是(10,140),happiness是目标预测值,其余为特征值。
(data.head()).append(data.tail())
————————————————————————————————————————————————————————
id happiness survey_type province city county survey_time gender birth nationality ... neighbor_familiarity public_service_1 public_service_2 public_service_3 public_service_4 public_service_5 public_service_6 public_service_7 public_service_8 public_service_9
0 1 4 1 12 32 59 2015/8/4 14:18 1 1959 1 ... 4 50 60 50 50 30.0 30 50 50 50
1 2 4 2 18 52 85 2015/7/21 15:04 1 1992 1 ... 3 90 70 70 80 85.0 70 90 60 60
2 3 4 2 29 83 126 2015/7/21 13:24 2 1967 1 ... 4 90 80 75 79 80.0 90 90 90 75
3 4 5 2 10 28 51 2015/7/25 17:33 2 1943 1 ... 3 100 90 70 80 80.0 90 90 80 80
4 5 4 1 7 18 36 2015/8/10 9:50 2 1994 1 ... 2 50 50 50 50 50.0 50 50 50 50
7995 7996 2 2 29 82 124 2015/7/21 19:36 1 1981 1 ... 3 40 50 50 50 40.0 50 50 60 50
7996 7997 3 1 12 32 61 2015/7/31 16:00 2 1945 1 ... 4 80 80 80 80 80.0 60 60 80 80
7997 7998 4 1 16 46 78 2015/8/1 17:48 2 1967 1 ... 4 75 70 70 80 80.0 70 75 70 75
7998 7999 3 1 1 1 8 2015/9/22 18:52 2 1978 1 ... 2 56 67 70 69 78.0 60 70 80 70
7999 8000 4 1 1 1 3 2015/9/28 20:22 2 1991 1 ... 3 80 80 80 80 80.0 80 80 80 80
10 rows × 140 columns
数据集分布情况预览。数据集中共8000条数据,观察到happiness平均值为3.85,并出现了异常值(-8),在训练集中将这些异常值(小于等于0的值)删除。
data.describe()
df_train = data[data['happiness']>0]
获取中位数、众数来观察一下数据分布状态。发现中位数、众数相同,而平均值略小,数据整体处于接近正态分布状态。
data['happiness'].median()
data['happiness'].mode().sum()
————————————————————————————————————————————————————————
4.0
4
1.2.2 幸福感类型分析
77.97%的人认为自己的生活是幸福的,14.51%的人并没有明确表态,剩下7.52%的人认为自己不幸福,可以看出,大多数人对自己的生活持积极态度。
# Pandas绘制子图
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(20,10))
df_train['happiness'].value_counts().plot.pie(autopct='%.2f%%',ax=ax1,fontsize=22)
ax1.set_title('Happiness',fontsize=25,color='w')
ax1.set_ylabel('')
df_train['happiness'].value_counts().plot.bar(ax=ax2,color=['r','g','b','black','y'],
alpha=0.7,fontsize=22)
ax2.set_title('Happiness',fontsize=25,color='w')
plt.show()
1.2.3 性别与幸福感
- 从总量来看:女性的幸福指数高于男性
- 从幸福指数来看:女性中感到幸福的人数多于男性
- 从不幸福指数来看,女性仍然高于男性
- 在并未表态的人中,男性人数多于女性
- 以上综合表明女性更看重幸福感,男性总体关注度相对较低。
fig,(ax0,ax1,ax2)=plt.subplots(1,3,figsize=(20,10))
df_train.groupby('gender')['happiness'].count().plot.bar(ax=ax0,
color=['b','r'],alpha=0.7,fontsize=22)
ax0.set_title('样本总数',fontsize=22)
ax0.set_xlabel('gender',fontsize=20)
sns.countplot('gender',hue='happiness',data=df_train,ax=ax1)
ax1.set_title('男女幸福指数',fontsize=22,)
ax1.set_xlabel('gender',fontsize=20)
ax1.set_ylabel('count',fontsize=20)
sns.countplot('happiness',hue='gender',data=df_train,ax=ax2)
ax2.set_title('男女幸福指数对比',fontsize=22,)
ax2.set_xlabel('happiness',fontsize=20)
ax2.set_ylabel('count',fontsize=20)
1.2.4 年龄与幸福感
首先计算每个样本中的年龄,并将'age'添加到数据集中。
df_train['survey_time']=pd.to_datetime(df_train['survey_time']).dt.year
df_test['survey_time']=pd.to_datetime(df_test['survey_time']).dt.year
df_train['age']= df_train['survey_time'] -df_train['birth']
df_test['age']= df_test['survey_time'] -df_test['birth']
df_train.drop(['survey_time','birth'],axis=1,inplace=True)
df_test.drop(['survey_time','birth'],axis=1,inplace=True)
图中展示了样本中年龄段的分布,40-70岁人数占了一半多。
fig=plt.figure(figsize=(15,8))
df_train['age'].plot.hist(color='m',alpha=0.75)
plt.title('年龄分布',fontsize=23)
将年龄、性别与幸福感结合起来。
- 40岁以下的人中大部分人对幸福感是认可的,其中女性幸福感认可度更高一些。
- 40~60岁人群中,无论幸福与否,对于幸福的感受更加深刻。
- 60岁以上人群中,更多的人认为生活幸福,也少有认为生活不幸福的人。
fig = plt.figure(figsize=(25, 10))
sns.violinplot(x='happiness', y='age',
hue='gender', data=df_train,
split=True,alpha=0.9,
palette={1: "r", 2: "g"});
plt.title('年龄-幸福感分布图',fontsize=25)
plt.xlabel('happiness',fontsize=22)
plt.ylabel('age',fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=20)
1.2.5 地区与幸福感
- 从总量来看:城市的幸福指数高于农村
- 从幸福指数来看:城市中感到幸福的人数多于农村
- 从不幸福指数来看,城村人数相近
- 在并未表态的人中,城市人数多于农村
- 以上综合表明,生活在城市的人对幸福感的感受更深刻。
fig,(ax0,ax1,ax2)=plt.subplots(1,3,figsize=(20,10))
df_train.groupby('survey_type')['happiness'].count().plot.bar(ax=ax0,
color=['r','g'],alpha=0.9,fontsize=22)
ax0.set_title('样本总数',fontsize=22)
ax0.set_xlabel('survey_type',fontsize=20)
sns.countplot('survey_type',hue='happiness',data=df_train,ax=ax1)
ax1.set_title('城村幸福指数',fontsize=22,)
ax1.set_xlabel('survey_type',fontsize=20)
ax1.set_ylabel('count',fontsize=20)
sns.countplot('happiness',hue='survey_type',data=df_train,ax=ax2)
ax2.set_title('城村幸福指数对比',fontsize=22,)
ax2.set_xlabel('happiness',fontsize=20)
ax2.set_ylabel('count',fontsize=20)
1.3 特征工程
1.3.1 处理缺失值
将测试集加添加到训练集进行特征工程。
def combined_data():
data_train =pd.read_csv('happiness_train_complete.csv',encoding='latin-1')
data_test = pd.read_csv('happiness_test_complete.csv',encoding='latin-1')
df_train = data_train[data_train['happiness']>0]
targets =df_train.happiness
df_train.drop(['happiness'],axis=1,inplace=True)
combined =df_train.append(df_test)
combined.reset_index(inplace=True)
combined.drop(['id','index'],inplace=True,axis=1)
return combined,df_train,targets
combined,df_train,targets=combined_data()
combined
找出有缺失值的数据。
df_nan=pd.DataFrame(combined.isnull().sum(),columns=['count'])
index_nan =df_nan[df_nan['count']!=0].index
combined[index_nan]
————————————————————————————————————————————————————————
edu_other edu_status edu_yr join_party property_other hukou_loc social_neighbor social_friend work_status work_yr ... marital_1st s_birth marital_now s_edu s_political s_hukou s_income s_work_exper s_work_status s_work_type
0 NaN 4.0 -2.0 NaN NaN 2.0 3.0 3.0 3.0 30.0 ... 1984.0 1958.0 1984.0 6.0 1.0 5.0 40000.0 5.0 NaN NaN
1 NaN 4.0 2013.0 NaN NaN 1.0 6.0 2.0 3.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN 4.0 -2.0 NaN NaN 1.0 2.0 5.0 NaN NaN ... 1990.0 1968.0 1990.0 3.0 1.0 1.0 6000.0 3.0 NaN NaN
3 NaN 4.0 1959.0 NaN NaN 2.0 1.0 6.0 NaN NaN ... 1960.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN 1.0 2014.0 NaN NaN 3.0 7.0 5.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10963 NaN 2.0 NaN NaN NaN 2.0 1.0 1.0 3.0 6.0 ... 1982.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
10964 NaN 2.0 -1.0 NaN NaN 1.0 3.0 4.0 NaN NaN ... 1998.0 1975.0 1998.0 3.0 1.0 1.0 10000.0 3.0 NaN NaN
10965 NaN NaN NaN NaN NaN 1.0 2.0 2.0 NaN NaN ... 1989.0 1962.0 1989.0 4.0 1.0 1.0 15000.0 3.0 NaN NaN
10966 NaN 4.0 -2.0 NaN NaN 1.0 NaN NaN NaN NaN ... 1988.0 1950.0 1988.0 7.0 1.0 -8.0 -1.0 4.0 NaN NaN
10967 NaN 2.0 1954.0 NaN NaN 1.0 2.0 5.0 NaN NaN ... 1970.0 1941.0 1970.0 3.0 1.0 2.0 12000.0 5.0 NaN NaN
10968 rows × 25 columns
1.3.1.1 处理教育情况
- edu_other表示其他,在edu中已经用'14'表示,因此删除
- edu_status中NAN值表示为接受教育,替换为0, edu_yr中的空值替换成0
- edu_status为-8时,edu_yr有具体日期的情况,全部替换为4,edu_yr为负的情况,全部替换为0
- edu_yr中的值小于0时,表示未知毕业日期,替换成0,
edu_other edu_status edu_yr
2950 NaN -8.0 2004.0
3124 NaN -8.0 1977.0
5127 NaN -8.0 1965.0
6612 NaN -8.0 2012.0
7148 NaN -8.0 1962.0
9271 NaN -8.0 1969.0
10079 NaN -8.0 1960.0
10447 NaN -8.0 2001.0
综上整理为:
def modify_edu():
global combined
combined.drop(['edu_other'],axis=1,inplace=True)
combined['edu_status'].fillna(0,inplace=True)
combined['edu_yr'].fillna(0,inplace=True)
combined.loc[(combined['edu_status']<0) & (combined['edu_yr']<0),
['edu_status','edu_yr']]=[0,0]
combined.loc[combined['edu_status']==(-8),'edu_status']=4
combined.loc[combined['edu_yr']<0,'edu_yr']=0
combined['edu_yr'].astype(int)
return combined
combined =modify_edu()
1.3.1.2 处理政治面貌
- 使用filna填充空值
- 将值为4和-2替换为0即可
def modify_join_party():
global combined
combined
combined['join_party'].fillna(0,inplace=True)
combined['join_party'].astype(int)
combined.loc[(combined['join_party'] ==4) |( combined['join_party'] ==(-2)),
'join_party']=0
return combined
1.3.1.3 处理房子产权
- 将空值替换为0
- 将有数据的值表明其有房产信息,替换为1
def modify_property_other():
global combined
combined['property_other'].fillna(0,inplace=True)
combined.loc[combined['property_other'] !=0,'property_other'] =1
return combined
combined=modify_property_other()
1.3.1.4 处理户口信息
- 由于4表示户口待定,因此将空值和0值替换为4
def modify_hukou_loc():
global combined
combined['hukou_loc'].fillna(4,inplace=True)
combined.loc[combined['hukou_loc']==0,'hukou_loc']=4
return combined
combined=modify_hukou_loc()
1.3.1.5 处理社交活动
- 用中位数填充social_neighbor和social_friend的空值与异常值
def modify_social_activity():
global combined
social_neighbor_median = combined['social_neighbor'].median()
combined['social_neighbor'].fillna(social_neighbor_median,inplace=True)
combined.loc[combined['social_neighbor']<0,'social_neighbor']=social_neighbor_median
social_friend_median = combined['social_friend'].median()
combined['social_friend'].fillna(social_friend_median,inplace=True)
combined.loc[combined['social_friend']<0,'social_friend']=social_friend_median
return combined
combined=modify_social_activity()
1.3.1.6 处理工作情况
- work_status空值填充为9,即其他,其余特征填充为0.
- work_status中小于0的值只有-8,取其相反数8
- work_yr中小于0的值取其相反数1,2,3
- work_type和work_manage小于0的值替换为0
def modify_work_situation():
global combiend
combined['work_status'].fillna(9,inplace=True)
combined['work_yr'].fillna(0,inplace=True)
combined['work_type'].fillna(0,inplace=True)
combined['work_manage'].fillna(0,inplace=True)
combined.loc[combined['work_status']<0,'work_status']=8
combined.loc[combined['work_yr']==(-1),'work_yr'] =1
combined.loc[combined['work_yr']==(-2),'work_yr'] =2
combined.loc[combined['work_yr']==(-3),'work_yr'] =3
combined.loc[combined['work_type']==-8,'work_type']=0
combined.loc[combined['work_manage']==-8,'work_manage']=0
return combined
combined=modify_work_situation()
1.3.1.7 处理家庭收入
- 利用中位数填充空值
- family_income小于0的值,用0来填充
def modify_family_income():
global combined
combined['family_income'].fillna(combined['family_income'].median(),inplace=True)
combined.loc[combined['family_income'] <0,'family_income'] =0
return combined
combined =modify_family_income()
1.3.1.8 处理投资活动
- 空值替换为0
- 有备注信息则替换为1,只有数值的替换为0
def modify_invest_other():
global combined
combined['invest_other'].fillna(0,inplace=True)
invest_other_01 =[]
for i in combined['invest_other']:
if type(i) == int :
invest_other_01.append(0)
else:
invest_other_01.append(1)
combined['invest_other'] =invest_other_01
return combined
combined=modify_invest_other()
1.3.1.9 处理未成年子女情况
- 空值和小于0的值用0替换
def modify_minor_child():
global combined
combined['minor_child'].fillna(0,inplace=True)
combined.loc[combined['minor_child']<0,'minor_child']=0
return combined
combined =modify_minor_child()
1.3.1.10 处理结婚时间
- 将空值和异常值替换为0
def modify_marital_date():
global combined
combined['marital_1st'].fillna(0,inplace=True)
combined.loc[combined['marital_1st']<0,'marital_1st']=0
combined.loc[combined['marital_1st']==4,'marital_1st']=0
combined['marital_1st'].astype(int)
combined['marital_now'].fillna(0,inplace=True)
combined.loc[combined['marital_now']<0,'marital_now']=0
combined['marital_now'].astype(int)
return combined
combined =modify_marital_date()
1.3.1.11 处理配偶信息
- 将空值、负值填充为0或者阈值
def modify_s_situation():
global combined
combined['s_birth'].fillna(0,inplace=True)
combined['s_birth'].astype(int)
combined['s_edu'].fillna(14,inplace=True)
combined.loc[(combined['s_edu']==0) | (combined['s_edu']==-8),'s_edu']=14
combined['s_political'].fillna(0,inplace=True)
combined.loc[combined['s_political']==-8,'s_political']=0
combined['s_hukou'].fillna(0,inplace=True)
combined.loc[(combined['s_hukou']==0) | (combined['s_hukou']==-8),'s_hukou']=8
combined['s_income'].fillna(combined['s_income'].mean(),inplace=True)
combined.loc[combined['s_income']<0,'s_income']=0
combined['s_work_exper'].fillna(0,inplace=True)
combined['s_work_status'].fillna(9,inplace=True)
combined.loc[(combined['s_work_status']==0) | (combined['s_work_status']==-8),'s_work_status']=9
combined['s_work_type'].fillna(0,inplace=True)
combined.loc[(combined['s_work_type']==3)|(combined['s_work_type']==8)|(combined['s_work_type']==-8),
's_work_type']=0
return combined
combined =modify_s_situation()
1.3.2 数据分箱处理年龄
- 用调查时间减去出生日期即可
def get_age():
global combined
combined['age']=pd.to_datetime(combined['survey_time']).dt.year-combined['birth']
combined.drop(['survey_time','birth'],axis=1,inplace=True)
return combined
combined =get_age()
- 利用数据分箱,将年龄分为5个年龄段
- 将age<=18,记为0
- 将18<age<=39,记为1
- 将39<age<=60,记为2
- 将60<age<=75,记为3
- 将age>75,记为4
def modify_age():
global combined
combined.loc[combined['age']<=18,'age']=0
combined.loc[(combined['age'] > 18) & (combined['age'] <= 39), 'age'] = 1
combined.loc[(combined['age'] > 39) & (combined['age'] <= 60), 'age'] = 2
combined.loc[(combined['age'] > 60) & (combined['age'] <= 75), 'age'] = 3
combined.loc[ combined['age'] > 75, 'age'] = 4
return combined
combined=modify_age()
1.3.3 特征分布与组合
数据特征的分布情况表明了样本的均衡性,如果一个特征中A类数量过高,甚至达到99%,那么这个特征在数据分析与模型训练时没有实际价值,因此可以剔除这些特征。
- 构建绘图和百分比计算函数以便于观察和分析每个特征的分布情况
def features_distribution(feature):
global combined
sns.countplot(x=feature,data=combined)
feature_sum=combined.groupby(feature)[feature].value_counts().sum()
feature_max=combined.groupby(feature)[feature].value_counts().max()
feature_pct=feature_max/feature_sum*100
print('%s中数量最多占比为: %.2f%%'% (feature,feature_pct))
- 构建剔除特征函数,以便于删除分布很不均衡的特征
def remove_feature(feature):
global combined
combined.drop([feature],axis=1,inplace=True)
return combined
1.3.3.1 民族分布
- nationality中汉族占比92.06%,过高的样本占比对于模型训练没有使用价值,因此删除。
features_distribution('nationality')
remove_feature('nationality')
——————————————————————————————————————————————————————
nationality中数量最多占比为: 92.06%
1.3.3.2 宗教分布
- religion中不信仰宗教占比87.88%,剔除。
- religion_freq中不参加宗教活动占比86.00%,剔除。
features_distribution('religion')
features_distribution('religion_freq')
remove_feature('religion')
remove_feature('religion_freq')
——————————————————————————————————————————————————————
religion中数量最多占比为: 87.88%
religion_freq中数量最多占比为: 86.00%
1.3.3.3 政治面貌分布
- political中群众占比84.10%,剔除。
features_distribution('political')
remove_feature('political')
——————————————————————————————————————————————————————
political中数量最多占比为: 84.10%
1.3.3.4 房产分布
- 只保留property_1和property_2,其他剔除。
features_distribution('property_0')
features_distribution('property_1')
features_distribution('property_2')
features_distribution('property_3')
features_distribution('property_4')
features_distribution('property_5')
features_distribution('property_6')
features_distribution('property_7')
remove_feature('property_0')
remove_feature('property_3')
remove_feature('property_4')
remove_feature('property_5')
remove_feature('property_6')
remove_feature('property_7')
——————————————————————————————————————————————————————
property_0中数量最多占比为: 99.20%
property_1中数量最多占比为: 52.82%
property_2中数量最多占比为: 73.29%
property_3中数量最多占比为: 89.92%
property_4中数量最多占比为: 89.60%
property_5中数量最多占比为: 97.46%
property_6中数量最多占比为: 99.60%
property_7中数量最多占比为: 97.76%
1.3.3.5 健康情况分布
- health与health_problem占比相近,剔除其中一个。
features_distribution('health')
features_distribution('health_problem')
remove_feature('health_problem')
——————————————————————————————————————————————————————
health中数量最多占比为: 38.81%
health_problem中数量最多占比为: 36.56%
1.3.3.6 媒体使用情况分布
-media_2与media_3占比相近,剔除其中一个。
features_distribution('media_1')
features_distribution('media_2')
features_distribution('media_3')
features_distribution('media_4')
features_distribution('media_5')
features_distribution('media_6')
remove_feature('media_2')
——————————————————————————————————————————————————————
media_1中数量最多占比为: 50.23%
media_2中数量最多占比为: 54.81%
media_3中数量最多占比为: 55.24%
media_4中数量最多占比为: 40.38%
media_5中数量最多占比为: 53.13%
media_6中数量最多占比为: 69.55%
1.3.3.7 家庭等级分布
- class_10_after与class_10_before表示十年间家庭等级的变化,可以进行组合为class_variable。
combined['class_variable'] =combined['class_10_after'] -combined['class_10_before']
features_distribution('class')
features_distribution('class_variable')
features_distribution('class_14')
remove_feature('class_10_after')
remove_feature('class_10_before')
——————————————————————————————————————————————————————
class中数量最多占比为: 34.65%
class_variable中数量最多占比为: 22.25%
class_14中数量最多占比为: 21.08%
1.3.3.8 社会保障分布
- insur_3与insur_4表示商业性保险,两者的渐变趋势一致,筛选其中一个。
features_distribution('insur_1')
features_distribution('insur_2')
features_distribution('insur_3')
features_distribution('insur_4')
remove_feature('insur_4')
——————————————————————————————————————————————————————
insur_1中数量最多占比为: 90.80%
insur_2中数量最多占比为: 68.81%
insur_3中数量最多占比为: 89.32%
insur_4中数量最多占比为: 91.81%
1.3.3.9 投资情况分布
- 保留invest_2,其他的几乎没有投资,选全部删除。
features_distribution('invest_0')
features_distribution('invest_1')
features_distribution('invest_2')
features_distribution('invest_3')
features_distribution('invest_4')
features_distribution('invest_5')
features_distribution('invest_6')
features_distribution('invest_7')
features_distribution('invest_8')
remove_feature('invest_0')
remove_feature('invest_1')
remove_feature('invest_3')
remove_feature('invest_4')
remove_feature('invest_5')
remove_feature('invest_6')
remove_feature('invest_7')
remove_feature('invest_8')
——————————————————————————————————————————————————————
invest_0中数量最多占比为: 98.48%
invest_1中数量最多占比为: 90.94%
invest_2中数量最多占比为: 93.97%
invest_3中数量最多占比为: 97.79%
invest_4中数量最多占比为: 99.52%
invest_5中数量最多占比为: 99.84%
invest_6中数量最多占比为: 99.99%
invest_7中数量最多占比为: 99.93%
invest_8中数量最多占比为: 99.94%
1.3.4 数据类型处理
1.3.4.1 时间类型
- survey_time和birth之前已经处理好了,关于父母的出生日期、配偶的出生日期、结婚时间等对于目标值的分析没有实际价值。
combined.drop(['s_birth','f_birth','m_birth','marital_now','edu_yr','join_party','marital_1st'],axis=1,inplace=True)
1.3.4.2 数值型数据(标准化)
- 提取数值类型数据计算平均数、标准差
- 数据标准化处理
numeric=['income','height_cm','weight_jin','s_income',
'family_income','family_m','house','car'
,'son','daughter','minor_child','inc_exp','public_service_1',
'public_service_2','public_service_3','public_service_4',
'public_service_5','public_service_6','public_service_7',
'public_service_8','public_service_9','floor_area']
def normalizing():
global combined,numeric
numeric_means=combined[numeric].mean()
numeric_std=combined[numeric].std()
combined[numeric]=(combined[numeric]-numeric_means)/numeric_std
return combi
1.3.4.3 分类型数据(独热编码)
- 对于分类型数据进行独热编码
global combined,numeric
df_object=combined.drop(numeric,axis=1)
df_object=df_object.astype(str)
for name in list(df_object):
object_dummies = pd.get_dummies(combined[name], prefix=name)
combined = pd.concat([combined, object_dummies], axis=1)
combined.drop(name, inplace=True, axis=1)
return combined
combined =one_hot_enconder()
1.4 模型训练
- 恢复训练集和测试集
def recover_train_test_target():
global combined
targets = pd.read_csv('happiness_train_complete.csv', usecols=['happiness'])['happiness'].values
train = combined.iloc[:7988]
test = combined.iloc[7988:]
targets =targets[targets>0]
return train, test, targets
train, test, targets=recover_train_test_target()
targets =pd.DataFrame(targets,columns=['happiness'])
- 导入模型训练需要的模块
import sklearn
from sklearn.linear_model import Lasso, LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import cross_val_score
1.4.1 特征降维
- 使用随机森林训练特征重要性来降低特征维度
clf = RandomForestClassifier(n_estimators=50, max_features='sqrt')
clf = clf.fit(train, targets)
model = SelectFromModel(clf, prefit=True)
train_reduced = model.transform(train)
train_reduced.shape
1.4.2 建立模型
使用几种模型进行初步评估。
logreg_cv = LogisticRegressionCV()
rf = RandomForestClassifier()
Las = Lasso()
svc=SVC()
xgb_r = xgb.XGBRegressor()
models = [ logreg_cv, rf,svc, Las,xgb]
用评分函数进行模型评估。
def compute_score(clf,x,y,cv=5,scoring='accuracy'):
xval =cross_val_score(clf,x,y,cv=5,scoring=scoring)
return np.mean(xval)
for model in models:
print (f'Cross-validation of :{model.__class__}')
score = compute_score(clf=model, x=train_reduced, y=targets, scoring='accuracy')
1.4.3 调节超参数
利用网格搜索调节参数。
model_box= []
xgb_re=xgb.XGBRegressor(seed=27,learning_rate=0.1, n_estimators=300,silent=0, objective='reg:linear',
gamma=0,subsample=0.8,colsample_bytree=0.8,nthread=4,scale_pos_weight=1)
xgb_params ={'n_estimators':[50,100,120],'min_child_weight':list(range(1,4,2)),}
model_box.append([xgb_re,xgb_params])
rf=RandomForestClassifier(random_state=2021,max_features='auto')
rf_params ={'n_estimators':[50,120,300],'max_depth':[5,8,15],
'min_samples_leaf':[2,5,10],'min_samples_split':[2,5,10]}
model_box.append([rf,rf_params])
for i in range(len(model_box)):
best_model= GridSearchCV(model_box[i][0],param_grid=model_box[i][1],refit=True,
cv=5,scoring='neg_mean_squared_error').fit(train_reduced,targets)
print('best_parameters:',best_model.best_params_)
————————————————————————————————————————————————————————
best_parameters: {'min_child_weight': 1, 'n_estimators': 120}
best_parameters: {'max_depth': 15, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 120}
1.4.4 模型预测
xgboost算法效果要比随机森林好一些,因此选择作为最终模型。
xgb_r=xgb.XGBRegressor(learning_rate=0.1, n_estimators=120,
silent=0, objective='reg:linear',
gamma=0,subsample=0.8,colsample_bytree=0.8,
nthread=4,scale_pos_weight=1,seed=27,min_child_weight=1)
xgb_r.fit(train,targets)
predictions=xgb_r.predict(test)
生成结果文件提交。
df_predictions = pd.DataFrame()
abc = pd.read_csv('happiness_test_complete.csv',encoding='gbk')
df_predictions['id'] = abc['id']
df_predictions['happiness'] = predictions
df_predictions[['id','happiness']].to_csv('submit.csv', index=False)