幸福感预测

395 阅读11分钟

前言

在大数据环境下,我们的生活可以由不同的条件指标量化描述,那么我们的生活状态和情感状况是否也可以通过大数据进行分析呢?本文通过天池平台的数据集对人生幸福感和一见钟情指数进行探索与分析。

幸福感预测

1.1 Jupyter设置、导包及数据集加载

导入相关模块。

import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import warnings
import sklearn
import seaborn as sns

拦截警告

warnings.filterwarnings('ignore')
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)   

防止中文乱码,设置seaborn中文字体。

mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus'] =False
sns.set(font='SimHei')

加载数据。

data = pd.read_csv('happiness_train_complete.csv',encoding='gbk')

1.2 探索性分析

1.2.1 数据集预览

数据集的shape是(10,140),happiness是目标预测值,其余为特征值。

(data.head()).append(data.tail())
————————————————————————————————————————————————————————
	id	happiness	survey_type	province	city	county	survey_time	gender	birth	nationality	...	neighbor_familiarity	public_service_1	public_service_2	public_service_3	public_service_4	public_service_5	public_service_6	public_service_7	public_service_8	public_service_9
0	1	4	1	12	32	59	2015/8/4 14:18	1	1959	1	...	4	50	60	50	50	30.0	30	50	50	50
1	2	4	2	18	52	85	2015/7/21 15:04	1	1992	1	...	3	90	70	70	80	85.0	70	90	60	60
2	3	4	2	29	83	126	2015/7/21 13:24	2	1967	1	...	4	90	80	75	79	80.0	90	90	90	75
3	4	5	2	10	28	51	2015/7/25 17:33	2	1943	1	...	3	100	90	70	80	80.0	90	90	80	80
4	5	4	1	7	18	36	2015/8/10 9:50	2	1994	1	...	2	50	50	50	50	50.0	50	50	50	50
7995	7996	2	2	29	82	124	2015/7/21 19:36	1	1981	1	...	3	40	50	50	50	40.0	50	50	60	50
7996	7997	3	1	12	32	61	2015/7/31 16:00	2	1945	1	...	4	80	80	80	80	80.0	60	60	80	80
7997	7998	4	1	16	46	78	2015/8/1 17:48	2	1967	1	...	4	75	70	70	80	80.0	70	75	70	75
7998	7999	3	1	1	1	8	2015/9/22 18:52	2	1978	1	...	2	56	67	70	69	78.0	60	70	80	70
7999	8000	4	1	1	1	3	2015/9/28 20:22	2	1991	1	...	3	80	80	80	80	80.0	80	80	80	80
10 rows × 140 columns

数据集分布情况预览。数据集中共8000条数据,观察到happiness平均值为3.85,并出现了异常值(-8),在训练集中将这些异常值(小于等于0的值)删除。

data.describe()
df_train = data[data['happiness']>0]

获取中位数、众数来观察一下数据分布状态。发现中位数、众数相同,而平均值略小,数据整体处于接近正态分布状态。

data['happiness'].median()
data['happiness'].mode().sum()
————————————————————————————————————————————————————————
4.0
4

1.2.2 幸福感类型分析

77.97%的人认为自己的生活是幸福的,14.51%的人并没有明确表态,剩下7.52%的人认为自己不幸福,可以看出,大多数人对自己的生活持积极态度。

# Pandas绘制子图
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(20,10))
df_train['happiness'].value_counts().plot.pie(autopct='%.2f%%',ax=ax1,fontsize=22)
ax1.set_title('Happiness',fontsize=25,color='w')
ax1.set_ylabel('')
df_train['happiness'].value_counts().plot.bar(ax=ax2,color=['r','g','b','black','y'],
                                    alpha=0.7,fontsize=22)
ax2.set_title('Happiness',fontsize=25,color='w')
plt.show()

image.png

1.2.3 性别与幸福感

  • 从总量来看:女性的幸福指数高于男性
  • 从幸福指数来看:女性中感到幸福的人数多于男性
  • 从不幸福指数来看,女性仍然高于男性
  • 在并未表态的人中,男性人数多于女性
  • 以上综合表明女性更看重幸福感,男性总体关注度相对较低。
fig,(ax0,ax1,ax2)=plt.subplots(1,3,figsize=(20,10))
df_train.groupby('gender')['happiness'].count().plot.bar(ax=ax0,
                                        color=['b','r'],alpha=0.7,fontsize=22)
ax0.set_title('样本总数',fontsize=22)
ax0.set_xlabel('gender',fontsize=20)
sns.countplot('gender',hue='happiness',data=df_train,ax=ax1)
ax1.set_title('男女幸福指数',fontsize=22,)
ax1.set_xlabel('gender',fontsize=20)
ax1.set_ylabel('count',fontsize=20)
sns.countplot('happiness',hue='gender',data=df_train,ax=ax2)
ax2.set_title('男女幸福指数对比',fontsize=22,)
ax2.set_xlabel('happiness',fontsize=20)
ax2.set_ylabel('count',fontsize=20)

image.png

1.2.4 年龄与幸福感

首先计算每个样本中的年龄,并将'age'添加到数据集中。

df_train['survey_time']=pd.to_datetime(df_train['survey_time']).dt.year
df_test['survey_time']=pd.to_datetime(df_test['survey_time']).dt.year
df_train['age']= df_train['survey_time'] -df_train['birth']
df_test['age']= df_test['survey_time'] -df_test['birth']
df_train.drop(['survey_time','birth'],axis=1,inplace=True)
df_test.drop(['survey_time','birth'],axis=1,inplace=True)

图中展示了样本中年龄段的分布,40-70岁人数占了一半多。

image.png

fig=plt.figure(figsize=(15,8))
df_train['age'].plot.hist(color='m',alpha=0.75)
plt.title('年龄分布',fontsize=23)

将年龄、性别与幸福感结合起来。

  • 40岁以下的人中大部分人对幸福感是认可的,其中女性幸福感认可度更高一些。
  • 40~60岁人群中,无论幸福与否,对于幸福的感受更加深刻。
  • 60岁以上人群中,更多的人认为生活幸福,也少有认为生活不幸福的人。
fig = plt.figure(figsize=(25, 10))
sns.violinplot(x='happiness', y='age', 
               hue='gender', data=df_train, 
               split=True,alpha=0.9,
               palette={1: "r", 2: "g"});
plt.title('年龄-幸福感分布图',fontsize=25)
plt.xlabel('happiness',fontsize=22)
plt.ylabel('age',fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=20)

image.png

1.2.5 地区与幸福感

  • 从总量来看:城市的幸福指数高于农村
  • 从幸福指数来看:城市中感到幸福的人数多于农村
  • 从不幸福指数来看,城村人数相近
  • 在并未表态的人中,城市人数多于农村
  • 以上综合表明,生活在城市的人对幸福感的感受更深刻。 image.png
fig,(ax0,ax1,ax2)=plt.subplots(1,3,figsize=(20,10))
df_train.groupby('survey_type')['happiness'].count().plot.bar(ax=ax0,
                                        color=['r','g'],alpha=0.9,fontsize=22)
ax0.set_title('样本总数',fontsize=22)
ax0.set_xlabel('survey_type',fontsize=20)
sns.countplot('survey_type',hue='happiness',data=df_train,ax=ax1)
ax1.set_title('城村幸福指数',fontsize=22,)
ax1.set_xlabel('survey_type',fontsize=20)
ax1.set_ylabel('count',fontsize=20)
sns.countplot('happiness',hue='survey_type',data=df_train,ax=ax2)
ax2.set_title('城村幸福指数对比',fontsize=22,)
ax2.set_xlabel('happiness',fontsize=20)
ax2.set_ylabel('count',fontsize=20)

1.3 特征工程

1.3.1 处理缺失值

将测试集加添加到训练集进行特征工程。

def combined_data():
    data_train =pd.read_csv('happiness_train_complete.csv',encoding='latin-1')
    data_test = pd.read_csv('happiness_test_complete.csv',encoding='latin-1')
    df_train = data_train[data_train['happiness']>0]
    targets =df_train.happiness
    df_train.drop(['happiness'],axis=1,inplace=True)
    combined =df_train.append(df_test)
    combined.reset_index(inplace=True)
    combined.drop(['id','index'],inplace=True,axis=1)
    return combined,df_train,targets
combined,df_train,targets=combined_data()
combined

找出有缺失值的数据。

df_nan=pd.DataFrame(combined.isnull().sum(),columns=['count'])
index_nan =df_nan[df_nan['count']!=0].index
combined[index_nan]
————————————————————————————————————————————————————————

edu_other	edu_status	edu_yr	join_party	property_other	hukou_loc	social_neighbor	social_friend	work_status	work_yr	...	marital_1st	s_birth	marital_now	s_edu	s_political	s_hukou	s_income	s_work_exper	s_work_status	s_work_type
0	NaN	4.0	-2.0	NaN	NaN	2.0	3.0	3.0	3.0	30.0	...	1984.0	1958.0	1984.0	6.0	1.0	5.0	40000.0	5.0	NaN	NaN
1	NaN	4.0	2013.0	NaN	NaN	1.0	6.0	2.0	3.0	2.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	4.0	-2.0	NaN	NaN	1.0	2.0	5.0	NaN	NaN	...	1990.0	1968.0	1990.0	3.0	1.0	1.0	6000.0	3.0	NaN	NaN
3	NaN	4.0	1959.0	NaN	NaN	2.0	1.0	6.0	NaN	NaN	...	1960.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	1.0	2014.0	NaN	NaN	3.0	7.0	5.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10963	NaN	2.0	NaN	NaN	NaN	2.0	1.0	1.0	3.0	6.0	...	1982.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10964	NaN	2.0	-1.0	NaN	NaN	1.0	3.0	4.0	NaN	NaN	...	1998.0	1975.0	1998.0	3.0	1.0	1.0	10000.0	3.0	NaN	NaN
10965	NaN	NaN	NaN	NaN	NaN	1.0	2.0	2.0	NaN	NaN	...	1989.0	1962.0	1989.0	4.0	1.0	1.0	15000.0	3.0	NaN	NaN
10966	NaN	4.0	-2.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	...	1988.0	1950.0	1988.0	7.0	1.0	-8.0	-1.0	4.0	NaN	NaN
10967	NaN	2.0	1954.0	NaN	NaN	1.0	2.0	5.0	NaN	NaN	...	1970.0	1941.0	1970.0	3.0	1.0	2.0	12000.0	5.0	NaN	NaN
10968 rows × 25 columns
1.3.1.1 处理教育情况
  • edu_other表示其他,在edu中已经用'14'表示,因此删除
  • edu_status中NAN值表示为接受教育,替换为0, edu_yr中的空值替换成0
  • edu_status为-8时,edu_yr有具体日期的情况,全部替换为4,edu_yr为负的情况,全部替换为0
  • edu_yr中的值小于0时,表示未知毕业日期,替换成0,
	edu_other	edu_status	edu_yr
2950	NaN	-8.0	2004.0
3124	NaN	-8.0	1977.0
5127	NaN	-8.0	1965.0
6612	NaN	-8.0	2012.0
7148	NaN	-8.0	1962.0
9271	NaN	-8.0	1969.0
10079	NaN	-8.0	1960.0
10447	NaN	-8.0	2001.0

综上整理为:

def modify_edu():
    global combined
    combined.drop(['edu_other'],axis=1,inplace=True)
    combined['edu_status'].fillna(0,inplace=True)  
    combined['edu_yr'].fillna(0,inplace=True)  
    combined.loc[(combined['edu_status']<0) & (combined['edu_yr']<0),
                 ['edu_status','edu_yr']]=[0,0]
    combined.loc[combined['edu_status']==(-8),'edu_status']=4
    combined.loc[combined['edu_yr']<0,'edu_yr']=0  
    combined['edu_yr'].astype(int)
    return combined
combined =modify_edu()
1.3.1.2 处理政治面貌
  • 使用filna填充空值
  • 将值为4和-2替换为0即可
def modify_join_party():
    global combined
    combined
    combined['join_party'].fillna(0,inplace=True)
    combined['join_party'].astype(int)
    combined.loc[(combined['join_party'] ==4) |( combined['join_party'] ==(-2)),
                                                'join_party']=0
    return combined
1.3.1.3 处理房子产权
  • 将空值替换为0
  • 将有数据的值表明其有房产信息,替换为1
def modify_property_other():
    global combined
    combined['property_other'].fillna(0,inplace=True)
    combined.loc[combined['property_other'] !=0,'property_other'] =1
    return combined
combined=modify_property_other()
1.3.1.4 处理户口信息
  • 由于4表示户口待定,因此将空值和0值替换为4
def modify_hukou_loc():
    global combined
    combined['hukou_loc'].fillna(4,inplace=True)
    combined.loc[combined['hukou_loc']==0,'hukou_loc']=4
    return combined
combined=modify_hukou_loc()
1.3.1.5 处理社交活动
  • 用中位数填充social_neighbor和social_friend的空值与异常值
def modify_social_activity():
    global combined
    social_neighbor_median = combined['social_neighbor'].median()
    combined['social_neighbor'].fillna(social_neighbor_median,inplace=True)
    combined.loc[combined['social_neighbor']<0,'social_neighbor']=social_neighbor_median
    social_friend_median = combined['social_friend'].median()
    combined['social_friend'].fillna(social_friend_median,inplace=True)
    combined.loc[combined['social_friend']<0,'social_friend']=social_friend_median
    return combined
combined=modify_social_activity()
1.3.1.6 处理工作情况
  • work_status空值填充为9,即其他,其余特征填充为0.
  • work_status中小于0的值只有-8,取其相反数8
  • work_yr中小于0的值取其相反数1,2,3
  • work_type和work_manage小于0的值替换为0
def modify_work_situation():
    global combiend
    combined['work_status'].fillna(9,inplace=True)
    combined['work_yr'].fillna(0,inplace=True)
    combined['work_type'].fillna(0,inplace=True)
    combined['work_manage'].fillna(0,inplace=True)
    combined.loc[combined['work_status']<0,'work_status']=8
    combined.loc[combined['work_yr']==(-1),'work_yr'] =1
    combined.loc[combined['work_yr']==(-2),'work_yr'] =2
    combined.loc[combined['work_yr']==(-3),'work_yr'] =3
    combined.loc[combined['work_type']==-8,'work_type']=0
    combined.loc[combined['work_manage']==-8,'work_manage']=0
    return combined
combined=modify_work_situation()
1.3.1.7 处理家庭收入
  • 利用中位数填充空值
  • family_income小于0的值,用0来填充
def modify_family_income():
    global combined
    combined['family_income'].fillna(combined['family_income'].median(),inplace=True)
    combined.loc[combined['family_income'] <0,'family_income'] =0
    return combined
combined =modify_family_income()
1.3.1.8 处理投资活动
  • 空值替换为0
  • 有备注信息则替换为1,只有数值的替换为0

def modify_invest_other():
    global combined
    combined['invest_other'].fillna(0,inplace=True)
    invest_other_01 =[]
    for i in combined['invest_other']:
        if type(i) == int :
            invest_other_01.append(0)
        else:
            invest_other_01.append(1)
    combined['invest_other'] =invest_other_01
    return combined
combined=modify_invest_other()
1.3.1.9 处理未成年子女情况
  • 空值和小于0的值用0替换
def modify_minor_child():
    global combined
    combined['minor_child'].fillna(0,inplace=True)
    combined.loc[combined['minor_child']<0,'minor_child']=0
    return combined
combined =modify_minor_child()
1.3.1.10 处理结婚时间
  • 将空值和异常值替换为0
def modify_marital_date():
    global combined
    combined['marital_1st'].fillna(0,inplace=True)
    combined.loc[combined['marital_1st']<0,'marital_1st']=0
    combined.loc[combined['marital_1st']==4,'marital_1st']=0    
    combined['marital_1st'].astype(int) 
    combined['marital_now'].fillna(0,inplace=True)
    combined.loc[combined['marital_now']<0,'marital_now']=0   
    combined['marital_now'].astype(int) 
    return combined
combined =modify_marital_date()
1.3.1.11 处理配偶信息
  • 将空值、负值填充为0或者阈值
def modify_s_situation():
    global combined
    combined['s_birth'].fillna(0,inplace=True)  
    combined['s_birth'].astype(int) 
    
    combined['s_edu'].fillna(14,inplace=True)  
    combined.loc[(combined['s_edu']==0) | (combined['s_edu']==-8),'s_edu']=14   
    
    combined['s_political'].fillna(0,inplace=True)
    combined.loc[combined['s_political']==-8,'s_political']=0
    
    combined['s_hukou'].fillna(0,inplace=True)
    combined.loc[(combined['s_hukou']==0) | (combined['s_hukou']==-8),'s_hukou']=8
    
    combined['s_income'].fillna(combined['s_income'].mean(),inplace=True)
    combined.loc[combined['s_income']<0,'s_income']=0
    
    combined['s_work_exper'].fillna(0,inplace=True)
    combined['s_work_status'].fillna(9,inplace=True)
    combined.loc[(combined['s_work_status']==0) | (combined['s_work_status']==-8),'s_work_status']=9
    
    combined['s_work_type'].fillna(0,inplace=True)
    combined.loc[(combined['s_work_type']==3)|(combined['s_work_type']==8)|(combined['s_work_type']==-8),
                                                            's_work_type']=0
    return combined
combined =modify_s_situation()

1.3.2 数据分箱处理年龄

  • 用调查时间减去出生日期即可
def get_age():
    global combined
    combined['age']=pd.to_datetime(combined['survey_time']).dt.year-combined['birth']
    combined.drop(['survey_time','birth'],axis=1,inplace=True)
    return combined
combined =get_age()
  • 利用数据分箱,将年龄分为5个年龄段
  • 将age<=18,记为0
  • 将18<age<=39,记为1
  • 将39<age<=60,记为2
  • 将60<age<=75,记为3
  • 将age>75,记为4
def modify_age():
    global combined
    combined.loc[combined['age']<=18,'age']=0
    combined.loc[(combined['age'] > 18) & (combined['age'] <= 39), 'age'] = 1
    combined.loc[(combined['age'] > 39) & (combined['age'] <= 60), 'age'] = 2
    combined.loc[(combined['age'] > 60) & (combined['age'] <= 75), 'age'] = 3
    combined.loc[ combined['age'] > 75, 'age'] = 4
    return combined
combined=modify_age()

1.3.3 特征分布与组合

数据特征的分布情况表明了样本的均衡性,如果一个特征中A类数量过高,甚至达到99%,那么这个特征在数据分析与模型训练时没有实际价值,因此可以剔除这些特征。

  • 构建绘图和百分比计算函数以便于观察和分析每个特征的分布情况
def features_distribution(feature):
    global combined
    sns.countplot(x=feature,data=combined)
    feature_sum=combined.groupby(feature)[feature].value_counts().sum()
    feature_max=combined.groupby(feature)[feature].value_counts().max()
    feature_pct=feature_max/feature_sum*100
    print('%s中数量最多占比为: %.2f%%'% (feature,feature_pct))
  • 构建剔除特征函数,以便于删除分布很不均衡的特征
def remove_feature(feature):
    global combined
    combined.drop([feature],axis=1,inplace=True)
    return combined
1.3.3.1 民族分布
  • nationality中汉族占比92.06%,过高的样本占比对于模型训练没有使用价值,因此删除。
features_distribution('nationality')
remove_feature('nationality')
——————————————————————————————————————————————————————
nationality中数量最多占比为: 92.06%

image.png

1.3.3.2 宗教分布
  • religion中不信仰宗教占比87.88%,剔除。
  • religion_freq中不参加宗教活动占比86.00%,剔除。
features_distribution('religion')
features_distribution('religion_freq')
remove_feature('religion')
remove_feature('religion_freq')
——————————————————————————————————————————————————————
religion中数量最多占比为: 87.88%
religion_freq中数量最多占比为: 86.00%

image.png

image.png

1.3.3.3 政治面貌分布
  • political中群众占比84.10%,剔除。
features_distribution('political')
remove_feature('political')
——————————————————————————————————————————————————————
political中数量最多占比为: 84.10%

image.png

1.3.3.4 房产分布
  • 只保留property_1和property_2,其他剔除。
features_distribution('property_0')
features_distribution('property_1')
features_distribution('property_2')
features_distribution('property_3')
features_distribution('property_4')
features_distribution('property_5')
features_distribution('property_6')
features_distribution('property_7')
remove_feature('property_0')
remove_feature('property_3')
remove_feature('property_4')
remove_feature('property_5')
remove_feature('property_6')
remove_feature('property_7')
——————————————————————————————————————————————————————
property_0中数量最多占比为: 99.20%
property_1中数量最多占比为: 52.82%
property_2中数量最多占比为: 73.29%
property_3中数量最多占比为: 89.92%
property_4中数量最多占比为: 89.60%
property_5中数量最多占比为: 97.46%
property_6中数量最多占比为: 99.60%
property_7中数量最多占比为: 97.76%
1.3.3.5 健康情况分布
  • health与health_problem占比相近,剔除其中一个。
features_distribution('health')
features_distribution('health_problem')
remove_feature('health_problem')
——————————————————————————————————————————————————————
health中数量最多占比为: 38.81%
health_problem中数量最多占比为: 36.56%

image.png

image.png

1.3.3.6 媒体使用情况分布

-media_2与media_3占比相近,剔除其中一个。

features_distribution('media_1')
features_distribution('media_2')
features_distribution('media_3')
features_distribution('media_4')
features_distribution('media_5')
features_distribution('media_6')
remove_feature('media_2')
——————————————————————————————————————————————————————
media_1中数量最多占比为: 50.23%
media_2中数量最多占比为: 54.81%
media_3中数量最多占比为: 55.24%
media_4中数量最多占比为: 40.38%
media_5中数量最多占比为: 53.13%
media_6中数量最多占比为: 69.55%
1.3.3.7 家庭等级分布
  • class_10_after与class_10_before表示十年间家庭等级的变化,可以进行组合为class_variable。
combined['class_variable'] =combined['class_10_after'] -combined['class_10_before']
features_distribution('class')
features_distribution('class_variable')
features_distribution('class_14')
remove_feature('class_10_after')
remove_feature('class_10_before')
——————————————————————————————————————————————————————
class中数量最多占比为: 34.65%
class_variable中数量最多占比为: 22.25%
class_14中数量最多占比为: 21.08%
1.3.3.8 社会保障分布
  • insur_3与insur_4表示商业性保险,两者的渐变趋势一致,筛选其中一个。
features_distribution('insur_1')
features_distribution('insur_2')
features_distribution('insur_3')
features_distribution('insur_4')
remove_feature('insur_4')
——————————————————————————————————————————————————————
insur_1中数量最多占比为: 90.80%
insur_2中数量最多占比为: 68.81%
insur_3中数量最多占比为: 89.32%
insur_4中数量最多占比为: 91.81%
1.3.3.9 投资情况分布
  • 保留invest_2,其他的几乎没有投资,选全部删除。
features_distribution('invest_0')
features_distribution('invest_1')
features_distribution('invest_2')
features_distribution('invest_3')
features_distribution('invest_4')
features_distribution('invest_5')
features_distribution('invest_6')
features_distribution('invest_7')
features_distribution('invest_8')
remove_feature('invest_0')
remove_feature('invest_1')
remove_feature('invest_3')
remove_feature('invest_4')
remove_feature('invest_5')
remove_feature('invest_6')
remove_feature('invest_7')
remove_feature('invest_8')
——————————————————————————————————————————————————————
invest_0中数量最多占比为: 98.48%
invest_1中数量最多占比为: 90.94%
invest_2中数量最多占比为: 93.97%
invest_3中数量最多占比为: 97.79%
invest_4中数量最多占比为: 99.52%
invest_5中数量最多占比为: 99.84%
invest_6中数量最多占比为: 99.99%
invest_7中数量最多占比为: 99.93%
invest_8中数量最多占比为: 99.94%

1.3.4 数据类型处理

1.3.4.1 时间类型
  • survey_time和birth之前已经处理好了,关于父母的出生日期、配偶的出生日期、结婚时间等对于目标值的分析没有实际价值。
combined.drop(['s_birth','f_birth','m_birth','marital_now','edu_yr','join_party','marital_1st'],axis=1,inplace=True)
1.3.4.2 数值型数据(标准化)
  • 提取数值类型数据计算平均数、标准差
  • 数据标准化处理
numeric=['income','height_cm','weight_jin','s_income',
                  'family_income','family_m','house','car'
                  ,'son','daughter','minor_child','inc_exp','public_service_1',
                  'public_service_2','public_service_3','public_service_4',
                  'public_service_5','public_service_6','public_service_7',
                  'public_service_8','public_service_9','floor_area']
def normalizing():
    global combined,numeric
    numeric_means=combined[numeric].mean()
    numeric_std=combined[numeric].std()
    combined[numeric]=(combined[numeric]-numeric_means)/numeric_std
    return combi
1.3.4.3 分类型数据(独热编码)
  • 对于分类型数据进行独热编码
    global combined,numeric
    df_object=combined.drop(numeric,axis=1)
    df_object=df_object.astype(str)
    for name in list(df_object):
        object_dummies = pd.get_dummies(combined[name], prefix=name)
        combined = pd.concat([combined, object_dummies], axis=1)
        combined.drop(name, inplace=True, axis=1)
    return combined 
combined =one_hot_enconder()

1.4 模型训练

  • 恢复训练集和测试集
def recover_train_test_target():
    global combined
    targets = pd.read_csv('happiness_train_complete.csv', usecols=['happiness'])['happiness'].values
    train = combined.iloc[:7988]
    test = combined.iloc[7988:]
    targets =targets[targets>0]
    return train, test, targets
train, test, targets=recover_train_test_target()
targets =pd.DataFrame(targets,columns=['happiness'])
  • 导入模型训练需要的模块
import sklearn
from sklearn.linear_model import Lasso, LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import cross_val_score

1.4.1 特征降维

  • 使用随机森林训练特征重要性来降低特征维度
clf = RandomForestClassifier(n_estimators=50, max_features='sqrt')
clf = clf.fit(train, targets)
model = SelectFromModel(clf, prefit=True)
train_reduced = model.transform(train)
train_reduced.shape

1.4.2 建立模型

使用几种模型进行初步评估。

logreg_cv = LogisticRegressionCV()
rf = RandomForestClassifier()
Las = Lasso()
svc=SVC()
xgb_r = xgb.XGBRegressor()
models = [ logreg_cv, rf,svc, Las,xgb] 

用评分函数进行模型评估。

def compute_score(clf,x,y,cv=5,scoring='accuracy'):
    xval =cross_val_score(clf,x,y,cv=5,scoring=scoring)
    return np.mean(xval)
for model in models:
    print (f'Cross-validation of :{model.__class__}')
    score = compute_score(clf=model, x=train_reduced, y=targets, scoring='accuracy')

1.4.3 调节超参数

利用网格搜索调节参数。

model_box= []
xgb_re=xgb.XGBRegressor(seed=27,learning_rate=0.1, n_estimators=300,silent=0, objective='reg:linear',
                         gamma=0,subsample=0.8,colsample_bytree=0.8,nthread=4,scale_pos_weight=1)
xgb_params ={'n_estimators':[50,100,120],'min_child_weight':list(range(1,4,2)),}
model_box.append([xgb_re,xgb_params])

rf=RandomForestClassifier(random_state=2021,max_features='auto')
rf_params ={'n_estimators':[50,120,300],'max_depth':[5,8,15],
            'min_samples_leaf':[2,5,10],'min_samples_split':[2,5,10]}
model_box.append([rf,rf_params])
for i in range(len(model_box)):
    best_model= GridSearchCV(model_box[i][0],param_grid=model_box[i][1],refit=True,
                             cv=5,scoring='neg_mean_squared_error').fit(train_reduced,targets)
    print('best_parameters:',best_model.best_params_)  
   ————————————————————————————————————————————————————————
best_parameters: {'min_child_weight': 1, 'n_estimators': 120}
best_parameters: {'max_depth': 15, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 120}

1.4.4 模型预测

xgboost算法效果要比随机森林好一些,因此选择作为最终模型。

xgb_r=xgb.XGBRegressor(learning_rate=0.1, n_estimators=120,
                  silent=0, objective='reg:linear',
                  gamma=0,subsample=0.8,colsample_bytree=0.8,
                  nthread=4,scale_pos_weight=1,seed=27,min_child_weight=1)
xgb_r.fit(train,targets)
predictions=xgb_r.predict(test)

生成结果文件提交。

df_predictions = pd.DataFrame()
abc = pd.read_csv('happiness_test_complete.csv',encoding='gbk')
df_predictions['id'] = abc['id']
df_predictions['happiness'] = predictions
df_predictions[['id','happiness']].to_csv('submit.csv', index=False)