数据分析实战：美国King Country房价预测数据主要包括2014年5月至2015年5月美国King County的

赛事背景：

数据主要包括2014年5月至2015年5月美国King County的房屋销售价格以及房屋的基本信息。数据分为训练数据和测试数据，分别保存在 kc_train.csv和kc_test.csv两个文件中。其中训练数据主要包括10000条记录，14个字段，主要字段说明如下：

第一列“销售日期”：2014年5月到2015年5月房屋出售时的日期

第二列“销售价格”：房屋交易价格，单位为美元，是目标预测值

第三列“卧室数”：房屋中的卧室数目

第四列“浴室数”：房屋中的浴室数目

第五列“房屋面积”：房屋里的生活面积

第六列“停车面积”：停车坪的面积

第七列“楼层数”：房屋的楼层数

第八列“房屋评分”：King County房屋评分系统对房屋的总体评分

第九列“建筑面积”：除了地下室之外的房屋建筑面积

第十列“地下室面积”：地下室的面积

第十一列“建筑年份”：房屋建成的年份

第十二列“修复年份”：房屋上次修复的年份

第十三列"纬度"：房屋所在纬度

第十四列“经度”：房屋所在经度

测试数据主要包括3000条记录，13个字段，跟训练数据的不同是测试数据并不包括房屋销售价格，通过由训练数据所建立的模型以及所给的测试数据，得出测试数据相应的房屋销售价格预测值。
评分标准

算法通过计算平均预测误差来衡量回归模型的优劣。平均预测误差越小，说明回归模型越好。

一、数据预处理

1.查看数据

df.head()
df.isnull().sum()

运行结果如下图所示：

2.由于原始数据没有列名，我们需要添加列名，同时将标签分析出来。

# 列名
columns = ['销售日期', '卧室数', '浴室数', '房屋面积', '停车面积', '楼层数', '房屋评分',
           '建筑面积', '地下室面积', '建筑年份', '修复年份', '纬度', '经度']

# 将要预测的列单独提取出来
df_price = pd.DataFrame(columns=['销售价格'])
df_price['销售价格'] = df_train[1]

# 删除原有销售价格对应的列
df_train.drop(columns=[1], inplace=True)

# 给数据添加列名
df_train.columns = columns
df_test.columns = columns

3.时间数据处理，原本的销售日期包含年月日，现在将其分离出年、月、日，参考程序如下：

# 时间数据转换
def timeFormat(df):
    sale_year,sale_month,sale_day = [],[],[]
    for date in df['销售日期'].values:
        # 将时间进行切割
        year = int(str(date)[0:4])
        month = int(str(date)[4:6])
        day = int(str(date)[6:8])
        # 将年月日添加入列表
        sale_year.append(year)
        sale_month.append(month)
        sale_day.append(day)

    df['销售年份'] = sale_year
    df['销售月份'] = sale_month
    df['销售日'] = sale_day
    df.drop(columns=['销售日期'], inplace=True)

    return df

4.然后对房子的建筑年份和修复年份进行处理，获得房子的建筑年龄，以及修复后的房龄，参考程序如下：

df['建筑年龄'] = df['销售年份'] - df['建筑年份']
# 修复时间处理
repair_age = []
for [i, j] in df[['修复年份', '销售年份']].values:
    if i == 0:
        repair_age.append(0)
    else:
        repair_age.append(j - i)
df['修复年龄'] = repair_age
df.drop(columns=['修复年份', '建筑年份'], inplace=True)

5.经纬度处理，我们可以将经纬度转换成极坐标，参考程序如下：

# 经纬度转换为极坐标
def pointChange(lat, lon):
    loc_x = lat.values  # 纬度
    loc_y = lon.values  # 经度

    # 取最小值为极点
    x_min = loc_x.min()
    y_min = loc_y.min()

    radius = [] # 半径
    angle = []  # 角度

    for x, y in zip(loc_x, loc_y):
        radius.append(np.sqrt((x - x_min) ** 2 + (y - y_min) ** 2))
        angle.append(np.arctan((y - y_min) / (x - x_min)) * 180 / np.pi)

    radius = (np.array(radius)).round(decimals=8)
    angle = (np.array(angle)).round(decimals=8)

    return radius, angle

还有一种思路，就是通过经纬度，将该区域划分为多个矩形区域，将房屋存放在属于各自的区域内。

6.组合新特征，可以考虑将各个面积进行相加处理，参考程序如下：

# 组合新特征
df['卧室_浴室数'] = df.apply(lambda x: x['卧室数'] + x['浴室数'], axis=1)

至此，特征处理完成。

二、数据建模

1.特征选择，我们需要对特征进行选择，去掉一些不重要的特征。考虑到这次的模型是随机森林，而随机森林可以通过feature_importance来进行特征筛选，参考程序如下：

# 特征选择及可视化
def featureChoice(feature, label):
    x = feature.values
    y = label.values
    rfr = RandomForestRegressor(n_estimators=100, max_depth=10, n_jobs=-1)
    rfr.fit(x, y)
    df_features = pd.DataFrame({'column': feature.columns,
                                'importance': rfr.feature_importances_}).sort_values(by='importance')

    plt.figure(figsize=(12, 9))
    plt.barh(range(len(df_features)), df_features['importance'], height=0.8, alpha=0.6)
    plt.yticks(range(len(df_features)), df_features['column'])
    plt.show():
    

    feature_choice = df_features['column'].values.tolist()
    return feature_choice[-4:]

可视化结果如下图，然后根据图像选择最优的特征，这里选择最优的4个特征。

2.建立模型，这里我选择了随机森林模型，首先开始进行参数设定。其中，主要要考虑的参数有：

n_estimators：建立的树的数目。
max_depth：树的最大深度。
min_samples_split：拆分内部节点所需的最小样本数。
min_samples_leaf：新创建的叶子中的最小样本数。

更多请参考：官方文档

3.开始进行调参

第一步先调节n_estimators，参考程序如下：

# 随机森林模型参数设定
def rfMOdelparams(feature, label):
    x = feature.values
    y = label.values

    # 调试n_estimators，第一次(0, 1000, 100)
    test_score = []
    for i in range(0, 1000, 50):
        rfr = RandomForestRegressor(n_estimators=i + 50, random_state=42, n_jobs=-1)
        score = cross_val_score(rfr, x, y, cv=10).mean()
        test_score.append(score)

    # 可视化
    plt.figure(figsize=(12, 8))
    plt.plot(range(0, 1000, 50), test_score)
    plt.show()

可视化结果如下：

然后在第一步的基础上，继续调n_estimators，参考程序如下：

# 调试n_estimators，第二次(800, 900, 10)
test_score = []
for i in range(800, 900, 10):
    rfr = RandomForestRegressor(n_estimators=i + 5, random_state=42, n_jobs=-1)
    score = cross_val_score(rfr, x, y, cv=10).mean()
    test_score.append(score)

# 可视化
plt.figure(figsize=(12, 8))
plt.plot(range(800, 900, 10), test_score)
plt.show()

可视化结果如下：

通过可视化结果确定n_estimators=850。

接下来调试max_depth，参考程序如下：

# 调试max_depth，(1, 21, 1)->20
param_grid = {'max_depth': np.arange(1, 21, 1)}
rfr = RandomForestRegressor(n_estimators=850, random_state=42, n_jobs=-1)
rfr_cv = GridSearchCV(rfr, param_grid, cv=10)
rfr_cv.fit(x, y)
print('The best scores is: ', rfr_cv.best_score_)
print('The best params is: ', rfr_cv.best_params_)

通过上述上述程序确定max_depth=20。

然后调试min_samples_split，参考程序如下：

# 调试min_samples_split，(2, 11, 1)->2
param_grid = {'min_samples_split': np.arange(2, 11, 1)}
rfr = RandomForestRegressor(n_estimators=850, max_depth=20,  random_state=42, n_jobs=-1)
rfr_cv = GridSearchCV(rfr, param_grid, cv=10)
rfr_cv.fit(x, y)
print('The best scores is: ', rfr_cv.best_score_)
print('The best params is: ', rfr_cv.best_params_)

通过上述上述程序确定min_samples_split=2。

最后调试min_samples_leaf，参考程序如下：

# 调试min_samples_leaf，(1, 6, 1)->2
param_grid = {'min_samples_leaf': np.arange(1, 6, 1)}
rfr = RandomForestRegressor(n_estimators=850, max_depth=15, max_features=7, min_samples_split=2, random_state=42, n_jobs=-1)
rfr_cv = GridSearchCV(rfr, param_grid, cv=10)
rfr_cv.fit(x, y)
print('The best scores is: ', rfr_cv.best_score_)
print('The best params is: ', rfr_cv.best_params_)

通过上述程序确定min_samples_leaf=2，最后得到的最优参数为：

n_estimators = 850, max_depth = 20, 
min_samples_split = 2, min_samples_leaf = 2

其实还有一种调试方法，那就是GridSearchCV，而且准确率比上面的要好，但是非常耗时，参考程序如下：

# 参数调试
def paramsTest(feature, label):
    x = feature.values
    y = label.values

    rfr = RandomForestRegressor()
    params = {'n_estimators': [i for i in range(100, 2000, 10)],
              'max_depth': [i for i in range(1,21,1)],
              'min_samples_split': [i for i in range(1,11,1)],
              'min_samples_leaf': [i for i in range(1,6,1)],
              'criterion': 'mse'
              }
    rfr_cv = GridSearchCV(estimator=rfr, param_grid=params, cv=10, n_jobs=-1)

    # 拟合训练集
    rfr_cv.fit(x, y)
    print('The best params is:', rfr_cv.best_params_)

    return rfr_cv.best_params_

至此，模型建立完成，我们可以去预测测试集。如果模型表现得不是很好，可以考虑换个模型。

三、总结

在整个调参流程中，最先考虑的应该是n_estimators，然后调整max_depth，通过 max_depth产生的结果，来判断模型位于复杂度-泛化误差图像的哪一边，从而选择我们应该调整的参数和调参的方向。
可以尝试网格搜索，自动调参的结果比手动的会好，但是非常耗时，如果时间够的话可以尝试，看看和手动调参差距多大。
这里提供了一个大概的思路，最后提交后发现结果不是很理想，随机森林模型的效果不是很好，可以试试神经网络或者集成学习。