📊波士顿房价预测(机器学习小学生版)

10 阅读4分钟

📊波士顿房价预测(机器学习小学生版)

机器学习项目,使用波士顿房价数据集进行房价预测分析。

刚开始学习,大佬们您就看一乐! (免费资源链接在最后)

🔬技术栈

  • numpy、pandes、matplotlib(经典三件套)
  • seaborn(简单画图工具)
  • scikit-learn(重点✅)

🏳️‍🌈项目背景

正在学习 机器学习,最开始接触的是线性回归和多项式回归,因此选择了经典的 波士顿房价预测 作为项目练习。

🪢项目结构

boston_houseprice/
├── housing.csv			# 数据集
├── img/				# 项目图片
├── Forecasts.ipynb		# 主要分析代码
└── README.md			# 项目说明

🎯项目描述

该项目使用 线性回归模型 (含梯度下降)多项式回归模型 ,采用 RMSER2交叉验证 多种模型评估方式进行验证。

img.png

🔢功能特性

(代码仅为示例,完整请获取资源)

  • 数据探索性分析
# 查看所有特征与房价的相关性
correlation = data.corr()['MEDV'].sort_values(ascending=False)
correlation
# 输出:相关性
MEDV       1.000000
RM         0.695360
ZN         0.360445
B          0.333461
DIS        0.249929
CHAS       0.175260
AGE       -0.376955
RAD       -0.381626
CRIM      -0.388305
NOX       -0.427321
TAX       -0.468536
INDUS     -0.483725
PTRATIO   -0.507787
LSTAT     -0.737663
Name: MEDV, dtype: float64
  • 特征工程处理
# 洗牌
np.random.seed(15)
data_shuffle = np.random.permutation(data)
data_shuffle
# 选择高相关性特征
high_corr = correlation[correlation.abs() >= 0.5].index
m = len(data_shuffle)
index = [i for i in range(len(names)) if names[i] in high_corr][:-1]
high_corr_data = data_shuffle[:, index]
target = data_shuffle[:, -1].reshape(m, 1)
high_corr_data[:10]
# 标准化数据(提升模型收敛速度)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
high_corr_data_scaler = scaler.fit_transform(high_corr_data)
high_corr_data_scaler[:10]
# 划分训练集与测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(high_corr_data_scaler, target, test_size=0.2, random_state=15)
m1 = len(X_train)
m2 = len(X_test)
m1
  • 多种回归模型比较(线性回归 / 多项式回归)
# 训练线性回归模型(直接使用函数)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
# 训练线性回归模型(梯度下降)
# 数据量小,首先尝试批量梯度下降
n_iterations = 10000
alpha = 0.01
theta = np.zeros((4, 1))
X_train_bias = np.c_[np.ones((m1, 1)), X_train]

for _ in range(n_iterations):
    gradient = (1/m1) * X_train_bias.T @ (X_train_bias @ theta - y_train)
    theta -= alpha * gradient
theta
# 训练多项式回归模型
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
pipeline = Pipeline([
    # 当 include_bias=True 时可用在梯度下降中,自动增加一列全为 1 项
    ('PolynomialFeatures', PolynomialFeatures(degree=3, include_bias=False)), 
    ('StandardScaler', StandardScaler()), 
    ('LinearRegression', LinearRegression())
])
pipeline.fit(X_train3, y_train3)
  • 模型评估与可视化
# 预测测试集
y_predict1 = model.predict(X_test)
y_predict2 = np.c_[np.ones((m2, 1)), X_test] @ theta
# RMSE
from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(y_test, y_predict1))
np.sqrt(mean_squared_error(y_test, y_predict2))
4.62 / (target.max() - target.min())
# R2
from sklearn.metrics import r2_score
r2_score(y_test, y_predict1)
r2_score(y_test, y_predict2)
# 交叉验证是否过拟合
# 流程化函数
def polynomial_regression(train_data, target_data, val_data):
    pipeline_val = Pipeline([
        ('PolynomialFeatures', PolynomialFeatures(degree=12, include_bias=False)), 
        ('StandardScaler', StandardScaler()), 
        ('LinearRegression', Ridge())
    ])
    pipeline_val.fit(train_data, target_data)
    return pipeline_val.predict(val_data)
    
# 交叉验证函数
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, shuffle=True, random_state=15)

mse_result = []
r2_result = []
for fold, (train_idx, val_idx) in enumerate(kf.split(X_train_new2, y_train_new2)):
    X_fold_train, y_fold_train = X_train_new2[train_idx], y_train_new2[train_idx]
    X_val, y_val = X_train_new2[val_idx], y_train_new2[val_idx]
    y_val_predict = polynomial_regression(X_fold_train, y_fold_train, X_val)
    mse_result.append(np.sqrt(mean_squared_error(y_val, y_val_predict)))
    r2_result.append(r2_score(y_val, y_val_predict))

# 评估结果
sum(mse_result) / 10
# 输出:np.float64(5.635728794857327)
sum(r2_result) / 10
# 输出:0.46012499964060816

# RMSE 增大且 R2 降低:过拟合
# 修改参数后,当 degree = 3 时效果最好,RMSE = 4.49,R2 = 0.75
  • 房价预测结果展示
# 仅预测了三个特征值的线性回归
new_data = scaler.transform([[7, 10, 5]])
predicted_price = model.predict(new_data)
print(f"Predicted MEDV: ${predicted_price[0, 0] * 1000:.2f}")
# Predicted MEDV: $38298.70
new_data = scaler.transform([[4, 20, 30]])
predicted_price = model.predict(new_data)
print(f"Predicted MEDV: ${predicted_price[0, 0] * 1000:.2f}")
# Predicted MEDV: $1089.19

🍎项目结论

通过预测波士顿房价,可以得出 RM, TAX, INDUS, PTRATIO, LSTAT 这五种特征值对房价具有一定的影响。

最终采用的多项式回归模型拟合效果最好,RMSE = 7%R2 = 85.5%

🛹示例结果


房间数与房价散点图

# 绘制房间数与房价的关系
plt.plot(data['RM'], data['MEDV'], 'b.')
plt.xlabel('RM')
plt.ylabel('MEDV')
plt.title('RM vs. MEDV')

RM_vs_MEDV.png


相关性大于 0.45 的特征值与房价散点矩阵图

# 查看线性关系
import seaborn as sns
data_index = high_corr2.drop('MEDV')
matrix = sns.pairplot(data[high_corr2], x_vars=data_index, y_vars=['MEDV'], plot_kws={'alpha': 0.6})
plt.suptitle('Scatter matrix', y=1.04)

Scatter_matrix.png


真实值和预测值散点图

# 可视化预测结果
plt.plot(y_test, y_predict1, 'g.')
plt.plot([0, 50], [0, 50], 'r--')
plt.xlabel('True Prices')
plt.ylabel('Predicted Prices')
plt.title('True vs Predicted Prices')

True_vs_Predicted_Prices.png


线性回归与多项式回归散点比较图

# 分析误差来源
plt.figure(figsize=(14, 6))
# Linear
plt.subplot(121)
residuals1 = y_test2 - y_predict1_add
plt.plot(y_predict1_add, residuals1, 'b.')
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Linear')
plt.axis((-5, 50, -15, 15))
# Polynomial
plt.subplot(122)
residuals2 = y_test3 - y_predict3
plt.plot(y_predict3, residuals2, 'b.')
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.title('Polynomial')
plt.axis((-5, 50, -15, 15))

Linear_vs_Polynomial.png


🚨缺陷

  • 模型构建较为生疏,代码较为粗糙
  • 思考过程可能未涉及全面
  • 可视化部分较少,可能影响观察和判断

❤️‍🔥免费资源链接

感谢各位大佬观看!!!