📊波士顿房价预测(机器学习小学生版)
机器学习项目,使用波士顿房价数据集进行房价预测分析。
刚开始学习,大佬们您就看一乐! (免费资源链接在最后)
🔬技术栈
- numpy、pandes、matplotlib(经典三件套)
- seaborn(简单画图工具)
- scikit-learn(重点✅)
🏳️🌈项目背景
正在学习 机器学习,最开始接触的是线性回归和多项式回归,因此选择了经典的 波士顿房价预测 作为项目练习。
🪢项目结构
boston_houseprice/
├── housing.csv # 数据集
├── img/ # 项目图片
├── Forecasts.ipynb # 主要分析代码
└── README.md # 项目说明
🎯项目描述
该项目使用 线性回归模型 (含梯度下降) 和 多项式回归模型 ,采用 RMSE、R2 和 交叉验证 多种模型评估方式进行验证。
🔢功能特性
(代码仅为示例,完整请获取资源)
- 数据探索性分析
# 查看所有特征与房价的相关性
correlation = data.corr()['MEDV'].sort_values(ascending=False)
correlation
# 输出:相关性
MEDV 1.000000
RM 0.695360
ZN 0.360445
B 0.333461
DIS 0.249929
CHAS 0.175260
AGE -0.376955
RAD -0.381626
CRIM -0.388305
NOX -0.427321
TAX -0.468536
INDUS -0.483725
PTRATIO -0.507787
LSTAT -0.737663
Name: MEDV, dtype: float64
- 特征工程处理
# 洗牌
np.random.seed(15)
data_shuffle = np.random.permutation(data)
data_shuffle
# 选择高相关性特征
high_corr = correlation[correlation.abs() >= 0.5].index
m = len(data_shuffle)
index = [i for i in range(len(names)) if names[i] in high_corr][:-1]
high_corr_data = data_shuffle[:, index]
target = data_shuffle[:, -1].reshape(m, 1)
high_corr_data[:10]
# 标准化数据(提升模型收敛速度)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
high_corr_data_scaler = scaler.fit_transform(high_corr_data)
high_corr_data_scaler[:10]
# 划分训练集与测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(high_corr_data_scaler, target, test_size=0.2, random_state=15)
m1 = len(X_train)
m2 = len(X_test)
m1
- 多种回归模型比较(线性回归 / 多项式回归)
# 训练线性回归模型(直接使用函数)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
# 训练线性回归模型(梯度下降)
# 数据量小,首先尝试批量梯度下降
n_iterations = 10000
alpha = 0.01
theta = np.zeros((4, 1))
X_train_bias = np.c_[np.ones((m1, 1)), X_train]
for _ in range(n_iterations):
gradient = (1/m1) * X_train_bias.T @ (X_train_bias @ theta - y_train)
theta -= alpha * gradient
theta
# 训练多项式回归模型
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
pipeline = Pipeline([
# 当 include_bias=True 时可用在梯度下降中,自动增加一列全为 1 项
('PolynomialFeatures', PolynomialFeatures(degree=3, include_bias=False)),
('StandardScaler', StandardScaler()),
('LinearRegression', LinearRegression())
])
pipeline.fit(X_train3, y_train3)
- 模型评估与可视化
# 预测测试集
y_predict1 = model.predict(X_test)
y_predict2 = np.c_[np.ones((m2, 1)), X_test] @ theta
# RMSE
from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(y_test, y_predict1))
np.sqrt(mean_squared_error(y_test, y_predict2))
4.62 / (target.max() - target.min())
# R2
from sklearn.metrics import r2_score
r2_score(y_test, y_predict1)
r2_score(y_test, y_predict2)
# 交叉验证是否过拟合
# 流程化函数
def polynomial_regression(train_data, target_data, val_data):
pipeline_val = Pipeline([
('PolynomialFeatures', PolynomialFeatures(degree=12, include_bias=False)),
('StandardScaler', StandardScaler()),
('LinearRegression', Ridge())
])
pipeline_val.fit(train_data, target_data)
return pipeline_val.predict(val_data)
# 交叉验证函数
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, shuffle=True, random_state=15)
mse_result = []
r2_result = []
for fold, (train_idx, val_idx) in enumerate(kf.split(X_train_new2, y_train_new2)):
X_fold_train, y_fold_train = X_train_new2[train_idx], y_train_new2[train_idx]
X_val, y_val = X_train_new2[val_idx], y_train_new2[val_idx]
y_val_predict = polynomial_regression(X_fold_train, y_fold_train, X_val)
mse_result.append(np.sqrt(mean_squared_error(y_val, y_val_predict)))
r2_result.append(r2_score(y_val, y_val_predict))
# 评估结果
sum(mse_result) / 10
# 输出:np.float64(5.635728794857327)
sum(r2_result) / 10
# 输出:0.46012499964060816
# RMSE 增大且 R2 降低:过拟合
# 修改参数后,当 degree = 3 时效果最好,RMSE = 4.49,R2 = 0.75
- 房价预测结果展示
# 仅预测了三个特征值的线性回归
new_data = scaler.transform([[7, 10, 5]])
predicted_price = model.predict(new_data)
print(f"Predicted MEDV: ${predicted_price[0, 0] * 1000:.2f}")
# Predicted MEDV: $38298.70
new_data = scaler.transform([[4, 20, 30]])
predicted_price = model.predict(new_data)
print(f"Predicted MEDV: ${predicted_price[0, 0] * 1000:.2f}")
# Predicted MEDV: $1089.19
🍎项目结论
通过预测波士顿房价,可以得出 RM, TAX, INDUS, PTRATIO, LSTAT 这五种特征值对房价具有一定的影响。
最终采用的多项式回归模型拟合效果最好,RMSE = 7%,R2 = 85.5%。
🛹示例结果
房间数与房价散点图
# 绘制房间数与房价的关系
plt.plot(data['RM'], data['MEDV'], 'b.')
plt.xlabel('RM')
plt.ylabel('MEDV')
plt.title('RM vs. MEDV')
相关性大于 0.45 的特征值与房价散点矩阵图
# 查看线性关系
import seaborn as sns
data_index = high_corr2.drop('MEDV')
matrix = sns.pairplot(data[high_corr2], x_vars=data_index, y_vars=['MEDV'], plot_kws={'alpha': 0.6})
plt.suptitle('Scatter matrix', y=1.04)
真实值和预测值散点图
# 可视化预测结果
plt.plot(y_test, y_predict1, 'g.')
plt.plot([0, 50], [0, 50], 'r--')
plt.xlabel('True Prices')
plt.ylabel('Predicted Prices')
plt.title('True vs Predicted Prices')
线性回归与多项式回归散点比较图
# 分析误差来源
plt.figure(figsize=(14, 6))
# Linear
plt.subplot(121)
residuals1 = y_test2 - y_predict1_add
plt.plot(y_predict1_add, residuals1, 'b.')
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Linear')
plt.axis((-5, 50, -15, 15))
# Polynomial
plt.subplot(122)
residuals2 = y_test3 - y_predict3
plt.plot(y_predict3, residuals2, 'b.')
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.title('Polynomial')
plt.axis((-5, 50, -15, 15))
🚨缺陷
- 模型构建较为生疏,代码较为粗糙
- 思考过程可能未涉及全面
- 可视化部分较少,可能影响观察和判断
❤️🔥免费资源链接
感谢各位大佬观看!!!