【机器学习与实战】回归分析与预测：线性回归-05-多元线性回归与实战【机器学习与实战】回归分析与预测：线性回归-05-多

【机器学习与实战】回归分析与预测：线性回归-05-多元线性回归与实战

配套视频教程：www.bilibili.com/video/BV1Rx…

一、什么是多元线性回归

在 advertising.csv 中，是存在多个广告渠道投放数据的，一元线性回归主要根据热力图分析了wechat对销售额的影响，得出了y=0.663x+0.173的函数，但是并没有考虑到webo, others两个广告投放对销售额的影响，一旦在线性回归中，只要自变量超过1个，就称为多元线性回归，此时，我们的求解参数就会变成多个。

而由于 $\large w_1x_1 + w_2x_2 + w_3x_3 + …$ 可以视为一个矩阵乘，所以，多元线性回归也可以表示为：

$\huge y = W^TX+b$

此处由于要进行矩阵运算，所以务必要注意张量的维度，比如在 advertising.csv 中，W是各个参数，而为了与X相乘，必须要对其进行转置。而y’的值是由三个广告渠道来共同决定，而每一条数据对应一个y’，所以y’的形状应该为：(200, 1)。

其实，在上述公式中，还可以所b也看作权重 $w_0$ ，并引入 $x_0$ ，这样公式就可以直接表示为

$\huge y=w_0x_0 + w_1x_1 + w_2x_2 + w_3x_3 …$

引入 $x_0$ ，相当于给数据集添加一个新的哑特征，值为：1, b和这个哑特征相乘， $w_0x_0=b\times1=b$ ，值保持不变也就是说，直接将公式修改为： $y=W^TX$ 就够了。

注意，无论何种情况，都要先对训练数据和测试数据进行归一化处理。

二、构建数据集

由于是多元线性回归，所以需要保留所有字段，包括wechat, webo, others三列数据，再利用numpy的delete方法删除标签字段，构造X和Y的数据集。

import numpy as np
import pandas as pd
ads = pd.read_csv('./advertising.csv')
x = np.array(ads)
x = np.delete(x, [3], axis=1)   # 表示删除第4列数据，axis=1代表列，默认为行
y = np.array(ads.sales)         # 获取标签数据y
y = y.reshape(len(y), 1)        # 重新将Y向量转换为矩阵
# print(x, y)                   # 打印矩阵，确认x和y均为矩阵，（200, 3)和(200, 1)

多元线性回归的重点就是W和X点积运算的实现，因为W不再是一个标量，而是一个1D向量 $[w_0, w_1, w_2, w_3]$ ，此时假设函数h(x)在程序中的实现就变成了一个循环，类似如下代码：

# 一元线性回归
y_hat = w * x + b
# 多元线性回归
for i in N:
    y_hat = y_hat + weight[i] * X[i] + bias

所以，此时可以给X最前面加上一列，这个维度所有的数值全为1，是一个哑特征然后将偏置b看作 $w_0$ ，代码如下：

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
x0_train = np.ones((len(x_train), 1))
# print(x0_train.shape)           # 形状为(160, 1)
x_train = np.append(x0_train, x_train, axis=1)  # 把X增加1列变成（160, 4)

三、损失函数和梯度下降

损失函数通过向量化来实现：

def loss_function(X, y, W):
    y_hat = X.dot(W)
    loss = y_hat - y  # 矩阵直接相减
    cost = np.sum(loss**2)/(2*len(X))
    return cost

梯度下降的公式仍然是：

$\huge w=w-\alpha\cdot 梯度= w-\frac{\alpha}{N}\Sigma_{i=1}^N((w\cdot x^i-y^i) \cdot x^i$

封装后的梯度下降函数：

def gradient_descent(X, y, W, lr, loop):
    losses = np.zeros(loop)
    weights = np.zeros((loop, len(W)))
    for i in range(loop):
        y_hat = X.dot(W.T)
        loss = y_hat.reshape((len(y_hat), 1)) - y
        d_W = X.T.dot(loss) / len(X)
        d_W = d_W.reshape(len(W))
        W = W - lr*d_W
        losses[i] = loss_function(X, y, W)
        weights[i] = W
    return losses, weights

注意上述函数与一元线性回归的区别，主要在于W的转置运算，以及由此衍生出来的其他转置运算。

四、完成训练代码

def train_draw(X, y, W, lr, loop):
    losses, weights = gradient_descent(X, y, W, lr, loop)
    print("训练最终损失: ", losses[-1])
    y_pred = X.dot(weights[-1])
    train_acc = 100 - np.mean(np.abs(y_pred-y)) * 100
    print("线性回归训练准确率: ", train_acc)
    # 损失函数图像
    # plt.plot(losses, 'g--')
    # plt.show()
    return losses, weights

五、测试代码

def test(x_test, y_test, W):
    x0_test = np.ones((len(x_test), 1))
    x_test = np.append(x0_test, x_test, axis=1)
    print(loss_function(x_test, y_test, W))

六、主调代码

if __name__ == '__main__':
    ads = pd.read_csv('./advertising.csv')
    x = np.array(ads)
    x = np.delete(x, [3], axis=1)   # 表示删除第4列数据，axis=1代表列，默认为行
    y = np.array(ads.sales)         # 获取标签数据y
    y = y.reshape(len(y), 1)        # 重新将Y向量转换为矩阵
    # print(x, y)                   # 打印矩阵，确认x和y均为矩阵，（200, 3)和(200, 1)
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=5)
    # 归一化处理
    x_train, x_test = scaler(x_train, x_test)
    y_train, y_test = scaler(y_train, y_test)
    x0_train = np.ones((len(x_train), 1))
    # print(x0_train.shape)           # 形状为(160, 1)
    x_train = np.append(x0_train, x_train, axis=1)  # 把X增加1列变成（160, 4)
    W = np.array([0.5, 1, 1, 1])      # 等价于np.ones(4)
    print(loss_function(x_train, y_train, W))
    losses, weights = train_draw(x_train, y_train, W, 0.15, 300)
    print("W最终参数为：", weights[-1])
    test(x_test, y_test, weights[-1])

输出结果为：

1.5810075118523153
训练最终损失:  0.00514388426870273
线性回归训练准确率:  75.86100998924522
W最终参数为： [0.08183125 0.65514423 0.17562431 0.16692169]
0.005040333619873452

也就是说，最终拟合的函数为： $\large y=0.655x_1 + 0.176x_2 + 0.167x_3 + 0.082$

七、预测销售额

假设现决定投入250, 50, 50元的广告费，预测结果的代码如下：

x_plan = [250, 50, 50]
W = [0.08183125,0.65514423,0.17562431,0.16692169]
x_train, x_plan = scaler(x_train, x_plan)
x_plan = np.append([1], x_plan )
y_plan = np.dot(W, x_plan)
y_max = y_train.max(axis=0)
y_min = y_train.min(axis=0)
y_pred = y_plan * (y_max - y_min) + y_min
print(y_pred)

预测结果为：7.51505114

八、利用sklearn实现多元线性回归

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
ads = pd.read_csv("./advertising.csv")
x = ads.drop("sales", axis=1)
y = ads.sales
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
model = LinearRegression()
model.fit(x_train, y_train)
print("截距为：", model.intercept_)
print("系数为：", model.coef_)
y_pred = model.predict(x_test)
print("真值（测试集）为：", y_test)
print("预测（测试集）为：", y_pred)
print("预测评分：", model.score(x_test, y_test))
plt.scatter(x_train.wechat, y_train, color="brown")
plt.plot(x_test.wechat, y_pred, color="green", linewidth=1)
plt.xlabel('wechat')
plt.ylabel('sales')
plt.show()

九、加州房价预测

# 加州房价预测
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
house = pd.read_csv("./house.csv")      # 读取
print(house.shape)                       # 显示
x = house.drop("median_house_value", axis=1)    # 将房价一列去掉，作为因变量y的值, axis中：0为行，1为列
y = house.median_house_value            # 将median_house_value这一列赋值给y
# 以80%-20%的数据集比例拆分数据集，80%为训练数据，20%为测试数据
# random_state表示随机数种子，默认为None，表示每次的测试集和数据集都随机读取，一旦设置了值，则每次都一样
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
# 选择LinearRegression线性回归作为训练模型，并调用fit方法进行训练，进行函数的拟合（即找到最优解）
model = LinearRegression()
model.fit(x_train, y_train)         # 拟合一个最优函数（y=ax+b的泛化函数的最优解：多元线性回归)
# 输出截距（即y=a1x1+a2x2+...a8x8+b中的b）
print("截距为：", model.intercept_)
# 输出系数（即y=a1x1+a2x2+...a8x8+b中的a），为什么是a8？
print("系数为：", model.coef_)
y_pred = model.predict(x_test)
print("房价的真值（测试集）为：", y_test)
print("预测的房价（测试集）为：", y_pred)
# 预测评分
print("预测评分：", model.score(x_test, y_test))            # 0.6321，拟合度勉强及格

利用matplotlib将散点图绘制出来（收入与预测值）

import matplotlib.pyplot as plt
# 散点图显示家庭收入中位数和房价中位数的分布
plt.scatter(x_test.median_income, y_test, color="brown")
# 画出回归函数（从特征到预测标签）
plt.plot(x_test.median_income, y_pred, color="green", linewidth=1)
plt.xlabel('Income')
plt.ylabel('Price')
plt.show()

可以看出，收入的增加与房价基本成正比。这一规律基本上拟合得比较明显。当然，也可以选择其他列来绘制散点图，以查看变化趋势，除了选择使用线性回归模型外，也可以考虑选择其他回归模型，进而找到拟合度更高的模型。