正则化线性回归中的方差与误差

一、引言

在本次练习中，我们将会实现一个正则化线性回归模型并将其用于学习不同的方差-误差属性的模型。同样，本次练习需要导入数据："ex5data1.mat"

二、正则化线性回归

在练习的前半部分，我们会实现一个用于预测因蓄水池水位变化而导致水库流量的变化的正则化线性回归。在下半部分，我们会进行算法性能的调试诊断，并且测试误差-方差的影响效果。

1. 可视化数据集

我们得到的数据集包含了水位变化的历史记录x，以及水坝流量y。数据集被分为三部分：

训练数据集(X, y)：用于训练模型
交叉验证集(Xval, yval)：用于训练正则化参数，抑制过拟合
测试集(Xtest,ytest)：用于评估模型性能，此集合是我们在训练或验证时完全没有投入使用的数据集

import numpy as np
import scipy.io as sio
import scipy.optimize as opt
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 导入数据集
def load_data():
    """for ex5
    d['X'] shape = (12, 1)
    pandas has trouble taking this 2d ndarray to construct a dataframe, so I ravel
    the results
    """
    d = sio.loadmat('ex5data1.mat')
    return map(np.ravel, [d['X'], d['y'], d['Xval'], d['yval'], d['Xtest'], d['ytest']])

# 依次为训练集，交叉验证集，测试集
X, y, Xval, yval, Xtest, ytest = load_data()

# 查看数据集维度
X.shape,y.shape, Xval.shape, yval.shape, Xtest.shape, ytest.shape

((12,), (12,), (21,), (21,), (21,), (21,))

# 将导入的训练集数据转化为Pandas的二维数组
df = pd.DataFrame({'water_level':X, 'flow':y})

# 对训练集可视化
sns.lmplot('water_level', 'flow', data=df, fit_reg=False, size=7)
plt.show()

E:\anaconda\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
E:\anaconda\lib\site-packages\seaborn\regression.py:581: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

# 使用列表生成式，将三个X集合在第一列插入全一列
X, Xval, Xtest = [np.insert(x.reshape(x.shape[0], 1), 0, np.ones(x.shape[0]), axis=1) for x in (X, Xval, Xtest)]

# 查看插入全一列后数据集维度
X.shape,y.shape, Xval.shape, yval.shape, Xtest.shape, ytest.shape

((12, 2), (12,), (21, 2), (21,), (21, 2), (21,))

2. 代价函数

从前面可知，线性回归的代价函数如下： $J(\theta)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}$

def cost(theta, X, y ):
    
#     X = np.matrix(X)
#     y = np.matrix(y)
#     theta = np.matrix(theta)
# 为何出错？
#     res = np.power((X * theta.T-y),2)
#     return np.sum(res)/(2 * len(X))
    m = X.shape[0]

    inner = X @ theta - y  # R(m*1)

    # 1*m @ m*1 = 1*1 in matrix multiplication
    # but you know numpy didn't do transpose in 1d array, so here is just a
    # vector inner product to itselves
    square_sum = inner.T @ inner
    cost = square_sum / (2 * m)

    return cost

theta = np.ones(X.shape[1])
cost(theta, X, y)

303.9515255535976

3. 梯度

梯度以及正则化梯度

$\begin{array}{ll} \frac{\partial J(\theta)}{\partial \theta_{0}}=\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)} & \text { for } j=0 \\ \frac{\partial J(\theta)}{\partial \theta_{j}}=\left(\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)}\right)+\frac{\lambda}{m} \theta_{j} & \text { for } j \geq 1 \end{array}$

# 非正则化
def gradient(theta, X, y):
    m = X.shape[0]

    inner = X.T @ (X @ theta - y)  # (m,n).T @ (m, 1) -> (n, 1)

    return inner / m

# 正则化，学习率默认为 1
def regularized_gradient(theta, X, y, learningrate=1):
    m = X.shape[0]
    # 得到的正则化梯度阵列和theta同样
    regularized_term = theta.copy() 
    # 我们不会对新插入的一列梯度进行正则化
    regularized_term[0] = 0  

    regularized_term = (learningrate / m) * regularized_term
    # 返回一个一维数组，和theta参数纬度一致
    return gradient(theta, X, y) + regularized_term

regularized_gradient(theta, X, y)

array([-15.30301567, 598.25074417])

4. 数据拟合

一旦我们定义了正确的梯度函数与代价函数，我们就应该着手计算参数θ的最优值，也就是训练模型。在本部分中，我们定义了一个正则化参数λ(初值为0)。因为我们需要实现的线性回归模型需要拟合一个2维参数θ。

def linear_regression_np(X, y, learningrate):
    """linear regression
    args:
        X: feature matrix, (m, n+1) # with incercept x0=1
        y: target vector, (m, )
        l: lambda constant for regularization

    return: trained parameters
    """
    # 初始化参数矩阵(和插入一列后的X列数相同)
    theta = np.ones(X.shape[1])

    # 训练
    res = opt.minimize(fun=regularized_cost,
                       x0=theta,
                       args=(X, y, learningrate),
                       method='TNC',
                       jac=regularized_gradient,
                       options={'disp': True})
    return res

5. 正则化代价函数

$J(\theta)=\frac{1}{2 m}\left(\sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}\right)+\frac{\lambda}{2 m}\left(\sum_{j=1}^{n} \theta_{j}^{2}\right)$

def regularized_cost(theta, X, y, learningrate=1):
    m = X.shape[0]
    # 无需计算全一列的正则化参数值
    regularized_term = (learningrate / (2 * m)) * np.power(theta[1:], 2).sum()

    return cost(theta, X, y) + regularized_term

theta = np.ones(X.shape[0])

final_theta = linear_regression_np(X, y, learningrate=0).get('x')

# 直线偏移
b = final_theta[0] 
# 直线斜率
m = final_theta[1] 

plt.scatter(X[:,1], y, label="Training data")
plt.plot(X[:, 1], X[:, 1]*m + b, label="Prediction")
plt.legend(loc=2)
plt.show()

training_cost, cv_cost = [], []

使用训练集的子集来拟合模型
在计算训练代价和交叉验证代价时，没有用正则化
记住使用相同的训练集子集来计算训练代价

m = X.shape[0]
# 将训练集的每前 i 部分投入训练并记录结果
for i in range(1, m+1):
#     print('i={}'.format(i))
    res = linear_regression_np(X[:i, :], y[:i], learningrate=0)
    
    tc = regularized_cost(res.x, X[:i, :], y[:i], learningrate=0)
    cv = regularized_cost(res.x, Xval, yval, learningrate=0)
#     print('tc={}, cv={}'.format(tc, cv))
    
    training_cost.append(tc)
    cv_cost.append(cv)

plt.plot(np.arange(1, m+1), training_cost, label='training cost')
plt.plot(np.arange(1, m+1), cv_cost, label='cv cost')
plt.legend(loc=1)
plt.show()

这个模型拟合不太好, 欠拟合了

三、创建多项式特征

我们创建的线性模型由于过于简单(一次函数)难以拟合数据，导致欠拟合。在本部分中，我们将会加入更多特征来解决这个问题。使用多项式回归，我们的加设函数如下所示：

$\begin{aligned} h_{\theta}(x) &=\theta_{0}+\theta_{1} *(\text { waterLevel })+\theta_{2} *(\text { waterLevel })^{2}+\cdots+\theta_{p} *(\text { waterLevel })^{p} \\ &=\theta_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\ldots+\theta_{p} x_{p} \end{aligned}$

注意：上面式子的 $x_1,x_2,\dots,x_p$ 代表着对应特征的 i 次方。

另外，我们除了需要修改假设函数以外，我们还需要加入更多特征(可以从已存在的数据集中获得，将水位线的高次方作为新特征)

### 预处理多项式数据集
def prepare_poly_data(*args, power):
    """
    args: keep feeding in X, Xval, or Xtest
        will return in the same order
    """
    def prepare(x):
        # 加入新特征
        df = poly_features(x, power=power)

        # 规范化
        # 这里可能会报错，因为as_martix函数已经被拿掉了
        # df.as_matrix()改写成df.values

        ndarr = normalize_feature(df).values

        # 加入全一列
        return np.insert(ndarr, 0, np.ones(ndarr.shape[0]), axis=1)

    return [prepare(x) for x in args]

# 加入新特征
def poly_features(x, power, as_ndarray=False):
    data = {'f{}'.format(i): np.power(x, i) for i in range(1, power + 1)}
    df = pd.DataFrame(data)

    return df.as_matrix() if as_ndarray else df

X, y, Xval, yval, Xtest, ytest = load_data()

# 三个特征分别为初始水位值的一、二、三次方
poly_features(X, power=3)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	f1	f2	f3
0	-15.936758	253.980260	-4047.621971
1	-29.152979	849.896197	-24777.006175
2	36.189549	1309.683430	47396.852168
3	37.492187	1405.664111	52701.422173
4	-48.058829	2309.651088	-110999.127750
5	-8.941458	79.949670	-714.866612
6	15.307793	234.328523	3587.052500
7	-34.706266	1204.524887	-41804.560890
8	1.389154	1.929750	2.680720
9	-44.383760	1969.918139	-87432.373590
10	7.013502	49.189211	344.988637
11	22.762749	518.142738	11794.353058

1. 准备回归数据

扩展特征到 8阶,或者你需要的阶数
使用 归一化 来合并 $x^n$
别忘了额外插入全一列

# 数据规范化
def normalize_feature(df):
    """Applies function along input axis(default 0) of DataFrame."""
    return df.apply(lambda column: (column - column.mean()) / column.std())

X_poly, Xval_poly, Xtest_poly= prepare_poly_data(X, Xval, Xtest, power=8)
X_poly[:3, :]

array([[ 1.00000000e+00, -3.62140776e-01, -7.55086688e-01,
         1.82225876e-01, -7.06189908e-01,  3.06617917e-01,
        -5.90877673e-01,  3.44515797e-01, -5.08481165e-01],
       [ 1.00000000e+00, -8.03204845e-01,  1.25825266e-03,
        -2.47936991e-01, -3.27023420e-01,  9.33963187e-02,
        -4.35817606e-01,  2.55416116e-01, -4.48912493e-01],
       [ 1.00000000e+00,  1.37746700e+00,  5.84826715e-01,
         1.24976856e+00,  2.45311974e-01,  9.78359696e-01,
        -1.21556976e-02,  7.56568484e-01, -1.70352114e-01]])

2. 绘制学习曲线

由于我们默认不使用正则化，所以将learningrate设置为默认值为0的默认参数

def plot_learning_curve(X, y, Xval, yval, learningrate=0):
    training_cost, cv_cost = [], []
    m = X.shape[0]

    for i in range(1, m + 1):
        # regularization applies here for fitting parameters
        res = linear_regression_np(X[:i, :], y[:i], learningrate=learningrate)

        # remember, when you compute the cost here, you are computing
        # non-regularized cost. Regularization is used to fit parameters only
        tc = cost(res.x, X[:i, :], y[:i])
        cv = cost(res.x, Xval, yval)

        training_cost.append(tc)
        cv_cost.append(cv)

    plt.plot(np.arange(1, m + 1), training_cost, label='training cost')
    plt.plot(np.arange(1, m + 1), cv_cost, label='cv cost')
    plt.legend(loc=1)

plot_learning_curve(X_poly, y, Xval_poly, yval, learningrate=0)
plt.show()

可以看出训练模型的代价太低了，不真实. 显然这是过拟合了

3. 尝试调整学习率为 1

plot_learning_curve(X_poly, y, Xval_poly, yval, learningrate=1)
plt.show()

训练代价增加了些，不再是0了。也就是说我们减轻过拟合倾向

4. 尝试调整学习率为 100

plot_learning_curve(X_poly, y, Xval_poly, yval, learningrate=100)
plt.show()

正则化项值过大，导致了模型欠拟合

四、找到最佳的λ

# 预定义学习率的可能值
l_candidate = [0, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10]
training_cost, cv_cost = [], []

# 遍历候选值，并记录对应的训练误差和验证误差
for l in l_candidate:
    res = linear_regression_np(X_poly, y, l)
    
    tc = cost(res.x, X_poly, y)
    cv = cost(res.x, Xval_poly, yval)
    
    training_cost.append(tc)
    cv_cost.append(cv)

plt.plot(l_candidate, training_cost, label='training')
plt.plot(l_candidate, cv_cost, label='cross validation')
plt.legend(loc=2)

plt.xlabel('lambda')

plt.ylabel('cost')
plt.show()

# 基于验证误差得出最佳学习率
l_candidate[np.argmin(cv_cost)]

# 使用测试集计算计算代价
for l in l_candidate:
    theta = linear_regression_np(X_poly, y, l).x
    print('test cost(l={}) = {}'.format(l, cost(theta, Xtest_poly, ytest)))

test cost(l=0) = 10.122298845834932
test cost(l=0.001) = 10.989357236615056
test cost(l=0.003) = 11.26731092609127
test cost(l=0.01) = 10.881623900868235
test cost(l=0.03) = 10.02232745596236
test cost(l=0.1) = 8.632062332318977
test cost(l=0.3) = 7.336513212074589
test cost(l=1) = 7.466265914249742
test cost(l=3) = 11.643931713037912
test cost(l=10) = 27.7150802906621

调参后， 𝜆=0.3 是最优选择，这个时候测试代价最小

练习五 - 正则化线性回归中的方差与误差