深度学习 正则化应用

256 阅读2分钟

这是我参与8月更文挑战的第24天,活动详情查看:8月更文挑战

线性回归的正则化

对于线性回归我们之前讨论过两种算法:

  • 梯度下降
  • 正规方程

Regularized linear regression

J(θ)=12m[i=1m(hθ(x(i))y(i))2+λj=1nθj2]minθJ(θ)\begin{aligned} &J(\theta)=\frac{1}{2 m}\left[\sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}+\lambda \sum_{j=1}^{n} \theta_{j}^{2}\right] \\ &\min _{\theta} J(\theta) \end{aligned}

对于上面的代价函数我们想寻找合适的θ使其最小化。

Gradient descent

还记得传统的梯度下降长这样:

Repeat {

θj:=θjα1mi=1m(hθ(x(i))y(i))xj(i)(j=0,1,2,3,,n)\theta_{j}:=\theta_{j}-\alpha \quad \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)} \quad\quad (j=0,1,2,3, \ldots, n)

}

现在把梯度下降里θ0\theta_0单独摘出来再给剩下的单独加上惩罚项就可以了。因为之前讲过了,惩罚项是从θ1\theta_{1}开始的。

Repeat {

θ0:=θ0α1mi=1m(hθ(x(i))y(i))x0(i)\theta_{0}:=\theta_{0}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{0}^{(i)}

θj:=θjα[1mi=1m(hθ(x(i))y(i))xj(i)+λmθj](j=1,2,3,,n)\theta_{j}:=\theta_{j}-\alpha \lbrack\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)}+\frac{\lambda}{m} \theta_{j} \rbrack \quad\quad (j=1,2,3, \ldots, n)

}

化简之后可以写成这样:

Repeat {

θ0:=θ0α1mi=1m(hθ(x(i))y(i))x0(i)\theta_{0}:=\theta_{0}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{0}^{(i)}

θj:=θj(1αλm)α1mi=1m(hθ(x(i))y(i))xj(i)(j=0,1,2,3,,n)\theta_{j}:=\theta_{j}\left(1-\alpha \frac{\lambda}{m}\right)-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)} \quad\quad (j=0,1,2,3, \ldots, n)

}

Normal equaion

原来的正规方程:

θ=(XTX)1XTy\theta=\left(X^{T} X\right)^{-1} X^{T} y

其中 X=[...(x0i)T......(x1i)T......(x2i)T.........(xni)T...]Rm×n+1X = \begin{bmatrix} ...(x^i_0)^T... \\ ...(x^i_1)^T... \\ ...(x^i_2)^T...\\ ...\\ ...(x^i_n)^T...\end{bmatrix} \in \R^{m \times n+1} \quad\quad\quad y=[y1y2y3...ym]Rmy = \begin{bmatrix} y^1 \\ y^2 \\ y^3 \\ ...\\ y^m \end{bmatrix} \in \R^m

使用正则化之后

θ=(xx+λ[0111])1xTy\theta=\left(x^{\top} x+\lambda\left[\begin{array}{lllll} 0 & & & \\ & 1 & & \\ & & 1 & \\ & & & \cdots \\ & & & & 1 \end{array}\right]\right)^{-1} x^T y

其中这个对角矩阵[0111]\left[\begin{array}{lllll} 0 & & & \\ & 1 & & \\ & & 1 & \\ & & & \cdots \\ & & & & 1 \end{array}\right]的维度Rn+1×n+1\R_{n+1 \times n+1}

逻辑回归的正则化

逻辑回归的代价函数:

J(θ)=1m[i=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]\begin{aligned} J(\theta) =-\frac{1}{m}\left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right] \end{aligned}

我们也要在后边添加一,添加之后的是:

J(θ)=1m[i=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]+λ2mj=1nθj2\begin{aligned} J(\theta) =-\frac{1}{m}\left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right]+\frac{\lambda}{2 m} \sum_{j=1}^{n} \theta_{j}^{2} \end{aligned}

添加之后即使你拟合的参数很多并且阶数很高,只要添加了这个正则化项,保持参数较小,你仍然可以得到一条合理的决策边界。

梯度下降

之前我们已经知道线性回归和逻辑回归看起来形式上是一样的,所以我们直接把线性回归的梯度下降搬过来:

Repeat {

θ0:=θ0α1mi=1m(hθ(x(i))y(i))x0(i)\theta_{0}:=\theta_{0}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{0}^{(i)}

θj:=θjα[1mi=1m(hθ(x(i))y(i))xj(i)+λmθj](j=1,2,3,,n)\theta_{j}:=\theta_{j}-\alpha \lbrack\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)}+\frac{\lambda}{m} \theta_{j} \rbrack \quad\quad (j=1,2,3, \ldots, n)

}

为了使其正则化符合逻辑回归,我们需要将第二个式子也加上一个:

Repeat {

θ0:=θ0α1mi=1m(hθ(x(i))y(i))x0(i)\theta_{0}:=\theta_{0}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{0}^{(i)}

θj:=θjα[1mi=1m(hθ(x(i))y(i))xj(i)+λmθj](j=1,2,3,,n)\theta_{j}:=\theta_{j}-\alpha \lbrack\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)}+\frac{\lambda}{m} \theta_{j} \rbrack \quad\quad (j=1,2,3, \ldots, n)

}

虽然看起来是和线性回归一样,但是一定要记住二者的h(x)h(x)存在区别。

Advanced optimization

讲逻辑回归的时候,除了梯度下降我们还提到了其他高级算法,但是没有展开细说。那如何在高级算法中使用正则化

function[jVal,gradient]=costFunction(theta)function [jVal, gradient] = costFunction (theta)

jValjVal=[ code to compute J(θ)J(\theta)]

gradient(1)=[gradient (1)=\left[\right. code to compute θ0J(θ)]\left.\frac{\partial}{\partial \theta_{0}} J(\theta)\right]

gradient(2)=[gradient (2)=\left[\right. code to compute θ1J(θ)]\left.\frac{\partial}{\partial \theta_{1}} J(\theta)\right]

...

gradient(n+1)=[gradient (n+1)=\left[\right. code to compute θnJ(θ)]\left.\frac{\partial}{\partial \theta_{n}} J(\theta) \quad\right]

还是需要你自己写costFunction函数,在这个函数中:

  • function[jVal,gradient]=costFunction(theta)function [jVal, gradient] = costFunction (theta) 需要传入theta,theta =[θ0θ1θn]=\left[\begin{array}{c}\theta_{0} \\ \theta_{1} \\ \vdots \\ \theta_{n}\end{array}\right]
  • jValjVal=[ code to compute J(θ)J(\theta)]这一句就是写代价函数J的表达式
  • gradient(1)=[gradient (1)=\left[\right. code to compute θ0J(θ)]\left.\frac{\partial}{\partial \theta_{0}} J(\theta)\right]是计算1mi=1m(hθ(x(i))y(i))x0(i)\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{0}^{(i)}
  • gradient(n+1)=[gradient (n+1)=\left[\right. code to compute θnJ(θ)]\left.\frac{\partial}{\partial \theta_{n}} J(\theta) \quad\right] 是计算1mi=1m(hθ(x(i))y(i))xn(i)+λmJ(θn)\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{n}^{(i)}+\frac{\lambda}{m}J(\theta_n)