Linear Regression with multiple variables

多参数线性回归本质上是单参数线性回归的推广。

一、多功能

1. 前情回顾：

单特征回归的数据集

Size(feet^2) Size(feet^2)
x y
2104 460
1416 232
1534 315
852 178
…… ……
单特征回归的假设方程

$h_{\theta}(x)=\theta_{0}+\theta_{1} x$

Size(feet^2)	Size(feet^2)
x	y
2104	460
1416	232
1534	315
852	178
……	……

2. 多特征回归

数据集

Size(feet^2) Bedrooms number Floors number Home years Price(feet^2)
$x_{1}$ $x_{2}$ $x_{3}$ $x_{4}$ y
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
…… …… …… …… ……
样例术语
- n 代表特征数量，上表中 n = 4
- $x^{(i)}$ 表示第 i 个训练样本。如 $x^{(3)}$ = $\begin{bmatrix} 1534 \\ 3\\ 2\\ 30 \end{bmatrix}$ ，以列向量的形式给出。
- $x^{(i)}_{j}$ 表示第 i 个训练样本的第 j 个特征。如 $x^{(2)}_{3}$ = 2
假设方程[多元方程]

$h_{\theta}(x)=\theta_{0} x_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+ \theta_{3} x_{3} +\theta_{4} x_{4}+ ... + \theta_{n} x_{n}$ .

此方程中常把 $x_{0}$ 取为 1 ,也就是将 $\theta_{0}$ 作为常数项，便于之后利用矩阵乘法来简化计算。

Size(feet^2)	Bedrooms number	Floors number	Home years	Price(feet^2)
$x_{1}$	$x_{2}$	$x_{3}$	$x_{4}$	y
2104	5	1	45	460
1416	3	2	40	232
1534	3	2	30	315
852	2	1	36	178
……	……	……	……	……

3. 利用向量简化计算流程

$h_{\theta}(x)=\theta_{0} x_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+ \theta_{3} x_{3} +\theta_{4} x_{4}+ ... + \theta_{n} x_{n}$ .按照习惯，将未知量矩阵[向量] ( $\theta_{i}$ )写在前面，已知量矩阵[向量] ( $x_{i}$ )写在后面。以上面的房价预测为例：

$\begin{bmatrix} \theta_{0} & \theta_{1} & \theta_{2} & \theta_{3} & \theta_{4} \end{bmatrix} \begin{bmatrix} x^{(1)}_{0} & x^{(2)}_{0} & x^{(3)}_{0} & x^{(4)}_{0}\\ x^{(1)}_{1} & x^{(2)}_{1} & x^{(3)}_{1} & x^{(4)}_{1}\\ x^{(1)}_{2} & x^{(2)}_{2} & x^{(3)}_{2} & x^{(4)}_{2}\\ x^{(1)}_{3} & x^{(2)}_{3} & x^{(3)}_{3} & x^{(4)}_{3}\\ x^{(1)}_{4} & x^{(2)}_{4} & x^{(3)}_{4} & x^{(4)}_{4} \end{bmatrix}$ = $\begin{bmatrix} y_{1} & y_{2} & y_{3} & y_{4} \end{bmatrix}$ 。式中的 $x^{(i)}_{0}$ 一般取值都为1，如此一来，计算过程就十分简单了。

二、多元梯度下降法

1. 前情回顾[单特征回归]：

代价函数

$J\left(\theta_{0}, \theta_{1}\right)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}$
迭代流程

Gradient descent algorithm repeat until convergence {

$\theta_{j}:=\theta_{j}-\alpha \frac{\partial}{\partial \theta_{j}} J\left(\theta_{0}, \theta_{1}\right) \quad \text { (for } j=0 \text { and } j=1 \text { ) }$

}

Correct: Simultaneous update(同步更新，学习率是一致的，且一定为正数)

$\begin{array}{l} \text { temp0 }:=\theta_{0}-\alpha \frac{\partial}{\partial \theta_{0}} J\left(\theta_{0}, \theta_{1}\right) \\ \text { temp1 }:=\theta_{1}-\alpha \frac{\partial}{\partial \theta_{1}} J\left(\theta_{0}, \theta_{1}\right) \\ \theta_{0}:=\text { temp0 } \\ \theta_{1}:=\text { temp1 } \end{array}$
偏导数

计算 j = 0 和 j = 1的对应的偏导数值：

$\begin{array}{l} \frac{\partial}{\partial \theta_{j}} J\left(\theta_{0}, \theta_{1}\right) =\frac{\partial}{\partial \theta_{j}} \frac{1}{2 m} \sum_{i=1}^{m}\left(\theta_{0}+\theta_{1} x^{(i)}-y^{(i)}\right)^{2} \\ j=0: \frac{\frac{\partial}{\partial \theta_{0}} J\left(\theta_{0}, \theta_{1}\right)}{\frac{\partial}{\partial \theta_{1}} J\left(\theta_{0}, \theta_{1}\right)}=\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \\ j=1: \frac{1}{m}\sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x^{(i)} \end{array}$

2. 多特征回归

代价函数

$J\left(\theta\right)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}$ ，右侧表达式虽与一元一致，但是注意计算方式的不同。
迭代流程

多元迭代.PNG

偏导数

上图已给出。代价函数是用来衡量性能的，偏导数则是用于迭代步长变换的，二者没有直接联系。

3. 技巧

特征缩放

思路：确保所有特征都处于相似规模。

例如： $x_{1}$ ∈ [1, 2000] (feet^2), $x_{2}$ ∈ [1, 5] (number of bedrooms)。那么在进行迭代时，形成的等高线图就异常瘦长，造成迭代过程十分缓慢，降低效率。

特征缩放.PNG

怎样进行？

可以直接将对应的 $x_{i}$ 除以区间长度进行缩放，俗称归一化。

上例中的 : $x_{1}$ = size(feet^2) / 2000, $x_{2}$ = number of bedrooms / 5。当然， $x_{i}$ 也可以为负数。
另一种归一化：

$\begin{array}{l} x_{1}=\frac{\text { size-1000 }}{2000}\\ x_{2}=\frac{\# \text { bedrooms-2 }}{5}\\ -0.5 \leq x_{1} \leq 0.5,-0.5 \leq x_{2} \leq 0.5 \end{array}$ 。 $x_{i}$ 的取值范围为 [- 0.5, 0.5]。这样做的好处：使各参数迭代的更快，收敛所需的迭代次数更少。

学习率 $\alpha$
1. 迭代次数和代价函数 J 最小值的关系
  
  每次迭代都必须使 J 的最小值变小。若是在迭代过程中，代价函数 J 反而递增了，则应向下调整 $\alpha$ 的值。
2. 两种终止迭代的方法
  1. 模型自动进行收敛检测
  2. 声明 J 的最小阈值，到此为止
3. 常见的 $\alpha$ 设置错误迭代方式
应向下调整 $\alpha$ 的值。

4. 总结

$\alpha$ 过大则可能导致 J 发散。 $\alpha$ 过小可能导致慢收敛(效率低)。

5. $\alpha$ 的选取方法[* 3]

……，0.001，……， [0.003]，……， 0.01，……， [0.03]，……， 0.1，……， [0.3]，……， 1，…….

三、特征和多项式回归

1. 特征简化

还是以上面的房价预测作为例子。

假设我们将房子的宽度(frontage)和纵深(depth)作为房子的两个特征，那么我们会得到如下的假设函数：

$h_{\theta}(x) = {\theta_{0}}+{\theta_{1}}*frontage+{\theta_{2}}*depth$ .
我们可以适当简化各特征量，来简化迭代流程，如上面的宽度(frontage)和纵深(depth) 就可以使用一个特征：面积(Area = frontage * depth)来代替。

也即： $h_{\theta}(x) = {\theta_{0}}+{\theta_{1}}*Area$ .这样就会使迭代流程简化许多。

2. 选择假设函数

根据训练样本的基本分布图像来选择合适的假设函数。如下图：

样本特征图像.PNG

我们可以选择二次函数作为假设函数： $h_{\theta}(x) = {\theta_{0}}+{\theta_{1}}*x_{1}+{\theta_{2}}*x_{2}^{2}$ ,实际上 $x_{1} = x_{2}$ 。

但是这个二次函数只模拟了前面一段的 Size - Price关系，由于二次函数开口向下，则后面一定会出现随着 Size递增，Price反而递减的情况，所以不符题意。可作如下调整：

加入三次方： $h_{\theta}(x) = {\theta_{0}}+{\theta_{1}}*x_{1}+{\theta_{2}}*x_{2}^{2} + {\theta_{3}}*x_{3}^{3}$ ，当然此处的 $x_{1} = x_{2} = x_{2} = Size$ 。但是计算量又过大。
变为开二次方： $h_{\theta}(x) = {\theta_{0}}+{\theta_{1}}*x_{1}+{\theta_{2}}*x_{2}^{1/2}$ ——最佳选择。

四、正规方程

区别于之前的迭代方法的直接解法。之前的梯度下降法是一次性对所有 ${\theta_{i}}$ 进行迭代，而正规方程是逐步解决 ${\theta_{i}}$ 的取值。

引例：

假设代价函数和 ${\theta}$ 的关系为：J( ${\theta}$ ) = a ${\theta^{2}}$ + b ${\theta}$ + c，此处的 ${\theta}$ 为标量，如下图：

J - θ.PNG

运用中学知识，要求J( ${\theta}$ )的最小值点，则应先对J( ${\theta}$ )求导，并将该式令为0，解出对应的 ${\theta_{i}}$ 。
梯度下降迭代 ${\theta_{i}}$ 的过程：

$\begin{array}{c}\theta \in \mathbb{R}^{n+1} \quad J\left(\theta_{0}, \theta_{1}, \ldots, \theta_{m}\right)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2} \\\left.\frac{\partial}{\partial \theta_{j}} J(\theta)=\cdots=0 \quad \text { (for every } j\right)\end{array}$

$Solve$ for $\theta_{0}, \theta_{1}, \ldots, \theta_{n}$ ，计算过程尤为繁琐，且此处的 ${\theta}$ 为一个向量。

改良：

例子： m = 4

	Size(feet^2)	Bedrooms number	Floors number	Home years	Price(feet^2)
$x_{0}$	$x_{1}$	$x_{2}$	$x_{3}$	$x_{4}$	y
1	2104	5	1	45	460
1	1416	3	2	40	232
1	1534	3	2	30	315
1	852	2	1	36	178
1	……	……	……	……	……

将数据集和对应的正确结果分别装入矩阵和向量。

$X=\left[\begin{array}{ccccc} 1 & 2104 & 5 & 1 & 45 \\ 1 & 1416 & 3 & 2 & 40 \\ 1 & 1534 & 3 & 2 & 30 \\ 1 & 852 & 2 & 1 & 36 \end{array}\right] \quad y=\left[\begin{array}{c} 460 \\ 232 \\ 315 \\ 178 \end{array}\right]$ 。

重头戏：不用使用繁琐求导来迭代 ${\theta_{i}}$ ，只要求出 $\theta=\left(X^{T} X\right)^{-1} X^{T} y$ 就可以得出最佳结果。
推广：
1. m 个样本， n 个特征： $m \text { examples }\left(x^{(1)}, y^{(1)}\right), \ldots,\left(x^{(m)}, y^{(m)}\right) ; n \text { features. }$
  
  $\underline{x^{(i)}}=\left[\begin{array}{c} x_{0}^{(i)} \\ x_{1}^{(i)} \\ x_{2}^{(i)} \\ \vdots \\ x_{n}^{(i)} \end{array}\right] \in \mathbb{R}^{n+1}$ 。 $x_{0}^{(i)} = 1$ 便于向量乘法。 $\underline{y}=\left[\begin{array}{c} y^{(1)} \\ y^{(2)} \\ y^{(3)} \\ \vdots \\ y^{(m)} \end{array}\right] \in \mathbb{R}^{m}$ .

正规方程和梯度下降的性能比较：

Gradient Descent(梯度下降)	Normal Equation(正规方程)
需要选择 ${\alpha}$	不用使用步幅 ${\alpha}$
需要多次迭代	不用迭代，但是需要计算 $\left(X^{T} X\right)^{-1}$
在 n (> = 10 ^ 6)较大的情况下，性能更佳	当 n 较小时，性能更好

矩阵不可逆的解决办法：
1. 问题描述：使用正规方程解决 ${\theta}$ 的取值时，矩阵不可逆，即 $\left(X^{T} X\right)^{-1}$ 非法。
2. 产生原因：
  1. 特征冗余： $x_{1}$ in $feets^{2},$ $x_{2}$ in $m^{2}$ ,都表示房子面积
  2. 特征过多：可以删除某些特征或者使用正则化

ML 03 - Linear Regression with multiple variables

Linear Regression with multiple variables

一、多功能

1. 前情回顾：

2. 多特征回归

3. 利用向量简化计算流程

二、多元梯度下降法

1. 前情回顾[单特征回归]：

2. 多特征回归

3. 技巧

4. 总结

5. α\alphaα的选取方法[* 3]

三、特征和多项式回归

1. 特征简化

2. 选择假设函数

四、正规方程

5. $\alpha$ 的选取方法[* 3]