吴恩达机器学习

197 阅读3分钟

Supervised Learning

给出的输入数据集样本中包含正确答案,根据输入数据预测答案。

Linear Regression

预测一个连续输出

Hypothesis Function:

h_\theta(x) = \theta_0 + \theta_1x

\theta_i: Parameters

不同的\theta_i会产生不同的假设函数

Cost Function

J(\theta_0, \theta_1) = \frac{1}{2m} \sum\limits_{i=1}^{m}(h_\theta(x^i)-y^i)^2

将线性回归问题转化为求代价函数最小问题

Gradient Descent

使用Gradient Descent最小化各类函数,代价函数为其中之一。

  1. Randomly Choose \theta_0, \theta_1, generally set \theta_0, \theta_1 as 0.
  2. Keep Change \theta_0, \theta_1 to reduce J(\theta_0, \theta_1) until end up to minimum.

\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)

:= 赋值运算符

\alpha :Learning Rate 控制更新参数的幅度

\theta_j需要同步更新:

  1. temp0 := \theta_0 - \alpha\frac{\partial}{\partial\theta_0}J(\theta_0, \theta_1)
  2. temp1 := \theta_1 - \alpha\frac{\partial}{\partial\theta_0}J(\theta_0, \theta_1)
  3. \theta_0 := temp0
  4. \theta_1 := temp1

将Gradient Descent代入Cost Function

\frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1) = \frac{\partial}{\partial\theta_j}\frac{1}{2m} \sum\limits_{i=1}^{m}(h_\theta(x^i)-y^i)^2

j = 0: \frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1) = \frac{1}{m}\sum\limits_{i=1}^{m}(h_\theta(x^i)-y^i)

j = 1: \frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1) = \frac{1}{m}\sum\limits_{i=1}^{m}(h_\theta(x^i)-y^i)x^i

\theta_0=\theta_0-\alpha\frac{1}{m}\sum\limits_{i=1}^{m}(h_\theta(x^i)-y^i)

\theta_1=\theta1-\alpha\frac{1}{m}\sum\limits_{i=1}^{m}(h_\theta(x^i)-y^i)x^i

Multivariate Linear Regression

多个feature的线性回归问题,使用Gradient Descent

n: number of features

x^{(i)}: input (features) of i^{th} training example

x^{(i)}_j: value of feature j in i^{th} training example

hypothesis function:h_\theta(x) = x\theta^T = x_0\theta_0 + x_1\theta_1+ x_n\theta_n(x_0恒等于1)

Paramaters: \theta_0,\theta_1,...,\theta_n: \theta

Cost function:

J(\theta) = \frac{1}{2m}\sum\limits_{i=1}^m(h_\theta{(x^i) - y^i})^2

Gradient Descent:

\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta)

\theta_j := \theta_j-\alpha\frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^i) - y^i)x_j^i

Feature Scaling

特征值相差过大时,导致寻找minium耗费时间太长,使用Feature Scaling进行解决。

一般保证-1\leq x_i\leq1,但如果和[-1, 1]相差不大也可以,加快收敛速度

Learning Rate: 太大导致不会每次都下降,太小导致下降缓慢

Mean normalization(均值归一化)

x_i = \frac{x_i-\mu_i}{s_i}

\mu_i: 训练集中的平均值

s_i: 最大值-最小值

Polymial regression

将多项式中的每一项看作是不同的feature,转化为上面的多feature线性回归

Normal Equation

梯度下降需要经过多次迭代得到最小化的\theta,而Normal Equation只需一次,且不需要feature scaling。 \theta = (X^TX)^{-1}X^Ty

x_0 x_1 x_2 x_3 x_4 y
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178

X = \begin{bmatrix}
1 & 2104 & 5 & 1 & 45 \\
1 & 1416 & 3 & 2 & 40 \\
1 & 1534 & 3 & 2 & 30 \\
1 & 852  & 2 & 1 & 36 \\
\end{bmatrix}
y = \begin{bmatrix}
460 \\
232 \\
315 \\
178 \\
\end{bmatrix}

Gradient Descent Normal Equation
Need to choose \alpha No neede to choose \alpha
Needs many iterations Don't need to iterate
works well even when n is large Need to compute(X^TX)^{-1}, slow if n is very large

Classification

Logistic回归

Logistic function == sigmod Function

g(z) = \frac{1}{1+e^{-z}}

h_\theta(x) = g(x\theta^T) = \frac{1}{1+e^{-x\theta^T}} = P(y=1|x;\theta)

由sigmod函数可知 g(z) \geq 0.5表示z \geq 0, h_\theta(x)=g(x\theta^T) \geq 0.5表示x\theta^T \geq 0, 又表示预测为y=1

Cost function

J(\theta) = \frac{1}{m}\sum\limits_{i=1}^mCost(h_\theta(x^i), y^i)

Cost(h_\theta(x), y) = 
\begin{cases}
-\log (h_\theta(x)) & \mbox{if }y\mbox{ = 1} \\
-\log (1 - h_\theta(x)) & \mbox{if }y\mbox{ = 0} \\
\end{cases}
=-y\log(h_\theta(x))-(1-y)\log(1-h_\theta(x))

Gradient Descent

与线性回归的梯度下降规则一样,同步更新

\theta_j := \theta_j -\alpha\frac{\partial}{\partial\theta_j}J(\theta)

\theta_j := \theta_j -\alpha\frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^i) - y^i)x_j^i

MultiClass Classification

One-vs-all

为每一个类都弄一个分类器,输入x,选取输出值最大的分类器

OverFitting

为了符合训练集而选取的feature过多,导致图像扭曲,无法很好的进行预测

  • 减少选取的feature
  • 正则化(Regularization)

正则化(Regularization)

Linear Gression

Gradient Descent

J(\theta) = \frac{1}{2m}\left[\sum\limits_{i=1}^m(h_\theta{(x^i) - y^i})^2 + \lambda\sum\limits_{j=1}^n\theta_j^2\right]

\theta_0=\theta_0-\alpha\frac{1}{m}\sum\limits_{i=1}^{m}(h_\theta(x^i)-y^i)

\theta_j := \theta_j-\alpha\frac{1}{m}\left[\sum\limits_{i=1}^m(h_\theta(x^i) - y^i)x_j^i + \lambda\theta_j\right]

Noraml Equation

\theta = (X^TX + \lambda
\begin{bmatrix}
0 & \cdots & 0 & 0 & 0 & 0 \\
0 & 1 & \cdots & 0 & 0 & 0 \\
0 & 0 & 1 & \cdots & 0 & 0 \\
\vdots & & & \ddots & & \vdots \\
0 & \cdots & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & \cdots & 1\\
\end{bmatrix}
)^{-1}X^Ty  (\lambda > 0)

Logisitci Regression

Gradient Descent

J(\theta) = \frac{-1}{m}\left[\sum\limits_{i=1}^m(y^i\log(h_\theta(x^i))+(1-y^i)\log(1-h_\theta(x^i)))\right] + \frac{\lambda}{2m}\sum\limits_{j=1}^n\theta_j^2

\theta_0=\theta_0-\alpha\frac{1}{m}\sum\limits_{i=1}^{m}(h_\theta(x^i)-y^i)

\theta_j := \theta_j-\alpha\frac{1}{m}\left[\sum\limits_{i=1}^m(h_\theta(x^i) - y^i)x_j^i + \lambda\theta_j\right]