本文已参与「新人创作礼」活动,一起开启掘金创作之路。
D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } x i ∈ R p , y i ∈ R , i = 1 , 2 , ⋯ , N X = ( x 1 x 2 ⋯ x N ) T = ( x 1 T x 2 T ⋮ x N T ) = ( x 11 x 12 ⋯ x 1 p x 21 x 22 ⋯ x 2 p ⋮ ⋮ ⋮ x N 1 x N 2 ⋯ x N p ) N × p Y = ( y 1 y 2 ⋮ y N ) N × 1
\begin{gathered}
D=\left\{(x_{1},y_{1}),(x_{2},y_{2}),\cdots ,(x_{N},y_{N})\right\}\\
x_{i}\in \mathbb{R}^{p},y_{i}\in \mathbb{R},i=1,2,\cdots ,N\\
X=\begin{pmatrix}
x_{1} & x_{2} & \cdots & x_{N}
\end{pmatrix}^{T}=\begin{pmatrix}
x_{1}^{T} \\ x_{2}^{T} \\ \vdots \\ x_{N}^{T}
\end{pmatrix}=\begin{pmatrix}
x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Np}
\end{pmatrix}_{N \times p}\\
Y=\begin{pmatrix}
y_{1} \\ y_{2} \\ \vdots \\ y_{N}
\end{pmatrix}_{N \times 1}
\end{gathered}
D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } x i ∈ R p , y i ∈ R , i = 1 , 2 , ⋯ , N X = ( x 1 x 2 ⋯ x N ) T = ⎝ ⎛ x 1 T x 2 T ⋮ x N T ⎠ ⎞ = ⎝ ⎛ x 11 x 21 ⋮ x N 1 x 12 x 22 ⋮ x N 2 ⋯ ⋯ ⋯ x 1 p x 2 p ⋮ x Np ⎠ ⎞ N × p Y = ⎝ ⎛ y 1 y 2 ⋮ y N ⎠ ⎞ N × 1
因此,对于最小二乘估计,有
L ( ω ) = ∑ i = 1 N ∣ ∣ ω T x i − y i ∣ ∣ 2 = ∑ i = 1 N ( ω T x i − y i ) 2 = ( ω T x 1 − y 1 ω T x 2 − y 2 ⋯ ω T x N − y N ) ( ω T x 1 − y 1 ω T x 2 − y 2 ⋮ ω T x N − y N ) = [ ( ω T x 1 ω T x 2 ⋯ ω T x N ) − ( y 1 y 2 ⋯ y N ) ] ( ω T x 1 − y 1 ω T x 2 − y 2 ⋮ ω T x N − y N ) = [ ω T ( x 1 x 2 ⋯ x N ) − ( y 1 y 2 ⋯ y N ) ] ( ω T x 1 − y 1 ω T x 2 − y 2 ⋮ ω T x N − y N ) = ( ω T X T − Y T ) ( ω T x 1 − y 1 ω T x 2 − y 2 ⋮ ω T x N − y N ) = ( ω T X T − Y T ) ( X ω − Y ) = ω T X T X ω − 2 ω T X T Y + Y T Y
\begin{aligned}
L(\omega)&=\sum\limits_{i=1}^{N}||\omega^{T}x_{i}-y_{i}||^{2}\\
&=\sum\limits_{i=1}^{N}(\omega^{T}x_{i}-y_{i})^{2}\\
&=\begin{pmatrix}
\omega^{T}x_{1}-y_{1} & \omega^{T}x_{2}-y_{2} & \cdots & \omega^{T}x_{N}-y_{N}
\end{pmatrix}\begin{pmatrix}
\omega^{T}x_{1}-y_{1} \\ \omega^{T}x_{2}-y_{2} \\ \vdots \\ \omega^{T}x_{N}-y_{N}
\end{pmatrix}\\
&=[\begin{pmatrix}
\omega^{T}x_{1} & \omega^{T}x_{2} & \cdots & \omega^{T}x_{N}
\end{pmatrix}-\begin{pmatrix}
y_{1} & y_{2} & \cdots & y_{N}
\end{pmatrix}]\begin{pmatrix}
\omega^{T}x_{1}-y_{1} \\ \omega^{T}x_{2}-y_{2} \\ \vdots \\ \omega^{T}x_{N}-y_{N}
\end{pmatrix}\\
&=[\omega^{T}\begin{pmatrix}
x_{1} & x_{2} & \cdots & x_{N}
\end{pmatrix}-\begin{pmatrix}
y_{1} & y_{2} & \cdots & y_{N}
\end{pmatrix}]\begin{pmatrix}
\omega^{T}x_{1}-y_{1} \\ \omega^{T}x_{2}-y_{2} \\ \vdots \\ \omega^{T}x_{N}-y_{N}
\end{pmatrix}\\
&=(\omega^{T}X^{T}-Y^{T})\begin{pmatrix}
\omega^{T}x_{1}-y_{1} \\ \omega^{T}x_{2}-y_{2} \\ \vdots \\ \omega^{T}x_{N}-y_{N}
\end{pmatrix}\\
&=(\omega^{T}X^{T}-Y^{T})(X \omega-Y)\\
&=\omega^{T}X^{T}X \omega-2 \omega^{T}X^{T}Y+Y^{T}Y
\end{aligned}
L ( ω ) = i = 1 ∑ N ∣∣ ω T x i − y i ∣ ∣ 2 = i = 1 ∑ N ( ω T x i − y i ) 2 = ( ω T x 1 − y 1 ω T x 2 − y 2 ⋯ ω T x N − y N ) ⎝ ⎛ ω T x 1 − y 1 ω T x 2 − y 2 ⋮ ω T x N − y N ⎠ ⎞ = [ ( ω T x 1 ω T x 2 ⋯ ω T x N ) − ( y 1 y 2 ⋯ y N ) ] ⎝ ⎛ ω T x 1 − y 1 ω T x 2 − y 2 ⋮ ω T x N − y N ⎠ ⎞ = [ ω T ( x 1 x 2 ⋯ x N ) − ( y 1 y 2 ⋯ y N ) ] ⎝ ⎛ ω T x 1 − y 1 ω T x 2 − y 2 ⋮ ω T x N − y N ⎠ ⎞ = ( ω T X T − Y T ) ⎝ ⎛ ω T x 1 − y 1 ω T x 2 − y 2 ⋮ ω T x N − y N ⎠ ⎞ = ( ω T X T − Y T ) ( X ω − Y ) = ω T X T X ω − 2 ω T X T Y + Y T Y
对于ω ^ \hat{\omega} ω ^ ,有
ω ^ = argmin L ( ω ) ∂ L ( ω ) ∂ ω = 2 X T X ω − 2 X T Y 2 X T X ω − 2 X T Y = 0 ω = ( X T X ) − 1 X T Y
\begin{aligned}
\hat{\omega}&=\text{argmin }L(\omega)\\
\frac{\partial L(\omega)}{\partial \omega}&=2X^{T}X \omega-2X^{T}Y\\
2X^{T}X \omega-2X^{T}Y&=0\\
\omega&=(X^{T}X)^{-1}X^{T}Y
\end{aligned}
ω ^ ∂ ω ∂ L ( ω ) 2 X T X ω − 2 X T Y ω = argmin L ( ω ) = 2 X T X ω − 2 X T Y = 0 = ( X T X ) − 1 X T Y
补充:矩阵求导法则
x = ( x 1 x 2 ⋯ x n ) f ( x ) = A x ,则 ∂ f ( x ) ∂ x T = ∂ ( A x ) ∂ x T = A f ( x ) = x T A x ,则 ∂ f ( x ) ∂ x = ∂ ( x T A x ) ∂ x = A x + A T x f ( x ) = a T x ,则 ∂ a T x ∂ x = ∂ x T a ∂ x = a f ( x ) = x T A y ,则 ∂ x T A y ∂ x = A y , ∂ x T A y ∂ A = x y T \begin{aligned} x&=\begin{pmatrix}x_{1} & x_{2} & \cdots & x_{n}\end{pmatrix}\\f(x)&=Ax,则\frac{\partial f (x)}{\partial x^T} = \frac{\partial (Ax)}{\partial x^T} =A\\f(x)&=x^TAx,则\frac{\partial f (x)}{\partial x} = \frac{\partial (x^TAx)}{\partial x} =Ax+A^Tx\\f(x)&=a^{T}x,则\frac{\partial a^Tx}{\partial x} = \frac{\partial x^Ta}{\partial x} =a\\f(x)&=x^{T}Ay,则\frac{\partial x^TAy}{\partial x} = Ay,\frac{\partial x^TAy}{\partial A} = xy^T\end{aligned} x f ( x ) f ( x ) f ( x ) f ( x ) = ( x 1 x 2 ⋯ x n ) = A x ,则 ∂ x T ∂ f ( x ) = ∂ x T ∂ ( A x ) = A = x T A x ,则 ∂ x ∂ f ( x ) = ∂ x ∂ ( x T A x ) = A x + A T x = a T x ,则 ∂ x ∂ a T x = ∂ x ∂ x T a = a = x T A y ,则 ∂ x ∂ x T A y = A y , ∂ A ∂ x T A y = x y T
作者:zealscott
链接:矩阵求导法则与性质
在几何上,最小二乘法相当于模型(这里就是直线)和试验值的距离的平方求和,假设我们的试验样本张成一个 p p p 维空间(满秩的情况):X = S p a n ( x 1 , ⋯ , x N ) X=Span(x_1,\cdots,x_N) X = Sp an ( x 1 , ⋯ , x N ) ,而模型可以写成 f ( w ) = x i T β f(w)=x_{i}^{T}\beta f ( w ) = x i T β ,也就是 x 1 , ⋯ , x N x_1,\cdots,x_N x 1 , ⋯ , x N 的某种组合,而最小二乘法就是说希望 Y Y Y 和这个模型距离越小越好,于是它们的差应该与这个张成的空间垂直:
X ⊥ ( Y − X β ) ⟶ X T ⋅ ( Y − X β ) = 0 p × 1 ⟶ β = ( X T X ) − 1 X T Y X\bot(Y-X\beta)\longrightarrow X^T\cdot(Y-X\beta)=0_{p\times1}\longrightarrow\beta=(X^TX)^{-1}X^TY X ⊥ ( Y − Xβ ) ⟶ X T ⋅ ( Y − Xβ ) = 0 p × 1 ⟶ β = ( X T X ) − 1 X T Y
作者:tsyw
链接:线性回归 · 语雀 (yuque.com)
这里个人理解,有几点
由于X = ( x 1 T x 2 T ⋮ x N T ) X=\begin{pmatrix}x_{1}^{T} \\ x_{2}^{T} \\ \vdots \\ x_{N}^{T}\end{pmatrix} X = ⎝ ⎛ x 1 T x 2 T ⋮ x N T ⎠ ⎞ ,因此x i T β x_{i}^{T}\beta x i T β 就是X β X \beta Xβ
一般Y Y Y 是不在p p p 维空间中的
X β = ( x 11 x 12 ⋯ x 1 p x 21 x 22 ⋯ x 2 p ⋮ ⋮ ⋮ x N 1 x N 2 ⋯ x N p ) ( β 1 β 2 ⋮ β p ) = β 1 ( x 11 x 21 ⋮ x N 1 ) + β 2 ( x 12 x 22 ⋮ x N 2 ) + ⋯ + β p ( x 1 p x 2 p ⋮ x N p ) \begin{aligned} X \beta&=\begin{pmatrix}x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & & \vdots \\ x_{N1} & x_{N2} & \cdots & x_{Np}\end{pmatrix}\begin{pmatrix}\beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{p}\end{pmatrix}\\&=\beta_{1}\begin{pmatrix}x_{11} \\ x_{21} \\ \vdots \\ x_{N1}\end{pmatrix}+\beta_{2}\begin{pmatrix}x_{12} \\ x_{22} \\ \vdots \\ x_{N2}\end{pmatrix}+\cdots +\beta_{p}\begin{pmatrix}x_{1p} \\ x_{2p} \\ \vdots \\ x_{Np}\end{pmatrix}\end{aligned} Xβ = ⎝ ⎛ x 11 x 21 ⋮ x N 1 x 12 x 22 ⋮ x N 2 ⋯ ⋯ ⋯ x 1 p x 2 p ⋮ x Np ⎠ ⎞ ⎝ ⎛ β 1 β 2 ⋮ β p ⎠ ⎞ = β 1 ⎝ ⎛ x 11 x 21 ⋮ x N 1 ⎠ ⎞ + β 2 ⎝ ⎛ x 12 x 22 ⋮ x N 2 ⎠ ⎞ + ⋯ + β p ⎝ ⎛ x 1 p x 2 p ⋮ x Np ⎠ ⎞
这里可以看做是β \beta β 在矩阵X X X 的作用下,从原来( 1 0 ⋮ 0 ) , ( 0 1 ⋮ 0 ) , ⋯ , ( 0 0 ⋮ 1 ) \begin{pmatrix}1 \\ 0 \\ \vdots \\ 0\end{pmatrix},\begin{pmatrix}0 \\ 1 \\ \vdots \\ 0\end{pmatrix},\cdots ,\begin{pmatrix}0 \\ 0 \\ \vdots \\ 1\end{pmatrix} ⎝ ⎛ 1 0 ⋮ 0 ⎠ ⎞ , ⎝ ⎛ 0 1 ⋮ 0 ⎠ ⎞ , ⋯ , ⎝ ⎛ 0 0 ⋮ 1 ⎠ ⎞ 基底映射到新的基底( x 11 x 21 ⋮ x N 1 ) , ( x 12 x 22 ⋮ x N 2 ) , ⋯ , ( x 1 p x 2 p ⋮ x N p ) \begin{pmatrix}x_{11} \\ x_{21} \\ \vdots \\ x_{N1}\end{pmatrix},\begin{pmatrix}x_{12} \\ x_{22} \\ \vdots \\ x_{N2}\end{pmatrix},\cdots ,\begin{pmatrix}x_{1p} \\ x_{2p} \\ \vdots \\ x_{Np}\end{pmatrix} ⎝ ⎛ x 11 x 21 ⋮ x N 1 ⎠ ⎞ , ⎝ ⎛ x 12 x 22 ⋮ x N 2 ⎠ ⎞ , ⋯ , ⎝ ⎛ x 1 p x 2 p ⋮ x Np ⎠ ⎞ ,因此新的向量X β X \beta Xβ 一定是在p p p 维空间内的,又因为Y Y Y 一般不在p p p 维空间内,因此求向量Y Y Y 与X β X \beta Xβ 的最短距离,应当调整β \beta β ,使得Y − X β Y-X \beta Y − Xβ 所代表的的向量恰好与p p p 维空间垂直,此时即为最小。因此有X T ⊥ ( Y − X β ) = 0 X^{T}\bot(Y -X \beta)=\boldsymbol{0} X T ⊥ ( Y − Xβ ) = 0
对于一维的情况,记y = ω T x + ϵ , ϵ ∼ N ( 0 , σ 2 ) y=\omega^{T}x+\epsilon ,\epsilon \sim N(0,\sigma^{2}) y = ω T x + ϵ , ϵ ∼ N ( 0 , σ 2 ) ,那么
y ∣ x ; ω ∼ N ( ω T x , σ 2 )
y|x;\omega \sim N(\omega^{T}x, \sigma^{2})
y ∣ x ; ω ∼ N ( ω T x , σ 2 )
注意这里x x x 为已知数据集,ω \omega ω 为参数,因此y y y 与ϵ \epsilon ϵ 同分布
有
P ( y ∣ x ; ω ) = 1 2 π σ exp [ ( y − ω T x ) 2 2 σ 2 ]
P(y|x;\omega)=\frac{1}{\sqrt{2\pi}\sigma}\text{exp}\left[ \frac{(y-\omega^{T}x)^{2}}{2\sigma^{2}}\right]
P ( y ∣ x ; ω ) = 2 π σ 1 exp [ 2 σ 2 ( y − ω T x ) 2 ]
最大似然估计即为
L ( ω ) = log P ( Y ∣ X ; ω ) = log ∏ i = 1 N P ( y i ∣ x i ; ω ) = ∑ i = 1 N log P ( y i ∣ x i ; ω ) = ∑ i = 1 N { log 1 2 π σ + log exp [ − ( y i − ω T x ) 2 2 σ 2 ] } ω ^ = a r g m a x ω L ( ω ) = a r g m a x ω [ − 1 2 σ 2 ( y i − ω T x i ) 2 ] = a r g m i n ω ( y i − ω T x i ) 2
\begin{aligned}
L(\omega)&=\log P(Y|X;\omega)\\
&=\log \prod\limits_{i=1}^{N}P(y_{i}|x_{i};\omega)\\
&=\sum\limits_{i=1}^{N}\log P(y_{i}|x_{i};\omega)\\
&=\sum\limits_{i=1}^{N}\left\{\log \frac{1}{\sqrt{2\pi}\sigma}+\log \text{exp}\left[- \frac{(y_{i}-\omega^{T}x)^{2}}{2\sigma^{2}}\right]\right\}\\
\hat{\omega}&=\mathop{argmax }\limits_{\omega}L(\omega)\\
&=\mathop{argmax }\limits_{\omega}\left[- \frac{1}{2\sigma^{2}}(y_{i}-\omega^{T}x_{i})^{2}\right]\\
&=\mathop{argmin }\limits_{\omega}(y_{i}-\omega^{T}x_{i})^{2}
\end{aligned}
L ( ω ) ω ^ = log P ( Y ∣ X ; ω ) = log i = 1 ∏ N P ( y i ∣ x i ; ω ) = i = 1 ∑ N log P ( y i ∣ x i ; ω ) = i = 1 ∑ N { log 2 π σ 1 + log exp [ − 2 σ 2 ( y i − ω T x ) 2 ] } = ω a r g ma x L ( ω ) = ω a r g ma x [ − 2 σ 2 1 ( y i − ω T x i ) 2 ] = ω a r g min ( y i − ω T x i ) 2
到目前为止对于确定ω \omega ω 的问题来说,最大化似然函数等价于最小化由公式
E ( ω ) = 1 2 ∑ n = 1 N [ y ( x n , ω ) − t n ] 2 E(\omega)=\frac{1}{2}\sum\limits_{n=1}^{N}[y(x_{n},\omega)-t_{n}]^{2} E ( ω ) = 2 1 n = 1 ∑ N [ y ( x n , ω ) − t n ] 2
定义的平方和误差函数。因此,在高斯噪声的假设下,平方和误差函数是最大化似然函数的一个自然结果
来源:《PRML Translation》-P27
作者:马春鹏
原著:《Pattern Recognition and Machine Learning》
作者:Christopher M. Bishop
在PRML中还有对精度矩阵β \beta β ,也就是这里的σ 2 \sigma^{2} σ 2 的最大似然估计。这里y y y 就是PRML中的t t t
(不做特殊说明都用PRML中的符号)
ln p ( T ∣ X , ω , β ) = − β 2 ∑ n = 1 N [ y ( x n , ω ) − t n ] 2 + N 2 ln β − N 2 ln ( 2 π ) β ^ = a r g m a x β { − β ∑ n = 1 N [ y ( x n , ω ) − t n ] 2 + N ln β } = L ( β ) ∂ L ( β ) ∂ β = ∑ n = 1 N [ y ( x n , ω MLE ) − t n ] 2 − N β MLE = 0 1 β MLE = 1 N ∑ n = 1 N [ y ( x n , ω MLE ) − t n ] 2
\begin{aligned}
\ln p(T|X,\omega,\beta)&=- \frac{\beta}{2}\sum\limits_{n=1}^{N}[y(x_{n},\omega)-t_{n}]^{2}+ \frac{N}{2}\ln \beta- \frac{N}{2}\ln (2 \pi)\\
\hat{\beta}&=\mathop{argmax\space}\limits_{\beta}\left\{- \beta\sum\limits_{n=1}^{N}[y(x_{n},\omega)-t_{n}]^{2}+ N\ln \beta\right\}=L(\beta)\\
\frac{\partial L(\beta)}{\partial \beta}&=\sum\limits_{n=1}^{N}[y(x_{n},\omega_\text{MLE})-t_{n}]^{2}- \frac{N}{\beta_\text{MLE}}=0\\
\frac{1}{\beta_\text{MLE}}&=\frac{1}{N}\sum\limits_{n=1}^{N}[y(x_{n},\omega_\text{MLE})-t_{n}]^{2}
\end{aligned}
ln p ( T ∣ X , ω , β ) β ^ ∂ β ∂ L ( β ) β MLE 1 = − 2 β n = 1 ∑ N [ y ( x n , ω ) − t n ] 2 + 2 N ln β − 2 N ln ( 2 π ) = β a r g ma x { − β n = 1 ∑ N [ y ( x n , ω ) − t n ] 2 + N ln β } = L ( β ) = n = 1 ∑ N [ y ( x n , ω MLE ) − t n ] 2 − β MLE N = 0 = N 1 n = 1 ∑ N [ y ( x n , ω MLE ) − t n ] 2