《机器学习理论导引》笔记目录
0 感言
在整理这一章时自己也是很迷茫的,感觉数学公式都看着很眼熟但是就是不会证明,以及感觉中间感觉如果按照自己这样码字下去将是一个工程量巨大的活。但是最后思来想去还是觉得:没有痛苦的学习就欠缺理解,数学公式如果浮于表面而不深入理解则在理解后续中将更加困难,想要深入学习就不能自己骗自己 。幸运的是公示在维基百科和一些专栏中有证明,事情变得明朗起来了!前面的大部分基础的公式都进行了自己证明的复现,后面的希望在自己看到相关工作的时候知道回来翻一翻看一看。
同时一开始想的是一章一个笔记,但是实在是工程量巨大,所以每章分成上、下或者上、中、下以保持整体可读性以及自己更文积极性 。
2022.4.15更新日志
找到了组里之前同学做的PPT,发现有了很好的已有材料可以来借鉴与丰富,在此表示感谢!
1.1 函数的性质
1.1.1 凸集 (convex set)
两个凸集 C C C 中的点 x 1 , x 2 ∈ C x_1,x_2\in C x 1 , x 2 ∈ C 的连线仍属于凸集 C C C ,即 θ x 1 + ( 1 − θ ) x 2 ∈ C , ∀ 0 ⩽ θ ⩽ 1 \theta x_1+(1-\theta)x_2\in C, \ \forall \ 0\leqslant\theta\leqslant1 θ x 1 + ( 1 − θ ) x 2 ∈ C , ∀ 0 ⩽ θ ⩽ 1
1.1.2 凸函数 (convex function)
对定义在凸集上的函数 f : R d ↦ R f:\mathbb{R}^d\mapsto \mathbb{R} f : R d ↦ R ,令 Ψ \Psi Ψ 表示其定义域,若 ∀ x , z ∈ Ψ \forall \bold x,\bold z\in \Psi ∀ x , z ∈ Ψ 满足
f ( θ x + ( 1 − θ ) z ) ⩽ θ f ( x ) + ( 1 − θ ) f ( z ) ( ∀ 0 ⩽ θ ⩽ 1 ) f(\theta\bold{x}+(1-\theta)\bold{z})\leqslant\theta f(\bold{x})+(1-\theta)f(\bold{z})\ \ (\forall\ 0\leqslant\theta\leqslant1) f ( θ x + ( 1 − θ ) z ) ⩽ θ f ( x ) + ( 1 − θ ) f ( z ) ( ∀ 0 ⩽ θ ⩽ 1 )
则称 f ( ⋅ ) f(\cdot) f ( ⋅ ) 是一个凸函数
图例1 .凸函数的示意图
1.1.3 凹函数 (concave function)
凸函数中的不等式反号 ,即
f ( θ x + ( 1 − θ ) z ) ⩾ θ f ( x ) + ( 1 − θ ) f ( z ) ( ∀ 0 ⩽ θ ⩽ 1 ) f(\theta\bold{x}+(1-\theta)\bold{z})\geqslant\theta f(\bold{x})+(1-\theta)f(\bold{z})\ \ (\forall\ 0\leqslant\theta\leqslant1) f ( θ x + ( 1 − θ ) z ) ⩾ θ f ( x ) + ( 1 − θ ) f ( z ) ( ∀ 0 ⩽ θ ⩽ 1 )
则称 f ( ⋅ ) f(\cdot) f ( ⋅ ) 是一个凹函数
1.1.4 梯度 (gradient)
对定义在凸集上的函数 f : R d ↦ R f:\mathbb{R}^d\mapsto \mathbb{R} f : R d ↦ R 的梯度 记为 ∇ f ( x ) = ( ∂ f ( x ) ∂ x 1 , . . . , ∂ f ( x ) ∂ x d ) ∈ R d \nabla f(\bold x)=(\frac{\partial f(\bold x)}{\partial x_1},...,\frac{\partial f(\bold x)}{\partial x_d})\in\mathbb{R}^d ∇ f ( x ) = ( ∂ x 1 ∂ f ( x ) , ... , ∂ x d ∂ f ( x ) ) ∈ R d . 则其为凸函数当且仅当其定义域 Ψ \Psi Ψ 是凸集且 ∀ x , z ∈ Ψ \forall \bold x,\bold z\in \Psi ∀ x , z ∈ Ψ 都有
f ( z ) ⩾ f ( x ) + ∇ f ( x ) T ( z − x ) f(\bold z)\geqslant f(\bold x) + \nabla f(\bold x)^T(\bold z-\bold x) f ( z ) ⩾ f ( x ) + ∇ f ( x ) T ( z − x )
1.1.5 强凸函数 (strongly convex function)
对定义在凸集上的函数 f : R d ↦ R f:\mathbb{R}^d\mapsto \mathbb{R} f : R d ↦ R ,若 ∃ λ ∈ R + \exists\lambda\in\mathbb{R}_{+} ∃ λ ∈ R + 使得 ∀ x , z ∈ Ψ \forall \bold x,\bold z\in \Psi ∀ x , z ∈ Ψ 都有
f ( θ x + ( 1 − θ ) z ) ⩽ θ f ( x ) + ( 1 − θ ) f ( z ) − λ 2 θ ( 1 − θ ) ∥ z − x ∥ 2 ( ∀ 0 ⩽ θ ⩽ 1 ) f(\theta\bold{x}+(1-\theta)\bold{z})\leqslant\theta f(\bold{x})+(1-\theta)f(\bold{z})-\frac{\lambda}{2}\theta(1-\theta)\lVert \bold z -\bold x\rVert^2\ (\forall\ 0\leqslant\theta\leqslant1) f ( θ x + ( 1 − θ ) z ) ⩽ θ f ( x ) + ( 1 − θ ) f ( z ) − 2 λ θ ( 1 − θ ) ∥ z − x ∥ 2 ( ∀ 0 ⩽ θ ⩽ 1 )
则称 f ( ⋅ ) f(\cdot) f ( ⋅ ) 是一个λ \lambda λ -强凸函数 ,若f ( ⋅ ) f(\cdot) f ( ⋅ ) 可微,则又有
f ( z ) ⩾ f ( x ) + ∇ f ( x ) T ( z − x ) + λ 2 ∥ z − x ∥ 2 f(\bold z)\geqslant f(\bold x)+\nabla f(\bold x)^T(\bold z-\bold x)+\frac{\lambda}{2}\lVert \bold z -\bold x\rVert^2 f ( z ) ⩾ f ( x ) + ∇ f ( x ) T ( z − x ) + 2 λ ∥ z − x ∥ 2
关于强凸函数的理解 :不仅要求函数曲线在其切线“上方” ,同时也要求函数曲线较切线大于λ 2 ∥ z − x ∥ 2 \frac{\lambda}{2}\lVert \bold z -\bold x\rVert^2 2 λ ∥ z − x ∥ 2 的二次函数距离
1.1.6 l − L i p s c h i t z l-\rm Lipschitz l − Lipschitz 连续 与 l − l- l − 光滑
对定义在凸集上的凸函数 f : R d ↦ R f:\mathbb{R}^d\mapsto \mathbb{R} f : R d ↦ R ,若 ∃ l ∈ R + \exist\ l\in\mathbb{R}_{+} ∃ l ∈ R + 对 ∀ x , z ∈ Ψ \forall \bold x,\bold z\in \Psi ∀ x , z ∈ Ψ 都有
f ( z ) − f ( x ) ⩽ l ∥ z − x ∥ f(\bold z)-f(\bold x)\leqslant l\lVert\bold z-\bold x\rVert f ( z ) − f ( x ) ⩽ l ∥ z − x ∥
则称f ( ⋅ ) f(\cdot) f ( ⋅ ) 为 l − L i p s c h i t z l-\rm Lipschitz l − Lipschitz 连续的,若可微函数f ( ⋅ ) f(\cdot) f ( ⋅ ) 的梯度 ∇ f ( ⋅ ) \nabla f(\cdot) ∇ f ( ⋅ ) 满足 l − L i p s c h i t z l-\rm Lipschitz l − Lipschitz 连续,则称f ( ⋅ ) f(\cdot) f ( ⋅ ) 为 l − l- l − 光滑
1.1.7 Hessian矩阵与凸性
对定义在凸集上的函数 f : R d ↦ R f:\mathbb{R}^d\mapsto \mathbb{R} f : R d ↦ R 二阶导数矩阵(Hessian矩阵) 记为∇ 2 f ( x ) ∈ R d × d \nabla^2 f(\bold x)\in\mathbb{R}^{d\times d} ∇ 2 f ( x ) ∈ R d × d ,其中∇ 2 f ( x ) i j = ∂ 2 f ( x ) ∂ x i ∂ x j \nabla^2 f(\bold x)_{ij}=\frac{\partial^2f(\bold x)}{\partial x_i\partial x_j} ∇ 2 f ( x ) ij = ∂ x i ∂ x j ∂ 2 f ( x ) ,若f ( ⋅ ) f(\cdot) f ( ⋅ ) 二阶可微,则其为凸函数当且仅当∇ 2 f ( x ) ⪰ 0 \nabla^2 f(\bold x)\succeq 0 ∇ 2 f ( x ) ⪰ 0 ,即其为半正定矩阵。
1.1.8 共轭函数 (conjugate function)
函数 f : R d ↦ R f:\mathbb{R}^d\mapsto \mathbb{R} f : R d ↦ R 的共轭函数定义为
f ∗ ( z ) = sup x ∈ Ψ ( z T x − f ( x ) ) f_*(\bold z)=\sup_{\bold x\in\Psi}(\bold z^T\bold x-f(\bold x)) f ∗ ( z ) = x ∈ Ψ sup ( z T x − f ( x ))
其定义域为
Ψ ∗ = { z ∣ sup x ∈ Ψ ( z T x − f ( x ) ) < ∞ } \Psi_*=\{\bold z|\sup_{\bold x\in\Psi}(\bold z^T\bold x-f(\bold x))<\infty\} Ψ ∗ = { z ∣ x ∈ Ψ sup ( z T x − f ( x )) < ∞ }
直观来看,共轭函数f ∗ ( z ) f_*(\bold z) f ∗ ( z ) 反应的是线性函数z T x \bold z^T\bold x z T x 与f ( x ) f(\bold x) f ( x ) 之间的最大差值。同时,共轭函数也具有一些很好的性质
共轭函数 f ∗ ( z ) f_*(\bold z) f ∗ ( z ) 一定是凸函数
若函数f f f 可微,则
f ∗ ( ∇ f ( x ) ) = ∇ f ( x ) T x − f ( x ) = − [ f ( x ) + ∇ f ( x ) T ( 0 − x ) ] f_*(\nabla f(\bold x))=\nabla f(\bold x)^T\bold x-f(x)=-[f(x)+\nabla f(\bold x)^T(0-\bold x)] f ∗ ( ∇ f ( x )) = ∇ f ( x ) T x − f ( x ) = − [ f ( x ) + ∇ f ( x ) T ( 0 − x )]
1.2 重要不等式
1.2.1 Jensen 不等式
对任意凸函数 f ( ⋅ ) f(\cdot) f ( ⋅ ) 有
f ( E [ X ] ) ⩽ E [ f ( X ) ] f(\mathbb{E}[X])\leqslant \mathbb{E}[f(X)] f ( E [ X ]) ⩽ E [ f ( X )]
由Jensen不等式可知 ( E [ X ] ) 2 ⩽ E [ X 2 ] (\mathbb{E}[X])^2\leqslant\mathbb{E}[X^2] ( E [ X ] ) 2 ⩽ E [ X 2 ]
1.2.2 Ho ¨ \rm \ddot{o} o ¨ lder 不等式
对 p , q ∈ R + p,q\in\mathbb{R}_+ p , q ∈ R + 且 1 p + 1 q = 1 \frac{1}{p}+\frac{1}{q}=1 p 1 + q 1 = 1 ,有
E [ ∣ X Y ∣ ] ⩽ ( E [ ∣ X ∣ p ] ) 1 p ( E [ Y ] q ) 1 q \mathbb{E}[|XY|]\leqslant(\mathbb{E}[|X|^p])^{\frac{1}{p}}(\mathbb{E}[Y]^q)^{\frac{1}{q}} E [ ∣ X Y ∣ ] ⩽ ( E [ ∣ X ∣ p ] ) p 1 ( E [ Y ] q ) q 1
证明:
引理1. (杨氏Young不等式):假设 a , b , p , q a,b,p,q a , b , p , q 是正实数,且有1 p + 1 q = 1 \frac{1}{p}+\frac{1}{q}=1 p 1 + q 1 = 1
则有a b ⩽ a p p + b q q ab\leqslant\frac{a^p}{p}+\frac{b^q}{q} ab ⩽ p a p + q b q
a b = e ln a e ln b = e 1 p ln a p + 1 q ln b q ⩽ 1 p e ln a p + 1 q e ln b q = a p p + b q q ( u s e J e n s e n ) ab=e^{\ln{a}}e^{\ln{b}}=e^{\frac{1}{p}\ln{a^p}+\frac{1}{q}\ln{b^q}}\leqslant\frac{1}{p}e^{\ln{a}^p}+\frac{1}{q}e^{\ln{b}^q}=\frac{a^p}{p}+\frac{b^q}{q}\ \ \ (\rm use\ Jensen) ab = e l n a e l n b = e p 1 l n a p + q 1 l n b q ⩽ p 1 e l n a p + q 1 e l n b q = p a p + q b q ( use Jensen )
继续证明
→ E [ ∣ X Y ∣ ] ( E [ ∣ X ∣ p ] ) 1 p ( E [ Y ] q ) 1 q ⩽ 1 s o a s s u m e t h a t ( E [ ∣ X ∣ p ] ) 1 p = ∥ f ∥ p = ( E [ Y ] q ) 1 q = ∥ g ∥ q = 1 ∣ f ( s ) g ( s ) ∣ ⩽ ∣ f ( s ) ∣ p p + ∣ g ( s ) ∣ q q a n d I n t e g r a t i n g b o t h s i d e s ∥ f g ∥ 1 ⩽ ∥ f ∥ p p p + ∥ g ∥ q q q = 1 p + 1 q = 1 Q . E . D . \begin{aligned}
\rightarrow & \quad \frac{\mathbb{E}[|XY|]}{(\mathbb{E}[|X|^p])^{\frac{1}{p}}(\mathbb{E}[Y]^q)^{\frac{1}{q}}}\leqslant1\\ {\rm so\ \ assume\ \ that}& \quad (\mathbb{E}[|X|^p])^{\frac{1}{p}}=\lVert f\rVert_p=(\mathbb{E}[Y]^q)^{\frac{1}{q}}=\lVert g\rVert_q =1\\
& |f(s)g(s)|\leqslant\frac{|f(s)|^p}{p}+\frac{|g(s)|^q}{q}\ {\rm and \ \ Integrating\ \ both\ \ sides}\\
& \lVert fg\rVert_1\leqslant\frac{\lVert f\rVert^p_p}{p}+\frac{\lVert g\rVert^q_q}{q}=\frac{1}{p}+\frac{1}{q}=1\ \ {\rm Q.E.D.}
\end{aligned}
→ so assume that ( E [ ∣ X ∣ p ] ) p 1 ( E [ Y ] q ) q 1 E [ ∣ X Y ∣ ] ⩽ 1 ( E [ ∣ X ∣ p ] ) p 1 = ∥ f ∥ p = ( E [ Y ] q ) q 1 = ∥ g ∥ q = 1 ∣ f ( s ) g ( s ) ∣ ⩽ p ∣ f ( s ) ∣ p + q ∣ g ( s ) ∣ q and Integrating both sides ∥ f g ∥ 1 ⩽ p ∥ f ∥ p p + q ∥ g ∥ q q = p 1 + q 1 = 1 Q.E.D.
1.2.3 Cauchy-Schwarz 不等式
E [ ∣ X Y ∣ ] ⩽ E [ X 2 ] E [ Y 2 ] \mathbb{E}[|XY|]\leqslant\sqrt{\mathbb{E}[X^2]\mathbb{E}[Y^2]} E [ ∣ X Y ∣ ] ⩽ E [ X 2 ] E [ Y 2 ]
对任意向量 x , y ∈ R d x,y\in\mathbb{R}^d x , y ∈ R d
∣ x T y ∣ ⩽ ∥ x ∥ ∥ y ∥ |x^Ty|\leqslant\lVert x\rVert\lVert y\rVert ∣ x T y ∣ ⩽ ∥ x ∥ ∥ y ∥
对任意向量 x , y ∈ R d x,y\in\mathbb{R}^d x , y ∈ R d 和正定矩阵 A ∈ R d × d \bold{A}\in\mathbb{R}^{d\times d} A ∈ R d × d
∣ x T y ∣ ⩽ ∥ x ∥ A ∥ y ∥ A − 1 |x^Ty|\leqslant \lVert x\rVert_{\bold{A}}\lVert y\rVert_{\bold{A}^{-1}} ∣ x T y ∣ ⩽ ∥ x ∥ A ∥ y ∥ A − 1
其中∥ x ∥ A = x T A x \lVert x\rVert_{\bold{A}}=\sqrt{x^T\bold{A}x} ∥ x ∥ A = x T A x
证明如下
A = O T D O , P = D O , Q = D − 1 O x T y = x T P T Q y ⩽ ∥ x T P T ∥ ∥ Q y ∣ ∣ = x T A x y T A − 1 y \begin{aligned}
A&=O^TDO,\quad P=\sqrt{D}O,\quad Q=\sqrt{D^{-1}}O\\
x^Ty&=x^TP^TQy\leqslant\lVert x^TP^T\rVert\lVert Qy||=\sqrt{x^TAxy^TA^{-1}y}
\end{aligned}
A x T y = O T D O , P = D O , Q = D − 1 O = x T P T Q y ⩽ ∥ x T P T ∥ ∥ Q y ∣∣ = x T A x y T A − 1 y
1.2.4 Lyapunov 不等式
对 0 < r ⩽ S 0<r\leqslant S 0 < r ⩽ S
( E [ ∣ X ∣ r ] ) 1 r ⩽ ( E [ ∣ X ∣ s ] ) 1 s (\mathbb{E}[|X|^r])^{\frac{1}{r}}\leqslant(\mathbb{E}[|X|^s])^{\frac{1}{s}} ( E [ ∣ X ∣ r ] ) r 1 ⩽ ( E [ ∣ X ∣ s ] ) s 1
证明
→ ( E [ ∣ X ∣ r ] ) s r ⩽ ( E [ ∣ X ∣ s ] ) \rightarrow (\mathbb{E}[|X|^r])^{\frac{s}{r}}\leqslant(\mathbb{E}[|X|^s]) → ( E [ ∣ X ∣ r ] ) r s ⩽ ( E [ ∣ X ∣ s ])
可以知道f ( x ) = ∣ x ∣ s r f(x)=|x|^{\frac{s}{r}} f ( x ) = ∣ x ∣ r s 在非去等条件下为凸函数,构造变量∣ X ∣ r |X|^r ∣ X ∣ r 加上 J e n s e n Jensen J e n se n 不等式即可知其成立
1.2.5 Minkowski 不等式
对1 ⩽ p 1\leqslant p 1 ⩽ p
( E [ ∣ X + Y ∣ p ] ) 1 p ⩽ ( E [ ∣ X ∣ p ] ) 1 p + ( E [ ∣ Y ∣ p ] ) 1 p (\mathbb{E}[|X+Y|^p])^{\frac{1}{p}}\leqslant(\mathbb{E}[|X|^p])^{\frac{1}{p}}+(\mathbb{E}[|Y|^p])^{\frac{1}{p}} ( E [ ∣ X + Y ∣ p ] ) p 1 ⩽ ( E [ ∣ X ∣ p ] ) p 1 + ( E [ ∣ Y ∣ p ] ) p 1
证明
( E [ ∣ X + Y ∣ p ] ) = ∫ a b ∣ f ( x ) + g ( x ) ∣ p d x = ∫ a b ∣ f ( x ) + g ( x ) ∣ ∣ f ( x ) + g ( x ) ∣ p − 1 d x ⩽ ∫ a b ∣ f ( x ) ∣ ∣ f ( x ) + g ( x ) ∣ p − 1 d x + ∫ a b ∣ g ( x ) ∣ ∣ f ( x ) + g ( x ) ∣ p − 1 d x ( u s e H o ¨ l d e r ) ⩽ ( ∫ a b ∣ f ( x ) ∣ p d x ) 1 p ( ∫ a b ∣ f ( x ) + g ( x ) ∣ q ( p − 1 ) d x ) 1 q + ( ∫ a b ∣ g ( x ) ∣ p d x ) 1 p ( ∫ a b ∣ f ( x ) + g ( x ) ∣ q ( p − 1 ) d x ) 1 q = [ ( ∫ a b ∣ f ( x ) ∣ p d x ) 1 p + ( ∫ a b ∣ g ( x ) ∣ p d x ) 1 p ] ( ∫ a b ∣ f ( x ) + g ( x ) ∣ q ( p − 1 ) d x ) 1 q ( q = p q − q ) = [ ( ∫ a b ∣ f ( x ) ∣ p d x ) 1 p + ( ∫ a b ∣ g ( x ) ∣ p d x ) 1 p ] ( ∫ a b ∣ f ( x ) + g ( x ) ∣ p d x ) 1 q \begin{aligned}
&(\mathbb{E}[|X+Y|^p])=\int_a^b|f(x)+g(x)|^p{\rm d}x=\int_a^b|f(x)+g(x)||f(x)+g(x)|^{p-1}{\rm d}x\\
\leqslant&\int_a^b|f(x)||f(x)+g(x)|^{p-1}{\rm d}x+\int_a^b|g(x)||f(x)+g(x)|^{p-1}{\rm d}x \ \ (use \ \ \ {\rm H\ddot{o}lder})\\
\leqslant& (\int_a^b|f(x)|^p{\rm d}x)^{\frac{1}{p}}(\int_a^b|f(x)+g(x)|^{q(p-1)}{\rm d}x)^{\frac{1}{q}}+(\int_a^b|g(x)|^p{\rm d}x)^{\frac{1}{p}}(\int_a^b|f(x)+g(x)|^{q(p-1)}{\rm d}x)^{\frac{1}{q}}\\
=&[(\int_a^b|f(x)|^p{\rm d}x)^{\frac{1}{p}}+(\int_a^b|g(x)|^p{\rm d}x)^{\frac{1}{p}}](\int_a^b|f(x)+g(x)|^{q(p-1)}{\rm d}x)^{\frac{1}{q}}\ \ (q=pq-q)\\
=&[(\int_a^b|f(x)|^p{\rm d}x)^{\frac{1}{p}}+(\int_a^b|g(x)|^p{\rm d}x)^{\frac{1}{p}}](\int_a^b|f(x)+g(x)|^p{\rm d}x)^{\frac{1}{q}}
\end{aligned} ⩽ ⩽ = = ( E [ ∣ X + Y ∣ p ]) = ∫ a b ∣ f ( x ) + g ( x ) ∣ p d x = ∫ a b ∣ f ( x ) + g ( x ) ∣∣ f ( x ) + g ( x ) ∣ p − 1 d x ∫ a b ∣ f ( x ) ∣∣ f ( x ) + g ( x ) ∣ p − 1 d x + ∫ a b ∣ g ( x ) ∣∣ f ( x ) + g ( x ) ∣ p − 1 d x ( u se H o ¨ lder ) ( ∫ a b ∣ f ( x ) ∣ p d x ) p 1 ( ∫ a b ∣ f ( x ) + g ( x ) ∣ q ( p − 1 ) d x ) q 1 + ( ∫ a b ∣ g ( x ) ∣ p d x ) p 1 ( ∫ a b ∣ f ( x ) + g ( x ) ∣ q ( p − 1 ) d x ) q 1 [( ∫ a b ∣ f ( x ) ∣ p d x ) p 1 + ( ∫ a b ∣ g ( x ) ∣ p d x ) p 1 ] ( ∫ a b ∣ f ( x ) + g ( x ) ∣ q ( p − 1 ) d x ) q 1 ( q = pq − q ) [( ∫ a b ∣ f ( x ) ∣ p d x ) p 1 + ( ∫ a b ∣ g ( x ) ∣ p d x ) p 1 ] ( ∫ a b ∣ f ( x ) + g ( x ) ∣ p d x ) q 1
相消即得
( E [ ∣ X + Y ∣ p ] ) 1 p = ( ∫ a b ∣ f ( x ) + g ( x ) ∣ p d x ) 1 p ⩽ ( ∫ a b ∣ f ( x ) ∣ p d x ) 1 p + ( ∫ a b ∣ g ( x ) ∣ p d x ) 1 p = ( E [ ∣ X ∣ p ] ) 1 p + ( E [ ∣ Y ∣ p ] ) 1 p \begin{aligned}
(\mathbb{E}[|X+Y|^p])^{\frac{1}{p}}&=(\int_a^b|f(x)+g(x)|^p{\rm d}x)^{\frac{1}{p}}\leqslant(\int_a^b|f(x)|^p{\rm d}x)^{\frac{1}{p}}+(\int_a^b|g(x)|^p{\rm d}x)^{\frac{1}{p}}\\
&=(\mathbb{E}[|X|^p])^{\frac{1}{p}}+(\mathbb{E}[|Y|^p])^{\frac{1}{p}}
\end{aligned} ( E [ ∣ X + Y ∣ p ] ) p 1 = ( ∫ a b ∣ f ( x ) + g ( x ) ∣ p d x ) p 1 ⩽ ( ∫ a b ∣ f ( x ) ∣ p d x ) p 1 + ( ∫ a b ∣ g ( x ) ∣ p d x ) p 1 = ( E [ ∣ X ∣ p ] ) p 1 + ( E [ ∣ Y ∣ p ] ) p 1
1.2.6 Bhatia-Davis 不等式
对 X ∈ [ a , b ] X\in[a,b] X ∈ [ a , b ]
D [ X ] ⩽ ( b − E [ X ] ) ( E [ X ] − a ) ⩽ ( b − a ) 2 4 \mathbb{D}[X]\leqslant(b-\mathbb{E}[X])(\mathbb{E}[X]-a)\leqslant\frac{(b-a)^2}{4} D [ X ] ⩽ ( b − E [ X ]) ( E [ X ] − a ) ⩽ 4 ( b − a ) 2
证明
left part obviously that E [ ( X − a + b 2 ) 2 ] ⩽ ( b − a ) 2 4 L H S − R H S = E [ X 2 ] − ( b + a ) E [ x ] + b a = D [ X ] − ( b − E [ x ] ) ( E [ x ] − a ) ⩽ 0 right part could be ( b − E [ x ] ) ( E [ x ] − a ) ⩽ ( b − E [ x ] + b − E [ x ] 2 ) 2 = ( b − a ) 2 4 \begin{aligned}
\text{left part obviously that}&\quad\mathbb{E}[(X-\frac{a+b}{2})^2]\leqslant\frac{(b-a)^2}{4} \\
{\rm LHS-RHS}=&\quad\mathbb{E}[X^2] − (b+a)\mathbb{E}[x] + ba
=\mathbb{D}[X]- (b − \mathbb{E}[x])(\mathbb{E}[x] − a)\leqslant0\\
\text{right part could be}&\quad(b − \mathbb{E}[x])(\mathbb{E}[x] − a)\leqslant(\frac{b − \mathbb{E}[x]+b − \mathbb{E}[x]}{2})^2=\frac{(b-a)^2}{4}
\end{aligned} left part obviously that LHS − RHS = right part could be E [( X − 2 a + b ) 2 ] ⩽ 4 ( b − a ) 2 E [ X 2 ] − ( b + a ) E [ x ] + ba = D [ X ] − ( b − E [ x ]) ( E [ x ] − a ) ⩽ 0 ( b − E [ x ]) ( E [ x ] − a ) ⩽ ( 2 b − E [ x ] + b − E [ x ] ) 2 = 4 ( b − a ) 2
1.2.7 联合界 (Union Bound) 不等式
P ( X ∪ Y ) ⩽ P ( X ) + P ( Y ) P(X\cup Y)\leqslant P(X)+P(Y) P ( X ∪ Y ) ⩽ P ( X ) + P ( Y )
1.2.8 Markov 不等式
对 X ⩾ 0 , ∀ ϵ > 0 X\geqslant0,\ \forall\epsilon>0 X ⩾ 0 , ∀ ϵ > 0
P ( X ⩾ ϵ ) ⩽ E [ X ] ϵ P(X\geqslant\epsilon)\leqslant\frac{\mathbb{E}[X]}{\epsilon} P ( X ⩾ ϵ ) ⩽ ϵ E [ X ]
证明
E [ X ] = ∫ R x d F ( x ) = ∫ x ⩾ ϵ x d F ( x ) + ∫ x < ϵ x d F ( x ) ⩾ ∫ x ⩾ ϵ ϵ d F ( x ) = ϵ P ( X ⩾ ϵ ) → E [ X ] ϵ ⩾ P ( X ⩾ ϵ ) \begin{aligned}
\mathbb{E}[X]&=\int_{\mathbb{R}}x{\rm d}F(x)=\int_{x\geqslant\epsilon}x{\rm d}F(x)+\int_{x<\epsilon}x{\rm d}F(x)\geqslant\int_{x\geqslant\epsilon}\epsilon{\rm d}F(x)=\epsilon P(X\geqslant\epsilon)\\
&\rightarrow\frac{\mathbb{E}[X]}{\epsilon}\geqslant P(X\geqslant\epsilon)
\end{aligned} E [ X ] = ∫ R x d F ( x ) = ∫ x ⩾ ϵ x d F ( x ) + ∫ x < ϵ x d F ( x ) ⩾ ∫ x ⩾ ϵ ϵ d F ( x ) = ϵ P ( X ⩾ ϵ ) → ϵ E [ X ] ⩾ P ( X ⩾ ϵ )
1.2.9 Chebyshev 不等式
∀ ϵ > 0 \forall\epsilon>0 ∀ ϵ > 0 有
P ( ∣ X − E [ X ] ∣ ⩾ ϵ ) ⩽ D [ X ] ϵ 2 P(|X-\mathbb{E}[X]|\geqslant\epsilon)\leqslant\frac{\mathbb{D}[X]}{\epsilon^2} P ( ∣ X − E [ X ] ∣ ⩾ ϵ ) ⩽ ϵ 2 D [ X ]
证明
P ( ∣ X − E [ X ] ∣ ⩾ ϵ ) = P ( ∣ X − E [ X ] ∣ 2 ⩾ ϵ 2 ) ⩽ E [ ( X − E [ X ] ) 2 ] ϵ 2 = D [ X ] ϵ 2 P(|X-\mathbb{E}[X]|\geqslant\epsilon)=P(|X-\mathbb{E}[X]|^2\geqslant\epsilon^2)\leqslant\frac{\mathbb{E}[(X-\mathbb{E}[X])^2]}{\epsilon^2}=\frac{\mathbb{D}[X]}{\epsilon^2} P ( ∣ X − E [ X ] ∣ ⩾ ϵ ) = P ( ∣ X − E [ X ] ∣ 2 ⩾ ϵ 2 ) ⩽ ϵ 2 E [( X − E [ X ] ) 2 ] = ϵ 2 D [ X ]
1.2.10 Cantelli 不等式
∀ ϵ > 0 \forall\epsilon>0 ∀ ϵ > 0 有
P ( X − E [ X ] ⩾ ϵ ) ⩽ D [ X ] D [ X ] + ϵ 2 P ( X − E [ X ] ⩽ − ϵ ) ⩽ D [ X ] D [ X ] + ϵ 2 P(X-\mathbb{E}[X]\geqslant\epsilon)\leqslant\frac{\mathbb{D}[X]}{\mathbb{D}[X]+\epsilon^2}\\
P(X-\mathbb{E}[X]\leqslant-\epsilon)\leqslant\frac{\mathbb{D}[X]}{\mathbb{D}[X]+\epsilon^2} P ( X − E [ X ] ⩾ ϵ ) ⩽ D [ X ] + ϵ 2 D [ X ] P ( X − E [ X ] ⩽ − ϵ ) ⩽ D [ X ] + ϵ 2 D [ X ]
证明
l e t Y = X − E [ X ] , E [ Y ] = 0 , D [ Y ] = D [ X ] ∫ Y < ϵ ( Y + λ ) 2 d F ( Y ) = ∫ Y < ϵ ( Y 2 + 2 λ Y + λ 2 ) d F ( Y ) , λ ⩾ 0 ∫ Y < ϵ Y 2 d F ( Y ) = D [ Y ] − ∫ Y ⩾ ϵ Y 2 d F ( Y ) ⩽ D [ Y ] − ϵ P ( Y ⩾ ϵ ) ∫ Y < ϵ Y d F ( Y ) ⩽ − ϵ P ( Y ⩾ ϵ ) , ∫ Y < ϵ d F ( Y ) = 1 − P ( Y ⩾ ϵ ) → ∫ Y < ϵ ( Y + λ ) 2 d F ( Y ) ⩽ D [ Y ] − P ( Y ⩾ ϵ ) ( E [ Y ] 2 + 2 λ E [ Y ] + λ 2 ) + λ 2 → D [ Y ] + λ 2 − P ( Y ⩾ ϵ ) ( E [ Y ] 2 + 2 λ E [ Y ] + λ 2 ) ⩾ 0 P ( Y ⩾ ϵ ) ⩽ inf λ ⩾ 0 D [ Y ] + λ 2 ( E [ Y ] + λ ) 2 = inf λ ⩾ 0 ( D [ Y ] − ϵ λ ) 2 ( E [ Y ] + λ ) 2 ( D [ Y ] + ϵ 2 ) + D [ Y ] D [ Y ] + ϵ 2 → P ( Y ⩾ ϵ ) ⩽ D [ Y ] D [ Y ] + ϵ 2 , P ( Y ⩽ − ϵ ) ⩽ D [ Y ] D [ Y ] + ϵ 2 Q . E . D . \begin{aligned}
{\rm let}&\quad Y=X-\mathbb{E}[X],\ \ \mathbb{E}[Y]=0,\ \mathbb{D}[Y]=\mathbb{D}[X]\\
&\int_{Y<\epsilon}(Y+\lambda)^2{\rm d}F(Y)=\int_{Y<\epsilon}(Y^2+2\lambda Y+\lambda^2){\rm d}F(Y),\lambda\geqslant0\\
&\int_{Y<\epsilon}Y^2{\rm d}F(Y)=\mathbb{D}[Y]-\int_{Y\geqslant\epsilon}Y^2{\rm d}F(Y)\leqslant\mathbb{D}[Y]-\epsilon P(Y\geqslant\epsilon)\\
&\int_{Y<\epsilon}Y{\rm d}F(Y)\leqslant-\epsilon P(Y\geqslant\epsilon),\int_{Y<\epsilon}{\rm d}F(Y)=1-P(Y\geqslant\epsilon)\\
\rightarrow &\int_{Y<\epsilon}(Y+\lambda)^2{\rm d}F(Y)\leqslant \mathbb{D}[Y]-P(Y\geqslant\epsilon)(\mathbb{E}[Y]^2+2\lambda \mathbb{E}[Y]+\lambda^2)+\lambda^2\\
\rightarrow &
\mathbb{D}[Y]+\lambda^2-P(Y\geqslant\epsilon)(\mathbb{E}[Y]^2+2\lambda \mathbb{E}[Y]+\lambda^2)\geqslant0\\
&P(Y\geqslant\epsilon)\leqslant\inf_{\lambda\geqslant0}\frac{\mathbb{D}[Y]+\lambda^2}{(\mathbb{E}[Y]+\lambda)^2}=\inf_{\lambda\geqslant0}\frac{(\mathbb{D}[Y]-\epsilon\lambda)^2}{(\mathbb{E}[Y]+\lambda)^2(\mathbb{D}[Y]+\epsilon^2)}+\frac{\mathbb{D}[Y]}{\mathbb{D}[Y]+\epsilon^2}\\
\rightarrow &P(Y\geqslant\epsilon)\leqslant\frac{\mathbb{D}[Y]}{\mathbb{D}[Y]+\epsilon^2},P(Y\leqslant-\epsilon)\leqslant\frac{\mathbb{D}[Y]}{\mathbb{D}[Y]+\epsilon^2}\ \ {\rm Q.E.D.}
\end{aligned}
let → → → Y = X − E [ X ] , E [ Y ] = 0 , D [ Y ] = D [ X ] ∫ Y < ϵ ( Y + λ ) 2 d F ( Y ) = ∫ Y < ϵ ( Y 2 + 2 λY + λ 2 ) d F ( Y ) , λ ⩾ 0 ∫ Y < ϵ Y 2 d F ( Y ) = D [ Y ] − ∫ Y ⩾ ϵ Y 2 d F ( Y ) ⩽ D [ Y ] − ϵ P ( Y ⩾ ϵ ) ∫ Y < ϵ Y d F ( Y ) ⩽ − ϵ P ( Y ⩾ ϵ ) , ∫ Y < ϵ d F ( Y ) = 1 − P ( Y ⩾ ϵ ) ∫ Y < ϵ ( Y + λ ) 2 d F ( Y ) ⩽ D [ Y ] − P ( Y ⩾ ϵ ) ( E [ Y ] 2 + 2 λ E [ Y ] + λ 2 ) + λ 2 D [ Y ] + λ 2 − P ( Y ⩾ ϵ ) ( E [ Y ] 2 + 2 λ E [ Y ] + λ 2 ) ⩾ 0 P ( Y ⩾ ϵ ) ⩽ λ ⩾ 0 inf ( E [ Y ] + λ ) 2 D [ Y ] + λ 2 = λ ⩾ 0 inf ( E [ Y ] + λ ) 2 ( D [ Y ] + ϵ 2 ) ( D [ Y ] − ϵ λ ) 2 + D [ Y ] + ϵ 2 D [ Y ] P ( Y ⩾ ϵ ) ⩽ D [ Y ] + ϵ 2 D [ Y ] , P ( Y ⩽ − ϵ ) ⩽ D [ Y ] + ϵ 2 D [ Y ] Q.E.D.
1.2.11 Chernoff 不等式
∀ t > 0 , P ( X ⩾ ϵ ) = P ( e t X ⩾ e t ϵ ) ⩽ E [ e t X ] e t ϵ ∀ t < 0 , P ( X ⩽ ϵ ) = P ( e t X ⩾ e t ϵ ) ⩽ E [ e t X ] e t ϵ \forall t>0,\ \ P(X\geqslant\epsilon)=P(e^{tX}\geqslant e^{t\epsilon})\leqslant\frac{\mathbb{E}[e^{tX}]}{e^{t\epsilon}}\\
\forall t<0,\ \ P(X\leqslant\epsilon)=P(e^{tX}\geqslant e^{t\epsilon})\leqslant\frac{\mathbb{E}[e^{tX}]}{e^{t\epsilon}} ∀ t > 0 , P ( X ⩾ ϵ ) = P ( e tX ⩾ e t ϵ ) ⩽ e t ϵ E [ e tX ] ∀ t < 0 , P ( X ⩽ ϵ ) = P ( e tX ⩾ e t ϵ ) ⩽ e t ϵ E [ e tX ]
其多变量形式 为:对m个独立同分布的随机变量 X i ∈ { 0 , 1 } , i ∈ [ m ] X_i\in\{0,1\},i\in[m] X i ∈ { 0 , 1 } , i ∈ [ m ] ,令 X ˉ = 1 m = ∑ i = 1 m X i \bar{X}=\frac{1}{m}=\sum^m_{i=1}X_i X ˉ = m 1 = ∑ i = 1 m X i ,对 r ∈ [ 0 , 1 ] r\in[0,1] r ∈ [ 0 , 1 ] 有
P ( X ˉ ⩾ ( 1 + r ) E [ X ˉ ] ) ⩽ e − m r 2 E [ X ˉ ] / 3 P ( X ˉ ⩽ ( 1 − r ) E [ X ˉ ] ) ⩽ e − m r 2 E [ X ˉ ] / 2 P(\bar{X}\geqslant(1+r)\mathbb{E}[\bar{X}])\leqslant e^{−mr^2\mathbb{E}[\bar{X}]/3}\\
P(\bar{X}\leqslant(1-r)\mathbb{E}[\bar{X}])\leqslant e^{−mr^2\mathbb{E}[\bar{X}]/2} P ( X ˉ ⩾ ( 1 + r ) E [ X ˉ ]) ⩽ e − m r 2 E [ X ˉ ] /3 P ( X ˉ ⩽ ( 1 − r ) E [ X ˉ ]) ⩽ e − m r 2 E [ X ˉ ] /2
下面仅对多变量形式进行证明
设P ( X i = 1 ) = p i , E [ X ˉ ] = μ P(X_i=1) =p_i,\mathbb{E}[\bar{X}]=\mu P ( X i = 1 ) = p i , E [ X ˉ ] = μ ,由单变量形式的 Chernoff 不等式,对 ∀ t > 0 \forall t>0 ∀ t > 0 ,有
P ( X ˉ ⩾ ( 1 + r ) μ ) ⩽ e − t ( 1 + r ) μ E [ e t X ˉ ] = e − t ( 1 + r ) μ E [ ∏ i = 1 m e t X i m ] = e − t ( 1 + r ) μ ∏ i = 1 m E [ e t X i m ] = e − t ( 1 + r ) μ ∏ i = 1 m ( 1 − p i + p i e t m ) ⩽ e − t ( 1 + r ) μ ∏ i = 1 m exp ( p i ( e m t − 1 ) ) ( use 1 + x ⩽ e x ) = exp ( − ( t μ ( 1 + r ) + m μ ( e t m − 1 ) ) ) \begin{aligned}
&P(\bar{X}\geqslant(1+r)\mu)\leqslant e^{−t(1+r)\mu}\mathbb{E}[e^{t\bar{X}}]\\
=&e^{−t(1+r)\mu}\mathbb{E}[\prod^m_{i=1}e^{\frac{tX_i}{m}}]=e^{−t(1+r)\mu}\prod^m_{i=1}\mathbb{E}[e^{\frac{tX_i}{m}}]=e^{−t(1+r)\mu}\prod^m_{i=1}(1-p_i+p_ie^{\frac{t}{m}})\\
\leqslant &e^{−t(1+r)\mu}\prod^m_{i=1}\exp{(p_i(e^{\frac{m}{t}} − 1))}\ \ (\text{use}\ \ 1+x\leqslant e^x)\\
=&\exp{(−(t\mu(1+r) + m\mu(e^{\frac{t}{m}} − 1)))}
\end{aligned} = ⩽ = P ( X ˉ ⩾ ( 1 + r ) μ ) ⩽ e − t ( 1 + r ) μ E [ e t X ˉ ] e − t ( 1 + r ) μ E [ i = 1 ∏ m e m t X i ] = e − t ( 1 + r ) μ i = 1 ∏ m E [ e m t X i ] = e − t ( 1 + r ) μ i = 1 ∏ m ( 1 − p i + p i e m t ) e − t ( 1 + r ) μ i = 1 ∏ m exp ( p i ( e t m − 1 )) ( use 1 + x ⩽ e x ) exp ( − ( t μ ( 1 + r ) + m μ ( e m t − 1 )))
在 t = m ln ( 1 + r ) t=m\ln{(1+r)} t = m ln ( 1 + r ) 时取到最小值 ( e r ( 1 + r ) ( 1 + r ) ) ( m μ ) (\frac{e^r}{(1+r)^{(1+r)}})^{(m\mu)} ( ( 1 + r ) ( 1 + r ) e r ) ( m μ ) ,注意到 2 r 2 + r ⩽ ln ( 1 + r ) \frac{2r}{
2+r}\leqslant\ln{(1 + r)} 2 + r 2 r ⩽ ln ( 1 + r ) ,从而有 ( e r ( 1 + r ) ( 1 + r ) ) ( m μ ) ⩽ e − m r 2 μ 2 + r ⩽ e − m r 2 μ 3 (\frac{e^r}{(1+r)^{(1+r)}})^{(m\mu)}\leqslant e^{-\frac{mr^2\mu}{2+r}}\leqslant e^{-\frac{mr^2\mu}{3}} ( ( 1 + r ) ( 1 + r ) e r ) ( m μ ) ⩽ e − 2 + r m r 2 μ ⩽ e − 3 m r 2 μ
对另一个不等式,有
P ( X ˉ ⩽ ( 1 − r ) μ ) = P ( − X ˉ ⩾ − ( 1 − r ) μ ) ⩽ e t ( 1 − r ) μ E [ e − t X ˉ ] (use univariate Chernoff’s Inequailty) = e t ( 1 − r ) μ ∏ i = 1 m ( 1 − p i + p i e − t m ) ⩽ exp ( t μ ( 1 − r ) + m μ ( e t m − 1 ) ) ( use 1 + x ⩽ e x ) \begin{aligned}
&P(\bar{X}\leqslant(1 − r)\mu)=P(−\bar{X}\geqslant−(1 − r)\mu)\\
\leqslant& e^{t(1−r)\mu}\mathbb{E}[e^{−t\bar{X}}]\ \ \text{(use univariate Chernoff’s Inequailty)}\\
=&e^{t(1-r)\mu}\prod^m_{i=1}(1-p_i+p_ie^{\frac{-t}{m}})\leqslant \exp{(t\mu(1 − r) + m\mu(e^{\frac{t}{m}} − 1))}\ \ (\text{use}\ \ 1+x\leqslant e^x)
\end{aligned} ⩽ = P ( X ˉ ⩽ ( 1 − r ) μ ) = P ( − X ˉ ⩾ − ( 1 − r ) μ ) e t ( 1 − r ) μ E [ e − t X ˉ ] (use univariate Chernoff’s Inequailty) e t ( 1 − r ) μ i = 1 ∏ m ( 1 − p i + p i e m − t ) ⩽ exp ( t μ ( 1 − r ) + m μ ( e m t − 1 )) ( use 1 + x ⩽ e x )
上式在 t = − m ln ( 1 − r ) t=−m\ln{(1−r)} t = − m ln ( 1 − r ) 时达到最小值 ( e − r ( 1 − r ) ( 1 − r ) ) ( m μ ) (\frac{e^{−r}}{(1−r)^{(1−r)}})^{(m\mu)} ( ( 1 − r ) ( 1 − r ) e − r ) ( m μ ) ,注意到 r − 1 r ⩽ 2 ln r r−\frac{1}{r}\leqslant 2\ln{r} r − r 1 ⩽ 2 ln r ,从而有P ( X ˉ ⩽ ( 1 − r ) μ ) ⩽ ( e − r ( 1 − r ) ( 1 − r ) ) ( m µ ) ⩽ e − m r 2 μ 2 P(\bar{X}\leqslant(1 − r)\mu)\leqslant(\frac{e^{−r}}{
(1−r)^{(1−r)}})^{(mµ)}\leqslant e^{\frac{-mr^2\mu}{2}} P ( X ˉ ⩽ ( 1 − r ) μ ) ⩽ ( ( 1 − r ) ( 1 − r ) e − r ) ( m µ ) ⩽ e 2 − m r 2 μ
1.2.12 Hoeffding 不等式
对m个独立随机变量 X i ∈ [ 0 , 1 ] , i ∈ [ m ] X_i\in[0,1],i\in[m] X i ∈ [ 0 , 1 ] , i ∈ [ m ] ,令 X ˉ = 1 m ∑ i = 1 m X i \bar{X}= \frac{1}{m}\sum_{i=1}^{m}X_i X ˉ = m 1 ∑ i = 1 m X i ,有
P ( X ˉ − E [ X ˉ ] ⩾ ϵ ) ⩽ e − 2 m ϵ 2 P(\bar{X}-\mathbb{E}[\bar{X}]\geqslant\epsilon)\leqslant e^{-2m\epsilon^2} P ( X ˉ − E [ X ˉ ] ⩾ ϵ ) ⩽ e − 2 m ϵ 2
Hoeffding 不等式的另一种形式,令 δ = e − 2 m ϵ 2 \delta=e^{-2m\epsilon^2} δ = e − 2 m ϵ 2 ,则至少以 1 − δ 1-\delta 1 − δ 的概率有
X ˉ ⩽ E [ X ˉ ] + 1 2 m ln 1 δ \bar{X}\leqslant\mathbb{E}[\bar{X}]+\sqrt{\frac{1}{2m}\ln{\frac{1}{\delta}}} X ˉ ⩽ E [ X ˉ ] + 2 m 1 ln δ 1
若考虑 X i ∈ [ a , b ] , i ∈ [ m ] X_i\in[a,b],i\in[m] X i ∈ [ a , b ] , i ∈ [ m ] ,则得到 Hoeffding 不等式的更一般的形式
P ( X ˉ − E [ X ˉ ] ⩾ ϵ ) ⩽ e − 2 m ϵ 2 ( b − a ) 2 P ( X ˉ − E [ X ˉ ] ⩽ − ϵ ) ⩽ e − 2 m ϵ 2 ( b − a ) 2 P(\bar{X}-\mathbb{E}[\bar{X}]\geqslant\epsilon)\leqslant e^{-\frac{2m\epsilon^2}{(b-a)^2}}\\
P(\bar{X}-\mathbb{E}[\bar{X}]\leqslant-\epsilon)\leqslant e^{-\frac{2m\epsilon^2}{(b-a)^2}} P ( X ˉ − E [ X ˉ ] ⩾ ϵ ) ⩽ e − ( b − a ) 2 2 m ϵ 2 P ( X ˉ − E [ X ˉ ] ⩽ − ϵ ) ⩽ e − ( b − a ) 2 2 m ϵ 2
证明:
先引入一个引理 (a lemma of Hoeffding’s Inequailty) :X 为有界随机变量,X ∈ [ a , b ] X\in[a,b] X ∈ [ a , b ] ,则E [ e λ ( X − E [ X ] ) ] ⩽ exp ( λ 2 ( b − a ) 2 8 ) \mathbb{E}[e^{\lambda(X−\mathbb{E}[X])}]\leqslant\exp{(\frac{\lambda^2{(b − a)}^2}{8})} E [ e λ ( X − E [ X ]) ] ⩽ exp ( 8 λ 2 ( b − a ) 2 )
引理证明:先考虑 E [ X ] = 0 \mathbb{E}[X]=0 E [ X ] = 0 的情形。对 f ( x ) = e λ x f(x) = e^{\lambda x} f ( x ) = e λ x 为凸函数,由 Jensen 不等式,对任意 s ∈ [ 0 , 1 ] s\in[0, 1] s ∈ [ 0 , 1 ] ,有
f ( s a + ( 1 − s ) b ) ⩽ s f ( a ) + ( 1 − s ) f ( b ) f(sa+(1-s)b)\leqslant sf(a)+(1-s)f(b) f ( s a + ( 1 − s ) b ) ⩽ s f ( a ) + ( 1 − s ) f ( b )
代入s = b − X b − a s=\frac{b-X}{b-a} s = b − a b − X 有
e λ x ⩽ b − X b − a e λ a + X − a b − a e λ b E [ e λ x ] ⩽ b ⋅ e λ a − a ⋅ e λ b b − a = ( 1 − θ ) e λ a + θ e λ b ( let θ = − a b − a ) = ( 1 − θ + θ ⋅ e λ ( b − a ) ) e − λ ( b − a ) = exp ( ln ( 1 − θ + θ e h ) ) ( let h = − ( b − a ) ) \begin{aligned}
e^{\lambda x}&\leqslant\frac{b-X}{b-a}e^{\lambda a}+\frac{X-a}{b-a}e^{\lambda b}\\
\mathbb{E}[e^{\lambda x}]&\leqslant\frac{b\cdot e^{\lambda a}-a\cdot e^{\lambda b}}{b-a}=(1-\theta)e^{\lambda a}+\theta e^{\lambda b}\ \ (\text{let}\ \theta=-\frac{a}{b-a})\\
&=(1-\theta+\theta\cdot e^{\lambda(b-a)})e^{-\lambda(b-a)}=\exp{(\ln{(1-\theta+\theta e^h)})}\ \ (\text{let}\ h=-(b-a))
\end{aligned} e λ x E [ e λ x ] ⩽ b − a b − X e λa + b − a X − a e λb ⩽ b − a b ⋅ e λa − a ⋅ e λb = ( 1 − θ ) e λa + θ e λb ( let θ = − b − a a ) = ( 1 − θ + θ ⋅ e λ ( b − a ) ) e − λ ( b − a ) = exp ( ln ( 1 − θ + θ e h ) ) ( let h = − ( b − a ))
再令 L ( h ) = ln ( 1 − θ + θ e h ) − h θ L(h)= \ln{(1 −\theta + \theta e^h)} − h\theta L ( h ) = ln ( 1 − θ + θ e h ) − h θ ,注意到 L ( 0 ) = L ′ ( 0 ) = 0 L(0)=L'(0)=0 L ( 0 ) = L ′ ( 0 ) = 0 ,且 L ′ ′ ( x ) = θ ⋅ e x 1 − θ + θ ⋅ e x ( 1 − θ ⋅ e x 1 − θ + θ ⋅ e x ) ⩽ 1 4 L''(x)=\frac{\theta\cdot e^x}{1-\theta+\theta\cdot e^x}(1-\frac{\theta\cdot e^x}{1-\theta+\theta\cdot e^x})\leqslant\frac{1}{4} L ′′ ( x ) = 1 − θ + θ ⋅ e x θ ⋅ e x ( 1 − 1 − θ + θ ⋅ e x θ ⋅ e x ) ⩽ 4 1 ,由带 lagrange 余项的 Taylor 定理,存在 ϕ ∈ ( 0 , h ) \phi\in(0,h) ϕ ∈ ( 0 , h )
使得 L ( h ) = L ( 0 ) + L ′ ( 0 ) h + 1 2 L ′ ′ ( ϕ ) h 2 ⩽ 1 8 h 2 L(h) = L(0) + L′(0)h + \frac{1}{2}L''(\phi)h^2\leqslant\frac{1}{8}h^2 L ( h ) = L ( 0 ) + L ′ ( 0 ) h + 2 1 L ′′ ( ϕ ) h 2 ⩽ 8 1 h 2 m,则 E [ e λ X ] ⩽ exp ( λ 2 ( b − a ) 2 8 ) \mathbb{E}[e^{\lambda X}]\leqslant\exp{(\frac{\lambda^2{(b − a)}^2}{8})} E [ e λ X ] ⩽ exp ( 8 λ 2 ( b − a ) 2 )
再考虑 E [ X ] ≠ 0 \mathbb{E}[X] \ne 0 E [ X ] = 0 的情形,此时令 X ^ = X − E [ X ] \hat{X}=X-\mathbb{E}[X] X ^ = X − E [ X ] ,则 X ^ ∈ [ a − E [ X ] , b − E [ X ] ] : = [ a ^ , b ^ ] \hat{X}\in[a-\mathbb{E}[X],b-\mathbb{E}[X]]:=[\hat{a},\hat{b}] X ^ ∈ [ a − E [ X ] , b − E [ X ]] := [ a ^ , b ^ ] ,令 h ^ : = λ ( b ^ − a ^ ) = λ ( b − a ) = h \hat{h}:=\lambda(\hat{b}-\hat{a})=\lambda(b-a)=h h ^ := λ ( b ^ − a ^ ) = λ ( b − a ) = h ,从而转化成E [ X ] = 0 \mathbb{E}[X]=0 E [ X ] = 0 的情形,引理得证。
推论:若 X 满足 E [ X ∣ F ] = 0 E[X|\mathcal{F}] = 0 E [ X ∣ F ] = 0 ,由同样的步骤能得到
E [ e λ X ∣ F ] ⩽ exp ( 1 8 λ 2 ( b − a ) 2 ) \mathbb{E}[e^{\lambda X}|\mathcal{F}] \leqslant \exp{(\frac{1}{8}\lambda^2(b-a)^2)} E [ e λ X ∣ F ] ⩽ exp ( 8 1 λ 2 ( b − a ) 2 )
回到原来不等式:
P ( X ˉ − E [ X ˉ ] ⩾ ϵ ) ⩽ e − λ ϵ E [ e λ ( X ˉ − E [ X ˉ ] ) ] (use univariate Chernoff’s Inequailty) = e − λ ϵ ∏ i = 1 m E [ e λ m ( X i − E [ X i ] ) ] ⩽ e − λ ϵ ∏ i = 1 m exp ( λ 2 8 m 2 ) (use lemma above) = exp ( m λ 2 8 m 2 − λ ϵ ) \begin{aligned}
P(\bar{X}-\mathbb{E}[\bar{X}]\geqslant\epsilon)&\leqslant e^{-\lambda\epsilon}\mathbb{E}[e^{\lambda(\bar{X}-\mathbb{E}[\bar{X}])}] \ \ \text{(use univariate Chernoff’s Inequailty)}\\
&=e^{-\lambda\epsilon}\prod_{i=1}^m\mathbb{E}[e^{\frac{\lambda}{m}(X_i-\mathbb{E}[X_i])}]\leqslant e^{-\lambda\epsilon}\prod_{i=1}^m\exp{(\frac{\lambda^2}{8m^2})}\ \ \text{(use lemma above)}\\
&=\exp{(\frac{m\lambda^2}{8m^2}-\lambda\epsilon)}
\end{aligned} P ( X ˉ − E [ X ˉ ] ⩾ ϵ ) ⩽ e − λ ϵ E [ e λ ( X ˉ − E [ X ˉ ]) ] (use univariate Chernoff’s Inequailty) = e − λ ϵ i = 1 ∏ m E [ e m λ ( X i − E [ X i ]) ] ⩽ e − λ ϵ i = 1 ∏ m exp ( 8 m 2 λ 2 ) (use lemma above) = exp ( 8 m 2 m λ 2 − λ ϵ )
于是由 λ > 0 \lambda > 0 λ > 0 任意性 P ( X ˉ − E [ X ˉ ] ⩾ ϵ ) ⩽ min λ > 0 exp ( λ 2 8 m − λ ϵ ) = e − 2 m ϵ 2 P(\bar{X}-\mathbb{E}[\bar{X}]\geqslant\epsilon)\leqslant\min_{\lambda>0}\exp{(\frac{\lambda^2}{8m}-\lambda\epsilon)}=e^{-2m\epsilon^2} P ( X ˉ − E [ X ˉ ] ⩾ ϵ ) ⩽ min λ > 0 exp ( 8 m λ 2 − λ ϵ ) = e − 2 m ϵ 2
1.2.13 McDiarmid 不等式
对m个独立随机变量 X i ∈ χ , i ∈ [ m ] X_i\in\chi,i\in[m] X i ∈ χ , i ∈ [ m ] ,若 f : χ m → R f:\chi^m\rightarrow\mathbb{R} f : χ m → R 是关于X i X_i X i 的实值函数且 ∀ x 1 , . . . , x m , x i ′ ∈ χ \forall x_1,...,x_m,x_i'\in\chi ∀ x 1 , ... , x m , x i ′ ∈ χ 有
∣ f ( x 1 , . . . , x i , . . . , x m ) − f ( x 1 , . . . , x i ′ , . . . , x m ) ∣ ⩽ c i |f(x_1,...,x_i,...,x_m)-f(x_1,...,x_i',...,x_m)|\leqslant c_i ∣ f ( x 1 , ... , x i , ... , x m ) − f ( x 1 , ... , x i ′ , ... , x m ) ∣ ⩽ c i
则 ∀ ϵ > 0 \forall\epsilon>0 ∀ ϵ > 0 有
P ( f ( x 1 , . . . , x i , . . . , x m ) − E [ f ( x 1 , . . . , x i , . . . , x m ) ] ⩾ ϵ ) ⩽ e − 2 ϵ 2 ∑ i = 1 m c i 2 P ( f ( x 1 , . . . , x i , . . . , x m ) − E [ f ( x 1 , . . . , x i , . . . , x m ) ] ⩽ − ϵ ) ⩽ e − 2 ϵ 2 ∑ i = 1 m c i 2 P(f(x_1,...,x_i,...,x_m)-\mathbb{E}[f(x_1,...,x_i,...,x_m)]\geqslant\epsilon)\leqslant e^{-\frac{2\epsilon^2}{\sum_{i=1}^m c_i^2}}\\
P(f(x_1,...,x_i,...,x_m)-\mathbb{E}[f(x_1,...,x_i,...,x_m)]\leqslant-\epsilon)\leqslant e^{-\frac{2\epsilon^2}{\sum_{i=1}^m c_i^2}} P ( f ( x 1 , ... , x i , ... , x m ) − E [ f ( x 1 , ... , x i , ... , x m )] ⩾ ϵ ) ⩽ e − ∑ i = 1 m c i 2 2 ϵ 2 P ( f ( x 1 , ... , x i , ... , x m ) − E [ f ( x 1 , ... , x i , ... , x m )] ⩽ − ϵ ) ⩽ e − ∑ i = 1 m c i 2 2 ϵ 2
证明
取 Z i = E [ f ∣ F i ] , Z 0 = E [ f ] Z_i = \mathbb{E}[f|\mathcal{F}_i], Z_0 = \mathbb{E}[f] Z i = E [ f ∣ F i ] , Z 0 = E [ f ] ,则由全期望公式,E [ Z i ] = E [ E [ f ∣ F i ] ] = E [ f ] < ∞ , E [ Z i ∣ F i − 1 ] = E [ E [ f ∣ F i ] ∣ F i − 1 ] = E [ f ∣ F i − 1 ] = Z i − 1 \mathbb{E}[Z_i] = \mathbb{E}[\mathbb{E}[f|\mathcal{F}_i]] = \mathbb{E}[f]<\infty,\mathbb{E}[Z_i|\mathcal{F}_{i-1}]=\mathbb{E}[\mathbb{E}[f|\mathcal{F}_i]|\mathcal{F}_{i-1}]=\mathbb{E}[f|\mathcal{F}_{i-1}]=Z_{i-1} E [ Z i ] = E [ E [ f ∣ F i ]] = E [ f ] < ∞ , E [ Z i ∣ F i − 1 ] = E [ E [ f ∣ F i ] ∣ F i − 1 ] = E [ f ∣ F i − 1 ] = Z i − 1 ,从而 { Z i } \{Z_i\} { Z i } 关于自身为鞅。
下面考虑 Z i − Z i − 1 = E [ f ∣ F i ] − E [ f ∣ F i − 1 ] Z_i − Z{i−1} = \mathbb{E}[f|\mathcal{F}_i] − \mathbb{E}[f|\mathcal{F}_{i-1}] Z i − Z i − 1 = E [ f ∣ F i ] − E [ f ∣ F i − 1 ] 的界。令
U i : = sup x ∈ X E [ f ∣ F i − 1 , X i = x ] − E [ f ∣ F i − 1 ] L i : = inf x ∈ X E [ f ∣ F i − 1 , X i = x ] − E [ f ∣ F i − 1 ] U_i:=\sup_{x\in\mathcal{X}}\mathbb{E}[f|\mathcal{F}_{i-1},X_i=x]-\mathbb{E}[f|\mathcal{F}_{i-1}]\\
L_i:=\inf_{x\in\mathcal{X}}\mathbb{E}[f|\mathcal{F}_{i-1},X_i=x]-\mathbb{E}[f|\mathcal{F}_{i-1}] U i := x ∈ X sup E [ f ∣ F i − 1 , X i = x ] − E [ f ∣ F i − 1 ] L i := x ∈ X inf E [ f ∣ F i − 1 , X i = x ] − E [ f ∣ F i − 1 ]
此时有 L i ⩽ Z i − Z i − 1 ⩽ U i L_i\leqslant Z_i − Z_{i−1}\leqslant U_i L i ⩽ Z i − Z i − 1 ⩽ U i
而
U i − L i = sup x u , x l ∈ X E [ f ∣ F i − 1 , X i = x u ] − E [ f ∣ F i − 1 , X i = x l ] = sup x u , x l ∈ X ∫ X i + 1 × ⋯ × X m f ( X 1 , ⋯ , X i − 1 , X u , X i + 1 , ⋯ X m ) − f ( X 1 , ⋯ , X i − 1 , X l , X i + 1 , ⋯ X m ) d P X i + 1 , ⋯ , X m ( X i + 1 , ⋯ , X m ) ⩽ ∫ c i d P = c i U_i-L_i=\sup_{x_u,x_l\in\mathcal{X}}\mathbb{E}[f|\mathcal{F}_{i-1},X_i=x_u]-\mathbb{E}[f|\mathcal{F}_{i-1},X_i=x_l]\\
=\sup_{x_u,x_l\in\mathcal{X}}\int_{\mathcal{X}_{i+1}\times\cdots\times\mathcal{X}_m}f(X_1,\cdots,X_{i-1},X_u,X_{i+1},\cdots X_m)-\\
f(X_1,\cdots,X_{i-1},X_l,X_{i+1},\cdots X_m){\rm d}P_{X_{i+1},\cdots,X_m}(X_{i+1},\cdots,X_m)\leqslant\int c_i {\rm d}P=c_i U i − L i = x u , x l ∈ X sup E [ f ∣ F i − 1 , X i = x u ] − E [ f ∣ F i − 1 , X i = x l ] = x u , x l ∈ X sup ∫ X i + 1 × ⋯ × X m f ( X 1 , ⋯ , X i − 1 , X u , X i + 1 , ⋯ X m ) − f ( X 1 , ⋯ , X i − 1 , X l , X i + 1 , ⋯ X m ) d P X i + 1 , ⋯ , X m ( X i + 1 , ⋯ , X m ) ⩽ ∫ c i d P = c i
从强化版的 Azuma 不等式,有:
P ( Z m − Z 0 ⩾ ϵ ) = P ( f − E [ f ] ⩾ ϵ ) ⩽ e − 2 ϵ 2 ∑ i = 1 m c i 2 P(Z_m-Z_0\geqslant\epsilon)=P(f-\mathbb{E}[f]\geqslant\epsilon)\leqslant e^{-\frac{2\epsilon^2}{\sum_{i=1}^m c_i^2}} P ( Z m − Z 0 ⩾ ϵ ) = P ( f − E [ f ] ⩾ ϵ ) ⩽ e − ∑ i = 1 m c i 2 2 ϵ 2
1.2.14 Bennett 不等式
对m个独立同分布的随机变量 X i , i ∈ [ m ] X_i,i\in[m] X i , i ∈ [ m ] ,令 X ˉ = 1 m ∑ i = 1 m X i \bar{X}= \frac{1}{m}\sum_{i=1}^{m}X_i X ˉ = m 1 ∑ i = 1 m X i ,若 X i − E [ X i ] ⩽ 1 X_i-\mathbb{E}[X_i]\leqslant1 X i − E [ X i ] ⩽ 1 ,有
P ( X ˉ ⩾ E [ X ˉ ] + ϵ ) ⩽ exp ( − m ϵ 2 2 D [ X 1 ] + 2 ϵ / 3 ) P(\bar{X}\geqslant\mathbb{E}[\bar{X}]+\epsilon)\leqslant\exp{(\frac{-m\epsilon^2}{2\mathbb{D}[X_1]+2\epsilon/3})} P ( X ˉ ⩾ E [ X ˉ ] + ϵ ) ⩽ exp ( 2 D [ X 1 ] + 2 ϵ /3 − m ϵ 2 )
在机器学习研究中常用到 Bennett 不等式的另一种形式,若
P ( X ˉ ⩾ E [ X ˉ ] + ϵ ) ⩽ exp ( − m ϵ 2 2 D [ X 1 ] + 2 ϵ / 3 ) = δ P(\bar{X}\geqslant\mathbb{E}[\bar{X}]+\epsilon)\leqslant\exp{(\frac{-m\epsilon^2}{2\mathbb{D}[X_1]+2\epsilon/3})}=\delta P ( X ˉ ⩾ E [ X ˉ ] + ϵ ) ⩽ exp ( 2 D [ X 1 ] + 2 ϵ /3 − m ϵ 2 ) = δ
则至少以 1 − δ 1-\delta 1 − δ 的概率有
X ˉ ⩽ E [ X ˉ ] + ϵ ⩽ E [ X ˉ ] + 2 ln 1 / δ 3 m + 2 D [ X 1 ] m ln 1 δ \bar{X}\leqslant\mathbb{E}[\bar{X}]+\epsilon\leqslant\mathbb{E}[\bar{X}]+\frac{2\ln{1/\delta}}{3m}+\sqrt{\frac{2\mathbb{D}[X_1]}{m}\ln{\frac{1}{\delta}}} X ˉ ⩽ E [ X ˉ ] + ϵ ⩽ E [ X ˉ ] + 3 m 2 ln 1/ δ + m 2 D [ X 1 ] ln δ 1
证明:实际上由 ∣ X i − E [ X i ] ∣ ⩽ 1 |X_i−\mathbb{E}[X_i]|\leqslant1 ∣ X i − E [ X i ] ∣ ⩽ 1 ,有 E [ ∣ X i − E [ X i ] ∣ k ] ⩽ 1 k − 2 E [ ∣ X i − E [ X i ] ∣ 2 ] = D [ X i ] \mathbb{E}[|X_i−\mathbb{E}[X_i]|^k]\leqslant 1^{k−2}\mathbb{E}[|X_i − \mathbb{E}[X_i]|^2]=\mathbb{D}[Xi] E [ ∣ X i − E [ X i ] ∣ k ] ⩽ 1 k − 2 E [ ∣ X i − E [ X i ] ∣ 2 ] = D [ X i ] ,从而利用 Bernstein 不等式,取 b = 1 3 b=\frac{1}{3} b = 3 1 即是一个满足不等式条件的 b,带入得到结论。
1.2.15 Bernstein 不等式
对m个独立同分布的随机变量 X i , i ∈ [ m ] X_i,i\in[m] X i , i ∈ [ m ] ,令 X ˉ = 1 m ∑ i = 1 m X i \bar{X}= \frac{1}{m}\sum_{i=1}^{m}X_i X ˉ = m 1 ∑ i = 1 m X i ,若存在 b > 0 b>0 b > 0 使得 ∀ k ⩾ 2 \forall k\geqslant2 ∀ k ⩾ 2 有 E [ ∣ X i ∣ k ] ⩽ k ! b k − 2 D [ X 1 ] / 2 \mathbb{E}[|X_i|^k]\leqslant k!b^{k-2}\mathbb{D}[X_1]/2 E [ ∣ X i ∣ k ] ⩽ k ! b k − 2 D [ X 1 ] /2 成立,则有
P ( X ˉ ⩾ E [ X ˉ ] + ϵ ) ⩽ exp ( − m ϵ 2 2 D [ X 1 ] + 2 b ϵ ) P(\bar{X}\geqslant\mathbb{E}[\bar{X}]+\epsilon)\leqslant\exp{(\frac{-m\epsilon^2}{2\mathbb{D}[X_1]+2b\epsilon})} P ( X ˉ ⩾ E [ X ˉ ] + ϵ ) ⩽ exp ( 2 D [ X 1 ] + 2 b ϵ − m ϵ 2 )
证明:对任意 λ ⩾ 0 \lambda\geqslant0 λ ⩾ 0 有
E [ e λ ( X − E [ X ] ) ] = 1 + λ E [ X − E [ X ] ] + λ 2 2 E [ X − E X ] 2 + ∑ k = 3 ∞ λ k E [ X − E [ X ] ] k k ! (use Fubini) = 1 + λ 2 2 D [ X ] + ∑ k = 3 ∞ λ k E [ X − E [ X ] ] k k ! ⩽ 1 + λ 2 2 D [ X ] + λ 2 2 D [ X ] ∑ k = 3 ∞ λ k − 2 b k − 2 ⩽ 1 + λ 2 D [ X ] 2 ( 1 − λ b ) ( take λ < 1 b ) ⩽ exp ( λ 2 D [ X ] 2 ( 1 − λ b ) ) \begin{aligned}
\mathbb{E}[e^{\lambda(X−\mathbb{E}[X])}]&=1 +\lambda\mathbb{E}[X −\mathbb{E}[X]] + \frac{λ^2}{2}\mathbb{E}[X −\mathbb{E}X]^2+\sum_{k=3}^{\infty}\frac{\lambda^k\mathbb{E}[X-\mathbb{E}[X]]^k}{k!}\ \ \text{(use Fubini)}\\
&=1+\frac{\lambda^2}{2}\mathbb{D}[X]+\sum_{k=3}^{\infty}\frac{\lambda^k\mathbb{E}[X-\mathbb{E}[X]]^k}{k!}\leqslant 1+\frac{\lambda^2}{2}\mathbb{D}[X]+\frac{\lambda^2}{2}\mathbb{D}[X]\sum_{k=3}^{\infty}\lambda^{k-2}b^{k-2}\\
&\leqslant1+\frac{\lambda^2\mathbb{D}[X]}{2(1-\lambda b)}\ \ (\text{take }\lambda<\frac{1}{b})\leqslant\exp{(\frac{\lambda^2\mathbb{D}[X]}{2(1-\lambda b)})}
\end{aligned} E [ e λ ( X − E [ X ]) ] = 1 + λ E [ X − E [ X ]] + 2 λ 2 E [ X − E X ] 2 + k = 3 ∑ ∞ k ! λ k E [ X − E [ X ] ] k (use Fubini) = 1 + 2 λ 2 D [ X ] + k = 3 ∑ ∞ k ! λ k E [ X − E [ X ] ] k ⩽ 1 + 2 λ 2 D [ X ] + 2 λ 2 D [ X ] k = 3 ∑ ∞ λ k − 2 b k − 2 ⩽ 1 + 2 ( 1 − λb ) λ 2 D [ X ] ( take λ < b 1 ) ⩽ exp ( 2 ( 1 − λb ) λ 2 D [ X ] )
又由单变量形式 Chernoff 不等式
P ( X ˉ − E [ X ˉ ] ⩾ ϵ ) ⩽ e − λ ϵ E [ e λ ( X ˉ − E [ X ˉ ] ) ] = e − λ ϵ ∏ i = 1 m E [ e λ m ( X i − E [ X i ] ) ] ⩽ exp ( − λ ϵ + ∑ i = 1 m ( λ m ) 2 D [ X i ] 2 ( 1 − λ b m ) \begin{aligned}
P(\bar{X}−\mathbb{E}[\bar{X}]\geqslant\epsilon)&\leqslant e^{-\lambda\epsilon}\mathbb{E}[e^{\lambda(\bar{X}−\mathbb{E}[\bar{X}])}]=e^{−\lambda\epsilon}\prod^m_{i=1}\mathbb{E}[e^{\frac{\lambda}{m}(X_i−\mathbb{E}[X_i])}]\\
&\leqslant\exp{(-\lambda\epsilon+\sum_{i=1}^m\frac{{(\frac{\lambda}{m})}^2\mathbb{D}[X_i]}{2(1-\frac{\lambda b}{m})}}
\end{aligned} P ( X ˉ − E [ X ˉ ] ⩾ ϵ ) ⩽ e − λ ϵ E [ e λ ( X ˉ − E [ X ˉ ]) ] = e − λ ϵ i = 1 ∏ m E [ e m λ ( X i − E [ X i ]) ] ⩽ exp ( − λ ϵ + i = 1 ∑ m 2 ( 1 − m λb ) ( m λ ) 2 D [ X i ]
取 λ = ϵ b ϵ + ∑ i = 1 m D [ X i ] m < 1 b \lambda = \frac{\epsilon}{b\epsilon+\sum^m_{i=1}\frac{\mathbb{D}[X_i]}{m}}< \frac{1}{b} λ = b ϵ + ∑ i = 1 m m D [ X i ] ϵ < b 1 ,从而有 P ( X ˉ − E [ X ˉ ] ⩾ ϵ ) ⩽ exp ( − m ϵ 2 2 D [ X i ] + 2 b ϵ ) P(\bar{X}−\mathbb{E}[\bar{X}]\geqslant\epsilon)\leqslant\exp(\frac{-m\epsilon^2}{2\mathbb{D}[X_i]+2b\epsilon}) P ( X ˉ − E [ X ˉ ] ⩾ ϵ ) ⩽ exp ( 2 D [ X i ] + 2 b ϵ − m ϵ 2 )
1.2.16 Azuma 不等式
对于均值为 μ \mu μ 的鞅 (martingale) { Z m , m ⩾ 1 } \{Z_m,m\geqslant1\} { Z m , m ⩾ 1 } ,令 Z 0 = μ Z_0=\mu Z 0 = μ ,若 − c i ⩽ Z i − Z i − 1 ⩽ c i -c_i\leqslant Z_i-Z_{i-1}\leqslant c_i − c i ⩽ Z i − Z i − 1 ⩽ c i ,则 ∀ ϵ > 0 \forall\epsilon>0 ∀ ϵ > 0
P ( ∑ i = 1 m X i ⩾ ϵ ) ⩽ e − ϵ 2 / 2 ∑ i = 1 m c i 2 P ( ∑ i = 1 m X i ⩽ − ϵ ) ⩽ e − ϵ 2 / 2 ∑ i = 1 m c i 2 \begin{aligned}
P(\sum_{i=1}^mX_i\geqslant\epsilon)&\leqslant e^{-\epsilon^2/2\sum_{i=1}^m c_i^2}\\
P(\sum_{i=1}^mX_i\leqslant-\epsilon)&\leqslant e^{-\epsilon^2/2\sum_{i=1}^m c_i^2}
\end{aligned} P ( i = 1 ∑ m X i ⩾ ϵ ) P ( i = 1 ∑ m X i ⩽ − ϵ ) ⩽ e − ϵ 2 /2 ∑ i = 1 m c i 2 ⩽ e − ϵ 2 /2 ∑ i = 1 m c i 2
证明:下给出一个更强结论的证明, 这可以直接导出书中的不等式
对于鞅 { Z m } m ⩾ 1 \{Z_m\}_{m\geqslant1} { Z m } m ⩾ 1 令 Z 0 = E [ Z ] Z_0 = \mathbb{E}[Z] Z 0 = E [ Z ] ,若
A i ⩽ Z i − Z i − 1 ⩽ B i A_i\leqslant Z_i − Z_{i−1} \leqslant B_i A i ⩽ Z i − Z i − 1 ⩽ B i ,且 B i − A i ⩽ C i B_i − A_i \leqslant C_i B i − A i ⩽ C i ,则对任意 ϵ > 0 \epsilon > 0 ϵ > 0 有
P ( Z n − Z 0 ⩾ ϵ ) ⩽ exp ( − 2 ϵ 2 ( ∑ i = 1 n C i 2 ) ) P(Z_n − Z_0\geqslant\epsilon)\leqslant\exp{(-\frac{2\epsilon^2}{(\sum_{i=1}^nC_i^2)})} P ( Z n − Z 0 ⩾ ϵ ) ⩽ exp ( − ( ∑ i = 1 n C i 2 ) 2 ϵ 2 )
强化定理证明
E [ Z n − Z n − 1 ∣ F n − 1 ] = E [ Z n ∣ F n − 1 ] − E [ Z n − 1 ∣ F n − 1 ] = Z n − 1 − Z n − 1 = 0 (use Chernoff’s Inequality) P ( Z n − Z 0 ⩾ ϵ ) ⩽ e − λ ϵ E [ e λ ( Z n − Z 0 ) ] = e − λ ϵ E [ exp ( λ ∑ i = 1 n ( Z i − Z i − 1 ) ) ] = e − λ ϵ E [ exp ( λ ∑ i = 1 n ( Z i − Z i − 1 ) ) E [ exp ( Z i − Z i − 1 ) ) ∣ F n − 1 ] \begin{aligned}
\mathbb{E}[Z_n−Z_{n−1}|\mathcal{F}_{n−1}] &= \mathbb{E}[Z_n|\mathcal{F}_{n−1}]−\mathbb{E}[Z_{n-1}|\mathcal{F}_{n−1}] = Z_{n−1}−Z_{n−1} = 0\\
\text{(use Chernoff’s Inequality)}&\quad P(Z_n − Z_0\geqslant\epsilon)\leqslant e^{-\lambda\epsilon}\mathbb{E}[e^{\lambda(Z_n-Z_0)}]\\
&=e^{-\lambda\epsilon}\mathbb{E}[\exp{(\lambda\sum_{i=1}^n(Z_i-Z_{i-1}))}]\\
&=e^{-\lambda\epsilon}\mathbb{E}[\exp{(\lambda\sum_{i=1}^n(Z_i-Z_{i-1}))}\mathbb{E}[\exp(Z_i-Z_{i-1}))|\mathcal{F}_{n-1}]
\end{aligned} E [ Z n − Z n − 1 ∣ F n − 1 ] (use Chernoff’s Inequality) = E [ Z n ∣ F n − 1 ] − E [ Z n − 1 ∣ F n − 1 ] = Z n − 1 − Z n − 1 = 0 P ( Z n − Z 0 ⩾ ϵ ) ⩽ e − λ ϵ E [ e λ ( Z n − Z 0 ) ] = e − λ ϵ E [ exp ( λ i = 1 ∑ n ( Z i − Z i − 1 )) ] = e − λ ϵ E [ exp ( λ i = 1 ∑ n ( Z i − Z i − 1 )) E [ exp ( Z i − Z i − 1 )) ∣ F n − 1 ]
再由 Hoeffding 不等式中证明的引理
E [ exp ( λ ( Z n − Z n − 1 ) ) ∣ F n − 1 ] ⩽ exp ( λ 2 ( B n − A n ) 2 8 ) ⩽ exp ( λ 2 C n 2 8 ) P ( Z n − Z 0 ⩾ ϵ ) ⩽ e − λ ϵ exp ( λ 2 C n 2 8 ) ⋅ E [ exp ( λ ∑ i = 1 n − 1 ( Z i − Z i − 1 ) ) ] ⩽ ⋯ ⩽ e − λ ϵ exp ( λ 2 8 ∑ i = 1 n C i 2 ) \begin{aligned}
E[\exp{(\lambda(Z_n − Z_{n−1}))}|\mathcal{F}_{n−1}]&\leqslant\exp(\frac{\lambda^2(B_n−A_n)^2}8)\leqslant\exp{(\frac{\lambda^2C_n^2}8)}\\
P(Z_n − Z_0\geqslant\epsilon)&\leqslant e^{-\lambda\epsilon}\exp{(\frac{\lambda^2C_n^2}8)}\cdot\mathbb{E}[\exp{(\lambda\sum_{i=1}^{n-1}(Z_i-Z_{i-1}))}]\\
&\leqslant\cdots\leqslant e^{-\lambda\epsilon}\exp{(\frac{\lambda^2}8\sum_{i=1}^nC_i^2)}
\end{aligned} E [ exp ( λ ( Z n − Z n − 1 )) ∣ F n − 1 ] P ( Z n − Z 0 ⩾ ϵ ) ⩽ exp ( 8 λ 2 ( B n − A n ) 2 ) ⩽ exp ( 8 λ 2 C n 2 ) ⩽ e − λ ϵ exp ( 8 λ 2 C n 2 ) ⋅ E [ exp ( λ i = 1 ∑ n − 1 ( Z i − Z i − 1 )) ] ⩽ ⋯ ⩽ e − λ ϵ exp ( 8 λ 2 i = 1 ∑ n C i 2 )
而此式极小值在 λ = 4 ϵ ∑ i = 1 n C i 2 \lambda = \frac{4\epsilon}{\sum_{i=1}^nC_i^2} λ = ∑ i = 1 n C i 2 4 ϵ 时取到,由原不等式中 λ \lambda λ 的任意性,带入有 P ( Z n − Z 0 ⩾ ϵ ) ⩽ exp ( − 2 ϵ 2 ( ∑ i = 1 n C i 2 ) ) P(Z_n − Z_0\geqslant\epsilon)\leqslant\exp{(-\frac{2\epsilon^2}{(\sum_{i=1}^nC_i^2)})} P ( Z n − Z 0 ⩾ ϵ ) ⩽ exp ( − ( ∑ i = 1 n C i 2 ) 2 ϵ 2 )