一、概述
总的来说,推断的任务就是求概率。假如我们知道联合概率P ( x ) = P ( x 1 , x 2 , ⋯ , x p ) P(x)=P(x_{1},x_{2},\cdots ,x_{p}) P ( x ) = P ( x 1 , x 2 , ⋯ , x p ) ,我们需要使用推断的方法来求:
边缘概率: P ( x i ) = ∑ x 1 ⋯ ∑ x i − 1 ∑ x i + 1 ⋯ ∑ x p P ( x ) 条件概率: P ( x A ∣ x B ) , x = x A ∪ x B M A P I n f e r e n c e : z ^ = a r g m a x z P ( z ∣ x ) ∝ a r g m a x z P ( z , x ) 边缘概率:P(x_{i})=\sum_{x_{1}}\cdots\sum_{x_{i-1}} \sum_{x_{i+1}}\cdots \sum_{x_{p}}P(x) \\
条件概率:P(x_{A}|x_{B}),x=x_{A}\cup x_{B}\\
MAP\; Inference:\hat{z}=\underset{z}{argmax}P(z|x)\propto \underset{z}{argmax}P(z,x) 边缘概率: P ( x i ) = x 1 ∑ ⋯ x i − 1 ∑ x i + 1 ∑ ⋯ x p ∑ P ( x ) 条件概率: P ( x A ∣ x B ) , x = x A ∪ x B M A P I n f ere n ce : z ^ = z a r g ma x P ( z ∣ x ) ∝ z a r g ma x P ( z , x )
以下是一些推断的方法:
①精确推断:
Variable Elimination(VE,变量消除法)(针对树结构);
Belief Propagation(BP,信念传播,Sum-Product Algo)(针对树结构);
Junction Tree Algorithm(针对图结构)
②近似推断:
二、Variable Elimination(变量消除法)
变量消除法
对于上述图结构,假如我们希望求边缘概率P ( d ) P(d) P ( d ) ,我们就可以应用变量消除法:
P ( d ) = ∑ a , b , c P ( a , b , c , d ) = ∑ a , b , c P ( a ) P ( b ∣ a ) P ( c ∣ b ) P ( d ∣ c ) ⏟ 因子分解 = ∑ b , c P ( c ∣ b ) P ( d ∣ c ) ∑ a P ( a ) P ( b ∣ a ) ⏟ ϕ a ( b ) = ∑ c P ( d ∣ c ) ∑ b P ( c ∣ b ) ϕ a ( b ) ⏟ ϕ b ( c ) = ∑ c P ( d ∣ c ) ϕ b ( c ) = ϕ c ( d ) P(d)=\sum _{a,b,c}P(a,b,c,d)\\ =\underset{因子分解}{\underbrace{\sum _{a,b,c}P(a)P(b|a)P(c|b)P(d|c)}}\\ =\sum _{b,c}P(c|b)P(d|c)\underset{\phi _{a}(b)}{\underbrace{\sum _{a}P(a)P(b|a)}}\\ =\sum _{c}P(d|c)\underset{\phi _{b}(c)}{\underbrace{\sum _{b}P(c|b)\phi _{a}(b)}}\\ =\sum _{c}P(d|c)\phi _{b}(c)\\ =\phi _{c}(d) P ( d ) = a , b , c ∑ P ( a , b , c , d ) = 因子分解 a , b , c ∑ P ( a ) P ( b ∣ a ) P ( c ∣ b ) P ( d ∣ c ) = b , c ∑ P ( c ∣ b ) P ( d ∣ c ) ϕ a ( b ) a ∑ P ( a ) P ( b ∣ a ) = c ∑ P ( d ∣ c ) ϕ b ( c ) b ∑ P ( c ∣ b ) ϕ a ( b ) = c ∑ P ( d ∣ c ) ϕ b ( c ) = ϕ c ( d )
解释
我们可以通过观察直接将 P ( d ) P(d) P ( d ) 展开计算的形式来理解变量消除法的作用。首先我们假设 a a a , b , c , d b , c , d b , c , d 都是离散的二值随机变量,只能取 0 和 1 两个值,然后直接将 P ( d ) P(d) P ( d ) 展开:
P ( d ) = ∑ a , b , c P ( a , b , c , d ) = ∑ a , b , c P ( a ) P ( b ∣ a ) P ( c ∣ b ) P ( d ∣ c ) = P ( a = 0 ) P ( b = 0 ∣ a = 0 ) P ( c = 0 ∣ b = 0 ) P ( d ∣ c = 0 ) + P ( a = 0 ) P ( b = 0 ∣ a = 0 ) P ( c = 1 ∣ b = 0 ) P ( d ∣ c = 1 ) + P ( a = 0 ) P ( b = 1 ∣ a = 0 ) P ( c = 0 ∣ b = 1 ) P ( d ∣ c = 0 ) + P ( a = 0 ) P ( b = 1 ∣ a = 0 ) P ( c = 1 ∣ b = 1 ) P ( d ∣ c = 1 ) + P ( a = 1 ) P ( b = 0 ∣ a = 1 ) P ( c = 0 ∣ b = 0 ) P ( d ∣ c = 0 ) + P ( a = 1 ) P ( b = 0 ∣ a = 1 ) P ( c = 1 ∣ b = 0 ) P ( d ∣ c = 1 ) + P ( a = 1 ) P ( b = 1 ∣ a = 1 ) P ( c = 0 ∣ b = 1 ) P ( d ∣ c = 0 ) + P ( a = 1 ) P ( b = 1 ∣ a = 1 ) P ( c = 1 ∣ b = 1 ) P ( d ∣ c = 1 ) = 8 ⋅ 因子积 \begin{gathered}
P(d)=\sum_{a, b, c} P(a, b, c, d) \\
=\sum_{a, b, c} P(a) P(b \mid a) P(c \mid b) P(d \mid c) \\
=P(a=0) P(b=0 \mid a=0) P(c=0 \mid b=0) P(d \mid c=0) \\
+P(a=0) P(b=0 \mid a=0) P(c=1 \mid b=0) P(d \mid c=1) \\
+P(a=0) P(b=1 \mid a=0) P(c=0 \mid b=1) P(d \mid c=0) \\
+P(a=0) P(b=1 \mid a=0) P(c=1 \mid b=1) P(d \mid c=1) \\
+P(a=1) P(b=0 \mid a=1) P(c=0 \mid b=0) P(d \mid c=0) \\
+P(a=1) P(b=0 \mid a=1) P(c=1 \mid b=0) P(d \mid c=1) \\
+P(a=1) P(b=1 \mid a=1) P(c=0 \mid b=1) P(d \mid c=0) \\
+P(a=1) P(b=1 \mid a=1) P(c=1 \mid b=1) P(d \mid c=1) \\
=8 \cdot \text { 因子积 }
\end{gathered} P ( d ) = a , b , c ∑ P ( a , b , c , d ) = a , b , c ∑ P ( a ) P ( b ∣ a ) P ( c ∣ b ) P ( d ∣ c ) = P ( a = 0 ) P ( b = 0 ∣ a = 0 ) P ( c = 0 ∣ b = 0 ) P ( d ∣ c = 0 ) + P ( a = 0 ) P ( b = 0 ∣ a = 0 ) P ( c = 1 ∣ b = 0 ) P ( d ∣ c = 1 ) + P ( a = 0 ) P ( b = 1 ∣ a = 0 ) P ( c = 0 ∣ b = 1 ) P ( d ∣ c = 0 ) + P ( a = 0 ) P ( b = 1 ∣ a = 0 ) P ( c = 1 ∣ b = 1 ) P ( d ∣ c = 1 ) + P ( a = 1 ) P ( b = 0 ∣ a = 1 ) P ( c = 0 ∣ b = 0 ) P ( d ∣ c = 0 ) + P ( a = 1 ) P ( b = 0 ∣ a = 1 ) P ( c = 1 ∣ b = 0 ) P ( d ∣ c = 1 ) + P ( a = 1 ) P ( b = 1 ∣ a = 1 ) P ( c = 0 ∣ b = 1 ) P ( d ∣ c = 0 ) + P ( a = 1 ) P ( b = 1 ∣ a = 1 ) P ( c = 1 ∣ b = 1 ) P ( d ∣ c = 1 ) = 8 ⋅ 因子积
如果我们想直接计算公式中的每一项并将其相加,那么这将需要大量的计算力。而且,这只是在每个变量都是二值变量的情况下的计算量。如果每个变量能取更多的值,计算量会更大。变量消除法就是为了简化这个计算。它基于一个特性,即某些节点只与它们在图中的邻居节点有关。这种方法应用了乘法分配律 ( a b + a c = a ( b + c ) ) (a b+a c=a(b+c)) ( ab + a c = a ( b + c )) 来避免需要计算每一项然后再将其相加。在公式中,变量消除法的计算过程是:
P ( d ) = ( 将与 a 有关的放到一起 ) = P ( c = 0 ∣ b = 0 ) P ( d ∣ c = 0 ) ⋅ P ( a = 0 ) P ( b = 0 ∣ a = 0 ) + P ( c = 1 ∣ b = 0 ) P ( d ∣ c = 1 ) ⋅ P ( a = 0 ) P ( b = 0 ∣ a = 0 ) + P ( c = 0 ∣ b = 1 ) P ( d ∣ c = 0 ) ⋅ P ( a = 0 ) P ( b = 1 ∣ a = 0 ) + P ( c = 1 ∣ b = 1 ) P ( d ∣ c = 1 ) ⋅ P ( a = 0 ) P ( b = 1 ∣ a = 0 ) + P ( c = 0 ∣ b = 0 ) P ( d ∣ c = 0 ) ⋅ P ( a = 1 ) P ( b = 0 ∣ a = 1 ) + P ( c = 1 ∣ b = 0 ) P ( d ∣ c = 1 ) ⋅ P ( a = 1 ) P ( b = 0 ∣ a = 1 ) + P ( c = 0 ∣ b = 1 ) P ( d ∣ c = 0 ) ⋅ P ( a = 1 ) P ( b = 1 ∣ a = 1 ) + P ( c = 1 ∣ b = 1 ) P ( d ∣ c = 1 ) ⋅ P ( a = 1 ) P ( b = 1 ∣ a = 1 ) ( 应用乘法分配律 ) = P ( c = 0 ∣ b = 0 ) P ( d ∣ c = 0 ) ⋅ ϕ a ( b = 0 ) + P ( c = 1 ∣ b = 0 ) P ( d ∣ c = 1 ) ⋅ ϕ a ( b = 0 ) + P ( c = 0 ∣ b = 1 ) P ( d ∣ c = 0 ) ⋅ ϕ a ( b = 1 ) + P ( c = 1 ∣ b = 1 ) P ( d ∣ c = 1 ) ⋅ ϕ a ( b = 1 ) ( 将与 b 有关的放到一起 ) = P ( d ∣ c = 0 ) ⋅ P ( c = 0 ∣ b = 0 ) ϕ a ( b = 0 ) + P ( d ∣ c = 1 ) ⋅ P ( c = 1 ∣ b = 0 ) ϕ a ( b = 0 ) + P ( d ∣ c = 0 ) ⋅ P ( c = 0 ∣ b = 1 ) ϕ a ( b = 1 ) + P ( d ∣ c = 1 ) ⋅ P ( c = 1 ∣ b = 1 ) ϕ a ( b = 1 ) ( 应用乘法分配律 ) = P ( d ∣ c = 0 ) ⋅ ϕ b ( c = 0 ) + P ( d ∣ c = 1 ) ⋅ ϕ b ( c = 1 ) = ϕ c ( d ) P(d)=\\ (将与a有关的放到一起)\\ ={\color{Red}{P(c=0|b=0)P(d|c=0)\cdot P(a=0)P(b=0|a=0)}}\\ +{\color{Green}{P(c=1|b=0)P(d|c=1)\cdot P(a=0)P(b=0|a=0)}}\\ +{\color{Blue}{P(c=0|b=1)P(d|c=0)\cdot P(a=0)P(b=1|a=0)}}\\ +{\color{Yellow}{P(c=1|b=1)P(d|c=1)\cdot P(a=0)P(b=1|a=0)}}\\ +{\color{Red}{P(c=0|b=0)P(d|c=0)\cdot P(a=1)P(b=0|a=1)}}\\ +{\color{Green}{P(c=1|b=0)P(d|c=1)\cdot P(a=1)P(b=0|a=1)}}\\ +{\color{Blue}{P(c=0|b=1)P(d|c=0)\cdot P(a=1)P(b=1|a=1)}}\\ +{\color{Yellow}{P(c=1|b=1)P(d|c=1)\cdot P(a=1)P(b=1|a=1)}}\\ (应用乘法分配律)\\ ={\color{Red}{P(c=0|b=0)P(d|c=0)\cdot \phi _{a}(b=0)}}\\ +{\color{Green}{P(c=1|b=0)P(d|c=1)\cdot \phi _{a}(b=0)}}\\ +{\color{Blue}{P(c=0|b=1)P(d|c=0)\cdot \phi _{a}(b=1)}}\\ +{\color{Yellow}{P(c=1|b=1)P(d|c=1)\cdot \phi _{a}(b=1)}}\\ (将与b有关的放到一起)\\ ={\color{Red}{P(d|c=0)\cdot P(c=0|b=0)\phi _{a}(b=0)}}\\ +{\color{Green}{P(d|c=1)\cdot P(c=1|b=0)\phi _{a}(b=0)}}\\ +{\color{Red}{P(d|c=0)\cdot P(c=0|b=1)\phi _{a}(b=1)}}\\ +{\color{Green}{P(d|c=1)\cdot P(c=1|b=1)\phi _{a}(b=1)}}\\ (应用乘法分配律)\\ ={\color{Red}{P(d|c=0)\cdot \phi _{b}(c=0)}}\\ +{\color{Green}{P(d|c=1)\cdot \phi _{b}(c=1)}}\\ =\phi _{c}(d) P ( d ) = ( 将与 a 有关的放到一起 ) = P ( c = 0∣ b = 0 ) P ( d ∣ c = 0 ) ⋅ P ( a = 0 ) P ( b = 0∣ a = 0 ) + P ( c = 1∣ b = 0 ) P ( d ∣ c = 1 ) ⋅ P ( a = 0 ) P ( b = 0∣ a = 0 ) + P ( c = 0∣ b = 1 ) P ( d ∣ c = 0 ) ⋅ P ( a = 0 ) P ( b = 1∣ a = 0 ) + P ( c = 1∣ b = 1 ) P ( d ∣ c = 1 ) ⋅ P ( a = 0 ) P ( b = 1∣ a = 0 ) + P ( c = 0∣ b = 0 ) P ( d ∣ c = 0 ) ⋅ P ( a = 1 ) P ( b = 0∣ a = 1 ) + P ( c = 1∣ b = 0 ) P ( d ∣ c = 1 ) ⋅ P ( a = 1 ) P ( b = 0∣ a = 1 ) + P ( c = 0∣ b = 1 ) P ( d ∣ c = 0 ) ⋅ P ( a = 1 ) P ( b = 1∣ a = 1 ) + P ( c = 1∣ b = 1 ) P ( d ∣ c = 1 ) ⋅ P ( a = 1 ) P ( b = 1∣ a = 1 ) ( 应用乘法分配律 ) = P ( c = 0∣ b = 0 ) P ( d ∣ c = 0 ) ⋅ ϕ a ( b = 0 ) + P ( c = 1∣ b = 0 ) P ( d ∣ c = 1 ) ⋅ ϕ a ( b = 0 ) + P ( c = 0∣ b = 1 ) P ( d ∣ c = 0 ) ⋅ ϕ a ( b = 1 ) + P ( c = 1∣ b = 1 ) P ( d ∣ c = 1 ) ⋅ ϕ a ( b = 1 ) ( 将与 b 有关的放到一起 ) = P ( d ∣ c = 0 ) ⋅ P ( c = 0∣ b = 0 ) ϕ a ( b = 0 ) + P ( d ∣ c = 1 ) ⋅ P ( c = 1∣ b = 0 ) ϕ a ( b = 0 ) + P ( d ∣ c = 0 ) ⋅ P ( c = 0∣ b = 1 ) ϕ a ( b = 1 ) + P ( d ∣ c = 1 ) ⋅ P ( c = 1∣ b = 1 ) ϕ a ( b = 1 ) ( 应用乘法分配律 ) = P ( d ∣ c = 0 ) ⋅ ϕ b ( c = 0 ) + P ( d ∣ c = 1 ) ⋅ ϕ b ( c = 1 ) = ϕ c ( d )
缺点
变量消除的缺点很明显:
①计算步骤无法存储:每次计算一个边缘概率就要重新计算一遍整个图;
②消除的最优次序是⼀个NP-hard问题:对于复杂的图来说,想要找到一个最优的消除次序是困难的。
三、Belief Propagation(信念传播算法)
Variable Elimination算法的计算重复问题
对于以下图结构:
已知联合概率:
P ( a , b , c , d , e ) = P ( a ) P ( b ∣ a ) P ( c ∣ b ) P ( d ∣ c ) P ( e ∣ d ) P(a,b,c,d,e)=P(a)P(b|a)P(c|b)P(d|c)P(e|d) P ( a , b , c , d , e ) = P ( a ) P ( b ∣ a ) P ( c ∣ b ) P ( d ∣ c ) P ( e ∣ d )
我们在计算e e e 的边缘概率时,使用变量消除法的步骤如下:
P ( e ) = ∑ a , b , c , d P ( a , b , c , d , e ) = ∑ a , b , c , d P ( a ) P ( b ∣ a ) P ( c ∣ b ) P ( d ∣ c ) P ( e ∣ d ) = ∑ d P ( e ∣ d ) ∑ c P ( d ∣ c ) ∑ b P ( c ∣ b ) ∑ a P ( b ∣ a ) P ( a ) ⏟ m a → b ( b ) ⏟ m b → c ( c ) ⏟ m c → d ( d ) ⏟ m d → e ( e ) P(e)=\sum_{a,b,c,d}P(a,b,c,d,e)\\ =\sum_{a,b,c,d}P(a)P(b|a)P(c|b)P(d|c)P(e|d)\\ =\underset{m_{d\rightarrow e}(e)}{\underbrace{\sum_{d}P(e|d)\underset{m_{c\rightarrow d}(d)}{\underbrace{\sum_{c}P(d|c)\underset{m_{b\rightarrow c}(c)}{\underbrace{\sum_{b}P(c|b)\underset{m_{a\rightarrow b}(b)}{\underbrace{\sum_{a}P(b|a)P(a)}}}}}}}} P ( e ) = a , b , c , d ∑ P ( a , b , c , d , e ) = a , b , c , d ∑ P ( a ) P ( b ∣ a ) P ( c ∣ b ) P ( d ∣ c ) P ( e ∣ d ) = m d → e ( e ) d ∑ P ( e ∣ d ) m c → d ( d ) c ∑ P ( d ∣ c ) m b → c ( c ) b ∑ P ( c ∣ b ) m a → b ( b ) a ∑ P ( b ∣ a ) P ( a )
我们在计算c c c 的边缘概率时,使用变量消除法的步骤如下:
P ( c ) = ∑ a , b , d , e P ( a , b , c , d , e ) = ∑ a , b , d , e P ( a ) P ( b ∣ a ) P ( c ∣ b ) P ( d ∣ c ) P ( e ∣ d ) = ( ∑ b P ( c ∣ b ) ∑ a P ( b ∣ a ) P ( a ) ) ⋅ ( ∑ c P ( d ∣ c ) ∑ d P ( e ∣ d ) ) P(c)=\sum_{a,b,d,e}P(a,b,c,d,e)\\ =\sum_{a,b,d,e}P(a)P(b|a)P(c|b)P(d|c)P(e|d)\\ =(\sum_{b}P(c|b)\sum_{a}P(b|a)P(a))\cdot (\sum_{c}P(d|c)\sum_{d}P(e|d)) P ( c ) = a , b , d , e ∑ P ( a , b , c , d , e ) = a , b , d , e ∑ P ( a ) P ( b ∣ a ) P ( c ∣ b ) P ( d ∣ c ) P ( e ∣ d ) = ( b ∑ P ( c ∣ b ) a ∑ P ( b ∣ a ) P ( a )) ⋅ ( c ∑ P ( d ∣ c ) d ∑ P ( e ∣ d ))
我们发现在计算 c c c 的边缘概率时的前一部分与在计算 e e e 的边缘概率时的一部分重复了,可以想象在求其他边缘概率的分布时也会有大量的重复,而Belief Propagation算法就是来解决这个问题。
Belief Propagation的引出
上面我们一直计算的是有向图的马尔可夫链,现在我们将问题从链结构引申到树结构,从有向图引申到无向图(Belief Propagation只针对树状结构)。举例来说,有如下无向树:
现在我们知道该联合概率的因子分解可以写为:
P ( a , b , c , d ) = 1 Z ψ a ( a ) ψ b ( b ) ψ c ( c ) ψ d ( d ) ⋅ ψ a b ( a , b ) ψ b c ( b , c ) ψ b d ( b , d ) P(a,b,c,d)=\frac{1}{Z}\psi _{a}(a)\psi _{b}(b)\psi _{c}(c)\psi _{d}(d)\cdot \psi _{ab}(a,b) \psi _{bc}(b,c) \psi _{bd}(b,d) P ( a , b , c , d ) = Z 1 ψ a ( a ) ψ b ( b ) ψ c ( c ) ψ d ( d ) ⋅ ψ ab ( a , b ) ψ b c ( b , c ) ψ b d ( b , d )
我们要求解边缘概率P ( a ) P(a) P ( a ) ,也要应用到变量消除法,大体步骤是先消去c c c 和d d d ,然后再消去b b b ,该过程如下所示:
p ( a ) = ψ a ∑ b ψ b ⋅ ψ a b ( ∑ c ψ c ⋅ ψ b c ⏟ m c → b ( b ) ) ( ∑ d ψ d ⋅ ψ b d ⏟ m d → b ( b ) ) ⏟ m b → a ( a ) p(a)=\psi _{a}\underset{m_{b\rightarrow a}(a)}{\underbrace{\sum _{b}\psi _{b}\cdot \psi _{ab}(\underset{m_{c\rightarrow b}(b)}{\underbrace{\sum _{c}\psi _{c}\cdot \psi _{bc}}})(\underset{m_{d\rightarrow b}(b)}{\underbrace{\sum _{d}\psi _{d}\cdot \psi _{bd}}})}} p ( a ) = ψ a m b → a ( a ) b ∑ ψ b ⋅ ψ ab ( m c → b ( b ) c ∑ ψ c ⋅ ψ b c ) ( m d → b ( b ) d ∑ ψ d ⋅ ψ b d )
我们可以看到求解的过程主要就是求以下两项(这里写得规范一些,比如a a a 写作x a x_a x a ):
{ m b → a ( x a ) = ∑ x b ψ a b ⋅ ψ b ⋅ m c → b ( x b ) ⋅ m d → b ( x b ) p ( x a ) = ψ a ⋅ m b → a ( x a ) \left\{\begin{matrix} m_{b\rightarrow a}(x_{a})=\sum _{x_{b}}\psi _{ab}\cdot \psi _{b}\cdot m_{c\rightarrow b}(x_{b})\cdot m_{d\rightarrow b}(x_{b})\\ p(x_{a})=\psi _{a}\cdot m_{b\rightarrow a}(x_{a}) \end{matrix}\right. { m b → a ( x a ) = ∑ x b ψ ab ⋅ ψ b ⋅ m c → b ( x b ) ⋅ m d → b ( x b ) p ( x a ) = ψ a ⋅ m b → a ( x a )
现在我们可以将求解x a x_{a} x a 边缘概率的过程抽象出来得到求解x i x_{i} x i 边缘概率的过程:
{ m j → i ( x i ) = ∑ x j ψ i j ⋅ ψ j ⋅ ∏ k ∈ N e i g h b o r ( j ) − i m k → j ( x j ) p ( x i ) = ψ i ⋅ ∏ k ∈ N e i g h b o r ( j ) m k → i ( x i ) \left\{\begin{matrix} m_{j\rightarrow i}(x_{i})=\sum _{x_{j}}\psi _{ij}\cdot \psi _{j}\cdot \prod _{k\in Neighbor(j)-i}m_{k\rightarrow j}(x_{j})\\ p(x_{i})=\psi _{i}\cdot \prod _{k\in Neighbor(j)} m_{k\rightarrow i}(x_{i}) \end{matrix}\right. { m j → i ( x i ) = ∑ x j ψ ij ⋅ ψ j ⋅ ∏ k ∈ N e i g hb or ( j ) − i m k → j ( x j ) p ( x i ) = ψ i ⋅ ∏ k ∈ N e i g hb or ( j ) m k → i ( x i )
我们可以继续观察求解x i x_{i} x i 边缘概率的公式,并对一些部分做一下定义:
{ m j → i ( x i ) = ∑ x j ψ i j ⋅ ψ j ⏟ s e l f ⋅ ∏ k ∈ N e i g h b o r ( j ) − i m k → j ( x j ) ⏟ c h i l d r e n ⏟ b e l i e f ( x j ) p ( x i ) = ψ i ⋅ ∏ k ∈ N e i g h b o r ( j ) m k → i ( x i ) \left\{\begin{matrix} m_{j\rightarrow i}(x_{i})=\sum _{x_{j}}\psi _{ij}\cdot\underset{belief(x_{j})}{ \underbrace{\underset{self}{\underbrace{\psi _{j}}}\cdot \underset{children}{\underbrace{\prod _{k\in Neighbor(j)-i}m_{k\rightarrow j}(x_{j})}}}}\\ p(x_{i})=\psi _{i}\cdot \prod _{k\in Neighbor(j)} m_{k\rightarrow i}(x_{i}) \end{matrix}\right. ⎩ ⎨ ⎧ m j → i ( x i ) = ∑ x j ψ ij ⋅ b e l i e f ( x j ) se l f ψ j ⋅ c hi l d re n k ∈ N e i g hb or ( j ) − i ∏ m k → j ( x j ) p ( x i ) = ψ i ⋅ ∏ k ∈ N e i g hb or ( j ) m k → i ( x i )
因此求解m j → i ( x i ) m_{j\rightarrow i}(x_{i}) m j → i ( x i ) 需要两步:
{ b e l i e f ( x j ) = s e l f ⋅ c h i l d r e n m j → i ( x i ) = ∑ x j ψ i j ⋅ b e l i e f ( x j ) \left\{\begin{matrix} belief(x_{j})=self\cdot children\\ m_{j\rightarrow i}(x_{i})=\sum _{x_{j}}\psi _{ij}\cdot belief(x_{j}) \end{matrix}\right. { b e l i e f ( x j ) = se l f ⋅ c hi l d re n m j → i ( x i ) = ∑ x j ψ ij ⋅ b e l i e f ( x j )
如图展示了求解x a x_{a} x a 的边缘概率的消去(信息传递)过程:
可以想象,在求其他边缘概率时势必会有很多重复的消去过程,但是由于我们已经有了计算 m j → i ( x i ) m_{j \rightarrow i}\left(x_i\right) m j → i ( x i ) 的通项,我们就可以利用这个公式来消除计算上的重复,而Belief Propagation算法正是利用了这个通项解决了这个问题。
Belief Propagation
Belief Propagation算法的思想是:
不要直接求 P ( a ) 、 P ( b ) 、 P ( c ) 、 P ( d ) P(a) 、 P(b) 、 P(c) 、 P(d) P ( a ) 、 P ( b ) 、 P ( c ) 、 P ( d ) ,只需求所有的 m j → i m_{j \rightarrow i} m j → i 。
Belief Propagation算法首先求所有的信息传递 (收集或分发) 的过程得到所有的 m j → i m_{j \rightarrow i} m j → i (图的遍历),然后套用公式计算边缘概率,总的来说也就是 B P = V E + B P=V E+ BP = V E + Caching :
Belief Propagation算法遍历图的一种方法(Sequential Implementation)如下:
①Get root,assume a is root;
②Collect Message:
for x i x_i x i in Neighbor(Root):
collectMsg ( x i ) \operatorname{collectMsg}\left(\boldsymbol{x}_i\right) collectMsg ( x i )
(3)Distribute Message:
for x i x_i x i in Neighbor(Root):
distributeMsg ( x i ) \operatorname{distributeMsg}\left(x_i\right) distributeMsg ( x i )
还有另外一种遍历的方法(Parellel Implementation),这是一种应用在分布式计算中的方法, 可以并行计算,这里不做过多介绍。
Max-product
值得注意的是,信念传播算法有两种形式:Max-product和Sum-product。我们前面讨论的是Sum-product。与Sum-product不同,Max-product只需要将求和符号更改为求最大值 max \max max 的符号。 Max-product是Sum-Product算法的改进版本,它也是在隐马尔科夫模型(HMM)中应用的Viterbi算法的扩展。
仍然拿以下图结构来举例,只画出了要求解的节点( a , b , c , d ) a , b , c , d ) a , b , c , d ) ,其他节点( E E E ) 末画出:
Max-product的作用是用来求一个序列来使得后验概率最大,也就是:
( x a ∗ , x b ∗ , x c ∗ , x d ∗ ) = a r g m a x x a , x b , x c , x d P ( x a , x b , x c , x d ∣ E ) (x_{a}^{*},x_{b}^{*},x_{c}^{*},x_{d}^{*})=\underset{x_{a},x_{b},x_{c},x_{d}}{argmax}\; P(x_{a},x_{b},x_{c},x_{d}|E) ( x a ∗ , x b ∗ , x c ∗ , x d ∗ ) = x a , x b , x c , x d a r g ma x P ( x a , x b , x c , x d ∣ E )
求解过程如下:
① m c → b = m a x x c ψ c ⋅ ψ b c ② m d → b = m a x x d ψ d ⋅ ψ b d ③ m b → a = m a x x b ψ b ⋅ ψ a b ⋅ m c → b ⋅ m d → b ④ m a x P ( x a , x b , x c , x d ) = m a x x a ψ a ⋅ m b → a ①\; m_{c\rightarrow b} =\underset{x_{c}}{max}\; \psi _{c}\cdot \psi _{bc}\\ ②\; m_{d\rightarrow b} =\underset{x_{d}}{max}\; \psi _{d}\cdot \psi _{bd}\\ ③\; m_{b\rightarrow a} =\underset{x_{b}}{max}\; \psi _{b}\cdot \psi _{ab}\cdot m_{c\rightarrow b}\cdot m_{d\rightarrow b}\\ ④\; max\; P(x_{a},x_{b},x_{c},x_{d})=\underset{x_{a}}{max}\; \psi _{a}\cdot m_{b\rightarrow a} ① m c → b = x c ma x ψ c ⋅ ψ b c ② m d → b = x d ma x ψ d ⋅ ψ b d ③ m b → a = x b ma x ψ b ⋅ ψ ab ⋅ m c → b ⋅ m d → b ④ ma x P ( x a , x b , x c , x d ) = x a ma x ψ a ⋅ m b → a
这里也进行了一次类似收集信息的过程:
与Sum-product不同的是,在求解max P ( x a , x b , x c , x d ) \max P\left(x_a, x_b, x_c, x_d\right) max P ( x a , x b , x c , x d ) 这个过程中我们不需要求 m a → b 、 m b → c 、 m b → d m_{a \rightarrow b} 、 m_{b \rightarrow c} 、 m_{b \rightarrow d} m a → b 、 m b → c 、 m b → d ,因为我们需要的是 max P ( x a , x b , x c , x d ) \max P\left(x_a, x_b, x_c, x_d\right) max P ( x a , x b , x c , x d ) 概率的值和 x a ∗ x_a^* x a ∗ , x b ∗ , x c ∗ , x d ∗ x_b^* , x_c^* , x_d^* x b ∗ , x c ∗ , x d ∗ 这个序列。
道德图
我们常常想将有向图转为无向图,从而应用更一般的表达式。对于有向图中的三种结构,有不同的转换方法:
P ( A , B , C ) = P ( A ) P ( B ∣ A ) ⏟ ϕ ( A , B ) P ( C ∣ B ) ⏟ ϕ ( B , C ) P(A,B,C)=\underset{\phi (A,B)}{\underbrace{P(A)P(B|A)}}\underset{\phi (B,C)}{\underbrace{P(C|B)}} P ( A , B , C ) = ϕ ( A , B ) P ( A ) P ( B ∣ A ) ϕ ( B , C ) P ( C ∣ B )
这说明A,B和B,C是团,因此可以直接去掉箭头:
P ( A , B , C ) = P ( B ) P ( A ∣ B ) ⏟ ϕ ( A , B ) P ( C ∣ B ) ⏟ ϕ ( B , C ) P(A,B,C)=\underset{\phi (A,B)}{\underbrace{P(B)P(A|B)}}\underset{\phi (B,C)}{\underbrace{P(C|B)}} P ( A , B , C ) = ϕ ( A , B ) P ( B ) P ( A ∣ B ) ϕ ( B , C ) P ( C ∣ B )
这说明A,B和B,C是团,因此可以直接去掉箭头:
P ( A , B , C ) = P ( B ∣ A ) P ( B ) P ( B ∣ C ) ⏟ ϕ ( A , B , C ) P(A,B,C)=\underset{\phi (A,B,C)}{\underbrace{P(B|A)P(B)P(B|C)}} P ( A , B , C ) = ϕ ( A , B , C ) P ( B ∣ A ) P ( B ) P ( B ∣ C )
这说明A,B,C是一个团,需要在A,C之间加一条线:
观察这三种情况可以将有向图到无向图的转换方法的步骤概括为:
①将每个节点的⽗节点两两相连
②将有向边替换为⽆向边
得到的无向图就是道德图。
因子图
对于⼀个有向图,可以通过引⼊环的⽅式,可以将其转换为⽆向图(Tree-like graph),这个图就叫做道德图。但是我们上⾯的 BP 算法只对⽆环图有效,通过因⼦图可以变为⽆环图。
联合概率的因子图分解方法为:
P ( x ) = ∏ S f S ( x S ) P(x)=\prod _{S}f_{S}(x_{S}) P ( x ) = S ∏ f S ( x S )
其中:
①S S S :图的节点子集
②x S x_{S} x S :S S S 的随机变量子集
有以下无向图:
可以将其转换成一个简单的因子图:
其中 f = f ( a , b , c ) f=f(a, b, c) f = f ( a , b , c ) ,对比无向图的因子分解 P ( x ) = 1 Z ψ ( a , b , c ) P(x)=\frac{1}{Z} \psi(a, b, c) P ( x ) = Z 1 ψ ( a , b , c ) ,我们可以看到因子分解本身对应一个特殊的因子图。
因子图不是唯一的,可以看做对因子分解的进一步分解,比如以下分解:
对应的计算公式为 P ( x ) = f 1 ( a , b ) f 2 ( a , c ) f 3 ( b , c ) f a ( a ) f b ( b ) f c ( c ) P(x)=f_1(a, b) f_2(a, c) f_3(b, c) f_a(a) f_b(b) f_c(c) P ( x ) = f 1 ( a , b ) f 2 ( a , c ) f 3 ( b , c ) f a ( a ) f b ( b ) f c ( c ) ,因式分解不是唯 一的,只需要保证乘积等于概率 P ( x ) P(x) P ( x ) 即可。
在上面的因式分解中我们可以看做这个因子图分为两层:
也就是说因子图可以做到随机变量节点之间不直接相连,只与因子节点相连,因子节点只与变量节点相连。
“开启掘金成长之旅!这是我参与「掘金日新计划 · 2 月更文挑战」的第 7 天,点击查看活动详情 ”