HMM模型--离散型

969 阅读1分钟

HMM

公式推导

在 HMM 中,有两个基本假设:

  1. 齐次 Markov 假设(未来只依赖于当前):

    p(it+1it,it1,,i1,ot,ot1,,o1)=p(it+1it)p(i_{t+1}|i_t,i_{t-1},\cdots,i_1,o_t,o_{t-1},\cdots,o_1)=p(i_{t+1}|i_t)
  2. 观测独立假设:

    p(otit,it1,,i1,ot1,,o1)=p(otit)p(o_t|i_t,i_{t-1},\cdots,i_1,o_{t-1},\cdots,o_1)=p(o_t|i_t)

HMM 要解决三个问题:

  1. Evaluation:p(Oλ)p(O|\lambda),Forward-Backward 算法
  2. Learning:λ=argmaxλp(Oλ)\lambda=\mathop{argmax}\limits_{\lambda}p(O|\lambda),EM 算法(Baum-Welch)
  3. Decoding:I=argmaxIp(IO,λ)I=\mathop{argmax}\limits_{I}p(I|O,\lambda),Viterbi 算法
    1. 预测问题:p(it+1o1,o2,,ot)p(i_{t+1}|o_1,o_2,\cdots,o_t)
    2. 滤波问题:p(ito1,o2,,ot)p(i_t|o_1,o_2,\cdots,o_t)

Evaluation

p(Oλ)=Ip(I,Oλ)=Ip(OI,λ)p(Iλ)p(O|\lambda)=\sum\limits_{I}p(I,O|\lambda)=\sum\limits_{I}p(O|I,\lambda)p(I|\lambda)
p(Iλ)=p(i1,i2,,itλ)=p(iti1,i2,,it1,λ)p(i1,i2,,it1λ)p(I|\lambda)=p(i_1,i_2,\cdots,i_t|\lambda)=p(i_t|i_1,i_2,\cdots,i_{t-1},\lambda)p(i_1,i_2,\cdots,i_{t-1}|\lambda)

根据齐次 Markov 假设:

p(iti1,i2,,it1,λ)=p(itit1)=ait1itp(i_t|i_1,i_2,\cdots,i_{t-1},\lambda)=p(i_t|i_{t-1})=a_{i_{t-1}i_t}

所以:

p(Iλ)=π1t=2Tait1,itp(I|\lambda)=\pi_1\prod\limits_{t=2}^Ta_{i_{t-1},i_t}

又由于:

p(OI,λ)=t=1Tbit(ot)p(O|I,\lambda)=\prod\limits_{t=1}^Tb_{i_t}(o_t)

于是:

p(Oλ)=Iπi1t=2Tait1,itt=1Tbit(ot)p(O|\lambda)=\sum\limits_{I}\pi_{i_1}\prod\limits_{t=2}^Ta_{i_{t-1},i_t}\prod\limits_{t=1}^Tb_{i_t}(o_t)

我们看到,上面的式子中的求和符号是对所有的观测变量求和,于是复杂度为 O(NT)O(N^T)

下面,记 αt(i)=p(o1,o2,,ot,it=qiλ)\alpha_t(i)=p(o_1,o_2,\cdots,o_t,i_t=q_i|\lambda),所以,αT(i)=p(O,iT=qiλ)\alpha_T(i)=p(O,i_T=q_i|\lambda)。我们看到:

p(Oλ)=i=1Np(O,iT=qiλ)=i=1NαT(i)p(O|\lambda)=\sum\limits_{i=1}^Np(O,i_T=q_i|\lambda)=\sum\limits_{i=1}^N\alpha_T(i)

αt+1(j)\alpha_{t+1}(j)

αt+1(j)=p(o1,o2,,ot+1,it+1=qjλ)=i=1Np(o1,o2,,ot+1,it+1=qj,it=qiλ)=i=1Np(ot+1o1,o2,,it+1=qj,it=qiλ)p(o1,,ot,it=qi,it+1=qjλ)\begin{aligned}\alpha_{t+1}(j)&=p(o_1,o_2,\cdots,o_{t+1},i_{t+1}=q_j|\lambda)\\ &=\sum\limits_{i=1}^Np(o_1,o_2,\cdots,o_{t+1},i_{t+1}=q_j,i_t=q_i|\lambda)\\ &=\sum\limits_{i=1}^Np(o_{t+1}|o_1,o_2,\cdots,i_{t+1}=q_j,i_t=q_i|\lambda)p(o_1,\cdots,o_t,i_t=q_i,i_{t+1}=q_j|\lambda) \end{aligned}

利用观测独立假设:

αt+1(j)=i=1Np(ot+1it+1=qj)p(o1,,ot,it=qi,it+1=qjλ)=i=1Np(ot+1it+1=qj)p(it+1=qjo1,,ot,it=qi,λ)p(o1,,ot,it=qiλ)=i=1Nbj(ot)aijαt(i)\begin{aligned}\alpha_{t+1}(j)&=\sum\limits_{i=1}^Np(o_{t+1}|i_{t+1}=q_j)p(o_1,\cdots,o_t,i_t=q_i,i_{t+1}=q_j|\lambda)\\ &=\sum\limits_{i=1}^Np(o_{t+1}|i_{t+1}=q_j)p(i_{t+1}=q_j|o_1,\cdots,o_t,i_t=q_i,\lambda)p(o_1,\cdots,o_t,i_t=q_i|\lambda)\\ &=\sum\limits_{i=1}^Nb_{j}(o_t)a_{ij}\alpha_t(i) \end{aligned}

上面利用了齐次 Markov 假设得到了一个递推公式,这个算法叫做前向算法。

还有一种算法叫做后向算法,定义 βt(i)=p(ot+1,ot+1,oTit=i,λ)\beta_t(i)=p(o_{t+1},o_{t+1},\cdots,o_T|i_t=i,\lambda)

p(Oλ)=p(o1,,oTλ)=i=1Np(o1,o2,,oT,i1=qiλ)=i=1Np(o1,o2,,oTi1=qi,λ)πi=i=1Np(o1o2,,oT,i1=qi,λ)p(o2,,oTi1=qi,λ)πi=i=1Nbi(o1)πiβ1(i)\begin{aligned}p(O|\lambda)&=p(o_1,\cdots,o_T|\lambda)\\ &=\sum\limits_{i=1}^Np(o_1,o_2,\cdots,o_T,i_1=q_i|\lambda)\\ &=\sum\limits_{i=1}^Np(o_1,o_2,\cdots,o_T|i_1=q_i,\lambda)\pi_i\\ &=\sum\limits_{i=1}^Np(o_1|o_2,\cdots,o_T,i_1=q_i,\lambda)p(o_2,\cdots,o_T|i_1=q_i,\lambda)\pi_i\\ &=\sum\limits_{i=1}^Nb_i(o_1)\pi_i\beta_1(i) \end{aligned}

对于这个 β1(i)\beta_1(i)

βt(i)=p(ot+1,,oTit=qi)=j=1Np(ot+1,ot+2,,oT,it+1=qjit=qi)=j=1Np(ot+1,,oTit+1=qj,it=qi)p(it+1=qjit=qi)=j=1Np(ot+1,,oTit+1=qj)aij=j=1Np(ot+1ot+2,,oT,it+1=qj)p(ot+2,,oTit+1=qj)aij=j=1Nbj(ot+1)aijβt+1(j)\begin{aligned}\beta_t(i)&=p(o_{t+1},\cdots,o_T|i_t=q_i)\\ &=\sum\limits_{j=1}^Np(o_{t+1},o_{t+2},\cdots,o_T,i_{t+1}=q_j|i_t=q_i)\\ &=\sum\limits_{j=1}^Np(o_{t+1},\cdots,o_T|i_{t+1}=q_j,i_t=q_i)p(i_{t+1}=q_j|i_t=q_i)\\ &=\sum\limits_{j=1}^Np(o_{t+1},\cdots,o_T|i_{t+1}=q_j)a_{ij}\\ &=\sum\limits_{j=1}^Np(o_{t+1}|o_{t+2},\cdots,o_T,i_{t+1}=q_j)p(o_{t+2},\cdots,o_T|i_{t+1}=q_j)a_{ij}\\ &=\sum\limits_{j=1}^Nb_j(o_{t+1})a_{ij}\beta_{t+1}(j) \end{aligned}

于是后向地得到了第一项。

Learning

为了学习得到参数的最优值,在 MLE 中:

λMLE=argmaxλp(Oλ)\lambda_{MLE}=\mathop{argmax}_\lambda p(O|\lambda)

我们采用 EM 算法(在这里也叫 Baum Welch 算法),用上标表示迭代:

θt+1=argmaxθzlogp(X,Zθ)p(ZX,θt)dz\theta^{t+1}=\mathop{argmax}_{\theta}\int_z\log p(X,Z|\theta)p(Z|X,\theta^t)dz

其中,XX 是观测变量,ZZ 是隐变量序列。于是:

λt+1=argmaxλIlogp(O,Iλ)p(IO,λt)=argmaxλIlogp(O,Iλ)p(O,Iλt)\lambda^{t+1}=\mathop{argmax}_\lambda\sum\limits_I\log p(O,I|\lambda)p(I|O,\lambda^t)\\ =\mathop{argmax}_\lambda\sum\limits_I\log p(O,I|\lambda)p(O,I|\lambda^t)

这里利用了 p(Oλt)p(O|\lambda^t)λ\lambda 无关。将 Evaluation 中的式子代入:

Ilogp(O,Iλ)p(O,Iλt)=I[logπi1+t=2Tlogait1,it+t=1Tlogbit(ot)]p(O,Iλt)\sum\limits_I\log p(O,I|\lambda)p(O,I|\lambda^t)=\sum\limits_I[\log \pi_{i_1}+\sum\limits_{t=2}^T\log a_{i_{t-1},i_t}+\sum\limits_{t=1}^T\log b_{i_t}(o_t)]p(O,I|\lambda^t)

πt+1\pi^{t+1}

πt+1=argmaxπI[logπi1p(O,Iλt)]=argmaxπI[logπi1p(O,i1,i2,,iTλt)]\begin{aligned}\pi^{t+1}&=\mathop{argmax}_\pi\sum\limits_I[\log \pi_{i_1}p(O,I|\lambda^t)]\\ &=\mathop{argmax}_\pi\sum\limits_I[\log \pi_{i_1}\cdot p(O,i_1,i_2,\cdots,i_T|\lambda^t)] \end{aligned}

上面的式子中,对 i2,i2,,iTi_2,i_2,\cdots,i_T 求和可以将这些参数消掉:

πt+1=argmaxπi1[logπi1p(O,i1λt)]\pi^{t+1}=\mathop{argmax}_\pi\sum\limits_{i_1}[\log \pi_{i_1}\cdot p(O,i_1|\lambda^t)]

上面的式子还有对 π\pi 的约束 iπi=1\sum\limits_i\pi_i=1。定义 Lagrange 函数:

L(π,η)=i=1Nlogπip(O,i1=qiλt)+η(i=1Nπi1)L(\pi,\eta)=\sum\limits_{i=1}^N\log \pi_i\cdot p(O,i_1=q_i|\lambda^t)+\eta(\sum\limits_{i=1}^N\pi_i-1)

于是:

Lπi=1πip(O,i1=qiλt)+η=0\frac{\partial L}{\partial\pi_i}=\frac{1}{\pi_i}p(O,i_1=q_i|\lambda^t)+\eta=0

对上式求和:

i=1Np(O,i1=qiλt)+πiη=0η=p(Oλt)\sum\limits_{i=1}^Np(O,i_1=q_i|\lambda^t)+\pi_i\eta=0\Rightarrow\eta=-p(O|\lambda^t)

所以:

πit+1=p(O,i1=qiλt)p(Oλt)\pi_i^{t+1}=\frac{p(O,i_1=q_i|\lambda^t)}{p(O|\lambda^t)}

Decoding

Decoding 问题表述为:

I=argmaxIp(IO,λ)I=\mathop{argmax}\limits_{I}p(I|O,\lambda)

我们需要找到一个序列,其概率最大,这个序列就是在参数空间中的一个路径,可以采用动态规划的思想。

定义:

δt(j)=maxi1,,it1p(o1,,ot,i1,,it1,it=qi)\delta_{t}(j)=\max\limits_{i_1,\cdots,i_{t-1}}p(o_1,\cdots,o_t,i_1,\cdots,i_{t-1},i_t=q_i)

于是:

δt+1(j)=max1iNδt(i)aijbj(ot+1)\delta_{t+1}(j)=\max\limits_{1\le i\le N}\delta_t(i)a_{ij}b_j(o_{t+1})

这个式子就是从上一步到下一步的概率再求最大值。记这个路径为:

ψt+1(j)=argmax1iNδt(i)aij\psi_{t+1}(j)=\mathop{argmax}\limits_{1\le i\le N}\delta_t(i)a_{ij}