深度强化学习(DRL)算法 附录1 —— 贝尔曼公式

36 阅读1分钟

贝尔曼公式

vπ(s)=E[GtSt=s]=E[Rt+1+γGt+1St=s]=E[Rt+1St=s]+γE[Gt+1St=s]=aπ(as)E[Rt+1St=s,At=a]+sE[Gt+1St=s,St+1=s]p(ss)=aπ(as)rp(rs,a)r+svπ(s)p(ss)=aπ(as)rp(rs,a)r+svπ(s)ap(ss,a)π(as)=aπ(as)rp(rs,a)rmean of immediate rewards +γaπ(as)sp(ss,a)vπ(s)mean of future rewards =aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)]qπ(s,a)=aπ(as)E[GtSt=s,At=a]qπ(s,a)=aπ(as)qπ(s,a),sS.\begin{aligned} v_\pi(s) &=\mathbb{E}\left[G_t \mid S_t=s\right] \\ &=\mathbb{E}\left[R_{t+1}+\gamma G_{t+1} \mid S_t=s\right] \\ &=\mathbb{E}\left[R_{t+1} \mid S_t=s\right]+\gamma \mathbb{E}\left[G_{t+1} \mid S_t=s\right] \\ &=\sum_a \pi(a \mid s) \mathbb{E}\left[R_{t+1} \mid S_t=s, A_t=a\right]+ \sum_{s^{\prime}} \mathbb{E}\left[G_{t+1} \mid S_t=s, S_{t+1}=s^{\prime}\right] p\left(s^{\prime} \mid s\right) \\ &=\sum_a \pi(a \mid s) \sum_r p(r \mid s, a) r + \sum_{s^{\prime}} v_\pi\left(s^{\prime}\right) p\left(s^{\prime} \mid s\right) \\ &=\sum_a \pi(a \mid s) \sum_r p(r \mid s, a) r + \sum_{s^{\prime}} v_\pi\left(s^{\prime}\right) \sum_a p\left(s^{\prime} \mid s, a\right) \pi(a \mid s) \\ &=\underbrace{\sum_a \pi(a \mid s) \sum_r p(r \mid s, a) r}_{\text {mean of immediate rewards }}+\underbrace{\gamma \sum_a \pi(a \mid s) \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right)}_{\text {mean of future rewards }} \\ &=\sum_a \pi(a \mid s) \underbrace{\left [\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right)\right]}_{{q_\pi(s, a)}} \\ &=\sum_a \pi(a \mid s) \underbrace{\mathbb{E}\left[G_t \mid S_t=s, A_t=a\right]}_{q_\pi(s, a)} \\ &=\sum_a \pi(a \mid s) q_\pi(s, a), \quad \forall s \in \mathcal{S} . \end{aligned}

贝尔曼最优公式

v(s)=maxπaπ(as)(rp(rs,a)r+γsp(ss,a)v(s))=maxπaπ(as)q(s,a),sS\begin{aligned} v(s) &=\max _\pi \sum_a \pi(a \mid s)\left(\sum_r p(r \mid s, a) r+\gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v\left(s^{\prime}\right)\right) \\ &=\max _\pi \sum_a \pi(a \mid s) q(s, a), \quad \forall s \in \mathcal{S} \end{aligned}

根据 Contraction mapping theorem 可知贝尔曼最优公式中的 v(state value) 存在唯一的最优解,并且可能有多种最优策略。

参考

  1. 强化学习的数学原理——从零开始透彻理解强化学习
  2. Book-Mathmatical-Foundation-of-Reinforcement-Learning