第十二章 Actor-Critic演员评论家

我们在上一章中介绍了策略梯度(Policy Gradient)方法，并代码练习了蒙特卡罗策略梯度reinforce算法。但是由于该算法需要完整的状态序列，同时单独对策略函数进行迭代更新，不太容易收敛。在本章我们将讨论一种将策略(Policy Based)和价值(Value Based)相结合的方法：Actor-Critic算法，在强化学习领域最受欢迎的A3C算法，DDPG算法，PPO算法等都是AC框架，所以AC重要性不言而喻。

12.1 Actor-Critic算法简介

Actor-Critic从名字上看包括两部分，演员(Actor)和评价家(Critic)。其中Actor使用的是上一章讲到的策略函数，负责生成动作(Action)并和环境交互。而Critic使用我们之前讲到了的价值函数，负责评估Actor的表现，并指导Actor下一阶段的动作。

回想上一篇的策略梯度，策略函数就是我们的Actor，但是那里是没有Critic的，当时使用了蒙特卡罗法来计算每一步的价值部分替代了Critic的功能，但是场景比较受限。因此现在我们使用类似DQN中用的价值函数来替代蒙特卡罗法，作为一个比较通用的Critic。

也就是说在Actor-Critic算法中，我们需要做两组近似，第一组是策略函数的近似： $\pi_{\theta}(s,a) = P(a|s,\theta)\approx \pi(a|s)$

第二组是价值函数的近似，对于状态价值和动作价值函数分别是： $\hat{v}(s, w) \approx v_{\pi}(s)$ $\hat{q}(s,a,w) \approx q_{\pi}(s,a)$

对于我们上一节讲到的蒙特卡罗策略梯度reinforce算法，需要进行改造才能变成Actor-Critic算法，将会在下一节中详细介绍。

首先，在蒙特卡罗策略梯度reinforce算法中，策略的参数更新公式是： $\theta = \theta + \alpha \nabla_{\theta}log \pi_{\theta}(s_t,a_t) v_t$

梯度更新部分中， $\nabla_{\theta}log \pi_{\theta}(s_t,a_t)$ 是得分函数，无需改变，要变成Actor的话改动的是 $v_t$ ，这块不能再使用蒙特卡罗法来得到，而应该从Critic得到。

而对于Critic来说，完全可以参考之前DQN的做法，即用一个Q网络来做为Critic, 这个Q网络的输入可以是状态，而输出是每个动作的价值或者最优动作的价值。

总体上来说，就是Critic通过Q网络计算状态的最优价值vt, 而Actor利用 $v_t$ 这个最优价值迭代更新策略函数的参数θ,进而选择动作，并得到反馈和新的状态，Critic使用反馈和新的状态更新Q网络参数w, 在后面Critic会使用新的网络参数w来帮Actor计算状态的最优价值 $v_t$ 。

12.2 Actor-Critic框架引出

在上一章中我们已经得到策略梯度的更新公式如下： $\nabla_{\theta} J(\theta)\approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_{n}}\left(\sum_{t=t'}^{T_{n}} \gamma^{t'-t} r_{t'}^{n}-b\right) \nabla \log \pi_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)$ 这里将 $\sum_{t=t'}^{T_{n}} \gamma^{t'-t} r_{t'}^{n}$ 记为 $G_t^n$ ，由于 $G_t^n$ 通过交互得到，其值非常不稳定（由于环境的动态性， $G_t^n$ 本身也是一个分布），方差会比较大，因此需要寻找减少方差的办法。一种方法就是在上一章中采用的添加基线b，这个b会使得 $G_t-b$ 的期望不变，但是方差会变小，常用的baseline函数就是 $V(s_t)$ ，在此基础上，为了进一步降低 $G_t$ 的随机性，我们用 $G_t^n$ 的期望 $E(G_t^n)$ 替代 $G_t^n$ ，这样上面的更新公式变为： $\nabla_{\theta} J(\theta)\approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_{n}}\left(E(G_t^n)-V(s_t^n)\right) \nabla \log \pi_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)$

在根据Q学习部分（），可知期望 $E(G_t^n)$ 就是在状态 $s_t$ 下执行动作 $a_t$ ，并遵循策略 $\pi$ 所能得到的Q值，即 $E(G_t)$ = $Q^{\pi_{\theta} }\left(s^n_t,a^n_t\right)$ ,由此得到下式： $\nabla_{\theta} J(\theta)\approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_{n}}\left(Q^{\pi_{\theta} }\left(s^n_t,a^n_t\right)-V(s_t^n)\right) \nabla \log \pi_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)$

上式中存在的问题是，需要两个网络来分别预测Q和V，这就无形中增加了误差来源，考虑到贝尔曼等式，即：

$Q^{\pi_{\theta} }\left(s^n_t,a^n_t\right)=E\left[ r_t^n+V^\pi(s_{t+1}^n)\right]$

这里将期望去掉（个人理解，虽然去掉期望会导致有偏，但是最终还是会收敛到真实值）：

$Q^{\pi_{\theta} }\left(s^n_t,a^n_t\right)= r_t^n+V^{\pi_\theta}(s_{t+1}^n)$

那么最终就得到： $\nabla_{\theta} J(\theta)\approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_{n}}\left(r_t^n+V^{\pi_\theta}(s_{t+1}^n)-V(s_t^n)\right) \nabla \log \pi_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)$

这样只需要一个网络就可以估算出V值了，而估算V的网络正是我们在Q-learning 中做的，所以我们就把这个网络叫做Critic。这样就在Policy Gradient算法的基础上引进了Q-learning 算法了。

12.3 Actor-Critic算法流程

Critic使用神经网络来计算TD误差并更新网络参数，Actor将TD误差作为输入，也使用神经网络来更新网络参数　　

算法输入：迭代轮数T，状态特征维度n, 动作集A, 步长α,β，衰减因子γ, 探索率ϵ, Critic网络结构和Actor网络结构。

输出：Actor 网络参数θ, Critic网络参数w 1.随机初始化所有的状态和动作对应的价值Q 2.for i from 1 to T，进行迭代。

a)初始化S为当前状态序列的第一个状态, 得到其特征向量 $\phi(S)$

b)在Actor网络中使用 $\phi(S)$ 作为输入，输出动作A,基于动作A得到新的状态S′,奖励r。

c)在Critic网络中分别使用 $\phi(S)$ ， $\phi(S’)$ 作为输入，得到Q值输出V(S)，V(S′)

d)计算TD误差 $\delta = R +\gamma V(S’) -V(S)$

e)使用均方差损失函数 $\sum\limits(R +\gamma V(S’) -V(S,w))^2$ 作Critic网络参数w的梯度更新

f)更新Actor网络参数θ:

$\theta = \theta + \alpha \nabla_{\theta}log \pi_{\theta}(S_t,A)\delta$

对于Actor的得分函数∇θlogπθ(St,A),可以选择softmax或者高斯分值函数。

12.4 代码练习

代码针对的环境的是 CliffWalkingEnv，在该环境中智能体在一个 4x12 的网格中移动，状态编号如下所示：

[[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
 [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
 [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35],
 [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]]

在任何阶段开始时，初始状态都是状态 36，状态 47 是唯一的终止状态，悬崖对应的是状态 37 到 46。智能体有 4 个可选动作（UP = 0，RIGHT = 1，DOWN = 2，LEFT = 3）。智能体每走一步都会得到-1 的奖励，跌入悬崖会得到-100 的奖励并重置到起点，当达到目标时，片段结束。

AC完整的代码如下：

import gym
import itertools
import matplotlib
import numpy as np
import sys
import tensorflow as tf
import collections

if "../" not in sys.path:
    sys.path.append("../")
from Lib.envs.cliff_walking import CliffWalkingEnv
from Lib import plotting

matplotlib.style.use('ggplot')

env = CliffWalkingEnv()


class PolicyEstimator():
    """
    策略函数逼近
    """

    def __init__(self, learning_rate=0.01, scope="policy_estimator"):
        with tf.variable_scope(scope):
            self.state = tf.placeholder(tf.int32, [], "state")
            self.action = tf.placeholder(dtype=tf.int32, name="action")
            self.target = tf.placeholder(dtype=tf.float32, name="target")

            # This is just table lookup estimator
            state_one_hot = tf.one_hot(self.state, int(env.observation_space.n))
            self.output_layer = tf.contrib.layers.fully_connected(
                inputs=tf.expand_dims(state_one_hot, 0),
                num_outputs=env.action_space.n,
                activation_fn=None,
                weights_initializer=tf.zeros_initializer)

            self.action_probs = tf.squeeze(tf.nn.softmax(self.output_layer))
            self.picked_action_prob = tf.gather(self.action_probs, self.action)

            self.loss = -tf.log(self.picked_action_prob) * self.target

            self.optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
            self.train_op = self.optimizer.minimize(
                self.loss, global_step=tf.contrib.framework.get_global_step())

    def predict(self, state, sess=None):
        sess = sess or tf.get_default_session()
        return sess.run(self.action_probs, {self.state: state})

    def update(self, state, target, action, sess=None):
        sess = sess or tf.get_default_session()
        feed_dict = {self.state: state, self.target: target, self.action: action}
        _, loss = sess.run([self.train_op, self.loss], feed_dict)
        return loss


class ValueEstimator():
    """
    值函数逼近器
    """

    def __init__(self, learning_rate=0.1, scope="value_estimator"):
        with tf.variable_scope(scope):
            self.state = tf.placeholder(tf.int32, [], "state")
            self.target = tf.placeholder(dtype=tf.float32, name="target")

            # This is just table lookup estimator
            state_one_hot = tf.one_hot(self.state, int(env.observation_space.n))
            self.output_layer = tf.contrib.layers.fully_connected(
                inputs=tf.expand_dims(state_one_hot, 0),
                num_outputs=1,
                activation_fn=None,
                weights_initializer=tf.zeros_initializer)

            self.value_estimate = tf.squeeze(self.output_layer)
            self.loss = tf.squared_difference(self.value_estimate, self.target)

            self.optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
            self.train_op = self.optimizer.minimize(
                self.loss, global_step=tf.contrib.framework.get_global_step())

    def predict(self, state, sess=None):
        sess = sess or tf.get_default_session()
        return sess.run(self.value_estimate, {self.state: state})

    def update(self, state, target, sess=None):
        sess = sess or tf.get_default_session()
        feed_dict = {self.state: state, self.target: target}
        _, loss = sess.run([self.train_op, self.loss], feed_dict)
        return loss


def actor_critic(env, estimator_policy, estimator_value, num_episodes, discount_factor=1.0):
    """
    Actor Critic 算法.通过策略梯度优化策略函数逼近器

    参数:
        env: OpenAI环境.
        estimator_policy: 待优化的策略函数
        estimator_value: 值函数逼近器，用作评论家
        num_episodes: 回合数
        discount_factor: 折扣因子

    返回值:
        EpisodeStats对象，包含两个numpy数组，分别存储片段长度和片段奖励
    """

    # Keeps track of useful statistics
    stats = plotting.EpisodeStats(
        episode_lengths=np.zeros(num_episodes),
        episode_rewards=np.zeros(num_episodes))

    Transition = collections.namedtuple("Transition", ["state", "action", "reward", "next_state", "done"])

    for i_episode in range(num_episodes):
        state = env.reset()

        episode = []

        for t in itertools.count():

            action_probs = estimator_policy.predict(state)
            action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
            next_state, reward, done, _ = env.step(action)

            episode.append(Transition(
                state=state, action=action, reward=reward, next_state=next_state, done=done))

            stats.episode_rewards[i_episode] += reward
            stats.episode_lengths[i_episode] = t

            # 计算TD目标
            value_next = estimator_value.predict(next_state)
            td_target = reward + discount_factor * value_next
            td_error = td_target - estimator_value.predict(state)

            # 更新值函数逼近
            estimator_value.update(state, td_target)

            # 更新策略逼近
            # 使用TD误差作为优势估计
            estimator_policy.update(state, td_error, action)

            print("\rStep {} @ Episode {}/{} ({})".format(
                t, i_episode + 1, num_episodes, stats.episode_rewards[i_episode - 1]), end="")

            if done:
                break

            state = next_state

    return stats


tf.reset_default_graph()

global_step = tf.Variable(0, name="global_step", trainable=False)
policy_estimator = PolicyEstimator()
value_estimator = ValueEstimator()

with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())
    stats = actor_critic(env, policy_estimator, value_estimator, 300)

plotting.plot_episode_stats(stats, smoothing_window=10)

结果如下：

第十二章 演员评论家（Actor-Critic）-强化学习理论学习与代码实现（强化学习导论第二版）

第十二章 Actor-Critic演员评论家

12.1 Actor-Critic算法简介

12.2 Actor-Critic框架引出

12.3 Actor-Critic算法流程

12.4 代码练习

第十二章演员评论家（Actor-Critic）-强化学习理论学习与代码实现（强化学习导论第二版）