强化学习之Policy Gradient笔记Policy Gradient方法是强化学习中非常重要的方法。不同于基于最优

Policy Gradient方法是强化学习中非常重要的方法。不同于基于最优价值的算法，Policy Gradient算法更着眼于算法的长期回报。策略梯度根据目标函数的梯度方向去寻找最优策略。策略梯度算法中，整个回合结束之后才会进行学习，所以策略梯度算法对全局过程有更好的把握。DeepMind的David Silver在深度学习讲座中这样评价基于策略的方法：
Policy Based强化学习方法优点：
– 收敛性好
– 在高维和连续问题中比较有效
– 能学习随机策略

其缺点有：
– 容易陷入局部最优
– 评价一个策略比较低效

基本理论

从理论上讲，其实策略梯度其实是更容易理解的一种方法，毕竟我们对梯度下降再熟悉不过了。理解策略梯度的关键点在于理解目标函数。就像前文所述，强化学习的目标是寻找一个策略过程，使得这个过程的回报期望最大化。我们的目标函数是:

J ( θ ) = ∫ x ∼ p θ ( x ) p θ ( x ) r ( x ) d x J(\theta) = \int_{x \sim p_{_\theta}(x)}^{} p_{_\theta}(x)\,r(x) \,dx J(θ)=∫x∼pθ(x)pθ(x)r(x)dx

其中 x xx是行为（可以是一个向量）， p θ ( x ) p_{_\theta}(x) pθ(x) 就是选择行为的概率。 J ( θ ) J(\theta)J(θ)就是整个回合的收益期望。从策略梯度算法的思路来看，算法的目标就是使得收益的期望值最大化。求最大的过程其实就是通过梯度计算实现的。

目标函数的导数函数如下：

∇ θ J ( θ ) = ∫ x ∼ p θ ( x ) ∇ θ p θ ( x ) r ( x ) d x \nabla_{\theta} J(\theta) = \int_{x \sim p_{_\theta}(x)}^{} \nabla_{_\theta} p_{_\theta}(x)\,r(x) \,dx ∇θJ(θ)=∫x∼pθ(x)∇θpθ(x)r(x)dx = ∫ x ∼ p θ ( x ) p θ ( x ) ∇ θ p θ ( x ) p θ ( x ) r ( x ) d x = \int_{x \sim p_{_\theta}(x)}^{} p_{_\theta}(x)\,\frac{\nabla_{_\theta} p_{_\theta}(x)}{p_{_\theta}(x)}\,r(x) \,dx =∫x∼pθ(x)pθ(x)pθ(x)∇θpθ(x)r(x)dx = ∫ x ∼ p θ ( x ) p θ ( x ) ∇ θ l o g p θ ( x ) r ( x ) d x = \int_{x \sim p_{_\theta}(x)}^{} p_{_\theta}(x)\,\nabla_{_\theta}log\,{p_{_\theta}(x)}\,r(x) \,dx =∫x∼pθ(x)pθ(x)∇θlogpθ(x)r(x)dx = E x ∼ p θ ( x ) [ ∇ θ l o g p θ ( x ) r ( x ) ] = E_{x \sim p_{_\theta}(x)}^{} [\nabla_{_\theta}log\,{p_{_\theta}(x)}\,r(x)] =Ex∼pθ(x)[∇θlogpθ(x)r(x)]

上面公式的推导用了连续函数，其实在离散情况下也是基本适用的。上面的对数概率部分可以继续分析，如下所示：

∇ θ l o g p θ ( x ) = ∑ t = 0 T ∇ θ l o g p θ ( a t ∣ s t ) \nabla_{_\theta} log\,{p_{_\theta}(x)} = \sum_{t=0}^{T} \nabla_{_\theta} log\,p _{_\theta}(a_{t}|s_{t}) ∇θlogpθ(x)=t=0∑T∇θlogpθ(at∣st)

所以最终策略价值梯度的公式如下：

∇ θ J ( θ ) = ∑ i = 1 N [ ∑ t = 0 T ∇ θ l o g p θ ( a i , t ∣ s i , t ) ( ∑ t = 0 T r ( s i , t , a i , t ) ) ] \nabla_{\theta} J(\theta) = \sum_{i=1}^{N}[ \sum_{t=0}^{T} \nabla_{_\theta} log\,p _{_\theta}(a_{i,t}|s_{i,t}) \ (\sum_{t=0}^{T}r(s_{i,t}, a_{i,t}))] ∇θJ(θ)=i=1∑N[t=0∑T∇θlogpθ(ai,t∣si,t) (t=0∑Tr(si,t,ai,t))]

这个公式其实是有问题的， ∑ t = 0 T r ( s i , t , a i , t ) \sum_{t^{}=0}^{T}r(s_{i,t}, a_{i,t})∑t=0Tr(si,t,ai,t)这部分在任何时候都会乘到梯度公式。然而某一步的action应该只能影响到之后的过程才对，所以上面的公式可以修正为如下形式：

∇ θ J ( θ ) = ∑ i = 1 N [ ∑ t = 0 T ∇ θ l o g p θ ( a i , t ∣ s i , t ) ( ∑ t ‘ = t T r ( s i , t ‘ , a i , t ‘ ) ) ] \nabla_{\theta} J(\theta) = \sum_{i=1}^{N}[ \sum_{t=0}^{T} \nabla_{_\theta} log\,p _{_\theta}(a_{i,t}|s_{i,t}) \ (\sum_{t^{‘}=t}^{T}r(s_{i,t^{‘}}, a_{i,t^{‘}}))] ∇θJ(θ)=i=1∑N[t=0∑T∇θlogpθ(ai,t∣si,t) (t‘=t∑Tr(si,t‘,ai,t‘))]

从更细节的角度分析，上面这个公式依然是有问题的。在很多reward部分的求和运算可能导致，对所有的行为其回报都是增强的。这样就是弱化了reward的真实意义。所以在工程实现中，还是会把reward部分进行均值偏移处理、甚至标准化处理。

TensorFlow实现

虽然Policy Gradient很少单独使用了，但是结合代码实现还是对理解有帮助的。我还是看的周莫凡的实现，算是代码阅读吧.

import numpy as np
import tensorflow as tf

np.random.seed(1)
tf.set_random_seed(1)

class PolicyGradient:
    def __init__(self,
                 n_actions,
                 n_features,
                 learning_rate=0.01,
                 reward_decay=0.95,
                 output_graph=False):
        self.n_actions = n_actions
        self.n_features = n_features
        self.lr = learning_rate
        self.gamma = reward_decay

        self.ep_obs, self.ep_as, self.ep_rs = [], [], []
        self._build_net()
        self.sess = tf.Session()
        if output_graph:
            tf.summary.FileWriter("logs/", self.sess.graph)

        self.sess.run(tf.global_variables_initializer())

    def _build_net(self):
        with tf.name_scope('inputs'):
            self.tf_obs = tf.placeholder(tf.float32, [None, self.n_features], name="observations")
            self.tf_acts = tf.placeholder(tf.int32, [None, ], name="actions_num")
            self.tf_vt = tf.placeholder(tf.float32, [None, ], name="actions_value")

        layer = tf.layers.dense(
            inputs=self.tf_obs,
            units=10,
            activation=tf.nn.tanh,
            kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.3),
            bias_initializer=tf.constant_initializer(0.1),
            name='fc1'
        )

        all_act = tf.layers.dense(
            inputs=layer,
            units=self.n_actions,
            activation=None,
            kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.3),
            bias_initializer=tf.constant_initializer(0.1),
            name='fc2'
        )

        self.all_act_prob = tf.nn.softmax(all_act, name='act_prob')
        with tf.name_scope('loss'):
            neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob) * tf.one_hot(self.tf_acts, self.n_actions), axis=1)
            loss = tf.reduce_sum(neg_log_prob * self.tf_vt)

        with tf.name_scope('train'):
            self.train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)

    def choose_action(self, observation):
        prob_weights = self.sess.run(self.all_act_prob, feed_dict={self.tf_obs: observation[np.newaxis, :]})
        action = np.random.choice(range(prob_weights.shape[1]), p=prob_weights.ravel())
        return action

    def store_transition(self, s, a, r):
        self.ep_obs.append(s)
        self.ep_as.append(a)
        self.ep_rs.append(r)

    def learn(self):
        discounted_ep_rs_norm = self._discount_and_norm_rewards()

        self.sess.run(self.train_op, feed_dict={
            self.tf_obs: np.vstack(self.ep_obs),
            self.tf_acts: np.array(self.ep_as),
            self.tf_vt: discounted_ep_rs_norm,
        })

        self.ep_obs, self.ep_as, self.ep_rs = [], [], []
        return discounted_ep_rs_norm

    def _discount_and_norm_rewards(self):
        discounted_ep_rs = np.zeros_like(self.ep_rs)
        running_add = 0
        for t in reversed(range(0, len(self.ep_rs))):
            running_add = running_add * self.gamma + self.ep_rs[t]
            discounted_ep_rs[t] = running_add
        discounted_ep_rs -= np.mean(discounted_ep_rs)
        discounted_ep_rs /= np.std(discounted_ep_rs)
        return discounted_ep_rs

首先看这个_discount_and_norm_rewards函数，这里就包含了数据处理均值偏移处理和标准化的逻辑。而且reward数据是反向计算的，即计算了从t到T的回报值。

构建的TensorFlow网络也是比较简单的，就两个全连接层。用softmax计算各个action的概率，然后根据实际的行为选择一个概率值，然后再求对数，最后乘以reward部分的数据。而梯度公式中的梯度其实已经体现在神经网络的训练过程中了。