4 REINFORCE策略梯度

175 阅读9分钟

策略梯度REINFORCE

理论介绍

策略梯度是一个非常直观的方法。为什么说很直观,因为它直接尝试用神经网络拟合policy,然后优化目标也直接就是期望累计收益。

用神经网络拟合policy,具体说,它的参数为θ\theta,所以一般表示为πθ(s)\pi_\theta(\cdot|s)。网络的输入是一个observation的向量,输出的向量长度是action的数量(先考虑action是离散的),表示选择每个action的概率。

优化目标直接就是最大化期望累计收益E[tγtRt]\mathbb{E}[\sum_t\gamma^t R_t]。一般优化网络的参数,都是对优化目标求梯度,然后沿着梯度更新参数,如果是最大化就是梯度上升,反之就是梯度下降。所以接下来尝试对优化目标求关于策略的函数θ\theta的梯度。

然后就会发现并不好求,因为参数并没有显式的写出来,为了解决这个问题就需要一些变形技巧。

θE[tγtRt]=θτR(τ)p(τ)=θτR(τ)p(S0)t=0np(St+1,RtSt,At)πθ(AtSt)=τR(τ)p(S0)t=0np(St+1,RtSt,At)θπθ(AtSt)=τR(τ)p(S0)t=0np(St+1,RtSt,At)πθ(AtSt)θtlnπθ(AtSt)=τR(τ)p(τ)θlnπθ(AtSt)=E[R(τ)θtlnπθ(AtSt)]=θE[R(τ)lntπθ(AtSt)]=θE[tγtRtlnπθ(AtSt)]\begin{aligned} \nabla_\theta \mathbb{E}[\sum_t\gamma^t R_t] & = \nabla_\theta \sum_\tau R(\tau) p(\tau) \\ & = \nabla_\theta \sum_\tau R(\tau) p(S_0)\prod_{t=0}^{n} p(S_{t+1}, R_t|S_t, A_t)\pi_\theta(A_{t}|S_{t}) \\ & = \sum_\tau R(\tau) p(S_0)\prod_{t=0}^{n} p(S_{t+1}, R_t|S_t, A_t)\nabla_\theta \pi_\theta(A_{t}|S_{t}) \\ & = \sum_\tau R(\tau) p(S_0)\prod_{t=0}^{n} p(S_{t+1}, R_t|S_t, A_t) \pi_\theta(A_{t}|S_{t}) \nabla_\theta \sum_t \ln\pi_\theta(A_{t}|S_{t}) \\ & = \sum_\tau R(\tau)p(\tau) \nabla_\theta \ln\pi_\theta(A_{t}|S_{t}) \\ & = \mathbb{E}[R(\tau) \nabla_\theta \sum_t \ln\pi_\theta(A_{t}|S_{t})] \\ & = \nabla_\theta \mathbb{E}[R(\tau)\ln\sum_t \pi_\theta(A_{t}|S_{t})] = \nabla_\theta \mathbb{E}[\sum_t \gamma^t R_t \ln \pi_\theta(A_{t}|S_{t})] \end{aligned}

第一行是定义,R(τ)R(\tau)表示一条episodeτ\tau的总收益,p(τ)p(\tau)是这条episode发生的概率。 第二行是把episode这个决策过程的概率展开了,第一章推过。 第三行是梯度算子的线性性,只对包含参数的policy作用 第四行是一个恒等变换,即lnf(x)=1f(x)f(x)\nabla \ln f(x)=\frac{1}{f(x)}\nabla f(x)。 第五行是因为这个恒等变换,我们又凑出一个policy的项,原来的概率又凑回去了。 第六行是进一步凑回了期望项。 第七行是线性性把梯度算子拿出来。并重新用求和表示累积收益。

上面的变形说明,虽然直接把期望累积收益作为优化目标不好求梯度,但是如果把E[tγtRtlnπθ(AtSt)]\mathbb{E}[\sum_t \gamma^t R_t \ln \pi_\theta(A_{t}|S_{t})]当成优化目标,对他求梯度,和对原始的优化目标求梯度结果是一样的。所以我们就把这个式子作为优化目标了。

在实际计算上,期望这是一个数学概念,实际计算中使用一系列数据求平均得到的,所以变成了1Nn=1Nt=0TnγtRtlnπθ(AtSt)\frac{1}{N}\sum_{n=1}^N\sum_{t=0}^{T_n} \gamma^t R_t \ln \pi_\theta(A_{t}|S_{t})

进一步审视这个公式,对他求梯度之后变成了1Nn=1Nt=0TnγtRtθlnπθ(AtSt)\frac{1}{N}\sum_{n=1}^N\sum_{t=0}^{T_n} \gamma^t R_t \nabla_\theta\ln \pi_\theta(A_{t}|S_{t})。也可以看成,对每一个StS_t, AtA_t对,算法都求了一个梯度θlnπθ(AtSt)\nabla_\theta\ln \pi_\theta(A_{t}|S_{t}),然后前面的γtRt\gamma^t R_t可以看作是对梯度赋权重,然后整体求和得到了全局的优化方向。这相当于我们对所有的梯度赋了相同的权重,这样会有两个问题

  • 1 因为很多时候reward可能全是正的,比如经典cartpole,你走一步一定会有+1分,那一个episode的累积收益永远是正的,就算一个action不好,那他梯度前面的权重也是正数,会沿着他的方向增强。而很多state-action采样不到,参数不会沿着他们增强的方向更新,不好动作的增强变向弱化了选取这些动作的概率。解决这个问题的方法是让收益和减去一个基线,让权重有正有负,1Nn=1Nt=0Tn(γtRtb)θlnπθ(AtSt)\frac{1}{N}\sum_{n=1}^N\sum_{t=0}^{T_n} (\gamma^t R_t - b)\nabla_\theta\ln \pi_\theta(A_{t}|S_{t}),这样如果采样到的episode收益虽然是正的但是太低了,我们一样不鼓励增强这个策略。
  • 2 在一个episode内,每个动作的功劳(credit)并不一样,直观上越到后面的状态可能分数是越低的,因为可能是这些操作把局面搞崩了导致episode结束。所以我们认为只有一个动作发生之后的收益才算到它梯度的权重上。所以梯度更新的公式就变成了1Nn=1Nt=0Tn(t=tTnγttRtb)θlnπθ(AtSt)\frac{1}{N}\sum_{n=1}^N\sum_{t=0}^{T_n} (\sum_{t'=t}^{T_n}\gamma^{t'-t} R_{t'} - b)\nabla_\theta\ln \pi_\theta(A_{t}|S_{t})

总体回顾一下,就是定义一个πθ(s)\pi_\theta(\cdot|s)网络,涉及关于θ\theta的目标函数J(θ)=1Nn=1Nt=0Tn(t=tTnγttRtb)lnπθ(AtSt)J(\theta)=\frac{1}{N}\sum_{n=1}^N\sum_{t=0}^{T_n} (\sum_{t'=t}^{T_n}\gamma^{t'-t} R_{t'} - b)\ln \pi_\theta(A_{t}|S_{t})

对于实现细节主要注意以下几点:

  • 1 基线bb怎么取,bb可以用V(S)V(S)函数,高级的方法会引入一个新的神经网络估计VV函数,不过更简单的方法是直接对收集到的rewards做regularization。
  • 2 这个J(θ)J(\theta)看起来很低效啊,因为他需要很多个episode的平均值然后才能更新一次?没有那么低效,我们一般一个episode就更新一次,但一个episode更新一次也挺低效了。注意REINFORCE作为on-policy的方法,必须遵守更新的策略和采样数据的策略一致,经过我的实验发现,如果REINFORCE用了epsilon-greedy结果会很不稳定,因为epsilon-greedy其实严格来说已经不再是policy真实的策略了,此外,更新也不能采用batch-base的方法,因为一个batch更新了policy就变了,也和采样数据的policy不一致。不过为了避免把一个完整的episode扔进去产生过于巨大的矩阵,可以分batch求和,但更新只有一次。

编码实现

实现一个agent一般考虑四个问题

  • 1 网络怎么定义
  • 2 怎么选择action
  • 3 怎么更新梯度
  • 4 怎么采样数据

接下来分别就4各方面结合代码实际说明

1 网络怎么定义?

为了解耦我们一般不把网络和agent放在一个类里,网络是单独的类,并且forward方法输出也是网络的原始输出,至于具体怎么采样交给agent来做。

class DiscretePolicy(nn.Module):
    def __init__(self, observ_dim: int, n_action: int, device: torch.device) -> None:
        super().__init__()
        n_hidden: int = 20
        self.observ_dim = observ_dim
        self.fc1 = nn.Linear(observ_dim, n_hidden)
        self.fc2 = nn.Linear(n_hidden, n_action)

def forward(self, observations: torch.Tensor) -> torch.Tensor:
    observations = observations.view(-1, self.observ_dim)
    hidden = func.relu(self.fc1(observations))
    return func.softmax(self.fc2(hidden), dim=-1)

2 怎么选择action

这里利用了pytorch的封装函数categorical可以帮助我们把网络的输出看成概率然后采样。

因为choose action是和环境交互的,它的输入是一个observation,所以如果action是离散的,那输出就直接是正数,表示action的编号。

class Agent
# ...
def choose_action(self, single_observ: np.array) -> int:
    with torch.no_grad():
        single_observ = torch.Tensor(single_observ).to(self.device)
        return torch.distributions.Categorical(self.net(single_observ)).sample().squeeze().item()

3 怎么更新梯度

前面已经说过,我们一个eposide更新一次,所以输入的observation、actions和rewards都是这一个episode的。

然后先用gamma倒序计算discount后的rewards,即从某时刻出发的累积收益。 regularization其实就是减去基线的技巧。 为了避免出现过大的矩阵,用batch的方式分批求和计算loss。 pytorch的gather函数可以选出每个action对应的概率。

class Agent:
# init and other functions
    def update(self, observations: Union[List[np.array], np.array],
    actions: Union[List[np.array], np.array], rewards: Union[List[np.array], np.array]) -> float:
    episode_len = len(observations)
        assert episode_len > 1 # 避免只有一条数据 后面的归一化时候计算方差出现错误

        discounted_rewards = np.zeros(episode_len)
        discounted_rewards[-1] = rewards[-1]

        for i in reversed(range(episode_len - 1)):
            discounted_rewards[i] = self.gamma * discounted_rewards[i + 1] + rewards[i]
        discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 0.01)

        # accumulate loss in a batch based manner to too-large matrix
        loss = torch.zeros(1).to(self.device)
        size = 200
        for i in range(0, episode_len + 1, size):
            left, right = i, min(episode_len, i + size)
            batch_rewards = torch.Tensor(np.array(discounted_rewards[left: right])).to(self.device).view(-1, 1)
            batch_observs = torch.Tensor(np.array(observations[left: right])).to(self.device).view(-1, self.observ_dim)
            batch_actions = torch.Tensor(np.array(actions[left: right])).to(self.device, dtype=torch.long).view(-1, 1)
            # self.net输出的是二维矩阵 行是batch的大小 列是action的数量,值是action的概率,gather的时候dim=1表示逐行选择
            # index表示每一行选择了batch_action对应的那个动作,所以结果是一个batch行1列的概率值
            # 取log操作得到的是log p(s|a),最后求和,取平均放在所有batch都加上之后
            loss += (batch_rewards * self.net(batch_observs).gather(dim=1, index=batch_actions).log()).sum()
        # 求平均
        loss = -loss / episode_len
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        return loss.cpu().data.numpy()

4 怎么采样数据

从前面的update可以看出需要收集每个episode的observ,action和reward用来训练,所以每个batch收集一下即可

def train(env: gym.Env, agent: REINFORCEAgent, n_episode: int = 100) -> np.array:
    memory = collections.namedtuple("memory", ["observs", "actions", "rewards"])
    losses = np.zeros(n_episode)
    for i in range(n_episode):
        mem = memory([], [], [])
        # collect data process
        observ, done = env.reset(), False
        while not done:
            action: int = agent.choose_action(observ)
            next_observ, reward, done, info = env.step(action)
            # insert into memories
            mem.observs.append(observ)
            mem.actions.append(action)
            mem.rewards.append(reward)
            observ = next_observ
        losses[i] = agent.update(mem.observs, mem.actions, mem.rewards)
        del mem
    return losses

进阶到连续动作

其实思路是一样的,只是现在的policy不再输出分布列,而是分布的参数。

比如假设policy服从正态分布,并且维度间彼此独立,那网络输出有两个向量,长度都是action_dim,一个是均值、一个是方差。不过vanilla REINFORCE并不能表现得很好

然后需要改三个地方

  • 1 choose_action需要按照分布采样一个action
  • 2 loss那里的log prob计算做出相应修改即可
  • 3 batch_action的类型不需要转成long 这里为了简单起间直接认为action space是个正态分布,所以只需要拟合均值和方差即可。

完整代码和实验结果

相关代码已开源至github:github.com/sdycodes/RL…

我的完整代码把离散和连续两个放在一起了

离散的在CartPole-v0上实验,torch, numpy所有种子设成0,然后改变env的种子分别为0,1,2在cpu和gpu上看看结果

连续的在Pendulum-v0实验,不过效果不好没有仔细调,PPO才是真稳。

reinforce-cartpole.gif

import collections
import gym
import torch.nn as nn
import torch
import numpy as np
import torch.nn.functional as func
from typing import Union, Any, List, Tuple

class DiscretePolicy(nn.Module):

    def __init__(self, observ_dim: int, n_action: int) -> None:
        super().__init__()
        n_hidden: int = 20
        self.observ_dim = observ_dim
        self.fc1 = nn.Linear(observ_dim, n_hidden)
        self.fc2 = nn.Linear(n_hidden, n_action)

    def forward(self, observations: torch.Tensor) -> torch.Tensor:
        observations = observations.view(-1, self.observ_dim)
        hidden = func.relu(self.fc1(observations)
        return func.softmax(self.fc2(hidden), dim=-1)

    def log_probs(self, batch_observs: torch.Tensor, batch_actions: torch.Tensor) -> torch.Tensor:
        return self(batch_observs).gather(dim=1, index=batch_actions).log()

class ContinuousPolicy(nn.Module):

    def __init__(self, observ_dim: int, action_dim: int) -> None:
        super().__init__()
        hidden_size = 30
        self.observ_dim = observ_dim
        self.action_dim = action_dim
        self.shared_fc = nn.Linear(observ_dim, hidden_size)
        self.mu_fc = nn.Linear(hidden_size, action_dim)
        self.sigma2_fc = nn.Linear(hidden_size, action_dim)

    def forward(self, observations: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        observations = observations.view(-1, self.observ_dim)
        hidden_layer = func.relu(self.shared_fc(observations))
        return self.mu_fc(hidden_layer), func.softplus(self.sigma2_fc(hidden_layer)) # ensure positive

    def log_probs(self, batch_observs: torch.Tensor, batch_actions: torch.Tensor) -> torch.Tensor:
        mus, sigma2s = self(batch_observs)
        left = -(2 * np.pi * sigma2s).pow(0.5).log().sum(dim=1, keepdim=True) # log(1 / \sqrt(2pi \sigma))
        right = ((batch_actions - mus).square() / (2. * sigma2s)).sum(dim=1, keepdim=True)
        return (left - right).view(-1,)


class REINFORCEAgent:
    def __init__(self, gamma: float, observ_dim: int, n_action: int, lr: float, device: torch.device, model: nn.Module):
        super().__init__()
        self.observ_dim = observ_dim
        self.device = device
        self.net = model
        self.discrete = isinstance(model, DiscretePolicy)
        self.net.to(self.device)
        self.gamma = gamma
        self.n_action = n_action
        self.optimizer = torch.optim.Adam(self.net.parameters(), lr=lr)

    def choose_action(self, single_observ: np.array) -> Union[int, np.array]:
        single_observ = torch.Tensor(single_observ).to(self.device)
        with torch.no_grad():
            if self.discrete:
                return torch.distributions.Categorical(self.net(single_observ)).sample().squeeze().item()
            else:
                mu, sigma2 = self.net(single_observ)
                action = (mu + sigma2.sqrt() * torch.randn(self.n_action).to(self.device))
                action = torch.clamp(action, -2, 2)
                action = action.cpu().numpy().reshape(self.n_action)
                return action

    def update(self, observations: Union[List[np.array], np.array],
    actions: Union[List[np.array], np.array], rewards: Union[List[np.array], np.array]) -> float:
        episode_len = len(observations)
        assert episode_len > 1
        discounted_rewards = np.zeros(episode_len)
        discounted_rewards[-1] = rewards[-1]
        for i in reversed(range(episode_len - 1)):
            discounted_rewards[i] = self.gamma * discounted_rewards[i + 1] + rewards[i]
        discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 0.01)
        # accumulate loss in a batch based manner to too-large matrix
        size = 200
        loss = torch.zeros(1).to(self.device)
        for i in range(0, episode_len, size):
            left, right = i, min(episode_len, i + size)
            batch_rewards = torch.Tensor(np.array(discounted_rewards[left: right])).to(self.device).view(-1, 1)
            batch_observs = torch.Tensor(np.array(observations[left: right])).to(self.device).view(-1, self.observ_dim)
            if self.discrete:
                batch_actions = torch.tensor(np.array(actions[left: right], dtype=np.int64)).to(self.device).view(-1, 1)
            else:
                batch_actions = torch.Tensor(np.array(actions[left: right])).to(self.device).view(-1, 1)
            log_probs = self.net.log_probs(batch_observs, batch_actions)
            loss += (batch_rewards * log_probs).sum()
            loss = torch.sum(loss)
            loss = -loss / episode_len
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
        return loss.cpu().data.numpy()

    def train(env: gym.Env, agent: REINFORCEAgent, n_episode: int = 100) -> np.array:
        memory = collections.namedtuple("memory", ["observs", "actions", "rewards"])
        losses = np.zeros(n_episode)
        for i in range(n_episode):
            mem = memory([], [], [])
            # collect data process
            observ, done = env.reset(), False
            while not done:
                action: int = agent.choose_action(observ)
                next_observ, reward, done, info = env.step(action)
                # insert into memories
                mem.observs.append(observ)
                mem.actions.append(action)
                mem.rewards.append(reward)
                observ = next_observ
            losses[i] = agent.update(mem.observs, mem.actions, mem.rewards)
            del mem
        return losses

def evaluation(env: gym.Env, agent: Any, n_episode: int = 1000, render: bool = True, verbose: int = 3):
    if verbose >= 1:
        print(f"observation space: {env.observation_space}")
        print(f"action space: {env.observation_space}")
    total_reward: float = 0.
    for episode in range(n_episode):
        # initialize
        episode_reward, step = 0., 0
        observation, reward, done = env.reset(), 0., False
        while not done:
            if render:
                env.render()
            # take action here
            action = agent.choose_action(observation)
            next_observation, reward, done, info = env.step(action)
            observation = next_observation
            step += 1
            episode_reward += reward
            if verbose >= 3:
                print(f"step={step}, reward={reward}, info={info}")
            total_reward += episode_reward
            if verbose >= 2:
                print(f"ep={episode}, reward={episode_reward}")
            if verbose >= 1:
                print(f"total episodes:{n_episode}, average reward per episode: {total_reward / n_episode}")
            if render:
                env.close()
                return total_reward / n_episode

def main_discrete():
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(device)
    env = gym.make("CartPole-v1")
    env.seed(2)
    # change different env seed only to check the stability
    # cpu env
    # 0 - 143.46, 1 - 104.58, 2 - 200.0
    # cuda env
    # 0 - 181.5, 1 - 151.26, -2 - 159.99
    n_action = env.action_space.n
    observ_dim = env.observation_space.shape[0]
    gamma = 0.9
    discrete_model = DiscretePolicy(observ_dim, n_action)
    reinforce_agent = REINFORCEAgent(gamma=gamma, observ_dim=observ_dim, n_action=n_action, lr=0.01,
    device=device, model=discrete_model)
    _ = train(env, reinforce_agent, n_episode=500)
    evaluation(env, reinforce_agent, 100, render=True, verbose=1)

def main_continuous():
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(device)
    env = gym.make("Pendulum-v1")
    env.seed(0)
    action_dim = env.action_space.shape[0]
    observ_dim = env.observation_space.shape[0]
    gamma = 0.9
    continuous_model = ContinuousPolicy(observ_dim, action_dim)
    reinforce_agent = REINFORCEAgent(gamma=gamma, observ_dim=observ_dim, n_action=action_dim, lr=0.01,
    device=device, model=continuous_model)
    _ = train(env, reinforce_agent, n_episode=1000)
    evaluation(env, reinforce_agent, 100, render=True, verbose=1)

if __name__ == "__main__":
    np.random.seed(0)
    torch.manual_seed(0)
    torch.cuda.manual_seed(0)
    torch.cuda.manual_seed_all(0)
    main_discrete()
    main_continuous()