策略梯度REINFORCE
理论介绍
策略梯度是一个非常直观的方法。为什么说很直观,因为它直接尝试用神经网络拟合policy,然后优化目标也直接就是期望累计收益。
用神经网络拟合policy,具体说,它的参数为,所以一般表示为。网络的输入是一个observation的向量,输出的向量长度是action的数量(先考虑action是离散的),表示选择每个action的概率。
优化目标直接就是最大化期望累计收益。一般优化网络的参数,都是对优化目标求梯度,然后沿着梯度更新参数,如果是最大化就是梯度上升,反之就是梯度下降。所以接下来尝试对优化目标求关于策略的函数的梯度。
然后就会发现并不好求,因为参数并没有显式的写出来,为了解决这个问题就需要一些变形技巧。
第一行是定义,表示一条episode的总收益,是这条episode发生的概率。 第二行是把episode这个决策过程的概率展开了,第一章推过。 第三行是梯度算子的线性性,只对包含参数的policy作用 第四行是一个恒等变换,即。 第五行是因为这个恒等变换,我们又凑出一个policy的项,原来的概率又凑回去了。 第六行是进一步凑回了期望项。 第七行是线性性把梯度算子拿出来。并重新用求和表示累积收益。
上面的变形说明,虽然直接把期望累积收益作为优化目标不好求梯度,但是如果把当成优化目标,对他求梯度,和对原始的优化目标求梯度结果是一样的。所以我们就把这个式子作为优化目标了。
在实际计算上,期望这是一个数学概念,实际计算中使用一系列数据求平均得到的,所以变成了。
进一步审视这个公式,对他求梯度之后变成了。也可以看成,对每一个, 对,算法都求了一个梯度,然后前面的可以看作是对梯度赋权重,然后整体求和得到了全局的优化方向。这相当于我们对所有的梯度赋了相同的权重,这样会有两个问题
- 1 因为很多时候reward可能全是正的,比如经典cartpole,你走一步一定会有+1分,那一个episode的累积收益永远是正的,就算一个action不好,那他梯度前面的权重也是正数,会沿着他的方向增强。而很多state-action采样不到,参数不会沿着他们增强的方向更新,不好动作的增强变向弱化了选取这些动作的概率。解决这个问题的方法是让收益和减去一个基线,让权重有正有负,,这样如果采样到的episode收益虽然是正的但是太低了,我们一样不鼓励增强这个策略。
- 2 在一个episode内,每个动作的功劳(credit)并不一样,直观上越到后面的状态可能分数是越低的,因为可能是这些操作把局面搞崩了导致episode结束。所以我们认为只有一个动作发生之后的收益才算到它梯度的权重上。所以梯度更新的公式就变成了
总体回顾一下,就是定义一个网络,涉及关于的目标函数
对于实现细节主要注意以下几点:
- 1 基线怎么取,可以用函数,高级的方法会引入一个新的神经网络估计函数,不过更简单的方法是直接对收集到的rewards做regularization。
- 2 这个看起来很低效啊,因为他需要很多个episode的平均值然后才能更新一次?没有那么低效,我们一般一个episode就更新一次,但一个episode更新一次也挺低效了。注意REINFORCE作为on-policy的方法,必须遵守更新的策略和采样数据的策略一致,经过我的实验发现,如果REINFORCE用了epsilon-greedy结果会很不稳定,因为epsilon-greedy其实严格来说已经不再是policy真实的策略了,此外,更新也不能采用batch-base的方法,因为一个batch更新了policy就变了,也和采样数据的policy不一致。不过为了避免把一个完整的episode扔进去产生过于巨大的矩阵,可以分batch求和,但更新只有一次。
编码实现
实现一个agent一般考虑四个问题
- 1 网络怎么定义
- 2 怎么选择action
- 3 怎么更新梯度
- 4 怎么采样数据
接下来分别就4各方面结合代码实际说明
1 网络怎么定义?
为了解耦我们一般不把网络和agent放在一个类里,网络是单独的类,并且forward方法输出也是网络的原始输出,至于具体怎么采样交给agent来做。
class DiscretePolicy(nn.Module):
def __init__(self, observ_dim: int, n_action: int, device: torch.device) -> None:
super().__init__()
n_hidden: int = 20
self.observ_dim = observ_dim
self.fc1 = nn.Linear(observ_dim, n_hidden)
self.fc2 = nn.Linear(n_hidden, n_action)
def forward(self, observations: torch.Tensor) -> torch.Tensor:
observations = observations.view(-1, self.observ_dim)
hidden = func.relu(self.fc1(observations))
return func.softmax(self.fc2(hidden), dim=-1)
2 怎么选择action
这里利用了pytorch的封装函数categorical可以帮助我们把网络的输出看成概率然后采样。
因为choose action是和环境交互的,它的输入是一个observation,所以如果action是离散的,那输出就直接是正数,表示action的编号。
class Agent
# ...
def choose_action(self, single_observ: np.array) -> int:
with torch.no_grad():
single_observ = torch.Tensor(single_observ).to(self.device)
return torch.distributions.Categorical(self.net(single_observ)).sample().squeeze().item()
3 怎么更新梯度
前面已经说过,我们一个eposide更新一次,所以输入的observation、actions和rewards都是这一个episode的。
然后先用gamma倒序计算discount后的rewards,即从某时刻出发的累积收益。 regularization其实就是减去基线的技巧。 为了避免出现过大的矩阵,用batch的方式分批求和计算loss。 pytorch的gather函数可以选出每个action对应的概率。
class Agent:
# init and other functions
def update(self, observations: Union[List[np.array], np.array],
actions: Union[List[np.array], np.array], rewards: Union[List[np.array], np.array]) -> float:
episode_len = len(observations)
assert episode_len > 1 # 避免只有一条数据 后面的归一化时候计算方差出现错误
discounted_rewards = np.zeros(episode_len)
discounted_rewards[-1] = rewards[-1]
for i in reversed(range(episode_len - 1)):
discounted_rewards[i] = self.gamma * discounted_rewards[i + 1] + rewards[i]
discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 0.01)
# accumulate loss in a batch based manner to too-large matrix
loss = torch.zeros(1).to(self.device)
size = 200
for i in range(0, episode_len + 1, size):
left, right = i, min(episode_len, i + size)
batch_rewards = torch.Tensor(np.array(discounted_rewards[left: right])).to(self.device).view(-1, 1)
batch_observs = torch.Tensor(np.array(observations[left: right])).to(self.device).view(-1, self.observ_dim)
batch_actions = torch.Tensor(np.array(actions[left: right])).to(self.device, dtype=torch.long).view(-1, 1)
# self.net输出的是二维矩阵 行是batch的大小 列是action的数量,值是action的概率,gather的时候dim=1表示逐行选择
# index表示每一行选择了batch_action对应的那个动作,所以结果是一个batch行1列的概率值
# 取log操作得到的是log p(s|a),最后求和,取平均放在所有batch都加上之后
loss += (batch_rewards * self.net(batch_observs).gather(dim=1, index=batch_actions).log()).sum()
# 求平均
loss = -loss / episode_len
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
return loss.cpu().data.numpy()
4 怎么采样数据
从前面的update可以看出需要收集每个episode的observ,action和reward用来训练,所以每个batch收集一下即可
def train(env: gym.Env, agent: REINFORCEAgent, n_episode: int = 100) -> np.array:
memory = collections.namedtuple("memory", ["observs", "actions", "rewards"])
losses = np.zeros(n_episode)
for i in range(n_episode):
mem = memory([], [], [])
# collect data process
observ, done = env.reset(), False
while not done:
action: int = agent.choose_action(observ)
next_observ, reward, done, info = env.step(action)
# insert into memories
mem.observs.append(observ)
mem.actions.append(action)
mem.rewards.append(reward)
observ = next_observ
losses[i] = agent.update(mem.observs, mem.actions, mem.rewards)
del mem
return losses
进阶到连续动作
其实思路是一样的,只是现在的policy不再输出分布列,而是分布的参数。
比如假设policy服从正态分布,并且维度间彼此独立,那网络输出有两个向量,长度都是action_dim,一个是均值、一个是方差。不过vanilla REINFORCE并不能表现得很好
然后需要改三个地方
- 1 choose_action需要按照分布采样一个action
- 2 loss那里的log prob计算做出相应修改即可
- 3 batch_action的类型不需要转成long 这里为了简单起间直接认为action space是个正态分布,所以只需要拟合均值和方差即可。
完整代码和实验结果
相关代码已开源至github:github.com/sdycodes/RL…
我的完整代码把离散和连续两个放在一起了
离散的在CartPole-v0上实验,torch, numpy所有种子设成0,然后改变env的种子分别为0,1,2在cpu和gpu上看看结果
连续的在Pendulum-v0实验,不过效果不好没有仔细调,PPO才是真稳。
import collections
import gym
import torch.nn as nn
import torch
import numpy as np
import torch.nn.functional as func
from typing import Union, Any, List, Tuple
class DiscretePolicy(nn.Module):
def __init__(self, observ_dim: int, n_action: int) -> None:
super().__init__()
n_hidden: int = 20
self.observ_dim = observ_dim
self.fc1 = nn.Linear(observ_dim, n_hidden)
self.fc2 = nn.Linear(n_hidden, n_action)
def forward(self, observations: torch.Tensor) -> torch.Tensor:
observations = observations.view(-1, self.observ_dim)
hidden = func.relu(self.fc1(observations)
return func.softmax(self.fc2(hidden), dim=-1)
def log_probs(self, batch_observs: torch.Tensor, batch_actions: torch.Tensor) -> torch.Tensor:
return self(batch_observs).gather(dim=1, index=batch_actions).log()
class ContinuousPolicy(nn.Module):
def __init__(self, observ_dim: int, action_dim: int) -> None:
super().__init__()
hidden_size = 30
self.observ_dim = observ_dim
self.action_dim = action_dim
self.shared_fc = nn.Linear(observ_dim, hidden_size)
self.mu_fc = nn.Linear(hidden_size, action_dim)
self.sigma2_fc = nn.Linear(hidden_size, action_dim)
def forward(self, observations: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
observations = observations.view(-1, self.observ_dim)
hidden_layer = func.relu(self.shared_fc(observations))
return self.mu_fc(hidden_layer), func.softplus(self.sigma2_fc(hidden_layer)) # ensure positive
def log_probs(self, batch_observs: torch.Tensor, batch_actions: torch.Tensor) -> torch.Tensor:
mus, sigma2s = self(batch_observs)
left = -(2 * np.pi * sigma2s).pow(0.5).log().sum(dim=1, keepdim=True) # log(1 / \sqrt(2pi \sigma))
right = ((batch_actions - mus).square() / (2. * sigma2s)).sum(dim=1, keepdim=True)
return (left - right).view(-1,)
class REINFORCEAgent:
def __init__(self, gamma: float, observ_dim: int, n_action: int, lr: float, device: torch.device, model: nn.Module):
super().__init__()
self.observ_dim = observ_dim
self.device = device
self.net = model
self.discrete = isinstance(model, DiscretePolicy)
self.net.to(self.device)
self.gamma = gamma
self.n_action = n_action
self.optimizer = torch.optim.Adam(self.net.parameters(), lr=lr)
def choose_action(self, single_observ: np.array) -> Union[int, np.array]:
single_observ = torch.Tensor(single_observ).to(self.device)
with torch.no_grad():
if self.discrete:
return torch.distributions.Categorical(self.net(single_observ)).sample().squeeze().item()
else:
mu, sigma2 = self.net(single_observ)
action = (mu + sigma2.sqrt() * torch.randn(self.n_action).to(self.device))
action = torch.clamp(action, -2, 2)
action = action.cpu().numpy().reshape(self.n_action)
return action
def update(self, observations: Union[List[np.array], np.array],
actions: Union[List[np.array], np.array], rewards: Union[List[np.array], np.array]) -> float:
episode_len = len(observations)
assert episode_len > 1
discounted_rewards = np.zeros(episode_len)
discounted_rewards[-1] = rewards[-1]
for i in reversed(range(episode_len - 1)):
discounted_rewards[i] = self.gamma * discounted_rewards[i + 1] + rewards[i]
discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 0.01)
# accumulate loss in a batch based manner to too-large matrix
size = 200
loss = torch.zeros(1).to(self.device)
for i in range(0, episode_len, size):
left, right = i, min(episode_len, i + size)
batch_rewards = torch.Tensor(np.array(discounted_rewards[left: right])).to(self.device).view(-1, 1)
batch_observs = torch.Tensor(np.array(observations[left: right])).to(self.device).view(-1, self.observ_dim)
if self.discrete:
batch_actions = torch.tensor(np.array(actions[left: right], dtype=np.int64)).to(self.device).view(-1, 1)
else:
batch_actions = torch.Tensor(np.array(actions[left: right])).to(self.device).view(-1, 1)
log_probs = self.net.log_probs(batch_observs, batch_actions)
loss += (batch_rewards * log_probs).sum()
loss = torch.sum(loss)
loss = -loss / episode_len
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
return loss.cpu().data.numpy()
def train(env: gym.Env, agent: REINFORCEAgent, n_episode: int = 100) -> np.array:
memory = collections.namedtuple("memory", ["observs", "actions", "rewards"])
losses = np.zeros(n_episode)
for i in range(n_episode):
mem = memory([], [], [])
# collect data process
observ, done = env.reset(), False
while not done:
action: int = agent.choose_action(observ)
next_observ, reward, done, info = env.step(action)
# insert into memories
mem.observs.append(observ)
mem.actions.append(action)
mem.rewards.append(reward)
observ = next_observ
losses[i] = agent.update(mem.observs, mem.actions, mem.rewards)
del mem
return losses
def evaluation(env: gym.Env, agent: Any, n_episode: int = 1000, render: bool = True, verbose: int = 3):
if verbose >= 1:
print(f"observation space: {env.observation_space}")
print(f"action space: {env.observation_space}")
total_reward: float = 0.
for episode in range(n_episode):
# initialize
episode_reward, step = 0., 0
observation, reward, done = env.reset(), 0., False
while not done:
if render:
env.render()
# take action here
action = agent.choose_action(observation)
next_observation, reward, done, info = env.step(action)
observation = next_observation
step += 1
episode_reward += reward
if verbose >= 3:
print(f"step={step}, reward={reward}, info={info}")
total_reward += episode_reward
if verbose >= 2:
print(f"ep={episode}, reward={episode_reward}")
if verbose >= 1:
print(f"total episodes:{n_episode}, average reward per episode: {total_reward / n_episode}")
if render:
env.close()
return total_reward / n_episode
def main_discrete():
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
env = gym.make("CartPole-v1")
env.seed(2)
# change different env seed only to check the stability
# cpu env
# 0 - 143.46, 1 - 104.58, 2 - 200.0
# cuda env
# 0 - 181.5, 1 - 151.26, -2 - 159.99
n_action = env.action_space.n
observ_dim = env.observation_space.shape[0]
gamma = 0.9
discrete_model = DiscretePolicy(observ_dim, n_action)
reinforce_agent = REINFORCEAgent(gamma=gamma, observ_dim=observ_dim, n_action=n_action, lr=0.01,
device=device, model=discrete_model)
_ = train(env, reinforce_agent, n_episode=500)
evaluation(env, reinforce_agent, 100, render=True, verbose=1)
def main_continuous():
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
env = gym.make("Pendulum-v1")
env.seed(0)
action_dim = env.action_space.shape[0]
observ_dim = env.observation_space.shape[0]
gamma = 0.9
continuous_model = ContinuousPolicy(observ_dim, action_dim)
reinforce_agent = REINFORCEAgent(gamma=gamma, observ_dim=observ_dim, n_action=action_dim, lr=0.01,
device=device, model=continuous_model)
_ = train(env, reinforce_agent, n_episode=1000)
evaluation(env, reinforce_agent, 100, render=True, verbose=1)
if __name__ == "__main__":
np.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed(0)
torch.cuda.manual_seed_all(0)
main_discrete()
main_continuous()