导语
gymnasium是gym升级后的版本,是模拟真实环境的强大工具,可以用来模拟游戏,自动驾驶甚至马斯克的火箭回收模拟等等,总的来说结合下tensorflow强化学习库,可以让机器自己学习。
前提
废话不多说直接进入正题,先来跑一个官方DEMO,降落器,在编码之前需要安装一下包
- python 3.10
- tensorflow 2.9
- gymnasium 1.0.0
- moviepy2 1.0.4 moviepy的增强版
show code
初级版本
import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo
env = gym.make("LunarLander-v3", render_mode = "human")
obs, info = env.reset()
episode_over = False
episode_reward = 0 # 记录当前回合的总奖励
for _ in range(3):
while not episode_over:
action = env.action_space.sample() # agent policy that uses the observation and info
observation, reward, terminated, truncated, info = env.step(action)
episode_reward += reward
episode_over = terminated or truncated
if terminated and info.get('landed', False):
print(f"Success! Landed successfully. Total successes")
env.close()
高级版本
为了监控到Agent的每个周期的行为动作,我是用了RecordVideo 作为每个周期的视频记录,同时使用RecordEpisodeStatistics 帮助我获取几个关键指标
import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo
n_episodes = 100_000
env = gym.make("LunarLander-v3", render_mode = "rgb_array")
env = RecordVideo(env, video_folder='./video2', name_prefix= 'lunar_',episode_trigger=lambda x: True)
env = RecordEpisodeStatistics(env, n_episodes)
obs, info = env.reset()
episode_over = False
episode_reward = 0 # 记录当前回合的总奖励
for _ in range(3):
while not episode_over:
action = env.action_space.sample() # agent policy that uses the observation and info
observation, reward, terminated, truncated, info = env.step(action)
episode_reward += reward
episode_over = terminated or truncated
if terminated and info.get('landed', False):
print(f"Success! Landed successfully. Total successes")
env.close()
print(f'Episode time taken: {env.time_queue}')
print(f'Episode total rewards: {env.return_queue}')
print(f'Episode lengths: {env.length_queue}')