gymnasium学习

237 阅读1分钟

导语

gymnasium是gym升级后的版本,是模拟真实环境的强大工具,可以用来模拟游戏,自动驾驶甚至马斯克的火箭回收模拟等等,总的来说结合下tensorflow强化学习库,可以让机器自己学习。

前提

废话不多说直接进入正题,先来跑一个官方DEMO,降落器,在编码之前需要安装一下包

  • python 3.10
  • tensorflow 2.9
  • gymnasium 1.0.0
  • moviepy2 1.0.4 moviepy的增强版

show code

初级版本


import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo

env = gym.make("LunarLander-v3", render_mode = "human")

obs, info = env.reset()

episode_over = False
episode_reward = 0  # 记录当前回合的总奖励

for _ in range(3):
    while not episode_over:
        action = env.action_space.sample()  # agent policy that uses the observation and info
        observation, reward, terminated, truncated, info = env.step(action)
        episode_reward += reward
        episode_over = terminated or truncated

        if terminated and info.get('landed', False):
            print(f"Success! Landed successfully. Total successes")


env.close()

高级版本

为了监控到Agent的每个周期的行为动作,我是用了RecordVideo 作为每个周期的视频记录,同时使用RecordEpisodeStatistics 帮助我获取几个关键指标

import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo

n_episodes = 100_000
env = gym.make("LunarLander-v3", render_mode = "rgb_array")
env = RecordVideo(env, video_folder='./video2', name_prefix= 'lunar_',episode_trigger=lambda x: True)
env = RecordEpisodeStatistics(env, n_episodes)

obs, info = env.reset()

episode_over = False
episode_reward = 0  # 记录当前回合的总奖励

for _ in range(3):
    while not episode_over:
        action = env.action_space.sample()  # agent policy that uses the observation and info
        observation, reward, terminated, truncated, info = env.step(action)
        episode_reward += reward
        episode_over = terminated or truncated

        if terminated and info.get('landed', False):
            print(f"Success! Landed successfully. Total successes")


env.close()

print(f'Episode time taken: {env.time_queue}')
print(f'Episode total rewards: {env.return_queue}')
print(f'Episode lengths: {env.length_queue}')