十臂老虎机简单实验复现最近在学习强化学习，然后想认真仔细学习，所以就准备把书上的练习也都做了。这个是书上Ch.2的一个

最近在学习强化学习，然后想认真仔细学习，所以就准备把书上的练习也都做了。

这个是书上Ch.2的一个图示，我用python复现了一下。

Testbed

这个Testbed大概就是指用于评估算法的一套标准环境，我也不知道中文怎么讲。去年刚来美国的时候，跟人沟通的时候好多概念不知道怎么讲，现在学了一阵子又发现一些在英文场景下的单词概念不知道怎么用中文讲。在语言天赋上，我是真的一点都没有，还记得很久以前学习托福学了半天学不明白。真心希望以后世界上只有一种语言，不然在语言上浪费时间真的很没有意义和价值。

这个实验简单来说，就是创建2000个独立的老虎机问题，然后每个问题有10个动作（这里的一个动作就相当于一个老虎机臂），每个动作的q*(a)是按照高斯分布生成的。这个q*(a)是真实值，进行强化学习的个体是无法得知的。

import numpy as np
import matplotlib.pyplot as plt

class Bandit:
    def __init__(self, num_arms=10):
        # True action values q*(a) sampled from N(0,1)
        self.q_star = np.random.normal(0, 1, num_arms)
    
    def get_reward(self, action):
        # Reward sampled from N(q*(a),1) given an action
        return np.random.normal(self.q_star[action], 1)
    
    def optimal_action(self):
        return np.argmax(self.q_star)

智能体 Agent (强化学习策略)

class Agent:
    def __init__(self, num_arms=10, epsilon=0.1):
        self.epsilon = epsilon
        self.q_estimates = np.zeros(num_arms)  # Initialize Q-values to 0
        self.action_counts = np.zeros(num_arms)  # Track action selection counts
    
    def select_action(self):
        if np.random.rand() < self.epsilon:
            return np.random.randint(len(self.q_estimates))  # Random action (exploration)
        else:
            return np.argmax(self.q_estimates)  # Greedy action (exploitation)
    
    def update(self, action, reward):
        self.action_counts[action] += 1
        alpha = 1 / self.action_counts[action]  # Incremental sample averaging
        self.q_estimates[action] += alpha * (reward - self.q_estimates[action])

运行一个单独的实验，观察Q值更新

num_steps = 1000
bandit = Bandit()
agent = Agent(epsilon=0.1)

rewards = []
optimal_action_counts = []

for step in range(num_steps):
    action = agent.select_action()
    reward = bandit.get_reward(action)
    agent.update(action, reward)

    rewards.append(reward)
    optimal_action_counts.append(action == bandit.optimal_action())

print("Final Q estimates:", agent.q_estimates)
print("True q* values:", bandit.q_star)

plt.plot(rewards)
plt.xlabel("Steps")
plt.ylabel("Reward")
plt.title("Reward over time")
plt.show()

Image description

运行2000次实验并计算平均奖励

num_experiments = 2000
epsilons = [0, 0.01, 0.1]
num_arms = 10

avg_rewards = {eps: np.zeros(num_steps) for eps in epsilons}
optimal_action_pct = {eps: np.zeros(num_steps) for eps in epsilons}

for experiment in range(num_experiments):
    bandit = Bandit()
    
    for eps in epsilons:
        agent = Agent(num_arms=num_arms, epsilon=eps)
        optimal_action = bandit.optimal_action()

        for step in range(num_steps):
            action = agent.select_action()
            reward = bandit.get_reward(action)
            agent.update(action, reward)

            avg_rewards[eps][step] += reward
            optimal_action_pct[eps][step] += (action == optimal_action)

# Compute final averages
for eps in epsilons:
    avg_rewards[eps] /= num_experiments
    optimal_action_pct[eps] = (optimal_action_pct[eps] / num_experiments) * 100

print("Finished 2000 experiments!")

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
for eps in epsilons:
    plt.plot(avg_rewards[eps], label=f'ε={eps}')
plt.xlabel("Steps")
plt.ylabel("Average Reward")
plt.legend()
plt.title("Average Reward vs Steps")

plt.subplot(1, 2, 2)
for eps in epsilons:
    plt.plot(optimal_action_pct[eps], label=f'ε={eps}')
plt.xlabel("Steps")
plt.ylabel("% Optimal Action")
plt.legend()
plt.title("Optimal Action Selection vs Steps")

plt.show()

Image description

总结

本次实验主要内容如下：

多臂赌博机测试平台： 实现了一个10臂赌博机，其真实动作价值从正态分布中采样。
智能体（强化学习策略）： 实现了一个ε-贪婪动作选择方法，并采用增量式Q值更新。
单次测试运行： 展示了Q值的更新过程和随时间变化的奖励趋势。
多次实验（2000次运行）： 比较了不同ε值（0、0.01、0.1）下的平均奖励和最优动作选择百分比。