在大多是强化学习(reinforcement learning RL)问题中,环境的model都是未知的,也就无法直接做动态规划。一种方法是去学MDP,在这个系列的理解强化学习中的策略迭代和值迭代这篇文章中有具体思路。但这种做法还是会存在很多问题,就是在sample过程中会比较麻烦,如果你随机sample的话就会有某些state你很难sample到,而按照某种策略sample的话,又很难得到真实的转移概率。一旦你的model出现了问题,值迭代和策略迭代都将会出现问题。
于是就有了Model-free Reinforcement Learning,直接与环境交互,直接从数据中学到model。
Model-free Reinforcement Learning
Model-free Reinforcement Learning需要从数据中estimate出value是多少(state or state-action pair),接下来拿到cumulative reward的期望,得到这些case之后,再去做model-free的control,去optimal当前的policy使得value function最大化。
那model-free的value function如何来做prediction呢?
在model-free的RL中我们无法获取state transition和reward function,我们仅仅是有一些episodes。之前我们是拿这些episodes学model,在model free的方法中拿这些episode直接学value function 或者是policy,不需要学MDP。这里面两个关键的key steps:1. estimate value function. 2. optimize policy.
Value Function Estimate
In model-based RL (MDP), the value function is calculated by dynamic programming
在model free的方法中,我们不知道state transition,由此无法计算上述等式的期望。
Monte-Carlo Methods
Monte-Carlo methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. For example, to calculate the circle's surface. As show in following figure:
对上述方框中均匀撒上一些点,然后用如下等式计算即可:
Monte-Carlo Value Estimation
我们有很多episodes,基于这些episode,我们去计算total discounted reward :
Value function的 expected return可表示为如下数学形式:
上述方法可总结为两步:1. 使用policy 从
state开始采样
个
episodes 。2. 计算平均累计奖励(the average of cumulative reward )。可以看出来,这种基于采样的方法,直接一步到位,计算value而不需要计算MDP中的什么状态转移啥的。
上述思想更加细致、更具体的方法可用如下形式表示:
- Sample episodes of policy
。
- Every time-step
that state
is visited in an episode
- Increment counter
- Increment total return
- Value is estimated by mean return
- By law of large numbers
as
。
- Increment counter
Incremental Monte-Carlo Updates
- Update
incrementally after each episode
- For each state
with cumulative return
- For non-stationary problems (i.e. the environment could be varying over time), it can be useful to track a running mean, i.e. forget old episodes
如果环境的state transition和reward function一直在变,我们把这个环境叫做non-stationary,环境本身肯定叫做stochastic环境。但是如果分布不变,叫做statically environment,但是环境本身的分布会发生变化的话,就需要去忘掉一些老的episode,如果用平均的方法去做的话,老的episode和新的episode一样,它就忘不掉老的episode。
Monte-Carlo Value Estimation的一些特点:
- MC methods learn directly from episodes of experience
- MC is model-free: no knowledge of MDP transitions / rewards
- MC learns from complete episodes: no bootstrapping (discussed later)
- MC uses the simplest possible idea: value = mean return
- Caveat: can only apply MC to episodic MDPs i.e., all episodes must terminate
Temporal-Difference Learning
TD的方法中引入对未来值函数的估计:
TD的算法主要有以下四个特点:
- TD methods learn directly from episodes of experience
- TD is model-free: no knowledge of MDP transitions / rewards
- TD learns from incomplete episodes, by bootstrapping
- TD updates a guess towards a guess
Monte Carlo vs. Temporal Difference
Monte Carlo方法和Temporal Difference方法对比如下:
-
The same goal: learn
from episodes of experience under policy
。
-
Incremental every-visit Monte-Carlo
- Update value
toward actual return
。
- Update value
- Simplest temporal-difference learning algorithm: TD
- Update value
toward estimated return
。
- TD Target:
;
- TD error:
- Update value
Advantages and Disadvantages of MC vs. TD
-
TD can learn before knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
-
TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments
Bias/Variance Trade-Off
- Return
is unbiased estimate of
。
基于当前的策略去采样,然后计算平均值,这样得到的估计是无偏估计。
- TD target
is biased estimate of
。
TD target中由于存在对未来的估计,这个估计如果是非常准确的,那
TD target也是unbiased estimate,但是由于很难估计准确,所以是
biased estimate 。
- TD target is of much lower variance than the return
TD target的方法一般比Return 要小。Return
depends on many random actions, transitions and rewards;TD target depends on one random action, transition and reward
Advantages and Disadvantages of MC vs. TD (2)
- MC has high variance, zero bias
MC方法具有好的 convergence properties (even with function approximation) 并且 Not very sensitive to initial value 但是需要 Very simple to understand and use。需要多采样去降低variance。
- TD has low variance, some bias
TD的方法 Usually more efficient than MC ,TD converges to ,but not always with function approximation。并且 More sensitive to initial value than MC。
n-step model-free prediction
For time constraint, we may jump n-step prediction section and directly head to model-free control
- Define the n-step return
- n-step temporal-difference learning
有了值函数之后,我们就需要去做策略改进了。
我的微信公众号名称:深度学习与先进智能决策 微信公众号ID:MultiAgent1024 公众号介绍:主要研究分享深度学习、机器博弈、强化学习等相关内容!期待您的关注,欢迎一起学习交流进步!