This is a review of Reinfoircement Learning.
It has three parts, 1) fundation of RL, 2) RL based on value function, 3) RL based on direct policy search.
Fundation of RL
-
Markov Decision Process | Day 1
What is MDP
MDP is a transition matrix with actions and rewards. MDP is used to simulate a world in the form of a grid by dividing it into states, actions, rewards, and transition models.
Circle is state.
{Facebook, Quit, Study, Pub, Sleep} are actions.
R is reward.MDP is defined as the following:
- States: S
- Actions:A(s), A
- Transition model: T(s,a,s') ~ P(s'|s,a)
- Rewards: R(s), R(s,a), R(s,a,s')
- Policy: π
-
MDP and RL | Day 2
A reinforcement learning (RL) task composed of states, actions, rewards that follows Markov property would be considered a MDP.
The goal of RL is to find an optimal policy with a given MDP. Policy is a mapping function, which is from state to action and is a probability that an action is selected in a state.
How to find the optimal policy => Optimal value function, optimal policy.
Definition
Bellman Equation (used in programming)
-
Implementing MDP | Day 3
Here is an example. We use a grid of 3*4 to represent a world(shown as below).
p(desired action) = 0.8, p(other actions prependicular desired action) = 0.1, reward = -3
check out the code here
-
Dynamic Programming for solving MDP | Day 4
The key idea of DP is the use of value function to find the optimal policy.
RL can be modelled as a MDP, which fits two conditions of DP.
These two conditions are
1) main problem can decompose to subproblems 2) the solutions of subproblems can be stored and reused.There are two ways to get the optimal policy, policy iteration and value iteration. Policy iteration consist of policy evaluation ans policy improvment.
The difference between policy iteration and value iteration is the timing of updating state value. Improving policy after evaluating it once is called value iteration.
-
Policy iteration and Value iteration | Day 5
Policy iteration :
1) Value function have to be converged and then use this value function to decide the optimal policy. 2) Repeat the process until policy is stable.Value iteration:
1) Calculate the optimal policy and assign this value to value function. 2) Repeat the process until value function is stable.