Introduction
For the value-based approach such as Q-Learning and Sarsa, all function approximation effort goes into estimating a value function and the policy is the one that selects the action with the highest estimated value in each state. Although it works well in many problems, it has several limitations ([1]):
- It’s oriented toward finding deterministic policies, whereas the optimal policy is often stochastic, selecting different actions with specific probabilities.
- An arbitrarily small change in the estimated value of an action can cause it to be, or not be, selected. Such discontinuous changes have been identified as a key obstacle to establishing convergence assurances for algorithms following the value-function approach. For example, Q-learning, Sarsa, and dynamic programming methods have all been shown unable to converge to any policy for simple MDPs and simple function approximators.
Rather than approximating a value function, policy based methods approximate a stochastic policy $\pi_{\theta}(s,a) = P(a\mid s, \theta)$ directly using an independent function approximator with its own parameters $\theta$. And the goal is to find out how to adjust the parameters $\theta$ so that we can get more reward. The main mechanism is gradient ascent. The advantages of using policy based approach are
- It has better convergence properties
- It is effective in high-dimensional or continuous action space
- It can learn stochastic policies
However, they typically converge to a local rather than global optimum and evaluating a policy is typically inefficient and has high variance.
Policy Gradient
In reinforcement learning, two ways of formulating the agent’s objective are widely used. One is the start-state formulation in which there is a designated start state $s_0$ and the agent cares about the long-term reward obtained from it. It is defined as
and the value of a state-action pair given policy $\pi$ is defined as
The second formulation is the average reward formulation, where the objective is
and the value of state-action pair is defined as
One key point should be emphasized here is that there are no terms of $\frac{\partial d^{\pi}(s)}{\partial \theta}$ which says that the policy changes will not affect the distribution of states. Therefore we can approximate the gradient by sampling.
Monte-Carlo Policy Gradient (REINFORCE)
To approximate $\nabla_{\theta}J(\theta)$, we also need to know $Q^{\pi_{\theta}} (s,a)$. One straightforward idea is to use the actual return $R_t$ as an unbiased sample of $Q^{\pi_{\theta}} (s,a)$ and this results in the REINFORCE Algorithm.
function REINFORCE
Initialize $\theta$ arbitrarily
for each episode $\{s_1, a_1, r_2, \dots, s_{T-1}, a_{T-1}, r_T\} $
$\quad$for t=1 to T-1 do
$\qquad$ $\theta \leftarrow \theta + \alpha \nabla_{\theta}\log\pi_{\theta}(s_t,a_t)R_t $
$\quad$ end for
end for
return $\theta$
As for the implementation of REINFORCE method, Andrej Karpathy’s post Deep Reinforcement Learning: Pong from Pixels is a very good source to start with.
There are several drawbacks of REINFORCE method. The first is that you have to wait the whole episode to end to get $R_t$. The second is that REINFORCE learns much more slowly than value based methods because of high variance.
Actor-Critic Policy Gradient
To improve over REINFORCE method, people came up with the Actor-Critic method where a value function is learned and used to reduce the variance of the gradient estimate for rapid learning. The main idea is that we use a critic $Q_w(s,a)$ to explicitly estimate the action-value function $Q^{\pi_{\theta}}(s,a)$ instead of using sample return $R_t$. Now two sets of parameters are maintained: one is the value function parameter $w$ (critic) and the other is the policy parameter $\theta$ (actor).
function Actor-Critic
Initialize $s$, $\theta$ arbitrarily
Sample $a \sim \pi_{\theta}$
for each step do
$\quad$ Sample reward $r \sim R_s^a$; sample transition $s' \sim P_s^a$
$\quad$ Sample action $a'\sim \pi_{\theta}(s', a')$
$\quad$ $\delta = r + \gamma V_v(s')-V_v(s)$
$\quad$ $\theta = \theta + \alpha \nabla_{\theta}\log\pi_{\theta}(s,a)\delta$
$\quad$ $v = v + \beta \delta\nabla_{v}V_v(s)$
$\quad$ $a \leftarrow a'$, $s\leftarrow s'$
end for
References
- Sutton, et al. “Policy gradient methods for reinforcement learning with function approximation.” Advances in neural information processing systems. 2000.