Policy Gradient methods learn a parameterized policy that can select actions without consulting a value function. A value function can still be used to learn policy parameters but is not needed for choosing actions. We use the notation $\theta \in \mathbb{R}^d$ for the policy’s parameter vector. Thus we write $\pi(a\mid s, \theta) = Pr\{ A_t = a \mid S_t = s, \theta_t=\theta \}$ for probability that action $a$ is taken at time t given the environment is in state $s$ at time $t$ with parameter $\theta$.

The methods to learn the policy parameter based on gradient of some scalar performance measure $J(\theta)$ w.r.t to the policy parameters are discussed in this chapter. These methods seek to maximize performance, so their updates approximate the gradient ascent in J:

\[\theta_{t+1} = \theta_t + \alpha \widehat{\nabla J(\theta_t)}\]

where $\widehat{\nabla J(\theta_t)}$ is a stochastic estimate whose expectation approximates the graident of performance measure with respect to its argument $\theta_t$. All methods that follow this general schema is called policy gradient methods, whether or not they learn an approximate value function. Methods that learn approximations to both policy and value functions are often called actor-critic methods, where ‘actor` is a reference of the learned policy, and ‘critic’ refers to the learned value function, usually a state value function.

Policy Approximation and its Advantages

In policy graident methods, the policy can be parameterized in any way as long as $\pi(a|s, \theta)$ is differentiable with respect to its parameters, that is, as long as $\nabla \pi(a|s, \theta)$ exists and is finite for all $s \in \mathcal{S}$ and $a \in \mathcal{A}$. In practice, to ensure exploration we generally require that policy never becomes deterministic.

If action space is discrete and not too large, then a natural and common kind of parameterization is to form parameterized numerical preference $h(s, a, \theta) \in \mathbb{R}$ for each state action pair. The actions with highest preferences in each state are given the highest probabilities of begin selected.

The action preferences themselves can be parameterized arbitrarily. One advantage of parameterizing policies according to the soft-max in action preferences is that the approximate policy can approach a deterministic policy, whereas with $\epsilon$-greedy policies, the probability of selecting any action is always at least $\epsilon$. One could select according to a soft-max distribution based on action values, but this along would not allow the policy to approach a deterministic policy. Instead, the action-value estimates would converge to their corresponding true values, which would differ by a finite amount, translating to specific probabilities other than 0 or 1. If the soft-max distribution included a temperature parameter, then the temperature could be reduced over time to approach determinism, but in practice it would be difficult to choose the reduction schedule, or even the initial temperature, without more prior knowledge of the true action values than we would like to assume. Action preferences are different because they do not approach specific values; instead they are driven to produce optimal stochastic policy. If the optimal policy is deterministic, then to preferences of the optimal actions will be driven infinitely higher than all suboptimal actions.

A second advantage of parameterizing policies according to the soft-max in action preferences is that it enables the selection of actions with arbitrary probabilities. In problems with significant function approximation, the best approximate policy may be stochastic.

The Policy Gradient Theorem

In addition to the practical advantages of policy parameterization over $\epsilon-$greedy action selection, there is also an important theoretical advantage. With continuous policy parameterization the action probabilities changes smoothly as a function of the learned parameter, whereas in $\epsilon-$greedy selection the action probabilities may change dramatically for an arbitrarily small change in the estimated action values, if that change results in a different action having the maximal value.Largely because of this stronger convergence guarantess are available for policy gradient methods than for action value methods. In particular, it is the continuity of pollicy dependence on the parameters that enables policy-gradient methods to approximate graident ascent.

The episodic and continuing cases define the performance measure, $J(\theta)$, differently and this have to be treated separately to some extent.

In this part we treat the episodic case, for which we define the performance measure as the value of the start state of episode. We can simplify the notation without losing any meaningful generality by assuming that every episode starts in some particular state $s_0$. Then the performance measure is simply the value of $s_0$:

\[J(\theta) \dot{=} v_{\pi_\theta}(s_0),\]

where $v_{\pi_\theta}(s_0)$ is the true value function for $\pi_\theta$, the policy determined by $\theta$.

With function approximation, it may seem challenging to change the policy parameter in a way that ensures improvement. The problem is that performance depends on both the action selections and the distribution of states in which those selections are made, and that both of these are affected by the policy parameter. Given a state, the effect of the policy parameter on the actions, and thus on reward, can be computed in a relatively straightforward way from the knowledge of parameterization. But the effect of policy on the state distribution is a function of the environment and is typically unknown. Estimation of the performance gradient with respect to the policy parameter when it depends on the unknown effect of policy changes on the state distribution can be answered using policy gradient theorem, which provides an analytic expression for the gradient of performance with respect to the policy parameter that doesn’t involve the derivative of the state distribution. The policy graident theorem for the episodic case establishes that

\[\nabla J(\theta) \propto \sum_s \mu(s) \sum_a q_\pi(s, a) \nabla \pi(a\mid s, \theta),\]

where the graident are column vectors of partial derivatives with respect to the components of $\theta$, and $\pi$ denotes the policy corresponding to parameter vector $\theta$. In the episodic case, the constant of proportionality is the average length of an episode, and in the continuing case it is 1, so that relationship is actually an equality. THe distribution $\mu$ here is the on-policy distribution under $\pi$.

Proof of the Policy Gradient Theorem (episodic)

\[\begin{align} \nabla v_\pi(s) &= \nabla \left[ \sum_a \pi(a\mid s) q_\pi(s, a) \right], \forall s \in \mathcal{S} \\ &= \sum_a \left[\nabla \pi (a \mid s) q_\pi(s, a) + \pi(a\mid s) \nabla q_\pi(s, a) \right] \\ &= \sum_a \left[\nabla \pi (a \mid s) q_\pi(s, a) + \pi(a\mid s) \nabla \sum_{s', r} p(s', r| s, a)(r + \gamma v_\pi(s')) \right] \\ &= \sum_a \left[\nabla \pi (a \mid s) q_\pi(s, a) + \pi(a\mid s) \sum_{s', r} [\nabla (p(s', r| s, a)*r) + \nabla p(s', r| s, a) \gamma v_\pi(s')] \right] \\ &= \sum_a \left[\nabla \pi (a \mid s) q_\pi(s, a) + \pi(a\mid s) \sum_{s'} \nabla p(s' | s, a) \gamma v_\pi(s') \right] \\ & \because \text{neither } p(s', r\mid s, a) \text{ nor r is not a function of } \theta \text{ and } p(s'\mid s, a) = \sum_r p(s',r \mid s, a) \end{align}\]

This equation is of recursive form. Let $x$ be the state we need to transition to, and $k$ be the number of steps.

With $k=1, p^{\pi}(s \rightarrow s’, k=1) = \sum_a \pi_\theta(a\mid s)P(s’ \mid s, a)$

With that being said, suppose it takes k steps to reach an intermediate state s’ and from that state we go to x. So it requires $k+1$ step for the transition $s \rightarrow s’$. Thus,

\[\begin{align} p^{\pi}(s \rightarrow x, k+1) = \sum_{s'} p^{\pi}(s \rightarrow s', k) p^{\pi}(s' \rightarrow x, 1), \end{align}\]

where $s’ \in \mathcal{S}$.

So the original equation becomes,

\[\begin{align} \nabla v_\pi(s) &= \sum_a \nabla \pi (a \mid s) q_\pi(s, a) + \gamma \sum_a \pi(a\mid s) \sum_{s'} \nabla p(s' | s, a) v_\pi(s') \\ &= \sum_a \nabla \pi (a \mid s) q_\pi(s, a) + \gamma \sum_a \sum_{s'} \pi(a\mid s) \nabla p(s' | s, a) v_\pi(s') \\ &= K(s) + \gamma \sum_{s'} p^{\pi}(s \rightarrow s', k=1) v_\pi(s') \\ &= K(s) + \gamma \sum_{s'} p^{\pi}(s \rightarrow s', k=1) \left[ K(s') + \gamma \sum_{s''} p^{\pi}(s' \rightarrow s'', k=1) v_\pi(s'') \right] \\ &= K(s) + \gamma \sum_{s'} p^{\pi}(s \rightarrow s', k=1) K(s') + \gamma^2 \sum_{s'} p^{\pi}(s \rightarrow s', k=1) \sum_{s''} p^{\pi}(s' \rightarrow s'', k=1) v_\pi(s'') \\ &= K(s) + \gamma \sum_{s'} p^{\pi}(s \rightarrow s', k=1) K(s') + \gamma^2 \sum_{s''} \sum_{s'} p^{\pi}(s \rightarrow s', k=1) p^{\pi}(s' \rightarrow s'', k=1) v_\pi(s'') \\ &= K(s) + \gamma \sum_{s'} p^{\pi}(s \rightarrow s', k=1) K(s') + \gamma^2 \sum_{s''} p^{\pi}(s \rightarrow s'', k=2) v_\pi(s'') \\ &= \cdots \\ &= \cdots \\ &= \sum_{x \in \mathcal{S}}\sum_{k=0}^{\infty} \gamma^kp_\pi(s \rightarrow x, k) \sum_a \nabla \pi(a\mid x)q_\pi(x, a) \end{align}\]

In this chapter we are assuming $\gamma=1$, so equation becomes

\[\nabla v_\pi(s) = \sum_{x \in \mathcal{S}}\sum_{k=0}^{\infty} p_\pi(s \rightarrow x, k) \sum_a \nabla \pi(a\mid x)q_\pi(x, a)\]

We defined $J(\theta) = v_{\pi_\theta}(s)$, so

\[\begin{align} \nabla J(\theta) &= \nabla v_\pi(s_0) \\ &= \sum_{x \in \mathcal{S}}\left(\sum_{k=0}^{\infty} p_\pi(s_0 \rightarrow x, k) \right)\sum_a \nabla \pi(a\mid x)q_\pi(x, a) \\ &= \sum_s \eta(s) \sum_a \nabla \pi(a\mid s)q_\pi(s, a) \\ &= \sum_{s'} \eta(s') \sum_s \frac{\eta(s)}{\sum_{s'} \eta(s')} \sum_a \nabla \pi(a\mid s')q_\pi(s', a) \\ &= \sum_s' \eta(s') \sum_s \mu(s) \sum_a \nabla \pi(a\mid s')q_\pi(s', a) \\ &\propto \sum_s \mu(s) \sum_a \nabla \pi(a\mid s)q_\pi(s, a) \\ \end{align}\]

Like mentioned above for episodic case, $\sum_s \eta(s)$ is the average length of an episode, and for continuing case, $\sum_s \eta(s)$ is 1.

REINFORCE: Monte Carlo Policy Gradient

We can rewite the equation above as:

\[\begin{align} \nabla J(\theta) &\propto \sum_s \mu(s) \sum_a \nabla \pi(a\mid s)q_\pi(s, a) \\ &= \mathbb{E}_{\pi} \left[ \sum_a \nabla \pi(a\mid S_t)q_\pi(S_t, a) \right] \\ &= \mathbb{E}_{\pi} \left[ \sum_a \nabla \pi(a\mid S_t)q_\pi(S_t, a) \right] \\ &= \mathbb{E}_{\pi} \left[ \sum_a \frac{\nabla \pi(a\mid S_t)}{\pi(a\mid S_t)} q_\pi(S_t, a) \pi(a\mid S_t) \right] \\ &= \mathbb{E}_{\pi} \left[\frac{\nabla \pi(A_t\mid S_t)}{\pi(A_t\mid S_t)} q_\pi(S_t, A_t) \right];\qquad \text{replacing a with A_t} \sim \pi\\ &= \mathbb{E}_\pi \left[ \nabla \ln \pi(A_t\mid S_t) q_\pi(S_t, A_t) \right];\qquad \left(\nabla ln(x) = \frac{\nabla x}{x}\right) \\ &= \mathbb{E}_\pi \left[ \nabla \ln \pi(A_t\mid S_t) G_t \right];\qquad \left(q_\pi(S_t, A_t) = \mathbb{E}_\pi [G_t \mid A_t, S_t]\right) \\ \end{align}\]

Now we can rewrite the equation, $\theta_{t+1} = \theta_t + \alpha \widehat{\nabla J(\theta_t)}$ as (here $\widehat{\nabla J(\theta_t)}$ is a stochastic estimate whose expectation approximates the gradient of performance measure, i.e. $J(\theta) = \mathbb{E}_\pi [\widehat{J(\theta)}]$)

\[\theta_{t+1} = \theta_t + \alpha G_t \nabla \ln \pi(A_t\mid S_t) G_t,\]

We call this algorithm REINFORCE. Its update have an intuitive interpretation: each increment is proportional to the product of a return $G_t$ and a vector, the gradient of the probability of taking the action actually taken divided by the probability of taking that action. This vector is the direction in parameter space that most increases the probability of repeating the action $A_t$ on future visits to state $S_t$. With that being said, we can see update increases the parameter vector in the direction which is proportional to the return and inversely proportional to the probability of taking the action. Former makes sense because it causes the parameter to moves towards the direction causing highest return, and latter makes sense because otherwise actions that are taken frequently are at an advantage and might win out even if they don’t yield a high return.

Pseudocode: An MC Policy Gradient Algorithm

REINFORCE, A Monte Carlo Policy Gradient Method (EPISODIC CASE)
  • Input: a differentiable policy parameterization $\pi(a\mid s, \theta)$
  • Initialize policy parameters $\theta \in \mathbb{R}^d$
  • Repeat forever:
    • Generate an episode using $\pi(\cdot\mid \cdot, \theta)$
    • For each step $t$ of the episode:
      • $G_t \leftarrow$ return from step $t$
      • $\theta \leftarrow \theta + \alpha \gamma^k G_t \nabla \ln \pi(A_t\mid S_t, \theta)$

    The vector $\nabla \ln \pi(A_t\mid S_t, \theta)$ is the only place where parameterization appears. We refer it as the eligibility vector.

    REINFORCE with Baseline

    The policy gradient theorem can be generalized to include a comparison of the action value of an arbitrary baseline b(s):

    \[\nabla J(\theta) \propto \sum_s \mu(s) \sum_a \nabla_{\theta} \pi(a\mid s, \theta) \left[ q_\pi(s, a) - b(s) \right]\]

    The baseline can be any function, even a random variable, as long as it doesn’t vary with a; the equation remains valid because the substracted quantity is zero:

    \[\sum_a b(s) \nabla_\theta \pi(a\mid s, \theta) = b(s) \nabla_\theta \sum_a \pi(a\mid s, \theta) = b(s) \nabla_\theta 1 = 0\]

    The update rule then becomes:

    \[\theta_{t+1} = \theta_t + \alpha \left[ G_t - b(S_t) \right] \nabla \ln \pi(A_t\mid S_t, \theta)\]

    A natural choice of baseline is an estimate of the state value, $\hat{v}(S_t, w)$, where $w \in \mathbb{R}^m$ is a weight vector.

    Pseudocode: REINFORCE with Baseline (episodic)

    This algorithm has two steps, denoted $\alpha^\theta$ and $\alpha^w$. The step size for values ($\alpha^w$) is relatively easy; in the linear case we have rules of thumb for choosing it, such as $\alpha^w = 0.1/\mathbb{E}[||\nabla_w \hat{v}(S_t, w)||^2]$. The step size for the policy parameters ($\alpha^\theta$) is more difficult to choose, and it depends on the range of variation of the rewards and on the policy parameterization.

    REINFORCE with Baseline (episodic)
  • Input: a differentiable policy parameterization $\pi(a\mid s, \theta)$
  • Input: a differentiable state-value parameterization $\hat{v}(s, w)$
  • Parameters: step sizes $\alpha^\theta$ > 0, $\alpha^w$ > 0
  • Initialize policy parameters $\theta \in \mathbb{R}^{d'}$ and state-value weights $w \in \mathbb{R}^d$
  • Repeat forever:
    • Generate an episode using $\pi(\cdot\mid \cdot, \theta)$
    • For each step $t$ of the episode:
      • $G_t \leftarrow$ return from step $t$
      • $\delta \leftarrow G_t - \hat{v}(S_t, w)$
      • $w \leftarrow w + \alpha^w \gamma^t \delta \nabla_w \hat{v}(S_t, w)$
      • $\theta \leftarrow \theta + \alpha^\theta \gamma^t \delta \nabla_\theta \ln \pi(A_t\mid S_t, \theta)$

    Actor-Critic Methods

    We never consider REINFORCE with baseline as a actor-critic method. The reason is that even though it learns both the policy and state-value function, we don’t consider it to be an actor-critic method because its state-value function is used only as a baseline, and not as a critic. That is, it is not used for bootstrapping (updating the value estimate for a state from the estimated values of subsequent states), but only as a baseline for the state whose estimate is being updated. REINFORCE with baseline is unbiased and will converge asymptotically to a local minimum, but like all Monte Carlo methods it tends to learn slowly and to be inconvinient to implement online or for continuing tasks. With temporal-difference methods we can eliminate these problems, and through multi-step methods we can flexibly choose the degree of bootstrapping. In order to gain these advantages in case of policy gradient methods we use actor critic methods with a bootstrapping critic.

    Algorithm: Actor Critic with Eligibility Traces (episodic)

    Actor-Critic with Eligibility Traces (episodic)
  • Input: a differentiable policy parameterization $\pi(a\mid s, \theta)$
  • Input: a differentiable state-value parameterization $\hat{v}(s, w)$
  • Parameters: trace-decay rates $\lambda^\theta \in [0, 1], \lambda^w \in [0, 1]$; step sizes $\alpha^\theta$ > 0, $\alpha^w$ > 0
  • Initialize policy parameters $\theta \in \mathbb{R}^{d'}$ and state-value weights $w \in \mathbb{R}^d$
  • Repeat forever:
    • Initialize S(first state of episode)
    • $z^\theta \leftarrow 0$ (d'-component eligibility trace vector)
    • $z^w \leftarrow 0$ (d-component eligibility trace vector)
    • $I \leftarrow 1$
    • Repeat until S is terminal:
      • $A \sim \pi(\cdot\mid S, \theta)$
      • Take action A, observe R, S'
      • $\delta \leftarrow R + \gamma \hat{v}(S', w) - \hat{v}(S, w)$ (if $S'$ is terminal, then $\hat{v}(S', w) = 0$)
      • $z^w \leftarrow \gamma \lambda^w z^w + I \nabla_w \hat{v}(S, w)$
      • $z^\theta \leftarrow \gamma \lambda^\theta z^\theta + I \nabla_\theta \ln \pi(A\mid S, \theta)$
      • $w \leftarrow w + \alpha^w \delta z^w$
      • $\theta \leftarrow \theta + \alpha^\theta \delta z^\theta$
      • $I \leftarrow \gamma I$
      • $S \leftarrow S'$

    Problems

    1. Prove (13.7) using the definitions and elementary calculus.

      We have to prove that

      \[\begin{align} \nabla_\theta \ln \pi(a\mid s, \theta) = x(s, a) - \sum_a \pi(b\mid s, \theta) x(s, b) \end{align}\]

      This is for the exponential softmax policy.

      We know that for exponential softmax policy $\pi(a\mid s, \theta) = \frac{exp(h(s, a, \theta))}{\sum_b exp(h(s, b, \theta))}$

      and with linear action preference, $h(s, a, \theta) = \theta^T x(s, a)$.

      Now moving to the proof,

      \[\begin{align} \nabla_\theta \ln \pi(a\mid s, \theta) &= \nabla_\theta \ln \left(\frac{exp(h(s, a, \theta))}{\sum_b exp(h(s, b, \theta))}\right) \\ &= \nabla_\theta \left(\ln exp(h(s, a, \theta)) - \ln \sum_b exp(h(s, b, \theta))\right) \\ &= \nabla_\theta h(s, a, \theta) - \nabla_\theta \ln \sum_b exp(h(s, b, \theta)) \\ &= \nabla_\theta h(s, a, \theta) - \frac{\nabla_\theta \sum_b exp(h(s, b, \theta))}{\sum_b exp(h(s, b, \theta))} \\ &= \nabla_\theta \theta^T x(s, a) - \frac{\sum_b \nabla_\theta exp(\theta^T x(s, b))}{\sum_b exp(\theta^T x(s, b))} \\ &= x(s, a) - \frac{\sum_b exp(\theta^T x(s, b)) x(s, b)}{\sum_b exp(\theta^T x(s, b))} \\ &= x(s, a) - \sum_b x(s, b) \frac{exp(\theta^T x(s, b))}{\sum_b exp(\theta^T x(s, b))} \\ &= x(s, a) - \sum_b \pi(b\mid s, \theta) x(s, b) \end{align}\]