From REINFORCE to GRPO
21 May 2026 #AI - views

From REINFORCE to GRPO

From WALL-E (2008)

This blog is an aggregate of various papers, blogs, and videos I have been going through over the last few weeks. The goal is to give a step-by-step mathematical explanation of how Large Language Models (LLMs) learn to reason using Reinforcement Learning (RL). We will start from the fundamentals, move through the REINFORCE algorithm, then Actor-Critic methods, Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), and eventually arrive at Group Relative Policy Optimization (GRPO) and some of its variants such as Dr. GRPO. Rather than explaining each method independently, I will try to show how they are connected, why each method was introduced, and how one naturally leads to the next.

1.

Introduction

I don't think reinforcement learning needs a long explanation, to be honest. A simple classical example is usually enough.

Say you have a dog and you want to teach it something, like fetching a ball. You throw the ball, and the dog runs after it, picks it up, and brings it back to you. If the dog does this correctly, you give it a treat (a reward). If it does something else, you don't give it a treat (no reward). Over time, the dog learns that fetching the ball leads to a reward, so it starts doing that behavior more often.

This is essentially what we do in reinforcement learning. We have an agent (the dog) that interacts with an environment (the world) and learns to take actions (fetching the ball) that maximize some notion of cumulative reward (treats).

Formalizing the Objective

Let the initial prompt be state s0s_0. The language model acts as our policy, denoted by πθ(atst)\pi_\theta(a_t|s_t), parameterized by θ\theta. It takes the current state sts_t (the prompt + previous tokens) and outputs a probability distribution over the next token ata_t. A full sequence of states and actions is called a path, or a trajectory, denoted by τ\tau:

τ=(s0,a0,s1,a1,...,sT)\tau = (s_0, a_0, s_1, a_1, ..., s_T)

The environment provides a reward at each time step, denoted by rtr_t. There is also a discount factor γ[0,1]\gamma \in [0, 1], which controls how much we value future rewards relative to immediate ones. When γ=1\gamma = 1, all rewards are weighted equally (undiscounted); when γ<1\gamma < 1, future rewards are exponentially down-weighted. The total discounted reward for a trajectory is:

R(τ)=t=0T1γtrtR(\tau) = \sum_{t=0}^{T-1} \gamma^t r_t

Goal: We want to find the parameters θ\theta that maximize the expected reward over all possible trajectories:

J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]

I would also like to point out three quantities that we will see throughout this blog:

  • Value function: V(s)V(s): expected future return starting from state ss. V(s)=Eτπθ[R(τ)s0=s]V(s) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau) | s_0 = s]
  • Action-value function: Q(s,a)Q(s, a): expected future return starting from state ss and taking action aa. Q(s,a)=Eτπθ[R(τ)s0=s,a0=a]Q(s, a) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau) | s_0 = s, a_0 = a]
  • Advantage function: A(s,a)A(s, a): how much better is action aa than the average action at state ss. If it's positive, then reinforce; if it's negative, then suppress; and if it's zero, then do nothing. A(s,a)=Q(s,a)V(s)A(s, a) = Q(s, a) - V(s)
2.

REINFORCE

The REINFORCE algorithm, introduced by Williams in 1992, estimates policy gradients using Monte-Carlo sampled trajectories to optimize the policy parameters θ \theta .

REINFORCE belongs to the broader family of Vanilla Policy Gradient (VPG) methods. VPG refers to the basic policy gradient approach where the policy parameters are updated directly using gradient ascent on expected return, without additional stabilization techniques such as clipping, trust regions, or actor-critic bootstrapping.

REINFORCE Architecture
Figure 1 Basic REINFORCE architecture for language models. The shaped reward combines the reward model score with a KL penalty from the reference model. Credits: rlhfbook.com

We begin with the objective:

J(θ)=Eτπθ[R(τ)] J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]

This expectation can be written as a sum over all trajectories:

J(θ)=τP(τ;θ)R(τ) J(\theta) = \sum_{\tau} P(\tau;\theta)R(\tau)

where the probability of a trajectory is

P(τ;θ)=ρ(s0)t=0T1πθ(atst)P(st+1st,at) P(\tau;\theta) = \rho(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t) P(s_{t+1}|s_t,a_t)

Here:

  • ρ(s0) \rho(s_0) is the initial state distribution
  • πθ(atst) \pi_\theta(a_t|s_t) is the policy
  • P(st+1st,at) P(s_{t+1}|s_t,a_t) represents environment dynamics

Taking the gradient with respect to θ \theta :

θJ(θ)=θτP(τ;θ)R(τ)=τR(τ)θP(τ;θ)=τR(τ)P(τ;θ)θlogP(τ;θ)θP(τ;θ)=P(τ;θ)θlogP(τ;θ) \begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \sum_{\tau} P(\tau;\theta)R(\tau) \\ &= \sum_{\tau} R(\tau)\nabla_\theta P(\tau;\theta) \\ &= \sum_{\tau} R(\tau) P(\tau;\theta) \nabla_\theta \log P(\tau;\theta) \\ &\qquad \because \nabla_\theta P(\tau;\theta) = P(\tau;\theta)\nabla_\theta \log P(\tau;\theta) \end{aligned}

Expanding the trajectory probability:

P(τ;θ)=ρ(s0)t=0T1πθ(atst)P(st+1st,at) P(\tau;\theta) = \rho(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t) P(s_{t+1}|s_t,a_t)

Taking the logarithm:

logP(τ;θ)=logρ(s0)+t=0T1logπθ(atst)+t=0T1logP(st+1st,at) \log P(\tau;\theta) = \log \rho(s_0) + \sum_{t=0}^{T-1} \log \pi_\theta(a_t|s_t) + \sum_{t=0}^{T-1} \log P(s_{t+1}|s_t,a_t)

Now differentiate with respect to θ \theta :

  • The environment dynamics do not depend on θ \theta
  • The initial state distribution also does not depend on θ \theta

Therefore,

θlogP(τ;θ)=t=0T1θlogπθ(atst) \nabla_\theta \log P(\tau;\theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t)

Substituting this back into the objective gradient gives

θJ(θ)=Eτπθ[R(τ)t=0T1θlogπθ(atst)] \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \right]

This is the REINFORCE gradient estimator, which forms the basis of Vanilla Policy Gradient methods.

However, using the full trajectory return R(τ) R(\tau) for every action can lead to very high variance. Every action is weighted using the final outcome of the entire episode, even though many rewards may not actually be caused by that specific action.

Instead, we use the future return starting from timestep t t :

Gt=k=tT1γktrk G_t = \sum_{k=t}^{T-1} \gamma^{k-t}r_k

The gradient now becomes

θJ(θ)=E[t=0T1θlogπθ(atst)Gt] \nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) G_t \right]

This works because an action at timestep t t cannot influence rewards that occurred before timestep t t .

Even with future returns, the gradient estimate can still have high variance. To reduce this variance, we subtract a baseline that does not depend on the sampled action:

E[θlogπθ(atst)b(st)]=0 \mathbb{E} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) b(s_t) \right] = 0

Therefore, the objective can be rewritten as

θJ(θ)=E[t=0T1θlogπθ(atst)(Gtb(st))] \nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \bigl( G_t - b(s_t) \bigr) \right]

When the baseline is chosen as the value function Vπ(st) V^\pi(s_t) , the quantity

Aπ(st,at)=GtVπ(st) A^\pi(s_t, a_t) = G_t - V^\pi(s_t)

is called the advantage. It measures whether an action performed better or worse than the policy's average expectation for that state.

Why use a baseline?

Without a baseline, every action is weighted directly by the raw return Gt G_t . But this only tells us whether the episode outcome was good or bad overall — not whether the action itself was better or worse than expected.

This creates high variance because the return mixes together:

  • The quality of the action
  • The quality of the state the agent was already in

As a result, a good action taken in a bad state and a bad action taken in a good state can produce similar returns.

Subtracting a baseline centers the learning signal:

Gtb(st) G_t - b(s_t)

Instead of asking:

"Was the return high?"

the policy gradient now asks:

"Was this action better or worse than expected in this state?"

Commonly used baselines include:

  • Moving average reward
  • Batch mean reward (used carefully in practice)
  • Learned value function Vπ(s) V_\pi(s)
3.

References

  1. Nathan Lambert, Louis Castricato, Leandro von Werra, and Alex Havrilla. "Reinforcement Learning from Human Feedback: From Zero to Hero." rlhfbook.com, 2024.
Tags: #RL #NLP

Discussion

Add a Comment

Comments