Reinforcement Learning (6): PPO and TRPO - Trust Region Policy Optimization

In Chapter 3, we introduced the basic principles of policy gradient methods: computing gradients through trajectory sampling to directly optimize policy parameters. However, vanilla policy gradient has a fundamental problem —update instability. A single overly large policy update can cause dramatic performance collapse, and since the policy has already changed, recovery from errors becomes extremely difficult. It's like walking a tightrope at the edge of a cliff: one wrong step, and everything is lost.

Trust Region Methods address this with a core idea: limit the magnitude of each policy update to ensure the new policy doesn't deviate too far from the old one. TRPO (Trust Region Policy Optimization) achieves this through KL divergence constraints, while PPO (Proximal Policy Optimization) uses a simpler clipping mechanism for similar effects. Due to its simplicity and stability, PPO has become the most widely used reinforcement learning algorithm today — from OpenAI Five defeating professional Dota 2 players to ChatGPT's RLHF training pipeline, PPO is everywhere.

This chapter starts from the instability problem of policy gradients, dives deep into the theoretical foundations of trust region methods, explains the mathematical derivation of TRPO and the geometric intuition of natural gradients, then introduces how PPO approximates TRPO's effect with a simple clipping mechanism. We'll also explore practical techniques and PPO's application in RLHF.

The Instability Problem of Policy Gradients

Vanilla Policy Gradient Recap

Let's first review the core idea of policy gradient methods. Policy gradient methods directly parameterize the policy, aiming to maximize expected return:Through the REINFORCE algorithm, we obtain an unbiased estimate of the gradient:whereis the discounted return from time. The intuition behind this formula is: if an action sequence receives high return, increase the probability of taking those actions; otherwise decrease it.

After introducing a baseline, we typically use the advantage functioninstead of, which reduces variance while maintaining unbiasedness:

Why Are Policy Updates Unstable?

Policy gradient looks elegant, but in practice often encounters training instability. Let's understand this problem through a concrete example.

Example: Dramatic Changes in Action Probabilities

Suppose the current policy selects actionwith probability 0.9 andwith probability 0.1 in some state. In one sampling run, we happen to sample action, and this trajectory receives very high return.

What will policy gradient do? It will dramatically increaseBut here are the problems:

High Variance: Sincehas low sampling probability (only 0.1), such "lucky" trajectories are very rare. Most of the time we sample, and even if's return is mediocre, the gradient will point toward increasing's probability. This leads to extremely high variance in gradient estimates.
Distribution Shift: Once we update the policy, the new policyhas a different data distribution than. But the data we used to compute the gradient came from the old policy! This means our gradient estimate may be completely inaccurate under the new policy.
Performance Collapse: A single large update might push the policy into a very poor region. For example, if we increase's probability from 0.1 to 0.8, butonly received high return due to luck, the new policy's performance will dramatically decrease. Worse, since the data distribution has changed, it becomes very difficult to recover from this error.

Mathematical Analysis: The Impact of Step Size

Consider gradient update of policy parameters:.

Taylor expansion of the objective function:whereis the Hessian matrix. Substituting:If step sizeis too large, the second-order termmay become very large. Ifhas negative eigenvalues (non-convex objective), this term may be negative, leading to— performance degradation!

Parameter Space vs Policy Space

A deeper problem is: Euclidean distance in parameter space doesn't reflect the true change in policy.

Consider two Gaussian policies: -, small variance -, large variance

These two policies may differ in only one parameter (variance), close in parameter space. But their behaviors are completely different!almost deterministically selects actions near, while's actions are nearly random.

This tells us: we shouldn't constrain update step size in parameter space, but rather in policy distribution space. This is the core idea of trust region methods.

An Intuitive Analogy

Imagine you're hiking in unfamiliar mountains. Your goal is to reach the summit (maximize return).

Vanilla Policy Gradient is like this: you glance at the compass (gradient direction), then take a big step in that direction. Problems: - You might step off a cliff - You might fall into a pit - Once you go wrong, it's hard to turn back

Trust Region Methods are more cautious: you ensure each step isn't too far, always staying within a "safe region." Even if the direction is slightly off, you won't end up in danger.

Importance Sampling and Policy Optimization

Importance Sampling: Reusing Old Data

In policy gradient, after each policy update, old data becomes "outdated"— it was collected from the old policy, with a different distribution than the new policy. This leads to low sample efficiency: we need to constantly collect new data.

Importance Sampling provides a way to reuse old data. The basic idea: if we want to compute an expectation under distribution, but only have samples from distribution, we can correct through weighting:The weightis called the importance weight, correcting for distribution differences.

Application to Policy Optimization

Suppose we have trajectory data collected by old policy, and now want to evaluate new policy's performance. Applying importance sampling:where the trajectory's probability ratio is:Notice that initial state distributionand transition probabilitiescancel out:This is a key result: the importance weight of a trajectory only depends on policy action probability ratios, independent of environment dynamics!

Surrogate Objective Function

Directly optimizingstill has problems: the trajectory importance weight is a product of all timestep weights, and whenis large, this product's variance grows exponentially.

A more practical approach is to define a surrogate objective function, considering only single-step importance weights:whereis the old policy's advantage function. We typically abbreviate:whereis the probability ratio.

Key properties of the surrogate objective:

First-order approximation: At,and true objectivehave the same gradient:
Equal value: At,, so.

This means the surrogate objective is a good local approximation of the true objective! But the problem is: whenis far from, this approximation may completely fail.

Variance Problems with Importance Sampling

When old and new policies differ significantly, importance weight variance increases dramatically. Consider the weight's expectation and variance:The expectation is always 1, good. But variance?If some actionhas high probability under new policy () but low under old policy (), the weight, weight squared is 81! This causes:

Gradient estimate variance explosion
A few high-weight samples dominate the entire gradient
Extremely unstable training

This is why we need to limit the difference between old and new policies— the core of trust region methods.

TRPO: Trust Region Policy Optimization

Theoretical Foundation: Policy Improvement Lower Bound

TRPO's theoretical foundation comes from an important result by Kakade and Langford (2002).

Define policy's expected return:

Theorem: New policy's expected return can be expressed as:whereis old policy's advantage function.

Proof sketch:Using the relationand telescope sum technique, we get: This theorem tells us: new policy's performance equals old policy's performance plus an "advantage term". If we can make the new policy always select positive-advantage actions at each state, performance will improve!

From Exact Equality to Optimizable Approximation

The equation above looks beautiful, but has a problem: the right side's expectation is computed under, but we don't haveyet! This is a chicken-and-egg problem.

Approximation 1: Use old policy's state distribution

Define surrogate objective under old policy's state distribution:whereis old policy's discounted state visitation frequency.

Thiscan be estimated using old policy's data:Using importance sampling to convertto:

Key question: How far apart areand true objective?

Conservative Lower Bound on Policy Improvement

Schulman et al. (2015) proved a key theorem:

Theorem (Policy Improvement Lower Bound):where: -is maximum absolute advantage -is maximum total variation distance -is total variation distance

Since total variation distance relates to KL divergence:, we get:where.

Significance of this theorem:

It provides a conservative lower bound on new policy's performance. As long as we: 1. Maximize surrogate objective$L_(')D_{KL}^{}(|| ')$to not be too large

We can guarantee that new policy's performance won't be much worse than old policy! This is the theoretical foundation of trust region methods.

TRPO's Optimization Problem

Based on above theory, TRPO formulates policy optimization as a constrained optimization problem: whereis trust region size, typically around 0.01.

Note: we use average KL divergenceinstead of maximum KL divergence. This is a practical approximation — maximum KL is hard to compute and optimize in practice.

Natural Gradient: Geometry of Policy Space

To understand TRPO's optimization method, we need to first understand the concept of natural gradient.

Problem: What's wrong with Euclidean distance in parameter space?

Consider standard gradient descent:. This is equivalent to solving:The last termis the Euclidean distance constraint in parameter space. But as we discussed, parameter space Euclidean distance doesn't reflect true policy change.

The Idea of Natural Gradient

Natural gradient doesn't constrain in parameter space, but in probability distribution space. Specifically, we use KL divergence to measure policy change:'F$The Fisher matrix defines the Riemannian metric of policy space, telling us at each parameter point how sensitive the policy distribution is to parameter changes.

Form of Natural Gradient

Using Lagrange multipliers to solve the constrained optimization problem above, we get the natural gradient update:whereis the natural gradient.

Intuitive understanding:

Ordinary gradient points in the "steepest" direction in parameter space. But if the parameter space coordinate system is poorly chosen (some directions affect policy greatly, others weakly), this direction may not be the fastest for improving performance.

Natural gradient "corrects" this through the Fisher matrix. It finds the steepest direction in policy distribution space, then maps it back to parameter space. It's like navigating on Earth: although latitude and longitude are orthogonal coordinates, at different latitudes, one degree of longitude corresponds to different actual distances. Natural gradient accounts for this "metric variation."

TRPO Implementation Details

TRPO uses Conjugate Gradient (CG) to efficiently solve the constrained optimization problem, avoiding explicit computation and storage of.

Algorithm Steps:

Collect data: Sample trajectories using current policy
Estimate advantages: Compute advantagefor each state-action pair (usually with GAE)
Compute policy gradient:
Solve with conjugate gradient: - No need to explicitly compute, just need to compute(matrix-vector product) -can be efficiently computed with one forward and backward pass
Compute step size:(so KL constraint is exactly satisfied)
Backtracking line search: Find maximum step that satisfies constraint and actually improves objective
Update parameters: Core of Conjugate Gradient

Conjugate gradient solves, only needing ability to compute. Fisher matrix-vector product can be efficiently computed:This requires two backward passes, but doesn't need to store the entirematrix.

import torch
import torch.nn as nn
import numpy as np
from torch.distributions import Categorical, Normal

class TRPOAgent:
    def __init__(self, state_dim, action_dim, hidden_dim=64, delta=0.01, 
                 gamma=0.99, lam=0.95, continuous=False):
        self.continuous = continuous
        self.delta = delta
        self.gamma = gamma
        self.lam = lam
        
        # Policy network
        if continuous:
            self.policy_mean = nn.Sequential(
                nn.Linear(state_dim, hidden_dim),
                nn.Tanh(),
                nn.Linear(hidden_dim, hidden_dim),
                nn.Tanh(),
                nn.Linear(hidden_dim, action_dim)
            )
            self.policy_log_std = nn.Parameter(torch.zeros(action_dim))
        else:
            self.policy = nn.Sequential(
                nn.Linear(state_dim, hidden_dim),
                nn.Tanh(),
                nn.Linear(hidden_dim, hidden_dim),
                nn.Tanh(),
                nn.Linear(hidden_dim, action_dim),
                nn.Softmax(dim=-1)
            )
        
        # Value network
        self.value = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, 1)
        )
        self.value_optimizer = torch.optim.Adam(self.value.parameters(), lr=1e-3)
    
    def get_policy_params(self):
        """Get policy network parameters"""
        if self.continuous:
            return list(self.policy_mean.parameters()) + [self.policy_log_std]
        else:
            return list(self.policy.parameters())
    
    def get_action_and_log_prob(self, state):
        """Get action and log probability"""
        state = torch.FloatTensor(state)
        if self.continuous:
            mean = self.policy_mean(state)
            std = torch.exp(self.policy_log_std)
            dist = Normal(mean, std)
            action = dist.sample()
            log_prob = dist.log_prob(action).sum()
            return action.detach().numpy(), log_prob.detach()
        else:
            probs = self.policy(state)
            dist = Categorical(probs)
            action = dist.sample()
            return action.item(), dist.log_prob(action).detach()
    
    def compute_advantages(self, rewards, values, dones):
        """Compute GAE advantage estimates"""
        advantages = []
        gae = 0
        for t in reversed(range(len(rewards))):
            if t == len(rewards) - 1:
                next_value = 0
            else:
                next_value = values[t+1]
            delta = rewards[t] + self.gamma * next_value * (1 - dones[t]) - values[t]
            gae = delta + self.gamma * self.lam * (1 - dones[t]) * gae
            advantages.insert(0, gae)
        return torch.FloatTensor(advantages)
    
    def flat_grad(self, grads):
        """Flatten gradients to 1D vector"""
        return torch.cat([g.view(-1) for g in grads if g is not None])
    
    def flat_params(self):
        """Flatten parameters to 1D vector"""
        params = self.get_policy_params()
        return torch.cat([p.view(-1) for p in params])
    
    def set_flat_params(self, flat_params):
        """Restore parameters from 1D vector"""
        params = self.get_policy_params()
        idx = 0
        for p in params:
            p.data.copy_(flat_params[idx:idx+p.numel()].view(p.shape))
            idx += p.numel()
    
    def compute_kl(self, states):
        """Compute KL divergence between old and new policy"""
        if self.continuous:
            mean = self.policy_mean(states)
            std = torch.exp(self.policy_log_std)
            mean_old = mean.detach()
            std_old = std.detach()
            
            # KL(old || new) for Gaussian
            kl = (torch.log(std/std_old) + (std_old**2 + (mean_old-mean)**2)/(2*std**2) - 0.5).sum(dim=1).mean()
        else:
            probs = self.policy(states)
            probs_old = probs.detach()
            kl = (probs_old * (torch.log(probs_old + 1e-8) - torch.log(probs + 1e-8))).sum(dim=1).mean()
        return kl
    
    def hessian_vector_product(self, states, vector, damping=0.1):
        """Compute Fisher matrix-vector product: Fv"""
        kl = self.compute_kl(states)
        
        params = self.get_policy_params()
        grads = torch.autograd.grad(kl, params, create_graph=True)
        flat_grads = self.flat_grad(grads)
        
        grad_vector_product = (flat_grads * vector).sum()
        hv = torch.autograd.grad(grad_vector_product, params)
        flat_hv = self.flat_grad(hv)
        
        return flat_hv + damping * vector
    
    def conjugate_gradient(self, states, b, n_steps=10, residual_tol=1e-10):
        """Conjugate gradient to solve Fx = b"""
        x = torch.zeros_like(b)
        r = b.clone()
        p = r.clone()
        rdotr = r.dot(r)
        
        for _ in range(n_steps):
            Ap = self.hessian_vector_product(states, p)
            alpha = rdotr / (p.dot(Ap) + 1e-8)
            x += alpha * p
            r -= alpha * Ap
            new_rdotr = r.dot(r)
            if new_rdotr < residual_tol:
                break
            p = r + (new_rdotr / rdotr) * p
            rdotr = new_rdotr
        
        return x
    
    def line_search(self, states, actions, advantages, step_direction, max_kl):
        """Backtracking line search to find step satisfying constraint"""
        old_params = self.flat_params().clone()
        
        # Compute maximum step size (so KL constraint is exactly satisfied)
        sAs = 0.5 * step_direction.dot(self.hessian_vector_product(states, step_direction))
        max_step = torch.sqrt(max_kl / (sAs + 1e-8))
        
        # Backtracking line search
        for shrink_factor in [0.5 ** i for i in range(10)]:
            step = max_step * shrink_factor
            new_params = old_params + step * step_direction
            self.set_flat_params(new_params)
            
            # Check KL constraint
            kl = self.compute_kl(states)
            
            if kl.item() < max_kl:
                return True
        
        # Revert to old parameters
        self.set_flat_params(old_params)
        return False

Pros and Cons of TRPO

Advantages: 1. Theoretical guarantee: Has clear performance improvement lower bound 2. Stable updates: KL constraint ensures each update isn't too aggressive 3. Sample efficiency: Can reuse same batch of data multiple times (within KL constraint)

Disadvantages: 1. Complex implementation: Requires conjugate gradient, line search 2. Computationally expensive: Each update needs multiple backward passes 3. Hard to tune: Trust region sizerequires experience to choose

PPO: Proximal Policy Optimization

Motivation: From TRPO to PPO

TRPO is theoretically elegant but complex to implement and computationally expensive. Schulman et al. proposed PPO in 2017, with the goal: approximate TRPO's effect with first-order optimization while keeping implementation simple.

PPO's core idea: since we want to limit old-new policy difference, why not directly add this constraint in the objective function?

PPO-Clip: The Clipped Version

PPO's most popular variant is the clipped version, which limits policy updates by clipping the probability ratio:where: -is the probability ratio -is the clip parameter, typically 0.1 or 0.2 -restrictsto Detailed Analysis of the Objective Function:

Let's discuss by case:

Case 1:(good action, we want to increase its probability)

If(new policy decreased probability):, gradient encourages increasing
If:, gradient continues encouraging increase
If:, gradient is 0!

The last case is key: when probability ratio exceeds, objective becomes constant, gradient is 0. This prevents policy from continuing to update in this direction — even with large advantage, we can't overly increase action probability.

Case 2:(bad action, we want to decrease its probability)

If(new policy increased probability):(negative), gradient encourages decreasing
If:, gradient continues encouraging decrease
If:, gradient is 0!

Similarly, when probability ratio is below, gradient becomes 0, preventing further decrease.

Intuition Summary: PPO-Clip's objective is "pessimistic"— it takes the smaller of two objectives. This ensures: - Won't dramatically change policy due to one high-advantage sample - Even if sampling noise causes inaccurate advantage estimates, policy won't fluctuate wildly

PPO-Penalty: The Penalty Version

Another PPO variant adds KL divergence as a penalty term:whereis the KL penalty coefficient. This is equivalent to TRPO's Lagrangian form, but converts constraint to penalty.

Adaptive KL Penalty:

Choice ofis important. Iftoo small, KL may be too large; if too large, learning is too conservative. PPO uses adaptive adjustment:

def adaptive_kl_penalty(kl, d_target=0.01, beta=1.0):
    """Adaptively adjust KL penalty coefficient"""
    if kl < d_target / 1.5:
        # KL too small, can be more aggressive
        beta /= 2
    elif kl > d_target * 1.5:
        # KL too large, need to be more conservative
        beta *= 2
    return beta

In practice, PPO-Clip is more commonly used than PPO-Penalty because it doesn't require tuning.

Complete PPO Implementation

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.distributions import Categorical, Normal

class PPOAgent:
    def __init__(self, state_dim, action_dim, hidden_dim=64, 
                 lr=3e-4, gamma=0.99, lam=0.95, eps_clip=0.2,
                 c1=0.5, c2=0.01, k_epochs=10, continuous=False):
        """
        PPO Agent
        
        Args:
            state_dim: State dimension
            action_dim: Action dimension
            hidden_dim: Hidden layer dimension
            lr: Learning rate
            gamma: Discount factor
            lam: GAE lambda parameter
            eps_clip: PPO clipping parameter
            c1: Value loss coefficient
            c2: Entropy coefficient (encourages exploration)
            k_epochs: Number of update epochs per batch
            continuous: Whether action space is continuous
        """
        self.continuous = continuous
        self.gamma = gamma
        self.lam = lam
        self.eps_clip = eps_clip
        self.c1 = c1
        self.c2 = c2
        self.k_epochs = k_epochs
        
        if continuous:
            # Continuous action space: output mean and std
            self.policy_mean = nn.Sequential(
                nn.Linear(state_dim, hidden_dim),
                nn.Tanh(),
                nn.Linear(hidden_dim, hidden_dim),
                nn.Tanh(),
                nn.Linear(hidden_dim, action_dim),
                nn.Tanh()
            )
            self.policy_log_std = nn.Parameter(torch.zeros(action_dim))
            policy_params = list(self.policy_mean.parameters()) + [self.policy_log_std]
        else:
            # Discrete action space: output action probabilities
            self.policy = nn.Sequential(
                nn.Linear(state_dim, hidden_dim),
                nn.Tanh(),
                nn.Linear(hidden_dim, hidden_dim),
                nn.Tanh(),
                nn.Linear(hidden_dim, action_dim),
                nn.Softmax(dim=-1)
            )
            policy_params = self.policy.parameters()
        
        # Value network
        self.value = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, 1)
        )
        
        # Unified optimizer
        self.optimizer = optim.Adam(
            list(policy_params) + list(self.value.parameters()), 
            lr=lr
        )
        
    def get_action(self, state):
        """Sample action from current policy"""
        state = torch.FloatTensor(state)
        
        if self.continuous:
            mean = self.policy_mean(state)
            std = torch.exp(self.policy_log_std)
            dist = Normal(mean, std)
            action = dist.sample()
            log_prob = dist.log_prob(action).sum()
            return action.detach().numpy(), log_prob.detach()
        else:
            probs = self.policy(state)
            dist = Categorical(probs)
            action = dist.sample()
            return action.item(), dist.log_prob(action).detach()
    
    def evaluate(self, states, actions):
        """Evaluate log probability, value, and entropy for given state-action pairs"""
        if self.continuous:
            mean = self.policy_mean(states)
            std = torch.exp(self.policy_log_std)
            dist = Normal(mean, std)
            log_probs = dist.log_prob(actions).sum(dim=-1)
            entropy = dist.entropy().sum(dim=-1).mean()
        else:
            probs = self.policy(states)
            dist = Categorical(probs)
            log_probs = dist.log_prob(actions)
            entropy = dist.entropy().mean()
        
        values = self.value(states).squeeze()
        return log_probs, values, entropy
    
    def compute_gae(self, rewards, values, dones, next_value):
        """
        Compute Generalized Advantage Estimation (GAE)
        
        GAE balances bias and variance:
        - lambda=0: Use single-step TD error, low variance but high bias
        - lambda=1: Use Monte Carlo return, low bias but high variance
        - lambda=0.95: Commonly used balance point in practice
        """
        advantages = []
        gae = 0
        values = list(values) + [next_value]
        
        for t in reversed(range(len(rewards))):
            # TD error
            delta = rewards[t] + self.gamma * values[t+1] * (1 - dones[t]) - values[t]
            # GAE recursion
            gae = delta + self.gamma * self.lam * (1 - dones[t]) * gae
            advantages.insert(0, gae)
        
        return torch.FloatTensor(advantages)
    
    def update(self, states, actions, old_log_probs, rewards, dones, next_state):
        """
        PPO Update
        
        Core idea:
        1. Compute advantage function (with GAE)
        2. Multiple epochs of optimization, each using PPO-Clip objective
        3. Simultaneously optimize policy and value function
        """
        # Convert to tensors
        states = torch.FloatTensor(np.array(states))
        if self.continuous:
            actions = torch.FloatTensor(np.array(actions))
        else:
            actions = torch.LongTensor(actions)
        old_log_probs = torch.stack(old_log_probs)
        
        # Compute advantages and returns (no gradient needed)
        with torch.no_grad():
            values = self.value(states).squeeze().numpy()
            next_value = self.value(torch.FloatTensor(next_state)).item()
            advantages = self.compute_gae(rewards, values, dones, next_value)
            returns = advantages + torch.FloatTensor(values)
            
            # Normalize advantages (important! reduces variance)
            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # Record losses for monitoring
        policy_losses = []
        value_losses = []
        entropies = []
        
        # Multiple optimization epochs
        for _ in range(self.k_epochs):
            # Evaluate current policy
            log_probs, value_pred, entropy = self.evaluate(states, actions)
            
            # Compute probability ratio
            ratios = torch.exp(log_probs - old_log_probs)
            
            # PPO-Clip objective
            surr1 = ratios * advantages
            surr2 = torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantages
            policy_loss = -torch.min(surr1, surr2).mean()
            
            # Value loss (can also use clipped version)
            value_loss = ((value_pred - returns) ** 2).mean()
            
            # Total loss = policy loss + value loss coefficient * value loss - entropy coefficient * entropy
            loss = policy_loss + self.c1 * value_loss - self.c2 * entropy
            
            # Gradient update
            self.optimizer.zero_grad()
            loss.backward()
            # Gradient clipping (optional but recommended)
            if self.continuous:
                nn.utils.clip_grad_norm_(
                    list(self.policy_mean.parameters()) + [self.policy_log_std] + 
                    list(self.value.parameters()), 
                    0.5
                )
            else:
                nn.utils.clip_grad_norm_(
                    list(self.policy.parameters()) + list(self.value.parameters()), 
                    0.5
                )
            self.optimizer.step()
            
            policy_losses.append(policy_loss.item())
            value_losses.append(value_loss.item())
            entropies.append(entropy.item())
        
        return {
            'policy_loss': np.mean(policy_losses),
            'value_loss': np.mean(value_losses),
            'entropy': np.mean(entropies)
        }


def train_ppo(env_name='CartPole-v1', num_episodes=500, 
              update_freq=2048, render=False):
    """
    PPO Training Loop
    """
    import gym
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    
    # Check action space type
    if isinstance(env.action_space, gym.spaces.Discrete):
        action_dim = env.action_space.n
        continuous = False
    else:
        action_dim = env.action_space.shape[0]
        continuous = True
    
    agent = PPOAgent(state_dim, action_dim, continuous=continuous)
    
    # Data buffer
    states, actions, log_probs, rewards, dones = [], [], [], [], []
    
    episode_rewards = []
    total_steps = 0
    
    for episode in range(num_episodes):
        state = env.reset()
        episode_reward = 0
        done = False
        
        while not done:
            if render:
                env.render()
            
            # Sample action
            action, log_prob = agent.get_action(state)
            next_state, reward, done, _ = env.step(action)
            
            # Store experience
            states.append(state)
            actions.append(action)
            log_probs.append(log_prob)
            rewards.append(reward)
            dones.append(done)
            
            state = next_state
            episode_reward += reward
            total_steps += 1
            
            # Update when reaching update frequency
            if total_steps % update_freq == 0 and len(states) > 0:
                info = agent.update(states, actions, log_probs, rewards, dones, state)
                # Clear buffer
                states, actions, log_probs, rewards, dones = [], [], [], [], []
        
        episode_rewards.append(episode_reward)
        
        if episode % 10 == 0:
            avg_reward = np.mean(episode_rewards[-10:]) if len(episode_rewards) >= 10 else np.mean(episode_rewards)
            print(f"Episode {episode}, Avg Reward (last 10): {avg_reward:.2f}")
    
    env.close()
    return agent, episode_rewards

Key Techniques for PPO

Generalized Advantage Estimation (GAE)

GAE is a technique for computing advantage function, balancing bias and variance through parameter:whereis TD error.

Special cases: -:(single-step TD, low variance high bias) -:(Monte Carlo, high variance low bias)

In practice,is commonly used as a good balance.

Value Function Clipping

To further stabilize training, clipping can be applied to value function as well:This prevents value function updates from being too aggressive.

Learning Rate Annealing

Lowering learning rate in later training can improve final performance:

def linear_schedule(initial_lr, final_lr, current_step, total_steps):
    """Linear annealing"""
    progress = current_step / total_steps
    return initial_lr + (final_lr - initial_lr) * progress

# Cosine annealing is more commonly used
def cosine_schedule(initial_lr, final_lr, current_step, total_steps):
    """Cosine annealing"""
    progress = current_step / total_steps
    return final_lr + 0.5 * (initial_lr - final_lr) * (1 + np.cos(np.pi * progress))

Parallel Environment Sampling

Using multiple parallel environments can greatly improve sampling efficiency and data diversity:

class ParallelEnvs:
    """Parallel environments wrapper"""
    def __init__(self, env_name, n_envs=8):
        import gym
        self.envs = [gym.make(env_name) for _ in range(n_envs)]
        self.n_envs = n_envs
    
    def reset(self):
        return np.array([env.reset() for env in self.envs])
    
    def step(self, actions):
        results = [env.step(a) for env, a in zip(self.envs, actions)]
        states, rewards, dones, infos = zip(*results)
        
        # Auto-reset completed environments
        for i, done in enumerate(dones):
            if done:
                states = list(states)
                states[i] = self.envs[i].reset()
                states = tuple(states)
        
        return np.array(states), np.array(rewards), np.array(dones), infos

Deep Comparison: TRPO vs PPO

Theoretical Guarantees

Aspect	TRPO	PPO-Clip
Policy improvement guarantee	Has strict theoretical lower bound	No strict theoretical guarantee
KL constraint	Hard constraint (ensured via line search)	Soft constraint (clipping is heuristic)
Update stability	Theoretical guarantee won't degrade	Empirically stable, but exceptions possible
Convergence	Has convergence analysis	No strict convergence proof

Implementation Complexity

Aspect	TRPO	PPO-Clip
Optimization method	Second-order (natural gradient)	First-order (SGD/Adam)
Requires Fisher matrix	Yes	No
Requires conjugate gradient	Yes	No
Requires line search	Yes	No
Lines of code (core)	~300	~100
Tuning complexity	Medium (choice)	Low (not sensitive)

Computational Efficiency

Aspect	TRPO	PPO-Clip
Backward passes per update	~20 (CG + line search)	1
Memory usage	Higher (stores Hessian-vector product intermediates)	Lower
Parallelization difficulty	Higher	Lower
GPU utilization	Medium	High

Practical Performance

On most standard RL benchmarks, PPO matches or sometimes exceeds TRPO performance. This may be because:

Multiple update epochs: PPO can perform multiple update epochs on same batch (usually 10), while TRPO typically updates once. This lets PPO more fully utilize data.
Better exploration: PPO's entropy regularization encourages exploration, while TRPO has no explicit exploration mechanism.
More stable value function: PPO typically trains policy and value networks together, sharing underlying representations, which helps learn better features.
Implementation details: PPO's simplicity makes implementation easier to get right, reducing bug probability.

PPO in RLHF

PPO is the core algorithm for large language model alignment (RLHF) today. Let's understand its application in this context in detail.

RLHF Pipeline

RLHF (Reinforcement Learning from Human Feedback) aims to make language models generate responses that better align with human preferences. The pipeline is:

Supervised Fine-Tuning (SFT): Fine-tune pretrained model on high-quality dialogue data
Train Reward Model (RM): Collect human preference data, train a model to predict which response humans prefer
PPO Fine-Tuning: Use reward model outputs as reward signal, fine-tune language model with PPO

PPO Objective in RLHF

In RLHF, PPO's objective function has some special features:where: -is user input (prompt) -is model-generated response -is reward model's score for the response -is reference model (usually SFT model) -is KL penalty coefficient

Why KL penalty?

Without KL penalty, model might find "dark magic" to fool reward model — generating responses that reward model scores high but are actually low quality. KL penalty ensures model doesn't deviate too far from SFT model, maintaining generation reasonableness.

Implementation Details

class RLHFPPOTrainer:
    def __init__(self, policy_model, ref_model, reward_model, 
                 kl_coef=0.1, clip_range=0.2):
        """
        RLHF PPO Trainer
        
        Args:
            policy_model: Policy model to train (language model)
            ref_model: Reference model (frozen SFT model)
            reward_model: Reward model
            kl_coef: KL penalty coefficient
            clip_range: PPO clipping range
        """
        self.policy = policy_model
        self.ref = ref_model
        self.reward = reward_model
        self.kl_coef = kl_coef
        self.clip_range = clip_range
        
        # Freeze reference model
        for param in self.ref.parameters():
            param.requires_grad = False
    
    def compute_rewards(self, prompts, responses):
        """
        Compute rewards (including KL penalty)
        
        Returns:
            rewards: Final reward = reward model score - KL penalty
        """
        # Reward model scores
        rm_scores = self.reward(prompts, responses)
        
        # Compute KL divergence
        with torch.no_grad():
            ref_logprobs = self.ref.get_log_probs(prompts, responses)
        policy_logprobs = self.policy.get_log_probs(prompts, responses)
        
        kl = policy_logprobs - ref_logprobs  # Per-token KL
        kl_penalty = self.kl_coef * kl.sum(dim=1)  # Whole sequence KL
        
        rewards = rm_scores - kl_penalty
        return rewards, rm_scores, kl_penalty
    
    def ppo_step(self, prompts, old_responses, old_logprobs, advantages):
        """PPO update step"""
        # Get log probs under new policy
        new_logprobs = self.policy.get_log_probs(prompts, old_responses)
        
        # Compute probability ratio
        ratio = torch.exp(new_logprobs - old_logprobs)
        
        # PPO-Clip objective
        surr1 = ratio * advantages
        surr2 = torch.clamp(ratio, 1 - self.clip_range, 1 + self.clip_range) * advantages
        policy_loss = -torch.min(surr1, surr2).mean()
        
        return policy_loss

Challenges of PPO in RLHF

Reward Hacking: Model might find ways to fool reward model
KL Divergence Explosion: If KL penalty too weak, model may rapidly deviate from reference
Training Instability: Language model output space is huge, optimization landscape complex
Computational Cost: Need to run multiple large models simultaneously

Recent Advances: DPO and Alternatives

Due to RLHF + PPO complexity, some simpler alternatives have emerged:

DPO (Direct Preference Optimization): Directly optimize from preference data, skipping reward model and PPO
IPO (Identity Preference Optimization): Improved version of DPO
KTO (Kahneman-Tversky Optimization): Optimization based on prospect theory

But PPO remains the choice in many scenarios, especially when fine-grained control or online learning is needed.

Practical Advice and Tuning Tips

Hyperparameter Selection Guide

Hyperparameter	Typical Range	Notes
Learning rate	1e-4 ~ 3e-4	Too large unstable, too small slow
Clip parameter	0.1 ~ 0.3	0.2 most common
GAE	0.9 ~ 0.99	0.95 usually works well
Discount factor	0.99 ~ 0.999	Longer task, closer to 1
Batch size	2048 ~ 8192	Larger more stable, slower updates
Epochs per batch	3 ~ 10	Too many may overfit current data
Entropy coefficient	0.0 ~ 0.01	Encourages exploration
Value loss coefficient	0.5 ~ 1.0

Common Problems and Solutions

Problem 1: Performance not improving or oscillating

Possible causes: - Learning rate too large: reduce it - Batch too small: increase batch size - Advantage estimate inaccurate: check GAE implementation

Problem 2: Policy collapses to deterministic

Possible causes: - Entropy coefficient too small: increase it - Clip parameter too small: increase Problem 3: Value function predictions are poor

Possible causes: - Value network insufficient capacity: increase hidden layer size - Return scale too large: normalize returns - Learning rate mismatch: separately tune value network learning rate

Debugging Tips

Monitor KL divergence: If too large (>0.1), updates too aggressive
Monitor probability ratios: Most should be in
Monitor entropy: Should gradually decrease, but not to 0
Visualize advantage distribution: Should be roughly symmetric, mean near 0

Summary

The core idea of trust region methods is constraining policy update magnitude to ensure stable learning:

TRPO provides strict theoretical guarantees through KL divergence constraints and natural gradients, but is complex to implement and computationally expensive
PPO achieves similar effects through simple clipping mechanism, becoming the de facto standard algorithm
Key differences:
- TRPO: Second-order optimization, hard constraint, theoretical guarantee
- PPO: First-order optimization, soft constraint, practically effective
PPO's success secrets:
- Simple implementation, easy to debug
- Can reuse data multiple times
- Seamlessly integrates with other techniques (GAE, parallel sampling)
- Thrives in new domains like RLHF

From an algorithm development perspective, PPO represents an important engineering philosophy: simple and effective approximations are often more valuable than complex exact methods. In the deep learning era, this pragmatic approach is particularly important — we pursue not mathematical perfection, but practical effectiveness.

In the next chapter, we'll explore another important direction —Imitation Learning and Inverse Reinforcement Learning: how do agents learn when we have expert demonstrations but no explicit reward function?

References

Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust Region Policy Optimization. ICML.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv.
Kakade, S., & Langford, J. (2002). Approximately Optimal Approximate Reinforcement Learning. ICML.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.
Achiam, J. (2018). Spinning Up in Deep RL. OpenAI Documentation.
Amari, S. (1998). Natural Gradient Works Efficiently in Learning. Neural Computation.
Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS.

Q&A: Frequently Asked Questions

Q1: Which is better for beginners, PPO or TRPO?

A: PPO. It's simpler to implement, easier to tune, and performance is stable. I recommend mastering PPO first, then learning TRPO's theory.

Q2: How to choose clip parameter?

A: 0.2 is the most commonly used value, suitable for most tasks. If training is unstable, can lower to 0.1; if learning is too slow, can try 0.3.

Q3: Why does PPO need multiple update epochs?

A: Because PPO's clipping mechanism limits update magnitude each epoch. Multiple epochs can more fully utilize data within clipping constraints. But not too many epochs, or may overfit current batch.

Q4: What's the difference between GAE'sand discount factor?

A:controls how much we value future rewards (task level);controls TD error decay in advantage estimation (estimation level). Both are between 0 and 1, but have different effects.

Q5: Can PPO be used for continuous action spaces?

A: Yes. Parameterize actions with Gaussian distribution, learn mean and variance. Be careful about log probability computation in implementation.

Q6: Why does RLHF use PPO instead of other RL algorithms?

A: PPO has good stability, unlikely to cause language model "collapse" (generating gibberish). Also, PPO's clipping mechanism naturally works well with KL penalty.

Q7: How to tell if PPO training is normal?

A: Monitor these metrics: - Policy loss should decrease - Average KL divergence shouldn't be too large (<0.02 usually good) - Clip fraction shouldn't be too high (<20%) - Average return should increase

Q8: What's the relationship between PPO and Actor-Critic?

A: PPO is a type of Actor-Critic. Actor is policy network, Critic is value network. PPO's specialty is its objective function design (clipping).