In Chapter 3, we introduced the basic principles of policy gradient methods: computing gradients through trajectory sampling to directly optimize policy parameters. However, vanilla policy gradient has a fundamental problem —update instability. A single overly large policy update can cause dramatic performance collapse, and since the policy has already changed, recovery from errors becomes extremely difficult. It's like walking a tightrope at the edge of a cliff: one wrong step, and everything is lost.
Trust Region Methods address this with a core idea: limit the magnitude of each policy update to ensure the new policy doesn't deviate too far from the old one. TRPO (Trust Region Policy Optimization) achieves this through KL divergence constraints, while PPO (Proximal Policy Optimization) uses a simpler clipping mechanism for similar effects. Due to its simplicity and stability, PPO has become the most widely used reinforcement learning algorithm today — from OpenAI Five defeating professional Dota 2 players to ChatGPT's RLHF training pipeline, PPO is everywhere.
This chapter starts from the instability problem of policy gradients, dives deep into the theoretical foundations of trust region methods, explains the mathematical derivation of TRPO and the geometric intuition of natural gradients, then introduces how PPO approximates TRPO's effect with a simple clipping mechanism. We'll also explore practical techniques and PPO's application in RLHF.
The Instability Problem of Policy Gradients
Vanilla Policy Gradient Recap
Let's first review the core idea of policy gradient methods. Policy
gradient methods directly parameterize the policy
After introducing a baseline
Why Are Policy Updates Unstable?
Policy gradient looks elegant, but in practice often encounters training instability. Let's understand this problem through a concrete example.
Example: Dramatic Changes in Action Probabilities
Suppose the current policy selects action
What will policy gradient do? It will dramatically increase
High Variance: Since
has low sampling probability (only 0.1), such "lucky" trajectories are very rare. Most of the time we sample , and even if 's return is mediocre, the gradient will point toward increasing 's probability. This leads to extremely high variance in gradient estimates. Distribution Shift: Once we update the policy, the new policy
has a different data distribution than . But the data we used to compute the gradient came from the old policy! This means our gradient estimate may be completely inaccurate under the new policy. Performance Collapse: A single large update might push the policy into a very poor region. For example, if we increase
's probability from 0.1 to 0.8, but only received high return due to luck, the new policy's performance will dramatically decrease. Worse, since the data distribution has changed, it becomes very difficult to recover from this error.
Mathematical Analysis: The Impact of Step Size
Consider gradient update of policy parameters:
Taylor expansion of the objective function:
Parameter Space vs Policy Space
A deeper problem is: Euclidean distance in parameter space doesn't reflect the true change in policy.
Consider two Gaussian policies: -
These two policies may differ in only one parameter (variance), close
in parameter space. But their behaviors are completely different!
This tells us: we shouldn't constrain update step size in parameter space, but rather in policy distribution space. This is the core idea of trust region methods.
An Intuitive Analogy
Imagine you're hiking in unfamiliar mountains. Your goal is to reach the summit (maximize return).
Vanilla Policy Gradient is like this: you glance at the compass (gradient direction), then take a big step in that direction. Problems: - You might step off a cliff - You might fall into a pit - Once you go wrong, it's hard to turn back
Trust Region Methods are more cautious: you ensure each step isn't too far, always staying within a "safe region." Even if the direction is slightly off, you won't end up in danger.
Importance Sampling and Policy Optimization
Importance Sampling: Reusing Old Data
In policy gradient, after each policy update, old data becomes "outdated"— it was collected from the old policy, with a different distribution than the new policy. This leads to low sample efficiency: we need to constantly collect new data.
Importance Sampling provides a way to reuse old
data. The basic idea: if we want to compute an expectation under
distribution
Application to Policy Optimization
Suppose we have trajectory data collected by old policy
Surrogate Objective Function
Directly optimizing
A more practical approach is to define a surrogate objective
function, considering only single-step importance weights:
Key properties of the surrogate objective:
First-order approximation: At
, and true objective have the same gradient: Equal value: At
, , so .
This means the surrogate objective is a good local approximation of
the true objective! But the problem is: when
Variance Problems with Importance Sampling
When old and new policies differ significantly, importance weight
variance increases dramatically. Consider the weight's expectation and
variance:
- Gradient estimate variance explosion
- A few high-weight samples dominate the entire gradient
- Extremely unstable training
This is why we need to limit the difference between old and new policies— the core of trust region methods.
TRPO: Trust Region Policy Optimization
Theoretical Foundation: Policy Improvement Lower Bound
TRPO's theoretical foundation comes from an important result by Kakade and Langford (2002).
Define policy
Theorem: New policy
Proof sketch:
From Exact Equality to Optimizable Approximation
The equation above looks beautiful, but has a problem: the right
side's expectation is computed under
Approximation 1: Use old policy's state distribution
Define surrogate objective under old policy's state
distribution:
This
Key question: How far apart are
Conservative Lower Bound on Policy Improvement
Schulman et al. (2015) proved a key theorem:
Theorem (Policy Improvement Lower Bound):
Since total variation distance relates to KL divergence:
Significance of this theorem:
It provides a conservative lower bound on new
policy's performance. As long as we: 1. Maximize surrogate
objective$L_(')
We can guarantee that new policy
TRPO's Optimization Problem
Based on above theory, TRPO formulates policy optimization as a
constrained optimization problem:
Note: we use average KL divergence
Natural Gradient: Geometry of Policy Space
To understand TRPO's optimization method, we need to first understand the concept of natural gradient.
Problem: What's wrong with Euclidean distance in parameter space?
Consider standard gradient descent:
The Idea of Natural Gradient
Natural gradient doesn't constrain in parameter space, but in
probability distribution space. Specifically, we use KL
divergence to measure policy change:
Form of Natural Gradient
Using Lagrange multipliers to solve the constrained optimization
problem above, we get the natural gradient update:
Intuitive understanding:
Ordinary gradient points in the "steepest" direction in parameter space. But if the parameter space coordinate system is poorly chosen (some directions affect policy greatly, others weakly), this direction may not be the fastest for improving performance.
Natural gradient "corrects" this through the Fisher matrix. It finds the steepest direction in policy distribution space, then maps it back to parameter space. It's like navigating on Earth: although latitude and longitude are orthogonal coordinates, at different latitudes, one degree of longitude corresponds to different actual distances. Natural gradient accounts for this "metric variation."
TRPO Implementation Details
TRPO uses Conjugate Gradient (CG) to efficiently
solve the constrained optimization problem, avoiding explicit
computation and storage of
Algorithm Steps:
Collect data: Sample trajectories using current policy
Estimate advantages: Compute advantage
for each state-action pair (usually with GAE) Compute policy gradient:
Solve with conjugate gradient:
- No need to explicitly compute , just need to compute (matrix-vector product) - can be efficiently computed with one forward and backward pass Compute step size:
(so KL constraint is exactly satisfied) Backtracking line search: Find maximum step that satisfies constraint and actually improves objective
Update parameters:
Core of Conjugate Gradient
Conjugate gradient solves
1 | import torch |
Pros and Cons of TRPO
Advantages: 1. Theoretical guarantee: Has clear performance improvement lower bound 2. Stable updates: KL constraint ensures each update isn't too aggressive 3. Sample efficiency: Can reuse same batch of data multiple times (within KL constraint)
Disadvantages: 1. Complex
implementation: Requires conjugate gradient, line search 2.
Computationally expensive: Each update needs multiple
backward passes 3. Hard to tune: Trust region size
PPO: Proximal Policy Optimization
Motivation: From TRPO to PPO
TRPO is theoretically elegant but complex to implement and computationally expensive. Schulman et al. proposed PPO in 2017, with the goal: approximate TRPO's effect with first-order optimization while keeping implementation simple.
PPO's core idea: since we want to limit old-new policy difference, why not directly add this constraint in the objective function?
PPO-Clip: The Clipped Version
PPO's most popular variant is the clipped version,
which limits policy updates by clipping the probability ratio:
Let's discuss by case:
Case 1:
- If
(new policy decreased probability): , gradient encourages increasing - If
: , gradient continues encouraging increase - If
: , gradient is 0!
The last case is key: when probability ratio exceeds
Case 2:
- If
(new policy increased probability): (negative), gradient encourages decreasing - If
: , gradient continues encouraging decrease - If
: , gradient is 0!
Similarly, when probability ratio is below
Intuition Summary: PPO-Clip's objective is "pessimistic"— it takes the smaller of two objectives. This ensures: - Won't dramatically change policy due to one high-advantage sample - Even if sampling noise causes inaccurate advantage estimates, policy won't fluctuate wildly
PPO-Penalty: The Penalty Version
Another PPO variant adds KL divergence as a penalty term:
Adaptive KL Penalty:
Choice of
1 | def adaptive_kl_penalty(kl, d_target=0.01, beta=1.0): |
In practice, PPO-Clip is more commonly used than PPO-Penalty because
it doesn't require tuning
Complete PPO Implementation
1 | import torch |
Key Techniques for PPO
Generalized Advantage Estimation (GAE)
GAE is a technique for computing advantage function, balancing bias
and variance through parameter
Special cases: -
In practice,
Value Function Clipping
To further stabilize training, clipping can be applied to value
function as well:
Learning Rate Annealing
Lowering learning rate in later training can improve final performance:
1 | def linear_schedule(initial_lr, final_lr, current_step, total_steps): |
Parallel Environment Sampling
Using multiple parallel environments can greatly improve sampling efficiency and data diversity:
1 | class ParallelEnvs: |
Deep Comparison: TRPO vs PPO
Theoretical Guarantees
| Aspect | TRPO | PPO-Clip |
|---|---|---|
| Policy improvement guarantee | Has strict theoretical lower bound | No strict theoretical guarantee |
| KL constraint | Hard constraint (ensured via line search) | Soft constraint (clipping is heuristic) |
| Update stability | Theoretical guarantee won't degrade | Empirically stable, but exceptions possible |
| Convergence | Has convergence analysis | No strict convergence proof |
Implementation Complexity
| Aspect | TRPO | PPO-Clip |
|---|---|---|
| Optimization method | Second-order (natural gradient) | First-order (SGD/Adam) |
| Requires Fisher matrix | Yes | No |
| Requires conjugate gradient | Yes | No |
| Requires line search | Yes | No |
| Lines of code (core) | ~300 | ~100 |
| Tuning complexity | Medium ( |
Low ( |
Computational Efficiency
| Aspect | TRPO | PPO-Clip |
|---|---|---|
| Backward passes per update | ~20 (CG + line search) | 1 |
| Memory usage | Higher (stores Hessian-vector product intermediates) | Lower |
| Parallelization difficulty | Higher | Lower |
| GPU utilization | Medium | High |
Practical Performance
On most standard RL benchmarks, PPO matches or sometimes exceeds TRPO performance. This may be because:
Multiple update epochs: PPO can perform multiple update epochs on same batch (usually 10), while TRPO typically updates once. This lets PPO more fully utilize data.
Better exploration: PPO's entropy regularization encourages exploration, while TRPO has no explicit exploration mechanism.
More stable value function: PPO typically trains policy and value networks together, sharing underlying representations, which helps learn better features.
Implementation details: PPO's simplicity makes implementation easier to get right, reducing bug probability.
PPO in RLHF
PPO is the core algorithm for large language model alignment (RLHF) today. Let's understand its application in this context in detail.
RLHF Pipeline
RLHF (Reinforcement Learning from Human Feedback) aims to make language models generate responses that better align with human preferences. The pipeline is:
- Supervised Fine-Tuning (SFT): Fine-tune pretrained model on high-quality dialogue data
- Train Reward Model (RM): Collect human preference data, train a model to predict which response humans prefer
- PPO Fine-Tuning: Use reward model outputs as reward signal, fine-tune language model with PPO
PPO Objective in RLHF
In RLHF, PPO's objective function has some special features:
Why KL penalty?
Without KL penalty, model might find "dark magic" to fool reward model — generating responses that reward model scores high but are actually low quality. KL penalty ensures model doesn't deviate too far from SFT model, maintaining generation reasonableness.
Implementation Details
1 | class RLHFPPOTrainer: |
Challenges of PPO in RLHF
- Reward Hacking: Model might find ways to fool reward model
- KL Divergence Explosion: If KL penalty too weak, model may rapidly deviate from reference
- Training Instability: Language model output space is huge, optimization landscape complex
- Computational Cost: Need to run multiple large models simultaneously
Recent Advances: DPO and Alternatives
Due to RLHF + PPO complexity, some simpler alternatives have emerged:
- DPO (Direct Preference Optimization): Directly optimize from preference data, skipping reward model and PPO
- IPO (Identity Preference Optimization): Improved version of DPO
- KTO (Kahneman-Tversky Optimization): Optimization based on prospect theory
But PPO remains the choice in many scenarios, especially when fine-grained control or online learning is needed.
Practical Advice and Tuning Tips
Hyperparameter Selection Guide
| Hyperparameter | Typical Range | Notes |
|---|---|---|
| Learning rate | 1e-4 ~ 3e-4 | Too large unstable, too small slow |
| Clip parameter |
0.1 ~ 0.3 | 0.2 most common |
| GAE |
0.9 ~ 0.99 | 0.95 usually works well |
| Discount factor |
0.99 ~ 0.999 | Longer task, closer to 1 |
| Batch size | 2048 ~ 8192 | Larger more stable, slower updates |
| Epochs per batch | 3 ~ 10 | Too many may overfit current data |
| Entropy coefficient | 0.0 ~ 0.01 | Encourages exploration |
| Value loss coefficient | 0.5 ~ 1.0 |
Common Problems and Solutions
Problem 1: Performance not improving or oscillating
Possible causes: - Learning rate too large: reduce it - Batch too small: increase batch size - Advantage estimate inaccurate: check GAE implementation
Problem 2: Policy collapses to deterministic
Possible causes: - Entropy coefficient too small: increase it - Clip
parameter too small: increase
Possible causes: - Value network insufficient capacity: increase hidden layer size - Return scale too large: normalize returns - Learning rate mismatch: separately tune value network learning rate
Debugging Tips
Monitor KL divergence: If too large (>0.1), updates too aggressive
Monitor probability ratios: Most should be in
Monitor entropy: Should gradually decrease, but not to 0
Visualize advantage distribution: Should be roughly symmetric, mean near 0
Summary
The core idea of trust region methods is constraining policy update magnitude to ensure stable learning:
TRPO provides strict theoretical guarantees through KL divergence constraints and natural gradients, but is complex to implement and computationally expensive
PPO achieves similar effects through simple clipping mechanism, becoming the de facto standard algorithm
Key differences:
- TRPO: Second-order optimization, hard constraint, theoretical guarantee
- PPO: First-order optimization, soft constraint, practically effective
PPO's success secrets:
- Simple implementation, easy to debug
- Can reuse data multiple times
- Seamlessly integrates with other techniques (GAE, parallel sampling)
- Thrives in new domains like RLHF
From an algorithm development perspective, PPO represents an important engineering philosophy: simple and effective approximations are often more valuable than complex exact methods. In the deep learning era, this pragmatic approach is particularly important — we pursue not mathematical perfection, but practical effectiveness.
In the next chapter, we'll explore another important direction —Imitation Learning and Inverse Reinforcement Learning: how do agents learn when we have expert demonstrations but no explicit reward function?
References
- Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust Region Policy Optimization. ICML.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv.
- Kakade, S., & Langford, J. (2002). Approximately Optimal Approximate Reinforcement Learning. ICML.
- Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.
- Achiam, J. (2018). Spinning Up in Deep RL. OpenAI Documentation.
- Amari, S. (1998). Natural Gradient Works Efficiently in Learning. Neural Computation.
- Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS.
Q&A: Frequently Asked Questions
Q1: Which is better for beginners, PPO or TRPO?
A: PPO. It's simpler to implement, easier to tune, and performance is stable. I recommend mastering PPO first, then learning TRPO's theory.
Q2: How to choose clip parameter
A: 0.2 is the most commonly used value, suitable for most tasks. If training is unstable, can lower to 0.1; if learning is too slow, can try 0.3.
Q3: Why does PPO need multiple update epochs?
A: Because PPO's clipping mechanism limits update magnitude each epoch. Multiple epochs can more fully utilize data within clipping constraints. But not too many epochs, or may overfit current batch.
Q4: What's the difference between GAE's
A:
Q5: Can PPO be used for continuous action spaces?
A: Yes. Parameterize actions with Gaussian distribution, learn mean and variance. Be careful about log probability computation in implementation.
Q6: Why does RLHF use PPO instead of other RL algorithms?
A: PPO has good stability, unlikely to cause language model "collapse" (generating gibberish). Also, PPO's clipping mechanism naturally works well with KL penalty.
Q7: How to tell if PPO training is normal?
A: Monitor these metrics: - Policy loss should decrease - Average KL divergence shouldn't be too large (<0.02 usually good) - Clip fraction shouldn't be too high (<20%) - Average return should increase
Q8: What's the relationship between PPO and Actor-Critic?
A: PPO is a type of Actor-Critic. Actor is policy network, Critic is value network. PPO's specialty is its objective function design (clipping).
- Post title:Reinforcement Learning (6): PPO and TRPO - Trust Region Policy Optimization
- Post author:Chen Kai
- Create time:2024-09-03 14:30:00
- Post link:https://www.chenk.top/reinforcement-learning-6-ppo-and-trpo/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.