Reinforcement Learning (6): PPO and TRPO - Trust Region Policy Optimization
Chen Kai BOSS

In Chapter 3, we introduced the basic principles of policy gradient methods: computing gradients through trajectory sampling to directly optimize policy parameters. However, vanilla policy gradient has a fundamental problem —update instability. A single overly large policy update can cause dramatic performance collapse, and since the policy has already changed, recovery from errors becomes extremely difficult. It's like walking a tightrope at the edge of a cliff: one wrong step, and everything is lost.

Trust Region Methods address this with a core idea: limit the magnitude of each policy update to ensure the new policy doesn't deviate too far from the old one. TRPO (Trust Region Policy Optimization) achieves this through KL divergence constraints, while PPO (Proximal Policy Optimization) uses a simpler clipping mechanism for similar effects. Due to its simplicity and stability, PPO has become the most widely used reinforcement learning algorithm today — from OpenAI Five defeating professional Dota 2 players to ChatGPT's RLHF training pipeline, PPO is everywhere.

This chapter starts from the instability problem of policy gradients, dives deep into the theoretical foundations of trust region methods, explains the mathematical derivation of TRPO and the geometric intuition of natural gradients, then introduces how PPO approximates TRPO's effect with a simple clipping mechanism. We'll also explore practical techniques and PPO's application in RLHF.

The Instability Problem of Policy Gradients

Vanilla Policy Gradient Recap

Let's first review the core idea of policy gradient methods. Policy gradient methods directly parameterize the policy, aiming to maximize expected return:Through the REINFORCE algorithm, we obtain an unbiased estimate of the gradient:whereis the discounted return from time. The intuition behind this formula is: if an action sequence receives high return, increase the probability of taking those actions; otherwise decrease it.

After introducing a baseline, we typically use the advantage functioninstead of, which reduces variance while maintaining unbiasedness:

Why Are Policy Updates Unstable?

Policy gradient looks elegant, but in practice often encounters training instability. Let's understand this problem through a concrete example.

Example: Dramatic Changes in Action Probabilities

Suppose the current policy selects actionwith probability 0.9 andwith probability 0.1 in some state. In one sampling run, we happen to sample action, and this trajectory receives very high return.

What will policy gradient do? It will dramatically increaseBut here are the problems:

  1. High Variance: Sincehas low sampling probability (only 0.1), such "lucky" trajectories are very rare. Most of the time we sample, and even if's return is mediocre, the gradient will point toward increasing's probability. This leads to extremely high variance in gradient estimates.

  2. Distribution Shift: Once we update the policy, the new policyhas a different data distribution than. But the data we used to compute the gradient came from the old policy! This means our gradient estimate may be completely inaccurate under the new policy.

  3. Performance Collapse: A single large update might push the policy into a very poor region. For example, if we increase's probability from 0.1 to 0.8, butonly received high return due to luck, the new policy's performance will dramatically decrease. Worse, since the data distribution has changed, it becomes very difficult to recover from this error.

Mathematical Analysis: The Impact of Step Size

Consider gradient update of policy parameters:.

Taylor expansion of the objective function:whereis the Hessian matrix. Substituting:If step sizeis too large, the second-order termmay become very large. Ifhas negative eigenvalues (non-convex objective), this term may be negative, leading to— performance degradation!

Parameter Space vs Policy Space

A deeper problem is: Euclidean distance in parameter space doesn't reflect the true change in policy.

Consider two Gaussian policies: -, small variance -, large variance

These two policies may differ in only one parameter (variance), close in parameter space. But their behaviors are completely different!almost deterministically selects actions near, while's actions are nearly random.

This tells us: we shouldn't constrain update step size in parameter space, but rather in policy distribution space. This is the core idea of trust region methods.

An Intuitive Analogy

Imagine you're hiking in unfamiliar mountains. Your goal is to reach the summit (maximize return).

Vanilla Policy Gradient is like this: you glance at the compass (gradient direction), then take a big step in that direction. Problems: - You might step off a cliff - You might fall into a pit - Once you go wrong, it's hard to turn back

Trust Region Methods are more cautious: you ensure each step isn't too far, always staying within a "safe region." Even if the direction is slightly off, you won't end up in danger.

Importance Sampling and Policy Optimization

Importance Sampling: Reusing Old Data

In policy gradient, after each policy update, old data becomes "outdated"— it was collected from the old policy, with a different distribution than the new policy. This leads to low sample efficiency: we need to constantly collect new data.

Importance Sampling provides a way to reuse old data. The basic idea: if we want to compute an expectation under distribution, but only have samples from distribution, we can correct through weighting:The weightis called the importance weight, correcting for distribution differences.

Application to Policy Optimization

Suppose we have trajectory data collected by old policy, and now want to evaluate new policy's performance. Applying importance sampling:where the trajectory's probability ratio is:Notice that initial state distributionand transition probabilitiescancel out:This is a key result: the importance weight of a trajectory only depends on policy action probability ratios, independent of environment dynamics!

Surrogate Objective Function

Directly optimizingstill has problems: the trajectory importance weight is a product of all timestep weights, and whenis large, this product's variance grows exponentially.

A more practical approach is to define a surrogate objective function, considering only single-step importance weights:whereis the old policy's advantage function. We typically abbreviate:whereis the probability ratio.

Key properties of the surrogate objective:

  1. First-order approximation: At,and true objectivehave the same gradient:

  2. Equal value: At,, so.

This means the surrogate objective is a good local approximation of the true objective! But the problem is: whenis far from, this approximation may completely fail.

Variance Problems with Importance Sampling

When old and new policies differ significantly, importance weight variance increases dramatically. Consider the weight's expectation and variance:The expectation is always 1, good. But variance?If some actionhas high probability under new policy () but low under old policy (), the weight, weight squared is 81! This causes:

  1. Gradient estimate variance explosion
  2. A few high-weight samples dominate the entire gradient
  3. Extremely unstable training

This is why we need to limit the difference between old and new policies— the core of trust region methods.

TRPO: Trust Region Policy Optimization

Theoretical Foundation: Policy Improvement Lower Bound

TRPO's theoretical foundation comes from an important result by Kakade and Langford (2002).

Define policy's expected return:

Theorem: New policy's expected return can be expressed as:whereis old policy's advantage function.

Proof sketch:Using the relationand telescope sum technique, we get: This theorem tells us: new policy's performance equals old policy's performance plus an "advantage term". If we can make the new policy always select positive-advantage actions at each state, performance will improve!

From Exact Equality to Optimizable Approximation

The equation above looks beautiful, but has a problem: the right side's expectation is computed under, but we don't haveyet! This is a chicken-and-egg problem.

Approximation 1: Use old policy's state distribution

Define surrogate objective under old policy's state distribution:whereis old policy's discounted state visitation frequency.

Thiscan be estimated using old policy's data:Using importance sampling to convertto:

Key question: How far apart areand true objective?

Conservative Lower Bound on Policy Improvement

Schulman et al. (2015) proved a key theorem:

Theorem (Policy Improvement Lower Bound):where: -is maximum absolute advantage -is maximum total variation distance -is total variation distance

Since total variation distance relates to KL divergence:, we get:where.

Significance of this theorem:

It provides a conservative lower bound on new policy's performance. As long as we: 1. Maximize surrogate objective$L_(')D_{KL}^{}(|| ')$to not be too large

We can guarantee that new policy's performance won't be much worse than old policy! This is the theoretical foundation of trust region methods.

TRPO's Optimization Problem

Based on above theory, TRPO formulates policy optimization as a constrained optimization problem: whereis trust region size, typically around 0.01.

Note: we use average KL divergenceinstead of maximum KL divergence. This is a practical approximation — maximum KL is hard to compute and optimize in practice.

Natural Gradient: Geometry of Policy Space

To understand TRPO's optimization method, we need to first understand the concept of natural gradient.

Problem: What's wrong with Euclidean distance in parameter space?

Consider standard gradient descent:. This is equivalent to solving:The last termis the Euclidean distance constraint in parameter space. But as we discussed, parameter space Euclidean distance doesn't reflect true policy change.

The Idea of Natural Gradient

Natural gradient doesn't constrain in parameter space, but in probability distribution space. Specifically, we use KL divergence to measure policy change:'F$The Fisher matrix defines the Riemannian metric of policy space, telling us at each parameter point how sensitive the policy distribution is to parameter changes.

Form of Natural Gradient

Using Lagrange multipliers to solve the constrained optimization problem above, we get the natural gradient update:whereis the natural gradient.

Intuitive understanding:

Ordinary gradient points in the "steepest" direction in parameter space. But if the parameter space coordinate system is poorly chosen (some directions affect policy greatly, others weakly), this direction may not be the fastest for improving performance.

Natural gradient "corrects" this through the Fisher matrix. It finds the steepest direction in policy distribution space, then maps it back to parameter space. It's like navigating on Earth: although latitude and longitude are orthogonal coordinates, at different latitudes, one degree of longitude corresponds to different actual distances. Natural gradient accounts for this "metric variation."

TRPO Implementation Details

TRPO uses Conjugate Gradient (CG) to efficiently solve the constrained optimization problem, avoiding explicit computation and storage of.

Algorithm Steps:

  1. Collect data: Sample trajectories using current policy

  2. Estimate advantages: Compute advantagefor each state-action pair (usually with GAE)

  3. Compute policy gradient:

  4. Solve with conjugate gradient: - No need to explicitly compute, just need to compute(matrix-vector product) -can be efficiently computed with one forward and backward pass

  5. Compute step size:(so KL constraint is exactly satisfied)

  6. Backtracking line search: Find maximum step that satisfies constraint and actually improves objective

  7. Update parameters: Core of Conjugate Gradient

Conjugate gradient solves, only needing ability to compute. Fisher matrix-vector product can be efficiently computed:This requires two backward passes, but doesn't need to store the entirematrix.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
import torch
import torch.nn as nn
import numpy as np
from torch.distributions import Categorical, Normal

class TRPOAgent:
def __init__(self, state_dim, action_dim, hidden_dim=64, delta=0.01,
gamma=0.99, lam=0.95, continuous=False):
self.continuous = continuous
self.delta = delta
self.gamma = gamma
self.lam = lam

# Policy network
if continuous:
self.policy_mean = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, action_dim)
)
self.policy_log_std = nn.Parameter(torch.zeros(action_dim))
else:
self.policy = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, action_dim),
nn.Softmax(dim=-1)
)

# Value network
self.value = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, 1)
)
self.value_optimizer = torch.optim.Adam(self.value.parameters(), lr=1e-3)

def get_policy_params(self):
"""Get policy network parameters"""
if self.continuous:
return list(self.policy_mean.parameters()) + [self.policy_log_std]
else:
return list(self.policy.parameters())

def get_action_and_log_prob(self, state):
"""Get action and log probability"""
state = torch.FloatTensor(state)
if self.continuous:
mean = self.policy_mean(state)
std = torch.exp(self.policy_log_std)
dist = Normal(mean, std)
action = dist.sample()
log_prob = dist.log_prob(action).sum()
return action.detach().numpy(), log_prob.detach()
else:
probs = self.policy(state)
dist = Categorical(probs)
action = dist.sample()
return action.item(), dist.log_prob(action).detach()

def compute_advantages(self, rewards, values, dones):
"""Compute GAE advantage estimates"""
advantages = []
gae = 0
for t in reversed(range(len(rewards))):
if t == len(rewards) - 1:
next_value = 0
else:
next_value = values[t+1]
delta = rewards[t] + self.gamma * next_value * (1 - dones[t]) - values[t]
gae = delta + self.gamma * self.lam * (1 - dones[t]) * gae
advantages.insert(0, gae)
return torch.FloatTensor(advantages)

def flat_grad(self, grads):
"""Flatten gradients to 1D vector"""
return torch.cat([g.view(-1) for g in grads if g is not None])

def flat_params(self):
"""Flatten parameters to 1D vector"""
params = self.get_policy_params()
return torch.cat([p.view(-1) for p in params])

def set_flat_params(self, flat_params):
"""Restore parameters from 1D vector"""
params = self.get_policy_params()
idx = 0
for p in params:
p.data.copy_(flat_params[idx:idx+p.numel()].view(p.shape))
idx += p.numel()

def compute_kl(self, states):
"""Compute KL divergence between old and new policy"""
if self.continuous:
mean = self.policy_mean(states)
std = torch.exp(self.policy_log_std)
mean_old = mean.detach()
std_old = std.detach()

# KL(old || new) for Gaussian
kl = (torch.log(std/std_old) + (std_old**2 + (mean_old-mean)**2)/(2*std**2) - 0.5).sum(dim=1).mean()
else:
probs = self.policy(states)
probs_old = probs.detach()
kl = (probs_old * (torch.log(probs_old + 1e-8) - torch.log(probs + 1e-8))).sum(dim=1).mean()
return kl

def hessian_vector_product(self, states, vector, damping=0.1):
"""Compute Fisher matrix-vector product: Fv"""
kl = self.compute_kl(states)

params = self.get_policy_params()
grads = torch.autograd.grad(kl, params, create_graph=True)
flat_grads = self.flat_grad(grads)

grad_vector_product = (flat_grads * vector).sum()
hv = torch.autograd.grad(grad_vector_product, params)
flat_hv = self.flat_grad(hv)

return flat_hv + damping * vector

def conjugate_gradient(self, states, b, n_steps=10, residual_tol=1e-10):
"""Conjugate gradient to solve Fx = b"""
x = torch.zeros_like(b)
r = b.clone()
p = r.clone()
rdotr = r.dot(r)

for _ in range(n_steps):
Ap = self.hessian_vector_product(states, p)
alpha = rdotr / (p.dot(Ap) + 1e-8)
x += alpha * p
r -= alpha * Ap
new_rdotr = r.dot(r)
if new_rdotr < residual_tol:
break
p = r + (new_rdotr / rdotr) * p
rdotr = new_rdotr

return x

def line_search(self, states, actions, advantages, step_direction, max_kl):
"""Backtracking line search to find step satisfying constraint"""
old_params = self.flat_params().clone()

# Compute maximum step size (so KL constraint is exactly satisfied)
sAs = 0.5 * step_direction.dot(self.hessian_vector_product(states, step_direction))
max_step = torch.sqrt(max_kl / (sAs + 1e-8))

# Backtracking line search
for shrink_factor in [0.5 ** i for i in range(10)]:
step = max_step * shrink_factor
new_params = old_params + step * step_direction
self.set_flat_params(new_params)

# Check KL constraint
kl = self.compute_kl(states)

if kl.item() < max_kl:
return True

# Revert to old parameters
self.set_flat_params(old_params)
return False

Pros and Cons of TRPO

Advantages: 1. Theoretical guarantee: Has clear performance improvement lower bound 2. Stable updates: KL constraint ensures each update isn't too aggressive 3. Sample efficiency: Can reuse same batch of data multiple times (within KL constraint)

Disadvantages: 1. Complex implementation: Requires conjugate gradient, line search 2. Computationally expensive: Each update needs multiple backward passes 3. Hard to tune: Trust region sizerequires experience to choose

PPO: Proximal Policy Optimization

Motivation: From TRPO to PPO

TRPO is theoretically elegant but complex to implement and computationally expensive. Schulman et al. proposed PPO in 2017, with the goal: approximate TRPO's effect with first-order optimization while keeping implementation simple.

PPO's core idea: since we want to limit old-new policy difference, why not directly add this constraint in the objective function?

PPO-Clip: The Clipped Version

PPO's most popular variant is the clipped version, which limits policy updates by clipping the probability ratio:where: -is the probability ratio -is the clip parameter, typically 0.1 or 0.2 -restrictsto Detailed Analysis of the Objective Function:

Let's discuss by case:

Case 1:(good action, we want to increase its probability)

  • If(new policy decreased probability):, gradient encourages increasing
  • If:, gradient continues encouraging increase
  • If:, gradient is 0!

The last case is key: when probability ratio exceeds, objective becomes constant, gradient is 0. This prevents policy from continuing to update in this direction — even with large advantage, we can't overly increase action probability.

Case 2:(bad action, we want to decrease its probability)

  • If(new policy increased probability):(negative), gradient encourages decreasing
  • If:, gradient continues encouraging decrease
  • If:, gradient is 0!

Similarly, when probability ratio is below, gradient becomes 0, preventing further decrease.

Intuition Summary: PPO-Clip's objective is "pessimistic"— it takes the smaller of two objectives. This ensures: - Won't dramatically change policy due to one high-advantage sample - Even if sampling noise causes inaccurate advantage estimates, policy won't fluctuate wildly

PPO-Penalty: The Penalty Version

Another PPO variant adds KL divergence as a penalty term:whereis the KL penalty coefficient. This is equivalent to TRPO's Lagrangian form, but converts constraint to penalty.

Adaptive KL Penalty:

Choice ofis important. Iftoo small, KL may be too large; if too large, learning is too conservative. PPO uses adaptive adjustment:

1
2
3
4
5
6
7
8
9
def adaptive_kl_penalty(kl, d_target=0.01, beta=1.0):
"""Adaptively adjust KL penalty coefficient"""
if kl < d_target / 1.5:
# KL too small, can be more aggressive
beta /= 2
elif kl > d_target * 1.5:
# KL too large, need to be more conservative
beta *= 2
return beta

In practice, PPO-Clip is more commonly used than PPO-Penalty because it doesn't require tuning.

Complete PPO Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.distributions import Categorical, Normal

class PPOAgent:
def __init__(self, state_dim, action_dim, hidden_dim=64,
lr=3e-4, gamma=0.99, lam=0.95, eps_clip=0.2,
c1=0.5, c2=0.01, k_epochs=10, continuous=False):
"""
PPO Agent

Args:
state_dim: State dimension
action_dim: Action dimension
hidden_dim: Hidden layer dimension
lr: Learning rate
gamma: Discount factor
lam: GAE lambda parameter
eps_clip: PPO clipping parameter
c1: Value loss coefficient
c2: Entropy coefficient (encourages exploration)
k_epochs: Number of update epochs per batch
continuous: Whether action space is continuous
"""
self.continuous = continuous
self.gamma = gamma
self.lam = lam
self.eps_clip = eps_clip
self.c1 = c1
self.c2 = c2
self.k_epochs = k_epochs

if continuous:
# Continuous action space: output mean and std
self.policy_mean = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, action_dim),
nn.Tanh()
)
self.policy_log_std = nn.Parameter(torch.zeros(action_dim))
policy_params = list(self.policy_mean.parameters()) + [self.policy_log_std]
else:
# Discrete action space: output action probabilities
self.policy = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, action_dim),
nn.Softmax(dim=-1)
)
policy_params = self.policy.parameters()

# Value network
self.value = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, 1)
)

# Unified optimizer
self.optimizer = optim.Adam(
list(policy_params) + list(self.value.parameters()),
lr=lr
)

def get_action(self, state):
"""Sample action from current policy"""
state = torch.FloatTensor(state)

if self.continuous:
mean = self.policy_mean(state)
std = torch.exp(self.policy_log_std)
dist = Normal(mean, std)
action = dist.sample()
log_prob = dist.log_prob(action).sum()
return action.detach().numpy(), log_prob.detach()
else:
probs = self.policy(state)
dist = Categorical(probs)
action = dist.sample()
return action.item(), dist.log_prob(action).detach()

def evaluate(self, states, actions):
"""Evaluate log probability, value, and entropy for given state-action pairs"""
if self.continuous:
mean = self.policy_mean(states)
std = torch.exp(self.policy_log_std)
dist = Normal(mean, std)
log_probs = dist.log_prob(actions).sum(dim=-1)
entropy = dist.entropy().sum(dim=-1).mean()
else:
probs = self.policy(states)
dist = Categorical(probs)
log_probs = dist.log_prob(actions)
entropy = dist.entropy().mean()

values = self.value(states).squeeze()
return log_probs, values, entropy

def compute_gae(self, rewards, values, dones, next_value):
"""
Compute Generalized Advantage Estimation (GAE)

GAE balances bias and variance:
- lambda=0: Use single-step TD error, low variance but high bias
- lambda=1: Use Monte Carlo return, low bias but high variance
- lambda=0.95: Commonly used balance point in practice
"""
advantages = []
gae = 0
values = list(values) + [next_value]

for t in reversed(range(len(rewards))):
# TD error
delta = rewards[t] + self.gamma * values[t+1] * (1 - dones[t]) - values[t]
# GAE recursion
gae = delta + self.gamma * self.lam * (1 - dones[t]) * gae
advantages.insert(0, gae)

return torch.FloatTensor(advantages)

def update(self, states, actions, old_log_probs, rewards, dones, next_state):
"""
PPO Update

Core idea:
1. Compute advantage function (with GAE)
2. Multiple epochs of optimization, each using PPO-Clip objective
3. Simultaneously optimize policy and value function
"""
# Convert to tensors
states = torch.FloatTensor(np.array(states))
if self.continuous:
actions = torch.FloatTensor(np.array(actions))
else:
actions = torch.LongTensor(actions)
old_log_probs = torch.stack(old_log_probs)

# Compute advantages and returns (no gradient needed)
with torch.no_grad():
values = self.value(states).squeeze().numpy()
next_value = self.value(torch.FloatTensor(next_state)).item()
advantages = self.compute_gae(rewards, values, dones, next_value)
returns = advantages + torch.FloatTensor(values)

# Normalize advantages (important! reduces variance)
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

# Record losses for monitoring
policy_losses = []
value_losses = []
entropies = []

# Multiple optimization epochs
for _ in range(self.k_epochs):
# Evaluate current policy
log_probs, value_pred, entropy = self.evaluate(states, actions)

# Compute probability ratio
ratios = torch.exp(log_probs - old_log_probs)

# PPO-Clip objective
surr1 = ratios * advantages
surr2 = torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantages
policy_loss = -torch.min(surr1, surr2).mean()

# Value loss (can also use clipped version)
value_loss = ((value_pred - returns) ** 2).mean()

# Total loss = policy loss + value loss coefficient * value loss - entropy coefficient * entropy
loss = policy_loss + self.c1 * value_loss - self.c2 * entropy

# Gradient update
self.optimizer.zero_grad()
loss.backward()
# Gradient clipping (optional but recommended)
if self.continuous:
nn.utils.clip_grad_norm_(
list(self.policy_mean.parameters()) + [self.policy_log_std] +
list(self.value.parameters()),
0.5
)
else:
nn.utils.clip_grad_norm_(
list(self.policy.parameters()) + list(self.value.parameters()),
0.5
)
self.optimizer.step()

policy_losses.append(policy_loss.item())
value_losses.append(value_loss.item())
entropies.append(entropy.item())

return {
'policy_loss': np.mean(policy_losses),
'value_loss': np.mean(value_losses),
'entropy': np.mean(entropies)
}


def train_ppo(env_name='CartPole-v1', num_episodes=500,
update_freq=2048, render=False):
"""
PPO Training Loop
"""
import gym
env = gym.make(env_name)
state_dim = env.observation_space.shape[0]

# Check action space type
if isinstance(env.action_space, gym.spaces.Discrete):
action_dim = env.action_space.n
continuous = False
else:
action_dim = env.action_space.shape[0]
continuous = True

agent = PPOAgent(state_dim, action_dim, continuous=continuous)

# Data buffer
states, actions, log_probs, rewards, dones = [], [], [], [], []

episode_rewards = []
total_steps = 0

for episode in range(num_episodes):
state = env.reset()
episode_reward = 0
done = False

while not done:
if render:
env.render()

# Sample action
action, log_prob = agent.get_action(state)
next_state, reward, done, _ = env.step(action)

# Store experience
states.append(state)
actions.append(action)
log_probs.append(log_prob)
rewards.append(reward)
dones.append(done)

state = next_state
episode_reward += reward
total_steps += 1

# Update when reaching update frequency
if total_steps % update_freq == 0 and len(states) > 0:
info = agent.update(states, actions, log_probs, rewards, dones, state)
# Clear buffer
states, actions, log_probs, rewards, dones = [], [], [], [], []

episode_rewards.append(episode_reward)

if episode % 10 == 0:
avg_reward = np.mean(episode_rewards[-10:]) if len(episode_rewards) >= 10 else np.mean(episode_rewards)
print(f"Episode {episode}, Avg Reward (last 10): {avg_reward:.2f}")

env.close()
return agent, episode_rewards

Key Techniques for PPO

Generalized Advantage Estimation (GAE)

GAE is a technique for computing advantage function, balancing bias and variance through parameter:whereis TD error.

Special cases: -:(single-step TD, low variance high bias) -:(Monte Carlo, high variance low bias)

In practice,is commonly used as a good balance.

Value Function Clipping

To further stabilize training, clipping can be applied to value function as well:This prevents value function updates from being too aggressive.

Learning Rate Annealing

Lowering learning rate in later training can improve final performance:

1
2
3
4
5
6
7
8
9
10
def linear_schedule(initial_lr, final_lr, current_step, total_steps):
"""Linear annealing"""
progress = current_step / total_steps
return initial_lr + (final_lr - initial_lr) * progress

# Cosine annealing is more commonly used
def cosine_schedule(initial_lr, final_lr, current_step, total_steps):
"""Cosine annealing"""
progress = current_step / total_steps
return final_lr + 0.5 * (initial_lr - final_lr) * (1 + np.cos(np.pi * progress))

Parallel Environment Sampling

Using multiple parallel environments can greatly improve sampling efficiency and data diversity:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class ParallelEnvs:
"""Parallel environments wrapper"""
def __init__(self, env_name, n_envs=8):
import gym
self.envs = [gym.make(env_name) for _ in range(n_envs)]
self.n_envs = n_envs

def reset(self):
return np.array([env.reset() for env in self.envs])

def step(self, actions):
results = [env.step(a) for env, a in zip(self.envs, actions)]
states, rewards, dones, infos = zip(*results)

# Auto-reset completed environments
for i, done in enumerate(dones):
if done:
states = list(states)
states[i] = self.envs[i].reset()
states = tuple(states)

return np.array(states), np.array(rewards), np.array(dones), infos

Deep Comparison: TRPO vs PPO

Theoretical Guarantees

Aspect TRPO PPO-Clip
Policy improvement guarantee Has strict theoretical lower bound No strict theoretical guarantee
KL constraint Hard constraint (ensured via line search) Soft constraint (clipping is heuristic)
Update stability Theoretical guarantee won't degrade Empirically stable, but exceptions possible
Convergence Has convergence analysis No strict convergence proof

Implementation Complexity

Aspect TRPO PPO-Clip
Optimization method Second-order (natural gradient) First-order (SGD/Adam)
Requires Fisher matrix Yes No
Requires conjugate gradient Yes No
Requires line search Yes No
Lines of code (core) ~300 ~100
Tuning complexity Medium (choice) Low (not sensitive)

Computational Efficiency

Aspect TRPO PPO-Clip
Backward passes per update ~20 (CG + line search) 1
Memory usage Higher (stores Hessian-vector product intermediates) Lower
Parallelization difficulty Higher Lower
GPU utilization Medium High

Practical Performance

On most standard RL benchmarks, PPO matches or sometimes exceeds TRPO performance. This may be because:

  1. Multiple update epochs: PPO can perform multiple update epochs on same batch (usually 10), while TRPO typically updates once. This lets PPO more fully utilize data.

  2. Better exploration: PPO's entropy regularization encourages exploration, while TRPO has no explicit exploration mechanism.

  3. More stable value function: PPO typically trains policy and value networks together, sharing underlying representations, which helps learn better features.

  4. Implementation details: PPO's simplicity makes implementation easier to get right, reducing bug probability.

PPO in RLHF

PPO is the core algorithm for large language model alignment (RLHF) today. Let's understand its application in this context in detail.

RLHF Pipeline

RLHF (Reinforcement Learning from Human Feedback) aims to make language models generate responses that better align with human preferences. The pipeline is:

  1. Supervised Fine-Tuning (SFT): Fine-tune pretrained model on high-quality dialogue data
  2. Train Reward Model (RM): Collect human preference data, train a model to predict which response humans prefer
  3. PPO Fine-Tuning: Use reward model outputs as reward signal, fine-tune language model with PPO

PPO Objective in RLHF

In RLHF, PPO's objective function has some special features:where: -is user input (prompt) -is model-generated response -is reward model's score for the response -is reference model (usually SFT model) -is KL penalty coefficient

Why KL penalty?

Without KL penalty, model might find "dark magic" to fool reward model — generating responses that reward model scores high but are actually low quality. KL penalty ensures model doesn't deviate too far from SFT model, maintaining generation reasonableness.

Implementation Details

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
class RLHFPPOTrainer:
def __init__(self, policy_model, ref_model, reward_model,
kl_coef=0.1, clip_range=0.2):
"""
RLHF PPO Trainer

Args:
policy_model: Policy model to train (language model)
ref_model: Reference model (frozen SFT model)
reward_model: Reward model
kl_coef: KL penalty coefficient
clip_range: PPO clipping range
"""
self.policy = policy_model
self.ref = ref_model
self.reward = reward_model
self.kl_coef = kl_coef
self.clip_range = clip_range

# Freeze reference model
for param in self.ref.parameters():
param.requires_grad = False

def compute_rewards(self, prompts, responses):
"""
Compute rewards (including KL penalty)

Returns:
rewards: Final reward = reward model score - KL penalty
"""
# Reward model scores
rm_scores = self.reward(prompts, responses)

# Compute KL divergence
with torch.no_grad():
ref_logprobs = self.ref.get_log_probs(prompts, responses)
policy_logprobs = self.policy.get_log_probs(prompts, responses)

kl = policy_logprobs - ref_logprobs # Per-token KL
kl_penalty = self.kl_coef * kl.sum(dim=1) # Whole sequence KL

rewards = rm_scores - kl_penalty
return rewards, rm_scores, kl_penalty

def ppo_step(self, prompts, old_responses, old_logprobs, advantages):
"""PPO update step"""
# Get log probs under new policy
new_logprobs = self.policy.get_log_probs(prompts, old_responses)

# Compute probability ratio
ratio = torch.exp(new_logprobs - old_logprobs)

# PPO-Clip objective
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - self.clip_range, 1 + self.clip_range) * advantages
policy_loss = -torch.min(surr1, surr2).mean()

return policy_loss

Challenges of PPO in RLHF

  1. Reward Hacking: Model might find ways to fool reward model
  2. KL Divergence Explosion: If KL penalty too weak, model may rapidly deviate from reference
  3. Training Instability: Language model output space is huge, optimization landscape complex
  4. Computational Cost: Need to run multiple large models simultaneously

Recent Advances: DPO and Alternatives

Due to RLHF + PPO complexity, some simpler alternatives have emerged:

  • DPO (Direct Preference Optimization): Directly optimize from preference data, skipping reward model and PPO
  • IPO (Identity Preference Optimization): Improved version of DPO
  • KTO (Kahneman-Tversky Optimization): Optimization based on prospect theory

But PPO remains the choice in many scenarios, especially when fine-grained control or online learning is needed.

Practical Advice and Tuning Tips

Hyperparameter Selection Guide

Hyperparameter Typical Range Notes
Learning rate 1e-4 ~ 3e-4 Too large unstable, too small slow
Clip parameter 0.1 ~ 0.3 0.2 most common
GAE 0.9 ~ 0.99 0.95 usually works well
Discount factor 0.99 ~ 0.999 Longer task, closer to 1
Batch size 2048 ~ 8192 Larger more stable, slower updates
Epochs per batch 3 ~ 10 Too many may overfit current data
Entropy coefficient 0.0 ~ 0.01 Encourages exploration
Value loss coefficient 0.5 ~ 1.0

Common Problems and Solutions

Problem 1: Performance not improving or oscillating

Possible causes: - Learning rate too large: reduce it - Batch too small: increase batch size - Advantage estimate inaccurate: check GAE implementation

Problem 2: Policy collapses to deterministic

Possible causes: - Entropy coefficient too small: increase it - Clip parameter too small: increase Problem 3: Value function predictions are poor

Possible causes: - Value network insufficient capacity: increase hidden layer size - Return scale too large: normalize returns - Learning rate mismatch: separately tune value network learning rate

Debugging Tips

  1. Monitor KL divergence: If too large (>0.1), updates too aggressive

  2. Monitor probability ratios: Most should be in

  3. Monitor entropy: Should gradually decrease, but not to 0

  4. Visualize advantage distribution: Should be roughly symmetric, mean near 0

Summary

The core idea of trust region methods is constraining policy update magnitude to ensure stable learning:

  1. TRPO provides strict theoretical guarantees through KL divergence constraints and natural gradients, but is complex to implement and computationally expensive

  2. PPO achieves similar effects through simple clipping mechanism, becoming the de facto standard algorithm

  3. Key differences:

    • TRPO: Second-order optimization, hard constraint, theoretical guarantee
    • PPO: First-order optimization, soft constraint, practically effective
  4. PPO's success secrets:

    • Simple implementation, easy to debug
    • Can reuse data multiple times
    • Seamlessly integrates with other techniques (GAE, parallel sampling)
    • Thrives in new domains like RLHF

From an algorithm development perspective, PPO represents an important engineering philosophy: simple and effective approximations are often more valuable than complex exact methods. In the deep learning era, this pragmatic approach is particularly important — we pursue not mathematical perfection, but practical effectiveness.

In the next chapter, we'll explore another important direction —Imitation Learning and Inverse Reinforcement Learning: how do agents learn when we have expert demonstrations but no explicit reward function?

References

  1. Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust Region Policy Optimization. ICML.
  2. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv.
  3. Kakade, S., & Langford, J. (2002). Approximately Optimal Approximate Reinforcement Learning. ICML.
  4. Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.
  5. Achiam, J. (2018). Spinning Up in Deep RL. OpenAI Documentation.
  6. Amari, S. (1998). Natural Gradient Works Efficiently in Learning. Neural Computation.
  7. Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS.

Q&A: Frequently Asked Questions

Q1: Which is better for beginners, PPO or TRPO?

A: PPO. It's simpler to implement, easier to tune, and performance is stable. I recommend mastering PPO first, then learning TRPO's theory.

Q2: How to choose clip parameter?

A: 0.2 is the most commonly used value, suitable for most tasks. If training is unstable, can lower to 0.1; if learning is too slow, can try 0.3.

Q3: Why does PPO need multiple update epochs?

A: Because PPO's clipping mechanism limits update magnitude each epoch. Multiple epochs can more fully utilize data within clipping constraints. But not too many epochs, or may overfit current batch.

Q4: What's the difference between GAE'sand discount factor?

A:controls how much we value future rewards (task level);controls TD error decay in advantage estimation (estimation level). Both are between 0 and 1, but have different effects.

Q5: Can PPO be used for continuous action spaces?

A: Yes. Parameterize actions with Gaussian distribution, learn mean and variance. Be careful about log probability computation in implementation.

Q6: Why does RLHF use PPO instead of other RL algorithms?

A: PPO has good stability, unlikely to cause language model "collapse" (generating gibberish). Also, PPO's clipping mechanism naturally works well with KL penalty.

Q7: How to tell if PPO training is normal?

A: Monitor these metrics: - Policy loss should decrease - Average KL divergence shouldn't be too large (<0.02 usually good) - Clip fraction shouldn't be too high (<20%) - Average return should increase

Q8: What's the relationship between PPO and Actor-Critic?

A: PPO is a type of Actor-Critic. Actor is policy network, Critic is value network. PPO's specialty is its objective function design (clipping).

  • Post title:Reinforcement Learning (6): PPO and TRPO - Trust Region Policy Optimization
  • Post author:Chen Kai
  • Create time:2024-09-03 14:30:00
  • Post link:https://www.chenk.top/reinforcement-learning-6-ppo-and-trpo/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments