If value function methods learn policies indirectly by "evaluating
action quality," then policy gradient methods directly optimize the
policy itself. DQN's success proved deep learning's tremendous potential
in reinforcement learning, but its limitations are also obvious — it can
only handle discrete action spaces and struggles with continuous control
tasks like robot control and autonomous driving. Policy Gradient methods
parameterize policies as neural networksand use gradient ascent to
directly maximize expected returns, naturally supporting continuous
actions. From the earliest REINFORCE algorithm to Actor-Critic
architectures combining value functions, from asynchronous parallel A3C
to breakthrough DDPG, from sample-efficient TD3 to industrially
widespread PPO, to SAC under the maximum entropy framework — policy
gradient methods have become the mainstream technical approach in deep
reinforcement learning. This chapter systematically traces this
evolution path, deeply analyzing each algorithm's design motivations,
mathematical principles, and implementation details.
Policy
Gradient Fundamentals: Direct Policy Optimization
Why Policy Gradient is
Needed
In Chapter 2, we saw DQN learn Q-functionand greedily select actions to
obtain a policy. This indirect approach has several problems:
Problem 1: Only Handles Discrete Actions
DQN needs to compute, which is simple in discrete spaces (like Atari's 18
actions) by enumeration, but in continuous spaces (like robot joint
angles) requires
solving an optimization problem — computationally expensive and
imprecise.
Problem 2: Exploration Dilemma of Deterministic
Policies
Greedy policyis completely deterministic; exploration can only rely on
heuristics like-greedy,
lacking principled guidance.
Problem 3: Accumulation of Value Function Approximation
Errors
In high-dimensional state spaces, Q-function approximation errors
accumulate and amplify through theoperation (like the overestimation
problem in Chapter 2), affecting final policy quality.
Problem 4: Cannot Represent Stochastic Policies
The optimal policy for some problems is inherently stochastic. A
classic example is rock-paper-scissors — deterministic policies can
inevitably be exploited by opponents; the optimal policy is uniformly
random.
Policy Gradient methods circumvent these problems by directly
parameterizing policy: - Policies can output
action probability distributions (discrete) or distribution parameters
(continuous), naturally supporting stochastic policies - For continuous
actions, typically use Gaussian distribution, directly outputting mean and variance -
Optimization objective is expected return, directly optimized through gradient
ascent
Policy Gradient Theorem
Let policybe
controlled by parameter; the
goal is to maximize expected return:whereis the state distribution
induced by policy,is a
trajectory.
Intuitively, we want to take the gradient with respect to:. But the problem isinside the expectation depends
on (when policy changes,
trajectory distribution changes), so we can't simply interchange
gradient and expectation.
The Policy Gradient Theorem (Sutton et al., 2000)
gives the exact gradient expression:This formula is very
elegant: -is the score function, measuring how
parameter changes affect the probability of selecting -is the long-term
value of that action - Their product: actions with high value increase
probability; actions with low value decrease probability
More elegantly, the gradient doesn't depend on state transition
probability— even if
the environment model is unknown, as long as we can sample trajectories,
we can estimate the gradient.
Derivation:
From Trajectory Distribution to Policy Gradient
Complete derivation requires some techniques. First, write the
objective as trajectory distribution:whereis
trajectory probability.
Take derivative with respect to:Using log-derivative trick:, we get:Expand:When taking derivative with respect to,anddon't depend on,
so they disappear:Substituting:This is REINFORCE
algorithm form. Further, noting that action at timeonly affects rewards after (causality), we can replacewith,
obtaining:whereis precisely an unbiased estimate
of. This is
the policy gradient theorem.
Baseline: Reducing Variance
One problem with policy gradients is high variance. Considermight range from -100 to +100, causing
very unstable gradient estimates.
A simple but effective trick is subtracting a baseline:As long asdoesn't depend on, the gradient's expectation is
unchanged (because), but variance can be greatly
reduced.
The most commonly used baseline is the state value function, at which point:is called the
advantage function. Intuitively,is the "average" value of that
state,measures how much better
actionis than average.
Usinginstead ofonly reinforces "better than average"
actions, avoiding increasing probabilities of all actions (even poor
ones).
REINFORCE
Algorithm: Monte Carlo Policy Gradient
Algorithm Flow
REINFORCE (Williams, 1992) is the simplest policy gradient
algorithm:
Key points: - Line 3: Sample complete trajectory (on-policy, must use
current policy) - Line 5: Discounted return fromto termination (Monte Carlo estimate) -
Line 6: Policy gradient formula - Line 8: Gradient ascent update
REINFORCE with Baseline
Adding state value functionas baseline:
Algorithm: REINFORCE with Baseline
Initialize policy parametersand value function parameters
import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F from torch.distributions import Categorical import gym import numpy as np
defcompute_returns(rewards, gamma=0.99): """Compute discounted returns""" returns = [] R = 0 for r inreversed(rewards): R = r + gamma * R returns.insert(0, R) return returns
Cons: - High variance:has large randomness, even with
baseline - Low sample efficiency: on-policy, each trajectory used only
once - Training instability: learning curve oscillates severely
These drawbacks motivated researchers to explore more advanced
methods.
Actor-Critic
Architecture: Combining Policy and Value
From REINFORCE to
Actor-Critic
REINFORCE uses complete returnto estimate Q-value, which is Monte
Carlo method with high variance. Can we use temporal difference (TD) to
reduce variance?
Recall policy gradient:In REINFORCE,is estimated
with. Actor-Critic's idea is:
use a neural networkorto approximate Q-value, then
train this network with TD methods.
Architecture splits into two parts: - Actor: policy
network, responsible
for selecting actions - Critic: value networkor, responsible for evaluating
actions
Actor updates based on Critic's feedback, Critic updates based on
environment rewards. This "actor-critic" interaction is where the name
comes from.
Advantage Actor-Critic (A2C)
Using state value functionas Critic, advantage function
estimated as:This is 1-step TD error. Compared
to, it has lower variance
(depends only on one-step transition) but introduces bias (becauseis approximate).
Note: - Line 6 uses TD error as advantage estimate - Line 7 minimizes
value function's TD error - Line 8 uses TD error to guide policy
update
A3C: Asynchronous
Advantage Actor-Critic
A3C (Asynchronous Advantage Actor-Critic, Mnih et al., 2016) is
parallel version of A2C, first on-policy algorithm to compete with DQN
on Atari.
Core idea: Run multiple environment instances in
parallel, each worker samples independently and computes gradients,
asynchronously updating shared global parameters.
Why is parallelization effective? - Break sample
correlation: experiences from different workers come from different
states, reduced correlation - Accelerate training: multi-core CPU can
sample simultaneously, GPU for network updates - Exploration diversity:
different workers can use different exploration strategies (like
different)
A3C achieved great success in 2016, reaching performance close to DQN
on Atari with faster training (leveraging multi-core CPU). But it has
drawbacks: asynchronous updates may cause stale gradients (when one
worker computes gradient, global parameters already modified by others),
affecting stability.
Modern implementations typically use synchronous version A2C
(removing "Asynchronous"), sampling in parallel across multiple
environments and uniformly updating parameters, avoiding asynchrony
problems.
import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F from torch.distributions import Categorical import gym import numpy as np
classActorCritic(nn.Module): """Actor-Critic network with shared parameters""" def__init__(self, state_dim, action_dim, hidden_dim=128): super(ActorCritic, self).__init__() self.fc1 = nn.Linear(state_dim, hidden_dim) self.fc2 = nn.Linear(hidden_dim, hidden_dim) # Actor head self.actor = nn.Linear(hidden_dim, action_dim) # Critic head self.critic = nn.Linear(hidden_dim, 1) defforward(self, state): x = F.relu(self.fc1(state)) x = F.relu(self.fc2(x)) action_probs = F.softmax(self.actor(x), dim=-1) state_value = self.critic(x) return action_probs, state_value
# Run model, rewards = a2c_multi_env(n_envs=8, episodes=500)
This implementation uses 8 parallel environments, sampling 5 steps
each time (n-step return), greatly improving sample efficiency and
training speed.
Continuous Control: DDPG and
TD3
From Discrete to
Continuous: Challenges
Previous algorithms (REINFORCE, A2C) all target discrete action
spaces. For continuous actions(like robot joint angles, steering wheel angle in
autonomous driving), policies typically model as Gaussian
distribution:Network outputs
meanand standard
deviation, sample
action.
But this stochastic policy has a problem: in some tasks (like precise
control), the optimal policy may be deterministic. Sampling each time
introduces unnecessary noise, degrading performance.
DDPG (Deep Deterministic Policy Gradient) idea: learn a deterministic
policy, directly
outputting action,
no sampling needed.
DDPG: Deterministic Policy
Gradient
DDPG (Lillicrap et al., 2016) combines ideas from DQN and
Actor-Critic: - Like DQN, uses experience replay and target networks -
Like Actor-Critic, separates policy (Actor) and value (Critic)
Deterministic Policy Gradient Theorem (Silver et
al., 2014) states, for deterministic policy, gradient is:Intuition: value functiontells us "how good is actionin state", we want policy-outputted actionto move in direction that
increases.is Q's gradient with respect to
action, pointing toward Q increase;is policy's gradient
with respect to parameters, chain rule connects them.
DDPG Algorithm:
Algorithm: DDPG
Initialize Actor,
Critic$Q_{'}Q{'}' , ' $
for episodedo 5.Initialize random exploration
noise$s_1t = 1, 2, , Ta_t =
(s_t) + ta_tr_t, s{t+1}(s_t, a_t, r_t, s{t+1})(s_i, a_i, r_i, s'i)y_i = r_i + Q{'}(s'i, {'}(s'i)) = i (y_i - Q(s_i, a_i))^2J i (s_i) a Q(s_i,
a)|{a=_(s_i)}$ 16.end for
end for
Key points: - Line 8: Deterministic policy + exploration noise
(typically Ornstein-Uhlenbeck process) - Line 12: Target network
computes TD target, note action also from target network - Line 15:
Soft update, updating a bit each step (), smoother than DQN's hard update
TD3: Twin Delayed DDPG
DDPG has a serious problem: Q-value overestimation. Reason similar to
DQN — targetwhereis selected by Actor, and
Actor training depends on Q-value; the two mutually reinforce, causing
Q-values to spiral upward.
TD3 (Twin Delayed DDPG, Fujimoto et al., 2018) introduces three
tricks to mitigate this:
Trick 1: Clipped Double Q-Learning
Learn two Critics, take smaller value as target:Intuition:
probability of both independent estimates overestimating is lower,
taking minimum is conservative, suppressing overestimation.
Trick 2: Delayed Policy Updates
Actor updates less frequently than Critic — update Actor once for
every 2 Critic updates. Reason: Critic needs to first converge to
accurate Q-value before Actor can optimize based on accurate Q-value. If
updated synchronously, Actor might exploit Critic's errors, learning
wrong policy.
Trick 3: Target Policy Smoothing
When computing target, add noise to target action:Intuition: smoothed target won't
fluctuate drastically due to single action's Q-value anomaly, like
regularization.
TD3 surpassed DDPG on MuJoCo continuous control tasks, becoming the
baseline for off-policy continuous control.
Trust Region Methods: TRPO
and PPO
Policy Update Dilemma
Policy gradient methods have a fundamental problem: learning rate is
hard to tune. Too small, training is slow; too large, new policy may be
much worse than old (gradient is only local information), causing
performance collapse.
One improvement idea: limit each update's step size, ensure new
policy isn't "too far" from old. But how to measure "distance"?
Euclidean distanceisn't suitable, because parameter space
distance doesn't equal policy space distance.
Trust region methods measure policy distance with KL divergence:and
constrain(like).
TRPO: Rigorous Trust
Region Optimization
TRPO (Trust Region Policy Optimization, Schulman et al., 2015) writes
policy optimization as constrained optimization:Importance sampling weightin objective allows updating new policy with old policy
data (off-policy).
TRPO solves this constrained optimization with conjugate gradient
method, theoretically guaranteeing monotonic improvement (new policy not
worse than old). But implementation is complex, computationally
expensive.
Intuition: - If(good
action), want to increase, i.e., increase. But clipping limits, preventing too fast
growth. - If(bad action),
want to decrease, i.e., decrease. But clipping limits, preventing too fast
decrease.
PPO advantages: - Simple implementation, just add clipping to loss -
No need to compute KL divergence or Hessian matrix - Performance close
to TRPO but faster
PPO has become the most commonly used policy gradient algorithm in
industry, widely used by OpenAI, DeepMind, etc.
import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F from torch.distributions import Categorical import gym import numpy as np
PPO typically solves CartPole within 100-200 episodes with very
smooth training curves (compared to REINFORCE).
Maximum Entropy
Reinforcement Learning: SAC
Motivation for Entropy
Regularization
Traditional RL goal is maximizing expected return. But this has a
problem: policies may prematurely converge to local optima, lacking
exploration.
One improvement idea is encouraging policy "diversity"— don't always
select same action, maintain some randomness. Metric for randomness is
entropy:Higher entropy means more random policy; zero
entropy means completely deterministic.
Maximum entropy reinforcement learning
objective:whereis temperature coefficient
controlling entropy weight. This objective encourages policies to
maximize return while maintaining exploration.
Benefits: - Automatic exploration: no need to manually design
exploration strategies (like-greedy) - Robustness: smoother
policies, insensitive to environment perturbations - Avoid local optima:
entropy penalty prevents premature policy convergence
SAC: Soft Actor-Critic
SAC (Soft Actor-Critic, Haarnoja et al., 2018) is off-policy
algorithm under maximum entropy framework, combining: - Actor-Critic
architecture - Experience replay and target networks (like DDPG) -
Entropy regularization
Core ideas:
Soft Q-function: Q-value includes entropy
bonus
Policy update: maximize Q-value + entropy
Automatic temperature tuning:not fixed, automatically adjusted
based on target entropy
Pseudocode (simplified):
Algorithm: SAC
Initialize Actor,
two Critics,
temperature$Q_{'1}, Q{'_2}$
for stepdo 4.Sample action, execute
and store$(s_t, a_t, r_t, s_{t+1})(s_i, a_i, r_i, s'i)a'i
(|s'i)Q = i (y_i - Q{j}(s_i, a_i))^2= {s , a } [(a|s) -
Q{_1}(s,a)]= -(_(a_t|s_t) +
{H}){H}$Soft update target networks
end for
SAC performs excellently on continuous control tasks, balancing
sample efficiency (off-policy) and stability (maximum entropy), widely
applied to robot control.
Code Framework
Complete SAC implementation is long (about 500 lines), here's core
part:
2. By Sample Budget: - Expensive samples (like real
robots): SAC, TD3 (off-policy, high sample efficiency) - Abundant
samples (like simulators): PPO (more stable)
3. By Implementation Resources: - Fast prototyping:
PPO (OpenAI Baselines, Stable Baselines have ready implementations) -
From scratch: REINFORCE or A2C (simple code)
4. By Task Characteristics: - Sparse rewards: SAC
(entropy encourages exploration) - Dense rewards: PPO or TD3 - Partial
observability: LSTM + A2C or PPO - Multi-agent: MADDPG (multi-agent
version of TD3)
5. Industrial Applications: - OpenAI, DeepMind: PPO
for large-scale training (like Dota 2, StarCraft II) - Robot control:
SAC (recommended by Berkeley RL lab) - Autonomous driving: SAC or
TD3
Deep Q&A
Q1:
Why can Policy Gradient handle continuous actions while DQN cannot?
A: Fundamental difference is policy representation.
DQN learns Q-function,
implicitly obtaining policy through. In discrete spaces
(like 18 actions), enumerating allto computeand taking maximum is simple. But
in continuous space(like robot with 7 joint angles), need to
solve:This is continuous optimization problem with no
analytical solution, requiring iterative algorithms (like gradient
ascent). Running optimizer each action selection is computationally
expensive and imprecise.
Policy Gradient methods directly parameterize policy: - Discrete actions:
output Softmax distribution, sample or take argmax - Continuous actions:
output Gaussian mean and variance, sampleOne forward
pass yields action, no optimization needed, naturally supports
continuous spaces.
Q2:
Why does REINFORCE have high variance? How to reduce it?
A: REINFORCE uses complete returnto estimate Q-value, variance sources: 1. Trajectory
randomness: different trajectories'may vary greatly 2. Long-term
accumulation: errors accumulate over time, largermeans higher variance
Methods to reduce variance:
Method 1: Baseline Subtract state value, keeping only advantage.is "average" return of that state;
subtracting eliminates state's inherent quality, focusing only on
action's relative merit. Experiments show variance can reduce
50%-90%.
Method 2: Critic (Actor-Critic) Replace Monte Carlo
estimate with function approximationor, introducing bias but greatly
reducing variance. TD methoddepends only on one-step transition,
variance much smaller than.
Method 3: Multi-step Return Use n-step returnto trade off bias and variance.is TD (low variance high bias),is Monte Carlo (high variance low
bias). In practicetoworks well.
A: PPO's objective:whereis probability ratio between new and old
policies.
Case 1:(good
action, want to increase probability) - If: take first termof, normally increase - If: take second
term, gradient is 0
(clip is constant), stop increasing
Intuition: good action probability can't grow too fast, at most
totimes old policy (like
1.2x).
Case 2:(bad
action, want to decrease probability) - If: take first term(note, so decreasing) - If: take second
term, gradient is
0, stop decreasing
Intuition: bad action probability can't decrease too fast, at least
totimes old policy (like
0.8x).
This mechanism ensures new policy's KL divergence from old won't be
too large, avoiding "one step too far" causing performance collapse.
Q4: What are
core differences between DDPG and TD3?
A: TD3 adds three tricks on top of DDPG to solve
Q-value overestimation:
1. Clipped Double Q-Learning - DDPG: single
Critic, target - TD3: two Critics, targetTaking minimum
suppresses overestimation. Reason: probability of two independent
Q-networks simultaneously overestimating same action is lower, minimum
is conservative.
2. Delayed Policy Updates - DDPG: Actor and Critic
update synchronously, every step updates both - TD3: Critic updates 2
times before Actor updates 1 time
Reason: Critic needs to first converge to accurate Q-value before
Actor can optimize based on accurate gradient. If updated synchronously,
Actor might exploit Critic's errors learning wrong policy.
3. Target Policy Smoothing - DDPG: target
action - TD3:Adding noise smooths Q-function, avoiding drastic target
fluctuation due to single action's Q-value anomaly.
Experiments show all three tricks are important; combined TD3
comprehensively surpasses DDPG on MuJoCo, becoming new off-policy
continuous control baseline.
Q5: How does
SAC's automatic temperature tuning work?
A: SAC's objective:Temperaturecontrols entropy weight: largemeans more random policy
(exploration); smallmeans
more deterministic (exploitation).
Early SAC manually set,
but optimalvaries greatly
across tasks. Later work (Haarnoja et al., 2018) proposed automatic
tuning:
Target Entropy Constraint: set target entropy(like, negative action
dimension), requiring:meaning policy's
average entropy not below target.
Dual Optimization: introduce Lagrange
multiplier, optimization
becomes:Taking
derivative with respect to: - If current policy entropy(too random), increasepenalizing entropy, forcing policy
more deterministic - If(too deterministic), decreaseencouraging exploration
In implementation,ensures non-negativity, optimize:Note, sois negative entropy.
This wayautomatically
adjusts to make policy entropy approach target, no manual tuning
needed.
Q6: Why is PPO more
popular than TRPO?
A: TRPO is theoretically more rigorous (monotonic
improvement guarantee), but PPO more popular in practice, reasons:
1. Simple Implementation - TRPO: needs computing
Hessian matrix inverse or conjugate gradient method, involves
second-order optimization, complex code (about 1000 lines) - PPO: just
modify loss function adding clip, first-order optimization (Adam),
simple code (about 200 lines)
3. Comparable Performance Experiments show PPO's
performance close to or surpassing TRPO on most tasks. PPO's clip
mechanism though heuristic, very effective in practice.
4. Hyperparameter Robustness - TRPO: sensitive to KL
constraintchoice, requires
careful tuning - PPO: clip rangeworks on most tasks, good robustness
5. Easy Extension PPO's clip mechanism easily
combines with other techniques (like GAE, reward shaping, curriculum
learning), while TRPO's constrained optimization framework less
flexible.
OpenAI used PPO for training Dota 2 AI and ChatGPT's RLHF phase,
validating its large-scale application feasibility.
Q7: On-policy
vs off-policy, what are pros and cons?
A:
On-policy (REINFORCE, A2C, PPO, TRPO):
Pros: - Good stability: training data matches current policy,
consistent distribution, more accurate gradient estimates - Simple
theory: directly optimizes expected return, no importance sampling
correction needed - Easy implementation: doesn't need experience replay
and other complex mechanisms
Cons: - Low sample efficiency: each experience used only once
(generated by current policy), then discarded - Hard to parallelize:
though can sample multi-environment (like A3C), data must come from
current policy - Insufficient exploration: relies on policy's own
randomness, may prematurely converge
Off-policy (DQN, DDPG, TD3, SAC):
Pros: - High sample efficiency: experience replay allows reusing data
multiple times, each experience used dozens of times - Flexible
exploration: can use any exploration policy (like-greedy, OU noise) to collect
data - Supports human data: can learn from demonstrations or historical
data (imitation learning)
Cons: - Training instability: data distribution doesn't match target
policy, needs importance sampling correction or target network
stabilization - Complex theory: involves off-policy correction,
distributional shift issues - May diverge: function approximation +
off-policy + bootstrapping (deadly triad) easily fails to converge
Selection Advice: - Expensive samples (like real
robots): off-policy (SAC, TD3) - Abundant samples (like simulators,
Atari): on-policy (PPO) - Need stable training: on-policy - Need
learning from demonstrations: off-policy
Q8:
How do Actor and Critic in Actor-Critic mutually promote?
A: Actor-Critic is iterative improvement
process:
Critic → Actor (critic guides actor):
Critic learns value functionor, evaluating "how good is state" or
"how good is action". Actor optimizes policy based on this evaluation: -
If Critic says, Actor increases, decreases - Policy gradient formulaprecisely embodies this
This way Actor isn't blindly trial-and-error, but improves
directionally (toward Q-value increase).
Actor → Critic (actor helps critic):
Critic needs data to learn value function. Actor generates new
trajectories,
providing training data: - On-policy: Critic uses current
Actor-generated data, TD updates -
Off-policy: Critic uses data from experience replay, covering broader
state space
More importantly, Actor's improvement enables policy to access
higher-value states, Critic can learn more accurate value function (more
positive samples).
Positive Feedback Loop:
Good Critic → Actor improves faster → access better states → Critic
learns more accurately → Actor further improves
This loop eventually converges to optimal policy and optimal value
function (theoretically; in practice may converge to local optimum).
Key is balancing their learning rates: - If Actor learns too fast,
Critic can't keep up, provides wrong value estimates, Actor learns
incorrectly - If Critic learns too fast, Actor updates too slowly,
wastes accurate value information
Typically set Actor learning rate < Critic learning rate
(like), letting Critic learn
first.
Q9: How to choose
discount factor?
A: Discount factorcontrols emphasis on future rewards: -: only care about immediate
reward (myopic) -: future
rewards equal weight to immediate (farsighted)
Theoretical Considerations:
Task Duration:
Short-term tasks (like CartPole, tens of steps to end):to - Long-term tasks (like some Atari
games, thousands of steps):or - Infinite
horizon tasks:mustensuring bounded returns
If task needs 100-step planning but, agent can at most look 10 steps ahead, can't learn
long-term strategy.
Variance and Bias:
Large: high return
variance (accumulates more randomness), but low bias (accurately
reflects long-term value)
Small: low variance but
myopic (underestimates long-term value)
Practical Advice:
Start with(common
default)
If training unstable (high variance), appropriately reduce (like
0.95)
If task obviously needs long-term planning but agent can't learn,
increase(like 0.995 or
0.999)
Some tasks can use varying(curriculum learning): initially
smallquickly learns
short-term strategy, later increaselearns long-term planning
Examples: - CartPole:(short task, but 0.99
sufficient) - Atari Pong:(hundreds of steps per game, need to predict ball
trajectory) - Go:(or very
close to 1, like 0.9999), because need global planning
Q10: How to debug
policy gradient algorithms?
A: Policy gradient algorithms are harder to debug
than supervised learning, because reward signals are sparse and delayed.
Systematic debugging process:
1. Check Environment and Data - Random policy
average return: if better than agent, agent has problems - Manual policy
return: what can human experts achieve? Where's upper bound? - Reward
distribution: check for outliers (like sudden +1000), may cause training
instability
2. Simplify Problem to Verify Code - Test on simple
environment (like CartPole): should solve within 100-200 episodes - If
simple environment doesn't work, code has bug (check gradient
computation, advantage estimation, etc.)
3. Monitor Key Metrics - Policy entropy: should gradually decrease (policy
from random to deterministic), but shouldn't drop to 0 too fast
(premature convergence) - Advantage mean and variance: mean should
approach 0 (baseline effective), variance should gradually reduce -
Value function error:should gradually reduce; if stays large long-term, Critic
didn't learn well - KL divergence (PPO/TRPO):per update should be in target range (like 0.01-0.05)
4. Visualize Policy Behavior - Render several
episodes, see what agent is doing: random wandering? Or clear strategy?
- Check action distribution: are some actions never selected? (possible
network initialization issue)
5. Check Hyperparameters - Learning rate too large:
training curve oscillates drastically - Learning rate too small:
converges extremely slowly - Batch size too small: high gradient
estimate variance, unstable -too small: myopic, can't learn
long-term strategy
6. Common Bugs - Forgot to detach target: value
function's targetor TD target
shouldn't have gradient - Advantage not normalized:'s scale affects learning, best to
normalize -
Reward not clipped: some environments have vastly different reward
scales, need normalize or clip - Insufficient exploration: deterministic
policy didn't add noise, or noise too small - Gradient explosion: need
gradient clipping
7. Compare Baseline - Run same task with mature
libraries like Stable Baselines3, compare performance - If library
results much better, own implementation has problems - If library also
doesn't work, may be task too hard or hyperparameters need special
tuning
8. Gradually Increase Complexity - First use
simplest REINFORCE to verify environment and data flow - Then add
baseline, check if variance reduces - Then upgrade to A2C, check if
Critic is effective - Finally upgrade to PPO/SAC etc., enjoy performance
improvement
Debugging RL requires patience and systematic methods, recommend
using tensorboard and other tools to record all metrics for easy
comparison and backtracking.
References
Core papers in Policy Gradient and Actor-Critic:
Williams, R. J. (1992). Simple statistical
gradient-following algorithms for connectionist reinforcement learning.
Machine Learning, 8(3-4), 229-256. Paper
Link REINFORCE algorithm, pioneering work in policy gradient
methods
Sutton, R. S., McAllester, D. A., Singh, S. P., &
Mansour, Y. (2000). Policy gradient methods for reinforcement
learning with function approximation. NIPS. Paper
Link Rigorous proof of policy gradient theorem
Silver, D., Lever, G., Heess, N., et al. (2014).
Deterministic policy gradient algorithms. ICML. Paper
Link Deterministic policy gradient, theoretical foundation of
DDPG
Mnih, V., Badia, A. P., Mirza, M., et al.
(2016). Asynchronous methods for deep reinforcement learning.
ICML. arXiv:1602.01783 A3C algorithm, first on-policy method to compete with DQN on
Atari
Schulman, J., Levine, S., Abbeel, P., et al.
(2015). Trust region policy optimization. ICML. arXiv:1502.05477 TRPO, introducing trust region constraint guaranteeing monotonic
improvement
Lillicrap, T. P., Hunt, J. J., Pritzel, A., et al.
(2016). Continuous control with deep reinforcement learning.
ICLR. arXiv:1509.02971 DDPG, extending DQN to continuous action spaces
Schulman, J., Wolski, F., Dhariwal, P., et al.
(2017). Proximal policy optimization algorithms. arXiv:1707.06347 PPO, most commonly used policy gradient algorithm in
industry
Fujimoto, S., van Hoof, H., & Meger, D.
(2018). Addressing function approximation error in actor-critic
methods. ICML. arXiv:1802.09477 TD3, solving DDPG's Q-value overestimation problem
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S.
(2018). Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor. ICML. arXiv:1801.01290 SAC, off-policy algorithm under maximum entropy
framework
Haarnoja, T., Zhou, A., Hartikainen, K., et al.
(2018). Soft actor-critic algorithms and applications. arXiv:1812.05905 SAC applications and automatic temperature tuning
Schulman, J., Moritz, P., Levine, S., et al.
(2016). High-dimensional continuous control using generalized
advantage estimation. ICLR. arXiv:1506.02438 GAE, important technique for reducing policy gradient
variance
From REINFORCE's Monte Carlo policy gradient to Actor-Critic's TD
methods, from A3C's asynchronous parallelization to PPO's clipping
tricks, from DDPG's deterministic policy to SAC's maximum entropy
framework — policy gradient methods have developed a rich technical
stack over the past thirty years. These algorithms not only broke
through DQN's discrete action limitation, shining in continuous control
tasks, but also provided diverse solutions for exploration-exploitation
balance, sample efficiency, and training stability. PPO has become
industry's first choice with its simplicity and robustness, while SAC
and TD3 dominate in robot control where performance is paramount.
However, model-free methods' sample efficiency remains a bottleneck —
even the most advanced SAC requires millions of interactions on complex
tasks. The next chapter will explore model-based methods: by learning
environment models and planning within models, dramatically reducing
real environment interactions, leading us into the world of algorithms
like Dyna, MuZero, and Dreamer.
Post title:Reinforcement Learning (3): Policy Gradient and Actor-Critic Methods
Post author:Chen Kai
Create time:2024-08-16 10:45:00
Post link:https://www.chenk.top/reinforcement-learning-3-policy-gradient-and-actor-critic/
Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.