One of the central challenges in reinforcement learning is the
exploration-exploitation dilemma. An agent that only exploits known good
policies may never discover better solutions; but excessive exploration
wastes time on low-reward behaviors. Traditional methods like-greedy and Boltzmann exploration
rely on randomness, which becomes extremely inefficient in
high-dimensional state spaces with sparse rewards — imagine in a game
like Montezuma's Revenge, where the agent needs hundreds of precise
steps to receive the first reward; pure random exploration is virtually
impossible to succeed. In recent years, inspired by cognitive science
theories of "intrinsic motivation," researchers have proposed
curiosity-driven learning — rewarding agents with intrinsic rewards for
exploring novel states, enabling continuous learning even when external
rewards are zero. From count-based methods to ICM's prediction error,
from RND's random network distillation to NGU's episodic memory,
exploration strategies have evolved into a complete theoretical and
engineering framework. This chapter systematically traces this
evolution, deeply analyzing the design motivation, mathematical
principles, and implementation details of each method, ultimately
validating their effectiveness on Atari hard-exploration games.
The
Exploration-Exploitation Dilemma: The Essence of the Problem
Why Exploration is So
Difficult
In the DQN of Chapter 2 and the policy gradient methods of Chapter 3,
we assumed agents could collect effective training data through
exploration. But in practice, exploration itself is a core
challenge:
Problem 1: Sparse Rewards and Delayed Feedback
Consider the classic Montezuma's Revenge game: the agent must first
get a key, open a door, and jump over enemies — executing over a hundred
operations before receiving the first +100 point reward. Before that,
all states have zero return. If using-greedy random exploration, the
probability of finding an effective trajectory is(assuming
10 actions per step), which is practically impossible.
Problem 2: Sparsity in High-Dimensional State
Spaces
Atari game states arepixels, totaling approximatelypossible
states. Even if the agent explores one million states, the coverage is
only— the vast majority
of states will never be visited.
Problem 3: Attraction Basins of Local Optima
In environments with rewards, agents easily fall into local optima.
For example, in a maze, if there's a small reward in a dead end, the
agent might repeatedly collect this small reward instead of exploring
the path to larger rewards.
Problem 4: Deceptiveness of Deterministic
Environments
In deterministic environments,-greedy causes agents to
repeatedly try the same random action in the same state, wasting
significant time. Worse, certain "interesting" states (like TV noise)
make agents engage in "pseudo-exploration," thinking they're learning
something new when it's actually worthless.
Limitations
of Traditional Exploration Strategies
-greedy
Exploration
During early training, with probabilityselect a random action, and with
probabilityselect the
greedy action:Advantages: Simple and effective, suitable for
low-dimensional discrete spaces (like FrozenLake).
Disadvantages: - Exploration is aimless — in high-dimensional spaces,
random actions almost never encounter meaningful states -decay schedule is hard to tune —
too fast leads to premature convergence, too slow affects performance -
Even worse in continuous action spaces — adding noise to actionseasily
produces unreasonable actions
Boltzmann Exploration (Softmax)
Sample actions according to the softmax distribution of
Q-values:whereis the temperature parameter
controlling exploration intensity. Asit approaches greedy, asit approaches uniform random.
Advantages: Adaptively adjusts exploration based on Q-value
magnitudes, theoretically more elegant.
Disadvantages: - Still random aimless exploration - When Q-value
estimates are inaccurate (early training), easily produces bias -
Computing softmax requires enumerating all actions, infeasible in large
action spaces
Upper Confidence Bound (UCB)
In the multi-armed bandit problem, UCB selects:whereis
total samples,is samples for
action, andis the exploration coefficient. The
second term is an "optimistic bonus"— the less an action is tried, the
higher its confidence upper bound.
Advantages: Theoretically has regret bound guarantees, near-optimal
in bandit problems.
Disadvantages: - Requires explicitly counting visits to each
state-action pair - In large state spaces, mostpairs are visited only once,, confidence intervals
fail - Cannot generalize to unseen states
Thompson Sampling
Maintain a posterior distributionof Q-values, sample a Q-function
each time and greedily select:Advantages:
Elegant Bayesian framework, exploration naturally arises from
uncertainty.
Disadvantages: - Requires maintaining distribution over entire
Q-function, computationally expensive in deep networks (though there are
approximations like Bootstrapped DQN) - Still relies on Q-value
estimation; with sparse rewards when all Q-values are zero, degenerates
to random exploration
The Essence of
Exploration: Seeking Novelty
Observing human and animal exploration behavior, we find a common
feature: not random wandering, but actively seeking novel,
unexpected, and uncertain things. Infants repeatedly play with
new toys (not old ones), monkeys actively explore unknown maze areas,
scientists pursue experimental phenomena that violate common sense.
This "curiosity" can be formalized as intrinsic
reward— even when the external environment provides no reward,
agents receive internal rewards for state novelty:where: -is the external reward from the environment (task objective)
-is the intrinsic
reward computed by the agent itself (exploration motivation) -is the balance coefficient
The key question: How to define and computeso it truly measures state
novelty?
The most intuitive "novelty" definition is: How many times
have I seen this state before? The less visited, the more
novel, the higher the intrinsic reward.
The simplest implementation uses the reciprocal of visit countas intrinsic reward:Or more aggressive pseudo-count:
Theoretical basis: This is equivalent to maximizing
state visit entropy, encouraging uniform coverage of the state space.
Naive Implementation:
Hash Table Counting
In small discrete state spaces (like GridWorld), can directly use a
dictionary:
Problem: In high-dimensional continuous state spaces
(like Atari pixels), states almost never repeat,always holds, degenerating
to constant reward.
Improvement:
Density Models and Pseudo-Counts
Bellemare et al. (2016) in the paper Unifying
Count-Based Exploration and Intrinsic Motivation proposed using
density modelsto estimate
state "occurrence frequency":whereis learned by a generative model
(like PixelCNN). Specifically:
Maintain a generative model, estimate state density using
likelihood
Online update: for each visited state, maximize$p_(s_t)(s) = (s) = p_(s)[0,1]$ Mathematical
Derivation
In the ideal case, ifis the empirical distribution, then:whereis the distribution after
visiting. Therefore:In
practice, approximate with neural network, compute likelihood ratio
before and after visit:
classCountBasedIntrinsic: def__init__(self, obs_shape, beta=0.05, lr=1e-4): self.model = DensityModel(obs_shape) self.optimizer = torch.optim.Adam(self.model.parameters(), lr=lr) self.beta = beta defcompute_intrinsic_reward(self, obs): """Compute intrinsic reward and update density model""" obs_tensor = torch.FloatTensor(obs).unsqueeze(0) # Compute density before update with torch.no_grad(): log_p_old = self.model(obs_tensor) p_old = torch.exp(log_p_old) # Update model log_p_new = self.model(obs_tensor) loss = -log_p_new self.optimizer.zero_grad() loss.backward() self.optimizer.step() # Recompute density after update with torch.no_grad(): log_p_new = self.model(obs_tensor) p_new = torch.exp(log_p_new) # Pseudo-count pseudo_count = p_old / (p_new - p_old + 1e-8) pseudo_count = torch.clamp(pseudo_count, 1, 1e6) # Prevent numerical issues intrinsic_reward = self.beta / torch.sqrt(pseudo_count) return intrinsic_reward.item()
Advantages and Limitations
Advantages: - Clear theory, directly derived from
statistical meaning of visit frequency - Significant effect on
small-scale problems (like early versions of Montezuma's Revenge)
Limitations: - Density models hard to train — models
like PixelCNN are computationally expensive, easily overfit -
"Noisy-TV Problem": In stochastic environments, every
state is "new" (due to pixel noise), intrinsic reward fails - Cannot
generalize — two visually similar but pixel-different states are
considered completely different
ICM: Measuring
Novelty Through Prediction Error
Motivation: From Pixels to
Features
Count-based methods count directly on raw pixels, ignoring a fact:
some pixel changes are irrelevant (like background
leaves swaying), while others are critical (like enemy movement).
Curiosity-Driven Exploration by Self-Supervised
Prediction (Pathak et al., ICML 2017) proposed ICM (Intrinsic
Curiosity Module): instead of comparing pixels directly, compare
whether the agent can predict features of the next
state.
Core insight: - If state transition is predictable → agent already
understands environment dynamics → low reward - If state transition is
unpredictable → agent encountered new phenomenon → high reward
ICM Architecture:
Forward and Inverse Models
ICM contains two modules:
1. Feature Encoder
Map raw stateto
low-dimensional feature:Usually implemented with convolutional networks,
only preserving information related to agent actions, filtering out
background noise.
2. Forward Dynamics Model
Predict next state's features:whereis a small MLP.
Forward model prediction error is the intrinsic reward:
3. Inverse Dynamics Model
Predict action from features (self-supervised auxiliary task):The inverse model's role is to force the
feature encoder to only preserve action-relevant information.
Ifonly encodes background, the
inverse model cannot predict actions; only by encoding agent and
interactive object information can it succeed.
Mathematical Intuition
Why does prediction error measure novelty? Consider two extremes:
Case 1: Agent randomly walks in empty room
State transition is deterministic:. Forward model quickly learns perfect
prediction,,
agent stops exploring this area.
Case 2: Agent enters new room, encounters moving
enemy
Enemy behavior is unknown, forward model has large prediction
error,is high, agent
is motivated to explore this area. As interaction increases, model
gradually learns enemy behavior patterns,decreases.
Case 3: TV noise (random pixels)
Raw pixels are completely random, but if feature encoder works
correctly, it should map noise to the same features (because noise is
irrelevant to actions). Inverse model supervision signal forcesto ignore noise, so,, solving the noisy-TV problem!
Depends on environment determinism: In stochastic
environments, unpredictability doesn't necessarily equal novelty (might
just be noise)
Feature encoder quality: Iffails to learn, entire ICM fails
Computational overhead: Need to train 3 additional
networks (encoder, forward, inverse), about 2x slower than vanilla
PPO
RND: Random Network
Distillation
Motivation: Avoid
Training Dynamics Models
One problem with ICM is the need to train a forward modelto predict. In complex environments,
the dynamics model itself is hard to learn; if it doesn't learn well,
intrinsic rewards become unreliable.
Exploration by Random Network Distillation (Burda et
al., ICLR 2019) proposed RND: completely avoid dynamics modeling,
instead predict outputs of a random network.
Core idea: - Fix a randomly initialized target network(never trained) -
Train a predictor networkto fit the target network's outputs - Prediction
errorserves as intrinsic reward
Why does this work?
For frequently visited states, predictor network has seen many
times, small prediction error → low
For new states, predictor network hasn't seen them, large prediction
error → high
Random network acts as "hash function," mapping similar states to
similar outputs, naturally achieving generalization
Better yet, this avoids the "noisy-TV problem": even if pixels are
random, random network outputs are similar for states of the same type
(like "noisy screen"), predictor network quickly learns to ignore
noise.
Burda et al. demonstrated RND's remarkable effectiveness in their
paper:
Montezuma's Revenge: Average score 8152 (first time
surpassing human-level 7385)
Pitfall: Score 70.4 (previous SOTA was 0)
Gravitar: Improved from DQN's 0 to 3500+
points
Key findings: 1. RND outperforms ICM on all Atari hard-exploration
games 2. Intrinsic rewards don't need discounting (), as
exploration is a long-term goal 3. Normalization is crucial — without it
intrinsic reward scale explodes
RND has an implicit assumption: once a state is visited once,
it's never novel again. But in actual exploration, this is too
strict:
Scenario 1: Key-door task
In Montezuma's Revenge, the agent needs to repeatedly "get key → open
door → enter room." If after getting the key the first time, RND
considers the key location no longer novel, the agent loses motivation
to get the key again.
Scenario 2: Reversible exploration
Agent first explores right room, then explores left room. When
returning to starting point, RND gives low reward (because starting
point was visited many times), but this "return to start" is necessary
to explore left side.
Never Give Up (NGU) (Badia et al., ICLR 2020)
proposed episodic memory: consider not only global novelty of states,
but also whether novel in current episode.
Core Design
NGU's intrinsic reward contains two parts:where: -: Episodic novelty —
in current episode, how far isfrom nearest neighbor states -: Lifetime novelty —
frequency ofvisits in entire
training history (like RND) -:
Upper bound, prevents reward explosion
1. Episodic Novelty
Maintain episodic memory, storing embeddingsof all states in current episode
(learned by trainable network).
Compute distance fromto k
nearest neighbors in memory:whereis
Euclidean distance,is small
constant to prevent division by zero.
Then convert to similarity using kernel function:whereis running average distance,is small constant.
Episodic novelty is defined as:
Intuition: Ifis far from all states in memory,
denominator is small,is large; ifis close
to some state (visited before), denominator is large, reward is small.
Memory is cleared at end of each episode, recomputed next episode.
2. Lifetime Novelty
Use RND prediction error:Normalized as long-term
exploration signal.
3. Adaptive Exploration Coefficients
NGU trains multiple agents, each with different exploration
coefficient:Highagents focus on
exploration, lowagents focus
on exploitation. All agents share experience pool, forming
"exploration-exploitation spectrum."
NGU achieved unprecedented exploration performance on DeepMind Lab
and Atari:
Pitfall: Improved from RND's 70 to 5000+
points
Private Eye: First to solve this game (69000
points)
Montezuma's Revenge: Average 11000 points,
surpassing human experts
Key findings: 1. Episodic memory is crucial — removing it causes 50%+
performance drop 2. Multi-agent exploration spectrum accelerates
learning — equivalent to simultaneously training explorers and
exploiters 3. Embedding network requires careful training — paper uses
contrastive learning to optimizePaper link: arXiv:2002.06038
Practical
Case: Implementing ICM+PPO on Montezuma's Revenge
Intrinsic reward coefficient: Start from 0.01, can
increase to 0.1 in sparse reward environments
Normalization: Normalize intrinsic and external
rewards separately to avoid scale mismatch
Frame skip: In Atari, repeat each action for 4
frames to accelerate training
Gradient clipping: ICM gradients can be large, clip
to 0.5 to prevent explosion
Expected Results
Training on single GPU (RTX 3090): - 10 million
steps: Agent learns to reach first room (100 points) -
30 million steps: Agent learns to get key and open door
(400 points) - 100 million steps: Average score 6000+,
surpassing DQN's 0
Theoretical Analysis and
Open Problems
Q&A:
Common Questions About Exploration Strategies
Q1: Why is ICM's inverse model so important?
A: The inverse model forces the feature encoderto only preserve action-relevant
information. Imagine a counterexample: without the inverse model,might only encode background, then
the forward model would give high rewards to background changes, causing
the agent to "pseudo-explore" in front of random backgrounds. The
inverse model forcesto encode
agent and interactive object positions by predicting actions.
Q2: Why doesn't RND get fooled by TV noise?
A: The random networkacts as a "hash function"— although each frame's pixels
differ, states belonging to the same type (noise screen) are mapped to
similar outputs. The predictor networkquickly learns to map
noise screens to some fixed vector, prediction error drops to 0,
intrinsic reward vanishes. The key is: random network outputs are
insensitive to irrelevant changes (like noise) but sensitive to
structural changes (like new object appearance).
Q3: Why can NGU's episodic memory solve "key-door"
tasks?
A: At the start of each new episode, episodic memory is cleared, so
the "get key" state becomes novel again. Even if the agent visited this
state 100 times historically, in the current episode it still gives high
reward. This ensures the agent doesn't abandon exploring certain
necessary states due to frequent global visits.
Q4: Can intrinsic rewards cause agents to ignore external
rewards?
A: This is a real problem, called "exploration trap." Solutions: 1.
Gradually decrease(though NGU
paper found fixedalso works)
2. Dual value networks: one predicts external returns, one predicts
intrinsic returns, optimize separately 3. Reset intrinsic rewards after
task success (avoid over-exploring solved tasks)
Q5: How to use exploration strategies in continuous control
tasks (like robots)?
A: Continuous control exploration is more complex because action
space is continuous. Common methods: - Add Gaussian noise to
actions: - Parameter space noise: add noise to policy network
parameters - ICM/RND equally applicable to continuous control, just need
to include actionas input
Q6: Can exploration strategies be used for imitation
learning?
A: Yes! When Behavioral Cloning fails, exploration strategies help
agents discover states missing from demonstration trajectories. Combined
approaches: - Use ICM to explore, but prefer states near expert
trajectories - Use RND to identify states different from expert
distribution, actively learn
Q7: How to explore in multi-agent environments?
A: Multi-agent exploration has unique challenges: - Other agents'
policies are changing, environment is non-stationary - Need to explore
"joint policy space," combinatorial explosion - Common methods: each
agent's independent ICM/RND + centralized episodic memory
Q8: How to evaluate exploration strategy
quality?
A: Besides task rewards, can also use: - State
coverage: Number of unique states visited - Exploration
efficiency: Steps needed to reach certain coverage -
Novelty curve: Change in intrinsic rewards over time
(should gradually decrease) - Success rate: Probability
of completing task within specified steps
Q9: Why do count methods fail in high
dimensions?
A: Due to the curse of dimensionality. In-dimensional continuous space, to
coverof volume, need to
samplepoints.
When(Atari pixels), this
number is astronomical. Density models and feature embeddings try to
alleviate this by learning low-dimensional representations.
Q10: Future research directions for exploration
strategies?
A: Several frontier directions: - Active
Exploration: Not just explore new states, but actively design
experiments to rapidly learn environment dynamics - Skill-Based
Exploration: First learn reusable skills (like "jump," "open
door"), then combine skills to explore - Goal
Generation: Automatically generate challenging goals to guide
exploration - Social Learning: Learn from other agents'
exploration, avoid repetition
Recommended Papers
Curiosity-Driven Exploration
Curiosity-Driven Exploration by Self-Supervised Prediction
(Pathak et al., ICML 2017)
Exploration strategies are key to moving reinforcement learning from
toy problems to real applications. Traditional-greedy and Boltzmann exploration
are extremely inefficient in high-dimensional sparse reward
environments, while intrinsic motivation-based curiosity-driven methods
achieve true autonomous exploration by rewarding state novelty. From
count-based statistical methods to ICM's prediction error, from RND's
random network distillation to NGU's episodic memory, each improvement
stems from deep understanding of exploration's essence — not random
wandering, but actively seeking unexpected, uncertain, and unknown
things.
However, the exploration problem is far from solved. Current methods
still rely on massive trial-and-error (NGU requires billions of training
steps), making them difficult to apply in safety-critical systems (like
autonomous driving). Future research may integrate active learning,
causal reasoning, and human prior knowledge to achieve more efficient,
interpretable, and transferable exploration strategies. Just as babies
don't need to try millions of times to learn to walk, truly intelligent
exploration should be sample-efficient, goal-directed, and continually
learning — this is one of the core challenges for reinforcement learning
research in the next decade.
From the perspective of exploration strategies, the next chapter will
delve into Model-Based reinforcement learning — learning environment
models for imagination and planning, from passive exploration to active
planning, from reactive control to forward-looking decision-making,
revealing another core path in reinforcement learning.
Post title:Reinforcement Learning (4): Exploration Strategies and Curiosity-Driven Learning
Post author:Chen Kai
Create time:2024-08-23 15:30:00
Post link:https://www.chenk.top/reinforcement-learning-4-exploration-and-curiosity-driven-learning/
Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.