Reinforcement Learning (4): Exploration Strategies and Curiosity-Driven Learning
Chen Kai BOSS

One of the central challenges in reinforcement learning is the exploration-exploitation dilemma. An agent that only exploits known good policies may never discover better solutions; but excessive exploration wastes time on low-reward behaviors. Traditional methods like-greedy and Boltzmann exploration rely on randomness, which becomes extremely inefficient in high-dimensional state spaces with sparse rewards — imagine in a game like Montezuma's Revenge, where the agent needs hundreds of precise steps to receive the first reward; pure random exploration is virtually impossible to succeed. In recent years, inspired by cognitive science theories of "intrinsic motivation," researchers have proposed curiosity-driven learning — rewarding agents with intrinsic rewards for exploring novel states, enabling continuous learning even when external rewards are zero. From count-based methods to ICM's prediction error, from RND's random network distillation to NGU's episodic memory, exploration strategies have evolved into a complete theoretical and engineering framework. This chapter systematically traces this evolution, deeply analyzing the design motivation, mathematical principles, and implementation details of each method, ultimately validating their effectiveness on Atari hard-exploration games.

The Exploration-Exploitation Dilemma: The Essence of the Problem

Why Exploration is So Difficult

In the DQN of Chapter 2 and the policy gradient methods of Chapter 3, we assumed agents could collect effective training data through exploration. But in practice, exploration itself is a core challenge:

Problem 1: Sparse Rewards and Delayed Feedback

Consider the classic Montezuma's Revenge game: the agent must first get a key, open a door, and jump over enemies — executing over a hundred operations before receiving the first +100 point reward. Before that, all states have zero return. If using-greedy random exploration, the probability of finding an effective trajectory is(assuming 10 actions per step), which is practically impossible.

Problem 2: Sparsity in High-Dimensional State Spaces

Atari game states arepixels, totaling approximatelypossible states. Even if the agent explores one million states, the coverage is only— the vast majority of states will never be visited.

Problem 3: Attraction Basins of Local Optima

In environments with rewards, agents easily fall into local optima. For example, in a maze, if there's a small reward in a dead end, the agent might repeatedly collect this small reward instead of exploring the path to larger rewards.

Problem 4: Deceptiveness of Deterministic Environments

In deterministic environments,-greedy causes agents to repeatedly try the same random action in the same state, wasting significant time. Worse, certain "interesting" states (like TV noise) make agents engage in "pseudo-exploration," thinking they're learning something new when it's actually worthless.

Limitations of Traditional Exploration Strategies

-greedy Exploration

During early training, with probabilityselect a random action, and with probabilityselect the greedy action:Advantages: Simple and effective, suitable for low-dimensional discrete spaces (like FrozenLake).

Disadvantages: - Exploration is aimless — in high-dimensional spaces, random actions almost never encounter meaningful states -decay schedule is hard to tune — too fast leads to premature convergence, too slow affects performance - Even worse in continuous action spaces — adding noise to actionseasily produces unreasonable actions

Boltzmann Exploration (Softmax)

Sample actions according to the softmax distribution of Q-values:whereis the temperature parameter controlling exploration intensity. Asit approaches greedy, asit approaches uniform random.

Advantages: Adaptively adjusts exploration based on Q-value magnitudes, theoretically more elegant.

Disadvantages: - Still random aimless exploration - When Q-value estimates are inaccurate (early training), easily produces bias - Computing softmax requires enumerating all actions, infeasible in large action spaces

Upper Confidence Bound (UCB)

In the multi-armed bandit problem, UCB selects:whereis total samples,is samples for action, andis the exploration coefficient. The second term is an "optimistic bonus"— the less an action is tried, the higher its confidence upper bound.

Advantages: Theoretically has regret bound guarantees, near-optimal in bandit problems.

Disadvantages: - Requires explicitly counting visits to each state-action pair - In large state spaces, mostpairs are visited only once,, confidence intervals fail - Cannot generalize to unseen states

Thompson Sampling

Maintain a posterior distributionof Q-values, sample a Q-function each time and greedily select:Advantages: Elegant Bayesian framework, exploration naturally arises from uncertainty.

Disadvantages: - Requires maintaining distribution over entire Q-function, computationally expensive in deep networks (though there are approximations like Bootstrapped DQN) - Still relies on Q-value estimation; with sparse rewards when all Q-values are zero, degenerates to random exploration

The Essence of Exploration: Seeking Novelty

Observing human and animal exploration behavior, we find a common feature: not random wandering, but actively seeking novel, unexpected, and uncertain things. Infants repeatedly play with new toys (not old ones), monkeys actively explore unknown maze areas, scientists pursue experimental phenomena that violate common sense.

This "curiosity" can be formalized as intrinsic reward— even when the external environment provides no reward, agents receive internal rewards for state novelty:where: -is the external reward from the environment (task objective) -is the intrinsic reward computed by the agent itself (exploration motivation) -is the balance coefficient

The key question: How to define and computeso it truly measures state novelty?

Count-Based Methods: Explicitly Counting Visit Frequencies

Core Idea

The most intuitive "novelty" definition is: How many times have I seen this state before? The less visited, the more novel, the higher the intrinsic reward.

The simplest implementation uses the reciprocal of visit countas intrinsic reward:Or more aggressive pseudo-count:

Theoretical basis: This is equivalent to maximizing state visit entropy, encouraging uniform coverage of the state space.

Naive Implementation: Hash Table Counting

In small discrete state spaces (like GridWorld), can directly use a dictionary:

1
2
3
4
5
6
7
8
9
10
11
class CountBasedExploration:
def __init__(self, beta=0.1):
self.counts = {} # state -> visit count
self.beta = beta

def get_intrinsic_reward(self, state):
# Convert state to hashable tuple
state_key = tuple(state.flatten())
count = self.counts.get(state_key, 0)
self.counts[state_key] = count + 1
return self.beta / (count + 1)**0.5

Problem: In high-dimensional continuous state spaces (like Atari pixels), states almost never repeat,always holds, degenerating to constant reward.

Improvement: Density Models and Pseudo-Counts

Bellemare et al. (2016) in the paper Unifying Count-Based Exploration and Intrinsic Motivation proposed using density modelsto estimate state "occurrence frequency":whereis learned by a generative model (like PixelCNN). Specifically:

  1. Maintain a generative model, estimate state density using likelihood
  2. Online update: for each visited state, maximize$p_(s_t)(s) = (s) = p_(s)[0,1]$ Mathematical Derivation

In the ideal case, ifis the empirical distribution, then:whereis the distribution after visiting. Therefore:In practice, approximate with neural network, compute likelihood ratio before and after visit:

Code Implementation (Simplified)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import torch
import torch.nn as nn

class DensityModel(nn.Module):
"""Estimate state density using PixelCNN-style autoregressive model"""
def __init__(self, obs_shape):
super().__init__()
# Simplified: use CNN to directly output log-likelihood
self.net = nn.Sequential(
nn.Conv2d(obs_shape[0], 32, 3, padding=1),
nn.ReLU(),
nn.Conv2d(32, 32, 3, padding=1),
nn.ReLU(),
nn.Conv2d(32, 256, 1) # Output 256 class logits per pixel
)

def forward(self, obs):
"""obs: (B, C, H, W), returns log p(obs)"""
logits = self.net(obs) # (B, 256, H, W)
# Compute per-pixel log-likelihood (assuming pixel values normalized to 0-255)
log_probs = -nn.functional.cross_entropy(
logits.view(-1, 256),
obs.flatten(1).long(),
reduction='none'
)
return log_probs.sum()

class CountBasedIntrinsic:
def __init__(self, obs_shape, beta=0.05, lr=1e-4):
self.model = DensityModel(obs_shape)
self.optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)
self.beta = beta

def compute_intrinsic_reward(self, obs):
"""Compute intrinsic reward and update density model"""
obs_tensor = torch.FloatTensor(obs).unsqueeze(0)

# Compute density before update
with torch.no_grad():
log_p_old = self.model(obs_tensor)
p_old = torch.exp(log_p_old)

# Update model
log_p_new = self.model(obs_tensor)
loss = -log_p_new
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()

# Recompute density after update
with torch.no_grad():
log_p_new = self.model(obs_tensor)
p_new = torch.exp(log_p_new)

# Pseudo-count
pseudo_count = p_old / (p_new - p_old + 1e-8)
pseudo_count = torch.clamp(pseudo_count, 1, 1e6) # Prevent numerical issues

intrinsic_reward = self.beta / torch.sqrt(pseudo_count)
return intrinsic_reward.item()

Advantages and Limitations

Advantages: - Clear theory, directly derived from statistical meaning of visit frequency - Significant effect on small-scale problems (like early versions of Montezuma's Revenge)

Limitations: - Density models hard to train — models like PixelCNN are computationally expensive, easily overfit - "Noisy-TV Problem": In stochastic environments, every state is "new" (due to pixel noise), intrinsic reward fails - Cannot generalize — two visually similar but pixel-different states are considered completely different

ICM: Measuring Novelty Through Prediction Error

Motivation: From Pixels to Features

Count-based methods count directly on raw pixels, ignoring a fact: some pixel changes are irrelevant (like background leaves swaying), while others are critical (like enemy movement).

Curiosity-Driven Exploration by Self-Supervised Prediction (Pathak et al., ICML 2017) proposed ICM (Intrinsic Curiosity Module): instead of comparing pixels directly, compare whether the agent can predict features of the next state.

Core insight: - If state transition is predictable → agent already understands environment dynamics → low reward - If state transition is unpredictable → agent encountered new phenomenon → high reward

ICM Architecture: Forward and Inverse Models

ICM contains two modules:

1. Feature Encoder

Map raw stateto low-dimensional feature:Usually implemented with convolutional networks, only preserving information related to agent actions, filtering out background noise.

2. Forward Dynamics Model

Predict next state's features:whereis a small MLP. Forward model prediction error is the intrinsic reward:

3. Inverse Dynamics Model

Predict action from features (self-supervised auxiliary task):The inverse model's role is to force the feature encoder to only preserve action-relevant information. Ifonly encodes background, the inverse model cannot predict actions; only by encoding agent and interactive object information can it succeed.

Mathematical Intuition

Why does prediction error measure novelty? Consider two extremes:

Case 1: Agent randomly walks in empty room

State transition is deterministic:. Forward model quickly learns perfect prediction,, agent stops exploring this area.

Case 2: Agent enters new room, encounters moving enemy

Enemy behavior is unknown, forward model has large prediction error,is high, agent is motivated to explore this area. As interaction increases, model gradually learns enemy behavior patterns,decreases.

Case 3: TV noise (random pixels)

Raw pixels are completely random, but if feature encoder works correctly, it should map noise to the same features (because noise is irrelevant to actions). Inverse model supervision signal forcesto ignore noise, so,, solving the noisy-TV problem!

Complete Code Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
import torch
import torch.nn as nn
import torch.nn.functional as F

class ICM(nn.Module):
"""Intrinsic Curiosity Module"""
def __init__(self, obs_shape, action_dim, feature_dim=256, eta=0.1):
super().__init__()
self.eta = eta # Intrinsic reward scaling factor

# Feature encoder: input obs, output feature_dim features
self.feature_encoder = nn.Sequential(
nn.Conv2d(obs_shape[0], 32, 8, stride=4),
nn.ReLU(),
nn.Conv2d(32, 64, 4, stride=2),
nn.ReLU(),
nn.Conv2d(64, 64, 3, stride=1),
nn.ReLU(),
nn.Flatten(),
nn.Linear(64 * 7 * 7, feature_dim) # Assuming 84x84 input
)

# Forward model: input (feature_t, action), output feature_{t+1}
self.forward_model = nn.Sequential(
nn.Linear(feature_dim + action_dim, 256),
nn.ReLU(),
nn.Linear(256, feature_dim)
)

# Inverse model: input (feature_t, feature_{t+1}), output action
self.inverse_model = nn.Sequential(
nn.Linear(feature_dim * 2, 256),
nn.ReLU(),
nn.Linear(256, action_dim)
)

def forward(self, obs, next_obs, action):
"""
Compute intrinsic reward and losses
Args:
obs: (B, C, H, W)
next_obs: (B, C, H, W)
action: (B,) or (B, action_dim)
Returns:
intrinsic_reward: (B,)
forward_loss: scalar
inverse_loss: scalar
"""
# Feature extraction
phi = self.feature_encoder(obs)
phi_next = self.feature_encoder(next_obs)

# Forward model
if action.ndim == 1: # Discrete action, one-hot encode
action_onehot = F.one_hot(action, num_classes=self.inverse_model[-1].out_features).float()
else:
action_onehot = action

phi_next_pred = self.forward_model(torch.cat([phi, action_onehot], dim=1))
forward_loss = F.mse_loss(phi_next_pred, phi_next.detach(), reduction='none').sum(dim=1)

# Intrinsic reward = forward model prediction error
intrinsic_reward = self.eta * forward_loss

# Inverse model
action_pred_logits = self.inverse_model(torch.cat([phi, phi_next], dim=1))
if action.ndim == 1:
inverse_loss = F.cross_entropy(action_pred_logits, action, reduction='none')
else:
inverse_loss = F.mse_loss(action_pred_logits, action, reduction='none').sum(dim=1)

return intrinsic_reward, forward_loss.mean(), inverse_loss.mean()

def compute_intrinsic_reward(self, obs, next_obs, action):
"""Only compute intrinsic reward (for inference)"""
with torch.no_grad():
reward, _, _ = self.forward(obs, next_obs, action)
return reward

class PPO_with_ICM:
"""PPO algorithm with ICM integration"""
def __init__(self, env, obs_shape, action_dim, lr=3e-4, beta_int=0.01):
self.env = env
self.beta_int = beta_int # Intrinsic reward coefficient

# PPO policy network (omitted, same as Chapter 3)
self.policy = PolicyNetwork(obs_shape, action_dim)

# ICM module
self.icm = ICM(obs_shape, action_dim)

# Optimizer: jointly optimize policy and ICM
self.optimizer = torch.optim.Adam(
list(self.policy.parameters()) + list(self.icm.parameters()),
lr=lr
)

def collect_trajectories(self, num_steps):
"""Collect trajectories and compute mixed rewards"""
trajectories = []
obs = self.env.reset()

for _ in range(num_steps):
# Select action
obs_tensor = torch.FloatTensor(obs).unsqueeze(0)
with torch.no_grad():
action_dist = self.policy(obs_tensor)
action = action_dist.sample()

# Environment interaction
next_obs, reward_ext, done, _ = self.env.step(action.item())

# Compute intrinsic reward
next_obs_tensor = torch.FloatTensor(next_obs).unsqueeze(0)
reward_int = self.icm.compute_intrinsic_reward(
obs_tensor, next_obs_tensor, action
).item()

# Mixed reward
reward_total = reward_ext + self.beta_int * reward_int

trajectories.append({
'obs': obs,
'action': action.item(),
'reward': reward_total,
'reward_ext': reward_ext,
'reward_int': reward_int,
'next_obs': next_obs,
'done': done
})

obs = next_obs if not done else self.env.reset()

return trajectories

def update(self, trajectories):
"""PPO update + ICM update"""
# Convert to tensors
obs = torch.FloatTensor([t['obs'] for t in trajectories])
actions = torch.LongTensor([t['action'] for t in trajectories])
next_obs = torch.FloatTensor([t['next_obs'] for t in trajectories])
rewards = torch.FloatTensor([t['reward'] for t in trajectories])

# ICM losses
_, forward_loss, inverse_loss = self.icm(obs, next_obs, actions)
icm_loss = forward_loss + inverse_loss

# PPO loss (omitted computation, see Chapter 3)
policy_loss = self.compute_ppo_loss(trajectories)

# Total loss
total_loss = policy_loss + icm_loss

self.optimizer.zero_grad()
total_loss.backward()
self.optimizer.step()

return {
'policy_loss': policy_loss.item(),
'forward_loss': forward_loss.item(),
'inverse_loss': inverse_loss.item()
}

Experimental Results

In Pathak et al.'s paper, ICM achieved breakthroughs on sparse reward tasks like VizDoom and Super Mario Bros:

  • No external rewards (): Agent learns to explore maps and avoid enemies using only intrinsic rewards
  • Montezuma's Revenge: Achieves average score of 6600 in 25 million steps (DQN gets 0)
  • Generalization ability: On new Mario levels, ICM explores 3x farther than baseline

Paper link: arXiv:1705.05363

Limitations

  1. Depends on environment determinism: In stochastic environments, unpredictability doesn't necessarily equal novelty (might just be noise)
  2. Feature encoder quality: Iffails to learn, entire ICM fails
  3. Computational overhead: Need to train 3 additional networks (encoder, forward, inverse), about 2x slower than vanilla PPO

RND: Random Network Distillation

Motivation: Avoid Training Dynamics Models

One problem with ICM is the need to train a forward modelto predict. In complex environments, the dynamics model itself is hard to learn; if it doesn't learn well, intrinsic rewards become unreliable.

Exploration by Random Network Distillation (Burda et al., ICLR 2019) proposed RND: completely avoid dynamics modeling, instead predict outputs of a random network.

Core idea: - Fix a randomly initialized target network(never trained) - Train a predictor networkto fit the target network's outputs - Prediction errorserves as intrinsic reward

Why does this work?

  • For frequently visited states, predictor network has seen many times, small prediction error → low
  • For new states, predictor network hasn't seen them, large prediction error → high
  • Random network acts as "hash function," mapping similar states to similar outputs, naturally achieving generalization

Better yet, this avoids the "noisy-TV problem": even if pixels are random, random network outputs are similar for states of the same type (like "noisy screen"), predictor network quickly learns to ignore noise.

Mathematical Formalization

Definitions: - Target network:(randomly initialized, fixed) - Predictor network:(training)

Intrinsic reward:Predictor network loss:

Key detail: To make intrinsic rewards have meaningful scale, usually normalize:whereare running mean and standard deviation of intrinsic rewards.

Complete Code Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

class RandomNetwork(nn.Module):
"""Randomly initialized target network (not trained)"""
def __init__(self, obs_shape, output_dim=512):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(obs_shape[0], 32, 8, stride=4),
nn.LeakyReLU(),
nn.Conv2d(32, 64, 4, stride=2),
nn.LeakyReLU(),
nn.Conv2d(64, 64, 3, stride=1),
nn.LeakyReLU(),
nn.Flatten(),
nn.Linear(64 * 7 * 7, output_dim)
)
# Fix parameters
for param in self.parameters():
param.requires_grad = False

def forward(self, obs):
return self.net(obs)

class PredictorNetwork(nn.Module):
"""Trained predictor network"""
def __init__(self, obs_shape, output_dim=512):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(obs_shape[0], 32, 8, stride=4),
nn.LeakyReLU(),
nn.Conv2d(32, 64, 4, stride=2),
nn.LeakyReLU(),
nn.Conv2d(64, 64, 3, stride=1),
nn.LeakyReLU(),
nn.Flatten(),
nn.Linear(64 * 7 * 7, 512),
nn.ReLU(),
nn.Linear(512, output_dim)
)

def forward(self, obs):
return self.net(obs)

class RND:
"""Random Network Distillation module"""
def __init__(self, obs_shape, output_dim=512, lr=1e-4):
self.target_net = RandomNetwork(obs_shape, output_dim)
self.predictor_net = PredictorNetwork(obs_shape, output_dim)
self.optimizer = torch.optim.Adam(self.predictor_net.parameters(), lr=lr)

# Normalization statistics
self.reward_mean = 0.0
self.reward_std = 1.0
self.reward_history = []

def compute_intrinsic_reward(self, obs):
"""
Compute intrinsic reward
Args:
obs: (B, C, H, W)
Returns:
intrinsic_reward: (B,)
"""
with torch.no_grad():
target_feat = self.target_net(obs)
pred_feat = self.predictor_net(obs)
reward = torch.pow(pred_feat - target_feat, 2).sum(dim=1)

return reward

def update(self, obs_batch):
"""Update predictor network"""
target_feat = self.target_net(obs_batch).detach()
pred_feat = self.predictor_net(obs_batch)
loss = F.mse_loss(pred_feat, target_feat)

self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()

return loss.item()

def normalize_reward(self, reward):
"""Normalize intrinsic rewards"""
self.reward_history.append(reward.mean().item())
if len(self.reward_history) > 1000:
self.reward_history.pop(0)

self.reward_mean = np.mean(self.reward_history)
self.reward_std = np.std(self.reward_history) + 1e-8

return (reward - self.reward_mean) / self.reward_std

class PPO_with_RND:
"""PPO algorithm with RND integration"""
def __init__(self, env, obs_shape, action_dim, lr=3e-4, beta_int=1.0, gamma_int=0.99):
self.env = env
self.beta_int = beta_int # Intrinsic reward coefficient
self.gamma_int = gamma_int # Discount for intrinsic rewards (usually close to 1)

# Policy network (omitted)
self.policy = PolicyNetwork(obs_shape, action_dim)

# RND module
self.rnd = RND(obs_shape)

# Two value networks: one predicts external returns, one predicts intrinsic returns
self.value_ext = ValueNetwork(obs_shape)
self.value_int = ValueNetwork(obs_shape)

self.optimizer = torch.optim.Adam(
list(self.policy.parameters()) +
list(self.value_ext.parameters()) +
list(self.value_int.parameters()),
lr=lr
)

def collect_trajectories(self, num_steps):
"""Collect trajectories and compute dual returns"""
trajectories = []
obs = self.env.reset()

for _ in range(num_steps):
obs_tensor = torch.FloatTensor(obs).unsqueeze(0)

with torch.no_grad():
# Select action
action = self.policy(obs_tensor).sample()

# Predict values
v_ext = self.value_ext(obs_tensor)
v_int = self.value_int(obs_tensor)

# Environment interaction
next_obs, reward_ext, done, _ = self.env.step(action.item())

# Compute intrinsic reward
next_obs_tensor = torch.FloatTensor(next_obs).unsqueeze(0)
reward_int = self.rnd.compute_intrinsic_reward(next_obs_tensor)
reward_int = self.rnd.normalize_reward(reward_int).item()

trajectories.append({
'obs': obs,
'action': action.item(),
'reward_ext': reward_ext,
'reward_int': reward_int,
'v_ext': v_ext.item(),
'v_int': v_int.item(),
'done': done
})

obs = next_obs if not done else self.env.reset()

return trajectories

def compute_gae(self, trajectories, gamma, lam):
"""Compute GAE (separately for external and intrinsic)"""
advantages_ext = []
advantages_int = []

gae_ext = 0
gae_int = 0

for t in reversed(range(len(trajectories))):
delta_ext = trajectories[t]['reward_ext'] + gamma * trajectories[t+1]['v_ext'] * (1 - trajectories[t]['done']) - trajectories[t]['v_ext']
delta_int = trajectories[t]['reward_int'] + self.gamma_int * trajectories[t+1]['v_int'] - trajectories[t]['v_int'] # Intrinsic rewards don't consider done

gae_ext = delta_ext + gamma * lam * (1 - trajectories[t]['done']) * gae_ext
gae_int = delta_int + self.gamma_int * lam * gae_int

advantages_ext.insert(0, gae_ext)
advantages_int.insert(0, gae_int)

# Combine advantages
advantages_total = [a_ext + self.beta_int * a_int for a_ext, a_int in zip(advantages_ext, advantages_int)]
return advantages_total, advantages_ext, advantages_int

def update(self, trajectories):
"""Update policy, value networks and RND"""
obs = torch.FloatTensor([t['obs'] for t in trajectories])

# Update RND
rnd_loss = self.rnd.update(obs)

# Compute GAE and PPO update (omit details)
# ...

return {'rnd_loss': rnd_loss}

Experimental Results

Burda et al. demonstrated RND's remarkable effectiveness in their paper:

  • Montezuma's Revenge: Average score 8152 (first time surpassing human-level 7385)
  • Pitfall: Score 70.4 (previous SOTA was 0)
  • Gravitar: Improved from DQN's 0 to 3500+ points

Key findings: 1. RND outperforms ICM on all Atari hard-exploration games 2. Intrinsic rewards don't need discounting (), as exploration is a long-term goal 3. Normalization is crucial — without it intrinsic reward scale explodes

Paper link: arXiv:1810.12894

NGU: Episodic Memory and Fast-Slow Exploration

Motivation: From Global to Episodic Novelty

RND has an implicit assumption: once a state is visited once, it's never novel again. But in actual exploration, this is too strict:

Scenario 1: Key-door task

In Montezuma's Revenge, the agent needs to repeatedly "get key → open door → enter room." If after getting the key the first time, RND considers the key location no longer novel, the agent loses motivation to get the key again.

Scenario 2: Reversible exploration

Agent first explores right room, then explores left room. When returning to starting point, RND gives low reward (because starting point was visited many times), but this "return to start" is necessary to explore left side.

Never Give Up (NGU) (Badia et al., ICLR 2020) proposed episodic memory: consider not only global novelty of states, but also whether novel in current episode.

Core Design

NGU's intrinsic reward contains two parts:where: -: Episodic novelty — in current episode, how far isfrom nearest neighbor states -: Lifetime novelty — frequency ofvisits in entire training history (like RND) -: Upper bound, prevents reward explosion

1. Episodic Novelty

Maintain episodic memory, storing embeddingsof all states in current episode (learned by trainable network).

Compute distance fromto k nearest neighbors in memory:whereis Euclidean distance,is small constant to prevent division by zero.

Then convert to similarity using kernel function:whereis running average distance,is small constant.

Episodic novelty is defined as:

Intuition: Ifis far from all states in memory, denominator is small,is large; ifis close to some state (visited before), denominator is large, reward is small. Memory is cleared at end of each episode, recomputed next episode.

2. Lifetime Novelty

Use RND prediction error:Normalized as long-term exploration signal.

3. Adaptive Exploration Coefficients

NGU trains multiple agents, each with different exploration coefficient:Highagents focus on exploration, lowagents focus on exploitation. All agents share experience pool, forming "exploration-exploitation spectrum."

Simplified Implementation (Pseudocode)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import torch
import numpy as np
from collections import deque

class EpisodicMemory:
"""Episodic memory module"""
def __init__(self, capacity=5000, k=10):
self.memory = deque(maxlen=capacity)
self.k = k # k nearest neighbors
self.running_mean_dist = 1.0
self.epsilon = 0.001
self.c = 0.001

def add(self, embedding):
"""Add state embedding to memory"""
self.memory.append(embedding.cpu().numpy())

def compute_novelty(self, embedding):
"""Compute episodic novelty"""
if len(self.memory) == 0:
return 1.0 # First state, highest reward

# Compute distances to all memories
emb_np = embedding.cpu().numpy()
distances = [np.linalg.norm(emb_np - mem) for mem in self.memory]

# k nearest distances
k_nearest = sorted(distances)[:self.k]
dk = np.sqrt(np.sum(np.square(k_nearest)) / self.k) + self.c

# Update running average distance
self.running_mean_dist = 0.99 * self.running_mean_dist + 0.01 * np.mean(distances)

# Kernel similarity
similarities = []
for d in distances:
kernel = self.epsilon / ((d**2 / self.running_mean_dist) + self.epsilon)
similarities.append(kernel)

# Episodic novelty
novelty = 1.0 / (np.sqrt(sum(similarities)) + self.c)
return novelty

def reset(self):
"""End of episode, clear memory"""
self.memory.clear()

class EmbeddingNetwork(nn.Module):
"""Trainable state embedding network"""
def __init__(self, obs_shape, embed_dim=128):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(obs_shape[0], 32, 8, stride=4),
nn.ReLU(),
nn.Conv2d(32, 64, 4, stride=2),
nn.ReLU(),
nn.Conv2d(64, 64, 3, stride=1),
nn.ReLU(),
nn.Flatten(),
nn.Linear(64 * 7 * 7, embed_dim)
)

def forward(self, obs):
return self.net(obs)

class NGU:
"""Never Give Up exploration module"""
def __init__(self, obs_shape, beta=0.3, L=5.0):
self.beta = beta
self.L = L

# Episodic memory
self.episodic_memory = EpisodicMemory()

# Trainable embedding network
self.embedding_net = EmbeddingNetwork(obs_shape)

# RND for lifetime novelty
self.rnd = RND(obs_shape)

def compute_intrinsic_reward(self, obs):
"""Compute mixed intrinsic reward"""
# Episodic novelty
with torch.no_grad():
embedding = self.embedding_net(obs)
episodic_novelty = self.episodic_memory.compute_novelty(embedding)
self.episodic_memory.add(embedding)

# Lifetime novelty
lifetime_novelty = self.rnd.compute_intrinsic_reward(obs).item()
lifetime_novelty = max(min(lifetime_novelty, self.L), 1.0)

# Combine
intrinsic_reward = episodic_novelty * lifetime_novelty
return intrinsic_reward

def reset_episode(self):
"""End of episode, reset episodic memory"""
self.episodic_memory.reset()

Experimental Results

NGU achieved unprecedented exploration performance on DeepMind Lab and Atari:

  • Pitfall: Improved from RND's 70 to 5000+ points
  • Private Eye: First to solve this game (69000 points)
  • Montezuma's Revenge: Average 11000 points, surpassing human experts

Key findings: 1. Episodic memory is crucial — removing it causes 50%+ performance drop 2. Multi-agent exploration spectrum accelerates learning — equivalent to simultaneously training explorers and exploiters 3. Embedding network requires careful training — paper uses contrastive learning to optimize Paper link: arXiv:2002.06038

Practical Case: Implementing ICM+PPO on Montezuma's Revenge

Environment Setup

1
2
3
4
5
# Install dependencies
pip install gym[atari] torch torchvision opencv-python

# Install Atari ROMs
pip install autorom[accept-rom-license]

Complete Training Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from collections import deque
import cv2

# ========== Environment Wrapper ==========
class AtariWrapper:
"""Atari environment preprocessing"""
def __init__(self, env_name):
self.env = gym.make(env_name)
self.frame_stack = deque(maxlen=4)

def reset(self):
obs = self.env.reset()
obs = self._preprocess(obs)
for _ in range(4):
self.frame_stack.append(obs)
return np.array(self.frame_stack)

def step(self, action):
total_reward = 0
for _ in range(4): # Frame skip
obs, reward, done, info = self.env.step(action)
total_reward += reward
if done:
break

obs = self._preprocess(obs)
self.frame_stack.append(obs)
return np.array(self.frame_stack), total_reward, done, info

def _preprocess(self, obs):
"""Convert to grayscale, crop, resize to 84x84"""
gray = cv2.cvtColor(obs, cv2.COLOR_RGB2GRAY)
resized = cv2.resize(gray, (84, 84))
return resized / 255.0

# ========== Neural Networks ==========
class ActorCritic(nn.Module):
"""Actor-Critic network"""
def __init__(self, action_dim):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(4, 32, 8, stride=4),
nn.ReLU(),
nn.Conv2d(32, 64, 4, stride=2),
nn.ReLU(),
nn.Conv2d(64, 64, 3, stride=1),
nn.ReLU(),
nn.Flatten()
)

self.fc = nn.Linear(64 * 7 * 7, 512)
self.actor = nn.Linear(512, action_dim)
self.critic = nn.Linear(512, 1)

def forward(self, obs):
x = self.conv(obs)
x = F.relu(self.fc(x))
logits = self.actor(x)
value = self.critic(x)
return logits, value

# ========== ICM Module (already defined above) ==========
# ...

# ========== Training Loop ==========
def train_icm_ppo(env_name='MontezumaRevengeNoFrameskip-v4', total_steps=10_000_000):
env = AtariWrapper(env_name)
action_dim = env.env.action_space.n
obs_shape = (4, 84, 84)

# Initialize networks
policy = ActorCritic(action_dim).cuda()
icm = ICM(obs_shape, action_dim).cuda()

optimizer = torch.optim.Adam(
list(policy.parameters()) + list(icm.parameters()),
lr=3e-4
)

# Hyperparameters
gamma = 0.99
lam = 0.95
clip_eps = 0.1
beta_int = 0.01

# Statistics
episode_rewards = []
episode_length = 0
obs = env.reset()

for step in range(total_steps):
# ===== Collect trajectories =====
trajectories = []
for _ in range(128): # Collect 128 steps each time
obs_tensor = torch.FloatTensor(obs).unsqueeze(0).cuda()

with torch.no_grad():
logits, value = policy(obs_tensor)
dist = torch.distributions.Categorical(logits=logits)
action = dist.sample()
log_prob = dist.log_prob(action)

next_obs, reward, done, _ = env.step(action.item())

# ICM intrinsic reward
next_obs_tensor = torch.FloatTensor(next_obs).unsqueeze(0).cuda()
reward_int, _, _ = icm(obs_tensor, next_obs_tensor, action)
reward_int = reward_int.item()

trajectories.append({
'obs': obs,
'action': action.item(),
'reward_ext': reward,
'reward_int': reward_int,
'value': value.item(),
'log_prob': log_prob.item(),
'done': done
})

obs = next_obs
episode_length += 1

if done:
episode_rewards.append(sum(t['reward_ext'] for t in trajectories[-episode_length:]))
episode_length = 0
obs = env.reset()

# ===== Compute GAE =====
advantages = []
returns = []
gae = 0
next_value = 0

for t in reversed(range(len(trajectories))):
reward = trajectories[t]['reward_ext'] + beta_int * trajectories[t]['reward_int']
delta = reward + gamma * next_value * (1 - trajectories[t]['done']) - trajectories[t]['value']
gae = delta + gamma * lam * (1 - trajectories[t]['done']) * gae

advantages.insert(0, gae)
returns.insert(0, gae + trajectories[t]['value'])
next_value = trajectories[t]['value']

# ===== PPO update =====
obs_batch = torch.FloatTensor([t['obs'] for t in trajectories]).cuda()
actions_batch = torch.LongTensor([t['action'] for t in trajectories]).cuda()
old_log_probs = torch.FloatTensor([t['log_prob'] for t in trajectories]).cuda()
advantages = torch.FloatTensor(advantages).cuda()
returns = torch.FloatTensor(returns).cuda()

advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

for _ in range(4): # 4 epochs
logits, values = policy(obs_batch)
dist = torch.distributions.Categorical(logits=logits)
log_probs = dist.log_prob(actions_batch)
entropy = dist.entropy().mean()

# PPO loss
ratio = torch.exp(log_probs - old_log_probs)
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * advantages
actor_loss = -torch.min(surr1, surr2).mean()
critic_loss = F.mse_loss(values.squeeze(), returns)

# ICM loss
next_obs_batch = torch.FloatTensor([trajectories[i]['obs'] if i+1 < len(trajectories) else trajectories[i]['obs'] for i in range(len(trajectories))]).cuda()
_, forward_loss, inverse_loss = icm(obs_batch, next_obs_batch, actions_batch)

# Total loss
loss = actor_loss + 0.5 * critic_loss - 0.01 * entropy + forward_loss + inverse_loss

optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(list(policy.parameters()) + list(icm.parameters()), 0.5)
optimizer.step()

# ===== Logging =====
if step % 100 == 0:
avg_reward = np.mean(episode_rewards[-10:]) if episode_rewards else 0
print(f"Step {step}, Avg Reward: {avg_reward:.2f}, Episodes: {len(episode_rewards)}")

if __name__ == '__main__':
train_icm_ppo()

Tuning Tips

  1. Intrinsic reward coefficient: Start from 0.01, can increase to 0.1 in sparse reward environments
  2. Normalization: Normalize intrinsic and external rewards separately to avoid scale mismatch
  3. Frame skip: In Atari, repeat each action for 4 frames to accelerate training
  4. Gradient clipping: ICM gradients can be large, clip to 0.5 to prevent explosion

Expected Results

Training on single GPU (RTX 3090): - 10 million steps: Agent learns to reach first room (100 points) - 30 million steps: Agent learns to get key and open door (400 points) - 100 million steps: Average score 6000+, surpassing DQN's 0

Theoretical Analysis and Open Problems

Q&A: Common Questions About Exploration Strategies

Q1: Why is ICM's inverse model so important?

A: The inverse model forces the feature encoderto only preserve action-relevant information. Imagine a counterexample: without the inverse model,might only encode background, then the forward model would give high rewards to background changes, causing the agent to "pseudo-explore" in front of random backgrounds. The inverse model forcesto encode agent and interactive object positions by predicting actions.

Q2: Why doesn't RND get fooled by TV noise?

A: The random networkacts as a "hash function"— although each frame's pixels differ, states belonging to the same type (noise screen) are mapped to similar outputs. The predictor networkquickly learns to map noise screens to some fixed vector, prediction error drops to 0, intrinsic reward vanishes. The key is: random network outputs are insensitive to irrelevant changes (like noise) but sensitive to structural changes (like new object appearance).

Q3: Why can NGU's episodic memory solve "key-door" tasks?

A: At the start of each new episode, episodic memory is cleared, so the "get key" state becomes novel again. Even if the agent visited this state 100 times historically, in the current episode it still gives high reward. This ensures the agent doesn't abandon exploring certain necessary states due to frequent global visits.

Q4: Can intrinsic rewards cause agents to ignore external rewards?

A: This is a real problem, called "exploration trap." Solutions: 1. Gradually decrease(though NGU paper found fixedalso works) 2. Dual value networks: one predicts external returns, one predicts intrinsic returns, optimize separately 3. Reset intrinsic rewards after task success (avoid over-exploring solved tasks)

Q5: How to use exploration strategies in continuous control tasks (like robots)?

A: Continuous control exploration is more complex because action space is continuous. Common methods: - Add Gaussian noise to actions: - Parameter space noise: add noise to policy network parameters - ICM/RND equally applicable to continuous control, just need to include actionas input

Q6: Can exploration strategies be used for imitation learning?

A: Yes! When Behavioral Cloning fails, exploration strategies help agents discover states missing from demonstration trajectories. Combined approaches: - Use ICM to explore, but prefer states near expert trajectories - Use RND to identify states different from expert distribution, actively learn

Q7: How to explore in multi-agent environments?

A: Multi-agent exploration has unique challenges: - Other agents' policies are changing, environment is non-stationary - Need to explore "joint policy space," combinatorial explosion - Common methods: each agent's independent ICM/RND + centralized episodic memory

Q8: How to evaluate exploration strategy quality?

A: Besides task rewards, can also use: - State coverage: Number of unique states visited - Exploration efficiency: Steps needed to reach certain coverage - Novelty curve: Change in intrinsic rewards over time (should gradually decrease) - Success rate: Probability of completing task within specified steps

Q9: Why do count methods fail in high dimensions?

A: Due to the curse of dimensionality. In-dimensional continuous space, to coverof volume, need to samplepoints. When(Atari pixels), this number is astronomical. Density models and feature embeddings try to alleviate this by learning low-dimensional representations.

Q10: Future research directions for exploration strategies?

A: Several frontier directions: - Active Exploration: Not just explore new states, but actively design experiments to rapidly learn environment dynamics - Skill-Based Exploration: First learn reusable skills (like "jump," "open door"), then combine skills to explore - Goal Generation: Automatically generate challenging goals to guide exploration - Social Learning: Learn from other agents' exploration, avoid repetition

  1. Curiosity-Driven Exploration
    • Curiosity-Driven Exploration by Self-Supervised Prediction (Pathak et al., ICML 2017)
    • Link: arXiv:1705.05363
  2. Random Network Distillation
    • Exploration by Random Network Distillation (Burda et al., ICLR 2019)
    • Link: arXiv:1810.12894
  3. Never Give Up
    • Never Give Up: Learning Directed Exploration Strategies (Badia et al., ICLR 2020)
    • Link: arXiv:2002.06038
  4. Count-Based Exploration
    • Unifying Count-Based Exploration and Intrinsic Motivation (Bellemare et al., NeurIPS 2016)
    • Link: arXiv:1606.01868
  5. Empowerment
    • Empowerment - An Introduction (Klyubin et al., 2005)
    • Variational Intrinsic Control (Gregor et al., ICLR 2017)
    • Link: arXiv:1611.07507
  6. Go-Explore
    • First Return Then Explore (Ecoffet et al., Nature 2021)
    • Link: arXiv:2004.12919
  7. Model-Based Exploration
    • Planning to Explore via Self-Supervised World Models (Sekar et al., ICML 2020)
    • Link: arXiv:2005.05960
  8. Disagreement-Based Exploration
  9. Information Gain Exploration
    • VIME: Variational Information Maximizing Exploration (Houthooft et al., NeurIPS 2016)
    • Link: arXiv:1605.09674
  10. Successor Features for Exploration
    • Deep Successor Reinforcement Learning (Kulkarni et al., 2016)
    • Link: arXiv:1606.02396

Core Formula Summary

Count-Based Exploration

ICM Intrinsic Rewardwhere RND Intrinsic Reward

NGU Mixed Reward

Episodic Noveltywhere

Summary and Outlook

Exploration strategies are key to moving reinforcement learning from toy problems to real applications. Traditional-greedy and Boltzmann exploration are extremely inefficient in high-dimensional sparse reward environments, while intrinsic motivation-based curiosity-driven methods achieve true autonomous exploration by rewarding state novelty. From count-based statistical methods to ICM's prediction error, from RND's random network distillation to NGU's episodic memory, each improvement stems from deep understanding of exploration's essence — not random wandering, but actively seeking unexpected, uncertain, and unknown things.

However, the exploration problem is far from solved. Current methods still rely on massive trial-and-error (NGU requires billions of training steps), making them difficult to apply in safety-critical systems (like autonomous driving). Future research may integrate active learning, causal reasoning, and human prior knowledge to achieve more efficient, interpretable, and transferable exploration strategies. Just as babies don't need to try millions of times to learn to walk, truly intelligent exploration should be sample-efficient, goal-directed, and continually learning — this is one of the core challenges for reinforcement learning research in the next decade.

From the perspective of exploration strategies, the next chapter will delve into Model-Based reinforcement learning — learning environment models for imagination and planning, from passive exploration to active planning, from reactive control to forward-looking decision-making, revealing another core path in reinforcement learning.

  • Post title:Reinforcement Learning (4): Exploration Strategies and Curiosity-Driven Learning
  • Post author:Chen Kai
  • Create time:2024-08-23 15:30:00
  • Post link:https://www.chenk.top/reinforcement-learning-4-exploration-and-curiosity-driven-learning/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments