Reinforcement Learning (4): Exploration Strategies and Curiosity-Driven Learning

One of the central challenges in reinforcement learning is the exploration-exploitation dilemma. An agent that only exploits known good policies may never discover better solutions; but excessive exploration wastes time on low-reward behaviors. Traditional methods like-greedy and Boltzmann exploration rely on randomness, which becomes extremely inefficient in high-dimensional state spaces with sparse rewards — imagine in a game like Montezuma's Revenge, where the agent needs hundreds of precise steps to receive the first reward; pure random exploration is virtually impossible to succeed. In recent years, inspired by cognitive science theories of "intrinsic motivation," researchers have proposed curiosity-driven learning — rewarding agents with intrinsic rewards for exploring novel states, enabling continuous learning even when external rewards are zero. From count-based methods to ICM's prediction error, from RND's random network distillation to NGU's episodic memory, exploration strategies have evolved into a complete theoretical and engineering framework. This chapter systematically traces this evolution, deeply analyzing the design motivation, mathematical principles, and implementation details of each method, ultimately validating their effectiveness on Atari hard-exploration games.

The Exploration-Exploitation Dilemma: The Essence of the Problem

Why Exploration is So Difficult

In the DQN of Chapter 2 and the policy gradient methods of Chapter 3, we assumed agents could collect effective training data through exploration. But in practice, exploration itself is a core challenge:

Problem 1: Sparse Rewards and Delayed Feedback

Consider the classic Montezuma's Revenge game: the agent must first get a key, open a door, and jump over enemies — executing over a hundred operations before receiving the first +100 point reward. Before that, all states have zero return. If using-greedy random exploration, the probability of finding an effective trajectory is(assuming 10 actions per step), which is practically impossible.

Problem 2: Sparsity in High-Dimensional State Spaces

Atari game states arepixels, totaling approximatelypossible states. Even if the agent explores one million states, the coverage is only— the vast majority of states will never be visited.

Problem 3: Attraction Basins of Local Optima

In environments with rewards, agents easily fall into local optima. For example, in a maze, if there's a small reward in a dead end, the agent might repeatedly collect this small reward instead of exploring the path to larger rewards.

Problem 4: Deceptiveness of Deterministic Environments

In deterministic environments,-greedy causes agents to repeatedly try the same random action in the same state, wasting significant time. Worse, certain "interesting" states (like TV noise) make agents engage in "pseudo-exploration," thinking they're learning something new when it's actually worthless.

Limitations of Traditional Exploration Strategies

-greedy Exploration

During early training, with probabilityselect a random action, and with probabilityselect the greedy action:Advantages: Simple and effective, suitable for low-dimensional discrete spaces (like FrozenLake).

Disadvantages: - Exploration is aimless — in high-dimensional spaces, random actions almost never encounter meaningful states -decay schedule is hard to tune — too fast leads to premature convergence, too slow affects performance - Even worse in continuous action spaces — adding noise to actionseasily produces unreasonable actions

Boltzmann Exploration (Softmax)

Sample actions according to the softmax distribution of Q-values:whereis the temperature parameter controlling exploration intensity. Asit approaches greedy, asit approaches uniform random.

Advantages: Adaptively adjusts exploration based on Q-value magnitudes, theoretically more elegant.

Disadvantages: - Still random aimless exploration - When Q-value estimates are inaccurate (early training), easily produces bias - Computing softmax requires enumerating all actions, infeasible in large action spaces

Upper Confidence Bound (UCB)

In the multi-armed bandit problem, UCB selects:whereis total samples,is samples for action, andis the exploration coefficient. The second term is an "optimistic bonus"— the less an action is tried, the higher its confidence upper bound.

Advantages: Theoretically has regret bound guarantees, near-optimal in bandit problems.

Disadvantages: - Requires explicitly counting visits to each state-action pair - In large state spaces, mostpairs are visited only once,, confidence intervals fail - Cannot generalize to unseen states

Thompson Sampling

Maintain a posterior distributionof Q-values, sample a Q-function each time and greedily select:Advantages: Elegant Bayesian framework, exploration naturally arises from uncertainty.

Disadvantages: - Requires maintaining distribution over entire Q-function, computationally expensive in deep networks (though there are approximations like Bootstrapped DQN) - Still relies on Q-value estimation; with sparse rewards when all Q-values are zero, degenerates to random exploration

The Essence of Exploration: Seeking Novelty

Observing human and animal exploration behavior, we find a common feature: not random wandering, but actively seeking novel, unexpected, and uncertain things. Infants repeatedly play with new toys (not old ones), monkeys actively explore unknown maze areas, scientists pursue experimental phenomena that violate common sense.

This "curiosity" can be formalized as intrinsic reward— even when the external environment provides no reward, agents receive internal rewards for state novelty:where: -is the external reward from the environment (task objective) -is the intrinsic reward computed by the agent itself (exploration motivation) -is the balance coefficient

The key question: How to define and computeso it truly measures state novelty?

Count-Based Methods: Explicitly Counting Visit Frequencies

Core Idea

The most intuitive "novelty" definition is: How many times have I seen this state before? The less visited, the more novel, the higher the intrinsic reward.

The simplest implementation uses the reciprocal of visit countas intrinsic reward:Or more aggressive pseudo-count:

Theoretical basis: This is equivalent to maximizing state visit entropy, encouraging uniform coverage of the state space.

Naive Implementation: Hash Table Counting

In small discrete state spaces (like GridWorld), can directly use a dictionary:

class CountBasedExploration:
    def __init__(self, beta=0.1):
        self.counts = {}  # state -> visit count
        self.beta = beta
    
    def get_intrinsic_reward(self, state):
        # Convert state to hashable tuple
        state_key = tuple(state.flatten())
        count = self.counts.get(state_key, 0)
        self.counts[state_key] = count + 1
        return self.beta / (count + 1)**0.5

Problem: In high-dimensional continuous state spaces (like Atari pixels), states almost never repeat,always holds, degenerating to constant reward.

Improvement: Density Models and Pseudo-Counts

Bellemare et al. (2016) in the paper Unifying Count-Based Exploration and Intrinsic Motivation proposed using density modelsto estimate state "occurrence frequency":whereis learned by a generative model (like PixelCNN). Specifically:

Maintain a generative model, estimate state density using likelihood
Online update: for each visited state, maximize$p_(s_t)(s) = (s) = p_(s)[0,1]$ Mathematical Derivation

In the ideal case, ifis the empirical distribution, then:whereis the distribution after visiting. Therefore:In practice, approximate with neural network, compute likelihood ratio before and after visit:

Code Implementation (Simplified)

import torch
import torch.nn as nn

class DensityModel(nn.Module):
    """Estimate state density using PixelCNN-style autoregressive model"""
    def __init__(self, obs_shape):
        super().__init__()
        # Simplified: use CNN to directly output log-likelihood
        self.net = nn.Sequential(
            nn.Conv2d(obs_shape[0], 32, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(32, 32, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(32, 256, 1)  # Output 256 class logits per pixel
        )
    
    def forward(self, obs):
        """obs: (B, C, H, W), returns log p(obs)"""
        logits = self.net(obs)  # (B, 256, H, W)
        # Compute per-pixel log-likelihood (assuming pixel values normalized to 0-255)
        log_probs = -nn.functional.cross_entropy(
            logits.view(-1, 256), 
            obs.flatten(1).long(), 
            reduction='none'
        )
        return log_probs.sum()

class CountBasedIntrinsic:
    def __init__(self, obs_shape, beta=0.05, lr=1e-4):
        self.model = DensityModel(obs_shape)
        self.optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)
        self.beta = beta
    
    def compute_intrinsic_reward(self, obs):
        """Compute intrinsic reward and update density model"""
        obs_tensor = torch.FloatTensor(obs).unsqueeze(0)
        
        # Compute density before update
        with torch.no_grad():
            log_p_old = self.model(obs_tensor)
            p_old = torch.exp(log_p_old)
        
        # Update model
        log_p_new = self.model(obs_tensor)
        loss = -log_p_new
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Recompute density after update
        with torch.no_grad():
            log_p_new = self.model(obs_tensor)
            p_new = torch.exp(log_p_new)
        
        # Pseudo-count
        pseudo_count = p_old / (p_new - p_old + 1e-8)
        pseudo_count = torch.clamp(pseudo_count, 1, 1e6)  # Prevent numerical issues
        
        intrinsic_reward = self.beta / torch.sqrt(pseudo_count)
        return intrinsic_reward.item()

Advantages and Limitations

Advantages: - Clear theory, directly derived from statistical meaning of visit frequency - Significant effect on small-scale problems (like early versions of Montezuma's Revenge)

Limitations: - Density models hard to train — models like PixelCNN are computationally expensive, easily overfit - "Noisy-TV Problem": In stochastic environments, every state is "new" (due to pixel noise), intrinsic reward fails - Cannot generalize — two visually similar but pixel-different states are considered completely different

ICM: Measuring Novelty Through Prediction Error

Motivation: From Pixels to Features

Count-based methods count directly on raw pixels, ignoring a fact: some pixel changes are irrelevant (like background leaves swaying), while others are critical (like enemy movement).

Curiosity-Driven Exploration by Self-Supervised Prediction (Pathak et al., ICML 2017) proposed ICM (Intrinsic Curiosity Module): instead of comparing pixels directly, compare whether the agent can predict features of the next state.

Core insight: - If state transition is predictable → agent already understands environment dynamics → low reward - If state transition is unpredictable → agent encountered new phenomenon → high reward

ICM Architecture: Forward and Inverse Models

ICM contains two modules:

1. Feature Encoder

Map raw stateto low-dimensional feature:Usually implemented with convolutional networks, only preserving information related to agent actions, filtering out background noise.

2. Forward Dynamics Model

Predict next state's features:whereis a small MLP. Forward model prediction error is the intrinsic reward:

3. Inverse Dynamics Model

Predict action from features (self-supervised auxiliary task):The inverse model's role is to force the feature encoder to only preserve action-relevant information. Ifonly encodes background, the inverse model cannot predict actions; only by encoding agent and interactive object information can it succeed.

Mathematical Intuition

Why does prediction error measure novelty? Consider two extremes:

Case 1: Agent randomly walks in empty room

State transition is deterministic:. Forward model quickly learns perfect prediction,, agent stops exploring this area.

Case 2: Agent enters new room, encounters moving enemy

Enemy behavior is unknown, forward model has large prediction error,is high, agent is motivated to explore this area. As interaction increases, model gradually learns enemy behavior patterns,decreases.

Case 3: TV noise (random pixels)

Raw pixels are completely random, but if feature encoder works correctly, it should map noise to the same features (because noise is irrelevant to actions). Inverse model supervision signal forcesto ignore noise, so,, solving the noisy-TV problem!

Complete Code Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class ICM(nn.Module):
    """Intrinsic Curiosity Module"""
    def __init__(self, obs_shape, action_dim, feature_dim=256, eta=0.1):
        super().__init__()
        self.eta = eta  # Intrinsic reward scaling factor
        
        # Feature encoder: input obs, output feature_dim features
        self.feature_encoder = nn.Sequential(
            nn.Conv2d(obs_shape[0], 32, 8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, 4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, 3, stride=1),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, feature_dim)  # Assuming 84x84 input
        )
        
        # Forward model: input (feature_t, action), output feature_{t+1}
        self.forward_model = nn.Sequential(
            nn.Linear(feature_dim + action_dim, 256),
            nn.ReLU(),
            nn.Linear(256, feature_dim)
        )
        
        # Inverse model: input (feature_t, feature_{t+1}), output action
        self.inverse_model = nn.Sequential(
            nn.Linear(feature_dim * 2, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim)
        )
    
    def forward(self, obs, next_obs, action):
        """
        Compute intrinsic reward and losses
        Args:
            obs: (B, C, H, W)
            next_obs: (B, C, H, W)
            action: (B,) or (B, action_dim)
        Returns:
            intrinsic_reward: (B,)
            forward_loss: scalar
            inverse_loss: scalar
        """
        # Feature extraction
        phi = self.feature_encoder(obs)
        phi_next = self.feature_encoder(next_obs)
        
        # Forward model
        if action.ndim == 1:  # Discrete action, one-hot encode
            action_onehot = F.one_hot(action, num_classes=self.inverse_model[-1].out_features).float()
        else:
            action_onehot = action
        
        phi_next_pred = self.forward_model(torch.cat([phi, action_onehot], dim=1))
        forward_loss = F.mse_loss(phi_next_pred, phi_next.detach(), reduction='none').sum(dim=1)
        
        # Intrinsic reward = forward model prediction error
        intrinsic_reward = self.eta * forward_loss
        
        # Inverse model
        action_pred_logits = self.inverse_model(torch.cat([phi, phi_next], dim=1))
        if action.ndim == 1:
            inverse_loss = F.cross_entropy(action_pred_logits, action, reduction='none')
        else:
            inverse_loss = F.mse_loss(action_pred_logits, action, reduction='none').sum(dim=1)
        
        return intrinsic_reward, forward_loss.mean(), inverse_loss.mean()
    
    def compute_intrinsic_reward(self, obs, next_obs, action):
        """Only compute intrinsic reward (for inference)"""
        with torch.no_grad():
            reward, _, _ = self.forward(obs, next_obs, action)
        return reward

class PPO_with_ICM:
    """PPO algorithm with ICM integration"""
    def __init__(self, env, obs_shape, action_dim, lr=3e-4, beta_int=0.01):
        self.env = env
        self.beta_int = beta_int  # Intrinsic reward coefficient
        
        # PPO policy network (omitted, same as Chapter 3)
        self.policy = PolicyNetwork(obs_shape, action_dim)
        
        # ICM module
        self.icm = ICM(obs_shape, action_dim)
        
        # Optimizer: jointly optimize policy and ICM
        self.optimizer = torch.optim.Adam(
            list(self.policy.parameters()) + list(self.icm.parameters()), 
            lr=lr
        )
    
    def collect_trajectories(self, num_steps):
        """Collect trajectories and compute mixed rewards"""
        trajectories = []
        obs = self.env.reset()
        
        for _ in range(num_steps):
            # Select action
            obs_tensor = torch.FloatTensor(obs).unsqueeze(0)
            with torch.no_grad():
                action_dist = self.policy(obs_tensor)
                action = action_dist.sample()
            
            # Environment interaction
            next_obs, reward_ext, done, _ = self.env.step(action.item())
            
            # Compute intrinsic reward
            next_obs_tensor = torch.FloatTensor(next_obs).unsqueeze(0)
            reward_int = self.icm.compute_intrinsic_reward(
                obs_tensor, next_obs_tensor, action
            ).item()
            
            # Mixed reward
            reward_total = reward_ext + self.beta_int * reward_int
            
            trajectories.append({
                'obs': obs,
                'action': action.item(),
                'reward': reward_total,
                'reward_ext': reward_ext,
                'reward_int': reward_int,
                'next_obs': next_obs,
                'done': done
            })
            
            obs = next_obs if not done else self.env.reset()
        
        return trajectories
    
    def update(self, trajectories):
        """PPO update + ICM update"""
        # Convert to tensors
        obs = torch.FloatTensor([t['obs'] for t in trajectories])
        actions = torch.LongTensor([t['action'] for t in trajectories])
        next_obs = torch.FloatTensor([t['next_obs'] for t in trajectories])
        rewards = torch.FloatTensor([t['reward'] for t in trajectories])
        
        # ICM losses
        _, forward_loss, inverse_loss = self.icm(obs, next_obs, actions)
        icm_loss = forward_loss + inverse_loss
        
        # PPO loss (omitted computation, see Chapter 3)
        policy_loss = self.compute_ppo_loss(trajectories)
        
        # Total loss
        total_loss = policy_loss + icm_loss
        
        self.optimizer.zero_grad()
        total_loss.backward()
        self.optimizer.step()
        
        return {
            'policy_loss': policy_loss.item(),
            'forward_loss': forward_loss.item(),
            'inverse_loss': inverse_loss.item()
        }

Experimental Results

In Pathak et al.'s paper, ICM achieved breakthroughs on sparse reward tasks like VizDoom and Super Mario Bros:

No external rewards (): Agent learns to explore maps and avoid enemies using only intrinsic rewards
Montezuma's Revenge: Achieves average score of 6600 in 25 million steps (DQN gets 0)
Generalization ability: On new Mario levels, ICM explores 3x farther than baseline

Paper link: arXiv:1705.05363

Limitations

Depends on environment determinism: In stochastic environments, unpredictability doesn't necessarily equal novelty (might just be noise)
Feature encoder quality: Iffails to learn, entire ICM fails
Computational overhead: Need to train 3 additional networks (encoder, forward, inverse), about 2x slower than vanilla PPO

RND: Random Network Distillation

Motivation: Avoid Training Dynamics Models

One problem with ICM is the need to train a forward modelto predict. In complex environments, the dynamics model itself is hard to learn; if it doesn't learn well, intrinsic rewards become unreliable.

Exploration by Random Network Distillation (Burda et al., ICLR 2019) proposed RND: completely avoid dynamics modeling, instead predict outputs of a random network.

Core idea: - Fix a randomly initialized target network(never trained) - Train a predictor networkto fit the target network's outputs - Prediction errorserves as intrinsic reward

Why does this work?

For frequently visited states, predictor network has seen many times, small prediction error → low
For new states, predictor network hasn't seen them, large prediction error → high
Random network acts as "hash function," mapping similar states to similar outputs, naturally achieving generalization

Better yet, this avoids the "noisy-TV problem": even if pixels are random, random network outputs are similar for states of the same type (like "noisy screen"), predictor network quickly learns to ignore noise.

Mathematical Formalization

Definitions: - Target network:(randomly initialized, fixed) - Predictor network:(training)

Intrinsic reward:Predictor network loss:

Key detail: To make intrinsic rewards have meaningful scale, usually normalize:whereare running mean and standard deviation of intrinsic rewards.

Complete Code Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

class RandomNetwork(nn.Module):
    """Randomly initialized target network (not trained)"""
    def __init__(self, obs_shape, output_dim=512):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(obs_shape[0], 32, 8, stride=4),
            nn.LeakyReLU(),
            nn.Conv2d(32, 64, 4, stride=2),
            nn.LeakyReLU(),
            nn.Conv2d(64, 64, 3, stride=1),
            nn.LeakyReLU(),
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, output_dim)
        )
        # Fix parameters
        for param in self.parameters():
            param.requires_grad = False
    
    def forward(self, obs):
        return self.net(obs)

class PredictorNetwork(nn.Module):
    """Trained predictor network"""
    def __init__(self, obs_shape, output_dim=512):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(obs_shape[0], 32, 8, stride=4),
            nn.LeakyReLU(),
            nn.Conv2d(32, 64, 4, stride=2),
            nn.LeakyReLU(),
            nn.Conv2d(64, 64, 3, stride=1),
            nn.LeakyReLU(),
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, 512),
            nn.ReLU(),
            nn.Linear(512, output_dim)
        )
    
    def forward(self, obs):
        return self.net(obs)

class RND:
    """Random Network Distillation module"""
    def __init__(self, obs_shape, output_dim=512, lr=1e-4):
        self.target_net = RandomNetwork(obs_shape, output_dim)
        self.predictor_net = PredictorNetwork(obs_shape, output_dim)
        self.optimizer = torch.optim.Adam(self.predictor_net.parameters(), lr=lr)
        
        # Normalization statistics
        self.reward_mean = 0.0
        self.reward_std = 1.0
        self.reward_history = []
    
    def compute_intrinsic_reward(self, obs):
        """
        Compute intrinsic reward
        Args:
            obs: (B, C, H, W)
        Returns:
            intrinsic_reward: (B,)
        """
        with torch.no_grad():
            target_feat = self.target_net(obs)
            pred_feat = self.predictor_net(obs)
            reward = torch.pow(pred_feat - target_feat, 2).sum(dim=1)
        
        return reward
    
    def update(self, obs_batch):
        """Update predictor network"""
        target_feat = self.target_net(obs_batch).detach()
        pred_feat = self.predictor_net(obs_batch)
        loss = F.mse_loss(pred_feat, target_feat)
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        return loss.item()
    
    def normalize_reward(self, reward):
        """Normalize intrinsic rewards"""
        self.reward_history.append(reward.mean().item())
        if len(self.reward_history) > 1000:
            self.reward_history.pop(0)
        
        self.reward_mean = np.mean(self.reward_history)
        self.reward_std = np.std(self.reward_history) + 1e-8
        
        return (reward - self.reward_mean) / self.reward_std

class PPO_with_RND:
    """PPO algorithm with RND integration"""
    def __init__(self, env, obs_shape, action_dim, lr=3e-4, beta_int=1.0, gamma_int=0.99):
        self.env = env
        self.beta_int = beta_int  # Intrinsic reward coefficient
        self.gamma_int = gamma_int  # Discount for intrinsic rewards (usually close to 1)
        
        # Policy network (omitted)
        self.policy = PolicyNetwork(obs_shape, action_dim)
        
        # RND module
        self.rnd = RND(obs_shape)
        
        # Two value networks: one predicts external returns, one predicts intrinsic returns
        self.value_ext = ValueNetwork(obs_shape)
        self.value_int = ValueNetwork(obs_shape)
        
        self.optimizer = torch.optim.Adam(
            list(self.policy.parameters()) + 
            list(self.value_ext.parameters()) + 
            list(self.value_int.parameters()),
            lr=lr
        )
    
    def collect_trajectories(self, num_steps):
        """Collect trajectories and compute dual returns"""
        trajectories = []
        obs = self.env.reset()
        
        for _ in range(num_steps):
            obs_tensor = torch.FloatTensor(obs).unsqueeze(0)
            
            with torch.no_grad():
                # Select action
                action = self.policy(obs_tensor).sample()
                
                # Predict values
                v_ext = self.value_ext(obs_tensor)
                v_int = self.value_int(obs_tensor)
            
            # Environment interaction
            next_obs, reward_ext, done, _ = self.env.step(action.item())
            
            # Compute intrinsic reward
            next_obs_tensor = torch.FloatTensor(next_obs).unsqueeze(0)
            reward_int = self.rnd.compute_intrinsic_reward(next_obs_tensor)
            reward_int = self.rnd.normalize_reward(reward_int).item()
            
            trajectories.append({
                'obs': obs,
                'action': action.item(),
                'reward_ext': reward_ext,
                'reward_int': reward_int,
                'v_ext': v_ext.item(),
                'v_int': v_int.item(),
                'done': done
            })
            
            obs = next_obs if not done else self.env.reset()
        
        return trajectories
    
    def compute_gae(self, trajectories, gamma, lam):
        """Compute GAE (separately for external and intrinsic)"""
        advantages_ext = []
        advantages_int = []
        
        gae_ext = 0
        gae_int = 0
        
        for t in reversed(range(len(trajectories))):
            delta_ext = trajectories[t]['reward_ext'] + gamma * trajectories[t+1]['v_ext'] * (1 - trajectories[t]['done']) - trajectories[t]['v_ext']
            delta_int = trajectories[t]['reward_int'] + self.gamma_int * trajectories[t+1]['v_int'] - trajectories[t]['v_int']  # Intrinsic rewards don't consider done
            
            gae_ext = delta_ext + gamma * lam * (1 - trajectories[t]['done']) * gae_ext
            gae_int = delta_int + self.gamma_int * lam * gae_int
            
            advantages_ext.insert(0, gae_ext)
            advantages_int.insert(0, gae_int)
        
        # Combine advantages
        advantages_total = [a_ext + self.beta_int * a_int for a_ext, a_int in zip(advantages_ext, advantages_int)]
        return advantages_total, advantages_ext, advantages_int
    
    def update(self, trajectories):
        """Update policy, value networks and RND"""
        obs = torch.FloatTensor([t['obs'] for t in trajectories])
        
        # Update RND
        rnd_loss = self.rnd.update(obs)
        
        # Compute GAE and PPO update (omit details)
        # ...
        
        return {'rnd_loss': rnd_loss}

Experimental Results

Burda et al. demonstrated RND's remarkable effectiveness in their paper:

Montezuma's Revenge: Average score 8152 (first time surpassing human-level 7385)
Pitfall: Score 70.4 (previous SOTA was 0)
Gravitar: Improved from DQN's 0 to 3500+ points

Key findings: 1. RND outperforms ICM on all Atari hard-exploration games 2. Intrinsic rewards don't need discounting (), as exploration is a long-term goal 3. Normalization is crucial — without it intrinsic reward scale explodes

Paper link: arXiv:1810.12894

NGU: Episodic Memory and Fast-Slow Exploration

Motivation: From Global to Episodic Novelty

RND has an implicit assumption: once a state is visited once, it's never novel again. But in actual exploration, this is too strict:

Scenario 1: Key-door task

In Montezuma's Revenge, the agent needs to repeatedly "get key → open door → enter room." If after getting the key the first time, RND considers the key location no longer novel, the agent loses motivation to get the key again.

Scenario 2: Reversible exploration

Agent first explores right room, then explores left room. When returning to starting point, RND gives low reward (because starting point was visited many times), but this "return to start" is necessary to explore left side.

Never Give Up (NGU) (Badia et al., ICLR 2020) proposed episodic memory: consider not only global novelty of states, but also whether novel in current episode.

Core Design

NGU's intrinsic reward contains two parts:where: -: Episodic novelty — in current episode, how far isfrom nearest neighbor states -: Lifetime novelty — frequency ofvisits in entire training history (like RND) -: Upper bound, prevents reward explosion

1. Episodic Novelty

Maintain episodic memory, storing embeddingsof all states in current episode (learned by trainable network).

Compute distance fromto k nearest neighbors in memory:whereis Euclidean distance,is small constant to prevent division by zero.

Then convert to similarity using kernel function:whereis running average distance,is small constant.

Episodic novelty is defined as:

Intuition: Ifis far from all states in memory, denominator is small,is large; ifis close to some state (visited before), denominator is large, reward is small. Memory is cleared at end of each episode, recomputed next episode.

2. Lifetime Novelty

Use RND prediction error:Normalized as long-term exploration signal.

3. Adaptive Exploration Coefficients

NGU trains multiple agents, each with different exploration coefficient:Highagents focus on exploration, lowagents focus on exploitation. All agents share experience pool, forming "exploration-exploitation spectrum."

Simplified Implementation (Pseudocode)

import torch
import numpy as np
from collections import deque

class EpisodicMemory:
    """Episodic memory module"""
    def __init__(self, capacity=5000, k=10):
        self.memory = deque(maxlen=capacity)
        self.k = k  # k nearest neighbors
        self.running_mean_dist = 1.0
        self.epsilon = 0.001
        self.c = 0.001
    
    def add(self, embedding):
        """Add state embedding to memory"""
        self.memory.append(embedding.cpu().numpy())
    
    def compute_novelty(self, embedding):
        """Compute episodic novelty"""
        if len(self.memory) == 0:
            return 1.0  # First state, highest reward
        
        # Compute distances to all memories
        emb_np = embedding.cpu().numpy()
        distances = [np.linalg.norm(emb_np - mem) for mem in self.memory]
        
        # k nearest distances
        k_nearest = sorted(distances)[:self.k]
        dk = np.sqrt(np.sum(np.square(k_nearest)) / self.k) + self.c
        
        # Update running average distance
        self.running_mean_dist = 0.99 * self.running_mean_dist + 0.01 * np.mean(distances)
        
        # Kernel similarity
        similarities = []
        for d in distances:
            kernel = self.epsilon / ((d**2 / self.running_mean_dist) + self.epsilon)
            similarities.append(kernel)
        
        # Episodic novelty
        novelty = 1.0 / (np.sqrt(sum(similarities)) + self.c)
        return novelty
    
    def reset(self):
        """End of episode, clear memory"""
        self.memory.clear()

class EmbeddingNetwork(nn.Module):
    """Trainable state embedding network"""
    def __init__(self, obs_shape, embed_dim=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(obs_shape[0], 32, 8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, 4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, 3, stride=1),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, embed_dim)
        )
    
    def forward(self, obs):
        return self.net(obs)

class NGU:
    """Never Give Up exploration module"""
    def __init__(self, obs_shape, beta=0.3, L=5.0):
        self.beta = beta
        self.L = L
        
        # Episodic memory
        self.episodic_memory = EpisodicMemory()
        
        # Trainable embedding network
        self.embedding_net = EmbeddingNetwork(obs_shape)
        
        # RND for lifetime novelty
        self.rnd = RND(obs_shape)
    
    def compute_intrinsic_reward(self, obs):
        """Compute mixed intrinsic reward"""
        # Episodic novelty
        with torch.no_grad():
            embedding = self.embedding_net(obs)
        episodic_novelty = self.episodic_memory.compute_novelty(embedding)
        self.episodic_memory.add(embedding)
        
        # Lifetime novelty
        lifetime_novelty = self.rnd.compute_intrinsic_reward(obs).item()
        lifetime_novelty = max(min(lifetime_novelty, self.L), 1.0)
        
        # Combine
        intrinsic_reward = episodic_novelty * lifetime_novelty
        return intrinsic_reward
    
    def reset_episode(self):
        """End of episode, reset episodic memory"""
        self.episodic_memory.reset()

Experimental Results

NGU achieved unprecedented exploration performance on DeepMind Lab and Atari:

Pitfall: Improved from RND's 70 to 5000+ points
Private Eye: First to solve this game (69000 points)
Montezuma's Revenge: Average 11000 points, surpassing human experts

Key findings: 1. Episodic memory is crucial — removing it causes 50%+ performance drop 2. Multi-agent exploration spectrum accelerates learning — equivalent to simultaneously training explorers and exploiters 3. Embedding network requires careful training — paper uses contrastive learning to optimize Paper link: arXiv:2002.06038

Practical Case: Implementing ICM+PPO on Montezuma's Revenge

Environment Setup

# Install dependencies
pip install gym[atari] torch torchvision opencv-python

# Install Atari ROMs
pip install autorom[accept-rom-license]

Complete Training Code

import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from collections import deque
import cv2

# ========== Environment Wrapper ==========
class AtariWrapper:
    """Atari environment preprocessing"""
    def __init__(self, env_name):
        self.env = gym.make(env_name)
        self.frame_stack = deque(maxlen=4)
    
    def reset(self):
        obs = self.env.reset()
        obs = self._preprocess(obs)
        for _ in range(4):
            self.frame_stack.append(obs)
        return np.array(self.frame_stack)
    
    def step(self, action):
        total_reward = 0
        for _ in range(4):  # Frame skip
            obs, reward, done, info = self.env.step(action)
            total_reward += reward
            if done:
                break
        
        obs = self._preprocess(obs)
        self.frame_stack.append(obs)
        return np.array(self.frame_stack), total_reward, done, info
    
    def _preprocess(self, obs):
        """Convert to grayscale, crop, resize to 84x84"""
        gray = cv2.cvtColor(obs, cv2.COLOR_RGB2GRAY)
        resized = cv2.resize(gray, (84, 84))
        return resized / 255.0

# ========== Neural Networks ==========
class ActorCritic(nn.Module):
    """Actor-Critic network"""
    def __init__(self, action_dim):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(4, 32, 8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, 4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, 3, stride=1),
            nn.ReLU(),
            nn.Flatten()
        )
        
        self.fc = nn.Linear(64 * 7 * 7, 512)
        self.actor = nn.Linear(512, action_dim)
        self.critic = nn.Linear(512, 1)
    
    def forward(self, obs):
        x = self.conv(obs)
        x = F.relu(self.fc(x))
        logits = self.actor(x)
        value = self.critic(x)
        return logits, value

# ========== ICM Module (already defined above) ==========
# ...

# ========== Training Loop ==========
def train_icm_ppo(env_name='MontezumaRevengeNoFrameskip-v4', total_steps=10_000_000):
    env = AtariWrapper(env_name)
    action_dim = env.env.action_space.n
    obs_shape = (4, 84, 84)
    
    # Initialize networks
    policy = ActorCritic(action_dim).cuda()
    icm = ICM(obs_shape, action_dim).cuda()
    
    optimizer = torch.optim.Adam(
        list(policy.parameters()) + list(icm.parameters()),
        lr=3e-4
    )
    
    # Hyperparameters
    gamma = 0.99
    lam = 0.95
    clip_eps = 0.1
    beta_int = 0.01
    
    # Statistics
    episode_rewards = []
    episode_length = 0
    obs = env.reset()
    
    for step in range(total_steps):
        # ===== Collect trajectories =====
        trajectories = []
        for _ in range(128):  # Collect 128 steps each time
            obs_tensor = torch.FloatTensor(obs).unsqueeze(0).cuda()
            
            with torch.no_grad():
                logits, value = policy(obs_tensor)
                dist = torch.distributions.Categorical(logits=logits)
                action = dist.sample()
                log_prob = dist.log_prob(action)
            
            next_obs, reward, done, _ = env.step(action.item())
            
            # ICM intrinsic reward
            next_obs_tensor = torch.FloatTensor(next_obs).unsqueeze(0).cuda()
            reward_int, _, _ = icm(obs_tensor, next_obs_tensor, action)
            reward_int = reward_int.item()
            
            trajectories.append({
                'obs': obs,
                'action': action.item(),
                'reward_ext': reward,
                'reward_int': reward_int,
                'value': value.item(),
                'log_prob': log_prob.item(),
                'done': done
            })
            
            obs = next_obs
            episode_length += 1
            
            if done:
                episode_rewards.append(sum(t['reward_ext'] for t in trajectories[-episode_length:]))
                episode_length = 0
                obs = env.reset()
        
        # ===== Compute GAE =====
        advantages = []
        returns = []
        gae = 0
        next_value = 0
        
        for t in reversed(range(len(trajectories))):
            reward = trajectories[t]['reward_ext'] + beta_int * trajectories[t]['reward_int']
            delta = reward + gamma * next_value * (1 - trajectories[t]['done']) - trajectories[t]['value']
            gae = delta + gamma * lam * (1 - trajectories[t]['done']) * gae
            
            advantages.insert(0, gae)
            returns.insert(0, gae + trajectories[t]['value'])
            next_value = trajectories[t]['value']
        
        # ===== PPO update =====
        obs_batch = torch.FloatTensor([t['obs'] for t in trajectories]).cuda()
        actions_batch = torch.LongTensor([t['action'] for t in trajectories]).cuda()
        old_log_probs = torch.FloatTensor([t['log_prob'] for t in trajectories]).cuda()
        advantages = torch.FloatTensor(advantages).cuda()
        returns = torch.FloatTensor(returns).cuda()
        
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        for _ in range(4):  # 4 epochs
            logits, values = policy(obs_batch)
            dist = torch.distributions.Categorical(logits=logits)
            log_probs = dist.log_prob(actions_batch)
            entropy = dist.entropy().mean()
            
            # PPO loss
            ratio = torch.exp(log_probs - old_log_probs)
            surr1 = ratio * advantages
            surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * advantages
            actor_loss = -torch.min(surr1, surr2).mean()
            critic_loss = F.mse_loss(values.squeeze(), returns)
            
            # ICM loss
            next_obs_batch = torch.FloatTensor([trajectories[i]['obs'] if i+1 < len(trajectories) else trajectories[i]['obs'] for i in range(len(trajectories))]).cuda()
            _, forward_loss, inverse_loss = icm(obs_batch, next_obs_batch, actions_batch)
            
            # Total loss
            loss = actor_loss + 0.5 * critic_loss - 0.01 * entropy + forward_loss + inverse_loss
            
            optimizer.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(list(policy.parameters()) + list(icm.parameters()), 0.5)
            optimizer.step()
        
        # ===== Logging =====
        if step % 100 == 0:
            avg_reward = np.mean(episode_rewards[-10:]) if episode_rewards else 0
            print(f"Step {step}, Avg Reward: {avg_reward:.2f}, Episodes: {len(episode_rewards)}")

if __name__ == '__main__':
    train_icm_ppo()

Tuning Tips

Intrinsic reward coefficient: Start from 0.01, can increase to 0.1 in sparse reward environments
Normalization: Normalize intrinsic and external rewards separately to avoid scale mismatch
Frame skip: In Atari, repeat each action for 4 frames to accelerate training
Gradient clipping: ICM gradients can be large, clip to 0.5 to prevent explosion

Expected Results

Training on single GPU (RTX 3090): - 10 million steps: Agent learns to reach first room (100 points) - 30 million steps: Agent learns to get key and open door (400 points) - 100 million steps: Average score 6000+, surpassing DQN's 0

Theoretical Analysis and Open Problems

Q&A: Common Questions About Exploration Strategies

Q1: Why is ICM's inverse model so important?

A: The inverse model forces the feature encoderto only preserve action-relevant information. Imagine a counterexample: without the inverse model,might only encode background, then the forward model would give high rewards to background changes, causing the agent to "pseudo-explore" in front of random backgrounds. The inverse model forcesto encode agent and interactive object positions by predicting actions.

Q2: Why doesn't RND get fooled by TV noise?

A: The random networkacts as a "hash function"— although each frame's pixels differ, states belonging to the same type (noise screen) are mapped to similar outputs. The predictor networkquickly learns to map noise screens to some fixed vector, prediction error drops to 0, intrinsic reward vanishes. The key is: random network outputs are insensitive to irrelevant changes (like noise) but sensitive to structural changes (like new object appearance).

Q3: Why can NGU's episodic memory solve "key-door" tasks?

A: At the start of each new episode, episodic memory is cleared, so the "get key" state becomes novel again. Even if the agent visited this state 100 times historically, in the current episode it still gives high reward. This ensures the agent doesn't abandon exploring certain necessary states due to frequent global visits.

Q4: Can intrinsic rewards cause agents to ignore external rewards?

A: This is a real problem, called "exploration trap." Solutions: 1. Gradually decrease(though NGU paper found fixedalso works) 2. Dual value networks: one predicts external returns, one predicts intrinsic returns, optimize separately 3. Reset intrinsic rewards after task success (avoid over-exploring solved tasks)

Q5: How to use exploration strategies in continuous control tasks (like robots)?

A: Continuous control exploration is more complex because action space is continuous. Common methods: - Add Gaussian noise to actions: - Parameter space noise: add noise to policy network parameters - ICM/RND equally applicable to continuous control, just need to include actionas input

Q6: Can exploration strategies be used for imitation learning?

A: Yes! When Behavioral Cloning fails, exploration strategies help agents discover states missing from demonstration trajectories. Combined approaches: - Use ICM to explore, but prefer states near expert trajectories - Use RND to identify states different from expert distribution, actively learn

Q7: How to explore in multi-agent environments?

A: Multi-agent exploration has unique challenges: - Other agents' policies are changing, environment is non-stationary - Need to explore "joint policy space," combinatorial explosion - Common methods: each agent's independent ICM/RND + centralized episodic memory

Q8: How to evaluate exploration strategy quality?

A: Besides task rewards, can also use: - State coverage: Number of unique states visited - Exploration efficiency: Steps needed to reach certain coverage - Novelty curve: Change in intrinsic rewards over time (should gradually decrease) - Success rate: Probability of completing task within specified steps

Q9: Why do count methods fail in high dimensions?

A: Due to the curse of dimensionality. In-dimensional continuous space, to coverof volume, need to samplepoints. When(Atari pixels), this number is astronomical. Density models and feature embeddings try to alleviate this by learning low-dimensional representations.

Q10: Future research directions for exploration strategies?

A: Several frontier directions: - Active Exploration: Not just explore new states, but actively design experiments to rapidly learn environment dynamics - Skill-Based Exploration: First learn reusable skills (like "jump," "open door"), then combine skills to explore - Goal Generation: Automatically generate challenging goals to guide exploration - Social Learning: Learn from other agents' exploration, avoid repetition

Core Formula Summary

Count-Based Exploration

ICM Intrinsic Rewardwhere RND Intrinsic Reward

NGU Mixed Reward

Episodic Noveltywhere

Summary and Outlook

Exploration strategies are key to moving reinforcement learning from toy problems to real applications. Traditional-greedy and Boltzmann exploration are extremely inefficient in high-dimensional sparse reward environments, while intrinsic motivation-based curiosity-driven methods achieve true autonomous exploration by rewarding state novelty. From count-based statistical methods to ICM's prediction error, from RND's random network distillation to NGU's episodic memory, each improvement stems from deep understanding of exploration's essence — not random wandering, but actively seeking unexpected, uncertain, and unknown things.

However, the exploration problem is far from solved. Current methods still rely on massive trial-and-error (NGU requires billions of training steps), making them difficult to apply in safety-critical systems (like autonomous driving). Future research may integrate active learning, causal reasoning, and human prior knowledge to achieve more efficient, interpretable, and transferable exploration strategies. Just as babies don't need to try millions of times to learn to walk, truly intelligent exploration should be sample-efficient, goal-directed, and continually learning — this is one of the core challenges for reinforcement learning research in the next decade.

From the perspective of exploration strategies, the next chapter will delve into Model-Based reinforcement learning — learning environment models for imagination and planning, from passive exploration to active planning, from reactive control to forward-looking decision-making, revealing another core path in reinforcement learning.

The Exploration-Exploitation Dilemma: The Essence of the Problem

Why Exploration is So Difficult

Limitations of Traditional Exploration Strategies

The Essence of Exploration: Seeking Novelty

Count-Based Methods: Explicitly Counting Visit Frequencies

Core Idea

Naive Implementation: Hash Table Counting

Improvement: Density Models and Pseudo-Counts

Advantages and Limitations

ICM: Measuring Novelty Through Prediction Error

Motivation: From Pixels to Features

ICM Architecture: Forward and Inverse Models

Mathematical Intuition

Complete Code Implementation

Experimental Results

Limitations

RND: Random Network Distillation

Motivation: Avoid Training Dynamics Models

Mathematical Formalization

Complete Code Implementation

Experimental Results

NGU: Episodic Memory and Fast-Slow Exploration

Motivation: From Global to Episodic Novelty

Core Design

Simplified Implementation (Pseudocode)

Experimental Results

Practical Case: Implementing ICM+PPO on Montezuma's Revenge

Environment Setup

Complete Training Code

Tuning Tips

Expected Results

Theoretical Analysis and Open Problems

Q&A: Common Questions About Exploration Strategies

Recommended Papers

Core Formula Summary

Summary and Outlook