Reinforcement Learning (11): Hierarchical Reinforcement Learning and Meta-Learning

Traditional reinforcement learning treats complex tasks as flat decision sequences — selecting atomic actions at each timestep, gradually optimizing policies through temporal differences. This paradigm works for simple tasks but becomes inefficient and difficult to generalize in tasks requiring long-term planning and multi-level decisions (such as robot manipulation, game completion, autonomous driving). Humans solve complex problems using hierarchical strategies: decomposing large goals into subgoals (e.g., "make breakfast" decomposes to "brew coffee," "fry eggs," "toast bread"), with each subgoal corresponding to a temporally extended action sequence. Hierarchical Reinforcement Learning (Hierarchical RL) introduces this hierarchical thinking into RL — learning multi-level policies through Temporal Abstraction, improving sample efficiency and interpretability. Meanwhile, traditional RL requires learning from scratch on new tasks, whereas humans can quickly adapt to new scenarios (e.g., quickly learning to ride a motorcycle after learning to ride a bicycle). Meta-Reinforcement Learning (Meta-RL) studies how to "learn to learn"— training across multiple related tasks to enable agents to rapidly adapt to new tasks with minimal samples. From Options' semi-Markov decision processes to MAML's second-order gradient optimization, from MAXQ's value decomposition to RL ²'s memory augmentation, hierarchical and meta-learning demonstrate reinforcement learning's enormous potential in structure and generalization. This chapter systematically examines these cutting-edge methods and helps you implement Options and MAML algorithms through complete code.

Hierarchical Reinforcement Learning: Temporal Abstraction and Options Framework

Why Do We Need Hierarchy?

Limitations of Flat RL: - Difficult Long-term Credit Assignment: In long episodes (e.g., 1000 steps), hard to determine which actions led to final reward - Inefficient Exploration: Atomic action exploration space grows exponentially with time, - Non-reusable Policies: Policies learned for "open door" task cannot transfer to "open window" task - Hard to Interpret: Flat policies are black-box mappings from states to actions, lacking human-understandable structure

Advantages of Hierarchy: - Temporal Abstraction: Package multi-step actions into "macro-actions," reducing decision points - Modularity: Learn reusable sub-policies (e.g., "walk to door," "turn handle") - Faster Exploration: Exploring at abstract level, each step "jumps" across more states - Interpretability: Policies decompose into high-level goals and low-level execution, easy to understand and debug

Options Framework: Semi-Markov Decision Processes

Options Definition: An Optionis a tuple : - Initiation Set$ : Set of states where Option can be initiated - Internal Policy : Policy executed by Option - Termination Condition : Probability of terminating Option at state Execution Flow: 1. At state$s , agent selects Option 2. Execute actions according tountiltriggers termination (typically Bernoulli sampling) 3. Select next Option, repeat

Relationship with MDP: Options transform original MDP into Semi-Markov Decision Process (SMDP): - State space unchanged, but action space expands fromto Options set - Time discretization changes from single-step to variable-length Option execution periods - Reward is cumulative discounted reward during Option execution

Value Function: Define Option-value function : expected return of selecting Optionat stateand subsequently selecting Options according to policy :whereis probability of reachinginsteps after executing,is cumulative discounted reward.

Bellman Equation:whereis event produced by executing(including terminal state, duration, reward).

Options Learning: Intra-Option Q-Learning

Core Idea: No need to wait for Option execution to complete, can update Q-values at every step during Option execution.

Intra-Option Update: Suppose currently executing Option, select actionat state, transition toand receive reward:whereis "continuation value":

Intuition: - If Option continues at (), value is - If Option terminates at (), value is (select optimal next Option) - General case is weighted combination of both

Advantages: - Updates at every step, doesn't waste experience - Supports off-policy learning (currently executing, but updating other Options' Q-values) - Converges to optimal Option-value function (proven in Tabular case)

Options Discovery: Automatically Learning Subgoals

Problem with Manual Options Design: Requires domain knowledge, difficult to scale to new tasks.

Automatic Discovery Methods:

1. Bottleneck States Based: - Analyze state transition graph, find "must-pass places" (e.g., doors between rooms) - Use bottleneck states as subgoals, train Options to reach those states - Algorithms: Betweenness Centrality, Graph Clustering

2. State Coverage Based: - Train diverse Options, maximize visited state diversity - Objective:, whereis entropy of state distribution - Algorithm: Diversity Is All You Need (DIAYN)

3. Skill Learning Based: - Discover useful behavioral primitives using unsupervised learning - Objective: maximize mutual information, whereis skill ID - Algorithms: Variational Intrinsic Control (VIC), Unsupervised Meta-RL

DIAYN Algorithm: - Learn skill policies, each skillcorresponds to a policy - Reward:, whereis discriminator predicting skill ID - Objective: make trajectories of different skills easily distinguishable in state distribution

Four Rooms Experiment: Classic Options Example

Environment: A grid world with 4 rooms, passages between rooms, agent starts from random position, navigates to goal position.

Manual Options: - 4 Options, each corresponding to "go to room X's exit" - Initiation set: all grids in room X - Termination condition: reach exit

Experimental Results: - Q-learning with Options 3-5x faster than vanilla Q-learning - Options can be reused across different goal positions without retraining

Code implementation provided later.

MAXQ: Value Function Decomposition

Core Idea: Task Hierarchization and Value Decomposition

MAXQ decomposes complex tasks into Task Hierarchy Graph: - Root node: main task (e.g., "make breakfast") - Intermediate nodes: subtasks (e.g., "brew coffee," "fry eggs") - Leaf nodes: atomic actions (e.g., "turn on stove," "pour water")

Value Function Decomposition: Q-value of each taskdecomposes into two parts: - Completion Value: intrinsic value of executing subtask(not considering parent task) - Continuation Value: value of continuing to complete parent taskafter completing subtask Recursive Definition:

Leaf Nodes (atomic action):

Non-leaf Nodes (composite task):

MAXQ Learning Algorithm

MAXQ-Q Learning: Recursively updateand:

def MAXQ-Q(task i, state s):
    if i is leaf node (atomic action):
        execute action i, observe r, s'
        return (r, s', 1)  # (reward, new state, steps)
    
    while s is not terminal state of i:
        select subtask a = argmax_a [V(a, s) + C(i, s, a)]
        (r, s', N) = MAXQ-Q(a, s)  # recursive execution
        
        # Update V(a, s)
        V(a, s) += alpha * [r + gamma^N * V(a, s') - V(a, s)]
        
        # Update C(i, s, a)
        C(i, s, a) += alpha * [gamma^N * V(i, s') - C(i, s, a)]
        
        s = s'
    
    return (cumulative reward, s, total steps)

Key Features: - State Abstraction: Subtask'scan ignore certain features of(e.g., "open door" task doesn't care about door color) - Parallel Learning: Multiple parent tasks can share subtask'svalues, accelerating learning - Convergence Guarantee: In Tabular setting, MAXQ-Q converges to hierarchically optimal policy

MAXQ Advantages and Limitations

Advantages: - Sample Efficiency: Value decomposition and state abstraction reduce parameters to learn - Transferability: Subtask'scan be reused across different parent tasks - Interpretability: Task hierarchy graph explicitly shows decision logic

Limitations: - Requires Manual Task Hierarchy Design: Difficult to automatically discover optimal decomposition in practice - Only Converges to Hierarchically Optimal: May not be globally optimal (limited by predefined hierarchy structure) - Tabular Limitation: Extending to large state spaces requires function approximation, but convergence proof no longer holds

Feudal Reinforcement Learning: Goal-Conditioned Hierarchical Learning

Feudal RL's Idea

Feudal System Analogy: - Manager (high-level): Sets subgoals (e.g., "go to coordinates (10, 5)") - Worker (low-level): Executes atomic actions to reach subgoals

Difference from Options: - Options' high-level policy selects discrete Options - Feudal RL's Manager outputs continuous subgoals (goal), Worker learns conditional policy

FuN: FeUdal Networks for Hierarchical RL

Architecture: - Manager: Outputs subgoaleverysteps (in latent space) - Worker: Learns policyto maximize reward for reaching subgoal

Manager's Objective: Maximize extrinsic reward:

Worker's Objective: Maximize intrinsic reward (proximity to subgoal):whereis state embedding.

Training: - Worker trained with A3C, optimizing intrinsic reward - Manager trained with policy gradient, optimizing extrinsic reward - Gradients propagate from Worker to Manager (Worker's state embedding is Manager's input)

Advantages: - Automatically learns subgoals, no manual design needed - Manager focuses on long-term planning, Worker focuses on short-term execution - Significantly improves performance in Atari games (e.g., Montezuma's Revenge)

HIRO: Data Efficient Hierarchical RL

Improvement: FuN's training unstable (Manager and Worker objectives may conflict). HIRO proposes off-policy correction:

Core Idea: - Manager outputs subgoaleverysteps (in state space, e.g., "reach coordinates") - Worker learns - But Worker's training data doesn't directly use Manager's output, instead uses relabeled subgoals

Relabeling Mechanism: For transition, originally Manager set subgoal, but actualmay deviate from. HIRO relabels as:i.e., "pretend Manager's goal was the actually reached place," so Worker always learns effective experience.

Effect: HIRO more stable than FuN in continuous control tasks (e.g., Ant Maze), sample efficiency improves 2-3x.

Meta-Reinforcement Learning: Learning to Learn Fast

Why Do We Need Meta-RL?

Limitations of Traditional RL: - Each new task learned from scratch, wastes knowledge from similar tasks - Low sample efficiency, requires millions of steps to converge - Cannot "learn by analogy" like humans

Meta-RL's Goal: - Train on multiple related tasks - Learn a meta-policy or meta-parameters - On new task, rapidly adapt with minimal samples (e.g., 10-100 episodes)

Application Scenarios: - Robotics: quickly adapt grasping policies under different objects, weights, friction coefficients - Games: quickly learn new levels, new characters - Recommendation systems: quickly adapt to new user preferences

MAML: Model-Agnostic Meta-Learning

Core Idea: Learn "good initialization" parameterssuch that starting from, any task can achieve high performance with few gradient steps.

Algorithm Flow:

Meta-Training:
- Sample task batch - For each task:
  - Starting from, sample trajectory, compute loss - Take one gradient step: - Useto sample new trajectory, compute loss - Meta-optimize:
Meta-Testing:
- Given new task - Starting from, takegradient steps, obtain - Deploy policy with Mathematical Form:

Key: This is second-order gradient optimization— meta-loss gradient w.r.t.requires derivative of inner gradient:Expanding:whereis Hessian matrix.

FOMAML (First-Order MAML): Ignore second-order term, only use:In practice, FOMAML performance close to MAML, but computation 10x faster.

RL ²: Fast Reinforcement Learning via Slow Reinforcement Learning

Core Idea: Use RNN to memorize historical experience, encode "learning" as RNN hidden state updates.

Architecture: - Input:(current state, previous action, previous reward, previous done flag) - RNN (e.g., LSTM) processes sequence, outputs hidden state - Policy: - Training: Train RNN on trajectories across multiple tasks, optimizing cumulative reward

Difference from MAML: - MAML explicitly does gradient updates to adapt to new tasks - RL ² implicitly adapts to new tasks through RNN memory (RNN hidden stateencodes "learned knowledge")

Training Process: - At each episode start, RNN hidden state resets to - Agent executes multiple episodes on task, RNN gradually "learns" - Training objective: maximize return from second and subsequent episodes (first episode for exploration)

Effect: RL ² rapidly adapts to new tasks in Bandit problems and Tabular MDPs, but less stable than MAML in complex environments (e.g., Mujoco).

Meta-Learning Challenges and Frontiers

Challenges: - Task Distribution: Meta-RL assumes training and test tasks from same distribution, but hard to define in reality - Meta-Overfitting: Overfitting on training tasks, failing to generalize to new tasks - Computational Cost: MAML's second-order gradient computation expensive, FOMAML sacrifices partial performance

Frontier Directions:

1. Unsupervised Meta-RL: - No task labels, automatically construct task distribution from single environment - Algorithms: CARML, UML

2. Model-Based Meta-RL: - Learn environment model, rapidly adapt in model - Algorithms: MAESN, Context-based Meta-RL

3. Transformer for Meta-RL: - Replace RNN with Transformer, handle long sequences - Algorithm: Decision Transformer + Meta-learning

4. Directed-MAML (2024 new development): - Use task-directed approximation to reduce second-order gradient computation - Outperforms MAML on CartPole, LunarLander

Complete Code Implementation: Options in Four Rooms

Below implements Options framework in Four Rooms environment, including: - Four Rooms environment construction - Manual Options definition (reach exits of each room) - Intra-Option Q-learning training - Comparison with vanilla Q-learning

import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

# ============ Four Rooms Environment ============
class FourRoomsEnv:
    """Four Rooms Grid World
    
    Layout (13x13):
    ######|#####
    #     |     #
    #     |     #
    #           #
    #     |     #
    #     |     #
    ------+------
    #     |     #
    #     |     #
    #           #
    #     |     #
    #     |     #
    ######|#####
    """
    def __init__(self, size=13):
        self.size = size
        self.grid = np.zeros((size, size), dtype=int)
        
        # Set walls (1 = wall)
        # Vertical wall
        self.grid[:, size//2] = 1
        self.grid[size//2, :] = 1
        
        # Doors (0 = passable)
        self.grid[size//2, size//4] = 0  # Left door
        self.grid[size//2, 3*size//4] = 0  # Right door
        self.grid[size//4, size//2] = 0  # Top door
        self.grid[3*size//4, size//2] = 0  # Bottom door
        
        # Action space: 0=up, 1=right, 2=down, 3=left
        self.actions = [(-1, 0), (0, 1), (1, 0), (0, -1)]
        
        # Room partitioning (for defining Options)
        self.rooms = {
            0: [(i, j) for i in range(size//2) for j in range(size//2) if self.grid[i, j] == 0],  # Top-left
            1: [(i, j) for i in range(size//2) for j in range(size//2+1, size) if self.grid[i, j] == 0],  # Top-right
            2: [(i, j) for i in range(size//2+1, size) for j in range(size//2) if self.grid[i, j] == 0],  # Bottom-left
            3: [(i, j) for i in range(size//2+1, size) for j in range(size//2+1, size) if self.grid[i, j] == 0],  # Bottom-right
        }
        
        # Room exits (door positions)
        self.hallways = [
            (size//2, size//4),   # Room 0 exit
            (size//4, size//2),   # Room 1 exit
            (3*size//4, size//2), # Room 2 exit
            (size//2, 3*size//4), # Room 3 exit
        ]
        
    def reset(self, goal=None):
        """Randomly initialize position"""
        while True:
            self.agent_pos = (np.random.randint(0, self.size), np.random.randint(0, self.size))
            if self.grid[self.agent_pos] == 0:
                break
        
        if goal is None:
            while True:
                self.goal_pos = (np.random.randint(0, self.size), np.random.randint(0, self.size))
                if self.grid[self.goal_pos] == 0 and self.goal_pos != self.agent_pos:
                    break
        else:
            self.goal_pos = goal
        
        return self.agent_pos
    
    def step(self, action):
        """Execute action"""
        dy, dx = self.actions[action]
        new_pos = (self.agent_pos[0] + dy, self.agent_pos[1] + dx)
        
        # Check boundaries and walls
        if 0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size and self.grid[new_pos] == 0:
            self.agent_pos = new_pos
        
        # Reward and termination
        done = (self.agent_pos == self.goal_pos)
        reward = 1.0 if done else -0.01
        
        return self.agent_pos, reward, done
    
    def get_room(self, pos):
        """Get room ID of position"""
        for room_id, positions in self.rooms.items():
            if pos in positions:
                return room_id
        return -1

# ============ Options Definition ============
class Option:
    """Option = (initiation_set, policy, termination)"""
    def __init__(self, name, initiation_set, goal_states):
        self.name = name
        self.initiation_set = initiation_set  # States where can be initiated
        self.goal_states = goal_states  # Subgoal states
        self.policy = {}  # Internal policy: state -> action (learned with Q-learning)
    
    def can_initiate(self, state):
        """Whether can be initiated at state"""
        return state in self.initiation_set
    
    def should_terminate(self, state):
        """Whether should terminate at state"""
        return state in self.goal_states
    
    def act(self, state, epsilon=0.1):
        """Select action according to internal policy"""
        if state not in self.policy:
            return np.random.randint(4)
        
        if np.random.rand() < epsilon:
            return np.random.randint(4)
        return self.policy[state]

def create_options(env):
    """Create 4 Options for Four Rooms, each corresponding to a room exit"""
    options = []
    for room_id in range(4):
        option = Option(
            name=f"GoToHallway{room_id}",
            initiation_set=env.rooms[room_id],
            goal_states=[env.hallways[room_id]]
        )
        options.append(option)
    return options

# ============ Intra-Option Q-Learning ============
class IntraOptionQLearning:
    def __init__(self, env, options, alpha=0.5, gamma=0.99, epsilon=0.1):
        self.env = env
        self.options = options
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        
        # Option-value function: Q(s, o)
        self.Q = defaultdict(lambda: np.zeros(len(options)))
        
        # Primitive Q-function (for learning Option internal policy)
        self.Q_primitive = defaultdict(lambda: np.zeros(4))
    
    def select_option(self, state):
        """ε-greedy Option selection"""
        available_options = [i for i, o in enumerate(self.options) if o.can_initiate(state)]
        if not available_options:
            return None
        
        if np.random.rand() < self.epsilon:
            return np.random.choice(available_options)
        
        q_values = self.Q[state][available_options]
        return available_options[np.argmax(q_values)]
    
    def train_episode(self):
        """Train one episode"""
        state = self.env.reset()
        total_reward = 0
        steps = 0
        
        while steps < 1000:
            # Select Option
            option_id = self.select_option(state)
            if option_id is None:
                break
            
            option = self.options[option_id]
            
            # Execute Option until termination
            while not option.should_terminate(state) and steps < 1000:
                # Select action
                action = option.act(state, self.epsilon)
                next_state, reward, done = self.env.step(action)
                
                total_reward += reward
                steps += 1
                
                # ===== Intra-Option Q Update =====
                # Compute continuation value
                if option.should_terminate(next_state):
                    U = np.max(self.Q[next_state])  # Terminate, select optimal next Option
                else:
                    U = self.Q[next_state][option_id]  # Continue current Option
                
                # Update Q(s, o)
                self.Q[state][option_id] += self.alpha * (reward + self.gamma * U - self.Q[state][option_id])
                
                # ===== Update Option Internal Policy =====
                self.Q_primitive[state][action] += self.alpha * (reward + self.gamma * np.max(self.Q_primitive[next_state]) - self.Q_primitive[state][action])
                option.policy[state] = np.argmax(self.Q_primitive[state])
                
                state = next_state
                
                if done:
                    return total_reward, steps
        
        return total_reward, steps

# ============ Vanilla Q-Learning (Baseline) ============
class FlatQLearning:
    def __init__(self, env, alpha=0.5, gamma=0.99, epsilon=0.1):
        self.env = env
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.Q = defaultdict(lambda: np.zeros(4))
    
    def train_episode(self):
        state = self.env.reset()
        total_reward = 0
        steps = 0
        
        while steps < 1000:
            # ε-greedy
            if np.random.rand() < self.epsilon:
                action = np.random.randint(4)
            else:
                action = np.argmax(self.Q[state])
            
            next_state, reward, done = self.env.step(action)
            
            # Q-learning update
            self.Q[state][action] += self.alpha * (reward + self.gamma * np.max(self.Q[next_state]) - self.Q[state][action])
            
            total_reward += reward
            steps += 1
            state = next_state
            
            if done:
                break
        
        return total_reward, steps

# ============ Experimental Comparison ============
def run_comparison(num_episodes=500, num_trials=10):
    """Compare Options and flat Q-learning"""
    results = {
        'options': {'rewards': [], 'steps': []},
        'flat': {'rewards': [], 'steps': []}
    }
    
    for trial in range(num_trials):
        print(f"Trial {trial+1}/{num_trials}")
        
        # Initialize environments
        env_opt = FourRoomsEnv()
        env_flat = FourRoomsEnv()
        
        # Options learning
        options = create_options(env_opt)
        agent_opt = IntraOptionQLearning(env_opt, options)
        rewards_opt = []
        steps_opt = []
        
        for ep in range(num_episodes):
            r, s = agent_opt.train_episode()
            rewards_opt.append(r)
            steps_opt.append(s)
        
        results['options']['rewards'].append(rewards_opt)
        results['options']['steps'].append(steps_opt)
        
        # Flat Q-learning
        agent_flat = FlatQLearning(env_flat)
        rewards_flat = []
        steps_flat = []
        
        for ep in range(num_episodes):
            r, s = agent_flat.train_episode()
            rewards_flat.append(r)
            steps_flat.append(s)
        
        results['flat']['rewards'].append(rewards_flat)
        results['flat']['steps'].append(steps_flat)
    
    # Average results
    avg_rewards_opt = np.mean(results['options']['rewards'], axis=0)
    avg_rewards_flat = np.mean(results['flat']['rewards'], axis=0)
    avg_steps_opt = np.mean(results['options']['steps'], axis=0)
    avg_steps_flat = np.mean(results['flat']['steps'], axis=0)
    
    # Plotting
    plt.figure(figsize=(14, 5))
    
    plt.subplot(1, 2, 1)
    plt.plot(avg_rewards_opt, label='Intra-Option Q-Learning', linewidth=2)
    plt.plot(avg_rewards_flat, label='Flat Q-Learning', linewidth=2)
    plt.xlabel('Episode')
    plt.ylabel('Average Reward')
    plt.title('Learning Curves: Rewards')
    plt.legend()
    plt.grid(True)
    
    plt.subplot(1, 2, 2)
    plt.plot(avg_steps_opt, label='Intra-Option Q-Learning', linewidth=2)
    plt.plot(avg_steps_flat, label='Flat Q-Learning', linewidth=2)
    plt.xlabel('Episode')
    plt.ylabel('Steps to Goal')
    plt.title('Learning Curves: Steps')
    plt.legend()
    plt.grid(True)
    
    plt.tight_layout()
    plt.savefig('options_vs_flat.png', dpi=300)
    print("Plot saved as options_vs_flat.png")
    
    return results

# ============ Main Program ============
if __name__ == "__main__":
    print("Starting Four Rooms Options experiment...")
    results = run_comparison(num_episodes=500, num_trials=10)
    print("Experiment complete!")
    print(f"Options final average reward: {np.mean([r[-50:] for r in results['options']['rewards']]):.2f}")
    print(f"Flat Q final average reward: {np.mean([r[-50:] for r in results['flat']['rewards']]):.2f}")

Code Analysis

Four Rooms Environment: - 13x13 grid, 4 rooms, doors between rooms - Agent starts from random position, reaches goal position for reward

Options Definition: - 4 Options, each corresponding to "go to room X's exit" - Initiation set: all positions in room - Termination condition: reach exit (door)

Intra-Option Q-Learning: - High-level: Q(s, o) selects Options - Low-level: Option internal policy π_o(a|s) executes atomic actions - Update: Updates Q(s, o) at each step, uses continuation value - Simultaneously updates Option internal policy's Q-function

Experimental Results (expected): - Options converges faster in first 100 episodes (smaller exploration space) - Final performance similar, but Options has fewer steps (jumps across rooms directly)

Complete Code Implementation: MAML for RL

Below implements MAML on simple RL task (2D navigation task, different tasks correspond to different goal positions):

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

# ============ 2D Navigation Environment ============
class Navigation2DEnv:
    """2D Plane Navigation Task
    
    Agent starts from origin, navigates to goal point
    Different tasks = different goal positions
    """
    def __init__(self, goal=(5.0, 5.0), max_steps=50):
        self.goal = np.array(goal, dtype=np.float32)
        self.max_steps = max_steps
        self.reset()
    
    def reset(self):
        self.pos = np.zeros(2, dtype=np.float32)
        self.steps = 0
        return self.pos.copy()
    
    def step(self, action):
        """action: (dx, dy) ∈ [-1, 1]²"""
        self.pos += action * 0.5  # Velocity coefficient
        self.steps += 1
        
        # Reward: negative distance
        dist = np.linalg.norm(self.pos - self.goal)
        reward = -dist
        
        # Termination condition
        done = (dist < 0.5) or (self.steps >= self.max_steps)
        
        return self.pos.copy(), reward, done
    
    @staticmethod
    def sample_task():
        """Sample random task (goal position)"""
        goal = np.random.uniform(-10, 10, size=2)
        return Navigation2DEnv(goal=goal)

# ============ Policy Network ============
class PolicyNetwork(nn.Module):
    """Gaussian Policy Network"""
    def __init__(self, state_dim=2, action_dim=2, hidden_dim=64):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        self.mean = nn.Linear(hidden_dim, action_dim)
        self.log_std = nn.Parameter(torch.zeros(action_dim))
    
    def forward(self, state):
        x = self.fc(state)
        mean = torch.tanh(self.mean(x))  # Limit to [-1, 1]
        std = torch.exp(self.log_std).clamp(min=1e-3, max=1.0)
        return mean, std
    
    def sample(self, state):
        mean, std = self.forward(state)
        dist = torch.distributions.Normal(mean, std)
        action = dist.sample()
        log_prob = dist.log_prob(action).sum(dim=-1)
        return action, log_prob

# ============ MAML Algorithm ============
class MAML:
    def __init__(self, policy, inner_lr=0.1, outer_lr=0.001, num_inner_steps=1):
        self.policy = policy
        self.inner_lr = inner_lr
        self.outer_lr = outer_lr
        self.num_inner_steps = num_inner_steps
        self.optimizer = optim.Adam(policy.parameters(), lr=outer_lr)
    
    def collect_trajectory(self, env, policy, max_steps=50):
        """Collect one trajectory"""
        states, actions, rewards, log_probs = [], [], [], []
        
        state = env.reset()
        for _ in range(max_steps):
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            action, log_prob = policy.sample(state_tensor)
            action_np = action.detach().numpy()[0]
            
            next_state, reward, done = env.step(action_np)
            
            states.append(state)
            actions.append(action_np)
            rewards.append(reward)
            log_probs.append(log_prob)
            
            state = next_state
            
            if done:
                break
        
        return states, actions, rewards, log_probs
    
    def compute_loss(self, states, actions, rewards, log_probs):
        """Compute policy gradient loss (REINFORCE)"""
        # Compute returns
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + 0.99 * G
            returns.insert(0, G)
        returns = torch.FloatTensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        # Policy gradient loss
        log_probs = torch.stack(log_probs)
        loss = -(log_probs * returns).mean()
        
        return loss
    
    def inner_loop(self, task_env, policy_params):
        """Inner loop: gradient update on single task"""
        # Collect trajectory
        states, actions, rewards, log_probs = self.collect_trajectory(task_env, self.policy)
        
        # Compute loss
        loss = self.compute_loss(states, actions, rewards, log_probs)
        
        # Gradient update (manual implementation, supports higher-order gradients)
        grads = torch.autograd.grad(loss, self.policy.parameters(), create_graph=True)
        
        # Update parameters
        adapted_params = []
        for param, grad in zip(self.policy.parameters(), grads):
            adapted_params.append(param - self.inner_lr * grad)
        
        return adapted_params, loss.item()
    
    def outer_loop(self, task_batch):
        """Outer loop: meta-optimization"""
        meta_loss = 0
        
        for task_env in task_batch:
            # Inner update
            adapted_params, train_loss = self.inner_loop(task_env, self.policy.parameters())
            
            # Temporarily apply adapted_params
            original_params = [p.clone() for p in self.policy.parameters()]
            for param, adapted_param in zip(self.policy.parameters(), adapted_params):
                param.data = adapted_param.data
            
            # Collect adapted trajectory
            states, actions, rewards, log_probs = self.collect_trajectory(task_env, self.policy)
            loss = self.compute_loss(states, actions, rewards, log_probs)
            
            meta_loss += loss
            
            # Restore original parameters
            for param, original_param in zip(self.policy.parameters(), original_params):
                param.data = original_param.data
        
        meta_loss /= len(task_batch)
        
        # Meta-optimize
        self.optimizer.zero_grad()
        meta_loss.backward()
        self.optimizer.step()
        
        return meta_loss.item()
    
    def adapt_to_task(self, task_env, num_steps=5):
        """Adapt to new task"""
        adapted_policy = PolicyNetwork()
        adapted_policy.load_state_dict(self.policy.state_dict())
        optimizer = optim.SGD(adapted_policy.parameters(), lr=self.inner_lr)
        
        for step in range(num_steps):
            states, actions, rewards, log_probs = self.collect_trajectory(task_env, adapted_policy)
            loss = self.compute_loss(states, actions, rewards, log_probs)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        return adapted_policy

# ============ Train MAML ============
def train_maml(num_iterations=500, task_batch_size=10):
    """Train MAML"""
    policy = PolicyNetwork()
    maml = MAML(policy, inner_lr=0.1, outer_lr=0.001)
    
    meta_losses = []
    
    for iteration in range(num_iterations):
        # Sample task batch
        task_batch = [Navigation2DEnv.sample_task() for _ in range(task_batch_size)]
        
        # Meta-train
        meta_loss = maml.outer_loop(task_batch)
        meta_losses.append(meta_loss)
        
        if iteration % 50 == 0:
            print(f"Iteration {iteration}: Meta Loss = {meta_loss:.4f}")
    
    # Plot learning curve
    plt.figure(figsize=(10, 5))
    plt.plot(meta_losses)
    plt.xlabel('Iteration')
    plt.ylabel('Meta Loss')
    plt.title('MAML Training Curve')
    plt.grid(True)
    plt.savefig('maml_training.png', dpi=300)
    print("Training curve saved as maml_training.png")
    
    return maml

# ============ Evaluate MAML ============
def evaluate_maml(maml, num_test_tasks=10, num_adapt_steps=5):
    """Evaluate MAML's rapid adaptation on new tasks"""
    results = {'before': [], 'after': []}
    
    for _ in range(num_test_tasks):
        # Sample test task
        test_env = Navigation2DEnv.sample_task()
        
        # Before adaptation: directly use meta-policy
        states, actions, rewards, _ = maml.collect_trajectory(test_env, maml.policy)
        results['before'].append(sum(rewards))
        
        # After adaptation: take few gradient steps
        adapted_policy = maml.adapt_to_task(test_env, num_steps=num_adapt_steps)
        states, actions, rewards, _ = maml.collect_trajectory(test_env, adapted_policy)
        results['after'].append(sum(rewards))
    
    print(f"Average return before adaptation: {np.mean(results['before']):.2f}")
    print(f"Average return after adaptation: {np.mean(results['after']):.2f}")
    print(f"Improvement: {np.mean(results['after']) - np.mean(results['before']):.2f}")
    
    return results

# ============ Main Program ============
if __name__ == "__main__":
    print("Starting MAML training...")
    maml = train_maml(num_iterations=500, task_batch_size=10)
    
    print("\nEvaluating MAML rapid adaptation...")
    results = evaluate_maml(maml, num_test_tasks=20, num_adapt_steps=5)
    
    print("\nExperiment complete!")

Code Analysis

Environment: - 2D plane navigation, agent starts from (0,0), navigates to goal point - Different tasks = different goal positions (random sampling) - Reward: negative Euclidean distance

MAML Flow: 1. Sample Task Batch: 10 random goal positions 2. Inner Loop (for each task): - Collect trajectory, compute loss - Take 1 gradient step, obtain

Outer Loop:
- Useto collect trajectory again, compute meta-loss
- Take gradient step on meta-parameters
Rapid Adaptation:
- Given new task, take 5 gradient steps starting from - Evaluate adapted performance

Experimental Results (expected): - Before adaptation: Meta-policy performs moderately on random tasks - After adaptation: 5 gradient steps significantly improve performance (e.g., return from -50 to -20) - Shows MAML learned "good initialization," can rapidly adapt to new tasks

In-Depth Q&A

Q1: Why Does Options Framework Accelerate Learning?

Temporal Abstraction Reduces Decision Points: - Flat RL: Selects atomic action each step, decision points =(episode length) - Options: Selects one Option everysteps, decision points = - Fewer decision points → shorter credit assignment path → faster convergence

Improved Exploration Efficiency: - Flat RL: Exploration space(exponential) - Options: Exploration space(smaller) - Options' "big step exploration" can reach distant states faster

Value Function Reuse: - Option Q-values can be shared across different tasks - E.g., "open door" Option useful in both "exit room" and "enter room" tasks

Experimental Verification: In Four Rooms task, Options converges 3-5x faster than flat Q-learning.

Q2: What's the Difference Between MAXQ's "Hierarchically Optimal" and "Globally Optimal"?

Hierarchically Optimal: - Optimal policy under given task decomposition - Each subtask independently optimized - Constrained by predefined task hierarchy structure

Globally Optimal: - Optimal policy on original MDP - Not constrained by task decomposition

Example: Suppose task "make breakfast" decomposes to "brew coffee" → "fry eggs" → "toast bread": - Globally optimal: May brew coffee and fry eggs in parallel (faster) - Hierarchically optimal: Constrained to serial decomposition, cannot parallelize

When Consistent? When task decomposition perfectly captures optimal policy's structure, hierarchically optimal = globally optimal. But in practice, designing perfect decomposition is difficult.

MAXQ's Value: Even if not globally optimal, hierarchically optimal is much better than random exploration, and learns faster.

Q3: Core Difference Between Feudal RL and Options?

Options: - Discrete behavioral primitives (e.g., "open door," "go to A") - High-level policy selects discrete Option ID - Each Option has fixed internal policy

Feudal RL: - Continuous subgoals (e.g., "reach coordinates (x, y)") - Manager outputs continuous subgoal vector - Worker learns conditional policy, adapts to any subgoal

Advantage Comparison: - Options easier to interpret (discrete behaviors have clear semantics) - Feudal more flexible (continuous subgoals cover infinite behaviors)

Example: Robot grasping task: - Options: Predefine "open hand," "close hand," "move to A" - Feudal: Manager outputs target position, Worker learns to reach any position

Q4: Why Does MAML Need Second-Order Gradients?

Problem: Meta-loss gradient w.r.t. meta-parametersdepends on adapted parameters, whileitself is function of:

Meta-loss:

Gradient Expansion: whereis Hessian matrix (second-order derivative).

Why Important? - Second-order term captures "gradient of gradient," guidestoward "better effect after gradient update" - Ignoring second-order term (FOMAML) assumes, loses curvature information

In Practice: FOMAML performance close to MAML (loss <5%), but 10x faster computation, often used as MAML alternative.

Q5: How Does RL ² Encode "Learning" as RNN Hidden State?

Core Idea: - Input sequence: - RNN (LSTM) processes sequence, hidden staterecords "observed experience" - Policybased on current state and historical experience

"Learning" Process: - Initial episode:, RNN has no experience, policy randomly explores - Subsequent episodes:accumulates information about "which actions good, which bad" - RNN implicitly learns task structure without explicit gradient updates

Comparison with MAML: - MAML: Explicitly gradient updates parameters - RL ²: Implicitly updates "policy" through RNN memory

Example: Bandit task (10 arms, each arm fixed reward): - Episode 1: RNN randomly explores, discovers arm 3 has high reward - Episode 2: RNN'sremembers "arm 3 good," policy tends to select arm 3 - No network weight updates needed, only update RNN hidden state

Limitation: RNN memory capacity limited, difficult to handle complex tasks (e.g., tasks requiring understanding long-term causality).

Q6: How to Automatically Discover Useful Options?

Challenge: Manual Options design requires domain knowledge, not necessarily optimal.

Method 1: Bottleneck States Based: - Analyze state transition graph, find "must-pass paths" (e.g., room doors, level checkpoints) - Algorithm: Graph clustering, Betweenness Centrality - Advantage: Matches human intuition, easy to interpret - Limitation: Requires complete state transition graph (infeasible for large-scale environments)

Method 2: DIAYN (Diversity Is All You Need): - Learn diverse skills, maximize, whereis skill ID - Reward: - Discriminatorlearns "predict skill from state" - Advantage: Unsupervised, automatically discovers diverse behaviors - Limitation: Diversity ≠ usefulness, may learn irrelevant skills

Method 3: Task Success Rate Based: - Train multiple Options, evaluate each's contribution to main task - Remove low-contribution Options, keep high-contribution ones - Algorithms: Evolutionary Options, Option-Critic

Practical Recommendation: - Early stage: Manually design few Options (based on domain knowledge) - Mid stage: Use DIAYN to expand Options library - Late stage: Filter useful Options based on task performance

Q7: Which Tasks Work Best for Meta-RL?

Suitable Meta-RL Task Characteristics:

1. Shared Structure Between Tasks: - E.g.: Different maze layouts, but navigation strategies similar - E.g.: Different robot weights, but balancing strategies similar

2. Moderate Task Diversity: - Too similar: Direct transfer learning simpler - Too different: Meta-RL cannot generalize

3. Limited Single-Task Samples: - Meta-RL's advantage is few-shot rapid adaptation - If single task has millions of samples, direct training suffices

Success Cases: - Robot Pushing: Different box weights, friction coefficients, MAML adapts quickly - Bandit Problems: Different arm reward distributions, RL ² converges in few rounds - Atari Game Levels: Different levels of same game, Meta-RL quickly learns new levels

Failure Cases: - Completely Random Tasks: E.g., each task has different physics laws, Meta-RL cannot find commonalities - Ultra-Long Time Dependencies: E.g., requires memorizing information 1000 steps ago, RNN difficult to handle

Practical Recommendation: - First verify shared structure between tasks (e.g., using human prior knowledge) - Before trying Meta-RL, try simple transfer learning (e.g., fine-tuning pretrained model)

Q8: Relationship Between Options and Hierarchical DQN?

Hierarchical DQN (h-DQN): - Proposed 2016, for Atari games - Two-layer structure: Meta-Controller selects subgoals, Controller executes atomic actions - Subgoals are discrete (e.g., "reach certain key position")

Relationship with Options: - h-DQN can be viewed as deep learning version of Options - Meta-Controller = high-level Option selection policy - Controller = Option internal policy

h-DQN Innovation: - Uses neural networks for approximation, suitable for large-scale state spaces (e.g., Atari pixels) - Subgoals learned end-to-end, no manual design needed - Intrinsic reward mechanism: Controller optimizes reward for reaching subgoals

Example: Atari game Montezuma's Revenge: - Flat DQN: Difficult to explore (requires consecutive correct actions to get reward) - h-DQN: Meta-Controller sets "get key" → "open door" → "climb stairs," Controller executes step by step, significantly improves performance

Subsequent Developments: - HAM-DQN, Feudal Networks, HIRO are all extensions of h-DQN - Core idea consistent: hierarchical decision-making + temporal abstraction

Q9: Difference Between Meta-RL and Transfer Learning?

Transfer Learning: - Train on source task, fine-tune on target task - Typically assumes single source task (or few source tasks) - Goal: Maximize target task performance

Meta-RL: - Train on task distribution - Assumes multiple related tasks (typically 10+) - Goal: Maximize rapid adaptation ability on new tasks

Mathematical Form:

Transfer Learning:i.e., fine-tune from pretrained parametersto target task.

Meta-RL:i.e., learn initial parameters with strongest "adaptation ability."

Example: - Transfer: Pretrain vision model on ImageNet, fine-tune on medical images - Meta-RL: Train on 100 maze tasks, rapidly adapt on new maze

Which to Choose? - Source and target tasks highly related (e.g., same domain different datasets): Transfer Learning - Have multiple related tasks, goal is rapid adaptation to new tasks: Meta-RL

Q10: Future Directions for Hierarchical RL?

1. End-to-End Hierarchy Discovery: - Current: Most methods require manual task hierarchy design - Future: Completely automatically learn hierarchical structure from data - Challenge: How to define "good hierarchy"? How to efficiently search hierarchy space?

2. Language as Subgoals: - Use natural language to describe subgoals (e.g., "pick up red cup") - Combine Large Language Models (LLMs) to generate subgoals - Advantage: Easy human-machine interaction, strong interpretability

3. Multi-Modal Hierarchical Learning: - Different levels use different modalities (e.g., high-level uses language, low-level uses actions) - Vision-language-action hierarchical alignment

4. Meta-Learning + Hierarchical: - Meta-learn hierarchical structures across multiple tasks - Rapidly adapt hierarchical policies on new tasks - Algorithms: Meta-HAM, Hierarchical MAML

5. Large-Scale Pretraining: - Pretrain hierarchical policies on large-scale offline data - Similar to LLM pretraining paradigm - Challenge: How to design hierarchical pretraining objectives?

Q11: What Improvements Does Directed-MAML (2024) Bring?

MAML's Computational Bottleneck: - Second-order gradient computation requires Hessian matrix - High memory usage (needs to save intermediate gradients) - Long training time (each task requires multiple forward-backward passes)

Directed-MAML's Idea: - Apply task-directed approximation before second-order gradient - Only compute gradient components most relevant to current task - Reduces computational complexity while maintaining performance

Technical Details: - Use task feature vectorto weight gradient directions - Only update parameter subset related to task - Similar to Attention mechanism, but operates in gradient space

Experimental Results (2024 paper): - CartPole-v1: 40% faster convergence - LunarLander-v2: 30% improved sample efficiency - 50% reduction in computation time

Compatibility: Directed-MAML can combine with FOMAML, Meta-SGD etc. to further improve performance.

Q12: How to Combine Options with Deep Learning?

Challenge: - Options framework originally designed for Tabular setting - Large-scale state spaces require function approximation

Option-Critic Architecture: - Parameterize with neural networks: - High-level policy: Select Option - Option internal policy: Execute action - Termination function: Decide termination - End-to-end training, optimize all parameters simultaneously

Loss Functions:

High-level Policy Gradient:

Option Internal Policy Gradient:

Termination Gradient:

Experiments: Option-Critic learns meaningful Options in Atari games (e.g., "avoid enemies," "collect items").

Advantages: - No manual Options design needed - End-to-end training, easy to optimize - Scalable to high-dimensional state spaces

Limitations: - Training unstable (three parameter groups interdependent) - Reduced interpretability (neural network black box)

Core Papers

Hierarchical Reinforcement Learning:

Options Framework:
Sutton et al. (1999). "Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning". Artificial Intelligence.
https://arxiv.org/abs/cs/9905014
MAXQ:
Dietterich (2000). "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition". JAIR.
https://arxiv.org/abs/cs/9905015
HAM:
Parr & Russell (1998). "Reinforcement Learning with Hierarchies of Machines". NIPS.
Feudal Networks:
Vezhnevets et al. (2017). "FeUdal Networks for Hierarchical Reinforcement Learning". ICML.
https://arxiv.org/abs/1703.01161
HIRO:
Nachum et al. (2018). "Data-Efficient Hierarchical Reinforcement Learning". NeurIPS.
https://arxiv.org/abs/1805.08296
Option-Critic:
Bacon et al. (2017). "The Option-Critic Architecture". AAAI.
https://arxiv.org/abs/1609.05140

Meta-Reinforcement Learning:

MAML:
Finn et al. (2017). "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks". ICML.
https://arxiv.org/abs/1703.03400
RL ²:
Duan et al. (2016). "RL ²: Fast Reinforcement Learning via Slow Reinforcement Learning". arXiv.
https://arxiv.org/abs/1611.02779
Directed-MAML:
(2024). "Directed-MAML: Meta Reinforcement Learning Algorithm with Task-directed Approximation". arXiv.
https://arxiv.org/abs/2510.00212
DIAYN:
Eysenbach et al. (2018). "Diversity is All You Need: Learning Skills without a Reward Function". ICLR.
https://arxiv.org/abs/1802.06070

Surveys and Resources

Hierarchical RL Survey:
Barto & Mahadevan (2003). "Recent Advances in Hierarchical Reinforcement Learning". Discrete Event Dynamic Systems.
Meta-Learning Survey:
Hospedales et al. (2021). "Meta-Learning in Neural Networks: A Survey". IEEE TPAMI.
https://arxiv.org/abs/2004.05439

Code Libraries

OpenAI Baselines: https://github.com/openai/baselines
learn2learn: https://github.com/learnables/learn2learn (MAML implementation)
Option-Critic: https://github.com/jeanharb/option_critic

Summary

Hierarchical reinforcement learning and meta-learning represent RL's paradigm shift from "flat single-task" toward "structured multi-task."

Hierarchical RL decomposes complex tasks into manageable subproblems through temporal abstraction and modularity: - Options provides semi-Markov decision framework, supporting temporally extended behavioral primitives - MAXQ decomposes value functions, enabling parallel learning of subtasks - Feudal RL implements flexible hierarchical control with continuous subgoals

Meta-reinforcement learning enables agents to "learn to learn," rapidly adapting to new tasks: - MAML learns "good initialization," achieving rapid gradient adaptation through second-order optimization - RL ² uses RNN memory to encode learning process, implicitly achieving task adaptation - Directed-MAML accelerates MAML with task-directed approximation, maintaining performance while reducing computation

In the future, hierarchical and meta-learning will deeply integrate — meta-learning hierarchical structures across tasks, using language to guide subgoal generation, combining large-scale pretraining to achieve general agents. From robotics to games, from recommendations to autonomous driving, hierarchical and meta-learning are reshaping RL's application boundaries.

Hierarchical Reinforcement Learning: Temporal Abstraction and Options Framework

Why Do We Need Hierarchy?

Options Framework: Semi-Markov Decision Processes

Options Learning: Intra-Option Q-Learning

Options Discovery: Automatically Learning Subgoals

Four Rooms Experiment: Classic Options Example

MAXQ: Value Function Decomposition

Core Idea: Task Hierarchization and Value Decomposition

MAXQ Learning Algorithm

MAXQ Advantages and Limitations

Feudal Reinforcement Learning: Goal-Conditioned Hierarchical Learning

Feudal RL's Idea

FuN: FeUdal Networks for Hierarchical RL

HIRO: Data Efficient Hierarchical RL

Meta-Reinforcement Learning: Learning to Learn Fast

Why Do We Need Meta-RL?

MAML: Model-Agnostic Meta-Learning

RL ²: Fast Reinforcement Learning via Slow Reinforcement Learning

Meta-Learning Challenges and Frontiers

Complete Code Implementation: Options in Four Rooms

Code Analysis

Complete Code Implementation: MAML for RL

Code Analysis

In-Depth Q&A

Q1: Why Does Options Framework Accelerate Learning?

Q2: What's the Difference Between MAXQ's "Hierarchically Optimal" and "Globally Optimal"?

Q3: Core Difference Between Feudal RL and Options?

Q4: Why Does MAML Need Second-Order Gradients?

Q5: How Does RL ² Encode "Learning" as RNN Hidden State?

Q6: How to Automatically Discover Useful Options?

Q7: Which Tasks Work Best for Meta-RL?

Q8: Relationship Between Options and Hierarchical DQN?

Q9: Difference Between Meta-RL and Transfer Learning?

Q10: Future Directions for Hierarchical RL?

Q11: What Improvements Does Directed-MAML (2024) Bring?

Q12: How to Combine Options with Deep Learning?

Related Papers and Resources

Core Papers

Surveys and Resources

Code Libraries

Summary