Traditional reinforcement learning treats complex tasks as flat decision sequences — selecting atomic actions at each timestep, gradually optimizing policies through temporal differences. This paradigm works for simple tasks but becomes inefficient and difficult to generalize in tasks requiring long-term planning and multi-level decisions (such as robot manipulation, game completion, autonomous driving). Humans solve complex problems using hierarchical strategies: decomposing large goals into subgoals (e.g., "make breakfast" decomposes to "brew coffee," "fry eggs," "toast bread"), with each subgoal corresponding to a temporally extended action sequence. Hierarchical Reinforcement Learning (Hierarchical RL) introduces this hierarchical thinking into RL — learning multi-level policies through Temporal Abstraction, improving sample efficiency and interpretability. Meanwhile, traditional RL requires learning from scratch on new tasks, whereas humans can quickly adapt to new scenarios (e.g., quickly learning to ride a motorcycle after learning to ride a bicycle). Meta-Reinforcement Learning (Meta-RL) studies how to "learn to learn"— training across multiple related tasks to enable agents to rapidly adapt to new tasks with minimal samples. From Options' semi-Markov decision processes to MAML's second-order gradient optimization, from MAXQ's value decomposition to RL ²'s memory augmentation, hierarchical and meta-learning demonstrate reinforcement learning's enormous potential in structure and generalization. This chapter systematically examines these cutting-edge methods and helps you implement Options and MAML algorithms through complete code.
Hierarchical Reinforcement Learning: Temporal Abstraction and Options Framework
Why Do We Need Hierarchy?
Limitations of Flat RL: - Difficult
Long-term Credit Assignment: In long episodes (e.g., 1000
steps), hard to determine which actions led to final reward -
Inefficient Exploration: Atomic action exploration
space grows exponentially with time,
Advantages of Hierarchy: - Temporal Abstraction: Package multi-step actions into "macro-actions," reducing decision points - Modularity: Learn reusable sub-policies (e.g., "walk to door," "turn handle") - Faster Exploration: Exploring at abstract level, each step "jumps" across more states - Interpretability: Policies decompose into high-level goals and low-level execution, easy to understand and debug
Options Framework: Semi-Markov Decision Processes
Options Definition: An Option
Relationship with MDP: Options transform original
MDP into Semi-Markov Decision Process (SMDP): - State
space unchanged, but action space expands from
Value Function: Define Option-value function
Bellman Equation:
Options Learning: Intra-Option Q-Learning
Core Idea: No need to wait for Option execution to complete, can update Q-values at every step during Option execution.
Intra-Option Update: Suppose currently executing
Option
Intuition: - If Option continues at
Advantages: - Updates at every step, doesn't waste
experience - Supports off-policy learning (currently executing
Options Discovery: Automatically Learning Subgoals
Problem with Manual Options Design: Requires domain knowledge, difficult to scale to new tasks.
Automatic Discovery Methods:
1. Bottleneck States Based: - Analyze state transition graph, find "must-pass places" (e.g., doors between rooms) - Use bottleneck states as subgoals, train Options to reach those states - Algorithms: Betweenness Centrality, Graph Clustering
2. State Coverage Based: - Train diverse Options,
maximize visited state diversity - Objective:
3. Skill Learning Based: - Discover useful
behavioral primitives using unsupervised learning - Objective: maximize
mutual information
DIAYN Algorithm: - Learn skill policies
Four Rooms Experiment: Classic Options Example
Environment: A grid world with 4 rooms, passages between rooms, agent starts from random position, navigates to goal position.
Manual Options: - 4 Options, each corresponding to "go to room X's exit" - Initiation set: all grids in room X - Termination condition: reach exit
Experimental Results: - Q-learning with Options 3-5x faster than vanilla Q-learning - Options can be reused across different goal positions without retraining
Code implementation provided later.
MAXQ: Value Function Decomposition
Core Idea: Task Hierarchization and Value Decomposition
MAXQ decomposes complex tasks into Task Hierarchy Graph: - Root node: main task (e.g., "make breakfast") - Intermediate nodes: subtasks (e.g., "brew coffee," "fry eggs") - Leaf nodes: atomic actions (e.g., "turn on stove," "pour water")
Value Function Decomposition: Q-value of each
task
Leaf Nodes (atomic action
Non-leaf Nodes (composite task
MAXQ Learning Algorithm
MAXQ-Q Learning: Recursively update
1 | def MAXQ-Q(task i, state s): |
Key Features: - State Abstraction:
Subtask
MAXQ Advantages and Limitations
Advantages: - Sample Efficiency:
Value decomposition and state abstraction reduce parameters to learn -
Transferability: Subtask's
Limitations: - Requires Manual Task Hierarchy Design: Difficult to automatically discover optimal decomposition in practice - Only Converges to Hierarchically Optimal: May not be globally optimal (limited by predefined hierarchy structure) - Tabular Limitation: Extending to large state spaces requires function approximation, but convergence proof no longer holds
Feudal Reinforcement Learning: Goal-Conditioned Hierarchical Learning
Feudal RL's Idea
Feudal System Analogy: - Manager (high-level): Sets subgoals (e.g., "go to coordinates (10, 5)") - Worker (low-level): Executes atomic actions to reach subgoals
Difference from Options: - Options' high-level
policy selects discrete Options - Feudal RL's Manager outputs
continuous subgoals (goal), Worker learns conditional
policy
FuN: FeUdal Networks for Hierarchical RL
Architecture: - Manager: Outputs
subgoal
Manager's Objective: Maximize extrinsic reward:
Worker's Objective: Maximize intrinsic reward
(proximity to subgoal):
Training: - Worker trained with A3C, optimizing intrinsic reward - Manager trained with policy gradient, optimizing extrinsic reward - Gradients propagate from Worker to Manager (Worker's state embedding is Manager's input)
Advantages: - Automatically learns subgoals, no manual design needed - Manager focuses on long-term planning, Worker focuses on short-term execution - Significantly improves performance in Atari games (e.g., Montezuma's Revenge)
HIRO: Data Efficient Hierarchical RL
Improvement: FuN's training unstable (Manager and Worker objectives may conflict). HIRO proposes off-policy correction:
Core Idea: - Manager outputs subgoal
Relabeling Mechanism: For transition
Effect: HIRO more stable than FuN in continuous control tasks (e.g., Ant Maze), sample efficiency improves 2-3x.
Meta-Reinforcement Learning: Learning to Learn Fast
Why Do We Need Meta-RL?
Limitations of Traditional RL: - Each new task learned from scratch, wastes knowledge from similar tasks - Low sample efficiency, requires millions of steps to converge - Cannot "learn by analogy" like humans
Meta-RL's Goal: - Train on multiple related
tasks
Application Scenarios: - Robotics: quickly adapt grasping policies under different objects, weights, friction coefficients - Games: quickly learn new levels, new characters - Recommendation systems: quickly adapt to new user preferences
MAML: Model-Agnostic Meta-Learning
Core Idea: Learn "good initialization"
parameters
Algorithm Flow:
- Meta-Training:
- Sample task batch
- For each task : - Starting from
, sample trajectory , compute loss - Take one gradient step: - Use to sample new trajectory , compute loss - Meta-optimize:
- Starting from
- Sample task batch
- Meta-Testing:
- Given new task
- Starting from , take gradient steps, obtain - Deploy policy with Mathematical Form:
- Given new task
Key: This is second-order gradient
optimization— meta-loss gradient w.r.t.
FOMAML (First-Order MAML): Ignore second-order term,
only use:
RL ²: Fast Reinforcement Learning via Slow Reinforcement Learning
Core Idea: Use RNN to memorize historical experience, encode "learning" as RNN hidden state updates.
Architecture: - Input:
Difference from MAML: - MAML explicitly does
gradient updates to adapt to new tasks - RL ² implicitly adapts to new
tasks through RNN memory (RNN hidden state
Training Process: - At each episode start, RNN
hidden state resets to
Effect: RL ² rapidly adapts to new tasks in Bandit problems and Tabular MDPs, but less stable than MAML in complex environments (e.g., Mujoco).
Meta-Learning Challenges and Frontiers
Challenges: - Task Distribution:
Meta-RL assumes training and test tasks from same distribution
Frontier Directions:
1. Unsupervised Meta-RL: - No task labels, automatically construct task distribution from single environment - Algorithms: CARML, UML
2. Model-Based Meta-RL: - Learn environment model, rapidly adapt in model - Algorithms: MAESN, Context-based Meta-RL
3. Transformer for Meta-RL: - Replace RNN with Transformer, handle long sequences - Algorithm: Decision Transformer + Meta-learning
4. Directed-MAML (2024 new development): - Use task-directed approximation to reduce second-order gradient computation - Outperforms MAML on CartPole, LunarLander
Complete Code Implementation: Options in Four Rooms
Below implements Options framework in Four Rooms environment, including: - Four Rooms environment construction - Manual Options definition (reach exits of each room) - Intra-Option Q-learning training - Comparison with vanilla Q-learning
1 | import numpy as np |
Code Analysis
Four Rooms Environment: - 13x13 grid, 4 rooms, doors between rooms - Agent starts from random position, reaches goal position for reward
Options Definition: - 4 Options, each corresponding to "go to room X's exit" - Initiation set: all positions in room - Termination condition: reach exit (door)
Intra-Option Q-Learning: - High-level:
Q(s, o) selects Options - Low-level: Option internal policy
π_o(a|s) executes atomic actions - Update: Updates
Q(s, o) at each step, uses continuation value
Experimental Results (expected): - Options converges faster in first 100 episodes (smaller exploration space) - Final performance similar, but Options has fewer steps (jumps across rooms directly)
Complete Code Implementation: MAML for RL
Below implements MAML on simple RL task (2D navigation task, different tasks correspond to different goal positions):
1 | import torch |
Code Analysis
Environment: - 2D plane navigation, agent starts from (0,0), navigates to goal point - Different tasks = different goal positions (random sampling) - Reward: negative Euclidean distance
MAML Flow: 1. Sample Task Batch: 10
random goal positions 2. Inner Loop (for each task): -
Collect trajectory, compute loss - Take 1 gradient step, obtain
- Outer Loop:
- Use
to collect trajectory again, compute meta-loss - Take gradient step on meta-parameters
- Use
- Rapid Adaptation:
- Given new task, take 5 gradient steps starting from
- Evaluate adapted performance
- Given new task, take 5 gradient steps starting from
Experimental Results (expected): - Before adaptation: Meta-policy performs moderately on random tasks - After adaptation: 5 gradient steps significantly improve performance (e.g., return from -50 to -20) - Shows MAML learned "good initialization," can rapidly adapt to new tasks
In-Depth Q&A
Q1: Why Does Options Framework Accelerate Learning?
Temporal Abstraction Reduces Decision Points: - Flat
RL: Selects atomic action each step, decision points =
Improved Exploration Efficiency: - Flat RL:
Exploration space
Value Function Reuse: - Option Q-values can be shared across different tasks - E.g., "open door" Option useful in both "exit room" and "enter room" tasks
Experimental Verification: In Four Rooms task, Options converges 3-5x faster than flat Q-learning.
Q2: What's the Difference Between MAXQ's "Hierarchically Optimal" and "Globally Optimal"?
Hierarchically Optimal: - Optimal policy under given task decomposition - Each subtask independently optimized - Constrained by predefined task hierarchy structure
Globally Optimal: - Optimal policy on original MDP - Not constrained by task decomposition
Example: Suppose task "make breakfast" decomposes to "brew coffee" → "fry eggs" → "toast bread": - Globally optimal: May brew coffee and fry eggs in parallel (faster) - Hierarchically optimal: Constrained to serial decomposition, cannot parallelize
When Consistent? When task decomposition perfectly captures optimal policy's structure, hierarchically optimal = globally optimal. But in practice, designing perfect decomposition is difficult.
MAXQ's Value: Even if not globally optimal, hierarchically optimal is much better than random exploration, and learns faster.
Q3: Core Difference Between Feudal RL and Options?
Options: - Discrete behavioral primitives (e.g., "open door," "go to A") - High-level policy selects discrete Option ID - Each Option has fixed internal policy
Feudal RL: - Continuous subgoals (e.g., "reach
coordinates (x, y)") - Manager outputs continuous subgoal vector
Advantage Comparison: - Options easier to interpret (discrete behaviors have clear semantics) - Feudal more flexible (continuous subgoals cover infinite behaviors)
Example: Robot grasping task: - Options: Predefine
"open hand," "close hand," "move to A" - Feudal: Manager outputs target
position
Q4: Why Does MAML Need Second-Order Gradients?
Problem: Meta-loss gradient w.r.t.
meta-parameters
Meta-loss:
Gradient Expansion:
Why Important? - Second-order term captures
"gradient of gradient," guides
In Practice: FOMAML performance close to MAML (loss <5%), but 10x faster computation, often used as MAML alternative.
Q5: How Does RL ² Encode "Learning" as RNN Hidden State?
Core Idea: - Input sequence:
"Learning" Process: - Initial episode:
Comparison with MAML: - MAML: Explicitly gradient updates parameters - RL ²: Implicitly updates "policy" through RNN memory
Example: Bandit task (10 arms, each arm fixed
reward): - Episode 1: RNN randomly explores, discovers arm 3 has high
reward - Episode 2: RNN's
Limitation: RNN memory capacity limited, difficult to handle complex tasks (e.g., tasks requiring understanding long-term causality).
Q6: How to Automatically Discover Useful Options?
Challenge: Manual Options design requires domain knowledge, not necessarily optimal.
Method 1: Bottleneck States Based: - Analyze state transition graph, find "must-pass paths" (e.g., room doors, level checkpoints) - Algorithm: Graph clustering, Betweenness Centrality - Advantage: Matches human intuition, easy to interpret - Limitation: Requires complete state transition graph (infeasible for large-scale environments)
Method 2: DIAYN (Diversity Is All You Need): - Learn
diverse skills, maximize
Method 3: Task Success Rate Based: - Train multiple Options, evaluate each's contribution to main task - Remove low-contribution Options, keep high-contribution ones - Algorithms: Evolutionary Options, Option-Critic
Practical Recommendation: - Early stage: Manually design few Options (based on domain knowledge) - Mid stage: Use DIAYN to expand Options library - Late stage: Filter useful Options based on task performance
Q7: Which Tasks Work Best for Meta-RL?
Suitable Meta-RL Task Characteristics:
1. Shared Structure Between Tasks: - E.g.: Different maze layouts, but navigation strategies similar - E.g.: Different robot weights, but balancing strategies similar
2. Moderate Task Diversity: - Too similar: Direct transfer learning simpler - Too different: Meta-RL cannot generalize
3. Limited Single-Task Samples: - Meta-RL's advantage is few-shot rapid adaptation - If single task has millions of samples, direct training suffices
Success Cases: - Robot Pushing: Different box weights, friction coefficients, MAML adapts quickly - Bandit Problems: Different arm reward distributions, RL ² converges in few rounds - Atari Game Levels: Different levels of same game, Meta-RL quickly learns new levels
Failure Cases: - Completely Random Tasks: E.g., each task has different physics laws, Meta-RL cannot find commonalities - Ultra-Long Time Dependencies: E.g., requires memorizing information 1000 steps ago, RNN difficult to handle
Practical Recommendation: - First verify shared structure between tasks (e.g., using human prior knowledge) - Before trying Meta-RL, try simple transfer learning (e.g., fine-tuning pretrained model)
Q8: Relationship Between Options and Hierarchical DQN?
Hierarchical DQN (h-DQN): - Proposed 2016, for Atari games - Two-layer structure: Meta-Controller selects subgoals, Controller executes atomic actions - Subgoals are discrete (e.g., "reach certain key position")
Relationship with Options: - h-DQN can be viewed as deep learning version of Options - Meta-Controller = high-level Option selection policy - Controller = Option internal policy
h-DQN Innovation: - Uses neural networks for approximation, suitable for large-scale state spaces (e.g., Atari pixels) - Subgoals learned end-to-end, no manual design needed - Intrinsic reward mechanism: Controller optimizes reward for reaching subgoals
Example: Atari game Montezuma's Revenge: - Flat DQN: Difficult to explore (requires consecutive correct actions to get reward) - h-DQN: Meta-Controller sets "get key" → "open door" → "climb stairs," Controller executes step by step, significantly improves performance
Subsequent Developments: - HAM-DQN, Feudal Networks, HIRO are all extensions of h-DQN - Core idea consistent: hierarchical decision-making + temporal abstraction
Q9: Difference Between Meta-RL and Transfer Learning?
Transfer Learning: - Train on source task, fine-tune on target task - Typically assumes single source task (or few source tasks) - Goal: Maximize target task performance
Meta-RL: - Train on task distribution
Mathematical Form:
Transfer Learning:
Meta-RL:
Example: - Transfer: Pretrain vision model on ImageNet, fine-tune on medical images - Meta-RL: Train on 100 maze tasks, rapidly adapt on new maze
Which to Choose? - Source and target tasks highly related (e.g., same domain different datasets): Transfer Learning - Have multiple related tasks, goal is rapid adaptation to new tasks: Meta-RL
Q10: Future Directions for Hierarchical RL?
1. End-to-End Hierarchy Discovery: - Current: Most methods require manual task hierarchy design - Future: Completely automatically learn hierarchical structure from data - Challenge: How to define "good hierarchy"? How to efficiently search hierarchy space?
2. Language as Subgoals: - Use natural language to describe subgoals (e.g., "pick up red cup") - Combine Large Language Models (LLMs) to generate subgoals - Advantage: Easy human-machine interaction, strong interpretability
3. Multi-Modal Hierarchical Learning: - Different levels use different modalities (e.g., high-level uses language, low-level uses actions) - Vision-language-action hierarchical alignment
4. Meta-Learning + Hierarchical: - Meta-learn hierarchical structures across multiple tasks - Rapidly adapt hierarchical policies on new tasks - Algorithms: Meta-HAM, Hierarchical MAML
5. Large-Scale Pretraining: - Pretrain hierarchical policies on large-scale offline data - Similar to LLM pretraining paradigm - Challenge: How to design hierarchical pretraining objectives?
Q11: What Improvements Does Directed-MAML (2024) Bring?
MAML's Computational Bottleneck: - Second-order gradient computation requires Hessian matrix - High memory usage (needs to save intermediate gradients) - Long training time (each task requires multiple forward-backward passes)
Directed-MAML's Idea: - Apply task-directed approximation before second-order gradient - Only compute gradient components most relevant to current task - Reduces computational complexity while maintaining performance
Technical Details: - Use task feature vector
Experimental Results (2024 paper): - CartPole-v1: 40% faster convergence - LunarLander-v2: 30% improved sample efficiency - 50% reduction in computation time
Compatibility: Directed-MAML can combine with FOMAML, Meta-SGD etc. to further improve performance.
Q12: How to Combine Options with Deep Learning?
Challenge: - Options framework originally designed for Tabular setting - Large-scale state spaces require function approximation
Option-Critic Architecture: - Parameterize with
neural networks: - High-level policy
Loss Functions:
High-level Policy Gradient:
Option Internal Policy Gradient:
Termination Gradient:
Experiments: Option-Critic learns meaningful Options in Atari games (e.g., "avoid enemies," "collect items").
Advantages: - No manual Options design needed - End-to-end training, easy to optimize - Scalable to high-dimensional state spaces
Limitations: - Training unstable (three parameter groups interdependent) - Reduced interpretability (neural network black box)
Related Papers and Resources
Core Papers
Hierarchical Reinforcement Learning:
Options Framework:
Sutton et al. (1999). "Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning". Artificial Intelligence.
https://arxiv.org/abs/cs/9905014MAXQ:
Dietterich (2000). "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition". JAIR.
https://arxiv.org/abs/cs/9905015HAM:
Parr & Russell (1998). "Reinforcement Learning with Hierarchies of Machines". NIPS.Feudal Networks:
Vezhnevets et al. (2017). "FeUdal Networks for Hierarchical Reinforcement Learning". ICML.
https://arxiv.org/abs/1703.01161HIRO:
Nachum et al. (2018). "Data-Efficient Hierarchical Reinforcement Learning". NeurIPS.
https://arxiv.org/abs/1805.08296Option-Critic:
Bacon et al. (2017). "The Option-Critic Architecture". AAAI.
https://arxiv.org/abs/1609.05140
Meta-Reinforcement Learning:
MAML:
Finn et al. (2017). "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks". ICML.
https://arxiv.org/abs/1703.03400RL ²:
Duan et al. (2016). "RL ²: Fast Reinforcement Learning via Slow Reinforcement Learning". arXiv.
https://arxiv.org/abs/1611.02779Directed-MAML:
(2024). "Directed-MAML: Meta Reinforcement Learning Algorithm with Task-directed Approximation". arXiv.
https://arxiv.org/abs/2510.00212DIAYN:
Eysenbach et al. (2018). "Diversity is All You Need: Learning Skills without a Reward Function". ICLR.
https://arxiv.org/abs/1802.06070
Surveys and Resources
Hierarchical RL Survey:
Barto & Mahadevan (2003). "Recent Advances in Hierarchical Reinforcement Learning". Discrete Event Dynamic Systems.Meta-Learning Survey:
Hospedales et al. (2021). "Meta-Learning in Neural Networks: A Survey". IEEE TPAMI.
https://arxiv.org/abs/2004.05439
Code Libraries
- OpenAI Baselines: https://github.com/openai/baselines
- learn2learn: https://github.com/learnables/learn2learn (MAML implementation)
- Option-Critic: https://github.com/jeanharb/option_critic
Summary
Hierarchical reinforcement learning and meta-learning represent RL's paradigm shift from "flat single-task" toward "structured multi-task."
Hierarchical RL decomposes complex tasks into manageable subproblems through temporal abstraction and modularity: - Options provides semi-Markov decision framework, supporting temporally extended behavioral primitives - MAXQ decomposes value functions, enabling parallel learning of subtasks - Feudal RL implements flexible hierarchical control with continuous subgoals
Meta-reinforcement learning enables agents to "learn to learn," rapidly adapting to new tasks: - MAML learns "good initialization," achieving rapid gradient adaptation through second-order optimization - RL ² uses RNN memory to encode learning process, implicitly achieving task adaptation - Directed-MAML accelerates MAML with task-directed approximation, maintaining performance while reducing computation
In the future, hierarchical and meta-learning will deeply integrate — meta-learning hierarchical structures across tasks, using language to guide subgoal generation, combining large-scale pretraining to achieve general agents. From robotics to games, from recommendations to autonomous driving, hierarchical and meta-learning are reshaping RL's application boundaries.
- Post title:Reinforcement Learning (11): Hierarchical Reinforcement Learning and Meta-Learning
- Post author:Chen Kai
- Create time:2024-09-27 09:00:00
- Post link:https://www.chenk.top/reinforcement-learning-11-hierarchical-and-meta-rl/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.