In previous chapters, we learned various reinforcement learning algorithms — from Q-Learning to PPO — all relying on an explicit reward function to guide learning. However, in many real-world scenarios, designing an appropriate reward function is extremely difficult:
- Autonomous driving: What constitutes "good" driving behavior? Safety first? Comfort priority? Maximum efficiency? How do we balance these goals? How do we quantify "driving like an experienced driver" with a single number?
- Robot manipulation: How do we write a reward function for teaching a robot to fold clothes, cook, or tidy a room? The final state is easy to define, but how much reward should each intermediate step receive?
- Game AI: Making an AI learn human player styles, not just maximize scores. Some players prefer aggressive play, others prefer defensive strategies — how do we make AI imitate specific styles?
- Dialogue systems: What makes a "good" conversation? Interesting? Helpful? Polite? How do we balance these objectives?
Imitation Learning provides a different path: instead of laboriously designing reward functions, learn directly from expert demonstrations. This is a very natural way of learning — humans learn this way too. Infants learn to walk and talk by imitating parents, apprentices learn crafts by observing masters, students learn math by imitating teacher's problem-solving methods.
This chapter systematically introduces core imitation learning methods: from the simplest Behavioral Cloning to distribution shift-solving DAgger, from reward-recovering Inverse Reinforcement Learning to end-to-end adversarial GAIL. We'll dive deep into each method's principles, pros and cons, applicable scenarios, and implementation details.
Imitation Learning Problem Setting
From Expert Demonstrations to Policy
Suppose we have an expert (can be human or another agent) who performs excellently on some task. We observe the expert's behavior and collect a demonstration dataset:
Imitation learning's goal is: Learn a policy
Key points to note: 1. We don't know what expert's
true policy
Differences from Reinforcement Learning
Let's compare imitation learning and reinforcement learning:
| Aspect | Reinforcement Learning | Imitation Learning |
|---|---|---|
| Supervision signal | Reward function |
Expert demonstrations |
| Signal characteristics | Sparse, delayed, requires trial-and-error | Direct, immediate, readily available |
| Interaction requirement | Must interact extensively with environment | Can learn completely offline |
| Goal | Maximize cumulative reward | Imitate expert behavior |
| Optimization | Trial-and-error (may need millions of interactions) | Similar to supervised learning (usually needs less data) |
| Exploration | Needs explicit exploration strategy | No exploration needed (expert already did) |
| Safety | Exploration may be risky | Relatively safe (imitating expert) |
Applicable scenarios for each method:
- Reinforcement learning better when:
- Clear reward function available
- Safe extensive trial-and-error possible
- Want to exceed human level
- Imitation learning better when:
- Reward function hard to define
- High-quality expert demonstrations available
- Want to replicate expert style
- High safety requirements
Main Imitation Learning Methods
Imitation learning methods can be categorized as:
- Behavioral Cloning (BC)
- Simplest, most direct method
- Treats imitation learning as supervised learning
- Problem: distribution shift
- Interactive Imitation Learning (Interactive IL)
- Representative method: DAgger
- Allows querying expert during learning
- Solves distribution shift problem
- Inverse Reinforcement Learning (Inverse RL)
- Recovers reward function from demonstrations
- Then optimizes with standard RL
- Deeper understanding of expert's objective
- Adversarial Imitation Learning (Adversarial IL)
- Representative method: GAIL
- Uses adversarial training to match expert distribution
- End-to-end learning, no explicit reward needed
Behavioral Cloning
Basic Idea
Behavioral cloning is the most direct, simplest imitation learning method. Its core idea is:
Treat
pairs as supervised learning training data, learn a mapping from state to action.
Formally, we minimize the difference between expert action and
predicted action:
Loss function choices:
For discrete action spaces, use cross-entropy
loss:
- Mean squared error (deterministic policy):
2. Negative log-likelihood (Gaussian policy): 3. Mixture density network (multimodal distribution):
Detailed Implementation
1 | import torch |
The Distribution Shift Problem
Behavioral cloning appears simple and elegant, but has a serious problem —distribution shift. Let's analyze this in detail.
The essence of the problem:
During training, we train the model on expert trajectory state
distribution
Key problem:
A concrete example: Autonomous driving
Suppose we train an autonomous driving model with behavioral cloning. The expert (human driver) always keeps the car in the center of the lane, so training data states are all "car in lane center."
The model learns how to drive in this state, but it's imperfect — sometimes slightly left or right. Once the car slightly deviates from center: - This is a new state, model never seen it - Model may continue driving in wrong direction - Car deviates more and more, eventually goes off road
Mathematical Analysis: Error Accumulation
Suppose at each timestep, learned policy has probability
Let
So
More rigorous analysis shows expected total error is:
Quadratic growth! This means: - If task needs 100
steps, error amplified
Mitigating Distribution Shift
Before introducing DAgger, let's look at some simple mitigation methods:
1. Data Augmentation
Add noise to states, simulating non-expert states policy might visit:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17def augment_data(states, actions, noise_std=0.01):
"""Augment data by adding noise"""
augmented_states = []
augmented_actions = []
for s, a in zip(states, actions):
# Original data
augmented_states.append(s)
augmented_actions.append(a)
# Noisy data
for _ in range(5):
noisy_s = s + np.random.normal(0, noise_std, s.shape)
augmented_states.append(noisy_s)
augmented_actions.append(a) # Action unchanged
return np.array(augmented_states), np.array(augmented_actions)
2. Expert Noise Injection
During data collection, have expert intentionally make small
mistakes, then show how to recover: 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22def collect_data_with_noise(expert, env, n_episodes, noise_prob=0.1):
"""Collect data with recovery demonstrations"""
data = []
for _ in range(n_episodes):
state = env.reset()
done = False
while not done:
if np.random.random() < noise_prob:
# Inject random action
action = env.action_space.sample()
else:
# Expert action
action = expert.get_action(state)
next_state, reward, done, _ = env.step(action)
# Record what expert would do in this state (even if random action executed)
expert_action = expert.get_action(state)
data.append((state, expert_action))
state = next_state
return data
3. Regularization and Ensembles
- Use Dropout, L2 regularization to prevent overfitting
- Train multiple models, average or vote
But these methods cannot fundamentally solve distribution shift. The real solution requires obtaining new expert labels during learning.
DAgger: Dataset Aggregation
Core Idea
DAgger's (Dataset Aggregation) core idea is simple:
During learning, use current policy to interact with environment, collect new states, then query expert for correct actions in these states.
This way, even if policy makes mistakes and enters new states, we can get expert's correct actions in these states.
Algorithm Flow:
- Collect initial dataset
with expert policy$^* _1 i = 1, 2, ..., N _i {s_1, s_2, ...}$ - For each state, query expert action - Add new data to dataset: - Retrain policy on Key Insight:
DAgger breaks the vicious cycle of distribution shift: - Policy makes mistake → enters new state → gets expert label → learns correct action in new state
Theoretical Guarantee
DAgger has strict theoretical guarantees. Let
Theorem (Ross et al., 2011): After
Intuitive understanding: - Behavioral cloning's
error accumulates and amplifies - DAgger by covering all possibly
visited states, transforms problem to "probability of error at each
state" - Each moment independently has
Detailed Implementation
1 | class DAgger: |
DAgger Variants
1. SafeDAgger
In some applications (like autonomous driving), letting learner fully control may be dangerous. SafeDAgger uses "guardrail" mechanism:
1 | def safe_dagger_step(state, learner, expert, safety_threshold): |
2. Sample-Efficient DAgger
Not every state needs expert labeling. Can selectively query: - Only query on "uncertain" states (use ensemble models to measure uncertainty) - Only query on "important" states (use advantage function or TD error to measure)
3. HG-DAgger (Human-Gated DAgger)
Let human expert decide when to intervene:
1 | def hgdagger_episode(env, learner, human_expert): |
DAgger Limitations
Requires interactive expert: In many scenarios, we cannot query expert anytime (expert may be historical data, deceased master, etc.)
Heavy expert burden: Expert needs to provide labels for many states, which may be very time-consuming
Expert must be perfect: DAgger assumes expert always gives correct answers, but human experts also make mistakes or are inconsistent
Safety: May visit dangerous states during learning
When we cannot use interactive expert, we need other methods — inverse reinforcement learning and GAIL.
Inverse Reinforcement Learning (IRL)
Problem Setting and Motivation
The methods above (BC and DAgger) are direct state-to-action mappings. But there's a deeper question: why does the expert act this way?
If we can understand expert's objective (reward function), we can: 1. Generalize to situations expert hasn't demonstrated 2. Understand the "intent" behind expert behavior 3. Apply the same objective in different environments
Inverse Reinforcement Learning (IRL) takes this approach:
Infer reward function from expert demonstrations, then use standard RL methods to optimize this reward.
Formally, given expert demonstrations
Reward Ambiguity Problem
A fundamental challenge in IRL is reward ambiguity: given demonstrations, there may be infinitely many consistent reward functions!
Example: Consider the most extreme case — reward
function identically 0:
More generally, any reward function that can be optimized by expert policy is valid. We need some regularization or assumption to select "good" reward functions.
Maximum Entropy Inverse Reinforcement Learning
Maximum Entropy IRL (Ziebart et al., 2008) solves this problem with an elegant assumption:
Expert policy, among all "equally good" choices, prefers the one with maximum entropy.
In other words, expert won't arbitrarily prefer certain actions — if two actions are equally good, expert will randomly choose.
This leads to expert policy having the form:
Intuition: High-reward trajectories are exponentially preferred, but not deterministically choosing optimal trajectory. This is a kind of "soft optimal."
Objective Function
Maximum Entropy IRL's objective is maximizing expert
trajectory likelihood:
Thus objective becomes:
Gradient Computation
Taking gradient with respect to
Final gradient:
Intuitive explanation: - First term: increase expert trajectory rewards - Second term: decrease current policy trajectory rewards - At convergence: both equal, meaning current policy matches expert
Detailed Implementation
1 | import torch |
IRL Challenges and Extensions
1. Computational Complexity
After each reward update, need to retrain policy (inner loop). This makes IRL much slower than direct imitation learning.
2. Reward Shaping
Learned reward function may not be the "true" reward, just one function that can explain expert behavior.
3. Deep IRL
Modern methods parameterize reward function with neural networks, can handle high-dimensional states. Representative methods include: - Deep MaxEnt IRL - Guided Cost Learning - AIRL (Adversarial IRL)
GAIL: Generative Adversarial Imitation Learning
Core Idea
GAIL (Generative Adversarial Imitation Learning) combines imitation learning with GANs, providing an end-to-end solution.
Core idea:
Train a discriminator to distinguish expert trajectories from policy-generated trajectories, while training policy to "fool" discriminator.
This is exactly like GANs: - Generator = Policy
When policy successfully fools discriminator, its behavior becomes indistinguishable from expert — exactly the goal of imitation learning!
Mathematical Formulation
GAIL optimizes the following objective:
Discriminator optimization:
For fixed
Policy optimization:
Policy wants to minimize
Key insight: Discriminator output can serve as reward
signal!
In practice, for numerical stability, typically use:
Relationship to IRL
GAIL can be viewed as implicit IRL:
- Traditional IRL: Explicitly learns reward function
, then solves RL - GAIL: Discriminator implicitly defines reward function, jointly optimized with policy
Ho & Ermon (2016) proved GAIL is equivalent to Maximum Entropy IRL in terms of occupancy measure matching.
Detailed Implementation
1 | import torch |
GAIL Advantages and Limitations
Advantages:
- No explicit reward needed: Discriminator implicitly learns reward structure
- End-to-end training: Policy and "reward" jointly optimized, no two-stage
- Sample efficient: More efficient than MaxEnt IRL
- Handles high-dimensional: Can handle high-dimensional states and continuous actions
- Theoretical guarantee: Has guarantees in occupancy measure matching sense
Limitations:
- Requires environment interaction: Cannot learn purely offline
- Training unstable: GAN training itself is unstable
- Mode collapse: May only learn part of expert behavior
- Hard to interpret: No explicit reward function
GAIL Variants
1. AIRL (Adversarial Inverse Reinforcement Learning)
AIRL modifies discriminator structure so it can recover explicit
reward function:
VAIL uses variational information bottleneck to improve training stability.
3. SAM (State-only Adversarial Mimicking)
When actions are unobservable (like learning from video), SAM only uses state matching.
Method Comparison and Selection Guide
Comprehensive Comparison
| Method | Interactive Expert | Env Interaction | Sample Efficiency | Implementation Complexity | Theoretical Guarantee | Interpretability |
|---|---|---|---|---|---|---|
| BC | Not needed | Not needed | High (but with drift) | Low | Weak | Medium |
| DAgger | Needed | Needed | Medium-High | Low | Strong | Medium |
| MaxEnt IRL | Not needed | Needed | Low | High | Strong | High |
| GAIL | Not needed | Needed | Medium | Medium | Medium | Low |
Selection Guide
Choose Behavioral Cloning when: - Have lots of expert data - Task relatively simple (short time horizon) - Compute resources limited - Cannot interact with environment
Choose DAgger when: - Can query expert anytime - Expert labeling cost not high - Need to handle long time horizon tasks - Need theoretical guarantee
Choose Inverse RL when: - Need to understand expert's objective - Need to generalize across different environments - Need interpretable reward function - Have sufficient compute resources
Choose GAIL when: - Cannot query expert - Need high-quality imitation - Have environment interaction capability - State/action spaces large
Advanced Topics
Multimodal Expert Behavior
Expert may take different actions in same state. For example, avoiding obstacle can turn left or right.
Standard BC learns "average" behavior — may crash directly into obstacle!
Solutions:
- Mixture Density Networks (MDN)
1 | class MDNPolicy(nn.Module): |
- Conditional VAE (CVAE)
Learn latent behavior modes
- Info-GAIL
Add latent variables in GAIL, learn different behavior modes.
Learning from Suboptimal Demonstrations
In reality, expert demonstrations often aren't optimal. How to handle?
1. Weighted Behavioral Cloning
Give higher-quality demonstrations higher weights:
1 | def weighted_bc_loss(predictions, expert_actions, quality_scores): |
2. Learning to Rank
Instead of learning absolute good actions, learn which action is
better than which:
3. Self-Improvement
First imitate, then improve with RL:
1 | def iterative_improvement(bc_agent, env, n_rounds=5): |
Cross-Domain Imitation Learning
When expert and learner have different state/action spaces:
1. Third-Person Imitation Learning
Learn from video (third-person view), but execute from first-person view.
2. Cross-Morphology Transfer
Robot A demonstrates, Robot B imitates (different body structures).
3. Domain Adaptation
Align state representations from different domains.
Practical Advice
Data Collection
- Data quality more important than quantity
- Ensure covering diverse scenarios
- Record expert's "recovery" behavior (correcting from mistakes)
- Avoid obvious errors in demonstrations
Training Tips
- State normalization: Standardize input states
- Data augmentation: Add noise, crop, rotate, etc.
- Regularization: Dropout, L2 regularization to prevent overfitting
- Early stopping: Monitor validation set performance
Evaluation Methods
- Trajectory similarity: Distance to expert trajectories
- Task success rate: Ratio of completed tasks
- Cumulative reward: Total reward obtained in environment
- Human evaluation: Let humans judge behavior quality
Summary
Imitation learning provides a learning paradigm that doesn't rely on explicit reward functions, learning policy by observing expert demonstrations:
Behavioral Cloning is simple and direct, but suffers from distribution shift, suitable for short-horizon tasks with lots of data
DAgger mitigates drift through interactive learning, requires queryable expert, has theoretical guarantees
Inverse RL recovers reward function, provides interpretability and generalization, but computationally expensive
GAIL uses adversarial training for end-to-end imitation, currently most popular method, balances performance and implementation complexity
These methods each have pros and cons, choice depends on specific application: whether there's interactive expert, whether environment interaction needed, interpretability requirements, compute resources, etc.
Imitation learning has broad applications in robotics, autonomous driving, game AI, dialogue systems. It complements reinforcement learning — when reward functions are hard to define, imitation learning provides another path; when need to exceed expert, reinforcement learning is more suitable. Combining both (e.g., initialize with imitation learning, fine-tune with RL) often achieves best results.
In the next chapter, we'll learn AlphaGo and Monte Carlo Tree Search— seeing how deep learning combined with traditional planning methods achieves superhuman performance in complex games like Go.
References
- Pomerleau, D. A. (1989). ALVINN: An Autonomous Land Vehicle in a Neural Network. NIPS.
- Ross, S., Gordon, G., & Bagnell, D. (2011). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS.
- Ziebart, B. D., et al. (2008). Maximum Entropy Inverse Reinforcement Learning. AAAI.
- Ho, J., & Ermon, S. (2016). Generative Adversarial Imitation Learning. NIPS.
- Fu, J., Luo, K., & Levine, S. (2018). Learning Robust Rewards with Adversarial Inverse Reinforcement Learning. ICLR.
- Abbeel, P., & Ng, A. Y. (2004). Apprenticeship Learning via Inverse Reinforcement Learning. ICML.
- Finn, C., Levine, S., & Abbeel, P. (2016). Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization. ICML.
Q&A: Frequently Asked Questions
Q1: What's the difference between imitation learning and supervised learning?
A: Main difference is data distribution. Supervised learning assumes training and test data come from same distribution (i.i.d. assumption), but in imitation learning, test-time state distribution depends on learned policy, different from training-time expert state distribution. This is the root of distribution shift problem.
Q2: How to tell if expert data is enough?
A: Can judge through: - Learning curve: does adding data still improve performance - Validation error: is it close to training error (overfitting check) - State space coverage: covers possible encountered states
Q3: GAIL training is unstable, what to do?
A: Try: - Adjust discriminator and policy update frequency (usually discriminator updates more) - Use gradient penalty or spectral normalization to stabilize discriminator - Lower learning rate - Increase entropy regularization coefficient - Use WGAN variants
Q4: When to use BC, when to use GAIL?
A: BC suitable for: simple tasks, lots of data, cannot interact with environment. GAIL suitable for: complex tasks, limited data, can interact with environment. If BC can solve it, prioritize BC (simpler, faster).
Q5: How to handle noise and errors in expert demonstrations?
A: - Data cleaning: filter obviously wrong demonstrations - Weighted learning: give higher-quality demonstrations higher weights - Learning to rank: learn relative preferences instead of absolute actions - Robust loss functions: like Huber loss more robust to outliers
Q6: Can imitation learning exceed expert?
A: Pure imitation learning theoretically cannot exceed expert (goal is to copy expert). But can: - Initialize with imitation learning, then fine-tune with RL - Aggregate strengths of multiple experts - Self-improve in states where expert performs poorly
Q7: How to handle multiple experts?
A: - If expert policies similar: directly mix data - If expert styles different: learn multimodal policy (MDN, CVAE) - If expert levels different: weight or only use best expert
Q8: How to evaluate imitation learning effectiveness?
A: Multi-angle evaluation: - Cumulative reward (if reward function available) - Similarity to expert trajectories - Task success rate - Human subjective evaluation - Robustness on out-of-distribution states
- Post title:Reinforcement Learning (7): Imitation Learning and Inverse Reinforcement Learning
- Post author:Chen Kai
- Create time:2024-09-06 10:15:00
- Post link:https://www.chenk.top/reinforcement-learning-7-imitation-learning/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.