Reinforcement Learning (7): Imitation Learning and Inverse Reinforcement Learning

In previous chapters, we learned various reinforcement learning algorithms — from Q-Learning to PPO — all relying on an explicit reward function to guide learning. However, in many real-world scenarios, designing an appropriate reward function is extremely difficult:

Autonomous driving: What constitutes "good" driving behavior? Safety first? Comfort priority? Maximum efficiency? How do we balance these goals? How do we quantify "driving like an experienced driver" with a single number?
Robot manipulation: How do we write a reward function for teaching a robot to fold clothes, cook, or tidy a room? The final state is easy to define, but how much reward should each intermediate step receive?
Game AI: Making an AI learn human player styles, not just maximize scores. Some players prefer aggressive play, others prefer defensive strategies — how do we make AI imitate specific styles?
Dialogue systems: What makes a "good" conversation? Interesting? Helpful? Polite? How do we balance these objectives?

Imitation Learning provides a different path: instead of laboriously designing reward functions, learn directly from expert demonstrations. This is a very natural way of learning — humans learn this way too. Infants learn to walk and talk by imitating parents, apprentices learn crafts by observing masters, students learn math by imitating teacher's problem-solving methods.

This chapter systematically introduces core imitation learning methods: from the simplest Behavioral Cloning to distribution shift-solving DAgger, from reward-recovering Inverse Reinforcement Learning to end-to-end adversarial GAIL. We'll dive deep into each method's principles, pros and cons, applicable scenarios, and implementation details.

Imitation Learning Problem Setting

From Expert Demonstrations to Policy

Suppose we have an expert (can be human or another agent) who performs excellently on some task. We observe the expert's behavior and collect a demonstration dataset:

whererepresents the expert's actiontaken in state. This data may come from: - Human operator recordings (e.g., driving videos) - Teleoperation-collected data (e.g., controlling robot with joystick) - Another trained AI's demonstrations - Expert's historical decision records (e.g., doctor's diagnoses)

Imitation learning's goal is: Learn a policythat behaves as close to expert policyas possible.

Key points to note: 1. We don't know what expert's true policyis, can only observe its behavior 2. We don't have a reward function, can't evaluate whether an action is good or bad 3. We usually cannot interact with expert in real-time (expert may be busy or expensive)

Differences from Reinforcement Learning

Let's compare imitation learning and reinforcement learning:

Aspect	Reinforcement Learning	Imitation Learning
Supervision signal	Reward function	Expert demonstrations
Signal characteristics	Sparse, delayed, requires trial-and-error	Direct, immediate, readily available
Interaction requirement	Must interact extensively with environment	Can learn completely offline
Goal	Maximize cumulative reward	Imitate expert behavior
Optimization	Trial-and-error (may need millions of interactions)	Similar to supervised learning (usually needs less data)
Exploration	Needs explicit exploration strategy	No exploration needed (expert already did)
Safety	Exploration may be risky	Relatively safe (imitating expert)

Applicable scenarios for each method:

Reinforcement learning better when:
- Clear reward function available
- Safe extensive trial-and-error possible
- Want to exceed human level
Imitation learning better when:
- Reward function hard to define
- High-quality expert demonstrations available
- Want to replicate expert style
- High safety requirements

Main Imitation Learning Methods

Imitation learning methods can be categorized as:

Behavioral Cloning (BC)
- Simplest, most direct method
- Treats imitation learning as supervised learning
- Problem: distribution shift
Interactive Imitation Learning (Interactive IL)
- Representative method: DAgger
- Allows querying expert during learning
- Solves distribution shift problem
Inverse Reinforcement Learning (Inverse RL)
- Recovers reward function from demonstrations
- Then optimizes with standard RL
- Deeper understanding of expert's objective
Adversarial Imitation Learning (Adversarial IL)
- Representative method: GAIL
- Uses adversarial training to match expert distribution
- End-to-end learning, no explicit reward needed

Behavioral Cloning

Basic Idea

Behavioral cloning is the most direct, simplest imitation learning method. Its core idea is:

Treatpairs as supervised learning training data, learn a mapping from state to action.

Formally, we minimize the difference between expert action and predicted action:

Loss function choices:

For discrete action spaces, use cross-entropy loss:For continuous action spaces, multiple options:

Mean squared error (deterministic policy):2. Negative log-likelihood (Gaussian policy):3. Mixture density network (multimodal distribution):

Detailed Implementation

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.utils.data import DataLoader, TensorDataset

class BehavioralCloning:
    """
    Behavioral Cloning Agent
    
    Converts imitation learning to supervised learning,
    learning policy from expert (state, action) pairs.
    """
    
    def __init__(self, state_dim, action_dim, hidden_dims=[256, 256], 
                 lr=1e-3, continuous=False, dropout=0.1):
        """
        Initialize behavioral cloning model
        
        Args:
            state_dim: State space dimension
            action_dim: Action space dimension
            hidden_dims: Hidden layer dimensions list
            lr: Learning rate
            continuous: Whether action space is continuous
            dropout: Dropout rate (prevents overfitting)
        """
        self.continuous = continuous
        self.action_dim = action_dim
        
        # Build network
        layers = []
        prev_dim = state_dim
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
                nn.Dropout(dropout),
            ])
            prev_dim = hidden_dim
        
        if continuous:
            # Continuous actions: output mean (optionally variance)
            layers.append(nn.Linear(prev_dim, action_dim))
            layers.append(nn.Tanh())  # Assume action range [-1, 1]
            self.policy = nn.Sequential(*layers)
            self.criterion = nn.MSELoss()
        else:
            # Discrete actions: output action logits
            layers.append(nn.Linear(prev_dim, action_dim))
            self.policy = nn.Sequential(*layers)
            self.criterion = nn.CrossEntropyLoss()
        
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
        
        # Normalization statistics
        self.state_mean = None
        self.state_std = None
    
    def compute_normalization(self, states):
        """Compute state normalization parameters"""
        self.state_mean = np.mean(states, axis=0)
        self.state_std = np.std(states, axis=0) + 1e-8
    
    def normalize_state(self, state):
        """Normalize state"""
        if self.state_mean is not None:
            return (state - self.state_mean) / self.state_std
        return state
    
    def train(self, states, actions, epochs=100, batch_size=64, 
              validation_split=0.1, early_stopping_patience=10):
        """
        Train behavioral cloning policy
        
        Args:
            states: State array [N, state_dim]
            actions: Action array [N] or [N, action_dim]
            epochs: Training epochs
            batch_size: Batch size
            validation_split: Validation set ratio
            early_stopping_patience: Early stopping patience
        """
        # Compute normalization parameters
        self.compute_normalization(states)
        states = self.normalize_state(states)
        
        # Split training and validation sets
        n_samples = len(states)
        n_val = int(n_samples * validation_split)
        indices = np.random.permutation(n_samples)
        train_idx, val_idx = indices[n_val:], indices[:n_val]
        
        train_states = torch.FloatTensor(states[train_idx])
        val_states = torch.FloatTensor(states[val_idx])
        
        if self.continuous:
            train_actions = torch.FloatTensor(actions[train_idx])
            val_actions = torch.FloatTensor(actions[val_idx])
        else:
            train_actions = torch.LongTensor(actions[train_idx])
            val_actions = torch.LongTensor(actions[val_idx])
        
        # Create data loader
        train_dataset = TensorDataset(train_states, train_actions)
        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
        
        # Training loop
        best_val_loss = float('inf')
        patience_counter = 0
        train_losses = []
        val_losses = []
        
        for epoch in range(epochs):
            # Training
            self.policy.train()
            epoch_loss = 0
            for batch_states, batch_actions in train_loader:
                pred = self.policy(batch_states)
                loss = self.criterion(pred, batch_actions)
                
                self.optimizer.zero_grad()
                loss.backward()
                self.optimizer.step()
                
                epoch_loss += loss.item() * len(batch_states)
            
            avg_train_loss = epoch_loss / len(train_idx)
            train_losses.append(avg_train_loss)
            
            # Validation
            self.policy.eval()
            with torch.no_grad():
                val_pred = self.policy(val_states)
                val_loss = self.criterion(val_pred, val_actions).item()
            val_losses.append(val_loss)
            
            # Early stopping check
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                patience_counter = 0
                best_state = self.policy.state_dict().copy()
            else:
                patience_counter += 1
                if patience_counter >= early_stopping_patience:
                    print(f"Early stopping at epoch {epoch}")
                    self.policy.load_state_dict(best_state)
                    break
            
            if epoch % 10 == 0:
                print(f"Epoch {epoch}: train_loss={avg_train_loss:.4f}, "
                      f"val_loss={val_loss:.4f}")
        
        return train_losses, val_losses
    
    def get_action(self, state, deterministic=True):
        """Get action"""
        state = self.normalize_state(state)
        state = torch.FloatTensor(state).unsqueeze(0)
        
        self.policy.eval()
        with torch.no_grad():
            if self.continuous:
                action = self.policy(state).squeeze().numpy()
            else:
                logits = self.policy(state)
                if deterministic:
                    action = logits.argmax(dim=1).item()
                else:
                    probs = torch.softmax(logits, dim=1)
                    action = torch.multinomial(probs, 1).item()
        return action
    
    def evaluate(self, env, n_episodes=10, render=False):
        """Evaluate policy performance in environment"""
        rewards = []
        for _ in range(n_episodes):
            state = env.reset()
            done = False
            episode_reward = 0
            while not done:
                if render:
                    env.render()
                action = self.get_action(state)
                state, reward, done, _ = env.step(action)
                episode_reward += reward
            rewards.append(episode_reward)
        return np.mean(rewards), np.std(rewards)

The Distribution Shift Problem

Behavioral cloning appears simple and elegant, but has a serious problem —distribution shift. Let's analyze this in detail.

The essence of the problem:

During training, we train the model on expert trajectory state distribution:During testing, policygenerates its own trajectories, visiting state distribution.

Key problem:Sinceisn't perfect, it will: 1. Make small mistakes (select suboptimal actions) 2. These small mistakes lead to entering states expert never visited 3. In these new states,hasn't seen training data, may make bigger mistakes 4. Errors accumulate, trajectory increasingly deviates from expert

A concrete example: Autonomous driving

Suppose we train an autonomous driving model with behavioral cloning. The expert (human driver) always keeps the car in the center of the lane, so training data states are all "car in lane center."

The model learns how to drive in this state, but it's imperfect — sometimes slightly left or right. Once the car slightly deviates from center: - This is a new state, model never seen it - Model may continue driving in wrong direction - Car deviates more and more, eventually goes off road

Mathematical Analysis: Error Accumulation

Suppose at each timestep, learned policy has probabilityof making an error (selecting different action from expert). In a-step trajectory, how does total error accumulate?

Letbe probability of being in "correct" state (state expert would visit) at time. Then: -(same initial state) -(only stay correct if no error)

So.

More rigorous analysis shows expected total error is:

Quadratic growth! This means: - If task needs 100 steps, error amplifiedtimes - Even with 99% single-step accuracy, long tasks will fail

Mitigating Distribution Shift

Before introducing DAgger, let's look at some simple mitigation methods:

1. Data Augmentation

Add noise to states, simulating non-expert states policy might visit:

def augment_data(states, actions, noise_std=0.01):
    """Augment data by adding noise"""
    augmented_states = []
    augmented_actions = []
    
    for s, a in zip(states, actions):
        # Original data
        augmented_states.append(s)
        augmented_actions.append(a)
        
        # Noisy data
        for _ in range(5):
            noisy_s = s + np.random.normal(0, noise_std, s.shape)
            augmented_states.append(noisy_s)
            augmented_actions.append(a)  # Action unchanged
    
    return np.array(augmented_states), np.array(augmented_actions)

2. Expert Noise Injection

During data collection, have expert intentionally make small mistakes, then show how to recover:

def collect_data_with_noise(expert, env, n_episodes, noise_prob=0.1):
    """Collect data with recovery demonstrations"""
    data = []
    for _ in range(n_episodes):
        state = env.reset()
        done = False
        while not done:
            if np.random.random() < noise_prob:
                # Inject random action
                action = env.action_space.sample()
            else:
                # Expert action
                action = expert.get_action(state)
            
            next_state, reward, done, _ = env.step(action)
            
            # Record what expert would do in this state (even if random action executed)
            expert_action = expert.get_action(state)
            data.append((state, expert_action))
            
            state = next_state
    return data

3. Regularization and Ensembles

Use Dropout, L2 regularization to prevent overfitting
Train multiple models, average or vote

But these methods cannot fundamentally solve distribution shift. The real solution requires obtaining new expert labels during learning.

DAgger: Dataset Aggregation

Core Idea

DAgger's (Dataset Aggregation) core idea is simple:

During learning, use current policy to interact with environment, collect new states, then query expert for correct actions in these states.

This way, even if policy makes mistakes and enters new states, we can get expert's correct actions in these states.

Algorithm Flow:

Collect initial datasetwith expert policy$^*_1i = 1, 2, ..., N_i{s_1, s_2, ...}$ - For each state, query expert action - Add new data to dataset: - Retrain policyon Key Insight:

DAgger breaks the vicious cycle of distribution shift: - Policy makes mistake → enters new state → gets expert label → learns correct action in new state

Theoretical Guarantee

DAgger has strict theoretical guarantees. Letbe training error (average error rate on training set),be trajectory length.

Theorem (Ross et al., 2011): AfterDAgger iterations, policysatisfies:Compared to behavioral cloning's, this is linear not quadratic!

Intuitive understanding: - Behavioral cloning's error accumulates and amplifies - DAgger by covering all possibly visited states, transforms problem to "probability of error at each state" - Each moment independently haserror probability,moments total

Detailed Implementation

class DAgger:
    """
    DAgger (Dataset Aggregation) Algorithm
    
    Solves distribution shift by iteratively collecting data:
    1. Collect trajectories with current policy
    2. Query expert for correct actions in these states
    3. Add data to training set
    4. Retrain policy
    """
    
    def __init__(self, state_dim, action_dim, hidden_dims=[256, 256], 
                 lr=1e-3, continuous=False):
        """Initialize DAgger"""
        self.bc = BehavioralCloning(
            state_dim, action_dim, hidden_dims, lr, continuous
        )
        self.dataset = {'states': [], 'actions': []}
        self.continuous = continuous
        self.action_dim = action_dim
    
    def collect_data_with_expert(self, env, expert_policy, n_episodes,
                                  use_learner=True, beta=0.5):
        """
        Collect data: use learner or mixed policy to collect trajectories, expert labels actions
        
        Args:
            env: Environment
            expert_policy: Expert policy function
            n_episodes: Number of episodes to collect
            use_learner: Whether to use learner policy (vs pure expert)
            beta: Expert ratio in mixed policy (for safe learning)
        
        Returns:
            new_states: Newly collected states
            new_actions: Expert actions in these states
            episode_rewards: Episode rewards (using actually executed actions)
        """
        new_states = []
        new_actions = []
        episode_rewards = []
        
        for ep in range(n_episodes):
            state = env.reset()
            done = False
            episode_reward = 0
            
            while not done:
                # Decide which action to execute
                if not use_learner:
                    # Pure expert data collection
                    action = expert_policy(state)
                elif np.random.random() < beta:
                    # Mixed policy: beta probability use expert
                    action = expert_policy(state)
                else:
                    # Use learner policy
                    action = self.bc.get_action(state)
                
                # Key: regardless of executed action, record expert's action as label
                expert_action = expert_policy(state)
                
                new_states.append(state)
                new_actions.append(expert_action)
                
                # Environment interaction
                next_state, reward, done, _ = env.step(action)
                state = next_state
                episode_reward += reward
            
            episode_rewards.append(episode_reward)
        
        return (np.array(new_states), np.array(new_actions), 
                episode_rewards)
    
    def train(self, env, expert_policy, n_iterations=10, 
              n_episodes_init=50, n_episodes_per_iter=20,
              epochs_per_iter=50, beta_schedule='linear'):
        """
        DAgger training main loop
        
        Args:
            env: Environment
            expert_policy: Expert policy
            n_iterations: Number of iterations
            n_episodes_init: Episodes for initial data collection
            n_episodes_per_iter: Episodes per iteration
            epochs_per_iter: Training epochs per iteration
            beta_schedule: Mixing ratio schedule
                - 'linear': Linear decay
                - 'constant': Keep constant
                - 'exponential': Exponential decay
        """
        rewards_history = []
        
        # Round 0: Collect initial data with expert
        print("Collecting initial expert data...")
        states, actions, _ = self.collect_data_with_expert(
            env, expert_policy, n_episodes_init, use_learner=False
        )
        self.dataset['states'].extend(states)
        self.dataset['actions'].extend(actions)
        
        # Train initial policy
        all_states = np.array(self.dataset['states'])
        all_actions = np.array(self.dataset['actions'])
        self.bc.train(all_states, all_actions, epochs=epochs_per_iter)
        
        # DAgger iterations
        for iteration in range(n_iterations):
            # Compute current beta (mixing ratio)
            if beta_schedule == 'linear':
                beta = max(0.1, 1 - iteration / n_iterations)
            elif beta_schedule == 'exponential':
                beta = 0.5 ** (iteration + 1)
            else:  # constant
                beta = 0.5
            
            # Collect new data
            states, actions, episode_rewards = self.collect_data_with_expert(
                env, expert_policy, n_episodes_per_iter,
                use_learner=True, beta=beta
            )
            
            # Aggregate dataset
            self.dataset['states'].extend(states)
            self.dataset['actions'].extend(actions)
            
            # Retrain on complete dataset
            all_states = np.array(self.dataset['states'])
            all_actions = np.array(self.dataset['actions'])
            self.bc.train(all_states, all_actions, epochs=epochs_per_iter)
            
            # Evaluate current policy
            eval_reward, _ = self.bc.evaluate(env, n_episodes=10)
            rewards_history.append(eval_reward)
            
            print(f"Iteration {iteration+1}: beta={beta:.2f}, "
                  f"dataset_size={len(self.dataset['states'])}, "
                  f"eval_reward={eval_reward:.2f}")
        
        return rewards_history
    
    def get_action(self, state):
        """Get action"""
        return self.bc.get_action(state)

DAgger Variants

1. SafeDAgger

In some applications (like autonomous driving), letting learner fully control may be dangerous. SafeDAgger uses "guardrail" mechanism:

def safe_dagger_step(state, learner, expert, safety_threshold):
    """Safe DAgger execution step"""
    learner_action = learner.get_action(state)
    expert_action = expert(state)
    
    # Compute difference between learner and expert actions
    if continuous:
        diff = np.linalg.norm(learner_action - expert_action)
    else:
        diff = 0 if learner_action == expert_action else 1
    
    # If difference too large, use expert action (safety measure)
    if diff > safety_threshold:
        return expert_action, expert_action  # Execute expert, label expert
    else:
        return learner_action, expert_action  # Execute learner, label expert

2. Sample-Efficient DAgger

Not every state needs expert labeling. Can selectively query: - Only query on "uncertain" states (use ensemble models to measure uncertainty) - Only query on "important" states (use advantage function or TD error to measure)

3. HG-DAgger (Human-Gated DAgger)

Let human expert decide when to intervene:

def hgdagger_episode(env, learner, human_expert):
    """DAgger with human deciding when to intervene"""
    state = env.reset()
    done = False
    data = []
    
    while not done:
        learner_action = learner.get_action(state)
        
        # Show to human, ask if correction needed
        human_action = human_expert.maybe_correct(state, learner_action)
        
        if human_action is not None:
            # Human intervenes
            action = human_action
            data.append((state, human_action))
        else:
            # Human thinks learner is doing right
            action = learner_action
        
        state, _, done, _ = env.step(action)
    
    return data

DAgger Limitations

Requires interactive expert: In many scenarios, we cannot query expert anytime (expert may be historical data, deceased master, etc.)
Heavy expert burden: Expert needs to provide labels for many states, which may be very time-consuming
Expert must be perfect: DAgger assumes expert always gives correct answers, but human experts also make mistakes or are inconsistent
Safety: May visit dangerous states during learning

When we cannot use interactive expert, we need other methods — inverse reinforcement learning and GAIL.

Inverse Reinforcement Learning (IRL)

Problem Setting and Motivation

The methods above (BC and DAgger) are direct state-to-action mappings. But there's a deeper question: why does the expert act this way?

If we can understand expert's objective (reward function), we can: 1. Generalize to situations expert hasn't demonstrated 2. Understand the "intent" behind expert behavior 3. Apply the same objective in different environments

Inverse Reinforcement Learning (IRL) takes this approach:

Infer reward function from expert demonstrations, then use standard RL methods to optimize this reward.

Formally, given expert demonstrations, IRL seeks reward functionsuch that: 1. Under, expert policy is (approximately) optimal 2. Expert policy achieves higher cumulative reward than other policies

Reward Ambiguity Problem

A fundamental challenge in IRL is reward ambiguity: given demonstrations, there may be infinitely many consistent reward functions!

Example: Consider the most extreme case — reward function identically 0:. Under this reward, all policies are optimal (cumulative reward is all 0), expert policy is of course also optimal. But this reward is completely uninformative.

More generally, any reward function that can be optimized by expert policy is valid. We need some regularization or assumption to select "good" reward functions.

Maximum Entropy Inverse Reinforcement Learning

Maximum Entropy IRL (Ziebart et al., 2008) solves this problem with an elegant assumption:

Expert policy, among all "equally good" choices, prefers the one with maximum entropy.

In other words, expert won't arbitrarily prefer certain actions — if two actions are equally good, expert will randomly choose.

This leads to expert policy having the form:Or more generally, for entire trajectories:

Intuition: High-reward trajectories are exponentially preferred, but not deterministically choosing optimal trajectory. This is a kind of "soft optimal."

Objective Function

Maximum Entropy IRL's objective is maximizing expert trajectory likelihood:Expanding the probability:whereis the partition function, integrating over all possible trajectories.

Thus objective becomes:

Gradient Computation

Taking gradient with respect to:The second termcan be written as:whereis the optimal policy under current reward.

Final gradient:

Intuitive explanation: - First term: increase expert trajectory rewards - Second term: decrease current policy trajectory rewards - At convergence: both equal, meaning current policy matches expert

Detailed Implementation

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

class MaxEntIRL:
    """
    Maximum Entropy Inverse Reinforcement Learning
    
    Learns reward function from expert demonstrations, then optimizes with RL.
    
    Core idea:
    1. Assume expert policy is soft-optimal: π*(a|s) ∝ exp(Q*(s,a))
    2. Maximize expert trajectory likelihood
    3. Gradient = expert feature expectation - current policy feature expectation
    """
    
    def __init__(self, state_dim, action_dim, hidden_dim=128, 
                 reward_lr=1e-3, policy_lr=1e-3, continuous=False):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.continuous = continuous
        
        # Reward network: r(s, a) -> scalar
        self.reward_net = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
        self.reward_optimizer = optim.Adam(
            self.reward_net.parameters(), lr=reward_lr
        )
        
        # Policy network
        if continuous:
            self.policy_mean = nn.Sequential(
                nn.Linear(state_dim, hidden_dim),
                nn.ReLU(),
                nn.Linear(hidden_dim, action_dim),
                nn.Tanh()
            )
            self.policy_log_std = nn.Parameter(torch.zeros(action_dim))
            policy_params = list(self.policy_mean.parameters()) + [self.policy_log_std]
        else:
            self.policy = nn.Sequential(
                nn.Linear(state_dim, hidden_dim),
                nn.ReLU(),
                nn.Linear(hidden_dim, action_dim),
                nn.Softmax(dim=-1)
            )
            policy_params = self.policy.parameters()
        
        self.policy_optimizer = optim.Adam(policy_params, lr=policy_lr)
    
    def compute_reward(self, states, actions):
        """Compute reward r(s, a)"""
        if not isinstance(states, torch.Tensor):
            states = torch.FloatTensor(states)
        if not isinstance(actions, torch.Tensor):
            if self.continuous:
                actions = torch.FloatTensor(actions)
            else:
                actions = torch.LongTensor(actions)
                actions = torch.nn.functional.one_hot(
                    actions, self.action_dim
                ).float()
        
        if len(states.shape) == 1:
            states = states.unsqueeze(0)
        if len(actions.shape) == 1:
            actions = actions.unsqueeze(0)
        
        inputs = torch.cat([states, actions], dim=-1)
        return self.reward_net(inputs).squeeze(-1)
    
    def get_action(self, state, deterministic=False):
        """Sample action from policy"""
        state = torch.FloatTensor(state).unsqueeze(0)
        
        if self.continuous:
            mean = self.policy_mean(state)
            if deterministic:
                return mean.squeeze().detach().numpy()
            std = torch.exp(self.policy_log_std)
            action = mean + std * torch.randn_like(mean)
            return action.squeeze().detach().numpy()
        else:
            probs = self.policy(state)
            if deterministic:
                return probs.argmax(dim=1).item()
            return torch.multinomial(probs, 1).item()

IRL Challenges and Extensions

1. Computational Complexity

After each reward update, need to retrain policy (inner loop). This makes IRL much slower than direct imitation learning.

2. Reward Shaping

Learned reward function may not be the "true" reward, just one function that can explain expert behavior.

3. Deep IRL

Modern methods parameterize reward function with neural networks, can handle high-dimensional states. Representative methods include: - Deep MaxEnt IRL - Guided Cost Learning - AIRL (Adversarial IRL)

GAIL: Generative Adversarial Imitation Learning

Core Idea

GAIL (Generative Adversarial Imitation Learning) combines imitation learning with GANs, providing an end-to-end solution.

Core idea:

Train a discriminator to distinguish expert trajectories from policy-generated trajectories, while training policy to "fool" discriminator.

This is exactly like GANs: - Generator = Policy: generates trajectories - Discriminator =: distinguishes expert from generated trajectories

When policy successfully fools discriminator, its behavior becomes indistinguishable from expert — exactly the goal of imitation learning!

Mathematical Formulation

GAIL optimizes the following objective:where: -: discriminator, outputs probabilityis from policy(not expert) -: policy entropy, -: entropy regularization coefficient

Discriminator optimization:

For fixed, discriminator wants to maximize distinguishing ability. Optimal discriminator is:whereis policy's state-action occupancy measure.

Policy optimization:

Policy wants to minimize, i.e., make discriminator think its trajectories are from expert.

Key insight: Discriminator output can serve as reward signal! - If(discriminator thinks expert), then(normal) - If(discriminator thinks policy), then(penalty)

In practice, for numerical stability, typically use:

Relationship to IRL

GAIL can be viewed as implicit IRL:

Traditional IRL: Explicitly learns reward function, then solves RL
GAIL: Discriminator implicitly defines reward function, jointly optimized with policy

Ho & Ermon (2016) proved GAIL is equivalent to Maximum Entropy IRL in terms of occupancy measure matching.

Detailed Implementation

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.distributions import Categorical, Normal

class GAIL:
    """
    Generative Adversarial Imitation Learning
    
    Core idea:
    1. Discriminator learns to distinguish expert and policy-generated (s,a) pairs
    2. Policy learns to fool discriminator
    3. Discriminator output serves as policy's reward signal
    4. Use PPO to optimize policy
    """
    
    def __init__(self, state_dim, action_dim, hidden_dim=256,
                 disc_lr=3e-4, policy_lr=3e-4, continuous=False):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.continuous = continuous
        
        # Discriminator: D(s, a) -> [0, 1]
        # Output near 1 means from policy, near 0 means from expert
        disc_input_dim = state_dim + (action_dim if continuous else action_dim)
        self.discriminator = nn.Sequential(
            nn.Linear(disc_input_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, 1),
            nn.Sigmoid()
        )
        self.disc_optimizer = optim.Adam(
            self.discriminator.parameters(), lr=disc_lr
        )
        
        # Policy network
        if continuous:
            self.policy_mean = nn.Sequential(
                nn.Linear(state_dim, hidden_dim),
                nn.Tanh(),
                nn.Linear(hidden_dim, hidden_dim),
                nn.Tanh(),
                nn.Linear(hidden_dim, action_dim),
                nn.Tanh()
            )
            self.policy_log_std = nn.Parameter(torch.zeros(action_dim))
            policy_params = list(self.policy_mean.parameters()) + [self.policy_log_std]
        else:
            self.policy = nn.Sequential(
                nn.Linear(state_dim, hidden_dim),
                nn.Tanh(),
                nn.Linear(hidden_dim, hidden_dim),
                nn.Tanh(),
                nn.Linear(hidden_dim, action_dim),
                nn.Softmax(dim=-1)
            )
            policy_params = self.policy.parameters()
        
        self.policy_optimizer = optim.Adam(policy_params, lr=policy_lr)
        
        # Value network (for PPO)
        self.value = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, 1)
        )
        self.value_optimizer = optim.Adam(self.value.parameters(), lr=policy_lr)
    
    def get_action(self, state, deterministic=False):
        """Sample action"""
        state = torch.FloatTensor(state).unsqueeze(0)
        
        if self.continuous:
            mean = self.policy_mean(state)
            if deterministic:
                return mean.squeeze().detach().numpy()
            std = torch.exp(self.policy_log_std)
            dist = Normal(mean, std)
            action = dist.sample()
            log_prob = dist.log_prob(action).sum(dim=-1)
            return action.squeeze().detach().numpy(), log_prob.detach()
        else:
            probs = self.policy(state)
            if deterministic:
                return probs.argmax(dim=1).item()
            dist = Categorical(probs)
            action = dist.sample()
            return action.item(), dist.log_prob(action).detach()
    
    def compute_gail_reward(self, states, actions):
        """
        Compute GAIL reward
        
        r(s, a) = log(1 - D(s, a))
        
        D near 0 (like expert) -> reward high (good)
        D near 1 (like policy) -> reward low (penalty)
        """
        disc_input = self.get_disc_input(states, actions)
        
        with torch.no_grad():
            d = self.discriminator(disc_input)
            rewards = torch.log(1 - d + 1e-8).squeeze()
        
        return rewards.numpy()
    
    def get_disc_input(self, states, actions):
        """Prepare discriminator input"""
        if not isinstance(states, torch.Tensor):
            states = torch.FloatTensor(states)
        
        if self.continuous:
            if not isinstance(actions, torch.Tensor):
                actions = torch.FloatTensor(actions)
        else:
            if not isinstance(actions, torch.Tensor):
                actions = torch.LongTensor(actions)
            actions = torch.nn.functional.one_hot(
                actions, self.action_dim
            ).float()
        
        return torch.cat([states, actions], dim=-1)
    
    def update_discriminator(self, expert_states, expert_actions,
                              policy_states, policy_actions,
                              n_updates=1):
        """
        Update discriminator
        
        Goal: distinguish expert and policy-generated (s,a) pairs
        - Expert data label: 0 (D output should be low)
        - Policy data label: 1 (D output should be high)
        """
        expert_input = self.get_disc_input(expert_states, expert_actions)
        policy_input = self.get_disc_input(policy_states, policy_actions)
        
        n_expert = len(expert_states)
        n_policy = len(policy_states)
        batch_size = min(64, n_expert, n_policy)
        
        total_loss = 0
        for _ in range(n_updates):
            expert_idx = np.random.choice(n_expert, batch_size, replace=False)
            policy_idx = np.random.choice(n_policy, batch_size, replace=False)
            
            expert_batch = expert_input[expert_idx]
            policy_batch = policy_input[policy_idx]
            
            expert_pred = self.discriminator(expert_batch)
            policy_pred = self.discriminator(policy_batch)
            
            expert_loss = -torch.log(1 - expert_pred + 1e-8).mean()
            policy_loss = -torch.log(policy_pred + 1e-8).mean()
            
            disc_loss = expert_loss + policy_loss
            
            self.disc_optimizer.zero_grad()
            disc_loss.backward()
            self.disc_optimizer.step()
            
            total_loss += disc_loss.item()
        
        return total_loss / n_updates

GAIL Advantages and Limitations

Advantages:

No explicit reward needed: Discriminator implicitly learns reward structure
End-to-end training: Policy and "reward" jointly optimized, no two-stage
Sample efficient: More efficient than MaxEnt IRL
Handles high-dimensional: Can handle high-dimensional states and continuous actions
Theoretical guarantee: Has guarantees in occupancy measure matching sense

Limitations:

Requires environment interaction: Cannot learn purely offline
Training unstable: GAN training itself is unstable
Mode collapse: May only learn part of expert behavior
Hard to interpret: No explicit reward function

GAIL Variants

1. AIRL (Adversarial Inverse Reinforcement Learning)

AIRL modifies discriminator structure so it can recover explicit reward function:wheredecomposes into reward and shaping: 2. VAIL (Variational Adversarial Imitation Learning)

VAIL uses variational information bottleneck to improve training stability.

3. SAM (State-only Adversarial Mimicking)

When actions are unobservable (like learning from video), SAM only uses state matching.

Method Comparison and Selection Guide

Comprehensive Comparison

Method	Interactive Expert	Env Interaction	Sample Efficiency	Implementation Complexity	Theoretical Guarantee	Interpretability
BC	Not needed	Not needed	High (but with drift)	Low	Weak	Medium
DAgger	Needed	Needed	Medium-High	Low	Strong	Medium
MaxEnt IRL	Not needed	Needed	Low	High	Strong	High
GAIL	Not needed	Needed	Medium	Medium	Medium	Low

Selection Guide

Choose Behavioral Cloning when: - Have lots of expert data - Task relatively simple (short time horizon) - Compute resources limited - Cannot interact with environment

Choose DAgger when: - Can query expert anytime - Expert labeling cost not high - Need to handle long time horizon tasks - Need theoretical guarantee

Choose Inverse RL when: - Need to understand expert's objective - Need to generalize across different environments - Need interpretable reward function - Have sufficient compute resources

Choose GAIL when: - Cannot query expert - Need high-quality imitation - Have environment interaction capability - State/action spaces large

Advanced Topics

Multimodal Expert Behavior

Expert may take different actions in same state. For example, avoiding obstacle can turn left or right.

Standard BC learns "average" behavior — may crash directly into obstacle!

Solutions:

Mixture Density Networks (MDN)

class MDNPolicy(nn.Module):
    """Mixture Density Network: outputs Gaussian mixture distribution"""
    def __init__(self, state_dim, action_dim, n_components=5, hidden_dim=128):
        super().__init__()
        self.n_components = n_components
        self.action_dim = action_dim
        
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        # Mixture weights
        self.pi_layer = nn.Linear(hidden_dim, n_components)
        # Mean and variance for each component
        self.mu_layer = nn.Linear(hidden_dim, n_components * action_dim)
        self.sigma_layer = nn.Linear(hidden_dim, n_components * action_dim)
    
    def forward(self, state):
        h = self.shared(state)
        
        pi = torch.softmax(self.pi_layer(h), dim=-1)  # Mixture weights
        mu = self.mu_layer(h).view(-1, self.n_components, self.action_dim)
        sigma = torch.exp(self.sigma_layer(h)).view(-1, self.n_components, self.action_dim)
        
        return pi, mu, sigma
    
    def sample(self, state):
        pi, mu, sigma = self(state)
        
        # Select component
        k = torch.multinomial(pi, 1).squeeze()
        
        # Sample from selected component
        action = mu[range(len(mu)), k] + sigma[range(len(sigma)), k] * torch.randn_like(mu[0, 0])
        return action

Conditional VAE (CVAE)

Learn latent behavior modes, then conditionally generate:

Info-GAIL

Add latent variables in GAIL, learn different behavior modes.

Learning from Suboptimal Demonstrations

In reality, expert demonstrations often aren't optimal. How to handle?

1. Weighted Behavioral Cloning

Give higher-quality demonstrations higher weights:

def weighted_bc_loss(predictions, expert_actions, quality_scores):
    """
    Weighted behavioral cloning loss
    
    quality_scores: Quality score for each demonstration (can be return, rating, etc.)
    """
    weights = torch.softmax(quality_scores / temperature, dim=0)
    losses = criterion(predictions, expert_actions)
    return (weights * losses).sum()

2. Learning to Rank

Instead of learning absolute good actions, learn which action is better than which:

3. Self-Improvement

First imitate, then improve with RL:

def iterative_improvement(bc_agent, env, n_rounds=5):
    """Iterative self-improvement"""
    for round in range(n_rounds):
        # Collect data with current policy
        trajectories = collect_trajectories(bc_agent, env)
        
        # Filter good trajectories
        good_trajectories = filter_by_return(trajectories, top_k=0.2)
        
        # Retrain
        bc_agent.train(good_trajectories)

Cross-Domain Imitation Learning

When expert and learner have different state/action spaces:

1. Third-Person Imitation Learning

Learn from video (third-person view), but execute from first-person view.

2. Cross-Morphology Transfer

Robot A demonstrates, Robot B imitates (different body structures).

3. Domain Adaptation

Align state representations from different domains.

Practical Advice

Data Collection

Data quality more important than quantity
Ensure covering diverse scenarios
Record expert's "recovery" behavior (correcting from mistakes)
Avoid obvious errors in demonstrations

Training Tips

State normalization: Standardize input states
Data augmentation: Add noise, crop, rotate, etc.
Regularization: Dropout, L2 regularization to prevent overfitting
Early stopping: Monitor validation set performance

Evaluation Methods

Trajectory similarity: Distance to expert trajectories
Task success rate: Ratio of completed tasks
Cumulative reward: Total reward obtained in environment
Human evaluation: Let humans judge behavior quality

Summary

Imitation learning provides a learning paradigm that doesn't rely on explicit reward functions, learning policy by observing expert demonstrations:

Behavioral Cloning is simple and direct, but suffers from distribution shift, suitable for short-horizon tasks with lots of data
DAgger mitigates drift through interactive learning, requires queryable expert, has theoretical guarantees
Inverse RL recovers reward function, provides interpretability and generalization, but computationally expensive
GAIL uses adversarial training for end-to-end imitation, currently most popular method, balances performance and implementation complexity

These methods each have pros and cons, choice depends on specific application: whether there's interactive expert, whether environment interaction needed, interpretability requirements, compute resources, etc.

Imitation learning has broad applications in robotics, autonomous driving, game AI, dialogue systems. It complements reinforcement learning — when reward functions are hard to define, imitation learning provides another path; when need to exceed expert, reinforcement learning is more suitable. Combining both (e.g., initialize with imitation learning, fine-tune with RL) often achieves best results.

In the next chapter, we'll learn AlphaGo and Monte Carlo Tree Search— seeing how deep learning combined with traditional planning methods achieves superhuman performance in complex games like Go.

References

Pomerleau, D. A. (1989). ALVINN: An Autonomous Land Vehicle in a Neural Network. NIPS.
Ross, S., Gordon, G., & Bagnell, D. (2011). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS.
Ziebart, B. D., et al. (2008). Maximum Entropy Inverse Reinforcement Learning. AAAI.
Ho, J., & Ermon, S. (2016). Generative Adversarial Imitation Learning. NIPS.
Fu, J., Luo, K., & Levine, S. (2018). Learning Robust Rewards with Adversarial Inverse Reinforcement Learning. ICLR.
Abbeel, P., & Ng, A. Y. (2004). Apprenticeship Learning via Inverse Reinforcement Learning. ICML.
Finn, C., Levine, S., & Abbeel, P. (2016). Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization. ICML.

Q&A: Frequently Asked Questions

Q1: What's the difference between imitation learning and supervised learning?

A: Main difference is data distribution. Supervised learning assumes training and test data come from same distribution (i.i.d. assumption), but in imitation learning, test-time state distribution depends on learned policy, different from training-time expert state distribution. This is the root of distribution shift problem.

Q2: How to tell if expert data is enough?

A: Can judge through: - Learning curve: does adding data still improve performance - Validation error: is it close to training error (overfitting check) - State space coverage: covers possible encountered states

Q3: GAIL training is unstable, what to do?

A: Try: - Adjust discriminator and policy update frequency (usually discriminator updates more) - Use gradient penalty or spectral normalization to stabilize discriminator - Lower learning rate - Increase entropy regularization coefficient - Use WGAN variants

Q4: When to use BC, when to use GAIL?

A: BC suitable for: simple tasks, lots of data, cannot interact with environment. GAIL suitable for: complex tasks, limited data, can interact with environment. If BC can solve it, prioritize BC (simpler, faster).

Q5: How to handle noise and errors in expert demonstrations?

A: - Data cleaning: filter obviously wrong demonstrations - Weighted learning: give higher-quality demonstrations higher weights - Learning to rank: learn relative preferences instead of absolute actions - Robust loss functions: like Huber loss more robust to outliers

Q6: Can imitation learning exceed expert?

A: Pure imitation learning theoretically cannot exceed expert (goal is to copy expert). But can: - Initialize with imitation learning, then fine-tune with RL - Aggregate strengths of multiple experts - Self-improve in states where expert performs poorly

Q7: How to handle multiple experts?

A: - If expert policies similar: directly mix data - If expert styles different: learn multimodal policy (MDN, CVAE) - If expert levels different: weight or only use best expert

Q8: How to evaluate imitation learning effectiveness?

A: Multi-angle evaluation: - Cumulative reward (if reward function available) - Similarity to expert trajectories - Task success rate - Human subjective evaluation - Robustness on out-of-distribution states