Reinforcement Learning (7): Imitation Learning and Inverse Reinforcement Learning
Chen Kai BOSS

In previous chapters, we learned various reinforcement learning algorithms — from Q-Learning to PPO — all relying on an explicit reward function to guide learning. However, in many real-world scenarios, designing an appropriate reward function is extremely difficult:

  • Autonomous driving: What constitutes "good" driving behavior? Safety first? Comfort priority? Maximum efficiency? How do we balance these goals? How do we quantify "driving like an experienced driver" with a single number?
  • Robot manipulation: How do we write a reward function for teaching a robot to fold clothes, cook, or tidy a room? The final state is easy to define, but how much reward should each intermediate step receive?
  • Game AI: Making an AI learn human player styles, not just maximize scores. Some players prefer aggressive play, others prefer defensive strategies — how do we make AI imitate specific styles?
  • Dialogue systems: What makes a "good" conversation? Interesting? Helpful? Polite? How do we balance these objectives?

Imitation Learning provides a different path: instead of laboriously designing reward functions, learn directly from expert demonstrations. This is a very natural way of learning — humans learn this way too. Infants learn to walk and talk by imitating parents, apprentices learn crafts by observing masters, students learn math by imitating teacher's problem-solving methods.

This chapter systematically introduces core imitation learning methods: from the simplest Behavioral Cloning to distribution shift-solving DAgger, from reward-recovering Inverse Reinforcement Learning to end-to-end adversarial GAIL. We'll dive deep into each method's principles, pros and cons, applicable scenarios, and implementation details.

Imitation Learning Problem Setting

From Expert Demonstrations to Policy

Suppose we have an expert (can be human or another agent) who performs excellently on some task. We observe the expert's behavior and collect a demonstration dataset:

whererepresents the expert's actiontaken in state. This data may come from: - Human operator recordings (e.g., driving videos) - Teleoperation-collected data (e.g., controlling robot with joystick) - Another trained AI's demonstrations - Expert's historical decision records (e.g., doctor's diagnoses)

Imitation learning's goal is: Learn a policythat behaves as close to expert policyas possible.

Key points to note: 1. We don't know what expert's true policyis, can only observe its behavior 2. We don't have a reward function, can't evaluate whether an action is good or bad 3. We usually cannot interact with expert in real-time (expert may be busy or expensive)

Differences from Reinforcement Learning

Let's compare imitation learning and reinforcement learning:

Aspect Reinforcement Learning Imitation Learning
Supervision signal Reward function Expert demonstrations
Signal characteristics Sparse, delayed, requires trial-and-error Direct, immediate, readily available
Interaction requirement Must interact extensively with environment Can learn completely offline
Goal Maximize cumulative reward Imitate expert behavior
Optimization Trial-and-error (may need millions of interactions) Similar to supervised learning (usually needs less data)
Exploration Needs explicit exploration strategy No exploration needed (expert already did)
Safety Exploration may be risky Relatively safe (imitating expert)

Applicable scenarios for each method:

  • Reinforcement learning better when:
    • Clear reward function available
    • Safe extensive trial-and-error possible
    • Want to exceed human level
  • Imitation learning better when:
    • Reward function hard to define
    • High-quality expert demonstrations available
    • Want to replicate expert style
    • High safety requirements

Main Imitation Learning Methods

Imitation learning methods can be categorized as:

  1. Behavioral Cloning (BC)
    • Simplest, most direct method
    • Treats imitation learning as supervised learning
    • Problem: distribution shift
  2. Interactive Imitation Learning (Interactive IL)
    • Representative method: DAgger
    • Allows querying expert during learning
    • Solves distribution shift problem
  3. Inverse Reinforcement Learning (Inverse RL)
    • Recovers reward function from demonstrations
    • Then optimizes with standard RL
    • Deeper understanding of expert's objective
  4. Adversarial Imitation Learning (Adversarial IL)
    • Representative method: GAIL
    • Uses adversarial training to match expert distribution
    • End-to-end learning, no explicit reward needed

Behavioral Cloning

Basic Idea

Behavioral cloning is the most direct, simplest imitation learning method. Its core idea is:

Treatpairs as supervised learning training data, learn a mapping from state to action.

Formally, we minimize the difference between expert action and predicted action:

Loss function choices:

For discrete action spaces, use cross-entropy loss:For continuous action spaces, multiple options:

  1. Mean squared error (deterministic policy):2. Negative log-likelihood (Gaussian policy):3. Mixture density network (multimodal distribution):

Detailed Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.utils.data import DataLoader, TensorDataset

class BehavioralCloning:
"""
Behavioral Cloning Agent

Converts imitation learning to supervised learning,
learning policy from expert (state, action) pairs.
"""

def __init__(self, state_dim, action_dim, hidden_dims=[256, 256],
lr=1e-3, continuous=False, dropout=0.1):
"""
Initialize behavioral cloning model

Args:
state_dim: State space dimension
action_dim: Action space dimension
hidden_dims: Hidden layer dimensions list
lr: Learning rate
continuous: Whether action space is continuous
dropout: Dropout rate (prevents overfitting)
"""
self.continuous = continuous
self.action_dim = action_dim

# Build network
layers = []
prev_dim = state_dim
for hidden_dim in hidden_dims:
layers.extend([
nn.Linear(prev_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout),
])
prev_dim = hidden_dim

if continuous:
# Continuous actions: output mean (optionally variance)
layers.append(nn.Linear(prev_dim, action_dim))
layers.append(nn.Tanh()) # Assume action range [-1, 1]
self.policy = nn.Sequential(*layers)
self.criterion = nn.MSELoss()
else:
# Discrete actions: output action logits
layers.append(nn.Linear(prev_dim, action_dim))
self.policy = nn.Sequential(*layers)
self.criterion = nn.CrossEntropyLoss()

self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)

# Normalization statistics
self.state_mean = None
self.state_std = None

def compute_normalization(self, states):
"""Compute state normalization parameters"""
self.state_mean = np.mean(states, axis=0)
self.state_std = np.std(states, axis=0) + 1e-8

def normalize_state(self, state):
"""Normalize state"""
if self.state_mean is not None:
return (state - self.state_mean) / self.state_std
return state

def train(self, states, actions, epochs=100, batch_size=64,
validation_split=0.1, early_stopping_patience=10):
"""
Train behavioral cloning policy

Args:
states: State array [N, state_dim]
actions: Action array [N] or [N, action_dim]
epochs: Training epochs
batch_size: Batch size
validation_split: Validation set ratio
early_stopping_patience: Early stopping patience
"""
# Compute normalization parameters
self.compute_normalization(states)
states = self.normalize_state(states)

# Split training and validation sets
n_samples = len(states)
n_val = int(n_samples * validation_split)
indices = np.random.permutation(n_samples)
train_idx, val_idx = indices[n_val:], indices[:n_val]

train_states = torch.FloatTensor(states[train_idx])
val_states = torch.FloatTensor(states[val_idx])

if self.continuous:
train_actions = torch.FloatTensor(actions[train_idx])
val_actions = torch.FloatTensor(actions[val_idx])
else:
train_actions = torch.LongTensor(actions[train_idx])
val_actions = torch.LongTensor(actions[val_idx])

# Create data loader
train_dataset = TensorDataset(train_states, train_actions)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# Training loop
best_val_loss = float('inf')
patience_counter = 0
train_losses = []
val_losses = []

for epoch in range(epochs):
# Training
self.policy.train()
epoch_loss = 0
for batch_states, batch_actions in train_loader:
pred = self.policy(batch_states)
loss = self.criterion(pred, batch_actions)

self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()

epoch_loss += loss.item() * len(batch_states)

avg_train_loss = epoch_loss / len(train_idx)
train_losses.append(avg_train_loss)

# Validation
self.policy.eval()
with torch.no_grad():
val_pred = self.policy(val_states)
val_loss = self.criterion(val_pred, val_actions).item()
val_losses.append(val_loss)

# Early stopping check
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
best_state = self.policy.state_dict().copy()
else:
patience_counter += 1
if patience_counter >= early_stopping_patience:
print(f"Early stopping at epoch {epoch}")
self.policy.load_state_dict(best_state)
break

if epoch % 10 == 0:
print(f"Epoch {epoch}: train_loss={avg_train_loss:.4f}, "
f"val_loss={val_loss:.4f}")

return train_losses, val_losses

def get_action(self, state, deterministic=True):
"""Get action"""
state = self.normalize_state(state)
state = torch.FloatTensor(state).unsqueeze(0)

self.policy.eval()
with torch.no_grad():
if self.continuous:
action = self.policy(state).squeeze().numpy()
else:
logits = self.policy(state)
if deterministic:
action = logits.argmax(dim=1).item()
else:
probs = torch.softmax(logits, dim=1)
action = torch.multinomial(probs, 1).item()
return action

def evaluate(self, env, n_episodes=10, render=False):
"""Evaluate policy performance in environment"""
rewards = []
for _ in range(n_episodes):
state = env.reset()
done = False
episode_reward = 0
while not done:
if render:
env.render()
action = self.get_action(state)
state, reward, done, _ = env.step(action)
episode_reward += reward
rewards.append(episode_reward)
return np.mean(rewards), np.std(rewards)

The Distribution Shift Problem

Behavioral cloning appears simple and elegant, but has a serious problem —distribution shift. Let's analyze this in detail.

The essence of the problem:

During training, we train the model on expert trajectory state distribution:During testing, policygenerates its own trajectories, visiting state distribution.

Key problem:Sinceisn't perfect, it will: 1. Make small mistakes (select suboptimal actions) 2. These small mistakes lead to entering states expert never visited 3. In these new states,hasn't seen training data, may make bigger mistakes 4. Errors accumulate, trajectory increasingly deviates from expert

A concrete example: Autonomous driving

Suppose we train an autonomous driving model with behavioral cloning. The expert (human driver) always keeps the car in the center of the lane, so training data states are all "car in lane center."

The model learns how to drive in this state, but it's imperfect — sometimes slightly left or right. Once the car slightly deviates from center: - This is a new state, model never seen it - Model may continue driving in wrong direction - Car deviates more and more, eventually goes off road

Mathematical Analysis: Error Accumulation

Suppose at each timestep, learned policy has probabilityof making an error (selecting different action from expert). In a-step trajectory, how does total error accumulate?

Letbe probability of being in "correct" state (state expert would visit) at time. Then: -(same initial state) -(only stay correct if no error)

So.

More rigorous analysis shows expected total error is:

Quadratic growth! This means: - If task needs 100 steps, error amplifiedtimes - Even with 99% single-step accuracy, long tasks will fail

Mitigating Distribution Shift

Before introducing DAgger, let's look at some simple mitigation methods:

1. Data Augmentation

Add noise to states, simulating non-expert states policy might visit:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def augment_data(states, actions, noise_std=0.01):
"""Augment data by adding noise"""
augmented_states = []
augmented_actions = []

for s, a in zip(states, actions):
# Original data
augmented_states.append(s)
augmented_actions.append(a)

# Noisy data
for _ in range(5):
noisy_s = s + np.random.normal(0, noise_std, s.shape)
augmented_states.append(noisy_s)
augmented_actions.append(a) # Action unchanged

return np.array(augmented_states), np.array(augmented_actions)

2. Expert Noise Injection

During data collection, have expert intentionally make small mistakes, then show how to recover:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def collect_data_with_noise(expert, env, n_episodes, noise_prob=0.1):
"""Collect data with recovery demonstrations"""
data = []
for _ in range(n_episodes):
state = env.reset()
done = False
while not done:
if np.random.random() < noise_prob:
# Inject random action
action = env.action_space.sample()
else:
# Expert action
action = expert.get_action(state)

next_state, reward, done, _ = env.step(action)

# Record what expert would do in this state (even if random action executed)
expert_action = expert.get_action(state)
data.append((state, expert_action))

state = next_state
return data

3. Regularization and Ensembles

  • Use Dropout, L2 regularization to prevent overfitting
  • Train multiple models, average or vote

But these methods cannot fundamentally solve distribution shift. The real solution requires obtaining new expert labels during learning.

DAgger: Dataset Aggregation

Core Idea

DAgger's (Dataset Aggregation) core idea is simple:

During learning, use current policy to interact with environment, collect new states, then query expert for correct actions in these states.

This way, even if policy makes mistakes and enters new states, we can get expert's correct actions in these states.

Algorithm Flow:

  1. Collect initial datasetwith expert policy$^*_1i = 1, 2, ..., N_i{s_1, s_2, ...}$ - For each state, query expert action - Add new data to dataset: - Retrain policyon Key Insight:

DAgger breaks the vicious cycle of distribution shift: - Policy makes mistake → enters new state → gets expert label → learns correct action in new state

Theoretical Guarantee

DAgger has strict theoretical guarantees. Letbe training error (average error rate on training set),be trajectory length.

Theorem (Ross et al., 2011): AfterDAgger iterations, policysatisfies:Compared to behavioral cloning's, this is linear not quadratic!

Intuitive understanding: - Behavioral cloning's error accumulates and amplifies - DAgger by covering all possibly visited states, transforms problem to "probability of error at each state" - Each moment independently haserror probability,moments total

Detailed Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
class DAgger:
"""
DAgger (Dataset Aggregation) Algorithm

Solves distribution shift by iteratively collecting data:
1. Collect trajectories with current policy
2. Query expert for correct actions in these states
3. Add data to training set
4. Retrain policy
"""

def __init__(self, state_dim, action_dim, hidden_dims=[256, 256],
lr=1e-3, continuous=False):
"""Initialize DAgger"""
self.bc = BehavioralCloning(
state_dim, action_dim, hidden_dims, lr, continuous
)
self.dataset = {'states': [], 'actions': []}
self.continuous = continuous
self.action_dim = action_dim

def collect_data_with_expert(self, env, expert_policy, n_episodes,
use_learner=True, beta=0.5):
"""
Collect data: use learner or mixed policy to collect trajectories, expert labels actions

Args:
env: Environment
expert_policy: Expert policy function
n_episodes: Number of episodes to collect
use_learner: Whether to use learner policy (vs pure expert)
beta: Expert ratio in mixed policy (for safe learning)

Returns:
new_states: Newly collected states
new_actions: Expert actions in these states
episode_rewards: Episode rewards (using actually executed actions)
"""
new_states = []
new_actions = []
episode_rewards = []

for ep in range(n_episodes):
state = env.reset()
done = False
episode_reward = 0

while not done:
# Decide which action to execute
if not use_learner:
# Pure expert data collection
action = expert_policy(state)
elif np.random.random() < beta:
# Mixed policy: beta probability use expert
action = expert_policy(state)
else:
# Use learner policy
action = self.bc.get_action(state)

# Key: regardless of executed action, record expert's action as label
expert_action = expert_policy(state)

new_states.append(state)
new_actions.append(expert_action)

# Environment interaction
next_state, reward, done, _ = env.step(action)
state = next_state
episode_reward += reward

episode_rewards.append(episode_reward)

return (np.array(new_states), np.array(new_actions),
episode_rewards)

def train(self, env, expert_policy, n_iterations=10,
n_episodes_init=50, n_episodes_per_iter=20,
epochs_per_iter=50, beta_schedule='linear'):
"""
DAgger training main loop

Args:
env: Environment
expert_policy: Expert policy
n_iterations: Number of iterations
n_episodes_init: Episodes for initial data collection
n_episodes_per_iter: Episodes per iteration
epochs_per_iter: Training epochs per iteration
beta_schedule: Mixing ratio schedule
- 'linear': Linear decay
- 'constant': Keep constant
- 'exponential': Exponential decay
"""
rewards_history = []

# Round 0: Collect initial data with expert
print("Collecting initial expert data...")
states, actions, _ = self.collect_data_with_expert(
env, expert_policy, n_episodes_init, use_learner=False
)
self.dataset['states'].extend(states)
self.dataset['actions'].extend(actions)

# Train initial policy
all_states = np.array(self.dataset['states'])
all_actions = np.array(self.dataset['actions'])
self.bc.train(all_states, all_actions, epochs=epochs_per_iter)

# DAgger iterations
for iteration in range(n_iterations):
# Compute current beta (mixing ratio)
if beta_schedule == 'linear':
beta = max(0.1, 1 - iteration / n_iterations)
elif beta_schedule == 'exponential':
beta = 0.5 ** (iteration + 1)
else: # constant
beta = 0.5

# Collect new data
states, actions, episode_rewards = self.collect_data_with_expert(
env, expert_policy, n_episodes_per_iter,
use_learner=True, beta=beta
)

# Aggregate dataset
self.dataset['states'].extend(states)
self.dataset['actions'].extend(actions)

# Retrain on complete dataset
all_states = np.array(self.dataset['states'])
all_actions = np.array(self.dataset['actions'])
self.bc.train(all_states, all_actions, epochs=epochs_per_iter)

# Evaluate current policy
eval_reward, _ = self.bc.evaluate(env, n_episodes=10)
rewards_history.append(eval_reward)

print(f"Iteration {iteration+1}: beta={beta:.2f}, "
f"dataset_size={len(self.dataset['states'])}, "
f"eval_reward={eval_reward:.2f}")

return rewards_history

def get_action(self, state):
"""Get action"""
return self.bc.get_action(state)

DAgger Variants

1. SafeDAgger

In some applications (like autonomous driving), letting learner fully control may be dangerous. SafeDAgger uses "guardrail" mechanism:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def safe_dagger_step(state, learner, expert, safety_threshold):
"""Safe DAgger execution step"""
learner_action = learner.get_action(state)
expert_action = expert(state)

# Compute difference between learner and expert actions
if continuous:
diff = np.linalg.norm(learner_action - expert_action)
else:
diff = 0 if learner_action == expert_action else 1

# If difference too large, use expert action (safety measure)
if diff > safety_threshold:
return expert_action, expert_action # Execute expert, label expert
else:
return learner_action, expert_action # Execute learner, label expert

2. Sample-Efficient DAgger

Not every state needs expert labeling. Can selectively query: - Only query on "uncertain" states (use ensemble models to measure uncertainty) - Only query on "important" states (use advantage function or TD error to measure)

3. HG-DAgger (Human-Gated DAgger)

Let human expert decide when to intervene:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def hgdagger_episode(env, learner, human_expert):
"""DAgger with human deciding when to intervene"""
state = env.reset()
done = False
data = []

while not done:
learner_action = learner.get_action(state)

# Show to human, ask if correction needed
human_action = human_expert.maybe_correct(state, learner_action)

if human_action is not None:
# Human intervenes
action = human_action
data.append((state, human_action))
else:
# Human thinks learner is doing right
action = learner_action

state, _, done, _ = env.step(action)

return data

DAgger Limitations

  1. Requires interactive expert: In many scenarios, we cannot query expert anytime (expert may be historical data, deceased master, etc.)

  2. Heavy expert burden: Expert needs to provide labels for many states, which may be very time-consuming

  3. Expert must be perfect: DAgger assumes expert always gives correct answers, but human experts also make mistakes or are inconsistent

  4. Safety: May visit dangerous states during learning

When we cannot use interactive expert, we need other methods — inverse reinforcement learning and GAIL.

Inverse Reinforcement Learning (IRL)

Problem Setting and Motivation

The methods above (BC and DAgger) are direct state-to-action mappings. But there's a deeper question: why does the expert act this way?

If we can understand expert's objective (reward function), we can: 1. Generalize to situations expert hasn't demonstrated 2. Understand the "intent" behind expert behavior 3. Apply the same objective in different environments

Inverse Reinforcement Learning (IRL) takes this approach:

Infer reward function from expert demonstrations, then use standard RL methods to optimize this reward.

Formally, given expert demonstrations, IRL seeks reward functionsuch that: 1. Under, expert policy is (approximately) optimal 2. Expert policy achieves higher cumulative reward than other policies

Reward Ambiguity Problem

A fundamental challenge in IRL is reward ambiguity: given demonstrations, there may be infinitely many consistent reward functions!

Example: Consider the most extreme case — reward function identically 0:. Under this reward, all policies are optimal (cumulative reward is all 0), expert policy is of course also optimal. But this reward is completely uninformative.

More generally, any reward function that can be optimized by expert policy is valid. We need some regularization or assumption to select "good" reward functions.

Maximum Entropy Inverse Reinforcement Learning

Maximum Entropy IRL (Ziebart et al., 2008) solves this problem with an elegant assumption:

Expert policy, among all "equally good" choices, prefers the one with maximum entropy.

In other words, expert won't arbitrarily prefer certain actions — if two actions are equally good, expert will randomly choose.

This leads to expert policy having the form:Or more generally, for entire trajectories:

Intuition: High-reward trajectories are exponentially preferred, but not deterministically choosing optimal trajectory. This is a kind of "soft optimal."

Objective Function

Maximum Entropy IRL's objective is maximizing expert trajectory likelihood:Expanding the probability:whereis the partition function, integrating over all possible trajectories.

Thus objective becomes:

Gradient Computation

Taking gradient with respect to:The second termcan be written as:whereis the optimal policy under current reward.

Final gradient:

Intuitive explanation: - First term: increase expert trajectory rewards - Second term: decrease current policy trajectory rewards - At convergence: both equal, meaning current policy matches expert

Detailed Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

class MaxEntIRL:
"""
Maximum Entropy Inverse Reinforcement Learning

Learns reward function from expert demonstrations, then optimizes with RL.

Core idea:
1. Assume expert policy is soft-optimal: π*(a|s) ∝ exp(Q*(s,a))
2. Maximize expert trajectory likelihood
3. Gradient = expert feature expectation - current policy feature expectation
"""

def __init__(self, state_dim, action_dim, hidden_dim=128,
reward_lr=1e-3, policy_lr=1e-3, continuous=False):
self.state_dim = state_dim
self.action_dim = action_dim
self.continuous = continuous

# Reward network: r(s, a) -> scalar
self.reward_net = nn.Sequential(
nn.Linear(state_dim + action_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
self.reward_optimizer = optim.Adam(
self.reward_net.parameters(), lr=reward_lr
)

# Policy network
if continuous:
self.policy_mean = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim),
nn.Tanh()
)
self.policy_log_std = nn.Parameter(torch.zeros(action_dim))
policy_params = list(self.policy_mean.parameters()) + [self.policy_log_std]
else:
self.policy = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim),
nn.Softmax(dim=-1)
)
policy_params = self.policy.parameters()

self.policy_optimizer = optim.Adam(policy_params, lr=policy_lr)

def compute_reward(self, states, actions):
"""Compute reward r(s, a)"""
if not isinstance(states, torch.Tensor):
states = torch.FloatTensor(states)
if not isinstance(actions, torch.Tensor):
if self.continuous:
actions = torch.FloatTensor(actions)
else:
actions = torch.LongTensor(actions)
actions = torch.nn.functional.one_hot(
actions, self.action_dim
).float()

if len(states.shape) == 1:
states = states.unsqueeze(0)
if len(actions.shape) == 1:
actions = actions.unsqueeze(0)

inputs = torch.cat([states, actions], dim=-1)
return self.reward_net(inputs).squeeze(-1)

def get_action(self, state, deterministic=False):
"""Sample action from policy"""
state = torch.FloatTensor(state).unsqueeze(0)

if self.continuous:
mean = self.policy_mean(state)
if deterministic:
return mean.squeeze().detach().numpy()
std = torch.exp(self.policy_log_std)
action = mean + std * torch.randn_like(mean)
return action.squeeze().detach().numpy()
else:
probs = self.policy(state)
if deterministic:
return probs.argmax(dim=1).item()
return torch.multinomial(probs, 1).item()

IRL Challenges and Extensions

1. Computational Complexity

After each reward update, need to retrain policy (inner loop). This makes IRL much slower than direct imitation learning.

2. Reward Shaping

Learned reward function may not be the "true" reward, just one function that can explain expert behavior.

3. Deep IRL

Modern methods parameterize reward function with neural networks, can handle high-dimensional states. Representative methods include: - Deep MaxEnt IRL - Guided Cost Learning - AIRL (Adversarial IRL)

GAIL: Generative Adversarial Imitation Learning

Core Idea

GAIL (Generative Adversarial Imitation Learning) combines imitation learning with GANs, providing an end-to-end solution.

Core idea:

Train a discriminator to distinguish expert trajectories from policy-generated trajectories, while training policy to "fool" discriminator.

This is exactly like GANs: - Generator = Policy: generates trajectories - Discriminator =: distinguishes expert from generated trajectories

When policy successfully fools discriminator, its behavior becomes indistinguishable from expert — exactly the goal of imitation learning!

Mathematical Formulation

GAIL optimizes the following objective:where: -: discriminator, outputs probabilityis from policy(not expert) -: policy entropy, -: entropy regularization coefficient

Discriminator optimization:

For fixed, discriminator wants to maximize distinguishing ability. Optimal discriminator is:whereis policy's state-action occupancy measure.

Policy optimization:

Policy wants to minimize, i.e., make discriminator think its trajectories are from expert.

Key insight: Discriminator output can serve as reward signal! - If(discriminator thinks expert), then(normal) - If(discriminator thinks policy), then(penalty)

In practice, for numerical stability, typically use:

Relationship to IRL

GAIL can be viewed as implicit IRL:

  • Traditional IRL: Explicitly learns reward function, then solves RL
  • GAIL: Discriminator implicitly defines reward function, jointly optimized with policy

Ho & Ermon (2016) proved GAIL is equivalent to Maximum Entropy IRL in terms of occupancy measure matching.

Detailed Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.distributions import Categorical, Normal

class GAIL:
"""
Generative Adversarial Imitation Learning

Core idea:
1. Discriminator learns to distinguish expert and policy-generated (s,a) pairs
2. Policy learns to fool discriminator
3. Discriminator output serves as policy's reward signal
4. Use PPO to optimize policy
"""

def __init__(self, state_dim, action_dim, hidden_dim=256,
disc_lr=3e-4, policy_lr=3e-4, continuous=False):
self.state_dim = state_dim
self.action_dim = action_dim
self.continuous = continuous

# Discriminator: D(s, a) -> [0, 1]
# Output near 1 means from policy, near 0 means from expert
disc_input_dim = state_dim + (action_dim if continuous else action_dim)
self.discriminator = nn.Sequential(
nn.Linear(disc_input_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, 1),
nn.Sigmoid()
)
self.disc_optimizer = optim.Adam(
self.discriminator.parameters(), lr=disc_lr
)

# Policy network
if continuous:
self.policy_mean = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, action_dim),
nn.Tanh()
)
self.policy_log_std = nn.Parameter(torch.zeros(action_dim))
policy_params = list(self.policy_mean.parameters()) + [self.policy_log_std]
else:
self.policy = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, action_dim),
nn.Softmax(dim=-1)
)
policy_params = self.policy.parameters()

self.policy_optimizer = optim.Adam(policy_params, lr=policy_lr)

# Value network (for PPO)
self.value = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, 1)
)
self.value_optimizer = optim.Adam(self.value.parameters(), lr=policy_lr)

def get_action(self, state, deterministic=False):
"""Sample action"""
state = torch.FloatTensor(state).unsqueeze(0)

if self.continuous:
mean = self.policy_mean(state)
if deterministic:
return mean.squeeze().detach().numpy()
std = torch.exp(self.policy_log_std)
dist = Normal(mean, std)
action = dist.sample()
log_prob = dist.log_prob(action).sum(dim=-1)
return action.squeeze().detach().numpy(), log_prob.detach()
else:
probs = self.policy(state)
if deterministic:
return probs.argmax(dim=1).item()
dist = Categorical(probs)
action = dist.sample()
return action.item(), dist.log_prob(action).detach()

def compute_gail_reward(self, states, actions):
"""
Compute GAIL reward

r(s, a) = log(1 - D(s, a))

D near 0 (like expert) -> reward high (good)
D near 1 (like policy) -> reward low (penalty)
"""
disc_input = self.get_disc_input(states, actions)

with torch.no_grad():
d = self.discriminator(disc_input)
rewards = torch.log(1 - d + 1e-8).squeeze()

return rewards.numpy()

def get_disc_input(self, states, actions):
"""Prepare discriminator input"""
if not isinstance(states, torch.Tensor):
states = torch.FloatTensor(states)

if self.continuous:
if not isinstance(actions, torch.Tensor):
actions = torch.FloatTensor(actions)
else:
if not isinstance(actions, torch.Tensor):
actions = torch.LongTensor(actions)
actions = torch.nn.functional.one_hot(
actions, self.action_dim
).float()

return torch.cat([states, actions], dim=-1)

def update_discriminator(self, expert_states, expert_actions,
policy_states, policy_actions,
n_updates=1):
"""
Update discriminator

Goal: distinguish expert and policy-generated (s,a) pairs
- Expert data label: 0 (D output should be low)
- Policy data label: 1 (D output should be high)
"""
expert_input = self.get_disc_input(expert_states, expert_actions)
policy_input = self.get_disc_input(policy_states, policy_actions)

n_expert = len(expert_states)
n_policy = len(policy_states)
batch_size = min(64, n_expert, n_policy)

total_loss = 0
for _ in range(n_updates):
expert_idx = np.random.choice(n_expert, batch_size, replace=False)
policy_idx = np.random.choice(n_policy, batch_size, replace=False)

expert_batch = expert_input[expert_idx]
policy_batch = policy_input[policy_idx]

expert_pred = self.discriminator(expert_batch)
policy_pred = self.discriminator(policy_batch)

expert_loss = -torch.log(1 - expert_pred + 1e-8).mean()
policy_loss = -torch.log(policy_pred + 1e-8).mean()

disc_loss = expert_loss + policy_loss

self.disc_optimizer.zero_grad()
disc_loss.backward()
self.disc_optimizer.step()

total_loss += disc_loss.item()

return total_loss / n_updates

GAIL Advantages and Limitations

Advantages:

  1. No explicit reward needed: Discriminator implicitly learns reward structure
  2. End-to-end training: Policy and "reward" jointly optimized, no two-stage
  3. Sample efficient: More efficient than MaxEnt IRL
  4. Handles high-dimensional: Can handle high-dimensional states and continuous actions
  5. Theoretical guarantee: Has guarantees in occupancy measure matching sense

Limitations:

  1. Requires environment interaction: Cannot learn purely offline
  2. Training unstable: GAN training itself is unstable
  3. Mode collapse: May only learn part of expert behavior
  4. Hard to interpret: No explicit reward function

GAIL Variants

1. AIRL (Adversarial Inverse Reinforcement Learning)

AIRL modifies discriminator structure so it can recover explicit reward function:wheredecomposes into reward and shaping: 2. VAIL (Variational Adversarial Imitation Learning)

VAIL uses variational information bottleneck to improve training stability.

3. SAM (State-only Adversarial Mimicking)

When actions are unobservable (like learning from video), SAM only uses state matching.

Method Comparison and Selection Guide

Comprehensive Comparison

Method Interactive Expert Env Interaction Sample Efficiency Implementation Complexity Theoretical Guarantee Interpretability
BC Not needed Not needed High (but with drift) Low Weak Medium
DAgger Needed Needed Medium-High Low Strong Medium
MaxEnt IRL Not needed Needed Low High Strong High
GAIL Not needed Needed Medium Medium Medium Low

Selection Guide

Choose Behavioral Cloning when: - Have lots of expert data - Task relatively simple (short time horizon) - Compute resources limited - Cannot interact with environment

Choose DAgger when: - Can query expert anytime - Expert labeling cost not high - Need to handle long time horizon tasks - Need theoretical guarantee

Choose Inverse RL when: - Need to understand expert's objective - Need to generalize across different environments - Need interpretable reward function - Have sufficient compute resources

Choose GAIL when: - Cannot query expert - Need high-quality imitation - Have environment interaction capability - State/action spaces large

Advanced Topics

Multimodal Expert Behavior

Expert may take different actions in same state. For example, avoiding obstacle can turn left or right.

Standard BC learns "average" behavior — may crash directly into obstacle!

Solutions:

  1. Mixture Density Networks (MDN)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class MDNPolicy(nn.Module):
"""Mixture Density Network: outputs Gaussian mixture distribution"""
def __init__(self, state_dim, action_dim, n_components=5, hidden_dim=128):
super().__init__()
self.n_components = n_components
self.action_dim = action_dim

self.shared = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU()
)

# Mixture weights
self.pi_layer = nn.Linear(hidden_dim, n_components)
# Mean and variance for each component
self.mu_layer = nn.Linear(hidden_dim, n_components * action_dim)
self.sigma_layer = nn.Linear(hidden_dim, n_components * action_dim)

def forward(self, state):
h = self.shared(state)

pi = torch.softmax(self.pi_layer(h), dim=-1) # Mixture weights
mu = self.mu_layer(h).view(-1, self.n_components, self.action_dim)
sigma = torch.exp(self.sigma_layer(h)).view(-1, self.n_components, self.action_dim)

return pi, mu, sigma

def sample(self, state):
pi, mu, sigma = self(state)

# Select component
k = torch.multinomial(pi, 1).squeeze()

# Sample from selected component
action = mu[range(len(mu)), k] + sigma[range(len(sigma)), k] * torch.randn_like(mu[0, 0])
return action
  1. Conditional VAE (CVAE)

Learn latent behavior modes, then conditionally generate:

  1. Info-GAIL

Add latent variables in GAIL, learn different behavior modes.

Learning from Suboptimal Demonstrations

In reality, expert demonstrations often aren't optimal. How to handle?

1. Weighted Behavioral Cloning

Give higher-quality demonstrations higher weights:

1
2
3
4
5
6
7
8
9
def weighted_bc_loss(predictions, expert_actions, quality_scores):
"""
Weighted behavioral cloning loss

quality_scores: Quality score for each demonstration (can be return, rating, etc.)
"""
weights = torch.softmax(quality_scores / temperature, dim=0)
losses = criterion(predictions, expert_actions)
return (weights * losses).sum()

2. Learning to Rank

Instead of learning absolute good actions, learn which action is better than which:

3. Self-Improvement

First imitate, then improve with RL:

1
2
3
4
5
6
7
8
9
10
11
def iterative_improvement(bc_agent, env, n_rounds=5):
"""Iterative self-improvement"""
for round in range(n_rounds):
# Collect data with current policy
trajectories = collect_trajectories(bc_agent, env)

# Filter good trajectories
good_trajectories = filter_by_return(trajectories, top_k=0.2)

# Retrain
bc_agent.train(good_trajectories)

Cross-Domain Imitation Learning

When expert and learner have different state/action spaces:

1. Third-Person Imitation Learning

Learn from video (third-person view), but execute from first-person view.

2. Cross-Morphology Transfer

Robot A demonstrates, Robot B imitates (different body structures).

3. Domain Adaptation

Align state representations from different domains.

Practical Advice

Data Collection

  1. Data quality more important than quantity
  2. Ensure covering diverse scenarios
  3. Record expert's "recovery" behavior (correcting from mistakes)
  4. Avoid obvious errors in demonstrations

Training Tips

  1. State normalization: Standardize input states
  2. Data augmentation: Add noise, crop, rotate, etc.
  3. Regularization: Dropout, L2 regularization to prevent overfitting
  4. Early stopping: Monitor validation set performance

Evaluation Methods

  1. Trajectory similarity: Distance to expert trajectories
  2. Task success rate: Ratio of completed tasks
  3. Cumulative reward: Total reward obtained in environment
  4. Human evaluation: Let humans judge behavior quality

Summary

Imitation learning provides a learning paradigm that doesn't rely on explicit reward functions, learning policy by observing expert demonstrations:

  1. Behavioral Cloning is simple and direct, but suffers from distribution shift, suitable for short-horizon tasks with lots of data

  2. DAgger mitigates drift through interactive learning, requires queryable expert, has theoretical guarantees

  3. Inverse RL recovers reward function, provides interpretability and generalization, but computationally expensive

  4. GAIL uses adversarial training for end-to-end imitation, currently most popular method, balances performance and implementation complexity

These methods each have pros and cons, choice depends on specific application: whether there's interactive expert, whether environment interaction needed, interpretability requirements, compute resources, etc.

Imitation learning has broad applications in robotics, autonomous driving, game AI, dialogue systems. It complements reinforcement learning — when reward functions are hard to define, imitation learning provides another path; when need to exceed expert, reinforcement learning is more suitable. Combining both (e.g., initialize with imitation learning, fine-tune with RL) often achieves best results.

In the next chapter, we'll learn AlphaGo and Monte Carlo Tree Search— seeing how deep learning combined with traditional planning methods achieves superhuman performance in complex games like Go.

References

  1. Pomerleau, D. A. (1989). ALVINN: An Autonomous Land Vehicle in a Neural Network. NIPS.
  2. Ross, S., Gordon, G., & Bagnell, D. (2011). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS.
  3. Ziebart, B. D., et al. (2008). Maximum Entropy Inverse Reinforcement Learning. AAAI.
  4. Ho, J., & Ermon, S. (2016). Generative Adversarial Imitation Learning. NIPS.
  5. Fu, J., Luo, K., & Levine, S. (2018). Learning Robust Rewards with Adversarial Inverse Reinforcement Learning. ICLR.
  6. Abbeel, P., & Ng, A. Y. (2004). Apprenticeship Learning via Inverse Reinforcement Learning. ICML.
  7. Finn, C., Levine, S., & Abbeel, P. (2016). Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization. ICML.

Q&A: Frequently Asked Questions

Q1: What's the difference between imitation learning and supervised learning?

A: Main difference is data distribution. Supervised learning assumes training and test data come from same distribution (i.i.d. assumption), but in imitation learning, test-time state distribution depends on learned policy, different from training-time expert state distribution. This is the root of distribution shift problem.

Q2: How to tell if expert data is enough?

A: Can judge through: - Learning curve: does adding data still improve performance - Validation error: is it close to training error (overfitting check) - State space coverage: covers possible encountered states

Q3: GAIL training is unstable, what to do?

A: Try: - Adjust discriminator and policy update frequency (usually discriminator updates more) - Use gradient penalty or spectral normalization to stabilize discriminator - Lower learning rate - Increase entropy regularization coefficient - Use WGAN variants

Q4: When to use BC, when to use GAIL?

A: BC suitable for: simple tasks, lots of data, cannot interact with environment. GAIL suitable for: complex tasks, limited data, can interact with environment. If BC can solve it, prioritize BC (simpler, faster).

Q5: How to handle noise and errors in expert demonstrations?

A: - Data cleaning: filter obviously wrong demonstrations - Weighted learning: give higher-quality demonstrations higher weights - Learning to rank: learn relative preferences instead of absolute actions - Robust loss functions: like Huber loss more robust to outliers

Q6: Can imitation learning exceed expert?

A: Pure imitation learning theoretically cannot exceed expert (goal is to copy expert). But can: - Initialize with imitation learning, then fine-tune with RL - Aggregate strengths of multiple experts - Self-improve in states where expert performs poorly

Q7: How to handle multiple experts?

A: - If expert policies similar: directly mix data - If expert styles different: learn multimodal policy (MDN, CVAE) - If expert levels different: weight or only use best expert

Q8: How to evaluate imitation learning effectiveness?

A: Multi-angle evaluation: - Cumulative reward (if reward function available) - Similarity to expert trajectories - Task success rate - Human subjective evaluation - Robustness on out-of-distribution states

  • Post title:Reinforcement Learning (7): Imitation Learning and Inverse Reinforcement Learning
  • Post author:Chen Kai
  • Create time:2024-09-06 10:15:00
  • Post link:https://www.chenk.top/reinforcement-learning-7-imitation-learning/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments