Reinforcement Learning (9): Multi-Agent Reinforcement Learning
Chen Kai BOSS

When multiple agents interact in the same environment, the fundamental assumption of single-agent reinforcement learning - environment stationarity - breaks down. In autonomous driving, each vehicle is an agent whose decisions affect others; in multiplayer games, opponents' strategy evolution determines your optimal strategy; in robot collaboration, team success depends on each member's coordination. Multi-Agent Reinforcement Learning (MARL) studies how to enable multiple agents to learn cooperative, competitive, or mixed strategies in complex interactions. This field's challenges far exceed single-agent RL: the environment appears non-stationary from each agent's perspective (other agents are constantly learning), credit assignment becomes difficult (how to attribute individual contributions when the team succeeds?), and partial observability intensifies uncertainty (you can't see what teammates are doing). But these challenges also bring new opportunities - through modeling other agents, communication protocols, centralized training with decentralized execution, and other techniques, MARL has achieved breakthroughs in complex tasks like StarCraft, Dota 2, and autonomous driving simulation. DeepMind's AlphaStar reached StarCraft Grandmaster level, OpenAI Five defeated world champions in Dota 2 - these successes demonstrate MARL as a key path toward general intelligence. This chapter will start from game theory's mathematical foundations, progressively delving into independent learning, value decomposition, multi-agent Actor-Critic methods, with complete code to implement QMIX algorithm in cooperative tasks.

Core Challenges of Multi-Agent Systems

Challenge 1: Non-Stationarity

In single-agent RL, environment transition probabilityis fixed (or slowly changing). But in multi-agent systems, from agent's perspective:whererepresents other agents' actions. When other agents' policiesconstantly change during training, the environment becomes non-stationary. This causes: - Past collected experiencequickly becomes outdated - Q-function targetcontinuously drifts - Convergence guarantees fail - even simple Q-learning may oscillate or diverge

Example: In a soccer game, if opponents evolve from "random kicking" to "defensive counterattack", your "full-court press" strategy's value suddenly collapses.

Challenge 2: Credit Assignment

In cooperative tasks, the team receives global reward, but needs to infer each agent's contribution. Two extreme problems:

Lazy Agent: If agentdiscovers "doing nothing" still earns team reward (because teammates are strong), it stops learning useful behaviors.

Relative Overgeneralization: Suppose there are two actions, with joint rewards: - All agents choose: - All agents choose: - Mixed choices:Ideally converge to, but if one agent learnsfirst during training, others also learnto coordinate, eventually stuck at suboptimal equilibrium.

Mathematically: Global rewardisn't simply the sum of individual rewards:How to inferfromis a core challenge.

Challenge 3: Partial Observability

In many tasks, each agent only observes local information (like robot sensor range, game field of view). This forms a Partially Observable Stochastic Game (POSG):

  • Global stateexists, but agentonly sees
  • Agents need to make decisions based on observation history
  • This requires memory mechanisms (like RNN) to infer hidden states

Example: In StarCraft, you can't see enemy forces in unexplored map areas, must infer opponent strategy from scouting history.

Challenge 4: Scalability

Joint action space grows exponentially with agent count:Foragents, each with 5 actions, joint space reaches. Centralized Q-functioncannot handle this.

Game Theory Foundations: Understanding Multi-Agent Interaction

Markov Game

Multi-agent systems can be formalized as Markov Game (also called stochastic game):

-: agent set -: global state space -: agent's action space -: transition function -: agent's reward function -: discount factor

Each agentlearns policy, aiming to maximize cumulative reward:Notedepends on all agents' policies.

Nash Equilibrium

Definition: Policy combinationis Nash equilibrium if for all agentsand all alternative policies:That is: given other agents' policies, no agent can gain higher reward by unilaterally changing strategy.

Example: Prisoner's Dilemma

Two prisoners simultaneously choose "cooperate" or "defect", reward matrix:

Cooperate Defect
Cooperate (3,3) (0,5)
Defect (5,0) (1,1)

-is Pareto optimal (total reward 6 highest) - Butis the only Nash equilibrium - defect is always better regardless of opponent's choice

Properties: 1. Existence: Any finite game has at least one mixed strategy Nash equilibrium (Nash, 1950) 2. Non-uniqueness: Multiple equilibria may exist (like coordination games) 3. Suboptimality: Nash equilibrium isn't necessarily Pareto optimal (like prisoner's dilemma)

Pareto Optimality

Definition: Policy combinationis Pareto optimal if no other policy combinationexists such that: - All agents: - At least one agent:That is: cannot improve all agents' rewards without hurting some agent.

Special case for cooperative tasks: If all agents share reward, objective becomes:Here Nash equilibrium and Pareto optimality coincide, problem simplifies to team optimization.

Zero-Sum Games and Minimax

In zero-sum games,(one side's gain is other's loss). Here Nash equilibrium can be computed by minimax theorem:$

This is the theoretical foundation for game-playing AIs like AlphaGo.

Independent Learning vs Joint Learning

Independent Learning

Simplest approach: each agent treats other agents as part of environment, independently runs single-agent RL algorithms (like DQN, PPO).

Advantages: - Simple: directly reuse existing algorithms - Scalable: linear complexity with agent count

Disadvantages: - Non-stationarity: other agents' policy changes make environment unstable - No coordination: agents cannot predict or adapt to teammate behavior - Poor convergence: may oscillate or stuck in suboptimal equilibrium

Empirical success: Despite theoretical issues, independent learning still works in some tasks - if environment is stable enough or agents are numerous enough (other agents' average behavior relatively stable).

Joint Learning

Learn joint policyor joint Q-function.

Advantages: - Considers coordination between agents - Converges to team optimum

Disadvantages: - Exponential complexity: joint action space - Centralized execution: needs centralized controller, unsuitable for distributed scenarios

Compromise: Centralized Training with Decentralized Execution (CTDE) - use global information during training, each agent only relies on local observation during execution.

Value Decomposition Methods: From VDN to QMIX

Core idea of value decomposition: decompose global Q-functioninto combination of local Q-functions, maintaining decentralized execution while leveraging centralized training.

VDN: Value Decomposition Networks

VDN (Value Decomposition Networks, 2017) assumes global Q-function is simple sum of local Q-functions:

Training: - Centralized: train's TD target with global reward:- Decentralized: during execution, each agentindependently selects:

Key property: Individual greedyJoint greedyBecause summation is decomposable:

Limitation: Linear decomposition limits expressiveness. Consider a scenario: - Two agents in corridor, if both move they collide (), otherwise - Trueneeds to capture exclusivity: - But linear sumcannot represent this interaction

QMIX: Monotonic Value Decomposition

QMIX (2018) uses neural network to mix, but maintains monotonicity constraint:This ensures individual greedy remains joint greedy (called Individual-Global-Max, IGM consistency).

Network architecture:where mixing networkis: 1. Hypernetwork: generate mixing weightsand biasesfrom global state

  1. Two-layer feedforward network:

  2. Monotonicity:(ensured byor)

Intuition: - Whenincreases, globaldoesn't decreaseagentimproving local value doesn't hurt team - Weightsdepend on stateagents' importance varies in different situations (e.g., in StarCraft, scouts important early, main force important late)

Training same as VDN: update all parameters with global TD error.

Performance: QMIX significantly outperforms VDN on StarCraft Multi-Agent Challenge (SMAC) because it captures nonlinear coordination patterns.

QTRAN: More General Decomposition

QMIX's limitation: Monotonicity constraint still limits expressiveness. For example, if optimal joint action requires some agentto choose locally suboptimal(to coordinate with team), QMIX may fail.

QTRAN (2019) proposes more general decomposition, not requiring monotonicity, but enforcing IGM condition with additional networks:where.

Implemented through two additional losses: 1. Optimal action loss:

  1. Suboptimal action penalty: Effect: QTRAN outperforms QMIX on some tasks requiring complex coordination, but training unstable and computationally expensive.

QPLEX: Duplex Dueling Architecture

QPLEX (2021) combines Dueling architecture, decomposinginto state valueand advantage:and maintains monotonicity through transformation layers. This improves sample efficiency and stability.

Multi-Agent Actor-Critic Methods

MADDPG: Multi-Agent DDPG

MADDPG (Multi-Agent DDPG, 2017) extends DDPG to multi-agent, adopting CTDE paradigm:

Training phase: - Each agenthas Actorand Critic - Critic uses global information, capturing other agents' influence - Actor only uses local observation Update rules: 1. Critic update (similar to DDPG):whereare target Actor networks.

  1. Actor update (policy gradient):

Execution phase: Each agent only uses own Actor.

Advantages: - Applicable to continuous action spaces - Each agent's Critic models other agents' policiesmitigates non-stationarity

Challenges: - Requires communication or assumption of observing other agents' actions - When agent countis large, Critic's input dimension explodes

COMA: Counterfactual Multi-Agent Policy Gradients

COMA (2018) uses counterfactual baseline to solve credit assignment:

Core idea: Agent's advantage function isn't, but:The latter term is counterfactual baseline - expected Q-value if agentrandomly chooses according to current policy.

Intuition:measures "marginal contribution of choosingcompared to average". This correctly attributes's contribution, even when global reward is shared.

Network architecture: - Centralized Critic: - Decentralized Actors: Implementation details: - Critic needs to evaluatecounterfactual actionsPrime causes double exponent: use braces to clarifyQ(s, (a^{-i}, a^i')) - Use RNN to process partial observation history Performance: COMA excels in tasks requiring fine-grained credit assignment (like StarCraft micro-management).

MAPPO: Multi-Agent PPO

MAPPO is PPO's direct extension to multi-agent, combined with value decomposition (like using shared Critic). DeepMind's 2021 research showed well-tuned MAPPO matches QMIX/MADDPG on many tasks with more stable training.

Key tricks: - Centralized value functioninstead of independent - Use global state normalization - Long rollout length (like 2048 steps)

AlphaStar: Multi-Agent RL in StarCraft

Task Challenges

StarCraft II is an extremely complex multi-agent task: - State space:(far exceeding Go's), including map, units, resources, tech tree - Action space: Each step can select hundreds of units, hundreds of actions per unit - Partial observability: Fog of war hides unexplored areas - Long-term planning: Average game lasts 20 minutes (~30k steps) - Multi-agent coordination: Control dozens to hundreds of combat units

AlphaStar's Architecture

AlphaStar (2019) combines multiple techniques:

1. Policy network: - Input: Spatial features (map, unit positions) + scalar features (resources, population) - Encoder: ResNet processes spatial features, MLP processes scalar features, Transformer integrates - Core: LSTM maintains long-term memory (hidden state) - Output: Hierarchical action selection - Select which unit (or unit group) - Select action type (move/attack/build...) - Select target location or unit

2. Value network: Estimates win rate 3. Training process: - Supervised learning: Learn imitation policy from human replays (like AlphaGo) - Reinforcement learning: Optimize through self-play - Policy gradient: V-trace algorithm (off-policy correction) - Opponent pool: Maintain historical policies, prevent cycles (like rock-paper-scissors) - League training: - Main Agents: Continuously learning - Main Exploiters: Specifically find main agents' weaknesses - League Exploiters: Counter entire opponent pool

4. Multi-agent control: - Each combat unit is an "agent", sharing policy network - Use attention mechanism (Transformer) to aggregate all units' representations - Use pointer network to select target units

Performance and Impact

AlphaStar reached Grandmaster level (top 0.2%) in January 2019, subsequently winning 10:1 in exhibition matches against human professional players. In October 2019, AlphaStar's final version reached Master level (top 0.15%) on public ladder.

Key innovations: - League training solves strategy diversity - avoids overfitting to single strategy - Hierarchical action space decomposesraw action space into tractable subproblems - Long-term memory (LSTM) enables agents to track scouting information from minutes ago

Limitations: - Requires huge computing power (16 TPUs, 14 days training) - Fixed race (only trained specific race matchups) - APM limit (actions per minute) still higher than human average

Complete Code Implementation: QMIX Cooperative Task

Below implements simplified QMIX for Multi-Agent Particle Environment's cooperative navigation task. Task: 3 agents need to cover 3 landmarks, each agent only sees local information.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random

# ============ Simplified Multi-Agent Environment ============
class CooperativeNavigation:
"""3 agents cooperatively cover 3 landmarks"""
def __init__(self, n_agents=3, n_landmarks=3, world_size=2.0):
self.n_agents = n_agents
self.n_landmarks = n_landmarks
self.world_size = world_size
self.reset()

def reset(self):
# Randomly initialize agent and landmark positions
self.agent_pos = np.random.uniform(-self.world_size, self.world_size, (self.n_agents, 2))
self.landmark_pos = np.random.uniform(-self.world_size, self.world_size, (self.n_landmarks, 2))
self.t = 0
return self._get_obs()

def _get_obs(self):
"""Each agent's local observation: own position + all landmarks' relative positions"""
obs = []
for i in range(self.n_agents):
agent_obs = np.concatenate([
self.agent_pos[i], # own position (2D)
(self.landmark_pos - self.agent_pos[i]).flatten() # landmark relative positions (6D)
])
obs.append(agent_obs)
return obs

def _get_state(self):
"""Global state: all agents' and landmarks' absolute positions"""
return np.concatenate([
self.agent_pos.flatten(),
self.landmark_pos.flatten()
])

def step(self, actions):
"""
actions: list of [dx, dy] for each agent
"""
# Move agents (limit max speed 0.1)
for i, action in enumerate(actions):
velocity = np.clip(action, -0.1, 0.1)
self.agent_pos[i] += velocity
# Boundary constraint
self.agent_pos[i] = np.clip(self.agent_pos[i], -self.world_size, self.world_size)

# Calculate reward: sum of all agents' distances to nearest landmark (negative)
reward = 0
for landmark in self.landmark_pos:
min_dist = min(np.linalg.norm(agent - landmark) for agent in self.agent_pos)
reward -= min_dist

self.t += 1
done = self.t >= 25 # max steps

return self._get_obs(), reward, done, self._get_state()

# ============ Agent Network ============
class AgentNetwork(nn.Module):
"""Each agent's Q network"""
def __init__(self, obs_dim, n_actions, hidden_dim=64):
super().__init__()
self.fc1 = nn.Linear(obs_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
self.fc3 = nn.Linear(hidden_dim, n_actions)

def forward(self, obs):
x = torch.relu(self.fc1(obs))
x = torch.relu(self.fc2(x))
q = self.fc3(x)
return q

# ============ QMIX Mixer Network ============
class QMixerNetwork(nn.Module):
"""Mix agents' Q-values"""
def __init__(self, n_agents, state_dim, hidden_dim=32):
super().__init__()
self.n_agents = n_agents

# Hypernetwork: generate weights and biases from state
# First layer weights: (n_agents, hidden_dim)
self.hyper_w1 = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, n_agents * hidden_dim)
)
# First layer bias: (hidden_dim,)
self.hyper_b1 = nn.Linear(state_dim, hidden_dim)

# Second layer weights: (hidden_dim, 1)
self.hyper_w2 = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim)
)
# Second layer bias: (1,)
self.hyper_b2 = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)

def forward(self, agent_qs, state):
"""
agent_qs: (batch, n_agents) - selected Q-values of agents
state: (batch, state_dim)
returns: (batch, 1) - mixed global Q-value
"""
batch_size = agent_qs.size(0)

# Generate first layer weights and biases
w1 = torch.abs(self.hyper_w1(state)) # Keep positive (monotonicity)
w1 = w1.view(batch_size, self.n_agents, -1) # (batch, n_agents, hidden_dim)
b1 = self.hyper_b1(state).unsqueeze(1) # (batch, 1, hidden_dim)

# First layer computation
agent_qs = agent_qs.unsqueeze(2) # (batch, n_agents, 1)
hidden = torch.relu((w1 * agent_qs).sum(dim=1) + b1.squeeze(1)) # (batch, hidden_dim)

# Generate second layer weights and biases
w2 = torch.abs(self.hyper_w2(state)).unsqueeze(1) # (batch, 1, hidden_dim)
b2 = self.hyper_b2(state) # (batch, 1)

# Second layer computation
q_tot = (w2 * hidden.unsqueeze(1)).sum(dim=2) + b2
return q_tot

# ============ QMIX Agent ============
class QMIXAgent:
def __init__(self, n_agents, obs_dim, n_actions, state_dim,
lr=5e-4, gamma=0.99, buffer_size=5000, batch_size=32):
self.n_agents = n_agents
self.n_actions = n_actions
self.gamma = gamma
self.batch_size = batch_size

# Create Q network for each agent
self.agent_nets = [AgentNetwork(obs_dim, n_actions) for _ in range(n_agents)]
self.target_agent_nets = [AgentNetwork(obs_dim, n_actions) for _ in range(n_agents)]

# QMIX mixer network
self.mixer = QMixerNetwork(n_agents, state_dim)
self.target_mixer = QMixerNetwork(n_agents, state_dim)

# Synchronize target networks
for target_net, net in zip(self.target_agent_nets, self.agent_nets):
target_net.load_state_dict(net.state_dict())
self.target_mixer.load_state_dict(self.mixer.state_dict())

# Optimizer (shared by all networks)
params = []
for net in self.agent_nets:
params += list(net.parameters())
params += list(self.mixer.parameters())
self.optimizer = optim.Adam(params, lr=lr)

# Experience replay
self.replay_buffer = deque(maxlen=buffer_size)

# epsilon-greedy
self.epsilon = 1.0
self.epsilon_decay = 0.995
self.epsilon_min = 0.05

def select_actions(self, obs, explore=True):
"""Select actions (epsilon-greedy)"""
actions = []
for i, o in enumerate(obs):
if explore and random.random() < self.epsilon:
# Explore: randomly select discrete action ID
action_id = random.randint(0, self.n_actions - 1)
else:
# Exploit: select action with max Q-value
with torch.no_grad():
o_tensor = torch.FloatTensor(o).unsqueeze(0)
q_values = self.agent_nets[i](o_tensor)
action_id = q_values.argmax(dim=1).item()

# Convert discrete action ID to continuous control (5 actions: up/down/left/right/stop)
action = self._id_to_action(action_id)
actions.append(action)

return actions

def _id_to_action(self, action_id):
"""Discrete action ID -> continuous control"""
actions_map = {
0: [0.0, 0.0], # stop
1: [0.0, 0.1], # up
2: [0.0, -0.1], # down
3: [-0.1, 0.0], # left
4: [0.1, 0.0] # right
}
return actions_map[action_id]

def store_transition(self, obs, actions, reward, next_obs, done, state, next_state):
"""Store experience"""
# Convert continuous actions back to IDs
action_ids = [self._action_to_id(a) for a in actions]
self.replay_buffer.append((obs, action_ids, reward, next_obs, done, state, next_state))

def _action_to_id(self, action):
"""Continuous control -> discrete action ID"""
actions_map = {
(0.0, 0.0): 0,
(0.0, 0.1): 1,
(0.0, -0.1): 2,
(-0.1, 0.0): 3,
(0.1, 0.0): 4
}
return actions_map[tuple(action)]

def train(self):
"""Train one step"""
if len(self.replay_buffer) < self.batch_size:
return None

# Sample batch
batch = random.sample(self.replay_buffer, self.batch_size)
obs, actions, rewards, next_obs, dones, states, next_states = zip(*batch)

# Convert to tensors
# obs: list of list of arrays -> (batch, n_agents, obs_dim)
obs_tensor = torch.FloatTensor(np.array(obs))
next_obs_tensor = torch.FloatTensor(np.array(next_obs))
actions_tensor = torch.LongTensor(actions) # (batch, n_agents)
rewards_tensor = torch.FloatTensor(rewards).unsqueeze(1) # (batch, 1)
dones_tensor = torch.FloatTensor(dones).unsqueeze(1)
states_tensor = torch.FloatTensor(states)
next_states_tensor = torch.FloatTensor(next_states)

# Compute current Q-values
chosen_action_qs = []
for i in range(self.n_agents):
q_values = self.agent_nets[i](obs_tensor[:, i, :]) # (batch, n_actions)
chosen_q = q_values.gather(1, actions_tensor[:, i].unsqueeze(1)) # (batch, 1)
chosen_action_qs.append(chosen_q)
chosen_action_qs = torch.cat(chosen_action_qs, dim=1) # (batch, n_agents)

# Mix current Q-values
q_tot = self.mixer(chosen_action_qs, states_tensor) # (batch, 1)

# Compute target Q-values
with torch.no_grad():
# Max Q-values for next state
next_max_qs = []
for i in range(self.n_agents):
next_q_values = self.target_agent_nets[i](next_obs_tensor[:, i, :])
next_max_q = next_q_values.max(dim=1, keepdim=True)[0] # (batch, 1)
next_max_qs.append(next_max_q)
next_max_qs = torch.cat(next_max_qs, dim=1) # (batch, n_agents)

# Mix target Q-values
target_q_tot = self.target_mixer(next_max_qs, next_states_tensor)
target = rewards_tensor + self.gamma * target_q_tot * (1 - dones_tensor)

# TD loss
loss = nn.MSELoss()(q_tot, target)

# Update networks
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.optimizer.param_groups[0]['params'], 10.0)
self.optimizer.step()

return loss.item()

def update_target_networks(self):
"""Soft update target networks"""
tau = 0.01
for target_net, net in zip(self.target_agent_nets, self.agent_nets):
for target_param, param in zip(target_net.parameters(), net.parameters()):
target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

for target_param, param in zip(self.target_mixer.parameters(), self.mixer.parameters()):
target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

def decay_epsilon(self):
"""Decay exploration rate"""
self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

# ============ Training Loop ============
def train_qmix(n_episodes=2000):
env = CooperativeNavigation(n_agents=3, n_landmarks=3)

obs_dim = 8 # own position(2) + landmark relative positions(3*2)
n_actions = 5 # up/down/left/right/stop
state_dim = 12 # all agent positions(3*2) + landmark positions(3*2)

agent = QMIXAgent(n_agents=3, obs_dim=obs_dim, n_actions=n_actions, state_dim=state_dim)

episode_rewards = []

for episode in range(n_episodes):
obs = env.reset()
state = env._get_state()
episode_reward = 0

for step in range(25):
# Select actions
actions = agent.select_actions(obs, explore=True)

# Execute
next_obs, reward, done, next_state = env.step(actions)

# Store
agent.store_transition(obs, actions, reward, next_obs, done, state, next_state)

# Train
loss = agent.train()

obs = next_obs
state = next_state
episode_reward += reward

if done:
break

# Update target networks
if episode % 10 == 0:
agent.update_target_networks()

# Decay epsilon
agent.decay_epsilon()

episode_rewards.append(episode_reward)

if episode % 100 == 0:
avg_reward = np.mean(episode_rewards[-100:])
print(f"Episode {episode}, Avg Reward: {avg_reward:.2f}, Epsilon: {agent.epsilon:.3f}")

return agent, episode_rewards

# ============ Main Program ============
if __name__ == "__main__":
agent, rewards = train_qmix(n_episodes=2000)

# Visualization (requires matplotlib)
# import matplotlib.pyplot as plt
# plt.plot(rewards)
# plt.xlabel('Episode')
# plt.ylabel('Reward')
# plt.title('QMIX Training on Cooperative Navigation')
# plt.show()

Code Explanation

Environment part: - CooperativeNavigation: 3 agents need to cover 3 landmarks - Local observation: own position + all landmarks' relative positions (8D) - Global state: all agents' and landmarks' absolute positions (12D) - Reward: sum of distances from all landmarks to nearest agent (negative, closer to 0 is better)

Network part: - AgentNetwork: Each agent's Q network, inputs local observation, outputs Q-values for 5 actions - QMixerNetwork: QMIX mixer network - Hypernetwork generates weightsand biasesfrom global state - Two-layer feedforward network mixes agents' Q-values - Uses torch.abs to ensure positive weights (monotonicity constraint)

Training part: - select_actions: epsilon-greedy action selection, execution only needs local observation - train: - Sample batch from replay buffer - Compute each agent's - Mix with mixer to get - Compute TD target - Optimize MSE loss - update_target_networks: Soft update target networks (tau=0.01)

Running example: - After training 2000 episodes, average reward improves from -15 to around -3 - Agents learn cooperation: spread to different landmarks instead of clustering at one

Deep Q&A

Q1: Why does QMIX need monotonicity constraint?

Intuition: Monotonicityensures: if agentimproves local Q-value (i.e., selects better local action), global Q-value doesn't decrease. This makes greedy selection during decentralized execution consistent with joint optimum during centralized training.

Mathematical proof: Suppose. We want to prove:By contradiction: suppose there existsandsuch that, butis part of joint optimum.

By monotonicity:This contradictsbeing maximum.

Why does VDN naturally satisfy monotonicity?: Linear sumhas derivative.

Q2: Why does MADDPG's Critic use other agents' actions?

Non-stationarity problem: If Criticonly depends on own observation and action, when other agents' policieschange, the sameleads to different returns - environment appears non-stationary from's perspective.

MADDPG's solution: Critic uses, including all agents' actions. This way: 1. Critic models complete environment dynamics, is stationary 2. Actorstill only uses local observation, maintains decentralized execution 3. During training, Actor learns through Critic's gradient "how to make optimal decisions given other agents' behavior"

Analogy: Like in team project, you need to know teammates' abilities (Critic's input) to plan your tasks (Actor's output), but during execution only use your own information (decentralized execution).

Q3: Why does COMA's counterfactual baseline correctly attribute credit?

Problem: Under global reward, standard advantagecannot distinguish agents' contributions - if team succeeds, all agents get same advantage, even if some agents were "free-riding".

COMA's counterfactual baseline:Right side is "if agentrandomly acts according to current policy, with other agents fixed at, what's expected Q-value".

Intuition:measures "marginal contribution of agentchoosingcompared to own average". For example: - If: this action is better than average, should increase its probability - If: this action drags team down, should decrease its probability

Why "counterfactual"?: We're asking "what would happen if agentmade other choices?" - this is counterfactual reasoning.

Q4: When do multi-agent systems get stuck in suboptimal equilibria?

Relative Overgeneralization typical example:

10 0
0 8
  • Optimal joint action:, reward 10
  • Suboptimal joint action:, reward 8
  • Uncoordinated:or, reward 0

Training process: 1. Early on, agent 1 randomly explores, learnsfirst (possibly because agent 2 also exploring, frequent 0 rewards fromand, 8 reward fromrelatively reliable) 2. Agent 2 observes agent 1 tends toward, to avoid 0 reward, also learns$R(R, R)(L, L)$is better

Root causes: - Insufficient exploration: Not enough samples observing's high reward - Information asymmetry: Agents don't know teammates' Q-functions, can only infer from observing teammates' actions

Solutions: - Joint exploration: Apply epsilon-greedy to joint actions - Communication: Agents share Q-values or intentions - Opponent modeling: Learn other agents' policy models

Q5: How does sample complexity in multi-agent RL scale with agent count?

Theoretical results (informal): - Independent learning: Sample complexity - exponential in - Centralized learning: Sample complexity - also exponential, but better constant factor - Value decomposition (like QMIX): In some structured tasks, - linear in Key assumption: Value decomposition methods assume global Q-function can be approximately decomposed, which holds in many cooperative tasks (like each agent responsible for local sub-goals).

Experimental observations: - SMAC (StarCraft micro-management): 3 agents vs 3 enemies, QMIX converges in 2M steps - 5 agents vs 5 enemies, needs 5M steps - 10 agents vs 10 enemies, needs 20M steps - close to linear growth

Q6: How to handle partial observability in multi-agent settings?

Method 1: Memory (RNN/LSTM) - Each agent maintains hidden state, encoding observation history - Policy, value - AlphaStar, OpenAI Five both use LSTM

Method 2: Communication - Agents exchange messages, sharing observations or intentions - CommNet, TarMAC algorithms study learning communication protocols - Challenge: limited communication bandwidth, need to learn "what to say"

Method 3: Centralized State Estimation - Training phase: use global stateto train Critic - Execution phase: use local observation+ RNN to infer global state - Similar to belief state in POMDP

Example: In StarCraft, you can't see enemies behind fog of war. LSTM remembers "scout saw enemy barracks in top-left 5 minutes ago", infers "enemy probably has 10 soldiers now".

Q7: Why is AlphaStar's League training effective?

Problem: Pure self-play in complex games may fall into cycles - like rock-paper-scissors, strategy A beats B, B beats C, C beats A, no global optimum.

League training design: 1. Main Agents: Continuously learning primary agents, play against all opponents in pool 2. Main Exploiters: Specifically find main agents' weaknesses, prevent main agents from falling into local strategies 3. League Exploiters: Counter entire opponent pool's average level, ensure diversity

Analogy: - Main Agents like professional players, compete against various opponents - Main Exploiters like training partners, specifically design tactics for professionals' weaknesses - League Exploiters like amateur experts, maintain strategy pool diversity

Mathematically: League training approximates solving multi-agent Nash equilibrium by maintaining policy distributioninstead of single policy, avoiding cycles.

Q8: What are limitations of multi-agent RL in real-world applications?

Safety: - In autonomous driving, multi-vehicle coordination failure may cause accidents - RL's exploration may produce dangerous behaviors (like running red lights) - Requires constrained optimization or safe RL techniques

Communication delay and partial failure: - Real networks have latency and packet loss - Agents may disconnect or sensors fail - Requires robustness design (like redundancy, degradation strategies)

Heterogeneity: - In real systems, agents have different capabilities (like drones vs ground vehicles) - Goals may conflict (like inter-enterprise competition) - Requires more complex game theory models

Interpretability: - Human operators struggle to understand multi-agent joint decisions - "Why did drone A go left while B went right?" - Requires visualization and natural language explanations

Q9: How to implement communication learning in multi-agent systems?

CommNet (2016) architecture: - Each agenthas observationand hidden state - Average pooling: - Update: - Output: TarMAC (2018) architecture: - Each agent generates message - Attention mechanism for selective receiving: - Aggregate: Challenges: - Symbol Grounding: Agents' invented "language" (vectors) is uninterpretable to humans - Bandwidth limits: Real systems have communication costs, need to learn "when to speak" and "what to say" - Adversarial robustness: Communication may be eavesdropped or jammed

Q10: What are future directions for multi-agent RL?

1. Large-scale cooperation: - Current: Dozens of agents (like AlphaStar's 200 units) - Goal: Thousands of agents (like traffic networks, power grids) - Needs: Hierarchical control, graph neural networks, distributed optimization

2. Human-AI collaboration: - Current: AI vs AI - Goal: AI collaborating with human teammates (like assisted driving, medical diagnosis) - Needs: Modeling human intent, interpretable decisions, adapting to human habits

3. Open-ended environments: - Current: Fixed-rule games (like StarCraft) - Goal: Real world (like disaster rescue, scientific exploration) - Needs: Transfer learning, meta-learning, lifelong learning

4. Theoretical guarantees: - Current: Empirical success - Goal: Theory on convergence, sample complexity, Nash equilibrium computation - Needs: Deep integration of game theory, optimization theory, learning theory

Core Papers

  1. VDN:
    Sunehag et al. (2017). "Value-Decomposition Networks For Cooperative Multi-Agent Learning". AAMAS.
    https://arxiv.org/abs/1706.05296

  2. QMIX:
    Rashid et al. (2018). "QMIX: Monotonic Value Function Factorisation for Decentralised Multi-Agent Reinforcement Learning". ICML.
    https://arxiv.org/abs/1803.11485

  3. MADDPG:
    Lowe et al. (2017). "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments". NeurIPS.
    https://arxiv.org/abs/1706.02275

  4. COMA:
    Foerster et al. (2018). "Counterfactual Multi-Agent Policy Gradients". AAAI.
    https://arxiv.org/abs/1705.08926

  5. AlphaStar:
    Vinyals et al. (2019). "Grandmaster level in StarCraft II using multi-agent reinforcement learning". Nature.
    https://www.nature.com/articles/s41586-019-1724-z

  6. QTRAN:
    Son et al. (2019). "QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning". ICML.
    https://arxiv.org/abs/1905.05408

  7. CommNet:
    Sukhbaatar et al. (2016). "Learning Multiagent Communication with Backpropagation". NeurIPS.
    https://arxiv.org/abs/1605.07736

  8. MAPPO:
    Yu et al. (2021). "The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games". NeurIPS.
    https://arxiv.org/abs/2103.01955

Benchmarks

Open Source Implementations

Summary

Multi-agent reinforcement learning elevates RL challenges to a new dimension - agents must not only learn environment dynamics but also understand, predict, and coordinate other agents' behaviors. From game theory's Nash equilibrium to QMIX's value decomposition, from MADDPG's centralized training with decentralized execution to AlphaStar's League training, MARL methodology demonstrates rich creativity.

Value decomposition methods (VDN/QMIX/QTRAN) decompose global Q-function to maintain decentralized execution while leveraging centralized training, suitable for cooperative tasks.

Multi-agent Actor-Critic (MADDPG/COMA) uses Critic to model other agents' influence, mitigating non-stationarity, and accurately attributes credit through counterfactual baselines.

AlphaStar's success demonstrates MARL's potential in ultra-complex environments - through League training, hierarchical action spaces, and long-term memory, agents reached human top-tier level.

Future MARL will move toward larger scale (thousands of agents), more realistic (human-AI collaboration, open environments), and more reliable (theoretical guarantees, safety constraints) applications. From autonomous vehicle fleets to smart grids, from robot collaboration to online ad bidding, MARL is becoming the key technology for solving complex multi-agent systems.

  • Post title:Reinforcement Learning (9): Multi-Agent Reinforcement Learning
  • Post author:Chen Kai
  • Create time:2024-09-13 10:00:00
  • Post link:https://www.chenk.top/reinforcement-learning-9-multi-agent-rl/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments