Reinforcement Learning (8): AlphaGo and Monte Carlo Tree Search
Chen Kai BOSS

In March 2016, when AlphaGo defeated world Go champion Lee Sedol 4-1, the world was stunned - Go had long been considered the ultimate fortress for AI, with a search space of(far exceeding the universe'satoms), making brute-force search completely ineffective. AlphaGo's success wasn't the victory of a single algorithm, but the perfect fusion of Monte Carlo Tree Search (MCTS), deep learning, and reinforcement learning. Even more astonishing, AlphaGo Zero, developed 18 months later, completely discarded human game records and learned purely through self-play from scratch, surpassing millennia of human Go wisdom in just 3 days. Subsequently, AlphaZero generalized this paradigm to chess and shogi, while MuZero went further by breaking free from game rules, learning to plan without knowing the environment dynamics. This series of breakthroughs not only transformed the landscape of game-playing AI but also provided a new paradigm for combining reinforcement learning with search - how to efficiently plan in vast state spaces, how to let AI discover knowledge autonomously, and how to perform forward planning without a model. This chapter will start from MCTS's mathematical foundations, progressively delving into the core designs of AlphaGo, AlphaGo Zero, AlphaZero, and MuZero, with complete code implementations to help you truly understand the essence of these algorithms.

Monte Carlo Tree Search: Balancing Search and Sampling

Why Do We Need MCTS?

In Go, each move has an average of 250 possible placements, with games averaging 150 moves, resulting in a complete search tree withbranches. Chess has a branching factor of about 35 and averages 80 moves, giving a search space of "only" - still astronomical, but much smaller than Go. Traditional AlphaBeta pruning works for chess but completely fails for Go. MCTS's core idea is: don't search all branches uniformly, but concentrate computational resources on the most promising paths.

Four Steps of MCTS

MCTS builds a search tree by iteratively executing four steps:

1. Selection: Starting from the root node, use a strategy (like UCT) to select child nodes until reaching a leaf node.

2. Expansion: If the leaf node isn't a terminal state, randomly select an unexplored action and add a new child node.

3. Simulation: From the new node, simulate to game end using a fast policy (like random or simple heuristic), obtaining reward.

4. Backpropagation: Propagate rewardbackward along the path, updating statistics (visit countand accumulated value) for all nodes on the path.

These four steps repeat continuously, with the search tree gradually growing toward promising directions. Finally, at the root node, select the action with the most visits as the actual decision.

UCT Algorithm: Mathematical Balance of Exploration and Exploitation

The key to the selection step is balancing exploration and exploitation. The UCT (Upper Confidence bounds applied to Trees) algorithm uses the following formula:where: -: accumulated value from stateselecting action -: visit count from stateselecting action -: visit count of state -: exploration constant (typically)

The first termis the average value, representing exploitation - selecting known good actions.

The second termis the confidence upper bound, representing exploration - encouraging trying less-visited actions.

This formula has deep mathematical roots - it comes from the UCB1 algorithm for the Multi-Armed Bandit problem. When action's visit countis small, the second term is large, and the algorithm tends to explore it; whenhas been visited many times, the second term decreases, and the algorithm relies more on average value.

Key property: UCT guarantees asymptotic optimality - as simulation count approaches infinity, UCT's action selection probability converges to the optimal action. Mathematically, this is proven through Hoeffding's inequality:This inequality shows that asincreases, the estimateconverges to the true valueat an exponential rate.

MCTS Advantages and Limitations

Advantages: - No domain knowledge needed: Basic MCTS only requires game rules (simulator), no heuristic evaluation function - Anytime algorithm: Can stop at any time and return current best action, longer computation yields better results - Adaptive: Automatically invests more computational resources in promising branches

Limitations: - Computationally expensive: Requires many simulations for reliable estimates - Random policy inefficient: In complex games, random simulation rarely reaches meaningful states - Cannot directly handle continuous action spaces: UCT assumes actions are discrete and finite

AlphaGo's breakthrough precisely solved the latter two problems - replacing random simulation with neural networks and providing prior knowledge to guide search.

AlphaGo's Three-Stage Training

AlphaGo's training consists of three stages:

Stage 1: Supervised Learning Policy Network - Using 160,000 human expert games, train a 13-layer convolutional neural network to predict human moves - Input:board features (current stones, historical stones, legality, etc.) - Output: 361-dimensional action probability distribution (19x19 board positions) - Accuracy: 57.0% - consistency with human experts on test set

Stage 2: Reinforcement Learning Policy Network - Initialize with supervised network - Optimize through self-play using policy gradient:whereis game outcome (win/loss) - Opponents randomly sampled from pool of historical policy versions, preventing overfitting - Compared to supervised policy, RL policy wins 80% of games

Stage 3: Value Network - Train a network to predict win rate at stateusing policy:- Data generation: self-play with RL policy generates 30 million (state, outcome) pairs - Key trick: sample only one state per game, avoiding overfitting from highly correlated states in same game - Test set MSE: 0.226 (significant reduction from random baseline of 0.5)

AlphaGo's MCTS Integration

AlphaGo's MCTS differs from traditional four steps:

1. Selection: Uses modified UCT formula:where:Hereis the supervised policy network's prior probability. Note: - Priorguides search toward human expert-approved directions - When,is large, encouraging exploration of high-prior actions - Asincreases,decreases, algorithm relies more on value 2. Expansion: Upon reaching leaf node, initialize all actions' priors with policy network:

3. Evaluation: No longer performs random simulation, but uses weighted average of value networkand fast rollout policy(a small linear softmax):whereis result from simulating to terminal state with,.

4. Backpropagation: Update node statistics on path:$

$

Final decision: At root node, action selection probability proportional to visit count:where temperature parametercontrols exploration: - First 30 moves use(proportional selection, maintains exploration) - After that(select action with most visits)

AlphaGo's Network Architecture Details

Policy Networksand: - Input:feature planes - 8 historical time steps of own/opponent stones (16 planes) - 1 color feature (black/white) - Legality, captured stones, ladder features, etc. - Convolutional layers: 13 layers, each with 256filters, ReLU activation - Output:softmax distribution

Value Network: - Architecture similar to policy network, final layer changed to single neuron outputtingvalue - Input features include rotation data augmentation

Fast Rollout Policy: - Linear softmax, using only localfeatures - Each step takes only 2 microseconds (policy network needs 3 milliseconds) - Used to accelerate simulation in evaluation phase

AlphaGo's Performance Against Lee Sedol

AlphaGo used 1920 CPUs and 280 GPUs, averaging 40 seconds per move, performing approximately 100,000 MCTS simulations. Key moves like move 37 (Game 2) and move 78 (Game 4) demonstrated AlphaGo's creativity - these placements initially seemed unreasonable but were actually globally optimal deep planning results. Lee Sedol's "divine move" at move 78 in Game 4 also proved the unique value of human intuition; AlphaGo fell into confusion for 19 moves afterward, indicating its insufficient generalization ability for certain rare patterns.

AlphaGo Zero: Self-Play from Scratch

Three Major Breakthroughs

AlphaGo Zero, published in Nature in October 2017, had three fundamental improvements over AlphaGo:

1. Completely discards human game records: Uses only black/white stones and game rules, learning from a completely random policy through self-play.

2. Single network: Merges policy and value networks into one two-headed network:-is action probability vector -is win rate estimate

This design is based on the intuition that both policy and value depend on understanding the position, so shared representation can accelerate learning.

3. No fast rollout: Relies entirely on value networkto evaluate leaf nodes, no longer performs random simulation. This makes evaluation more accurate and simplifies the algorithm.

Training Process

AlphaGo Zero's training is a self-play closed loop:

Step 1: Self-play - Using current best network, generate self-play data through MCTS - At each step, MCTS runssimulations, computing improved policy:- Savetriples: state, improved policy, final outcome

Step 2: Train network - Sample mini-batch from replay buffer, optimize loss function:First term is value loss (MSE), second term is policy loss (cross-entropy), third term is L2 regularization.

Step 3: Evaluation and update - Let new networkplay 400 games against current best network - If new network wins >55%, update best network

AlphaGo Zero's MCTS

Selection step's UCT formula becomes:wherecomes from network output, not supervised learning prior.

Evaluation step directly uses:No rollout mixing.

Why Does Self-Play Work?

Self-play's success relies on Curriculum Learning: - Early stage: Both sides are random policies, games end quickly, strong gradient signal - Middle stage: Policy gradually strengthens, opponent also strengthens, always maintaining moderate challenge - Late stage: Both sides approach perfect policy, exploring subtle tactical details

This is more efficient than learning from fixed human game records because: - Human game records cover limited state space (mainly opening patterns) - Humans have biases and errors, AI might learn these flaws - Self-play can continuously explore unknown positions

Mathematically, self-play is similar to Fictitious Self-Play:In two-player zero-sum games, this iteration converges to Nash equilibrium. AlphaGo Zero's MCTS computes a "local best-response" at each step, then distills it into the network through supervised learning.

Performance Comparison

AlphaGo Zero with 3 days training (4.9 million self-play games) defeated AlphaGo Lee version 100-0; with 40 days training, defeated AlphaGo Master version (which had 60 consecutive wins against top human players online) 89-11. More amazingly, AlphaGo Zero's Elo rating growth curve is almost linear - from 0 to 5000 points took only 72 hours, while humans need years to progress from beginner to professional.

AlphaZero: Generalization to Chess and Shogi

From Specialized to General

AlphaZero, demonstrated in the 2018 Science paper, showed the same algorithm's dominance across three board games (Go, chess, shogi). The only modifications were:

1. Game rule encoding: Each game uses different state representation - Go: - Chess:(including piece types, move history, castling rights, etc.) - Shogi:(including captured pieces in hand)

2. Hyperparameter tuning: - Chess and shogi MCTS simulations reduced to 800 (due to smaller branching factor) - Exploration constantoptimized for each game

3. Network architecture: Unified to 20 residual blocks, each with 256 filters, totaling approximately 46 million parameters.

Training Details and Results

  • Go: 700k steps (equivalent to 21 million self-play games), Elo 5018, surpassing AlphaGo Zero
  • Chess: 300k steps (44 million games), surpassed Stockfish (traditional strongest engine) after 4 hours, reached Elo 4450 after 9 hours
  • Shogi: 100k steps (4 million games), surpassed Elmo (Japanese champion program) after 2 hours, post-training Elo 4650

In 100 games against Stockfish, AlphaZero won 28, drew 72, and lost none. More importantly, AlphaZero's playing style is more "human-like" - willing to sacrifice material for positional advantage, while traditional engines rely more on material calculation. Chess.com grandmasters described AlphaZero as "a player from the future".

Why Can It Generalize?

AlphaZero's generalization ability comes from: - Universal search framework: MCTS doesn't depend on game-specific heuristics - Powerful representation learning: Residual networks automatically learn features, no hand-crafted design needed - Self-play robustness: Not limited by human biases, explores broader strategy space

This paradigm's success shows that: search + learning is closer to general intelligence than pure memorization (like opening books) or pure brute-force computation.

MuZero: Planning Without Rules

Core Challenge

AlphaZero relies on game rules to perform MCTS simulations - in chess, it knows "knight moves in L-shape", "en passant capture", etc. But many real-world problems (like robot control, video games) have unknown or hard-to-program rules. MuZero's breakthrough is: learning an implicit modelto plan in latent space.

Three Learned Functions

MuZero learns three functions:

1. Representation function: Maps observationto hidden state

2. Dynamics function: Predicts next hidden state and immediate rewardNote:is not the real environment state, but a learned abstract representation.doesn't need to predict the actual next observation, only needs to be sufficient for planning.

3. Prediction function: Outputs policy and valueAll three functions share parameters, trained jointly.

MuZero's MCTS

At root node, MuZero executes MCTS:

Selection: Same UCT formula as AlphaZero:

Expansion: Upon reaching leaf node, expand a new action, use dynamics modelto generate virtual next state:Then initialize with prediction function:

Evaluation: Directly use, accumulate rewards on path during backpropagation:Key difference: entire search tree unfolds in latent space, doesn't need real environment simulator. Bothandare learned neural networks.

Training Objective

MuZero's loss function includes three terms: -: policy loss, cross-entropy between MCTS improved policyand network prediction -: value loss, MSE between true returnand prediction -: reward loss, MSE between true immediate rewardand predictionwhereis unroll steps (typically 5 for Atari games). Note: - Only when,comes from real observation - When,comes from recursive prediction by dynamics model - Realcome from actual experience in replay buffer

This design cleverly combines Model-Based and Model-Free: - Uses learned model during planning (Model-Based) - Supervised by real experience during training (Model-Free data efficiency)

Reanalysis Loss

MuZero introduces a key trick: Reanalysis - reanalyze old experience with new network: - Replay buffer stores, i.e., observation, action, reward, return - When training new network, rerun MCTS with, obtaining improved policy - This increases old data's value as network improves, similar to Dyna-2's idea

MuZero's Performance

In 57 Atari games, MuZero achieves AlphaZero-level performance without needing game rules. More importantly, MuZero surpasses Model-Free Agent57 in some highly stochastic games (like Ms. Pac-Man) because planning helps maintain consistency in long-term strategy.

In Go and chess, MuZero matches AlphaZero - proving that implicit models sufficiently express key game structures, even without predicting pixel-level next frames.

Complete Code Implementation: MCTS + Neural Network for Gomoku

The following code implements a simplified AlphaZero forGomoku. Complete with: - MCTS's four steps - Policy-value dual-head network - Self-play data generation - Training loop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random

# ============ Gomoku Environment ============
class GomokuEnv:
def __init__(self, board_size=15):
self.size = board_size
self.reset()

def reset(self):
self.board = np.zeros((self.size, self.size), dtype=np.int8)
self.current_player = 1 # 1=black, -1=white
self.done = False
self.winner = 0
return self.get_state()

def get_state(self):
"""Returns 3-channel state: [current player stones, opponent stones, color indicator]"""
state = np.zeros((3, self.size, self.size), dtype=np.float32)
state[0] = (self.board == self.current_player).astype(np.float32)
state[1] = (self.board == -self.current_player).astype(np.float32)
state[2] = np.full((self.size, self.size), self.current_player)
return state

def legal_actions(self):
"""Returns list of legal action coordinates"""
return list(zip(*np.where(self.board == 0)))

def step(self, action):
"""Execute action (row, col)"""
if self.board[action] != 0:
raise ValueError(f"Illegal move: {action}")

self.board[action] = self.current_player

# Check win
if self._check_win(action):
self.done = True
self.winner = self.current_player
return self.get_state(), self.winner, True

# Check draw
if len(self.legal_actions()) == 0:
self.done = True
return self.get_state(), 0, True

self.current_player *= -1
return self.get_state(), 0, False

def _check_win(self, last_move):
"""Check if last_move forms five in a row"""
directions = [(0,1), (1,0), (1,1), (1,-1)] # horizontal, vertical, diagonal
player = self.board[last_move]

for dr, dc in directions:
count = 1 # including last_move itself
# Positive direction
r, c = last_move[0] + dr, last_move[1] + dc
while 0 <= r < self.size and 0 <= c < self.size and self.board[r,c] == player:
count += 1
r += dr
c += dc
# Negative direction
r, c = last_move[0] - dr, last_move[1] - dc
while 0 <= r < self.size and 0 <= c < self.size and self.board[r,c] == player:
count += 1
r -= dr
c -= dc

if count >= 5:
return True
return False

# ============ Policy-Value Network ============
class PolicyValueNet(nn.Module):
def __init__(self, board_size=15, num_channels=128):
super().__init__()
self.size = board_size

# Shared representation
self.conv1 = nn.Conv2d(3, num_channels, 3, padding=1)
self.conv2 = nn.Conv2d(num_channels, num_channels, 3, padding=1)
self.conv3 = nn.Conv2d(num_channels, num_channels, 3, padding=1)

# Policy head
self.policy_conv = nn.Conv2d(num_channels, 2, 1)
self.policy_fc = nn.Linear(2 * board_size * board_size, board_size * board_size)

# Value head
self.value_conv = nn.Conv2d(num_channels, 1, 1)
self.value_fc1 = nn.Linear(board_size * board_size, 64)
self.value_fc2 = nn.Linear(64, 1)

def forward(self, x):
# Shared layers
x = torch.relu(self.conv1(x))
x = torch.relu(self.conv2(x))
x = torch.relu(self.conv3(x))

# Policy head
p = torch.relu(self.policy_conv(x))
p = p.view(-1, 2 * self.size * self.size)
p = torch.log_softmax(self.policy_fc(p), dim=1)

# Value head
v = torch.relu(self.value_conv(x))
v = v.view(-1, self.size * self.size)
v = torch.relu(self.value_fc1(v))
v = torch.tanh(self.value_fc2(v))

return p, v

# ============ MCTS Node ============
class MCTSNode:
def __init__(self, prior, parent=None):
self.prior = prior # P(s,a)
self.parent = parent
self.children = {} # action -> MCTSNode

self.visit_count = 0
self.value_sum = 0.0

def value(self):
if self.visit_count == 0:
return 0
return self.value_sum / self.visit_count

def is_expanded(self):
return len(self.children) > 0

def select_child(self, c_puct=1.0):
"""UCT selection"""
best_score = -float('inf')
best_action = None
best_child = None

for action, child in self.children.items():
# Q(s,a) + c_puct * P(s,a) * sqrt(N(s)) / (1 + N(s,a))
u = c_puct * child.prior * np.sqrt(self.visit_count) / (1 + child.visit_count)
score = child.value() + u

if score > best_score:
best_score = score
best_action = action
best_child = child

return best_action, best_child

def expand(self, actions, priors):
"""Expand node"""
for action, prior in zip(actions, priors):
if action not in self.children:
self.children[action] = MCTSNode(prior, parent=self)

def backup(self, value):
"""Backpropagate update"""
self.visit_count += 1
self.value_sum += value
if self.parent:
self.parent.backup(-value) # Opponent perspective has opposite value

# ============ MCTS ============
class MCTS:
def __init__(self, model, c_puct=1.0, num_simulations=400):
self.model = model
self.c_puct = c_puct
self.num_simulations = num_simulations

@torch.no_grad()
def search(self, env):
"""Execute MCTS, return improved policy"""
root = MCTSNode(prior=0)

# Initialize root node
state = env.get_state()
legal_actions = env.legal_actions()
log_probs, value = self.model(torch.FloatTensor(state).unsqueeze(0))
probs = torch.exp(log_probs).squeeze(0).numpy()

# Only consider legal actions
action_probs = []
for action in legal_actions:
idx = action[0] * env.size + action[1]
action_probs.append(probs[idx])
action_probs = np.array(action_probs)
action_probs /= action_probs.sum()

root.expand(legal_actions, action_probs)

# Simulate
for _ in range(self.num_simulations):
node = root
env_copy = self._copy_env(env)
search_path = [node]

# 1. Selection
while node.is_expanded() and not env_copy.done:
action, node = node.select_child(self.c_puct)
search_path.append(node)
env_copy.step(action)

# 2&3. Expansion and evaluation
if not env_copy.done:
state = env_copy.get_state()
legal_actions = env_copy.legal_actions()
log_probs, value = self.model(torch.FloatTensor(state).unsqueeze(0))
probs = torch.exp(log_probs).squeeze(0).numpy()

action_probs = []
for action in legal_actions:
idx = action[0] * env_copy.size + action[1]
action_probs.append(probs[idx])
action_probs = np.array(action_probs)
action_probs /= action_probs.sum()

node.expand(legal_actions, action_probs)
value = value.item()
else:
# Terminal state, use true outcome
value = env_copy.winner * env_copy.current_player

# 4. Backpropagation
for node in reversed(search_path):
node.backup(value)
value = -value # Alternating players

# Return improved policy
visit_counts = np.zeros(env.size * env.size)
for action, child in root.children.items():
idx = action[0] * env.size + action[1]
visit_counts[idx] = child.visit_count

return visit_counts / visit_counts.sum()

def _copy_env(self, env):
"""Deep copy environment"""
new_env = GomokuEnv(env.size)
new_env.board = env.board.copy()
new_env.current_player = env.current_player
new_env.done = env.done
new_env.winner = env.winner
return new_env

# ============ AlphaZero Trainer ============
class AlphaZeroTrainer:
def __init__(self, board_size=15, num_simulations=100):
self.env = GomokuEnv(board_size)
self.model = PolicyValueNet(board_size)
self.mcts = MCTS(self.model, num_simulations=num_simulations)
self.optimizer = optim.Adam(self.model.parameters(), lr=1e-3, weight_decay=1e-4)

self.replay_buffer = deque(maxlen=10000)
self.batch_size = 256

def self_play(self, num_games=10, temperature=1.0):
"""Self-play to generate data"""
for _ in range(num_games):
states, policies, current_players = [], [], []
self.env.reset()

step = 0
while not self.env.done:
state = self.env.get_state()
policy = self.mcts.search(self.env)

states.append(state)
policies.append(policy)
current_players.append(self.env.current_player)

# Temperature sampling
if step < 30:
action_probs = policy ** (1.0 / temperature)
action_probs /= action_probs.sum()
action_idx = np.random.choice(len(policy), p=action_probs)
else:
action_idx = np.argmax(policy)

action = (action_idx // self.env.size, action_idx % self.env.size)
self.env.step(action)
step += 1

# Save data
winner = self.env.winner
for state, policy, player in zip(states, policies, current_players):
# Value from current player's perspective
value = winner * player
self.replay_buffer.append((state, policy, value))

def train_step(self):
"""Train one step"""
if len(self.replay_buffer) < self.batch_size:
return None

batch = random.sample(self.replay_buffer, self.batch_size)
states, target_policies, target_values = zip(*batch)

states = torch.FloatTensor(np.array(states))
target_policies = torch.FloatTensor(np.array(target_policies))
target_values = torch.FloatTensor(np.array(target_values)).unsqueeze(1)

# Forward pass
log_probs, values = self.model(states)

# Loss
policy_loss = -torch.mean(torch.sum(target_policies * log_probs, dim=1))
value_loss = torch.mean((target_values - values) ** 2)
loss = policy_loss + value_loss

# Backward pass
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()

return {
'loss': loss.item(),
'policy_loss': policy_loss.item(),
'value_loss': value_loss.item()
}

def train(self, num_iterations=100, games_per_iter=10, train_steps_per_iter=100):
"""Complete training loop"""
for i in range(num_iterations):
print(f"\n=== Iteration {i+1}/{num_iterations} ===")

# Self-play
print("Self-playing...")
self.self_play(num_games=games_per_iter)

# Train
print("Training network...")
losses = []
for _ in range(train_steps_per_iter):
loss_dict = self.train_step()
if loss_dict:
losses.append(loss_dict['loss'])

if losses:
print(f"Average loss: {np.mean(losses):.4f}")
print(f"Buffer size: {len(self.replay_buffer)}")

# ============ Play Against AI ============
def play_against_ai(model, board_size=15):
"""Play against trained AI"""
env = GomokuEnv(board_size)
mcts = MCTS(model, num_simulations=400)

print("You are black (X), AI is white (O)")
print("Enter coordinates like: 7 7")

while not env.done:
# Display board
print("\nCurrent board:")
for i in range(board_size):
row = []
for j in range(board_size):
if env.board[i,j] == 1:
row.append('X')
elif env.board[i,j] == -1:
row.append('O')
else:
row.append('.')
print(' '.join(row))

if env.current_player == 1:
# Human turn
while True:
try:
r, c = map(int, input("Your move: ").split())
if (r, c) in env.legal_actions():
break
else:
print("Illegal position!")
except:
print("Input format: row col")
action = (r, c)
else:
# AI turn
print("AI thinking...")
policy = mcts.search(env)
action_idx = np.argmax(policy)
action = (action_idx // board_size, action_idx % board_size)
print(f"AI moves: {action}")

env.step(action)

# Game over
print("\nFinal board:")
for i in range(board_size):
row = []
for j in range(board_size):
if env.board[i,j] == 1:
row.append('X')
elif env.board[i,j] == -1:
row.append('O')
else:
row.append('.')
print(' '.join(row))

if env.winner == 1:
print("\nYou won!")
elif env.winner == -1:
print("\nAI won!")
else:
print("\nDraw!")

# ============ Main Program ============
if __name__ == "__main__":
# Train
trainer = AlphaZeroTrainer(board_size=9, num_simulations=100)
trainer.train(num_iterations=50, games_per_iter=5, train_steps_per_iter=50)

# Save model
torch.save(trainer.model.state_dict(), "alphazero_gomoku.pth")

# Play against AI
# model = PolicyValueNet(board_size=9)
# model.load_state_dict(torch.load("alphazero_gomoku.pth"))
# model.eval()
# play_against_ai(model, board_size=9)

Code Explanation

Environment part: - GomokuEnv implementsGomoku, including state representation, legal actions, win detection - get_state() returns 3-channel tensor: current player stones, opponent stones, color indicator

Network part: - PolicyValueNet is dual-head network: policy head outputs, value head outputs - Uses 3 convolutional layers to extract features, then separate fully connected layers for policy and value

MCTS part: - MCTSNode stores prior, visit count, value sum - select_child() implements UCT formula - expand() adds child nodes and initializes priors - backup() backpropagates updates along path

Training part: - self_play() generatesdata, first 30 moves use temperaturesampling, then greedy - train_step() optimizes policy lossand value loss Running example: - Onboard, 100 simulations, AI has decent level after 50 training rounds - Onboard, 400 simulations, training hundreds of rounds can approach amateur level

Deep Q&A: Understanding MCTS and AlphaGo Series

Q1: Where does theterm in UCT formula come from?

Intuition: This comes from Confidence Interval in statistics. Suppose action's true value is, we observesamples, average value. According to central limit theorem, standard error of sample mean is. To guarantee not missing good actions with high confidence, we add upper bound.

Mathematical derivation: UCT is based on Hoeffding's inequality:Let right side equal(i.e., with confidence):Solving:Therefore confidence upper bound is:This is exactly UCT's form (ignoring constant factors).

Physical meaning: - Whenis small,is large, algorithm "optimistically" assumes this action might be good - Whenincreases whiledoesn't,increases, algorithm is "curiosity"-driven to explore - Whenis large,is small, algorithm "confidently" relies on empirical estimates

Q2: Why did AlphaGo need fast rollout policy?

Historical reason: When AlphaGo was developed (2015-2016), value networktraining wasn't mature enough, prone to overfitting to training distribution. Fast rollout provided an independent evaluation signal, like a "voting" mechanism: -: deep learning's global judgment (slow but accurate) -'s rollout: Monte Carlo sampling's local verification (fast but noisy)

Mixed usereduced variance of single estimate.

Why doesn't AlphaGo Zero need it?: By 2017, residual networks and batch normalization techniques matured, value network's generalization ability greatly improved. More importantly, AlphaGo Zero's self-play data covered broader state space, while AlphaGo's data mainly came from human game records (biased). Experiments showed pure value networkevaluation was accurate enough, rollout actually introduced noise.

Numerical comparison: AlphaGo's each MCTS simulation took ~5ms (including 2ms rollout), AlphaGo Zero's pure value evaluation only needed 3ms, with 600 Elo improvement.

Q3: Can self-play fall into local optima?

Theoretical concern: In game theory, self-play is similar to Fictitious Self-Play:whereis best response. This iteration doesn't guarantee convergence to Nash equilibrium in general games, may have cycles (like rock-paper-scissors).

Why does AlphaGo Zero succeed?: 1. Specificity of two-player zero-sum perfect information games: This class of games has unique Nash equilibrium obtainable by minimax value. Fictitious self-play has convergence guarantee in these games. 2. Exploration mechanism: MCTS's UCT formula forces exploring all actions, avoiding premature policy collapse to single move. 3. Random initialization: Early random policy leads to diverse games, laying broad foundation. 4. Opponent pool: Though AlphaGo Zero mainly plays against current best network, mini-batch sampling during training includes data from different periods, equivalent to implicit opponent pool.

Experimental evidence: Paper's ablation study shows if exploration is removed (set), training stalls. But as long as, policy diversity is sufficient to support continuous improvement.

Q4: Was AlphaZero's victory over Stockfish fair?

Controversial points: - Stockfish ran on single-core CPU, AlphaZero used 4 TPUs (equivalent to thousands of cores computing power) - Stockfish used fixed-depth AlphaBeta search (~60 million nodes/second), AlphaZero only searched 80k nodes/second, but each node evaluated with neural network - Stockfish had opening books and endgame tablebases, AlphaZero relied entirely on real-time computation

DeepMind's defense: - Match time was fixed (1 minute per side), computational resources weren't main factor - Though Stockfish's node search was fast, most was ineffective exploration; AlphaZero's search depth was deeper (average 60 layers vs 30 layers) - Follow-up tests with multi-core Stockfish, AlphaZero still maintained advantage

Deeper issue: This reflects difference between two AI paradigms: - Symbolic AI: Stockfish uses hand-crafted heuristics (like "control center", "king safety") and brute-force search - Connectionist AI: AlphaZero uses learned implicit knowledge and high-quality selective search

AlphaZero's playing style is closer to human masters - willing to sacrifice material for long-term positional advantage, while traditional engines are more "materialistic".

Q5: What exactly does MuZero's hidden staterepresent?

Key insight:doesn't need to represent complete environment state, only abstract representation sufficient for planning. For example in Atari's Ms. Pac-Man: - Complete state includes pixels, enemy positions, score, level information, etc. - But planning only needs to know "which direction to go without being caught" type high-level information -might only encode "direction of safe zones" and "rough distribution of dots", ignoring irrelevant details (like background patterns)

Mathematically: Define real environment's transition function as, reward function as. MuZero's learnedsatisfies:$

$ whereare equivalent in planning sense to true policy and value - i.e., search tree expanded withproduces same optimal action as tree expanded with true.

Why not directly predict next frame?: - Pixel-level prediction needs to model irrelevant visual details (like cloud drifting), wasting capacity - Latent space transitions can be more abstract, similar to human "mental simulation" - we don't need to imagine realistic images for planning, only functional causal relationships

Experimental verification: Paper shows's dimension (like 512D) is much smaller than observation (likeimages), but sufficient to support 50-step unrolled planning, proving abstraction's effectiveness.

Q6: What's the fundamental difference between MCTS and Model-Free RL (like PPO)?

Planning vs Learning: - MCTS: At each decision step, executes search in real-time, computing locally optimal action for that state. This is planning - using model to look ahead. - PPO and Model-Free: Train a policy networkthrough massive experience, directly query network during execution. This is learning - memorizing past experience.

Computational overhead: - MCTS: Each decision needs hundreds to thousands of simulations, computationally expensive, but doesn't need training - PPO: Training phase needs millions of environment interactions, but execution only needs one forward pass, extremely fast

Generalization ability: - MCTS: As long as there's a model, can immediately adapt to new states (like mid-game variations) - PPO: Can only generalize to states near training distribution, may fail in completely new situations

AlphaZero's fusion: Combines advantages of both: - During training uses MCTS to generate high-quality policy(planning) - Uses supervised learning to distillinto network (learning) - During execution uses network to accelerate MCTS search (synergy)

This "System 1 (fast intuition) + System 2 (slow reasoning)" dual-system architecture coincides with human cognitive science models.

Q7: Why does AlphaGo's value network need to sample only one state per game?

Overfitting risk: Different states in same game are highly correlated - values of first 10 moves all depend on final outcome. If all sampled, will have: - Same game contributes hundreds of samples in training set - These samples aren't independent and identically distributed (i.i.d.), violating supervised learning assumption - Network will memorize specific game patterns rather than learning general value estimation

Mathematically: Suppose a game hassteps, sampling all producessampleswith identical labels. This causes gradient estimate's variance to falsely decrease:because samples aren't independent. Actual variance is much less than right side, optimizer becomes overconfident, stops exploring too early.

Paper experiment: Compared three sampling strategies: 1. Sample all states per game: value network training set MSE=0.01, test set MSE=0.37 (severe overfitting) 2. Sample one state per game: training set MSE=0.19, test set MSE=0.23 (good generalization) 3. Sample fixed proportion per game (like 10%): between the two

Finally chose strategy 2, using 30 million games to generate 30 million independent samples.

Q8: How many parameters does AlphaZero's residual network have? How to train?

Architecture details: - 20 residual blocks, each containing: -convolution, 256 filters - Batch normalization - ReLU activation - Skip connection - Total parameters approximately 46 million

Training configuration: - Batch size: 4096 (using 64 TPUs in parallel) - Optimizer: SGD, momentum 0.9 - Learning rate: initial 0.2, decay to 0.02 every 100k steps, then 0.002 - Weight decay: - Training time: Go 72 hours, chess 9 hours, shogi 13 hours

Key tricks: 1. Data augmentation: Go/shogi use 8-fold symmetry (rotation + flip), chess uses left-right flip 2. Temperature annealing: First 30 steps use, after that

  1. Dirichlet noise: Add noise to root node's prior, encouraging exploration:where

Q9: How does MuZero avoid model error accumulation?

Challenge: In Model-Based RL, model prediction error grows exponentially with planning steps:whereis single-step error,is unroll steps. This makes long-term planning unreliable.

MuZero's solution: 1. Latent space modeling:is abstract representation, doesn't need to predict pixel-level details, reduces single-step error

  1. Short-term unrolling: Only unrolls 5 steps during training, not dozens of steps, limits error accumulation
  2. Value guidance: Model doesn't need to perfectly predict future, only needs to ensure value estimateis accurate. Even if's prediction is biased, as long ascorrects this bias, planning remains effective
  3. End-to-end training: Loss function directly optimizes policy and value, not reconstruction accuracy. Model automatically learns to ignore details not affecting decisions

Mathematically: Traditional Model-Based RL optimizes reconstruction error:MuZero optimizes decision quality:The latter allows model to "cheat" in places not affecting, avoiding wasting capacity.

Q10: Can MCTS be used for continuous action spaces?

Challenge: UCT assumes finite action set, can enumerate all child nodes. In continuous space (like robot joint angles), actions are infinite, cannot expand.

Possible extensions: 1. Progressive Widening: Gradually increase number of expanded actions, whenever node is visitedtimes, addnew actions (sampled from policy distribution) 2. Local optimization: Use gradient optimization at leaf nodes to find best action, rather than traversing 3. Discretization: Divide continuous space into grid, but introduces discretization error

AlphaGo series' choice: All versions currently only apply to discrete actions (Go, chess, Atari). DeepMind's robotics projects (like dexterous hand manipulation) use Model-Free DDPG/SAC, not MCTS.

Future direction: Combining MCTS with continuous control is an open problem, may require new tree structures or search strategies.

Core Papers

  1. MCTS Foundations:
    Kocsis & Szepesv á ri (2006). "Bandit based Monte-Carlo Planning". ECML.
    https://arxiv.org/abs/cs/0703062

  2. AlphaGo:
    Silver et al. (2016). "Mastering the game of Go with deep neural networks and tree search". Nature.
    https://www.nature.com/articles/nature16961

  3. AlphaGo Zero:
    Silver et al. (2017). "Mastering the game of Go without human knowledge". Nature.
    https://www.nature.com/articles/nature24270

  4. AlphaZero:
    Silver et al. (2018). "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play". Science.
    https://arxiv.org/abs/1712.01815

  5. MuZero:
    Schrittwieser et al. (2020). "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model". Nature.
    https://arxiv.org/abs/1911.08265

  6. EfficientZero:
    Ye et al. (2021). "Mastering Atari Games with Limited Data". NeurIPS.
    https://arxiv.org/abs/2111.00210
    Improves MuZero's sample efficiency by 100x

  7. Gumbel AlphaZero:
    Danihelka et al. (2022). "Policy Improvement by Planning with Gumbel". ICLR.
    https://arxiv.org/abs/2106.06613
    Improves MCTS exploration using Gumbel sampling

  8. Sampled MuZero:
    Hubert et al. (2021). "Learning and Planning in Complex Action Spaces". ICML.
    https://arxiv.org/abs/2104.06303
    Extends MuZero to large discrete action spaces

Open Source Implementations

Summary and Outlook

The evolution from MCTS to the AlphaGo series demonstrates how AI achieved superiority at the pinnacle of human intelligence - Go. This success wasn't the victory of a single technology, but the perfect fusion of search, learning, and planning:

MCTS provides mathematical balance of exploration and exploitation, focusing limited computational resources on the most promising paths.

Deep learning through policy and value networks compresses millennia of human Go wisdom (AlphaGo) or pure self-play (AlphaGo Zero) into neural network weights, providing high-quality priors and evaluation for search.

Self-play constructs a continuously improving closed loop - policy strengthens, opponent also strengthens, always maintaining moderate challenge, ultimately breaking through the ceiling of human game records.

AlphaZero's generalization proves this paradigm's universality - the same algorithm achieves superhuman level in Go, chess, and shogi, showing "search + learning" potential as foundation for general intelligence.

MuZero further breaks free from rule limitations, achieving planning through learning implicit models, advancing toward the ultimate goal of "planning the world without understanding the world".

Future directions include: - Extension to continuous spaces: Combining MCTS with robot control - Transfer learning: Can models trained on one game quickly adapt to new games? - Multi-agent games: Extending from two-player zero-sum to multi-party cooperation-competition - Open-ended environments: How to plan in tasks without clear win/loss definitions?

AlphaGo's victory isn't an endpoint, but the starting point of combining reinforcement learning with search. It tells us: intelligence is not only memory, but also planning; not only imitation, but also creation. The evolution from AlphaGo to MuZero is the best interpretation of this philosophy.

  • Post title:Reinforcement Learning (8): AlphaGo and Monte Carlo Tree Search
  • Post author:Chen Kai
  • Create time:2024-09-06 14:45:00
  • Post link:https://www.chenk.top/reinforcement-learning-8-alphago-and-mcts/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments