In March 2016, when AlphaGo defeated world Go champion Lee Sedol 4-1,
the world was stunned - Go had long been considered the ultimate
fortress for AI, with a search space of(far exceeding the
universe'satoms), making
brute-force search completely ineffective. AlphaGo's success wasn't the
victory of a single algorithm, but the perfect fusion of Monte Carlo
Tree Search (MCTS), deep learning, and reinforcement learning. Even more
astonishing, AlphaGo Zero, developed 18 months later, completely
discarded human game records and learned purely through self-play from
scratch, surpassing millennia of human Go wisdom in just 3 days.
Subsequently, AlphaZero generalized this paradigm to chess and shogi,
while MuZero went further by breaking free from game rules, learning to
plan without knowing the environment dynamics. This series of
breakthroughs not only transformed the landscape of game-playing AI but
also provided a new paradigm for combining reinforcement learning with
search - how to efficiently plan in vast state spaces, how to let AI
discover knowledge autonomously, and how to perform forward planning
without a model. This chapter will start from MCTS's mathematical
foundations, progressively delving into the core designs of AlphaGo,
AlphaGo Zero, AlphaZero, and MuZero, with complete code implementations
to help you truly understand the essence of these algorithms.
Monte
Carlo Tree Search: Balancing Search and Sampling
Why Do We Need MCTS?
In Go, each move has an average of 250 possible placements, with
games averaging 150 moves, resulting in a complete search tree withbranches. Chess
has a branching factor of about 35 and averages 80 moves, giving a
search space of "only" -
still astronomical, but much smaller than Go. Traditional AlphaBeta
pruning works for chess but completely fails for Go. MCTS's core idea
is: don't search all branches uniformly, but concentrate
computational resources on the most promising paths.
Four Steps of MCTS
MCTS builds a search tree by iteratively executing four steps:
1. Selection: Starting from the root node, use a
strategy (like UCT) to select child nodes until reaching a leaf
node.
2. Expansion: If the leaf node isn't a terminal
state, randomly select an unexplored action and add a new child
node.
3. Simulation: From the new node, simulate to game
end using a fast policy (like random or simple heuristic), obtaining
reward.
4. Backpropagation: Propagate rewardbackward along the path, updating
statistics (visit countand
accumulated value) for all nodes
on the path.
These four steps repeat continuously, with the search tree gradually
growing toward promising directions. Finally, at the root node, select
the action with the most visits as the actual decision.
UCT
Algorithm: Mathematical Balance of Exploration and Exploitation
The key to the selection step is balancing
exploration and exploitation. The UCT
(Upper Confidence bounds applied to Trees) algorithm uses the following
formula:where: -:
accumulated value from stateselecting action -: visit count from stateselecting action -: visit count of state -: exploration constant (typically)
The first termis the average
value, representing exploitation - selecting known good
actions.
The second termis the confidence upper bound,
representing exploration - encouraging trying
less-visited actions.
This formula has deep mathematical roots - it comes from the UCB1
algorithm for the Multi-Armed Bandit problem. When
action's visit countis small, the second term is large,
and the algorithm tends to explore it; whenhas been visited many times, the second
term decreases, and the algorithm relies more on average value.
Key property: UCT guarantees asymptotic
optimality - as simulation count approaches infinity, UCT's
action selection probability converges to the optimal action.
Mathematically, this is proven through Hoeffding's inequality:This
inequality shows that asincreases, the estimateconverges to the
true valueat an exponential
rate.
MCTS Advantages and
Limitations
Advantages: - No domain knowledge
needed: Basic MCTS only requires game rules (simulator), no
heuristic evaluation function - Anytime algorithm: Can
stop at any time and return current best action, longer computation
yields better results - Adaptive: Automatically invests
more computational resources in promising branches
Limitations: - Computationally
expensive: Requires many simulations for reliable estimates -
Random policy inefficient: In complex games, random
simulation rarely reaches meaningful states - Cannot directly
handle continuous action spaces: UCT assumes actions are
discrete and finite
AlphaGo's breakthrough precisely solved the latter two problems -
replacing random simulation with neural networks and providing prior
knowledge to guide search.
Stage 1: Supervised Learning Policy Network - Using 160,000 human
expert games, train a 13-layer convolutional neural network to predict
human moves - Input:board features (current stones, historical stones, legality,
etc.) - Output: 361-dimensional action probability distribution (19x19
board positions) - Accuracy: 57.0% - consistency with human experts on
test set
Stage 2: Reinforcement Learning Policy Network - Initialize with
supervised network -
Optimize through self-play using policy gradient:whereis game outcome (win/loss) - Opponents randomly sampled
from pool of historical policy versions, preventing overfitting -
Compared to supervised policy, RL policy wins 80% of games
Stage 3: Value Network - Train a network to
predict win rate at stateusing
policy:- Data generation:
self-play with RL policy generates 30 million (state, outcome) pairs -
Key trick: sample only one state per game, avoiding overfitting from
highly correlated states in same game - Test set MSE: 0.226 (significant
reduction from random baseline of 0.5)
AlphaGo's MCTS Integration
AlphaGo's MCTS differs from traditional four steps:
1. Selection: Uses modified UCT formula:where:Hereis the supervised
policy network's prior probability. Note: - Priorguides search toward human
expert-approved directions - When,is
large, encouraging exploration of high-prior actions - Asincreases,decreases, algorithm relies more on
value2.
Expansion: Upon reaching leaf node, initialize all actions' priors with
policy network:
3. Evaluation: No longer performs random simulation,
but uses weighted average of value networkand fast rollout policy(a small linear softmax):whereis result from
simulating to terminal state with,.
4. Backpropagation: Update node statistics on
path:$
$
Final decision: At root node, action selection
probability proportional to visit count:where temperature parametercontrols exploration: - First 30
moves use(proportional
selection, maintains exploration) - After that(select action with most
visits)
AlphaGo's Network
Architecture Details
Policy Networksand: - Input:feature planes - 8
historical time steps of own/opponent stones (16 planes) - 1 color
feature (black/white) - Legality, captured stones, ladder features, etc.
- Convolutional layers: 13 layers, each with 256filters, ReLU activation -
Output:softmax
distribution
Value Network: - Architecture similar
to policy network, final layer changed to single neuron outputtingvalue - Input features include
rotation data augmentation
Fast Rollout Policy: - Linear softmax, using
only localfeatures - Each
step takes only 2 microseconds (policy network needs 3 milliseconds) -
Used to accelerate simulation in evaluation phase
AlphaGo's Performance
Against Lee Sedol
AlphaGo used 1920 CPUs and 280 GPUs, averaging 40 seconds per move,
performing approximately 100,000 MCTS simulations. Key moves like move
37 (Game 2) and move 78 (Game 4) demonstrated AlphaGo's creativity -
these placements initially seemed unreasonable but were actually
globally optimal deep planning results. Lee Sedol's "divine move" at
move 78 in Game 4 also proved the unique value of human intuition;
AlphaGo fell into confusion for 19 moves afterward, indicating its
insufficient generalization ability for certain rare patterns.
AlphaGo Zero: Self-Play
from Scratch
Three Major Breakthroughs
AlphaGo Zero, published in Nature in October 2017, had three
fundamental improvements over AlphaGo:
1. Completely discards human game records: Uses only
black/white stones and game rules, learning from a completely random
policy through self-play.
2. Single network: Merges policy and
value networks into one two-headed network:-is action probability vector -is win rate estimate
This design is based on the intuition that both policy and value
depend on understanding the position, so shared representation can
accelerate learning.
3. No fast rollout: Relies entirely on value
networkto evaluate leaf nodes, no
longer performs random simulation. This makes evaluation more accurate
and simplifies the algorithm.
Training Process
AlphaGo Zero's training is a self-play closed loop:
Step 1: Self-play - Using current best network, generate self-play data
through MCTS - At each step, MCTS runssimulations, computing improved
policy:- Savetriples: state, improved
policy, final outcome
Step 2: Train network - Sample mini-batch from
replay buffer, optimize loss function:First term is value
loss (MSE), second term is policy loss (cross-entropy), third term is L2
regularization.
Step 3: Evaluation and update - Let new networkplay 400 games against
current best network -
If new network wins >55%, update best network
AlphaGo Zero's MCTS
Selection step's UCT formula becomes:wherecomes from network output, not supervised learning
prior.
Evaluation step directly uses:No rollout mixing.
Why Does Self-Play Work?
Self-play's success relies on Curriculum Learning: -
Early stage: Both sides are random policies, games end quickly, strong
gradient signal - Middle stage: Policy gradually strengthens, opponent
also strengthens, always maintaining moderate challenge - Late stage:
Both sides approach perfect policy, exploring subtle tactical
details
This is more efficient than learning from fixed human game records
because: - Human game records cover limited state space (mainly opening
patterns) - Humans have biases and errors, AI might learn these flaws -
Self-play can continuously explore unknown positions
Mathematically, self-play is similar to Fictitious
Self-Play:In two-player zero-sum games, this
iteration converges to Nash equilibrium. AlphaGo Zero's MCTS computes a
"local best-response" at each step, then distills it into the network
through supervised learning.
Performance Comparison
AlphaGo Zero with 3 days training (4.9 million self-play games)
defeated AlphaGo Lee version 100-0; with 40 days training, defeated
AlphaGo Master version (which had 60 consecutive wins against top human
players online) 89-11. More amazingly, AlphaGo Zero's Elo rating growth
curve is almost linear - from 0 to 5000 points took only 72 hours, while
humans need years to progress from beginner to professional.
AlphaZero:
Generalization to Chess and Shogi
From Specialized to General
AlphaZero, demonstrated in the 2018 Science paper, showed the same
algorithm's dominance across three board games (Go, chess, shogi). The
only modifications were:
1. Game rule encoding: Each game uses different
state representation - Go: - Chess:(including piece types, move history, castling rights, etc.)
- Shogi:(including captured pieces in hand)
2. Hyperparameter tuning: - Chess and shogi MCTS
simulations reduced to 800 (due to smaller branching factor) -
Exploration constantoptimized for each game
3. Network architecture: Unified to 20 residual
blocks, each with 256 filters, totaling approximately 46 million
parameters.
Training Details and Results
Go: 700k steps (equivalent to 21 million self-play
games), Elo 5018, surpassing AlphaGo Zero
Chess: 300k steps (44 million games), surpassed
Stockfish (traditional strongest engine) after 4 hours, reached Elo 4450
after 9 hours
Shogi: 100k steps (4 million games), surpassed Elmo
(Japanese champion program) after 2 hours, post-training Elo 4650
In 100 games against Stockfish, AlphaZero won 28, drew 72, and lost
none. More importantly, AlphaZero's playing style is more "human-like" -
willing to sacrifice material for positional advantage, while
traditional engines rely more on material calculation. Chess.com
grandmasters described AlphaZero as "a player from the future".
Why Can It Generalize?
AlphaZero's generalization ability comes from: - Universal
search framework: MCTS doesn't depend on game-specific
heuristics - Powerful representation learning: Residual
networks automatically learn features, no hand-crafted design needed -
Self-play robustness: Not limited by human biases,
explores broader strategy space
This paradigm's success shows that: search +
learning is closer to general intelligence than pure
memorization (like opening books) or pure brute-force computation.
MuZero: Planning Without
Rules
Core Challenge
AlphaZero relies on game rules to perform MCTS simulations - in
chess, it knows "knight moves in L-shape", "en passant capture", etc.
But many real-world problems (like robot control, video games) have
unknown or hard-to-program rules. MuZero's breakthrough is:
learning an implicit modelto plan in latent
space.
Three Learned Functions
MuZero learns three functions:
1. Representation function: Maps observationto hidden state
2. Dynamics function: Predicts next hidden state
and immediate rewardNote:is not the real environment state, but
a learned abstract representation.doesn't need to predict the actual next
observation, only needs to be sufficient for planning.
3. Prediction function: Outputs policy and valueAll three
functions share parameters,
trained jointly.
MuZero's MCTS
At root node,
MuZero executes MCTS:
Selection: Same UCT formula as AlphaZero:
Expansion: Upon reaching leaf node, expand a new action, use dynamics modelto generate virtual next state:Then
initialize with prediction function:
Evaluation: Directly use, accumulate rewards on path
during backpropagation:Key difference: entire search tree unfolds in
latent space, doesn't need real environment simulator.
Bothandare learned neural networks.
Training Objective
MuZero's loss function includes three terms: -: policy loss, cross-entropy between
MCTS improved policyand
network prediction -: value loss, MSE between true
returnand prediction -: reward loss, MSE between true
immediate rewardand
predictionwhereis unroll steps (typically 5 for Atari
games). Note: - Only when,comes from real observation - When,comes from recursive prediction by
dynamics model - Realcome from actual experience in replay buffer
This design cleverly combines Model-Based and Model-Free: - Uses
learned model during planning (Model-Based) - Supervised by real
experience during training (Model-Free data efficiency)
Reanalysis Loss
MuZero introduces a key trick: Reanalysis -
reanalyze old experience with new network: - Replay buffer stores, i.e., observation, action,
reward, return - When training new network, rerun MCTS with, obtaining improved
policy - This increases old
data's value as network improves, similar to Dyna-2's idea
MuZero's Performance
In 57 Atari games, MuZero achieves AlphaZero-level performance
without needing game rules. More importantly, MuZero surpasses
Model-Free Agent57 in some highly stochastic games (like Ms. Pac-Man)
because planning helps maintain consistency in long-term strategy.
In Go and chess, MuZero matches AlphaZero - proving that implicit
models sufficiently express key game structures, even without predicting
pixel-level next frames.
Complete
Code Implementation: MCTS + Neural Network for Gomoku
The following code implements a simplified AlphaZero forGomoku. Complete with: -
MCTS's four steps - Policy-value dual-head network - Self-play data
generation - Training loop
# ============ AlphaZero Trainer ============ classAlphaZeroTrainer: def__init__(self, board_size=15, num_simulations=100): self.env = GomokuEnv(board_size) self.model = PolicyValueNet(board_size) self.mcts = MCTS(self.model, num_simulations=num_simulations) self.optimizer = optim.Adam(self.model.parameters(), lr=1e-3, weight_decay=1e-4) self.replay_buffer = deque(maxlen=10000) self.batch_size = 256 defself_play(self, num_games=10, temperature=1.0): """Self-play to generate data""" for _ inrange(num_games): states, policies, current_players = [], [], [] self.env.reset() step = 0 whilenot self.env.done: state = self.env.get_state() policy = self.mcts.search(self.env) states.append(state) policies.append(policy) current_players.append(self.env.current_player) # Temperature sampling if step < 30: action_probs = policy ** (1.0 / temperature) action_probs /= action_probs.sum() action_idx = np.random.choice(len(policy), p=action_probs) else: action_idx = np.argmax(policy) action = (action_idx // self.env.size, action_idx % self.env.size) self.env.step(action) step += 1 # Save data winner = self.env.winner for state, policy, player inzip(states, policies, current_players): # Value from current player's perspective value = winner * player self.replay_buffer.append((state, policy, value)) deftrain_step(self): """Train one step""" iflen(self.replay_buffer) < self.batch_size: returnNone batch = random.sample(self.replay_buffer, self.batch_size) states, target_policies, target_values = zip(*batch) states = torch.FloatTensor(np.array(states)) target_policies = torch.FloatTensor(np.array(target_policies)) target_values = torch.FloatTensor(np.array(target_values)).unsqueeze(1) # Forward pass log_probs, values = self.model(states) # Loss policy_loss = -torch.mean(torch.sum(target_policies * log_probs, dim=1)) value_loss = torch.mean((target_values - values) ** 2) loss = policy_loss + value_loss # Backward pass self.optimizer.zero_grad() loss.backward() self.optimizer.step() return { 'loss': loss.item(), 'policy_loss': policy_loss.item(), 'value_loss': value_loss.item() } deftrain(self, num_iterations=100, games_per_iter=10, train_steps_per_iter=100): """Complete training loop""" for i inrange(num_iterations): print(f"\n=== Iteration {i+1}/{num_iterations} ===") # Self-play print("Self-playing...") self.self_play(num_games=games_per_iter) # Train print("Training network...") losses = [] for _ inrange(train_steps_per_iter): loss_dict = self.train_step() if loss_dict: losses.append(loss_dict['loss']) if losses: print(f"Average loss: {np.mean(losses):.4f}") print(f"Buffer size: {len(self.replay_buffer)}")
# ============ Play Against AI ============ defplay_against_ai(model, board_size=15): """Play against trained AI""" env = GomokuEnv(board_size) mcts = MCTS(model, num_simulations=400) print("You are black (X), AI is white (O)") print("Enter coordinates like: 7 7") whilenot env.done: # Display board print("\nCurrent board:") for i inrange(board_size): row = [] for j inrange(board_size): if env.board[i,j] == 1: row.append('X') elif env.board[i,j] == -1: row.append('O') else: row.append('.') print(' '.join(row)) if env.current_player == 1: # Human turn whileTrue: try: r, c = map(int, input("Your move: ").split()) if (r, c) in env.legal_actions(): break else: print("Illegal position!") except: print("Input format: row col") action = (r, c) else: # AI turn print("AI thinking...") policy = mcts.search(env) action_idx = np.argmax(policy) action = (action_idx // board_size, action_idx % board_size) print(f"AI moves: {action}") env.step(action) # Game over print("\nFinal board:") for i inrange(board_size): row = [] for j inrange(board_size): if env.board[i,j] == 1: row.append('X') elif env.board[i,j] == -1: row.append('O') else: row.append('.') print(' '.join(row)) if env.winner == 1: print("\nYou won!") elif env.winner == -1: print("\nAI won!") else: print("\nDraw!")
# ============ Main Program ============ if __name__ == "__main__": # Train trainer = AlphaZeroTrainer(board_size=9, num_simulations=100) trainer.train(num_iterations=50, games_per_iter=5, train_steps_per_iter=50) # Save model torch.save(trainer.model.state_dict(), "alphazero_gomoku.pth") # Play against AI # model = PolicyValueNet(board_size=9) # model.load_state_dict(torch.load("alphazero_gomoku.pth")) # model.eval() # play_against_ai(model, board_size=9)
Code Explanation
Environment part: - GomokuEnv
implementsGomoku,
including state representation, legal actions, win detection -
get_state() returns 3-channel tensor: current player
stones, opponent stones, color indicator
Network part: - PolicyValueNet is
dual-head network: policy head outputs, value head outputs - Uses 3 convolutional layers to extract features, then
separate fully connected layers for policy and value
MCTS part: - MCTSNode stores prior, visit count, value sum - select_child()
implements UCT formula - expand() adds child nodes and
initializes priors - backup() backpropagates updates along
path
Training part: - self_play()
generatesdata, first 30
moves use temperaturesampling, then greedy -
train_step() optimizes policy lossand value lossRunning
example: - Onboard, 100 simulations, AI has decent level after 50 training
rounds - Onboard, 400
simulations, training hundreds of rounds can approach amateur level
Deep Q&A:
Understanding MCTS and AlphaGo Series
Q1:
Where does theterm in UCT formula come from?
Intuition: This comes from Confidence
Interval in statistics. Suppose action's true value is, we observesamples, average value. According to
central limit theorem, standard error of sample mean is. To guarantee not
missing good actions with high confidence, we add upper bound.
Mathematical derivation: UCT is based on Hoeffding's
inequality:Let right side equal(i.e., with confidence):Solving:Therefore confidence upper bound is:This is exactly UCT's form (ignoring constant
factors).
Physical meaning: - Whenis small,is large, algorithm
"optimistically" assumes this action might be good - Whenincreases whiledoesn't,increases, algorithm is
"curiosity"-driven to explore - Whenis large,is small, algorithm "confidently"
relies on empirical estimates
Q2: Why did
AlphaGo need fast rollout policy?
Historical reason: When AlphaGo was developed
(2015-2016), value networktraining wasn't mature enough,
prone to overfitting to training distribution. Fast rollout provided an
independent evaluation signal, like a "voting" mechanism: -: deep learning's global judgment
(slow but accurate) -'s
rollout: Monte Carlo sampling's local verification (fast but noisy)
Mixed usereduced variance of single
estimate.
Why doesn't AlphaGo Zero need it?: By 2017, residual
networks and batch normalization techniques matured, value network's
generalization ability greatly improved. More importantly, AlphaGo
Zero's self-play data covered broader state space, while AlphaGo's data
mainly came from human game records (biased). Experiments showed pure
value networkevaluation was
accurate enough, rollout actually introduced noise.
Numerical comparison: AlphaGo's each MCTS simulation
took ~5ms (including 2ms rollout), AlphaGo Zero's pure value evaluation
only needed 3ms, with 600 Elo improvement.
Q3: Can self-play fall
into local optima?
Theoretical concern: In game theory, self-play is
similar to Fictitious Self-Play:whereis best response. This iteration
doesn't guarantee convergence to Nash equilibrium in
general games, may have cycles (like rock-paper-scissors).
Why does AlphaGo Zero succeed?: 1.
Specificity of two-player zero-sum perfect information
games: This class of games has unique Nash equilibrium
obtainable by minimax value. Fictitious self-play has convergence
guarantee in these games. 2. Exploration mechanism:
MCTS's UCT formula forces exploring all actions, avoiding premature
policy collapse to single move. 3. Random
initialization: Early random policy leads to diverse games,
laying broad foundation. 4. Opponent pool: Though
AlphaGo Zero mainly plays against current best network, mini-batch
sampling during training includes data from different periods,
equivalent to implicit opponent pool.
Experimental evidence: Paper's ablation study shows
if exploration is removed (set), training stalls. But as long
as, policy diversity
is sufficient to support continuous improvement.
Q4: Was
AlphaZero's victory over Stockfish fair?
Controversial points: - Stockfish ran on single-core
CPU, AlphaZero used 4 TPUs (equivalent to thousands of cores computing
power) - Stockfish used fixed-depth AlphaBeta search (~60 million
nodes/second), AlphaZero only searched 80k nodes/second, but each node
evaluated with neural network - Stockfish had opening books and endgame
tablebases, AlphaZero relied entirely on real-time computation
DeepMind's defense: - Match time was fixed (1 minute
per side), computational resources weren't main factor - Though
Stockfish's node search was fast, most was ineffective exploration;
AlphaZero's search depth was deeper (average 60 layers vs 30 layers) -
Follow-up tests with multi-core Stockfish, AlphaZero still maintained
advantage
Deeper issue: This reflects difference between two
AI paradigms: - Symbolic AI: Stockfish uses
hand-crafted heuristics (like "control center", "king safety") and
brute-force search - Connectionist AI: AlphaZero uses
learned implicit knowledge and high-quality selective search
AlphaZero's playing style is closer to human masters - willing to
sacrifice material for long-term positional advantage, while traditional
engines are more "materialistic".
Q5: What
exactly does MuZero's hidden staterepresent?
Key insight:doesn't need to represent complete
environment state, only abstract representation sufficient for
planning. For example in Atari's Ms. Pac-Man: - Complete state
includes pixels, enemy positions, score, level information, etc. - But
planning only needs to know "which direction to go without being caught"
type high-level information -might
only encode "direction of safe zones" and "rough distribution of dots",
ignoring irrelevant details (like background patterns)
Mathematically: Define real environment's transition
function as,
reward function as. MuZero's
learnedsatisfies:$
$ whereare equivalent in
planning sense to true policy and value - i.e., search tree
expanded withproduces same optimal
action as tree expanded with true.
Why not directly predict next frame?: - Pixel-level
prediction needs to model irrelevant visual details (like cloud
drifting), wasting capacity - Latent space transitions can be more
abstract, similar to human "mental simulation" - we don't need to
imagine realistic images for planning, only functional causal
relationships
Experimental verification: Paper shows's dimension (like 512D) is much smaller
than observation (likeimages), but sufficient to support 50-step unrolled planning,
proving abstraction's effectiveness.
Q6:
What's the fundamental difference between MCTS and Model-Free RL (like
PPO)?
Planning vs Learning: - MCTS: At
each decision step, executes search in real-time, computing locally
optimal action for that state. This is planning - using
model to look ahead. - PPO and Model-Free: Train a
policy networkthrough massive
experience, directly query network during execution. This is
learning - memorizing past experience.
Computational overhead: - MCTS: Each decision needs
hundreds to thousands of simulations, computationally expensive, but
doesn't need training - PPO: Training phase needs millions of
environment interactions, but execution only needs one forward pass,
extremely fast
Generalization ability: - MCTS: As long as there's a
model, can immediately adapt to new states (like mid-game variations) -
PPO: Can only generalize to states near training distribution, may fail
in completely new situations
AlphaZero's fusion: Combines advantages of both: -
During training uses MCTS to generate high-quality policy(planning) - Uses supervised learning
to distillinto network
(learning) - During execution uses network to accelerate MCTS search
(synergy)
This "System 1 (fast intuition) + System 2 (slow reasoning)"
dual-system architecture coincides with human cognitive science
models.
Q7:
Why does AlphaGo's value network need to sample only one state per
game?
Overfitting risk: Different states in same game are
highly correlated - values of first 10 moves all depend on final
outcome. If all sampled, will
have: - Same game contributes hundreds of samples in training set -
These samples aren't independent and identically distributed (i.i.d.),
violating supervised learning assumption - Network will memorize
specific game patterns rather than learning general value estimation
Mathematically: Suppose a game hassteps, sampling all producessampleswith identical labels. This causes
gradient estimate's variance to falsely decrease:because samples aren't independent. Actual variance is
much less than right side, optimizer becomes overconfident, stops
exploring too early.
Paper experiment: Compared three sampling
strategies: 1. Sample all states per game: value network training set
MSE=0.01, test set MSE=0.37 (severe overfitting) 2. Sample one state per
game: training set MSE=0.19, test set MSE=0.23 (good generalization) 3.
Sample fixed proportion per game (like 10%): between the two
Finally chose strategy 2, using 30 million games to generate 30
million independent samples.
Q8:
How many parameters does AlphaZero's residual network have? How to
train?
Architecture details: - 20 residual blocks, each
containing: -convolution,
256 filters - Batch normalization - ReLU activation - Skip connection -
Total parameters approximately 46 million
Training configuration: - Batch size: 4096 (using 64
TPUs in parallel) - Optimizer: SGD, momentum 0.9 - Learning rate:
initial 0.2, decay to 0.02 every 100k steps, then 0.002 - Weight
decay: - Training time: Go
72 hours, chess 9 hours, shogi 13 hours
Key tricks: 1. Data augmentation:
Go/shogi use 8-fold symmetry (rotation + flip), chess uses left-right
flip 2. Temperature annealing: First 30 steps use, after that
Dirichlet noise: Add noise to root node's prior,
encouraging exploration:where
Q9: How does
MuZero avoid model error accumulation?
Challenge: In Model-Based RL, model prediction error
grows exponentially with planning steps:whereis single-step error,is unroll steps. This makes long-term
planning unreliable.
MuZero's solution: 1. Latent space
modeling:is abstract
representation, doesn't need to predict pixel-level details, reduces
single-step error
Short-term unrolling: Only unrolls 5 steps during
training, not dozens of steps, limits error accumulation
Value guidance: Model doesn't need to perfectly
predict future, only needs to ensure value estimateis accurate. Even if's prediction is biased, as long ascorrects this bias, planning remains
effective
End-to-end training: Loss function directly
optimizes policy and value, not reconstruction accuracy. Model
automatically learns to ignore details not affecting decisions
Mathematically: Traditional Model-Based RL optimizes
reconstruction error:MuZero
optimizes decision quality:The latter allows model to "cheat" in
places not affecting,
avoiding wasting capacity.
Q10: Can MCTS
be used for continuous action spaces?
Challenge: UCT assumes finite action set, can
enumerate all child nodes. In continuous space (like robot joint
angles), actions are infinite, cannot expand.
Possible extensions: 1. Progressive
Widening: Gradually increase number of expanded actions,
whenever node is visitedtimes,
addnew actions
(sampled from policy distribution) 2. Local
optimization: Use gradient optimization at leaf nodes to find
best action, rather than traversing 3. Discretization:
Divide continuous space into grid, but introduces discretization
error
AlphaGo series' choice: All versions currently only
apply to discrete actions (Go, chess, Atari). DeepMind's robotics
projects (like dexterous hand manipulation) use Model-Free DDPG/SAC, not
MCTS.
Future direction: Combining MCTS with continuous
control is an open problem, may require new tree structures or search
strategies.
AlphaZero:
Silver et al. (2018). "A general reinforcement learning algorithm that
masters chess, shogi, and Go through self-play". Science. https://arxiv.org/abs/1712.01815
MuZero:
Schrittwieser et al. (2020). "Mastering Atari, Go, Chess and Shogi by
Planning with a Learned Model". Nature. https://arxiv.org/abs/1911.08265
EfficientZero:
Ye et al. (2021). "Mastering Atari Games with Limited Data".
NeurIPS. https://arxiv.org/abs/2111.00210
Improves MuZero's sample efficiency by 100x
Gumbel AlphaZero:
Danihelka et al. (2022). "Policy Improvement by Planning with Gumbel".
ICLR. https://arxiv.org/abs/2106.06613
Improves MCTS exploration using Gumbel sampling
Sampled MuZero:
Hubert et al. (2021). "Learning and Planning in Complex Action Spaces".
ICML. https://arxiv.org/abs/2104.06303
Extends MuZero to large discrete action spaces
The evolution from MCTS to the AlphaGo series demonstrates how AI
achieved superiority at the pinnacle of human intelligence - Go. This
success wasn't the victory of a single technology, but the perfect
fusion of search, learning, and planning:
MCTS provides mathematical balance of exploration
and exploitation, focusing limited computational resources on the most
promising paths.
Deep learning through policy and value networks
compresses millennia of human Go wisdom (AlphaGo) or pure self-play
(AlphaGo Zero) into neural network weights, providing high-quality
priors and evaluation for search.
Self-play constructs a continuously improving closed
loop - policy strengthens, opponent also strengthens, always maintaining
moderate challenge, ultimately breaking through the ceiling of human
game records.
AlphaZero's generalization proves this paradigm's
universality - the same algorithm achieves superhuman level in Go,
chess, and shogi, showing "search + learning" potential as foundation
for general intelligence.
MuZero further breaks free from rule limitations,
achieving planning through learning implicit models, advancing toward
the ultimate goal of "planning the world without understanding the
world".
Future directions include: - Extension to continuous
spaces: Combining MCTS with robot control - Transfer
learning: Can models trained on one game quickly adapt to new
games? - Multi-agent games: Extending from two-player
zero-sum to multi-party cooperation-competition - Open-ended
environments: How to plan in tasks without clear win/loss
definitions?
AlphaGo's victory isn't an endpoint, but the starting point of
combining reinforcement learning with search. It tells us:
intelligence is not only memory, but also planning; not only
imitation, but also creation. The evolution from AlphaGo to
MuZero is the best interpretation of this philosophy.
Post title:Reinforcement Learning (8): AlphaGo and Monte Carlo Tree Search
Post author:Chen Kai
Create time:2024-09-06 14:45:00
Post link:https://www.chenk.top/reinforcement-learning-8-alphago-and-mcts/
Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.