Reinforcement Learning (8): AlphaGo and Monte Carlo Tree Search

In March 2016, when AlphaGo defeated world Go champion Lee Sedol 4-1, the world was stunned - Go had long been considered the ultimate fortress for AI, with a search space of(far exceeding the universe'satoms), making brute-force search completely ineffective. AlphaGo's success wasn't the victory of a single algorithm, but the perfect fusion of Monte Carlo Tree Search (MCTS), deep learning, and reinforcement learning. Even more astonishing, AlphaGo Zero, developed 18 months later, completely discarded human game records and learned purely through self-play from scratch, surpassing millennia of human Go wisdom in just 3 days. Subsequently, AlphaZero generalized this paradigm to chess and shogi, while MuZero went further by breaking free from game rules, learning to plan without knowing the environment dynamics. This series of breakthroughs not only transformed the landscape of game-playing AI but also provided a new paradigm for combining reinforcement learning with search - how to efficiently plan in vast state spaces, how to let AI discover knowledge autonomously, and how to perform forward planning without a model. This chapter will start from MCTS's mathematical foundations, progressively delving into the core designs of AlphaGo, AlphaGo Zero, AlphaZero, and MuZero, with complete code implementations to help you truly understand the essence of these algorithms.

Monte Carlo Tree Search: Balancing Search and Sampling

Why Do We Need MCTS?

In Go, each move has an average of 250 possible placements, with games averaging 150 moves, resulting in a complete search tree withbranches. Chess has a branching factor of about 35 and averages 80 moves, giving a search space of "only" - still astronomical, but much smaller than Go. Traditional AlphaBeta pruning works for chess but completely fails for Go. MCTS's core idea is: don't search all branches uniformly, but concentrate computational resources on the most promising paths.

Four Steps of MCTS

MCTS builds a search tree by iteratively executing four steps:

1. Selection: Starting from the root node, use a strategy (like UCT) to select child nodes until reaching a leaf node.

2. Expansion: If the leaf node isn't a terminal state, randomly select an unexplored action and add a new child node.

3. Simulation: From the new node, simulate to game end using a fast policy (like random or simple heuristic), obtaining reward.

4. Backpropagation: Propagate rewardbackward along the path, updating statistics (visit countand accumulated value) for all nodes on the path.

These four steps repeat continuously, with the search tree gradually growing toward promising directions. Finally, at the root node, select the action with the most visits as the actual decision.

UCT Algorithm: Mathematical Balance of Exploration and Exploitation

The key to the selection step is balancing exploration and exploitation. The UCT (Upper Confidence bounds applied to Trees) algorithm uses the following formula:where: -: accumulated value from stateselecting action -: visit count from stateselecting action -: visit count of state -: exploration constant (typically)

The first termis the average value, representing exploitation - selecting known good actions.

The second termis the confidence upper bound, representing exploration - encouraging trying less-visited actions.

This formula has deep mathematical roots - it comes from the UCB1 algorithm for the Multi-Armed Bandit problem. When action's visit countis small, the second term is large, and the algorithm tends to explore it; whenhas been visited many times, the second term decreases, and the algorithm relies more on average value.

Key property: UCT guarantees asymptotic optimality - as simulation count approaches infinity, UCT's action selection probability converges to the optimal action. Mathematically, this is proven through Hoeffding's inequality:This inequality shows that asincreases, the estimateconverges to the true valueat an exponential rate.

MCTS Advantages and Limitations

Advantages: - No domain knowledge needed: Basic MCTS only requires game rules (simulator), no heuristic evaluation function - Anytime algorithm: Can stop at any time and return current best action, longer computation yields better results - Adaptive: Automatically invests more computational resources in promising branches

Limitations: - Computationally expensive: Requires many simulations for reliable estimates - Random policy inefficient: In complex games, random simulation rarely reaches meaningful states - Cannot directly handle continuous action spaces: UCT assumes actions are discrete and finite

AlphaGo's breakthrough precisely solved the latter two problems - replacing random simulation with neural networks and providing prior knowledge to guide search.

AlphaGo: Supervised Learning + Reinforcement Learning + Search

AlphaGo's Three-Stage Training

AlphaGo's training consists of three stages:

Stage 1: Supervised Learning Policy Network - Using 160,000 human expert games, train a 13-layer convolutional neural network to predict human moves - Input:board features (current stones, historical stones, legality, etc.) - Output: 361-dimensional action probability distribution (19x19 board positions) - Accuracy: 57.0% - consistency with human experts on test set

Stage 2: Reinforcement Learning Policy Network - Initialize with supervised network - Optimize through self-play using policy gradient:whereis game outcome (win/loss) - Opponents randomly sampled from pool of historical policy versions, preventing overfitting - Compared to supervised policy, RL policy wins 80% of games

Stage 3: Value Network - Train a network to predict win rate at stateusing policy:- Data generation: self-play with RL policy generates 30 million (state, outcome) pairs - Key trick: sample only one state per game, avoiding overfitting from highly correlated states in same game - Test set MSE: 0.226 (significant reduction from random baseline of 0.5)

AlphaGo's MCTS Integration

AlphaGo's MCTS differs from traditional four steps:

1. Selection: Uses modified UCT formula:where:Hereis the supervised policy network's prior probability. Note: - Priorguides search toward human expert-approved directions - When,is large, encouraging exploration of high-prior actions - Asincreases,decreases, algorithm relies more on value 2. Expansion: Upon reaching leaf node, initialize all actions' priors with policy network:

3. Evaluation: No longer performs random simulation, but uses weighted average of value networkand fast rollout policy(a small linear softmax):whereis result from simulating to terminal state with,.

4. Backpropagation: Update node statistics on path:$

Final decision: At root node, action selection probability proportional to visit count:where temperature parametercontrols exploration: - First 30 moves use(proportional selection, maintains exploration) - After that(select action with most visits)

AlphaGo's Network Architecture Details

Policy Networksand: - Input:feature planes - 8 historical time steps of own/opponent stones (16 planes) - 1 color feature (black/white) - Legality, captured stones, ladder features, etc. - Convolutional layers: 13 layers, each with 256filters, ReLU activation - Output:softmax distribution

Value Network: - Architecture similar to policy network, final layer changed to single neuron outputtingvalue - Input features include rotation data augmentation

Fast Rollout Policy: - Linear softmax, using only localfeatures - Each step takes only 2 microseconds (policy network needs 3 milliseconds) - Used to accelerate simulation in evaluation phase

AlphaGo's Performance Against Lee Sedol

AlphaGo used 1920 CPUs and 280 GPUs, averaging 40 seconds per move, performing approximately 100,000 MCTS simulations. Key moves like move 37 (Game 2) and move 78 (Game 4) demonstrated AlphaGo's creativity - these placements initially seemed unreasonable but were actually globally optimal deep planning results. Lee Sedol's "divine move" at move 78 in Game 4 also proved the unique value of human intuition; AlphaGo fell into confusion for 19 moves afterward, indicating its insufficient generalization ability for certain rare patterns.

AlphaGo Zero: Self-Play from Scratch

Three Major Breakthroughs

AlphaGo Zero, published in Nature in October 2017, had three fundamental improvements over AlphaGo:

1. Completely discards human game records: Uses only black/white stones and game rules, learning from a completely random policy through self-play.

2. Single network: Merges policy and value networks into one two-headed network:-is action probability vector -is win rate estimate

This design is based on the intuition that both policy and value depend on understanding the position, so shared representation can accelerate learning.

3. No fast rollout: Relies entirely on value networkto evaluate leaf nodes, no longer performs random simulation. This makes evaluation more accurate and simplifies the algorithm.

Training Process

AlphaGo Zero's training is a self-play closed loop:

Step 1: Self-play - Using current best network, generate self-play data through MCTS - At each step, MCTS runssimulations, computing improved policy:- Savetriples: state, improved policy, final outcome

Step 2: Train network - Sample mini-batch from replay buffer, optimize loss function:First term is value loss (MSE), second term is policy loss (cross-entropy), third term is L2 regularization.

Step 3: Evaluation and update - Let new networkplay 400 games against current best network - If new network wins >55%, update best network

AlphaGo Zero's MCTS

Selection step's UCT formula becomes:wherecomes from network output, not supervised learning prior.

Evaluation step directly uses:No rollout mixing.

Why Does Self-Play Work?

Self-play's success relies on Curriculum Learning: - Early stage: Both sides are random policies, games end quickly, strong gradient signal - Middle stage: Policy gradually strengthens, opponent also strengthens, always maintaining moderate challenge - Late stage: Both sides approach perfect policy, exploring subtle tactical details

This is more efficient than learning from fixed human game records because: - Human game records cover limited state space (mainly opening patterns) - Humans have biases and errors, AI might learn these flaws - Self-play can continuously explore unknown positions

Mathematically, self-play is similar to Fictitious Self-Play:In two-player zero-sum games, this iteration converges to Nash equilibrium. AlphaGo Zero's MCTS computes a "local best-response" at each step, then distills it into the network through supervised learning.

Performance Comparison

AlphaGo Zero with 3 days training (4.9 million self-play games) defeated AlphaGo Lee version 100-0; with 40 days training, defeated AlphaGo Master version (which had 60 consecutive wins against top human players online) 89-11. More amazingly, AlphaGo Zero's Elo rating growth curve is almost linear - from 0 to 5000 points took only 72 hours, while humans need years to progress from beginner to professional.

AlphaZero: Generalization to Chess and Shogi

From Specialized to General

AlphaZero, demonstrated in the 2018 Science paper, showed the same algorithm's dominance across three board games (Go, chess, shogi). The only modifications were:

1. Game rule encoding: Each game uses different state representation - Go: - Chess:(including piece types, move history, castling rights, etc.) - Shogi:(including captured pieces in hand)

2. Hyperparameter tuning: - Chess and shogi MCTS simulations reduced to 800 (due to smaller branching factor) - Exploration constantoptimized for each game

3. Network architecture: Unified to 20 residual blocks, each with 256 filters, totaling approximately 46 million parameters.

Training Details and Results

Go: 700k steps (equivalent to 21 million self-play games), Elo 5018, surpassing AlphaGo Zero
Chess: 300k steps (44 million games), surpassed Stockfish (traditional strongest engine) after 4 hours, reached Elo 4450 after 9 hours
Shogi: 100k steps (4 million games), surpassed Elmo (Japanese champion program) after 2 hours, post-training Elo 4650

In 100 games against Stockfish, AlphaZero won 28, drew 72, and lost none. More importantly, AlphaZero's playing style is more "human-like" - willing to sacrifice material for positional advantage, while traditional engines rely more on material calculation. Chess.com grandmasters described AlphaZero as "a player from the future".

Why Can It Generalize?

AlphaZero's generalization ability comes from: - Universal search framework: MCTS doesn't depend on game-specific heuristics - Powerful representation learning: Residual networks automatically learn features, no hand-crafted design needed - Self-play robustness: Not limited by human biases, explores broader strategy space

This paradigm's success shows that: search + learning is closer to general intelligence than pure memorization (like opening books) or pure brute-force computation.

MuZero: Planning Without Rules

Core Challenge

AlphaZero relies on game rules to perform MCTS simulations - in chess, it knows "knight moves in L-shape", "en passant capture", etc. But many real-world problems (like robot control, video games) have unknown or hard-to-program rules. MuZero's breakthrough is: learning an implicit modelto plan in latent space.

Three Learned Functions

MuZero learns three functions:

1. Representation function: Maps observationto hidden state

2. Dynamics function: Predicts next hidden state and immediate rewardNote:is not the real environment state, but a learned abstract representation.doesn't need to predict the actual next observation, only needs to be sufficient for planning.

3. Prediction function: Outputs policy and valueAll three functions share parameters, trained jointly.

MuZero's MCTS

At root node, MuZero executes MCTS:

Selection: Same UCT formula as AlphaZero:

Expansion: Upon reaching leaf node, expand a new action, use dynamics modelto generate virtual next state:Then initialize with prediction function:

Evaluation: Directly use, accumulate rewards on path during backpropagation:Key difference: entire search tree unfolds in latent space, doesn't need real environment simulator. Bothandare learned neural networks.

Training Objective

MuZero's loss function includes three terms: -: policy loss, cross-entropy between MCTS improved policyand network prediction -: value loss, MSE between true returnand prediction -: reward loss, MSE between true immediate rewardand predictionwhereis unroll steps (typically 5 for Atari games). Note: - Only when,comes from real observation - When,comes from recursive prediction by dynamics model - Realcome from actual experience in replay buffer

This design cleverly combines Model-Based and Model-Free: - Uses learned model during planning (Model-Based) - Supervised by real experience during training (Model-Free data efficiency)

Reanalysis Loss

MuZero introduces a key trick: Reanalysis - reanalyze old experience with new network: - Replay buffer stores, i.e., observation, action, reward, return - When training new network, rerun MCTS with, obtaining improved policy - This increases old data's value as network improves, similar to Dyna-2's idea

MuZero's Performance

In 57 Atari games, MuZero achieves AlphaZero-level performance without needing game rules. More importantly, MuZero surpasses Model-Free Agent57 in some highly stochastic games (like Ms. Pac-Man) because planning helps maintain consistency in long-term strategy.

In Go and chess, MuZero matches AlphaZero - proving that implicit models sufficiently express key game structures, even without predicting pixel-level next frames.

Complete Code Implementation: MCTS + Neural Network for Gomoku

The following code implements a simplified AlphaZero forGomoku. Complete with: - MCTS's four steps - Policy-value dual-head network - Self-play data generation - Training loop

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random

# ============ Gomoku Environment ============
class GomokuEnv:
    def __init__(self, board_size=15):
        self.size = board_size
        self.reset()
    
    def reset(self):
        self.board = np.zeros((self.size, self.size), dtype=np.int8)
        self.current_player = 1  # 1=black, -1=white
        self.done = False
        self.winner = 0
        return self.get_state()
    
    def get_state(self):
        """Returns 3-channel state: [current player stones, opponent stones, color indicator]"""
        state = np.zeros((3, self.size, self.size), dtype=np.float32)
        state[0] = (self.board == self.current_player).astype(np.float32)
        state[1] = (self.board == -self.current_player).astype(np.float32)
        state[2] = np.full((self.size, self.size), self.current_player)
        return state
    
    def legal_actions(self):
        """Returns list of legal action coordinates"""
        return list(zip(*np.where(self.board == 0)))
    
    def step(self, action):
        """Execute action (row, col)"""
        if self.board[action] != 0:
            raise ValueError(f"Illegal move: {action}")
        
        self.board[action] = self.current_player
        
        # Check win
        if self._check_win(action):
            self.done = True
            self.winner = self.current_player
            return self.get_state(), self.winner, True
        
        # Check draw
        if len(self.legal_actions()) == 0:
            self.done = True
            return self.get_state(), 0, True
        
        self.current_player *= -1
        return self.get_state(), 0, False
    
    def _check_win(self, last_move):
        """Check if last_move forms five in a row"""
        directions = [(0,1), (1,0), (1,1), (1,-1)]  # horizontal, vertical, diagonal
        player = self.board[last_move]
        
        for dr, dc in directions:
            count = 1  # including last_move itself
            # Positive direction
            r, c = last_move[0] + dr, last_move[1] + dc
            while 0 <= r < self.size and 0 <= c < self.size and self.board[r,c] == player:
                count += 1
                r += dr
                c += dc
            # Negative direction
            r, c = last_move[0] - dr, last_move[1] - dc
            while 0 <= r < self.size and 0 <= c < self.size and self.board[r,c] == player:
                count += 1
                r -= dr
                c -= dc
            
            if count >= 5:
                return True
        return False

# ============ Policy-Value Network ============
class PolicyValueNet(nn.Module):
    def __init__(self, board_size=15, num_channels=128):
        super().__init__()
        self.size = board_size
        
        # Shared representation
        self.conv1 = nn.Conv2d(3, num_channels, 3, padding=1)
        self.conv2 = nn.Conv2d(num_channels, num_channels, 3, padding=1)
        self.conv3 = nn.Conv2d(num_channels, num_channels, 3, padding=1)
        
        # Policy head
        self.policy_conv = nn.Conv2d(num_channels, 2, 1)
        self.policy_fc = nn.Linear(2 * board_size * board_size, board_size * board_size)
        
        # Value head
        self.value_conv = nn.Conv2d(num_channels, 1, 1)
        self.value_fc1 = nn.Linear(board_size * board_size, 64)
        self.value_fc2 = nn.Linear(64, 1)
    
    def forward(self, x):
        # Shared layers
        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = torch.relu(self.conv3(x))
        
        # Policy head
        p = torch.relu(self.policy_conv(x))
        p = p.view(-1, 2 * self.size * self.size)
        p = torch.log_softmax(self.policy_fc(p), dim=1)
        
        # Value head
        v = torch.relu(self.value_conv(x))
        v = v.view(-1, self.size * self.size)
        v = torch.relu(self.value_fc1(v))
        v = torch.tanh(self.value_fc2(v))
        
        return p, v

# ============ MCTS Node ============
class MCTSNode:
    def __init__(self, prior, parent=None):
        self.prior = prior  # P(s,a)
        self.parent = parent
        self.children = {}  # action -> MCTSNode
        
        self.visit_count = 0
        self.value_sum = 0.0
    
    def value(self):
        if self.visit_count == 0:
            return 0
        return self.value_sum / self.visit_count
    
    def is_expanded(self):
        return len(self.children) > 0
    
    def select_child(self, c_puct=1.0):
        """UCT selection"""
        best_score = -float('inf')
        best_action = None
        best_child = None
        
        for action, child in self.children.items():
            # Q(s,a) + c_puct * P(s,a) * sqrt(N(s)) / (1 + N(s,a))
            u = c_puct * child.prior * np.sqrt(self.visit_count) / (1 + child.visit_count)
            score = child.value() + u
            
            if score > best_score:
                best_score = score
                best_action = action
                best_child = child
        
        return best_action, best_child
    
    def expand(self, actions, priors):
        """Expand node"""
        for action, prior in zip(actions, priors):
            if action not in self.children:
                self.children[action] = MCTSNode(prior, parent=self)
    
    def backup(self, value):
        """Backpropagate update"""
        self.visit_count += 1
        self.value_sum += value
        if self.parent:
            self.parent.backup(-value)  # Opponent perspective has opposite value

# ============ MCTS ============
class MCTS:
    def __init__(self, model, c_puct=1.0, num_simulations=400):
        self.model = model
        self.c_puct = c_puct
        self.num_simulations = num_simulations
    
    @torch.no_grad()
    def search(self, env):
        """Execute MCTS, return improved policy"""
        root = MCTSNode(prior=0)
        
        # Initialize root node
        state = env.get_state()
        legal_actions = env.legal_actions()
        log_probs, value = self.model(torch.FloatTensor(state).unsqueeze(0))
        probs = torch.exp(log_probs).squeeze(0).numpy()
        
        # Only consider legal actions
        action_probs = []
        for action in legal_actions:
            idx = action[0] * env.size + action[1]
            action_probs.append(probs[idx])
        action_probs = np.array(action_probs)
        action_probs /= action_probs.sum()
        
        root.expand(legal_actions, action_probs)
        
        # Simulate
        for _ in range(self.num_simulations):
            node = root
            env_copy = self._copy_env(env)
            search_path = [node]
            
            # 1. Selection
            while node.is_expanded() and not env_copy.done:
                action, node = node.select_child(self.c_puct)
                search_path.append(node)
                env_copy.step(action)
            
            # 2&3. Expansion and evaluation
            if not env_copy.done:
                state = env_copy.get_state()
                legal_actions = env_copy.legal_actions()
                log_probs, value = self.model(torch.FloatTensor(state).unsqueeze(0))
                probs = torch.exp(log_probs).squeeze(0).numpy()
                
                action_probs = []
                for action in legal_actions:
                    idx = action[0] * env_copy.size + action[1]
                    action_probs.append(probs[idx])
                action_probs = np.array(action_probs)
                action_probs /= action_probs.sum()
                
                node.expand(legal_actions, action_probs)
                value = value.item()
            else:
                # Terminal state, use true outcome
                value = env_copy.winner * env_copy.current_player
            
            # 4. Backpropagation
            for node in reversed(search_path):
                node.backup(value)
                value = -value  # Alternating players
        
        # Return improved policy
        visit_counts = np.zeros(env.size * env.size)
        for action, child in root.children.items():
            idx = action[0] * env.size + action[1]
            visit_counts[idx] = child.visit_count
        
        return visit_counts / visit_counts.sum()
    
    def _copy_env(self, env):
        """Deep copy environment"""
        new_env = GomokuEnv(env.size)
        new_env.board = env.board.copy()
        new_env.current_player = env.current_player
        new_env.done = env.done
        new_env.winner = env.winner
        return new_env

# ============ AlphaZero Trainer ============
class AlphaZeroTrainer:
    def __init__(self, board_size=15, num_simulations=100):
        self.env = GomokuEnv(board_size)
        self.model = PolicyValueNet(board_size)
        self.mcts = MCTS(self.model, num_simulations=num_simulations)
        self.optimizer = optim.Adam(self.model.parameters(), lr=1e-3, weight_decay=1e-4)
        
        self.replay_buffer = deque(maxlen=10000)
        self.batch_size = 256
    
    def self_play(self, num_games=10, temperature=1.0):
        """Self-play to generate data"""
        for _ in range(num_games):
            states, policies, current_players = [], [], []
            self.env.reset()
            
            step = 0
            while not self.env.done:
                state = self.env.get_state()
                policy = self.mcts.search(self.env)
                
                states.append(state)
                policies.append(policy)
                current_players.append(self.env.current_player)
                
                # Temperature sampling
                if step < 30:
                    action_probs = policy ** (1.0 / temperature)
                    action_probs /= action_probs.sum()
                    action_idx = np.random.choice(len(policy), p=action_probs)
                else:
                    action_idx = np.argmax(policy)
                
                action = (action_idx // self.env.size, action_idx % self.env.size)
                self.env.step(action)
                step += 1
            
            # Save data
            winner = self.env.winner
            for state, policy, player in zip(states, policies, current_players):
                # Value from current player's perspective
                value = winner * player
                self.replay_buffer.append((state, policy, value))
    
    def train_step(self):
        """Train one step"""
        if len(self.replay_buffer) < self.batch_size:
            return None
        
        batch = random.sample(self.replay_buffer, self.batch_size)
        states, target_policies, target_values = zip(*batch)
        
        states = torch.FloatTensor(np.array(states))
        target_policies = torch.FloatTensor(np.array(target_policies))
        target_values = torch.FloatTensor(np.array(target_values)).unsqueeze(1)
        
        # Forward pass
        log_probs, values = self.model(states)
        
        # Loss
        policy_loss = -torch.mean(torch.sum(target_policies * log_probs, dim=1))
        value_loss = torch.mean((target_values - values) ** 2)
        loss = policy_loss + value_loss
        
        # Backward pass
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        return {
            'loss': loss.item(),
            'policy_loss': policy_loss.item(),
            'value_loss': value_loss.item()
        }
    
    def train(self, num_iterations=100, games_per_iter=10, train_steps_per_iter=100):
        """Complete training loop"""
        for i in range(num_iterations):
            print(f"\n=== Iteration {i+1}/{num_iterations} ===")
            
            # Self-play
            print("Self-playing...")
            self.self_play(num_games=games_per_iter)
            
            # Train
            print("Training network...")
            losses = []
            for _ in range(train_steps_per_iter):
                loss_dict = self.train_step()
                if loss_dict:
                    losses.append(loss_dict['loss'])
            
            if losses:
                print(f"Average loss: {np.mean(losses):.4f}")
                print(f"Buffer size: {len(self.replay_buffer)}")

# ============ Play Against AI ============
def play_against_ai(model, board_size=15):
    """Play against trained AI"""
    env = GomokuEnv(board_size)
    mcts = MCTS(model, num_simulations=400)
    
    print("You are black (X), AI is white (O)")
    print("Enter coordinates like: 7 7")
    
    while not env.done:
        # Display board
        print("\nCurrent board:")
        for i in range(board_size):
            row = []
            for j in range(board_size):
                if env.board[i,j] == 1:
                    row.append('X')
                elif env.board[i,j] == -1:
                    row.append('O')
                else:
                    row.append('.')
            print(' '.join(row))
        
        if env.current_player == 1:
            # Human turn
            while True:
                try:
                    r, c = map(int, input("Your move: ").split())
                    if (r, c) in env.legal_actions():
                        break
                    else:
                        print("Illegal position!")
                except:
                    print("Input format: row col")
            action = (r, c)
        else:
            # AI turn
            print("AI thinking...")
            policy = mcts.search(env)
            action_idx = np.argmax(policy)
            action = (action_idx // board_size, action_idx % board_size)
            print(f"AI moves: {action}")
        
        env.step(action)
    
    # Game over
    print("\nFinal board:")
    for i in range(board_size):
        row = []
        for j in range(board_size):
            if env.board[i,j] == 1:
                row.append('X')
            elif env.board[i,j] == -1:
                row.append('O')
            else:
                row.append('.')
        print(' '.join(row))
    
    if env.winner == 1:
        print("\nYou won!")
    elif env.winner == -1:
        print("\nAI won!")
    else:
        print("\nDraw!")

# ============ Main Program ============
if __name__ == "__main__":
    # Train
    trainer = AlphaZeroTrainer(board_size=9, num_simulations=100)
    trainer.train(num_iterations=50, games_per_iter=5, train_steps_per_iter=50)
    
    # Save model
    torch.save(trainer.model.state_dict(), "alphazero_gomoku.pth")
    
    # Play against AI
    # model = PolicyValueNet(board_size=9)
    # model.load_state_dict(torch.load("alphazero_gomoku.pth"))
    # model.eval()
    # play_against_ai(model, board_size=9)

Code Explanation

Environment part: - GomokuEnv implementsGomoku, including state representation, legal actions, win detection - get_state() returns 3-channel tensor: current player stones, opponent stones, color indicator

Network part: - PolicyValueNet is dual-head network: policy head outputs, value head outputs - Uses 3 convolutional layers to extract features, then separate fully connected layers for policy and value

MCTS part: - MCTSNode stores prior, visit count, value sum - select_child() implements UCT formula - expand() adds child nodes and initializes priors - backup() backpropagates updates along path

Training part: - self_play() generatesdata, first 30 moves use temperaturesampling, then greedy - train_step() optimizes policy lossand value loss Running example: - Onboard, 100 simulations, AI has decent level after 50 training rounds - Onboard, 400 simulations, training hundreds of rounds can approach amateur level

Deep Q&A: Understanding MCTS and AlphaGo Series

Q1: Where does theterm in UCT formula come from?

Intuition: This comes from Confidence Interval in statistics. Suppose action's true value is, we observesamples, average value. According to central limit theorem, standard error of sample mean is. To guarantee not missing good actions with high confidence, we add upper bound.

Mathematical derivation: UCT is based on Hoeffding's inequality:Let right side equal(i.e., with confidence):Solving:Therefore confidence upper bound is:This is exactly UCT's form (ignoring constant factors).

Physical meaning: - Whenis small,is large, algorithm "optimistically" assumes this action might be good - Whenincreases whiledoesn't,increases, algorithm is "curiosity"-driven to explore - Whenis large,is small, algorithm "confidently" relies on empirical estimates

Q2: Why did AlphaGo need fast rollout policy?

Historical reason: When AlphaGo was developed (2015-2016), value networktraining wasn't mature enough, prone to overfitting to training distribution. Fast rollout provided an independent evaluation signal, like a "voting" mechanism: -: deep learning's global judgment (slow but accurate) -'s rollout: Monte Carlo sampling's local verification (fast but noisy)

Mixed usereduced variance of single estimate.

Why doesn't AlphaGo Zero need it?: By 2017, residual networks and batch normalization techniques matured, value network's generalization ability greatly improved. More importantly, AlphaGo Zero's self-play data covered broader state space, while AlphaGo's data mainly came from human game records (biased). Experiments showed pure value networkevaluation was accurate enough, rollout actually introduced noise.

Numerical comparison: AlphaGo's each MCTS simulation took ~5ms (including 2ms rollout), AlphaGo Zero's pure value evaluation only needed 3ms, with 600 Elo improvement.

Q3: Can self-play fall into local optima?

Theoretical concern: In game theory, self-play is similar to Fictitious Self-Play:whereis best response. This iteration doesn't guarantee convergence to Nash equilibrium in general games, may have cycles (like rock-paper-scissors).

Why does AlphaGo Zero succeed?: 1. Specificity of two-player zero-sum perfect information games: This class of games has unique Nash equilibrium obtainable by minimax value. Fictitious self-play has convergence guarantee in these games. 2. Exploration mechanism: MCTS's UCT formula forces exploring all actions, avoiding premature policy collapse to single move. 3. Random initialization: Early random policy leads to diverse games, laying broad foundation. 4. Opponent pool: Though AlphaGo Zero mainly plays against current best network, mini-batch sampling during training includes data from different periods, equivalent to implicit opponent pool.

Experimental evidence: Paper's ablation study shows if exploration is removed (set), training stalls. But as long as, policy diversity is sufficient to support continuous improvement.

Q4: Was AlphaZero's victory over Stockfish fair?

Controversial points: - Stockfish ran on single-core CPU, AlphaZero used 4 TPUs (equivalent to thousands of cores computing power) - Stockfish used fixed-depth AlphaBeta search (~60 million nodes/second), AlphaZero only searched 80k nodes/second, but each node evaluated with neural network - Stockfish had opening books and endgame tablebases, AlphaZero relied entirely on real-time computation

DeepMind's defense: - Match time was fixed (1 minute per side), computational resources weren't main factor - Though Stockfish's node search was fast, most was ineffective exploration; AlphaZero's search depth was deeper (average 60 layers vs 30 layers) - Follow-up tests with multi-core Stockfish, AlphaZero still maintained advantage

Deeper issue: This reflects difference between two AI paradigms: - Symbolic AI: Stockfish uses hand-crafted heuristics (like "control center", "king safety") and brute-force search - Connectionist AI: AlphaZero uses learned implicit knowledge and high-quality selective search

AlphaZero's playing style is closer to human masters - willing to sacrifice material for long-term positional advantage, while traditional engines are more "materialistic".

Q5: What exactly does MuZero's hidden staterepresent?

Key insight:doesn't need to represent complete environment state, only abstract representation sufficient for planning. For example in Atari's Ms. Pac-Man: - Complete state includes pixels, enemy positions, score, level information, etc. - But planning only needs to know "which direction to go without being caught" type high-level information -might only encode "direction of safe zones" and "rough distribution of dots", ignoring irrelevant details (like background patterns)

Mathematically: Define real environment's transition function as, reward function as. MuZero's learnedsatisfies:$

$ whereare equivalent in planning sense to true policy and value - i.e., search tree expanded withproduces same optimal action as tree expanded with true.

Why not directly predict next frame?: - Pixel-level prediction needs to model irrelevant visual details (like cloud drifting), wasting capacity - Latent space transitions can be more abstract, similar to human "mental simulation" - we don't need to imagine realistic images for planning, only functional causal relationships

Experimental verification: Paper shows's dimension (like 512D) is much smaller than observation (likeimages), but sufficient to support 50-step unrolled planning, proving abstraction's effectiveness.

Q6: What's the fundamental difference between MCTS and Model-Free RL (like PPO)?

Planning vs Learning: - MCTS: At each decision step, executes search in real-time, computing locally optimal action for that state. This is planning - using model to look ahead. - PPO and Model-Free: Train a policy networkthrough massive experience, directly query network during execution. This is learning - memorizing past experience.

Computational overhead: - MCTS: Each decision needs hundreds to thousands of simulations, computationally expensive, but doesn't need training - PPO: Training phase needs millions of environment interactions, but execution only needs one forward pass, extremely fast

Generalization ability: - MCTS: As long as there's a model, can immediately adapt to new states (like mid-game variations) - PPO: Can only generalize to states near training distribution, may fail in completely new situations

AlphaZero's fusion: Combines advantages of both: - During training uses MCTS to generate high-quality policy(planning) - Uses supervised learning to distillinto network (learning) - During execution uses network to accelerate MCTS search (synergy)

This "System 1 (fast intuition) + System 2 (slow reasoning)" dual-system architecture coincides with human cognitive science models.

Q7: Why does AlphaGo's value network need to sample only one state per game?

Overfitting risk: Different states in same game are highly correlated - values of first 10 moves all depend on final outcome. If all sampled, will have: - Same game contributes hundreds of samples in training set - These samples aren't independent and identically distributed (i.i.d.), violating supervised learning assumption - Network will memorize specific game patterns rather than learning general value estimation

Mathematically: Suppose a game hassteps, sampling all producessampleswith identical labels. This causes gradient estimate's variance to falsely decrease:because samples aren't independent. Actual variance is much less than right side, optimizer becomes overconfident, stops exploring too early.

Paper experiment: Compared three sampling strategies: 1. Sample all states per game: value network training set MSE=0.01, test set MSE=0.37 (severe overfitting) 2. Sample one state per game: training set MSE=0.19, test set MSE=0.23 (good generalization) 3. Sample fixed proportion per game (like 10%): between the two

Finally chose strategy 2, using 30 million games to generate 30 million independent samples.

Q8: How many parameters does AlphaZero's residual network have? How to train?

Architecture details: - 20 residual blocks, each containing: -convolution, 256 filters - Batch normalization - ReLU activation - Skip connection - Total parameters approximately 46 million

Training configuration: - Batch size: 4096 (using 64 TPUs in parallel) - Optimizer: SGD, momentum 0.9 - Learning rate: initial 0.2, decay to 0.02 every 100k steps, then 0.002 - Weight decay: - Training time: Go 72 hours, chess 9 hours, shogi 13 hours

Key tricks: 1. Data augmentation: Go/shogi use 8-fold symmetry (rotation + flip), chess uses left-right flip 2. Temperature annealing: First 30 steps use, after that

Dirichlet noise: Add noise to root node's prior, encouraging exploration:where

Q9: How does MuZero avoid model error accumulation?

Challenge: In Model-Based RL, model prediction error grows exponentially with planning steps:whereis single-step error,is unroll steps. This makes long-term planning unreliable.

MuZero's solution: 1. Latent space modeling:is abstract representation, doesn't need to predict pixel-level details, reduces single-step error

Short-term unrolling: Only unrolls 5 steps during training, not dozens of steps, limits error accumulation
Value guidance: Model doesn't need to perfectly predict future, only needs to ensure value estimateis accurate. Even if's prediction is biased, as long ascorrects this bias, planning remains effective
End-to-end training: Loss function directly optimizes policy and value, not reconstruction accuracy. Model automatically learns to ignore details not affecting decisions

Mathematically: Traditional Model-Based RL optimizes reconstruction error:MuZero optimizes decision quality:The latter allows model to "cheat" in places not affecting, avoiding wasting capacity.

Q10: Can MCTS be used for continuous action spaces?

Challenge: UCT assumes finite action set, can enumerate all child nodes. In continuous space (like robot joint angles), actions are infinite, cannot expand.

Possible extensions: 1. Progressive Widening: Gradually increase number of expanded actions, whenever node is visitedtimes, addnew actions (sampled from policy distribution) 2. Local optimization: Use gradient optimization at leaf nodes to find best action, rather than traversing 3. Discretization: Divide continuous space into grid, but introduces discretization error

AlphaGo series' choice: All versions currently only apply to discrete actions (Go, chess, Atari). DeepMind's robotics projects (like dexterous hand manipulation) use Model-Free DDPG/SAC, not MCTS.

Future direction: Combining MCTS with continuous control is an open problem, may require new tree structures or search strategies.

Core Papers

MCTS Foundations:
Kocsis & Szepesv á ri (2006). "Bandit based Monte-Carlo Planning". ECML.
https://arxiv.org/abs/cs/0703062
AlphaGo:
Silver et al. (2016). "Mastering the game of Go with deep neural networks and tree search". Nature.
https://www.nature.com/articles/nature16961
AlphaGo Zero:
Silver et al. (2017). "Mastering the game of Go without human knowledge". Nature.
https://www.nature.com/articles/nature24270
AlphaZero:
Silver et al. (2018). "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play". Science.
https://arxiv.org/abs/1712.01815
MuZero:
Schrittwieser et al. (2020). "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model". Nature.
https://arxiv.org/abs/1911.08265
EfficientZero:
Ye et al. (2021). "Mastering Atari Games with Limited Data". NeurIPS.
https://arxiv.org/abs/2111.00210
Improves MuZero's sample efficiency by 100x
Gumbel AlphaZero:
Danihelka et al. (2022). "Policy Improvement by Planning with Gumbel". ICLR.
https://arxiv.org/abs/2106.06613
Improves MCTS exploration using Gumbel sampling
Sampled MuZero:
Hubert et al. (2021). "Learning and Planning in Complex Action Spaces". ICML.
https://arxiv.org/abs/2104.06303
Extends MuZero to large discrete action spaces

Open Source Implementations

AlphaZero General Framework: https://github.com/suragnair/alpha-zero-general
MuZero PyTorch: https://github.com/werner-duvaud/muzero-general
KataGo (Go engine surpassing AlphaGo): https://github.com/lightvector/KataGo

Summary and Outlook

The evolution from MCTS to the AlphaGo series demonstrates how AI achieved superiority at the pinnacle of human intelligence - Go. This success wasn't the victory of a single technology, but the perfect fusion of search, learning, and planning:

MCTS provides mathematical balance of exploration and exploitation, focusing limited computational resources on the most promising paths.

Deep learning through policy and value networks compresses millennia of human Go wisdom (AlphaGo) or pure self-play (AlphaGo Zero) into neural network weights, providing high-quality priors and evaluation for search.

Self-play constructs a continuously improving closed loop - policy strengthens, opponent also strengthens, always maintaining moderate challenge, ultimately breaking through the ceiling of human game records.

AlphaZero's generalization proves this paradigm's universality - the same algorithm achieves superhuman level in Go, chess, and shogi, showing "search + learning" potential as foundation for general intelligence.

MuZero further breaks free from rule limitations, achieving planning through learning implicit models, advancing toward the ultimate goal of "planning the world without understanding the world".

Future directions include: - Extension to continuous spaces: Combining MCTS with robot control - Transfer learning: Can models trained on one game quickly adapt to new games? - Multi-agent games: Extending from two-player zero-sum to multi-party cooperation-competition - Open-ended environments: How to plan in tasks without clear win/loss definitions?

AlphaGo's victory isn't an endpoint, but the starting point of combining reinforcement learning with search. It tells us: intelligence is not only memory, but also planning; not only imitation, but also creation. The evolution from AlphaGo to MuZero is the best interpretation of this philosophy.