Reinforcement Learning (1): Fundamentals and Core Concepts

How should an intelligent agent learn optimal behavior in an environment? When AlphaGo defeats world champions on the Go board, when robots learn to walk and grasp objects, when recommendation systems continuously optimize suggestions based on user feedback — all of these share a common mathematical framework: reinforcement learning.

Reinforcement learning differs from supervised learning's "train with labeled data" approach and unsupervised learning's "discover inherent structure" paradigm. It addresses a problem closer to the real world: learning optimal decision-making strategies for long-term rewards through trial-and-error and reward feedback, without explicit correct answers.

The Intuition: Learning to Ride a Bicycle

Learning to ride a bicycle is a classic reinforcement learning process. No one hands you a "Standard Cycling Manual" detailing exactly how to adjust the handlebars and body center of gravity at each moment (no supervised learning labels). You must experiment yourself: lean slightly left, the bike veers left; adjust your body right, it veers right. Falling is negative feedback; maintaining balance is positive feedback.

After hundreds of trials, your brain learns a policy: based on the current bike tilt angle, speed, and road conditions, you decide in real-time how to adjust your body and handlebars. This policy isn't acquired through memorization but progressively optimized through the cycle of interacting with the environment, receiving feedback, and adjusting behavior.

The mathematical framework of reinforcement learning precisely describes this process.

Markov Decision Process: The Mathematical Foundation

The core of reinforcement learning is the Markov Decision Process (MDP). An MDP is a five-tuple:

Basic Components

State Space:

All possible states of the environment. In cycling, the state might include bike tilt angle, speed, distance to obstacles, etc. States can be discrete (like board positions) or continuous (like robot joint angles).

Action Space:

All actions the agent can execute. Actions can be discrete (like moving up, down, left, right) or continuous (like applying a specific torque).

Transition Probability:

The probability of transitioning to stateafter executing actionin state. The transition probability completely defines the environment's dynamics:It satisfies the probability normalization condition:.

Reward Function:

The immediate reward received after transitioning from statetoby executing action. Rewards are the agent's only learning signal. Often simplified toorin many cases.

Discount Factor:

Measures the importance of future rewards.means caring only about immediate rewards, whileemphasizes long-term returns.

The Markov Property

The core assumption of MDPs is the Markov property: the future depends only on the current state, independent of history.This assumption appears strong but is practically useful. If the current state contains all relevant information (such as concatenating recent frames as the state), the Markov property can be satisfied.

Policy: Mapping from States to Actions

A policyis the agent's behavioral guideline, defining how to select actions in each state:

Deterministic Policy:Directly provides the action.

Stochastic Policy:Provides a probability distribution over actions. Stochastic policies are more useful for exploration and handling partially observable environments.

Return and Value Functions

The agent's goal is to maximize cumulative return. The return from time stepis defined as:

The role of the discount factor:

Mathematically: Ensures bounded return in infinite-horizon tasks (when,)
Intuitively: Distant rewards have higher uncertainty and should be discounted
Practically: Prevents agents from excessively delaying reward acquisition

State Value Function:

The expected cumulative return starting from stateand following policy: measures "how good it is to be in state" (following policy).

Action Value Function:

The expected cumulative return from executing actionin state, then following policy: measures "how good it is to take actionin state".

Relationship between them:

Bellman Equations: Recursive Structure of Value Functions

Value functions satisfy recursive relationships called Bellman equations. These are the cornerstone of reinforcement learning theory.

Bellman Expectation Equations (for any policy):

Derivation Logic:

Current state value = expected immediate reward + expected discounted value of next state.

Expanding:

Bellman Optimality Equations (for optimal policy):

Define optimal value functions:Optimal Bellman equations: Optimal policy:

Numerical Example:

Consider a simple 2-state MDP:

States:
Actions:(available in each state)
Transition probabilities and rewards:


			0.5	5
			0.5	10
			0.9	0
			0.1	15
			0.7	2
			0.3	8
			1.0	3

Discount factor:
Policy:(always choose)

Computing: This is a system of two linear equations inand. Rearranging: That is: Solving yields:This demonstrates how value functions satisfy recursive relationships defined by Bellman equations.

Dynamic Programming: Optimal Control with Known Models

When the environment model (transition probabilityand reward function) is known, we can use dynamic programming to solve for the optimal policy.

Policy Evaluation: Computing Value Function for a Given Policy

Given policy, policy evaluation computes.

Iterative algorithm:

Initialize(for all)
Repeat until convergence:

For all:This is the iterative form of the Bellman expectation equation. Convergence is guaranteed by the Banach fixed-point theorem (Bellman operator is a contraction mapping).

Policy Improvement: Improving Policy Based on Value Function

Given, construct a greedy policy:

Policy Improvement Theorem: The greedy policyis at least as good as:

Proof:

For any state, by definition:From the Q-function definition:If we chooseon the first step and then follow, the return is at least as good as always following. If we use the greedy policy at every step, the return will be even better.

Policy Iteration Algorithm

Alternates between policy evaluation and policy improvement:

Initialize random policy
Policy Evaluation: Solve for
Policy Improvement:${k+1}(s) a {s'} P(s'|s, a) [R(s, a, s') + V^{_k}(s')]{k+1} = _k$, stop; otherwise return to step 2

Policy iteration is guaranteed to converge to the optimal policy.

Value Iteration Algorithm

Directly iterates the optimal Bellman equation:

Initialize$V_0(s) = 0$Value iteration is essentially a simplified version of policy iteration: each iteration implicitly performs one policy evaluation and improvement.

Convergence:

Define the Bellman operator:We can proveis a-contraction mapping:Therefore, value iteration converges exponentially to the unique fixed point.

Code Implementation: GridWorld Environment

Let's implement a simple grid world where an agent moves from start to goal, with walls blocking certain positions.

import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List

class GridWorld:
    """
    Simple Grid World Environment
    
    Parameters:
        grid_size: (height, width), grid dimensions
        start: (row, col), starting position
        goal: (row, col), goal position
        obstacles: List[(row, col)], obstacle positions
    """
    def __init__(self, grid_size=(5, 5), start=(0, 0), goal=(4, 4), obstacles=[]):
        self.height, self.width = grid_size
        self.start = start
        self.goal = goal
        self.obstacles = set(obstacles)
        
        # Action space: 0=up, 1=down, 2=left, 3=right
        self.actions = [0, 1, 2, 3]
        self.action_names = ['↑', '↓', '←', '→']
        
        # Action-to-state-change mapping
        self.action_deltas = {
            0: (-1, 0),  # up
            1: (1, 0),   # down
            2: (0, -1),  # left
            3: (0, 1)    # right
        }
        
        self.reset()
    
    def reset(self) -> Tuple[int, int]:
        """Reset environment, return starting position"""
        self.state = self.start
        return self.state
    
    def step(self, action: int) -> Tuple[Tuple[int, int], float, bool]:
        """
        Execute action
        
        Returns:
            next_state: next state
            reward: immediate reward
            done: whether goal is reached
        """
        dr, dc = self.action_deltas[action]
        next_r = self.state[0] + dr
        next_c = self.state[1] + dc
        
        # Check boundaries and obstacles
        if (0 <= next_r < self.height and 0 <= next_c < self.width and
            (next_r, next_c) not in self.obstacles):
            self.state = (next_r, next_c)
        
        # Rewards: +10 for goal, -1 otherwise
        if self.state == self.goal:
            return self.state, 10.0, True
        else:
            return self.state, -1.0, False
    
    def get_transition_prob(self, state, action):
        """
        Get transition probability (deterministic environment)
        
        Returns:
            List[(next_state, prob, reward)]
        """
        # Save current state
        old_state = self.state
        self.state = state
        
        # Execute action
        next_state, reward, done = self.step(action)
        
        # Restore state
        self.state = old_state
        
        return [(next_state, 1.0, reward)]


def value_iteration(env: GridWorld, gamma=0.9, theta=1e-6, max_iter=1000):
    """
    Value Iteration Algorithm
    
    Parameters:
        env: environment object
        gamma: discount factor
        theta: convergence threshold
        max_iter: maximum iterations
    
    Returns:
        V: state value function, shape=(height, width)
        policy: optimal policy, shape=(height, width)
    """
    height, width = env.height, env.width
    V = np.zeros((height, width))
    
    for iter_count in range(max_iter):
        delta = 0
        
        # Iterate over all states
        for r in range(height):
            for c in range(width):
                state = (r, c)
                
                # Skip goal and obstacles
                if state == env.goal or state in env.obstacles:
                    continue
                
                v = V[r, c]
                
                # Compute Q-values for all actions
                q_values = []
                for action in env.actions:
                    q = 0
                    transitions = env.get_transition_prob(state, action)
                    for next_state, prob, reward in transitions:
                        next_r, next_c = next_state
                        q += prob * (reward + gamma * V[next_r, next_c])
                    q_values.append(q)
                
                # Bellman optimality update
                V[r, c] = max(q_values)
                delta = max(delta, abs(v - V[r, c]))
        
        # Check convergence
        if delta < theta:
            print(f"Value iteration converged after {iter_count + 1} iterations")
            break
    
    # Extract optimal policy
    policy = np.zeros((height, width), dtype=int)
    for r in range(height):
        for c in range(width):
            state = (r, c)
            if state == env.goal or state in env.obstacles:
                continue
            
            q_values = []
            for action in env.actions:
                q = 0
                transitions = env.get_transition_prob(state, action)
                for next_state, prob, reward in transitions:
                    next_r, next_c = next_state
                    q += prob * (reward + gamma * V[next_r, next_c])
                q_values.append(q)
            
            policy[r, c] = np.argmax(q_values)
    
    return V, policy


def visualize_policy(env: GridWorld, V, policy):
    """Visualize value function and policy"""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    
    # Value function heatmap
    im1 = ax1.imshow(V, cmap='coolwarm')
    ax1.set_title('State Value Function V(s)')
    plt.colorbar(im1, ax=ax1)
    
    # Annotate values
    for r in range(env.height):
        for c in range(env.width):
            if (r, c) == env.goal:
                ax1.text(c, r, 'G', ha='center', va='center', color='white', fontsize=16, fontweight='bold')
            elif (r, c) in env.obstacles:
                ax1.text(c, r, 'X', ha='center', va='center', color='black', fontsize=16)
            else:
                ax1.text(c, r, f'{V[r, c]:.1f}', ha='center', va='center')
    
    # Policy arrows
    ax2.imshow(np.zeros((env.height, env.width)), cmap='gray', alpha=0.3)
    ax2.set_title('Optimal Policy π*(s)')
    
    for r in range(env.height):
        for c in range(env.width):
            if (r, c) == env.goal:
                ax2.text(c, r, 'G', ha='center', va='center', color='green', fontsize=16, fontweight='bold')
            elif (r, c) in env.obstacles:
                ax2.text(c, r, 'X', ha='center', va='center', color='red', fontsize=16)
            else:
                action = policy[r, c]
                ax2.text(c, r, env.action_names[action], ha='center', va='center', fontsize=20)
    
    plt.tight_layout()
    plt.savefig('gridworld_value_iteration.png', dpi=150)
    plt.show()


# Run example
if __name__ == "__main__":
    # Create 5x5 grid world with obstacles
    env = GridWorld(
        grid_size=(5, 5),
        start=(0, 0),
        goal=(4, 4),
        obstacles=[(1, 1), (2, 2), (3, 1)]
    )
    
    # Run value iteration
    V, policy = value_iteration(env, gamma=0.9, theta=1e-6)
    
    # Visualize results
    visualize_policy(env, V, policy)
    
    print("\nOptimal Value Function:")
    print(V)
    print("\nOptimal Policy:")
    for r in range(env.height):
        for c in range(env.width):
            if (r, c) == env.goal:
                print('G', end=' ')
            elif (r, c) in env.obstacles:
                print('X', end=' ')
            else:
                print(env.action_names[policy[r, c]], end=' ')
        print()

Code Explanation:

Theclass implements the environment, providing state transitions and rewards 2.implements the value iteration algorithm until value function convergence ()
The optimal policy is extracted by greedily choosing4. Visualization displays value function heatmap and policy direction arrows

Monte Carlo Methods: Model-Free Learning

Dynamic programming requires a complete environment model, but in practice, models are often unknown or difficult to construct. Monte Carlo (MC) methods estimate value functions by sampling trajectories, without requiring environment models.

Core Idea

The MC method's basic approach: the value function is expected return, which can be approximated using sample averages:whereis the return observed in the-th trajectory starting from state.

Steps:

Agent interacts with environment following policy, generating complete trajectories (episodes):2. For each statein the trajectory, compute the return starting from that state$G_tV(s)s$

First-Visit MC vs Every-Visit MC

First-Visit MC:

For each trajectory, record the return only on the first visit to state.

Algorithm:

Initialize,(empty list)
For each trajectory:
- Compute return at each timestep - For each state(only first occurrence):
  - Appendto - Every-Visit MC:

For each trajectory, record returns on every visit to state.

The difference lies in handling multiple visits to the same state within one trajectory. First-Visit MC is more commonly used and its unbiasedness is easier to prove.

Convergence of MC Policy Evaluation

Theorem: First-Visit MC method converges to the true value function:whereis the number of trajectories visiting state.

Proof: Returnsobserved on each visit toare independent and identically distributed random variables with expectation. By the law of large numbers, the sample mean converges to the expectation.

MC Control: Policy Improvement

MC methods can also be used for control problems (finding optimal policies). The key is estimatingrather than, because policy improvement requires knowing the value of each action.

MC Control Algorithm (based on exploring starts):

Initialize policyand$Q(s, a)$ - Policy evaluation: For eachpair in trajectory, updateas average of observed returns
- Policy improvement: For each, update Exploring Starts Assumption:

Every state-action pair has nonzero probability of being the starting point. This ensures all state-action pairs can be visited.

ε-Greedy Policy: Balancing Exploration and Exploitation

Exploring Starts is difficult to implement in practice. A more practical approach is using an-greedy policy:With probability, choose the current best action (exploit); with probability, randomly choose an action (explore).

On-Policy vs Off-Policy:

On-Policy: Evaluate and improve the currently executing policy (like-greedy)
Off-Policy: Evaluate one policy (target policy) while executing another (behavior policy)

MC control is typically on-policy.

Incremental Update

Storing all returns and computing averages has high memory overhead. The incremental update formula:whereis the learning rate. This is a weighted average with higher weight for recent samples (whenis constant).

Code Implementation: MC Policy Evaluation

import numpy as np
from collections import defaultdict

def mc_policy_evaluation(env, policy, num_episodes=10000, gamma=0.9):
    """
    Monte Carlo Policy Evaluation (First-Visit)
    
    Parameters:
        env: environment object
        policy: policy function, policy(state) -> action
        num_episodes: number of trajectories to sample
        gamma: discount factor
    
    Returns:
        V: state value function dictionary
    """
    # Store return lists for each state
    returns = defaultdict(list)
    V = defaultdict(float)
    
    for episode in range(num_episodes):
        # Generate trajectory
        trajectory = []
        state = env.reset()
        done = False
        
        while not done:
            action = policy(state)
            next_state, reward, done = env.step(action)
            trajectory.append((state, action, reward))
            state = next_state
        
        # Compute returns (backward from end)
        G = 0
        visited_states = set()
        
        for t in range(len(trajectory) - 1, -1, -1):
            state, action, reward = trajectory[t]
            G = reward + gamma * G
            
            # First-Visit: record only on first visit
            if state not in visited_states:
                visited_states.add(state)
                returns[state].append(G)
                V[state] = np.mean(returns[state])
        
        if (episode + 1) % 1000 == 0:
            print(f"Completed {episode + 1} episodes")
    
    return V


def mc_control(env, num_episodes=10000, gamma=0.9, epsilon=0.1):
    """
    Monte Carlo Control (On-Policy, ε-greedy)
    
    Returns:
        Q: action-value function dictionary
        policy: learned policy
    """
    Q = defaultdict(lambda: np.zeros(len(env.actions)))
    returns = defaultdict(list)
    
    for episode in range(num_episodes):
        # Generate trajectory (using ε-greedy policy)
        trajectory = []
        state = env.reset()
        done = False
        
        while not done:
            # ε-greedy action selection
            if np.random.random() < epsilon:
                action = np.random.choice(env.actions)
            else:
                action = np.argmax(Q[state])
            
            next_state, reward, done = env.step(action)
            trajectory.append((state, action, reward))
            state = next_state
        
        # Update Q-values
        G = 0
        visited_sa = set()
        
        for t in range(len(trajectory) - 1, -1, -1):
            state, action, reward = trajectory[t]
            G = reward + gamma * G
            
            if (state, action) not in visited_sa:
                visited_sa.add((state, action))
                returns[(state, action)].append(G)
                Q[state][action] = np.mean(returns[(state, action)])
        
        if (episode + 1) % 1000 == 0:
            print(f"Completed {episode + 1} episodes")
    
    # Extract greedy policy
    policy = lambda s: np.argmax(Q[s])
    
    return Q, policy

Pros and Cons of MC Methods:

Pros: - Model-free - Can learn from actual experience or simulation - Can focus on states of interest - Simple and intuitive

Cons: - Requires complete trajectories (must wait until episode ends) - High variance (return is sum of many random rewards) - Inefficient in continuous tasks or long episodes

Temporal Difference Learning: Combining MC and DP

Temporal Difference (TD) learning is the core of reinforcement learning, combining MC's model-free nature with DP's bootstrapping.

TD(0): The Simplest TD Method

TD(0) update rule:

TD Error:TD error measures the discrepancy between "actual observation ()" and "current estimate ()".

Comparison with MC:

Method	Update Target	Characteristics
MC	(actual return)	Unbiased, high variance, requires complete trajectory
TD(0)	(bootstrapped estimate)	Biased (initially), low variance, single-step update

TD methods can update after every step without waiting for episode completion. This makes TD methods applicable to continuing tasks (tasks without explicit termination states).

Convergence of TD(0)

Theorem (Sutton, 1988): For any policy, if the following conditions are met:

Learning rate satisfies Robbins-Monro conditions: $Double subscripts: use braces to clarify_t _t = , _t _t^2 <$ 2. All state-action pairs are visited infinitely

Then TD(0) converges almost surely to.

Common learning rates:or fixed small values (like 0.1).

Sarsa: On-Policy TD Control

Sarsa (State-Action-Reward-State-Action) is the TD version for Q-functions:The algorithm name comes from the quintuple needed for the update:.

Sarsa Algorithm:

Initialize$Q(s, a)S$ - Choose actionfromusing(e.g.,-greedy)
- Repeat (for each step in episode):
  - Execute, observe - Choosefromusing - Update: - - Untilis terminal

Sarsa is an on-policy method: it evaluates and improves the currently executing policy (like-greedy).

Q-Learning: Off-Policy TD Control

Q-Learning is one of the most important breakthroughs in TD methods (Watkins, 1989).

Q-Learning Update Rule:The difference from Sarsa is: Sarsa uses the actually executed next action, while Q-Learning uses the optimal action.

Q-Learning Algorithm:

Initialize$Q(s, a)S$ - Repeat (for each step in episode):
- Choose actionfromusing(e.g.,-greedy)
- Execute, observe - Update: - - Untilis terminal

Q-Learning is an off-policy method: the behavior policy (like-greedy) is used for exploration, while the target policy (greedy policy) is used for evaluation.

Convergence of Q-Learning:

Theorem (Watkins & Dayan, 1992): If:

All state-action pairs are visited infinitely
Learning rate satisfies Robbins-Monro conditions

Then Q-Learning converges to the optimal Q-function, independent of the behavior policy.

This is Q-Learning's huge advantage: it can learn the optimal policy from data generated by any policy.

Sarsa vs Q-Learning: Intuitive Comparison

Consider a "cliff walking" environment:

S: start, G: goal, C: cliff (reward=-100)

S . . . . . . G
C C C C C C C C

The agent moves from start to goal; falling off the cliff returns it to start with large negative reward.

Sarsa (On-Policy):

Since it uses an-greedy policy, Sarsa considers the possibility of falling off the cliff during exploration. It learns a policy that stays far from the cliff edge, taking a safer but longer path.

Q-Learning (Off-Policy):

Q-Learning evaluates the greedy policy, assuming no random exploration. It learns a policy that walks along the cliff edge for the shortest path, but during execution (due to-greedy), it might fall off the cliff.

This illustrates: - Sarsa is more conservative (considers exploration risk) - Q-Learning is more aggressive (evaluates optimal policy)

TD(λ) and Eligibility Traces

TD(0) only uses one-step reward for bootstrapping. TD() is a more general framework that combines multi-step returns.

-step return:

TD() update:

Uses a weighted average of all-step returns:wherecontrols weight allocation.

Eligibility Trace:

Eligibility trace is an efficient implementation that maintains a tracefor each state:Update rule:Eligibility trace records a state's "credit": recently visited states and frequently visited states have higher traces, receiving more credit assignment.

Special cases:

-: TD(0) -: Equivalent to MC (using complete return)

Code Implementation: Sarsa and Q-Learning

import numpy as np

def sarsa(env, num_episodes=1000, alpha=0.1, gamma=0.9, epsilon=0.1):
    """
    Sarsa Algorithm (On-Policy TD Control)
    
    Parameters:
        env: environment object
        num_episodes: number of training episodes
        alpha: learning rate
        gamma: discount factor
        epsilon: exploration rate
    
    Returns:
        Q: learned Q-function
    """
    # Initialize Q-table
    Q = {}
    for r in range(env.height):
        for c in range(env.width):
            Q[(r, c)] = np.zeros(len(env.actions))
    
    for episode in range(num_episodes):
        state = env.reset()
        
        # Choose initial action (ε-greedy)
        if np.random.random() < epsilon:
            action = np.random.choice(env.actions)
        else:
            action = np.argmax(Q[state])
        
        done = False
        
        while not done:
            # Execute action
            next_state, reward, done = env.step(action)
            
            # Choose next action (ε-greedy)
            if np.random.random() < epsilon:
                next_action = np.random.choice(env.actions)
            else:
                next_action = np.argmax(Q[next_state])
            
            # Sarsa update: use actually executed next_action
            td_target = reward + gamma * Q[next_state][next_action]
            td_error = td_target - Q[state][action]
            Q[state][action] += alpha * td_error
            
            # Transition
            state = next_state
            action = next_action
        
        if (episode + 1) % 100 == 0:
            print(f"Sarsa completed episode {episode + 1}")
    
    return Q


def q_learning(env, num_episodes=1000, alpha=0.1, gamma=0.9, epsilon=0.1):
    """
    Q-Learning Algorithm (Off-Policy TD Control)
    
    Returns:
        Q: learned Q-function
    """
    # Initialize Q-table
    Q = {}
    for r in range(env.height):
        for c in range(env.width):
            Q[(r, c)] = np.zeros(len(env.actions))
    
    for episode in range(num_episodes):
        state = env.reset()
        done = False
        
        while not done:
            # Choose action (ε-greedy)
            if np.random.random() < epsilon:
                action = np.random.choice(env.actions)
            else:
                action = np.argmax(Q[state])
            
            # Execute action
            next_state, reward, done = env.step(action)
            
            # Q-Learning update: use max_a Q(s', a)
            td_target = reward + gamma * np.max(Q[next_state])
            td_error = td_target - Q[state][action]
            Q[state][action] += alpha * td_error
            
            # Transition
            state = next_state
        
        if (episode + 1) % 100 == 0:
            print(f"Q-Learning completed episode {episode + 1}")
    
    return Q


def sarsa_lambda(env, num_episodes=1000, alpha=0.1, gamma=0.9, epsilon=0.1, lambda_=0.9):
    """
    Sarsa(λ) Algorithm (with eligibility traces)
    
    Parameters:
        lambda_: eligibility trace decay factor
    """
    Q = {}
    for r in range(env.height):
        for c in range(env.width):
            Q[(r, c)] = np.zeros(len(env.actions))
    
    for episode in range(num_episodes):
        # Initialize eligibility traces
        E = {}
        for r in range(env.height):
            for c in range(env.width):
                E[(r, c)] = np.zeros(len(env.actions))
        
        state = env.reset()
        
        # Choose initial action
        if np.random.random() < epsilon:
            action = np.random.choice(env.actions)
        else:
            action = np.argmax(Q[state])
        
        done = False
        
        while not done:
            # Execute action
            next_state, reward, done = env.step(action)
            
            # Choose next action
            if np.random.random() < epsilon:
                next_action = np.random.choice(env.actions)
            else:
                next_action = np.argmax(Q[next_state])
            
            # Compute TD error
            td_error = reward + gamma * Q[next_state][next_action] - Q[state][action]
            
            # Update eligibility trace (current state-action pair)
            E[state][action] += 1
            
            # Update all state-action pairs
            for s in E:
                for a in range(len(env.actions)):
                    Q[s][a] += alpha * td_error * E[s][a]
                    E[s][a] *= gamma * lambda_
            
            # Transition
            state = next_state
            action = next_action
        
        if (episode + 1) % 100 == 0:
            print(f"Sarsa(λ) completed episode {episode + 1}")
    
    return Q


# Comparison experiment
if __name__ == "__main__":
    from gridworld import GridWorld
    
    env = GridWorld(grid_size=(5, 5), start=(0, 0), goal=(4, 4), obstacles=[(2, 2)])
    
    print("Training Sarsa...")
    Q_sarsa = sarsa(env, num_episodes=1000)
    
    print("\nTraining Q-Learning...")
    Q_qlearn = q_learning(env, num_episodes=1000)
    
    print("\nTraining Sarsa(λ)...")
    Q_sarsa_lambda = sarsa_lambda(env, num_episodes=1000, lambda_=0.9)
    
    # Extract policies and compare
    policy_sarsa = {s: np.argmax(Q_sarsa[s]) for s in Q_sarsa}
    policy_qlearn = {s: np.argmax(Q_qlearn[s]) for s in Q_qlearn}
    
    print("\nPolicy comparison complete")

Key Code Points:

Sarsa updates usingwhereis the actually executed action
Q-Learning updates using, taking$a Q(S{t+1}, a)E$, recording "credit" for each state-action pair
All three methods use-greedy for exploration

❓ Q&A: Common Questions on RL Fundamentals

Q1: Why do we need the discount factor? Can't we directly maximize undiscounted return?

Mathematical Necessity:

In infinite-horizon tasks, undiscounted returnmay diverge to infinity. The discount factor ensures bounded return:

Economic Interpretation:

The discount factor reflects "time value":1 tomorrow. In reinforcement learning, this manifests as:

Rewards obtained earlier have more value (reduced uncertainty)
Prevents agents from indefinitely delaying reward acquisition

Practical Impact:

Value	Behavioral Characteristics	Suitable Scenarios
0.0 - 0.5	Myopic, focuses on near-term rewards	Financial trading, quick decisions
0.9 - 0.95	Balanced, commonly used default	Most tasks
0.99 - 0.999	Farsighted, values long-term gains	Go, strategic planning
1.0	No discounting (finite-step tasks only)	Games with clear termination

Selection Guidelines:

If task has clear termination state (episodic),can be close to 1
If task is continuing,should be < 1
Tunevia cross-validation

Q2: Why is the Bellman equation a "contraction mapping"? What does this mean for convergence?

Contraction Mapping Definition:

Operatoris a contraction mapping if there existssuch that for any functions:where.

Proof of Bellman Operator Contraction:

Define Bellman operator:Proof:

Banach Fixed-Point Theorem:

A contraction mapping has a unique fixed point, and iteration from any initial value converges to that fixed point:Convergence rate:Exponential convergence! This means value iteration and policy evaluation converge quickly.

Practical Significance:

Smallerconverges faster (but policy quality may decrease)
Can useas stopping criterion
Even with poor initial values, convergence to optimal value is guaranteed

Q3: Dynamic programming requires complete models, but models are often unknown in practice. Does this mean DP has no practical value?

Cases Where Models Are Available:

Known game rules: Board games (Go, Chess), video games
Simulation environments: Robot simulators (MuJoCo, Gazebo), autonomous driving simulation
Learned models: Using supervised learning to learn transition functionand rewardfrom data

Model-Based RL:

Combines model learning with planning:

Collect data from real interactions$ = {(s, a, r, s')}(s'|s, a)(s, a)$3. Use DP to plan on the learned model

This is called the Dyna architecture (Sutton, 1990).

DP's Theoretical Value:

Even if DP algorithms aren't directly used, the theoretical framework (Bellman equations, policy improvement theorem) is foundational for all RL methods:

TD methods are essentially sampled versions of DP
Policy gradient methods' convergence relies on the policy improvement theorem
Target networks in deep RL originate from DP ideas

Practical Recommendations:

If model is known or easily modeled, prioritize DP (most efficient)
If model is unknown but can be sampled, use MC or TD
If samples are expensive, consider model-based methods

Q4: Which is better, Monte Carlo methods or TD methods?

Both have advantages and disadvantages depending on task characteristics:

Monte Carlo Methods:

✅ Advantages: - Unbiased estimate (sample mean of expected return converges to true value) - Don't rely on Markov property - Simple and intuitive

❌ Disadvantages: - High variance (return is sum of long sequence of random rewards) - Require complete trajectories (can't be used for continuing tasks) - Slow learning (especially for long trajectories)

TD Methods:

✅ Advantages: - Low variance (depends only on one-step randomness) - Single-step updates (applicable to continuing tasks) - Fast learning (fully utilizes bootstrapping)

❌ Disadvantages: - Biased (initially based on inaccurate estimates) - Depend on Markov property - Sensitive to initial values

Rules of Thumb:

Task Characteristics	Recommended Method	Reason
Short episodes	MC	Variance not large, unbiased estimation more important
Long/continuing tasks	TD	Must use single-step updates
High randomness	TD	MC variance too large
Function approximation	TD	MC's high variance amplifies approximation error
Offline data	MC	Fixed trajectories, can't bootstrap

Combined Use:

TD() interpolates between MC and TD through parameter:

-: TD(0) -: Close to MC -: Balances bias and variance

In practice,typically works best.

Q5: Q-Learning guarantees convergence to optimal policy, so why do we need Sarsa?

Q-Learning indeed has stronger theoretical guarantees (converges to), but Sarsa is more practical in certain scenarios:

Safety Considerations:

Sarsa is an on-policy method that evaluates the actually executed policy (including exploration). In safety-sensitive tasks (like robot control, medical decisions), Sarsa learns more conservative policies.

Example (cliff walking):

1 2	Start ← ← ← ← ← ← ← Goal Cliff Cliff Cliff Cliff Cliff Cliff Cliff Cliff

Q-Learning: Learns shortest path along cliff edge (evaluates greedy policy), but frequently falls off cliff during execution (due to-exploration)
Sarsa: Learns safe path far from cliff (considers exploration risk)

Impact of Exploration Strategy:

Q-Learning's convergence requires all state-action pairs to be visited infinitely but doesn't require a specific exploration strategy. Sarsa's performance directly depends on exploration strategy quality.

Convergence Speed:

Theoretically Q-Learning converges faster (directly optimizes optimal policy), but in practice:

If exploration strategy differs greatly from optimal policy, Q-Learning's sample efficiency may be lower
Sarsa converges more stably in some tasks

Practical Recommendations:

Try Q-Learning first: More general, stronger theory
Switch to Sarsa if:
- Training is unstable
- Task has safety requirements
- Exploration noise is large

Q6: Why do eligibility traces accelerate credit assignment?

The credit assignment problem: In long trajectories, how do early actions receive "credit" for distant rewards?

TD(0)'s Limitation:

TD(0) only backpropagates one step:If reward is obtained at time, onlyis directly updated.need multiple iterations to indirectly benefit.

Eligibility Trace Mechanism:

Eligibility tracerecords state's "responsibility" for current reward:Update rule:

Intuition:

Frequency heuristic: Frequently visited states have higher traces (cumulative effect)
Recency heuristic: Recently visited states have higher traces (decay)
Global update: Each TD errorupdates all states (weighted by their eligibility traces)

Example:

Suppose trajectory:(rewardobtained at)

TD(0): - Onlyis directly updated -are affected only after multiple iterations

Sarsa() (assume):

Suppose at:TD error (assuming initial):All states updated simultaneously:

Reward signal propagated through entire trajectory in one step!

Practical Effect:

In long trajectory tasks, Sarsa() is several to dozens of times faster than Sarsa(0) -typically works best
Cost: Need to maintain eligibility trace for each state (memory overhead)

Q7: How to choose appropriate exploration strategy? What alternatives exist to ε-greedy?

Problems with ε-greedy:

Exploration is uniformly random (doesn't utilize prior knowledge)
With fixed, late-stage training still does ineffective exploration
Equal probability exploration for all actions (even obviously bad actions)

Improved Approaches:

1. Decaying ε-greedy:

Reduce exploration rate over time:Common settings:.

2. Boltzmann Exploration (Softmax):

Choose actions based on relative Q-value magnitudes:Temperature parameter: -: Close to greedy -: Close to uniform random

Advantage: Actions with higher Q-values have higher selection probability.

3. UCB (Upper Confidence Bound):

Uncertainty-based exploration:whereis the number of times actionhas been selected in state.

Intuition: Explore uncertain actions (low visit count).

4. Noisy Networks (NoisyNet):

Add learnable noise to network parameters (detailed in Chapter 6).

Practical Recommendations:

Scenario	Recommended Method	Reason
Simple tasks	ε-greedy + decay	Simple and effective
Continuous action space	Parameter noise (like OU noise)	ε-greedy not applicable
Deep RL	NoisyNet	Exploration couples with policy parameters
Expensive samples	UCB	Optimal exploration efficiency
Multi-armed bandits	Thompson sampling	Bayesian optimal

Q8: How much does TD methods' "bias" impact practice?

TD methods' bias comes from bootstrapping: using current estimateto replace true value.

Bias Analysis:

Assume true value is, current estimate is, TD error:TD target bias:Initially bias may be large, but asapproaches, bias disappears.

Variance Comparison:

MC variance (assume-step trajectory):TD variance:TD variance istimes lower than MC!

Practical Impact:

In most tasks, low variance is more important than unbiasedness:

Deep RL: Function approximators (neural networks) amplify variance
Long episodes: MC variance grows linearly with episode length
Sample efficiency: TD learns more from same number of samples

Empirical Conclusion:

TD methods typically converge faster than MC (especially TD(),)
In high-variance tasks, TD's advantage is more pronounced
MC still has value in short episodes and offline data scenarios

Q9: How to set learning rate? What are the pros and cons of fixed vs adaptive learning rates?

Learning rate controls step size of each update, crucial for convergence and stability.

Theoretical Requirements (Robbins-Monro Conditions):

Necessary and sufficient conditions for guaranteed convergence:Examples satisfying conditions:

-: Classic decay -: Based on visit count

Fixed Learning Rate:Pros: - Simple - Can adapt to non-stationary environments (higher weight for recent samples)

Cons: - Doesn't satisfy Robbins-Monro conditions (theoretically doesn't guarantee convergence to optimum) - Will eventually oscillate around optimal solution

Common values:, determined by cross-validation.

Adaptive Learning Rates:

1. Visit-Count-Based:whereis visit count for state.

Pros: - Satisfies Robbins-Monro conditions - Frequently visited states automatically have lower learning rate

Cons: - Slow learning in late stage - Requires maintaining visit counts

2. AdaGrad Style:

Borrows from deep learning adaptive optimizers:whereis gradient (or TD error).

3. Adam Style:

Uses first and second moment estimates:

Practical Recommendations:

Scenario	Recommended Method	Parameters
Simple tabular RL	Fixed learning rate
Deep RL	Adam optimizer
Online learning	Fixed learning rate
Offline learning	Decaying learning rate	Initial 0.1, decay to 0.001
Non-stationary env	Fixed learning rate	Maintain adaptability

Tuning Tips:

Start with larger learning rate (like 0.1), observe training curve
If unstable/divergent, reduce learning rate
If converging too slowly, increase learning rate
Use learning rate schedule (like linear decay, exponential decay)

Q10: The Markov property assumption is often violated in real problems. How to handle partial observability?

Partially Observable Markov Decision Process (POMDP):

Agent cannot directly observe state, only receives observation, with observation-state relationship determined by observation function:Examples: - Robot navigation: Sensor readings (observation) ≠ true location (state) - Medical diagnosis: Symptoms (observation) ≠ disease (state) - Financial trading: Market data (observation) ≠ market state (state)

Solutions:

1. State Augmentation:

Concatenate historical observations as extended state:Frame stacking commonly used in deep RL:

1 2	# Atari games: stack most recent 4 frames state = np.concatenate([frame_t, frame_{t-1}, frame_{t-2}, frame_{t-3}], axis=0)

Pros: Simple, suitable for short-term memory Cons: State space dimensionality explosion

2. Recurrent Neural Networks (RNN):

Use LSTM/GRU to process observation sequences, hidden stateas state representation: Pros: Can handle long-term dependencies Cons: Training difficult (vanishing/exploding gradients)

3. Belief State:

Maintain probability distribution over states (belief):Use Bayesian update:Solve MDP in belief space.

Pros: Theoretically complete Cons: High computational complexity (infeasible for continuous state spaces)

4. Transformer (Modern Approach):

Use self-attention mechanism to process observation sequences:Pros: Can capture long-distance dependencies, parallel training Cons: High computational cost

Practical Recommendations:

Scenario	Recommended Method	Example
Short-term partial observability	Frame stacking	Atari games (stack 4 frames)
Long-term memory needs	LSTM/GRU	Dialogue systems, strategy games
High uncertainty	Belief state	Robot localization (SLAM)
Large-scale sequences	Transformer	Decision Transformer

Diagnostic Method:

If you suspect Markov property is violated, check:

Does agent make different decisions under same observation?
Does reward signal appear very delayed?
Does environment have hidden states (like opponent intent)?

If yes, consider state augmentation or RNN.

🎓 Summary: Key Points of Reinforcement Learning

MDP Framework:

The core of reinforcement learning is the Markov Decision Process, where agents interact with the environment through policy, aiming to maximize cumulative return.

Bellman Equations (Memory Formula):

Methodology Triangle:

        Dynamic Programming (Known Model)
       /                                \
      /                                  \
Monte Carlo (Model-Free, High Variance)  Temporal Difference (Model-Free, Low Variance)

Practical Checklist:

Define state space, action space, reward function
Choose discount factor(0.9-0.99)
If model known, use dynamic programming
If model unknown and episodes short, use MC
If model unknown and task continuous, use TD (Q-Learning/Sarsa)
Design appropriate exploration strategy (ε-greedy + decay)
Tune learning rate(start from 0.1)
For long trajectories, consider eligibility traces (Sarsa())

Core Intuition (Memory Mnemonic):

Value = Immediate Reward + Discounted Future Value

Policy Improvement = Greedily Choose High-Value Actions

Learning = Use Experience to Correct Estimates (MC/TD)

Next Chapter Preview:

Chapter 1 established the theoretical foundation of reinforcement learning, but all methods are limited to small-scale tabular problems (finite states and actions). When state spaces are huge (like Atari game pixel spaces) or continuous, tabular methods cannot scale.

The next chapter introduces how to use deep neural networks to approximate Q-functions, giving birth to the groundbreaking work of deep reinforcement learning — DQN (Deep Q-Network). We will explore in depth:

Deep Q-Network architecture
Role of experience replay and target networks
DQN variants: Double DQN, Dueling DQN, Prioritized Experience Replay
Rainbow: The ensemble approach
Complete Atari game implementation

📚 References

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4), 279-292.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9-44.
Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2), 181-211.
Bellman, R. (1957). Dynamic Programming. Princeton University Press.
Silver, D. (2015). UCL Course on Reinforcement Learning. https://www.davidsilver.uk/teaching/
Levine, S. (2023). CS 285: Deep Reinforcement Learning. UC Berkeley.
Dayan, P. (1992). The convergence of TD() for general. Machine Learning, 8(3-4), 341-362.
Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD Thesis, Cambridge University.
Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report, Cambridge University.

This is the first article in the Reinforcement Learning series, establishing the theoretical foundation for the entire series. Subsequent chapters will delve into deep reinforcement learning, policy optimization, exploration strategies, multi-agent learning, and other frontier topics.