How should an intelligent agent learn optimal behavior in an
environment? When AlphaGo defeats world champions on the Go board, when
robots learn to walk and grasp objects, when recommendation systems
continuously optimize suggestions based on user feedback — all of these
share a common mathematical framework: reinforcement learning.
Reinforcement learning differs from supervised learning's "train with
labeled data" approach and unsupervised learning's "discover inherent
structure" paradigm. It addresses a problem closer to the real world:
learning optimal decision-making strategies for long-term
rewards through trial-and-error and reward feedback, without explicit
correct answers.
The Intuition:
Learning to Ride a Bicycle
Learning to ride a bicycle is a classic reinforcement learning
process. No one hands you a "Standard Cycling Manual" detailing exactly
how to adjust the handlebars and body center of gravity at each moment
(no supervised learning labels). You must experiment yourself: lean
slightly left, the bike veers left; adjust your body right, it veers
right. Falling is negative feedback; maintaining balance is positive
feedback.
After hundreds of trials, your brain learns a policy: based on the
current bike tilt angle, speed, and road conditions, you decide in
real-time how to adjust your body and handlebars. This policy isn't
acquired through memorization but progressively optimized through the
cycle of interacting with the environment, receiving feedback,
and adjusting behavior.
The mathematical framework of reinforcement learning precisely
describes this process.
Markov
Decision Process: The Mathematical Foundation
The core of reinforcement learning is the Markov Decision Process
(MDP). An MDP is a five-tuple:
Basic Components
State Space:
All possible states of the environment. In cycling, the state might
include bike tilt angle, speed, distance to obstacles, etc. States can
be discrete (like board positions) or continuous (like robot joint
angles).
Action Space:
All actions the agent can execute. Actions can be discrete (like
moving up, down, left, right) or continuous (like applying a specific
torque).
Transition Probability:
The probability of transitioning to stateafter executing actionin state. The transition probability completely
defines the environment's dynamics:It satisfies the probability
normalization condition:.
Reward Function:
The immediate reward received after transitioning from statetoby executing action. Rewards are the agent's only learning
signal. Often simplified toorin many cases.
Discount Factor:
Measures the importance of future rewards.means caring only about
immediate rewards, whileemphasizes long-term returns.
The Markov Property
The core assumption of MDPs is the Markov property:
the future depends only on the current state, independent of
history.This
assumption appears strong but is practically useful. If the current
state contains all relevant information (such as concatenating recent
frames as the state), the Markov property can be satisfied.
Policy: Mapping from
States to Actions
A policyis the agent's
behavioral guideline, defining how to select actions in each state:
Deterministic Policy:Directly provides the action.
Stochastic Policy:Provides
a probability distribution over actions. Stochastic policies are more
useful for exploration and handling partially observable
environments.
Return and Value Functions
The agent's goal is to maximize cumulative return. The return from
time stepis defined as:
The role of the discount factor:
Mathematically: Ensures bounded return in
infinite-horizon tasks (when,)
Intuitively: Distant rewards have higher
uncertainty and should be discounted
Practically: Prevents agents from excessively
delaying reward acquisition
State Value Function:
The expected cumulative return starting from stateand following policy:measures
"how good it is to be in state"
(following policy).
Action Value Function:
The expected cumulative return from executing actionin state, then following policy:measures "how good it is to
take actionin state".
Relationship between them:
Bellman
Equations: Recursive Structure of Value Functions
Value functions satisfy recursive relationships called Bellman
equations. These are the cornerstone of reinforcement learning
theory.
Bellman Expectation Equations (for any policy):
Derivation Logic:
Current state value = expected immediate reward + expected discounted
value of next state.
Define optimal value functions:Optimal Bellman equations:Optimal policy:
Numerical Example:
Consider a simple 2-state MDP:
States:
Actions:(available in each state)
Transition probabilities and rewards:
0.5
5
0.5
10
0.9
0
0.1
15
0.7
2
0.3
8
1.0
3
Discount factor:
Policy:(always choose)
Computing:This is a system of two linear
equations inand. Rearranging:That is:Solving yields:This demonstrates
how value functions satisfy recursive relationships defined by Bellman
equations.
Dynamic
Programming: Optimal Control with Known Models
When the environment model (transition probabilityand reward function) is known, we can use dynamic
programming to solve for the optimal policy.
Policy
Evaluation: Computing Value Function for a Given Policy
Given policy, policy
evaluation computes.
Iterative algorithm:
Initialize(for
all)
Repeat until convergence:
For all:This is the iterative form of the Bellman
expectation equation. Convergence is guaranteed by the Banach
fixed-point theorem (Bellman operator is a contraction
mapping).
Policy
Improvement: Improving Policy Based on Value Function
Given, construct a greedy
policy:
Policy Improvement Theorem: The greedy policyis at least as good as:
Proof:
For any state, by
definition:From the Q-function
definition:If we chooseon the first step and then
follow, the return is at least
as good as always following. If
we use the greedy policy at every step, the return will be even
better.
Policy Iteration Algorithm
Alternates between policy evaluation and policy improvement:
Initialize random policy
Policy Evaluation: Solve for
Policy Improvement:${k+1}(s) a {s'}
P(s'|s, a) [R(s, a, s') + V^{_k}(s')]{k+1} = _k$, stop; otherwise return to step 2
Policy iteration is guaranteed to converge to the optimal policy.
Value Iteration Algorithm
Directly iterates the optimal Bellman equation:
Initialize$V_0(s) = 0$Value iteration is
essentially a simplified version of policy iteration: each iteration
implicitly performs one policy evaluation and improvement.
Convergence:
Define the Bellman operator:We can proveis
a-contraction mapping:Therefore, value iteration converges exponentially
to the unique fixed point.
Code Implementation:
GridWorld Environment
Let's implement a simple grid world where an agent moves from start
to goal, with walls blocking certain positions.
import numpy as np import matplotlib.pyplot as plt from typing importTuple, List
classGridWorld: """ Simple Grid World Environment Parameters: grid_size: (height, width), grid dimensions start: (row, col), starting position goal: (row, col), goal position obstacles: List[(row, col)], obstacle positions """ def__init__(self, grid_size=(5, 5), start=(0, 0), goal=(4, 4), obstacles=[]): self.height, self.width = grid_size self.start = start self.goal = goal self.obstacles = set(obstacles) # Action space: 0=up, 1=down, 2=left, 3=right self.actions = [0, 1, 2, 3] self.action_names = ['↑', '↓', '←', '→'] # Action-to-state-change mapping self.action_deltas = { 0: (-1, 0), # up 1: (1, 0), # down 2: (0, -1), # left 3: (0, 1) # right } self.reset() defreset(self) -> Tuple[int, int]: """Reset environment, return starting position""" self.state = self.start return self.state defstep(self, action: int) -> Tuple[Tuple[int, int], float, bool]: """ Execute action Returns: next_state: next state reward: immediate reward done: whether goal is reached """ dr, dc = self.action_deltas[action] next_r = self.state[0] + dr next_c = self.state[1] + dc # Check boundaries and obstacles if (0 <= next_r < self.height and0 <= next_c < self.width and (next_r, next_c) notin self.obstacles): self.state = (next_r, next_c) # Rewards: +10 for goal, -1 otherwise if self.state == self.goal: return self.state, 10.0, True else: return self.state, -1.0, False defget_transition_prob(self, state, action): """ Get transition probability (deterministic environment) Returns: List[(next_state, prob, reward)] """ # Save current state old_state = self.state self.state = state # Execute action next_state, reward, done = self.step(action) # Restore state self.state = old_state return [(next_state, 1.0, reward)]
defvalue_iteration(env: GridWorld, gamma=0.9, theta=1e-6, max_iter=1000): """ Value Iteration Algorithm Parameters: env: environment object gamma: discount factor theta: convergence threshold max_iter: maximum iterations Returns: V: state value function, shape=(height, width) policy: optimal policy, shape=(height, width) """ height, width = env.height, env.width V = np.zeros((height, width)) for iter_count inrange(max_iter): delta = 0 # Iterate over all states for r inrange(height): for c inrange(width): state = (r, c) # Skip goal and obstacles if state == env.goal or state in env.obstacles: continue v = V[r, c] # Compute Q-values for all actions q_values = [] for action in env.actions: q = 0 transitions = env.get_transition_prob(state, action) for next_state, prob, reward in transitions: next_r, next_c = next_state q += prob * (reward + gamma * V[next_r, next_c]) q_values.append(q) # Bellman optimality update V[r, c] = max(q_values) delta = max(delta, abs(v - V[r, c])) # Check convergence if delta < theta: print(f"Value iteration converged after {iter_count + 1} iterations") break # Extract optimal policy policy = np.zeros((height, width), dtype=int) for r inrange(height): for c inrange(width): state = (r, c) if state == env.goal or state in env.obstacles: continue q_values = [] for action in env.actions: q = 0 transitions = env.get_transition_prob(state, action) for next_state, prob, reward in transitions: next_r, next_c = next_state q += prob * (reward + gamma * V[next_r, next_c]) q_values.append(q) policy[r, c] = np.argmax(q_values) return V, policy
defvisualize_policy(env: GridWorld, V, policy): """Visualize value function and policy""" fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5)) # Value function heatmap im1 = ax1.imshow(V, cmap='coolwarm') ax1.set_title('State Value Function V(s)') plt.colorbar(im1, ax=ax1) # Annotate values for r inrange(env.height): for c inrange(env.width): if (r, c) == env.goal: ax1.text(c, r, 'G', ha='center', va='center', color='white', fontsize=16, fontweight='bold') elif (r, c) in env.obstacles: ax1.text(c, r, 'X', ha='center', va='center', color='black', fontsize=16) else: ax1.text(c, r, f'{V[r, c]:.1f}', ha='center', va='center') # Policy arrows ax2.imshow(np.zeros((env.height, env.width)), cmap='gray', alpha=0.3) ax2.set_title('Optimal Policy π*(s)') for r inrange(env.height): for c inrange(env.width): if (r, c) == env.goal: ax2.text(c, r, 'G', ha='center', va='center', color='green', fontsize=16, fontweight='bold') elif (r, c) in env.obstacles: ax2.text(c, r, 'X', ha='center', va='center', color='red', fontsize=16) else: action = policy[r, c] ax2.text(c, r, env.action_names[action], ha='center', va='center', fontsize=20) plt.tight_layout() plt.savefig('gridworld_value_iteration.png', dpi=150) plt.show()
# Run example if __name__ == "__main__": # Create 5x5 grid world with obstacles env = GridWorld( grid_size=(5, 5), start=(0, 0), goal=(4, 4), obstacles=[(1, 1), (2, 2), (3, 1)] ) # Run value iteration V, policy = value_iteration(env, gamma=0.9, theta=1e-6) # Visualize results visualize_policy(env, V, policy) print("\nOptimal Value Function:") print(V) print("\nOptimal Policy:") for r inrange(env.height): for c inrange(env.width): if (r, c) == env.goal: print('G', end=' ') elif (r, c) in env.obstacles: print('X', end=' ') else: print(env.action_names[policy[r, c]], end=' ') print()
Code Explanation:
Theclass
implements the environment, providing state transitions and rewards
2.implements the
value iteration algorithm until value function convergence ()
The optimal policy is extracted by greedily choosing4.
Visualization displays value function heatmap and policy direction
arrows
Monte Carlo Methods:
Model-Free Learning
Dynamic programming requires a complete environment model, but in
practice, models are often unknown or difficult to construct. Monte
Carlo (MC) methods estimate value functions by sampling trajectories,
without requiring environment models.
Core Idea
The MC method's basic approach: the value function is expected
return, which can be approximated using sample averages:whereis the return observed in
the-th trajectory starting from
state.
Steps:
Agent interacts with environment following policy, generating complete trajectories
(episodes):2. For each statein the trajectory, compute the return
starting from that state$G_tV(s)s$
First-Visit MC vs Every-Visit
MC
First-Visit MC:
For each trajectory, record the return only on the first visit to
state.
Algorithm:
Initialize,(empty list)
For each trajectory:
Compute return at each timestep - For each state(only first occurrence):
Appendto -Every-Visit
MC:
For each trajectory, record returns on every visit to state.
The difference lies in handling multiple visits to the same state
within one trajectory. First-Visit MC is more commonly used and its
unbiasedness is easier to prove.
Convergence of MC Policy
Evaluation
Theorem: First-Visit MC method converges to the true
value function:whereis
the number of trajectories visiting state.
Proof: Returnsobserved on each visit toare independent and identically
distributed random variables with expectation. By the law of large numbers,
the sample mean converges to the expectation.
MC Control: Policy
Improvement
MC methods can also be used for control problems (finding optimal
policies). The key is estimatingrather than,
because policy improvement requires knowing the value of each
action.
MC Control Algorithm (based on exploring
starts):
Initialize policyand$Q(s,
a)$ - Policy evaluation: For
eachpair in trajectory,
updateas average of observed
returns
Policy improvement: For each, updateExploring Starts Assumption:
Every state-action pair has nonzero probability of being the starting
point. This ensures all state-action pairs can be visited.
ε-Greedy
Policy: Balancing Exploration and Exploitation
Exploring Starts is difficult to implement in practice. A more
practical approach is using an-greedy policy:With probability, choose the current best action (exploit); with
probability, randomly
choose an action (explore).
On-Policy vs Off-Policy:
On-Policy: Evaluate and improve the currently
executing policy (like-greedy)
Off-Policy: Evaluate one policy (target policy) while executing another (behavior
policy)
MC control is typically on-policy.
Incremental Update
Storing all returns and computing averages has high memory overhead.
The incremental update formula:whereis the learning rate.
This is a weighted average with higher weight for recent samples
(whenis constant).
import numpy as np from collections import defaultdict
defmc_policy_evaluation(env, policy, num_episodes=10000, gamma=0.9): """ Monte Carlo Policy Evaluation (First-Visit) Parameters: env: environment object policy: policy function, policy(state) -> action num_episodes: number of trajectories to sample gamma: discount factor Returns: V: state value function dictionary """ # Store return lists for each state returns = defaultdict(list) V = defaultdict(float) for episode inrange(num_episodes): # Generate trajectory trajectory = [] state = env.reset() done = False whilenot done: action = policy(state) next_state, reward, done = env.step(action) trajectory.append((state, action, reward)) state = next_state # Compute returns (backward from end) G = 0 visited_states = set() for t inrange(len(trajectory) - 1, -1, -1): state, action, reward = trajectory[t] G = reward + gamma * G # First-Visit: record only on first visit if state notin visited_states: visited_states.add(state) returns[state].append(G) V[state] = np.mean(returns[state]) if (episode + 1) % 1000 == 0: print(f"Completed {episode + 1} episodes") return V
defmc_control(env, num_episodes=10000, gamma=0.9, epsilon=0.1): """ Monte Carlo Control (On-Policy, ε-greedy) Returns: Q: action-value function dictionary policy: learned policy """ Q = defaultdict(lambda: np.zeros(len(env.actions))) returns = defaultdict(list) for episode inrange(num_episodes): # Generate trajectory (using ε-greedy policy) trajectory = [] state = env.reset() done = False whilenot done: # ε-greedy action selection if np.random.random() < epsilon: action = np.random.choice(env.actions) else: action = np.argmax(Q[state]) next_state, reward, done = env.step(action) trajectory.append((state, action, reward)) state = next_state # Update Q-values G = 0 visited_sa = set() for t inrange(len(trajectory) - 1, -1, -1): state, action, reward = trajectory[t] G = reward + gamma * G if (state, action) notin visited_sa: visited_sa.add((state, action)) returns[(state, action)].append(G) Q[state][action] = np.mean(returns[(state, action)]) if (episode + 1) % 1000 == 0: print(f"Completed {episode + 1} episodes") # Extract greedy policy policy = lambda s: np.argmax(Q[s]) return Q, policy
Pros and Cons of MC Methods:
Pros: - Model-free - Can learn from actual experience or simulation -
Can focus on states of interest - Simple and intuitive
Cons: - Requires complete trajectories (must wait until episode ends)
- High variance (return is sum of many random rewards) - Inefficient in
continuous tasks or long episodes
Temporal
Difference Learning: Combining MC and DP
Temporal Difference (TD) learning is the core of reinforcement
learning, combining MC's model-free nature with DP's bootstrapping.
TD(0): The Simplest TD Method
TD(0) update rule:
TD Error:TD error measures the
discrepancy between "actual observation ()" and
"current estimate ()".
Comparison with MC:
Method
Update Target
Characteristics
MC
(actual return)
Unbiased, high variance, requires complete trajectory
TD methods can update after every step without waiting for episode
completion. This makes TD methods applicable to continuing tasks (tasks
without explicit termination states).
Convergence of TD(0)
Theorem (Sutton, 1988): For any policy, if the following conditions are
met:
Learning rate satisfies Robbins-Monro conditions:Double subscripts: use braces to clarify_t _t = , _t _t^2 < 2. All state-action pairs are visited infinitely
Then TD(0) converges almost surely to.
Common learning rates:or fixed small values
(like 0.1).
Sarsa: On-Policy TD Control
Sarsa (State-Action-Reward-State-Action) is the TD version for
Q-functions:The algorithm name comes from the quintuple needed for the
update:.
Sarsa is an on-policy method: it evaluates and improves the currently
executing policy (like-greedy).
Q-Learning: Off-Policy TD
Control
Q-Learning is one of the most important breakthroughs in TD methods
(Watkins, 1989).
Q-Learning Update Rule:The difference from Sarsa is: Sarsa uses the
actually executed next action, while Q-Learning uses the
optimal action.
Q-Learning Algorithm:
Initialize$Q(s, a)S$ - Repeat (for each step in episode):
Choose actionfromusing(e.g.,-greedy)
Execute, observe - Update: - - Untilis terminal
Q-Learning is an off-policy method: the behavior policy (like-greedy) is used for
exploration, while the target policy (greedy policy) is used for
evaluation.
Convergence of Q-Learning:
Theorem (Watkins & Dayan, 1992): If:
All state-action pairs are visited infinitely
Learning rate satisfies Robbins-Monro conditions
Then Q-Learning converges to the optimal Q-function, independent of the behavior
policy.
This is Q-Learning's huge advantage: it can learn the optimal policy
from data generated by any policy.
Sarsa vs Q-Learning:
Intuitive Comparison
Consider a "cliff walking" environment:
1 2 3 4
S: start, G: goal, C: cliff (reward=-100)
S . . . . . . G C C C C C C C C
The agent moves from start to goal; falling off the cliff returns it
to start with large negative reward.
Sarsa (On-Policy):
Since it uses an-greedy policy, Sarsa
considers the possibility of falling off the cliff during exploration.
It learns a policy that stays far from the cliff edge, taking a safer
but longer path.
Q-Learning (Off-Policy):
Q-Learning evaluates the greedy policy, assuming no random
exploration. It learns a policy that walks along the cliff edge for the
shortest path, but during execution (due to-greedy), it might fall off
the cliff.
This illustrates: - Sarsa is more conservative (considers exploration
risk) - Q-Learning is more aggressive (evaluates optimal policy)
TD(λ) and Eligibility Traces
TD(0) only uses one-step reward for bootstrapping. TD() is a more general framework that
combines multi-step returns.
-step
return:
TD()
update:
Uses a weighted average of all-step returns:wherecontrols weight allocation.
Eligibility Trace:
Eligibility trace is an efficient implementation that maintains a
tracefor each state:Update rule:Eligibility trace records a state's "credit": recently visited
states and frequently visited states have higher traces, receiving more
credit assignment.
Special cases:
-: TD(0) -: Equivalent to MC (using
complete return)
defsarsa_lambda(env, num_episodes=1000, alpha=0.1, gamma=0.9, epsilon=0.1, lambda_=0.9): """ Sarsa(λ) Algorithm (with eligibility traces) Parameters: lambda_: eligibility trace decay factor """ Q = {} for r inrange(env.height): for c inrange(env.width): Q[(r, c)] = np.zeros(len(env.actions)) for episode inrange(num_episodes): # Initialize eligibility traces E = {} for r inrange(env.height): for c inrange(env.width): E[(r, c)] = np.zeros(len(env.actions)) state = env.reset() # Choose initial action if np.random.random() < epsilon: action = np.random.choice(env.actions) else: action = np.argmax(Q[state]) done = False whilenot done: # Execute action next_state, reward, done = env.step(action) # Choose next action if np.random.random() < epsilon: next_action = np.random.choice(env.actions) else: next_action = np.argmax(Q[next_state]) # Compute TD error td_error = reward + gamma * Q[next_state][next_action] - Q[state][action] # Update eligibility trace (current state-action pair) E[state][action] += 1 # Update all state-action pairs for s in E: for a inrange(len(env.actions)): Q[s][a] += alpha * td_error * E[s][a] E[s][a] *= gamma * lambda_ # Transition state = next_state action = next_action if (episode + 1) % 100 == 0: print(f"Sarsa(λ) completed episode {episode + 1}") return Q
# Comparison experiment if __name__ == "__main__": from gridworld import GridWorld env = GridWorld(grid_size=(5, 5), start=(0, 0), goal=(4, 4), obstacles=[(2, 2)]) print("Training Sarsa...") Q_sarsa = sarsa(env, num_episodes=1000) print("\nTraining Q-Learning...") Q_qlearn = q_learning(env, num_episodes=1000) print("\nTraining Sarsa(λ)...") Q_sarsa_lambda = sarsa_lambda(env, num_episodes=1000, lambda_=0.9) # Extract policies and compare policy_sarsa = {s: np.argmax(Q_sarsa[s]) for s in Q_sarsa} policy_qlearn = {s: np.argmax(Q_qlearn[s]) for s in Q_qlearn} print("\nPolicy comparison complete")
Key Code Points:
Sarsa updates usingwhereis the actually executed
action
Q-Learning updates using, taking$a Q(S{t+1}, a)E$, recording "credit" for
each state-action pair
All three methods use-greedy for exploration
❓ Q&A: Common
Questions on RL Fundamentals
Q1:
Why do we need the discount factor? Can't we directly maximize
undiscounted return?
Mathematical Necessity:
In infinite-horizon tasks, undiscounted returnmay diverge to
infinity. The discount factor ensures bounded return:
Economic Interpretation:
The discount factor reflects "time value":1 tomorrow. In reinforcement learning, this manifests as:
Rewards obtained earlier have more value (reduced uncertainty)
Prevents agents from indefinitely delaying reward acquisition
Practical Impact:
Value
Behavioral Characteristics
Suitable Scenarios
0.0 - 0.5
Myopic, focuses on near-term rewards
Financial trading, quick decisions
0.9 - 0.95
Balanced, commonly used default
Most tasks
0.99 - 0.999
Farsighted, values long-term gains
Go, strategic planning
1.0
No discounting (finite-step tasks only)
Games with clear termination
Selection Guidelines:
If task has clear termination state (episodic),can be close to 1
If task is continuing,should be < 1
Tunevia
cross-validation
Q2:
Why is the Bellman equation a "contraction mapping"? What does this mean
for convergence?
Contraction Mapping Definition:
Operatoris a contraction
mapping if there existssuch that for any functions:where.
Proof of Bellman Operator Contraction:
Define Bellman operator:Proof:
Banach Fixed-Point Theorem:
A contraction mapping has a unique fixed point, and iteration from
any initial value converges to that fixed point:Convergence rate:Exponential
convergence! This means value iteration and policy evaluation converge
quickly.
Practical Significance:
Smallerconverges faster
(but policy quality may decrease)
Can useas stopping criterion
Even with poor initial values, convergence to optimal value is
guaranteed
Q3:
Dynamic programming requires complete models, but models are often
unknown in practice. Does this mean DP has no practical value?
Cases Where Models Are Available:
Known game rules: Board games (Go, Chess), video
games
Learned models: Using supervised learning to learn
transition functionand rewardfrom data
Model-Based RL:
Combines model learning with planning:
Collect data from real interactions$ = {(s, a, r, s')}(s'|s, a)(s, a)$3. Use DP to plan on the
learned model
This is called the Dyna architecture (Sutton, 1990).
DP's Theoretical Value:
Even if DP algorithms aren't directly used, the theoretical framework
(Bellman equations, policy improvement theorem) is foundational for all
RL methods:
TD methods are essentially sampled versions of DP
Policy gradient methods' convergence relies on the policy
improvement theorem
Target networks in deep RL originate from DP ideas
Practical Recommendations:
If model is known or easily modeled, prioritize DP (most
efficient)
If model is unknown but can be sampled, use MC or TD
If samples are expensive, consider model-based methods
Q4: Which
is better, Monte Carlo methods or TD methods?
Both have advantages and disadvantages depending on task
characteristics:
Monte Carlo Methods:
✅ Advantages: - Unbiased estimate (sample mean of expected return
converges to true value) - Don't rely on Markov property - Simple and
intuitive
❌ Disadvantages: - High variance (return is sum of long sequence of
random rewards) - Require complete trajectories (can't be used for
continuing tasks) - Slow learning (especially for long trajectories)
TD Methods:
✅ Advantages: - Low variance (depends only on one-step randomness) -
Single-step updates (applicable to continuing tasks) - Fast learning
(fully utilizes bootstrapping)
❌ Disadvantages: - Biased (initially based on inaccurate estimates)
- Depend on Markov property - Sensitive to initial values
Rules of Thumb:
Task Characteristics
Recommended Method
Reason
Short episodes
MC
Variance not large, unbiased estimation more important
Long/continuing tasks
TD
Must use single-step updates
High randomness
TD
MC variance too large
Function approximation
TD
MC's high variance amplifies approximation error
Offline data
MC
Fixed trajectories, can't bootstrap
Combined Use:
TD() interpolates between
MC and TD through parameter:
-: TD(0) -: Close to MC -: Balances bias and
variance
In practice,typically works best.
Q5:
Q-Learning guarantees convergence to optimal policy, so why do we need
Sarsa?
Q-Learning indeed has stronger theoretical guarantees (converges
to), but Sarsa is more practical
in certain scenarios:
Safety Considerations:
Sarsa is an on-policy method that evaluates the actually
executed policy (including exploration). In safety-sensitive
tasks (like robot control, medical decisions), Sarsa learns more
conservative policies.
Q-Learning: Learns shortest path along cliff edge (evaluates greedy
policy), but frequently falls off cliff during execution (due to-exploration)
Sarsa: Learns safe path far from cliff (considers exploration
risk)
Impact of Exploration Strategy:
Q-Learning's convergence requires all state-action pairs to be
visited infinitely but doesn't require a specific exploration strategy.
Sarsa's performance directly depends on exploration strategy
quality.
Convergence Speed:
Theoretically Q-Learning converges faster (directly optimizes optimal
policy), but in practice:
If exploration strategy differs greatly from optimal policy,
Q-Learning's sample efficiency may be lower
Sarsa converges more stably in some tasks
Practical Recommendations:
Try Q-Learning first: More general, stronger
theory
Switch to Sarsa if:
Training is unstable
Task has safety requirements
Exploration noise is large
Q6:
Why do eligibility traces accelerate credit assignment?
The credit assignment problem: In long trajectories, how do early
actions receive "credit" for distant rewards?
TD(0)'s Limitation:
TD(0) only backpropagates one step:If reward is obtained at time, onlyis directly updated.need multiple
iterations to indirectly benefit.
Eligibility Trace Mechanism:
Eligibility tracerecords
state's "responsibility" for
current reward:Update rule:
Intuition:
Frequency heuristic: Frequently visited states have
higher traces (cumulative effect)
Recency heuristic: Recently visited states have
higher traces (decay)
Global update: Each TD errorupdates all
states (weighted by their eligibility traces)
Example:
Suppose trajectory:(rewardobtained
at)
TD(0): - Onlyis directly
updated -are affected
only after multiple iterations
Sarsa() (assume):
Suppose at:TD error (assuming initial):All states
updated simultaneously:
Reward signal propagated through entire trajectory in one
step!
Practical Effect:
In long trajectory tasks, Sarsa() is several to dozens of times
faster than Sarsa(0) -typically works best
Cost: Need to maintain eligibility trace for each state (memory
overhead)
Q7:
How to choose appropriate exploration strategy? What alternatives exist
to ε-greedy?
Problems with ε-greedy:
Exploration is uniformly random (doesn't utilize prior
knowledge)
With fixed,
late-stage training still does ineffective exploration
Equal probability exploration for all actions (even obviously bad
actions)
Improved Approaches:
1. Decaying ε-greedy:
Reduce exploration rate over time:Common settings:.
2. Boltzmann Exploration (Softmax):
Choose actions based on relative Q-value magnitudes:Temperature
parameter: -: Close to greedy -: Close to uniform
random
Advantage: Actions with higher Q-values have higher selection
probability.
3. UCB (Upper Confidence Bound):
Uncertainty-based exploration:whereis
the number of times actionhas been
selected in state.
Necessary and sufficient conditions for guaranteed convergence:Examples satisfying
conditions:
-: Classic
decay -: Based on visit count
Fixed Learning Rate:Pros: - Simple - Can adapt to non-stationary
environments (higher weight for recent samples)
Cons: - Doesn't satisfy Robbins-Monro conditions (theoretically
doesn't guarantee convergence to optimum) - Will eventually oscillate
around optimal solution
Common values:, determined by cross-validation.
Adaptive Learning Rates:
1. Visit-Count-Based:whereis visit count
for state.
Pros: - Satisfies Robbins-Monro conditions - Frequently visited
states automatically have lower learning rate
Cons: - Slow learning in late stage - Requires maintaining visit
counts
2. AdaGrad Style:
Borrows from deep learning adaptive optimizers:whereis gradient (or TD error).
3. Adam Style:
Uses first and second moment estimates:
Practical Recommendations:
Scenario
Recommended Method
Parameters
Simple tabular RL
Fixed learning rate
Deep RL
Adam optimizer
Online learning
Fixed learning rate
Offline learning
Decaying learning rate
Initial 0.1, decay to 0.001
Non-stationary env
Fixed learning rate
Maintain adaptability
Tuning Tips:
Start with larger learning rate (like 0.1), observe training
curve
If unstable/divergent, reduce learning rate
If converging too slowly, increase learning rate
Use learning rate schedule (like linear decay, exponential
decay)
Q10:
The Markov property assumption is often violated in real problems. How
to handle partial observability?
Partially Observable Markov Decision Process
(POMDP):
Agent cannot directly observe state, only receives observation, with observation-state relationship
determined by observation function:Examples: - Robot navigation: Sensor
readings (observation) ≠ true location (state) - Medical
diagnosis: Symptoms (observation) ≠ disease (state) -
Financial trading: Market data (observation) ≠ market
state (state)
Solutions:
1. State Augmentation:
Concatenate historical observations as extended state:Frame stacking commonly used in deep RL:
1 2
# Atari games: stack most recent 4 frames state = np.concatenate([frame_t, frame_{t-1}, frame_{t-2}, frame_{t-3}], axis=0)
Pros: Simple, suitable for short-term memory Cons: State space
dimensionality explosion
2. Recurrent Neural Networks (RNN):
Use LSTM/GRU to process observation sequences, hidden stateas state representation:Pros: Can handle
long-term dependencies Cons: Training difficult (vanishing/exploding
gradients)
3. Belief State:
Maintain probability distribution over states (belief):Use Bayesian update:Solve MDP in belief space.
Pros: Theoretically complete Cons: High computational complexity
(infeasible for continuous state spaces)
4. Transformer (Modern Approach):
Use self-attention mechanism to process observation sequences:Pros: Can
capture long-distance dependencies, parallel training Cons: High
computational cost
Practical Recommendations:
Scenario
Recommended Method
Example
Short-term partial observability
Frame stacking
Atari games (stack 4 frames)
Long-term memory needs
LSTM/GRU
Dialogue systems, strategy games
High uncertainty
Belief state
Robot localization (SLAM)
Large-scale sequences
Transformer
Decision Transformer
Diagnostic Method:
If you suspect Markov property is violated, check:
Does agent make different decisions under same observation?
Does reward signal appear very delayed?
Does environment have hidden states (like opponent intent)?
If yes, consider state augmentation or RNN.
🎓 Summary: Key
Points of Reinforcement Learning
MDP Framework:
The core of reinforcement learning is the Markov Decision
Process, where
agents interact with the environment through policy, aiming to maximize cumulative
return.
Bellman Equations (Memory Formula):
Methodology Triangle:
1 2 3 4
Dynamic Programming (Known Model) / \ / \ Monte Carlo (Model-Free, High Variance) Temporal Difference (Model-Free, Low Variance)
Practical Checklist:
Core Intuition (Memory Mnemonic):
Value = Immediate Reward + Discounted Future
Value
Learning = Use Experience to Correct Estimates
(MC/TD)
Next Chapter Preview:
Chapter 1 established the theoretical foundation of reinforcement
learning, but all methods are limited to small-scale tabular problems
(finite states and actions). When state spaces are huge (like Atari game
pixel spaces) or continuous, tabular methods cannot scale.
The next chapter introduces how to use deep neural networks to
approximate Q-functions, giving birth to the groundbreaking work of deep
reinforcement learning — DQN (Deep Q-Network). We will explore in
depth:
Sutton, R. S., & Barto, A. G. (2018). Reinforcement
Learning: An Introduction (2nd ed.). MIT Press.
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine
Learning, 8(3-4), 279-292.
Sutton, R. S. (1988). Learning to predict by the methods of temporal
differences. Machine Learning, 3(1), 9-44.
Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and
semi-MDPs: A framework for temporal abstraction in reinforcement
learning. Artificial Intelligence, 112(1-2), 181-211.
Bellman, R. (1957). Dynamic Programming. Princeton
University Press.
Silver, D. (2015). UCL Course on Reinforcement Learning.
https://www.davidsilver.uk/teaching/
Levine, S. (2023). CS 285: Deep Reinforcement Learning. UC
Berkeley.
Dayan, P. (1992). The convergence of TD() for general. Machine Learning,
8(3-4), 341-362.
Watkins, C. J. C. H. (1989). Learning from Delayed Rewards.
PhD Thesis, Cambridge University.
Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning
using connectionist systems. Technical Report, Cambridge
University.
This is the first article in the Reinforcement Learning series,
establishing the theoretical foundation for the entire series.
Subsequent chapters will delve into deep reinforcement learning, policy
optimization, exploration strategies, multi-agent learning, and other
frontier topics.
Post title:Reinforcement Learning (1): Fundamentals and Core Concepts
Post author:Chen Kai
Create time:2024-08-02 09:30:00
Post link:https://www.chenk.top/reinforcement-learning-1-fundamentals-and-core-concepts/
Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.