Reinforcement Learning (1): Fundamentals and Core Concepts
Chen Kai BOSS

How should an intelligent agent learn optimal behavior in an environment? When AlphaGo defeats world champions on the Go board, when robots learn to walk and grasp objects, when recommendation systems continuously optimize suggestions based on user feedback — all of these share a common mathematical framework: reinforcement learning.

Reinforcement learning differs from supervised learning's "train with labeled data" approach and unsupervised learning's "discover inherent structure" paradigm. It addresses a problem closer to the real world: learning optimal decision-making strategies for long-term rewards through trial-and-error and reward feedback, without explicit correct answers.

The Intuition: Learning to Ride a Bicycle

Learning to ride a bicycle is a classic reinforcement learning process. No one hands you a "Standard Cycling Manual" detailing exactly how to adjust the handlebars and body center of gravity at each moment (no supervised learning labels). You must experiment yourself: lean slightly left, the bike veers left; adjust your body right, it veers right. Falling is negative feedback; maintaining balance is positive feedback.

After hundreds of trials, your brain learns a policy: based on the current bike tilt angle, speed, and road conditions, you decide in real-time how to adjust your body and handlebars. This policy isn't acquired through memorization but progressively optimized through the cycle of interacting with the environment, receiving feedback, and adjusting behavior.

The mathematical framework of reinforcement learning precisely describes this process.

Markov Decision Process: The Mathematical Foundation

The core of reinforcement learning is the Markov Decision Process (MDP). An MDP is a five-tuple:

Basic Components

State Space:

All possible states of the environment. In cycling, the state might include bike tilt angle, speed, distance to obstacles, etc. States can be discrete (like board positions) or continuous (like robot joint angles).

Action Space:

All actions the agent can execute. Actions can be discrete (like moving up, down, left, right) or continuous (like applying a specific torque).

Transition Probability:

The probability of transitioning to stateafter executing actionin state. The transition probability completely defines the environment's dynamics:It satisfies the probability normalization condition:.

Reward Function:

The immediate reward received after transitioning from statetoby executing action. Rewards are the agent's only learning signal. Often simplified toorin many cases.

Discount Factor:

Measures the importance of future rewards.means caring only about immediate rewards, whileemphasizes long-term returns.

The Markov Property

The core assumption of MDPs is the Markov property: the future depends only on the current state, independent of history.This assumption appears strong but is practically useful. If the current state contains all relevant information (such as concatenating recent frames as the state), the Markov property can be satisfied.

Policy: Mapping from States to Actions

A policyis the agent's behavioral guideline, defining how to select actions in each state:

Deterministic Policy:Directly provides the action.

Stochastic Policy:Provides a probability distribution over actions. Stochastic policies are more useful for exploration and handling partially observable environments.

Return and Value Functions

The agent's goal is to maximize cumulative return. The return from time stepis defined as:

The role of the discount factor:

  • Mathematically: Ensures bounded return in infinite-horizon tasks (when,)
  • Intuitively: Distant rewards have higher uncertainty and should be discounted
  • Practically: Prevents agents from excessively delaying reward acquisition

State Value Function:

The expected cumulative return starting from stateand following policy: measures "how good it is to be in state" (following policy).

Action Value Function:

The expected cumulative return from executing actionin state, then following policy: measures "how good it is to take actionin state".

Relationship between them:

Bellman Equations: Recursive Structure of Value Functions

Value functions satisfy recursive relationships called Bellman equations. These are the cornerstone of reinforcement learning theory.

Bellman Expectation Equations (for any policy):

Derivation Logic:

Current state value = expected immediate reward + expected discounted value of next state.

Expanding:

Bellman Optimality Equations (for optimal policy):

Define optimal value functions:Optimal Bellman equations: Optimal policy:

Numerical Example:

Consider a simple 2-state MDP:

  • States:
  • Actions:(available in each state)
  • Transition probabilities and rewards:
0.5 5
0.5 10
0.9 0
0.1 15
0.7 2
0.3 8
1.0 3
  • Discount factor:
  • Policy:(always choose)

Computing: This is a system of two linear equations inand. Rearranging: That is: Solving yields:This demonstrates how value functions satisfy recursive relationships defined by Bellman equations.

Dynamic Programming: Optimal Control with Known Models

When the environment model (transition probabilityand reward function) is known, we can use dynamic programming to solve for the optimal policy.

Policy Evaluation: Computing Value Function for a Given Policy

Given policy, policy evaluation computes.

Iterative algorithm:

  1. Initialize(for all)

  2. Repeat until convergence:

    For all:This is the iterative form of the Bellman expectation equation. Convergence is guaranteed by the Banach fixed-point theorem (Bellman operator is a contraction mapping).

Policy Improvement: Improving Policy Based on Value Function

Given, construct a greedy policy:

Policy Improvement Theorem: The greedy policyis at least as good as:

Proof:

For any state, by definition:From the Q-function definition:If we chooseon the first step and then follow, the return is at least as good as always following. If we use the greedy policy at every step, the return will be even better.

Policy Iteration Algorithm

Alternates between policy evaluation and policy improvement:

  1. Initialize random policy

  2. Policy Evaluation: Solve for

  3. Policy Improvement:${k+1}(s) a {s'} P(s'|s, a) [R(s, a, s') + V^{_k}(s')]{k+1} = _k$, stop; otherwise return to step 2

Policy iteration is guaranteed to converge to the optimal policy.

Value Iteration Algorithm

Directly iterates the optimal Bellman equation:

  1. Initialize$V_0(s) = 0$Value iteration is essentially a simplified version of policy iteration: each iteration implicitly performs one policy evaluation and improvement.

Convergence:

Define the Bellman operator:We can proveis a-contraction mapping:Therefore, value iteration converges exponentially to the unique fixed point.

Code Implementation: GridWorld Environment

Let's implement a simple grid world where an agent moves from start to goal, with walls blocking certain positions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List

class GridWorld:
"""
Simple Grid World Environment

Parameters:
grid_size: (height, width), grid dimensions
start: (row, col), starting position
goal: (row, col), goal position
obstacles: List[(row, col)], obstacle positions
"""
def __init__(self, grid_size=(5, 5), start=(0, 0), goal=(4, 4), obstacles=[]):
self.height, self.width = grid_size
self.start = start
self.goal = goal
self.obstacles = set(obstacles)

# Action space: 0=up, 1=down, 2=left, 3=right
self.actions = [0, 1, 2, 3]
self.action_names = ['↑', '↓', '←', '→']

# Action-to-state-change mapping
self.action_deltas = {
0: (-1, 0), # up
1: (1, 0), # down
2: (0, -1), # left
3: (0, 1) # right
}

self.reset()

def reset(self) -> Tuple[int, int]:
"""Reset environment, return starting position"""
self.state = self.start
return self.state

def step(self, action: int) -> Tuple[Tuple[int, int], float, bool]:
"""
Execute action

Returns:
next_state: next state
reward: immediate reward
done: whether goal is reached
"""
dr, dc = self.action_deltas[action]
next_r = self.state[0] + dr
next_c = self.state[1] + dc

# Check boundaries and obstacles
if (0 <= next_r < self.height and 0 <= next_c < self.width and
(next_r, next_c) not in self.obstacles):
self.state = (next_r, next_c)

# Rewards: +10 for goal, -1 otherwise
if self.state == self.goal:
return self.state, 10.0, True
else:
return self.state, -1.0, False

def get_transition_prob(self, state, action):
"""
Get transition probability (deterministic environment)

Returns:
List[(next_state, prob, reward)]
"""
# Save current state
old_state = self.state
self.state = state

# Execute action
next_state, reward, done = self.step(action)

# Restore state
self.state = old_state

return [(next_state, 1.0, reward)]


def value_iteration(env: GridWorld, gamma=0.9, theta=1e-6, max_iter=1000):
"""
Value Iteration Algorithm

Parameters:
env: environment object
gamma: discount factor
theta: convergence threshold
max_iter: maximum iterations

Returns:
V: state value function, shape=(height, width)
policy: optimal policy, shape=(height, width)
"""
height, width = env.height, env.width
V = np.zeros((height, width))

for iter_count in range(max_iter):
delta = 0

# Iterate over all states
for r in range(height):
for c in range(width):
state = (r, c)

# Skip goal and obstacles
if state == env.goal or state in env.obstacles:
continue

v = V[r, c]

# Compute Q-values for all actions
q_values = []
for action in env.actions:
q = 0
transitions = env.get_transition_prob(state, action)
for next_state, prob, reward in transitions:
next_r, next_c = next_state
q += prob * (reward + gamma * V[next_r, next_c])
q_values.append(q)

# Bellman optimality update
V[r, c] = max(q_values)
delta = max(delta, abs(v - V[r, c]))

# Check convergence
if delta < theta:
print(f"Value iteration converged after {iter_count + 1} iterations")
break

# Extract optimal policy
policy = np.zeros((height, width), dtype=int)
for r in range(height):
for c in range(width):
state = (r, c)
if state == env.goal or state in env.obstacles:
continue

q_values = []
for action in env.actions:
q = 0
transitions = env.get_transition_prob(state, action)
for next_state, prob, reward in transitions:
next_r, next_c = next_state
q += prob * (reward + gamma * V[next_r, next_c])
q_values.append(q)

policy[r, c] = np.argmax(q_values)

return V, policy


def visualize_policy(env: GridWorld, V, policy):
"""Visualize value function and policy"""
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Value function heatmap
im1 = ax1.imshow(V, cmap='coolwarm')
ax1.set_title('State Value Function V(s)')
plt.colorbar(im1, ax=ax1)

# Annotate values
for r in range(env.height):
for c in range(env.width):
if (r, c) == env.goal:
ax1.text(c, r, 'G', ha='center', va='center', color='white', fontsize=16, fontweight='bold')
elif (r, c) in env.obstacles:
ax1.text(c, r, 'X', ha='center', va='center', color='black', fontsize=16)
else:
ax1.text(c, r, f'{V[r, c]:.1f}', ha='center', va='center')

# Policy arrows
ax2.imshow(np.zeros((env.height, env.width)), cmap='gray', alpha=0.3)
ax2.set_title('Optimal Policy π*(s)')

for r in range(env.height):
for c in range(env.width):
if (r, c) == env.goal:
ax2.text(c, r, 'G', ha='center', va='center', color='green', fontsize=16, fontweight='bold')
elif (r, c) in env.obstacles:
ax2.text(c, r, 'X', ha='center', va='center', color='red', fontsize=16)
else:
action = policy[r, c]
ax2.text(c, r, env.action_names[action], ha='center', va='center', fontsize=20)

plt.tight_layout()
plt.savefig('gridworld_value_iteration.png', dpi=150)
plt.show()


# Run example
if __name__ == "__main__":
# Create 5x5 grid world with obstacles
env = GridWorld(
grid_size=(5, 5),
start=(0, 0),
goal=(4, 4),
obstacles=[(1, 1), (2, 2), (3, 1)]
)

# Run value iteration
V, policy = value_iteration(env, gamma=0.9, theta=1e-6)

# Visualize results
visualize_policy(env, V, policy)

print("\nOptimal Value Function:")
print(V)
print("\nOptimal Policy:")
for r in range(env.height):
for c in range(env.width):
if (r, c) == env.goal:
print('G', end=' ')
elif (r, c) in env.obstacles:
print('X', end=' ')
else:
print(env.action_names[policy[r, c]], end=' ')
print()

Code Explanation:

  1. Theclass implements the environment, providing state transitions and rewards 2.implements the value iteration algorithm until value function convergence ()
  2. The optimal policy is extracted by greedily choosing4. Visualization displays value function heatmap and policy direction arrows

Monte Carlo Methods: Model-Free Learning

Dynamic programming requires a complete environment model, but in practice, models are often unknown or difficult to construct. Monte Carlo (MC) methods estimate value functions by sampling trajectories, without requiring environment models.

Core Idea

The MC method's basic approach: the value function is expected return, which can be approximated using sample averages:whereis the return observed in the-th trajectory starting from state.

Steps:

  1. Agent interacts with environment following policy, generating complete trajectories (episodes):2. For each statein the trajectory, compute the return starting from that state$G_tV(s)s$

First-Visit MC vs Every-Visit MC

First-Visit MC:

For each trajectory, record the return only on the first visit to state.

Algorithm:

  1. Initialize,(empty list)
  2. For each trajectory:
    • Compute return at each timestep - For each state(only first occurrence):
      • Appendto - Every-Visit MC:

For each trajectory, record returns on every visit to state.

The difference lies in handling multiple visits to the same state within one trajectory. First-Visit MC is more commonly used and its unbiasedness is easier to prove.

Convergence of MC Policy Evaluation

Theorem: First-Visit MC method converges to the true value function:whereis the number of trajectories visiting state.

Proof: Returnsobserved on each visit toare independent and identically distributed random variables with expectation. By the law of large numbers, the sample mean converges to the expectation.

MC Control: Policy Improvement

MC methods can also be used for control problems (finding optimal policies). The key is estimatingrather than, because policy improvement requires knowing the value of each action.

MC Control Algorithm (based on exploring starts):

  1. Initialize policyand$Q(s, a)$ - Policy evaluation: For eachpair in trajectory, updateas average of observed returns
    • Policy improvement: For each, update Exploring Starts Assumption:

Every state-action pair has nonzero probability of being the starting point. This ensures all state-action pairs can be visited.

ε-Greedy Policy: Balancing Exploration and Exploitation

Exploring Starts is difficult to implement in practice. A more practical approach is using an-greedy policy:With probability, choose the current best action (exploit); with probability, randomly choose an action (explore).

On-Policy vs Off-Policy:

  • On-Policy: Evaluate and improve the currently executing policy (like-greedy)
  • Off-Policy: Evaluate one policy (target policy) while executing another (behavior policy)

MC control is typically on-policy.

Incremental Update

Storing all returns and computing averages has high memory overhead. The incremental update formula:whereis the learning rate. This is a weighted average with higher weight for recent samples (whenis constant).

Code Implementation: MC Policy Evaluation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from collections import defaultdict

def mc_policy_evaluation(env, policy, num_episodes=10000, gamma=0.9):
"""
Monte Carlo Policy Evaluation (First-Visit)

Parameters:
env: environment object
policy: policy function, policy(state) -> action
num_episodes: number of trajectories to sample
gamma: discount factor

Returns:
V: state value function dictionary
"""
# Store return lists for each state
returns = defaultdict(list)
V = defaultdict(float)

for episode in range(num_episodes):
# Generate trajectory
trajectory = []
state = env.reset()
done = False

while not done:
action = policy(state)
next_state, reward, done = env.step(action)
trajectory.append((state, action, reward))
state = next_state

# Compute returns (backward from end)
G = 0
visited_states = set()

for t in range(len(trajectory) - 1, -1, -1):
state, action, reward = trajectory[t]
G = reward + gamma * G

# First-Visit: record only on first visit
if state not in visited_states:
visited_states.add(state)
returns[state].append(G)
V[state] = np.mean(returns[state])

if (episode + 1) % 1000 == 0:
print(f"Completed {episode + 1} episodes")

return V


def mc_control(env, num_episodes=10000, gamma=0.9, epsilon=0.1):
"""
Monte Carlo Control (On-Policy, ε-greedy)

Returns:
Q: action-value function dictionary
policy: learned policy
"""
Q = defaultdict(lambda: np.zeros(len(env.actions)))
returns = defaultdict(list)

for episode in range(num_episodes):
# Generate trajectory (using ε-greedy policy)
trajectory = []
state = env.reset()
done = False

while not done:
# ε-greedy action selection
if np.random.random() < epsilon:
action = np.random.choice(env.actions)
else:
action = np.argmax(Q[state])

next_state, reward, done = env.step(action)
trajectory.append((state, action, reward))
state = next_state

# Update Q-values
G = 0
visited_sa = set()

for t in range(len(trajectory) - 1, -1, -1):
state, action, reward = trajectory[t]
G = reward + gamma * G

if (state, action) not in visited_sa:
visited_sa.add((state, action))
returns[(state, action)].append(G)
Q[state][action] = np.mean(returns[(state, action)])

if (episode + 1) % 1000 == 0:
print(f"Completed {episode + 1} episodes")

# Extract greedy policy
policy = lambda s: np.argmax(Q[s])

return Q, policy

Pros and Cons of MC Methods:

Pros: - Model-free - Can learn from actual experience or simulation - Can focus on states of interest - Simple and intuitive

Cons: - Requires complete trajectories (must wait until episode ends) - High variance (return is sum of many random rewards) - Inefficient in continuous tasks or long episodes

Temporal Difference Learning: Combining MC and DP

Temporal Difference (TD) learning is the core of reinforcement learning, combining MC's model-free nature with DP's bootstrapping.

TD(0): The Simplest TD Method

TD(0) update rule:

TD Error:TD error measures the discrepancy between "actual observation ()" and "current estimate ()".

Comparison with MC:

Method Update Target Characteristics
MC (actual return) Unbiased, high variance, requires complete trajectory
TD(0) (bootstrapped estimate) Biased (initially), low variance, single-step update

TD methods can update after every step without waiting for episode completion. This makes TD methods applicable to continuing tasks (tasks without explicit termination states).

Convergence of TD(0)

Theorem (Sutton, 1988): For any policy, if the following conditions are met:

  1. Learning rate satisfies Robbins-Monro conditions:Double subscripts: use braces to clarify_t _t = , _t _t^2 < 2. All state-action pairs are visited infinitely

Then TD(0) converges almost surely to.

Common learning rates:or fixed small values (like 0.1).

Sarsa: On-Policy TD Control

Sarsa (State-Action-Reward-State-Action) is the TD version for Q-functions:The algorithm name comes from the quintuple needed for the update:.

Sarsa Algorithm:

  1. Initialize$Q(s, a)S$ - Choose actionfromusing(e.g.,-greedy)
    • Repeat (for each step in episode):
      • Execute, observe - Choosefromusing - Update: - - Untilis terminal

Sarsa is an on-policy method: it evaluates and improves the currently executing policy (like-greedy).

Q-Learning: Off-Policy TD Control

Q-Learning is one of the most important breakthroughs in TD methods (Watkins, 1989).

Q-Learning Update Rule:The difference from Sarsa is: Sarsa uses the actually executed next action, while Q-Learning uses the optimal action.

Q-Learning Algorithm:

  1. Initialize$Q(s, a)S$ - Repeat (for each step in episode):
    • Choose actionfromusing(e.g.,-greedy)
    • Execute, observe - Update: - - Untilis terminal

Q-Learning is an off-policy method: the behavior policy (like-greedy) is used for exploration, while the target policy (greedy policy) is used for evaluation.

Convergence of Q-Learning:

Theorem (Watkins & Dayan, 1992): If:

  1. All state-action pairs are visited infinitely
  2. Learning rate satisfies Robbins-Monro conditions

Then Q-Learning converges to the optimal Q-function, independent of the behavior policy.

This is Q-Learning's huge advantage: it can learn the optimal policy from data generated by any policy.

Sarsa vs Q-Learning: Intuitive Comparison

Consider a "cliff walking" environment:

1
2
3
4
S: start, G: goal, C: cliff (reward=-100)

S . . . . . . G
C C C C C C C C

The agent moves from start to goal; falling off the cliff returns it to start with large negative reward.

Sarsa (On-Policy):

Since it uses an-greedy policy, Sarsa considers the possibility of falling off the cliff during exploration. It learns a policy that stays far from the cliff edge, taking a safer but longer path.

Q-Learning (Off-Policy):

Q-Learning evaluates the greedy policy, assuming no random exploration. It learns a policy that walks along the cliff edge for the shortest path, but during execution (due to-greedy), it might fall off the cliff.

This illustrates: - Sarsa is more conservative (considers exploration risk) - Q-Learning is more aggressive (evaluates optimal policy)

TD(λ) and Eligibility Traces

TD(0) only uses one-step reward for bootstrapping. TD() is a more general framework that combines multi-step returns.

-step return:

TD() update:

Uses a weighted average of all-step returns:wherecontrols weight allocation.

Eligibility Trace:

Eligibility trace is an efficient implementation that maintains a tracefor each state:Update rule:Eligibility trace records a state's "credit": recently visited states and frequently visited states have higher traces, receiving more credit assignment.

Special cases:

-: TD(0) -: Equivalent to MC (using complete return)

Code Implementation: Sarsa and Q-Learning

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
import numpy as np

def sarsa(env, num_episodes=1000, alpha=0.1, gamma=0.9, epsilon=0.1):
"""
Sarsa Algorithm (On-Policy TD Control)

Parameters:
env: environment object
num_episodes: number of training episodes
alpha: learning rate
gamma: discount factor
epsilon: exploration rate

Returns:
Q: learned Q-function
"""
# Initialize Q-table
Q = {}
for r in range(env.height):
for c in range(env.width):
Q[(r, c)] = np.zeros(len(env.actions))

for episode in range(num_episodes):
state = env.reset()

# Choose initial action (ε-greedy)
if np.random.random() < epsilon:
action = np.random.choice(env.actions)
else:
action = np.argmax(Q[state])

done = False

while not done:
# Execute action
next_state, reward, done = env.step(action)

# Choose next action (ε-greedy)
if np.random.random() < epsilon:
next_action = np.random.choice(env.actions)
else:
next_action = np.argmax(Q[next_state])

# Sarsa update: use actually executed next_action
td_target = reward + gamma * Q[next_state][next_action]
td_error = td_target - Q[state][action]
Q[state][action] += alpha * td_error

# Transition
state = next_state
action = next_action

if (episode + 1) % 100 == 0:
print(f"Sarsa completed episode {episode + 1}")

return Q


def q_learning(env, num_episodes=1000, alpha=0.1, gamma=0.9, epsilon=0.1):
"""
Q-Learning Algorithm (Off-Policy TD Control)

Returns:
Q: learned Q-function
"""
# Initialize Q-table
Q = {}
for r in range(env.height):
for c in range(env.width):
Q[(r, c)] = np.zeros(len(env.actions))

for episode in range(num_episodes):
state = env.reset()
done = False

while not done:
# Choose action (ε-greedy)
if np.random.random() < epsilon:
action = np.random.choice(env.actions)
else:
action = np.argmax(Q[state])

# Execute action
next_state, reward, done = env.step(action)

# Q-Learning update: use max_a Q(s', a)
td_target = reward + gamma * np.max(Q[next_state])
td_error = td_target - Q[state][action]
Q[state][action] += alpha * td_error

# Transition
state = next_state

if (episode + 1) % 100 == 0:
print(f"Q-Learning completed episode {episode + 1}")

return Q


def sarsa_lambda(env, num_episodes=1000, alpha=0.1, gamma=0.9, epsilon=0.1, lambda_=0.9):
"""
Sarsa(λ) Algorithm (with eligibility traces)

Parameters:
lambda_: eligibility trace decay factor
"""
Q = {}
for r in range(env.height):
for c in range(env.width):
Q[(r, c)] = np.zeros(len(env.actions))

for episode in range(num_episodes):
# Initialize eligibility traces
E = {}
for r in range(env.height):
for c in range(env.width):
E[(r, c)] = np.zeros(len(env.actions))

state = env.reset()

# Choose initial action
if np.random.random() < epsilon:
action = np.random.choice(env.actions)
else:
action = np.argmax(Q[state])

done = False

while not done:
# Execute action
next_state, reward, done = env.step(action)

# Choose next action
if np.random.random() < epsilon:
next_action = np.random.choice(env.actions)
else:
next_action = np.argmax(Q[next_state])

# Compute TD error
td_error = reward + gamma * Q[next_state][next_action] - Q[state][action]

# Update eligibility trace (current state-action pair)
E[state][action] += 1

# Update all state-action pairs
for s in E:
for a in range(len(env.actions)):
Q[s][a] += alpha * td_error * E[s][a]
E[s][a] *= gamma * lambda_

# Transition
state = next_state
action = next_action

if (episode + 1) % 100 == 0:
print(f"Sarsa(λ) completed episode {episode + 1}")

return Q


# Comparison experiment
if __name__ == "__main__":
from gridworld import GridWorld

env = GridWorld(grid_size=(5, 5), start=(0, 0), goal=(4, 4), obstacles=[(2, 2)])

print("Training Sarsa...")
Q_sarsa = sarsa(env, num_episodes=1000)

print("\nTraining Q-Learning...")
Q_qlearn = q_learning(env, num_episodes=1000)

print("\nTraining Sarsa(λ)...")
Q_sarsa_lambda = sarsa_lambda(env, num_episodes=1000, lambda_=0.9)

# Extract policies and compare
policy_sarsa = {s: np.argmax(Q_sarsa[s]) for s in Q_sarsa}
policy_qlearn = {s: np.argmax(Q_qlearn[s]) for s in Q_qlearn}

print("\nPolicy comparison complete")

Key Code Points:

  1. Sarsa updates usingwhereis the actually executed action
  2. Q-Learning updates using, taking$a Q(S{t+1}, a)E$, recording "credit" for each state-action pair
  3. All three methods use-greedy for exploration

❓ Q&A: Common Questions on RL Fundamentals

Q1: Why do we need the discount factor? Can't we directly maximize undiscounted return?

Mathematical Necessity:

In infinite-horizon tasks, undiscounted returnmay diverge to infinity. The discount factor ensures bounded return:

Economic Interpretation:

The discount factor reflects "time value":1 tomorrow. In reinforcement learning, this manifests as:

  • Rewards obtained earlier have more value (reduced uncertainty)
  • Prevents agents from indefinitely delaying reward acquisition

Practical Impact:

Value Behavioral Characteristics Suitable Scenarios
0.0 - 0.5 Myopic, focuses on near-term rewards Financial trading, quick decisions
0.9 - 0.95 Balanced, commonly used default Most tasks
0.99 - 0.999 Farsighted, values long-term gains Go, strategic planning
1.0 No discounting (finite-step tasks only) Games with clear termination

Selection Guidelines:

  • If task has clear termination state (episodic),can be close to 1
  • If task is continuing,should be < 1
  • Tunevia cross-validation

Q2: Why is the Bellman equation a "contraction mapping"? What does this mean for convergence?

Contraction Mapping Definition:

Operatoris a contraction mapping if there existssuch that for any functions:where.

Proof of Bellman Operator Contraction:

Define Bellman operator:Proof:

Banach Fixed-Point Theorem:

A contraction mapping has a unique fixed point, and iteration from any initial value converges to that fixed point:Convergence rate:Exponential convergence! This means value iteration and policy evaluation converge quickly.

Practical Significance:

  • Smallerconverges faster (but policy quality may decrease)
  • Can useas stopping criterion
  • Even with poor initial values, convergence to optimal value is guaranteed

Q3: Dynamic programming requires complete models, but models are often unknown in practice. Does this mean DP has no practical value?

Cases Where Models Are Available:

  1. Known game rules: Board games (Go, Chess), video games
  2. Simulation environments: Robot simulators (MuJoCo, Gazebo), autonomous driving simulation
  3. Learned models: Using supervised learning to learn transition functionand rewardfrom data

Model-Based RL:

Combines model learning with planning:

  1. Collect data from real interactions$ = {(s, a, r, s')}(s'|s, a)(s, a)$3. Use DP to plan on the learned model

This is called the Dyna architecture (Sutton, 1990).

DP's Theoretical Value:

Even if DP algorithms aren't directly used, the theoretical framework (Bellman equations, policy improvement theorem) is foundational for all RL methods:

  • TD methods are essentially sampled versions of DP
  • Policy gradient methods' convergence relies on the policy improvement theorem
  • Target networks in deep RL originate from DP ideas

Practical Recommendations:

  • If model is known or easily modeled, prioritize DP (most efficient)
  • If model is unknown but can be sampled, use MC or TD
  • If samples are expensive, consider model-based methods

Q4: Which is better, Monte Carlo methods or TD methods?

Both have advantages and disadvantages depending on task characteristics:

Monte Carlo Methods:

✅ Advantages: - Unbiased estimate (sample mean of expected return converges to true value) - Don't rely on Markov property - Simple and intuitive

❌ Disadvantages: - High variance (return is sum of long sequence of random rewards) - Require complete trajectories (can't be used for continuing tasks) - Slow learning (especially for long trajectories)

TD Methods:

✅ Advantages: - Low variance (depends only on one-step randomness) - Single-step updates (applicable to continuing tasks) - Fast learning (fully utilizes bootstrapping)

❌ Disadvantages: - Biased (initially based on inaccurate estimates) - Depend on Markov property - Sensitive to initial values

Rules of Thumb:

Task Characteristics Recommended Method Reason
Short episodes MC Variance not large, unbiased estimation more important
Long/continuing tasks TD Must use single-step updates
High randomness TD MC variance too large
Function approximation TD MC's high variance amplifies approximation error
Offline data MC Fixed trajectories, can't bootstrap

Combined Use:

TD() interpolates between MC and TD through parameter:

-: TD(0) -: Close to MC -: Balances bias and variance

In practice,typically works best.


Q5: Q-Learning guarantees convergence to optimal policy, so why do we need Sarsa?

Q-Learning indeed has stronger theoretical guarantees (converges to), but Sarsa is more practical in certain scenarios:

Safety Considerations:

Sarsa is an on-policy method that evaluates the actually executed policy (including exploration). In safety-sensitive tasks (like robot control, medical decisions), Sarsa learns more conservative policies.

Example (cliff walking):

1
2
Start ← ← ← ← ← ← ← Goal
Cliff Cliff Cliff Cliff Cliff Cliff Cliff Cliff
  • Q-Learning: Learns shortest path along cliff edge (evaluates greedy policy), but frequently falls off cliff during execution (due to-exploration)
  • Sarsa: Learns safe path far from cliff (considers exploration risk)

Impact of Exploration Strategy:

Q-Learning's convergence requires all state-action pairs to be visited infinitely but doesn't require a specific exploration strategy. Sarsa's performance directly depends on exploration strategy quality.

Convergence Speed:

Theoretically Q-Learning converges faster (directly optimizes optimal policy), but in practice:

  • If exploration strategy differs greatly from optimal policy, Q-Learning's sample efficiency may be lower
  • Sarsa converges more stably in some tasks

Practical Recommendations:

  • Try Q-Learning first: More general, stronger theory
  • Switch to Sarsa if:
    • Training is unstable
    • Task has safety requirements
    • Exploration noise is large

Q6: Why do eligibility traces accelerate credit assignment?

The credit assignment problem: In long trajectories, how do early actions receive "credit" for distant rewards?

TD(0)'s Limitation:

TD(0) only backpropagates one step:If reward is obtained at time, onlyis directly updated.need multiple iterations to indirectly benefit.

Eligibility Trace Mechanism:

Eligibility tracerecords state's "responsibility" for current reward:Update rule:

Intuition:

  • Frequency heuristic: Frequently visited states have higher traces (cumulative effect)
  • Recency heuristic: Recently visited states have higher traces (decay)
  • Global update: Each TD errorupdates all states (weighted by their eligibility traces)

Example:

Suppose trajectory:(rewardobtained at)

TD(0): - Onlyis directly updated -are affected only after multiple iterations

Sarsa() (assume):

Suppose at:TD error (assuming initial):All states updated simultaneously:

Reward signal propagated through entire trajectory in one step!

Practical Effect:

  • In long trajectory tasks, Sarsa() is several to dozens of times faster than Sarsa(0) -typically works best
  • Cost: Need to maintain eligibility trace for each state (memory overhead)

Q7: How to choose appropriate exploration strategy? What alternatives exist to ε-greedy?

Problems with ε-greedy:

  • Exploration is uniformly random (doesn't utilize prior knowledge)
  • With fixed, late-stage training still does ineffective exploration
  • Equal probability exploration for all actions (even obviously bad actions)

Improved Approaches:

1. Decaying ε-greedy:

Reduce exploration rate over time:Common settings:.

2. Boltzmann Exploration (Softmax):

Choose actions based on relative Q-value magnitudes:Temperature parameter: -: Close to greedy -: Close to uniform random

Advantage: Actions with higher Q-values have higher selection probability.

3. UCB (Upper Confidence Bound):

Uncertainty-based exploration:whereis the number of times actionhas been selected in state.

Intuition: Explore uncertain actions (low visit count).

4. Noisy Networks (NoisyNet):

Add learnable noise to network parameters (detailed in Chapter 6).

Practical Recommendations:

Scenario Recommended Method Reason
Simple tasks ε-greedy + decay Simple and effective
Continuous action space Parameter noise (like OU noise) ε-greedy not applicable
Deep RL NoisyNet Exploration couples with policy parameters
Expensive samples UCB Optimal exploration efficiency
Multi-armed bandits Thompson sampling Bayesian optimal

Q8: How much does TD methods' "bias" impact practice?

TD methods' bias comes from bootstrapping: using current estimateto replace true value.

Bias Analysis:

Assume true value is, current estimate is, TD error:TD target bias:Initially bias may be large, but asapproaches, bias disappears.

Variance Comparison:

MC variance (assume-step trajectory):TD variance:TD variance istimes lower than MC!

Practical Impact:

In most tasks, low variance is more important than unbiasedness:

  • Deep RL: Function approximators (neural networks) amplify variance
  • Long episodes: MC variance grows linearly with episode length
  • Sample efficiency: TD learns more from same number of samples

Empirical Conclusion:

  • TD methods typically converge faster than MC (especially TD(),)
  • In high-variance tasks, TD's advantage is more pronounced
  • MC still has value in short episodes and offline data scenarios

Q9: How to set learning rate? What are the pros and cons of fixed vs adaptive learning rates?

Learning rate controls step size of each update, crucial for convergence and stability.

Theoretical Requirements (Robbins-Monro Conditions):

Necessary and sufficient conditions for guaranteed convergence:Examples satisfying conditions:

-: Classic decay -: Based on visit count

Fixed Learning Rate:Pros: - Simple - Can adapt to non-stationary environments (higher weight for recent samples)

Cons: - Doesn't satisfy Robbins-Monro conditions (theoretically doesn't guarantee convergence to optimum) - Will eventually oscillate around optimal solution

Common values:, determined by cross-validation.

Adaptive Learning Rates:

1. Visit-Count-Based:whereis visit count for state.

Pros: - Satisfies Robbins-Monro conditions - Frequently visited states automatically have lower learning rate

Cons: - Slow learning in late stage - Requires maintaining visit counts

2. AdaGrad Style:

Borrows from deep learning adaptive optimizers:whereis gradient (or TD error).

3. Adam Style:

Uses first and second moment estimates:

Practical Recommendations:

Scenario Recommended Method Parameters
Simple tabular RL Fixed learning rate
Deep RL Adam optimizer
Online learning Fixed learning rate
Offline learning Decaying learning rate Initial 0.1, decay to 0.001
Non-stationary env Fixed learning rate Maintain adaptability

Tuning Tips:

  • Start with larger learning rate (like 0.1), observe training curve
  • If unstable/divergent, reduce learning rate
  • If converging too slowly, increase learning rate
  • Use learning rate schedule (like linear decay, exponential decay)

Q10: The Markov property assumption is often violated in real problems. How to handle partial observability?

Partially Observable Markov Decision Process (POMDP):

Agent cannot directly observe state, only receives observation, with observation-state relationship determined by observation function:Examples: - Robot navigation: Sensor readings (observation) ≠ true location (state) - Medical diagnosis: Symptoms (observation) ≠ disease (state) - Financial trading: Market data (observation) ≠ market state (state)

Solutions:

1. State Augmentation:

Concatenate historical observations as extended state:Frame stacking commonly used in deep RL:

1
2
# Atari games: stack most recent 4 frames
state = np.concatenate([frame_t, frame_{t-1}, frame_{t-2}, frame_{t-3}], axis=0)

Pros: Simple, suitable for short-term memory Cons: State space dimensionality explosion

2. Recurrent Neural Networks (RNN):

Use LSTM/GRU to process observation sequences, hidden stateas state representation: Pros: Can handle long-term dependencies Cons: Training difficult (vanishing/exploding gradients)

3. Belief State:

Maintain probability distribution over states (belief):Use Bayesian update:Solve MDP in belief space.

Pros: Theoretically complete Cons: High computational complexity (infeasible for continuous state spaces)

4. Transformer (Modern Approach):

Use self-attention mechanism to process observation sequences:Pros: Can capture long-distance dependencies, parallel training Cons: High computational cost

Practical Recommendations:

Scenario Recommended Method Example
Short-term partial observability Frame stacking Atari games (stack 4 frames)
Long-term memory needs LSTM/GRU Dialogue systems, strategy games
High uncertainty Belief state Robot localization (SLAM)
Large-scale sequences Transformer Decision Transformer

Diagnostic Method:

If you suspect Markov property is violated, check:

  • Does agent make different decisions under same observation?
  • Does reward signal appear very delayed?
  • Does environment have hidden states (like opponent intent)?

If yes, consider state augmentation or RNN.

🎓 Summary: Key Points of Reinforcement Learning

MDP Framework:

The core of reinforcement learning is the Markov Decision Process, where agents interact with the environment through policy, aiming to maximize cumulative return.

Bellman Equations (Memory Formula):

Methodology Triangle:

1
2
3
4
        Dynamic Programming (Known Model)
/ \
/ \
Monte Carlo (Model-Free, High Variance) Temporal Difference (Model-Free, Low Variance)

Practical Checklist:

Core Intuition (Memory Mnemonic):

Value = Immediate Reward + Discounted Future Value

Policy Improvement = Greedily Choose High-Value Actions

Learning = Use Experience to Correct Estimates (MC/TD)

Next Chapter Preview:

Chapter 1 established the theoretical foundation of reinforcement learning, but all methods are limited to small-scale tabular problems (finite states and actions). When state spaces are huge (like Atari game pixel spaces) or continuous, tabular methods cannot scale.

The next chapter introduces how to use deep neural networks to approximate Q-functions, giving birth to the groundbreaking work of deep reinforcement learning — DQN (Deep Q-Network). We will explore in depth:

  • Deep Q-Network architecture
  • Role of experience replay and target networks
  • DQN variants: Double DQN, Dueling DQN, Prioritized Experience Replay
  • Rainbow: The ensemble approach
  • Complete Atari game implementation

📚 References

  1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
  2. Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4), 279-292.
  3. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9-44.
  4. Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2), 181-211.
  5. Bellman, R. (1957). Dynamic Programming. Princeton University Press.
  6. Silver, D. (2015). UCL Course on Reinforcement Learning. https://www.davidsilver.uk/teaching/
  7. Levine, S. (2023). CS 285: Deep Reinforcement Learning. UC Berkeley.
  8. Dayan, P. (1992). The convergence of TD() for general. Machine Learning, 8(3-4), 341-362.
  9. Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD Thesis, Cambridge University.
  10. Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report, Cambridge University.

This is the first article in the Reinforcement Learning series, establishing the theoretical foundation for the entire series. Subsequent chapters will delve into deep reinforcement learning, policy optimization, exploration strategies, multi-agent learning, and other frontier topics.

  • Post title:Reinforcement Learning (1): Fundamentals and Core Concepts
  • Post author:Chen Kai
  • Create time:2024-08-02 09:30:00
  • Post link:https://www.chenk.top/reinforcement-learning-1-fundamentals-and-core-concepts/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments