From board games to Atari video games, value function methods have been a cornerstone of reinforcement learning. Q-Learning learns to select optimal actions by iteratively updating state-action values, but faces the curse of dimensionality when dealing with high-dimensional state spaces (like an 84x84 pixel game screen). DeepMind's Deep Q-Network (DQN), proposed in 2013, broke through this barrier by using neural networks as function approximators, combined with two key innovations: experience replay and target networks. This enabled computers to achieve superhuman performance on multiple Atari games for the first time. This breakthrough not only accelerated the development of deep reinforcement learning but also spawned a series of improvements like Double DQN, Dueling DQN, and Prioritized Experience Replay, culminating in the Rainbow algorithm. This chapter starts from the mathematical foundations of Q-Learning, progressively deconstructs DQN's core mechanisms, and analyzes the design motivations and implementation details of various variants.
Q-Learning Foundations: From Dynamic Programming to Temporal Difference
Bellman Optimality Equation and Q-Values
In Chapter 1, we introduced the value function
Unlike
Q-Learning Algorithm: Incremental Updates
Q-Learning is an off-policy temporal difference (TD) algorithm,
proposed by Watkins in 1989. Its update rule is:
Why is it called off-policy? Because the update formula uses
Convergence Guarantees: Watkins & Dayan (1992)
Q-Learning's convergence was proven by Watkins and Dayan in 1992, requiring the following conditions:
- Tabular representation: Both state and action spaces are finite, Q-values stored in a table
- All state-action pairs visited infinitely:
Each
must be updated infinitely many times - Learning rate satisfies Robbins-Monro
conditions:
Intuitively, the first condition means the sum of learning rates is large enough to overcome any initial error; the second condition ensures learning rates decay small enough for the algorithm to eventually stabilize near the optimal value. Typical choices are or .
Under these conditions, Q-Learning converges to the optimal Q
function
Cliff Walking Example: Q-Learning Intuition
Let's build intuition with a classic example. Cliff Walking is a grid world: the agent starts from the bottom-left corner and needs to reach the goal at the bottom-right corner, but there's a row of cliffs at the bottom — falling off gives -100 reward and returns to start. Each step gives -1 reward.
In this environment, the optimal path is to walk along the cliff edge (shortest path), but it's easy to fall during exploration. What happens with Q-Learning updates?
Initially, all Q-values are 0. When the agent first falls off the
cliff, the transition
This example demonstrates two properties of Q-Learning: 1.
Propagation of negative rewards: Low Q-values in
dangerous areas propagate forward, forming "forbidden zones" 2.
Advantage of off-policy: Even if the behavior policy
frequently falls (exploration), the learned
Complete Python implementation:
1 | import numpy as np |
This code demonstrates the complete Q-Learning flow: initialize Q-table, interact with environment, compute TD error, update Q-values. After running, you'll find the learning curve has large oscillations early on (frequent cliff falls), but as training progresses, the agent gradually learns to avoid the cliff, eventually stabilizing around -13 reward (optimal path length).
Necessity and Challenges of Function Approximation
Curse of Dimensionality: Why Tables Aren't Enough
Cliff Walking has only 48 states, so storing Q-values in a table is no problem. But consider Atari games:
- Breakout: Input is 84x84x4 grayscale frame stack (4
frames of history), state space size approximately
- Go: 19x19 board, each intersection has 3 states
(black/white/empty), state space
Even with the most advanced storage technology, we cannot store a Q-value for every state. More seriously, in such huge spaces, the agent will almost never encounter the exact same state twice — meaning each state is visited only once, violating Q-Learning's convergence conditions.
The solution is function approximation: use a parameterized
function
Deadly Triad: Triple Threat to Stability
However, function approximation brings serious stability issues. Sutton and Barto summarized the Deadly Triad in "Reinforcement Learning: An Introduction":
- Bootstrapping: Updating estimates with estimates
- Q-Learning's update target
itself depends on Q estimates
- Q-Learning's update target
- Function Approximation: Replacing tables with
parameterized functions
- Updating Q-value of one state affects Q-values of other "similar" states
- Off-Policy: Learning policy differs from behavior
policy
- Distribution mismatch between behavior and target policies introduces distributional shift
Combining all three can cause training divergence. Let's mathematically analyze why.
Mathematics of Divergence
Consider linear function approximation:
Worse, function approximation introduces generalization:
updating
- Updating
increases it → due to generalization, also increases - But
's true Q-value is low → next update decreases - Generalization affects
again, forming a cycle
Baird constructed a counterexample in 1995 showing even simple linear
function approximation + Q-Learning can diverge. In his "star
counterexample", 7 states connected in a star structure with 6 features,
standard Q-Learning updates cause parameters
Early Attempts: Neural Fitted Q-Iteration (NFQ)
Before DQN, Riedmiller proposed Neural Fitted Q-Iteration (NFQ) in 2005. The idea:
- Collect a batch of experiences$(s,a,r,s')
y = r + _{a'} Q(s',a') = (y - Q(s,a))^2$4. Repeat steps 1-3
NFQ worked on simple tasks but remained unstable in high-dimensional
environments like Atari. Reasons: - Sample correlation:
Consecutively collected experiences
DQN's two innovations address exactly these two problems.
Core Innovations of DQN
Experience Replay: Breaking Correlations
Experience replay draws from supervised learning's "data shuffling". In supervised learning, training sequentially (e.g., training all cat images first, then all dog images) causes models to overfit to data order, leading to catastrophic forgetting. The solution is randomly shuffling data at each epoch start.
DQN applies this to reinforcement learning:
Store experiences: Maintain a replay buffer
with capacity (e.g., 1 million) Add experiences: After each environment interaction yielding
, store it in Sample for training: Each update samples a mini-batch uniformly from
(e.g., 32 transitions) Overwrite old data: When
is full, new experiences overwrite oldest (FIFO queue)
This has three major benefits:
Benefit 1: Break Temporal Correlations
Consecutive experiences
From an information theory perspective, let the mutual information
between consecutive samples be
Benefit 2: Improve Sample Efficiency
In on-policy methods, each experience can only be used once (discarded after use). Experience replay allows us to reuse the same experience multiple times — as long as it's still in the buffer, it can be sampled. This is especially important for sample-expensive environments (like robot control).
Let each experience be used an average of
Benefit 3: Smooth Distribution Changes
When policies update, new policies collect different data distributions. But since the buffer stores data from many old policies, the current training data distribution is a mixture of multiple policies, changing more smoothly. This avoids "sharp turns"— one policy update causing dramatic distribution change, making the next update's target completely different.
Mathematically, let the
Implementation Details: ReplayBuffer
1 | import numpy as np |
This implementation uses deque (double-ended queue) to
automatically handle capacity limits — new elements push out oldest.
Sampling uses random.sample to ensure uniformity.
Target Network: Stabilizing Moving Targets
Even with experience replay, DQN still faces the "moving target"
problem: updating parameter
Specifically, DQN maintains two networks:
- Online network: Parameters
, used for action selection and computing current Q-values - Target network: Parameters
, only used for computing TD targets
The update rule becomes:
Every
Theoretical Analysis: Why Target Networks Work
Target networks can be understood through the variance-bias tradeoff.
Consider the variance of TD targets:
Mathematically, we can quantify using the expectation of Bellman
error:
Complete DQN Algorithm Analysis
Pseudocode and Flow
Now we can write the complete DQN algorithm:
Algorithm: Deep Q-Network (DQN)
Input: - Environment
- for episode
do 5. Initialize state$s_0 t = 0, 1, 2, a_t a_t = a Q(s_t, a) a_t r_t, s_{t+1}, (s_t, a_t, r_t, s_{t+1}, ) (s_i, a_i, r_i, s'i, i) B y_i = r_i + (1 - i) {a'} Q{-}(s'i, a') () = i (y_i - Q(s_i, a_i))^2 - () C - ({}, _{}) $then break - end for
Key points:
- Line 11:
ensures terminal state targets are (no future returns) - Line 10: Only start training when
(prefill buffer) - Exploration decay: Initially
encourages exploration, gradually decays to favoring exploitation
Loss Function and Gradient Computation
DQN's loss function is the mean squared TD error:
This is identical to supervised learning's regression loss
Training Tricks
Gradient Clipping
Atari environments have large reward ranges (e.g., Breakout brick
scores can accumulate to hundreds), causing large TD errors and
potential gradient explosion. DQN uses gradient clipping:torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=10).
Learning Rate Scheduling
DQN typically uses fixed learning rate (e.g.,
Reward Clipping and Normalization
Atari games have vastly different reward scales (Pong is -1/0/+1,
Breakout can reach hundreds). The DQN paper clips all positive rewards
to +1, negative to -1:
Complete Atari DQN Implementation
Below is a complete DQN implementation for Atari Pong (approximately 350 lines):
1 | import torch |
This code contains all key DQN components: convolutional neural network, experience replay, target network, epsilon-greedy exploration, gradient clipping. After training for about 200-300 episodes on Pong, the agent typically reaches near-optimal performance (average reward close to +21, winning by 21 points per game).
DQN Variants: From Double to Rainbow
Double DQN: Addressing Q-Value Overestimation
Problem: Positive Bias of
The
Let true Q-values be
Solution: Decouple Selection and Evaluation
Double DQN (van Hasselt et al., 2016) uses the online network to
select actions and the target network to evaluate them. The update
target becomes:
Mathematically, let two independent estimates have errors
Numerical Example
Consider a state
- DQN:
(overestimate by 2) - Double DQN: Online network selects action 2 (Q=12),
target network evaluates
(assume target network's error ), getting 9.5 (closer to true value)
Implementation
Only need to modify one line:
1 | # Original DQN target |
Experiments show Double DQN reduces overestimation in most Atari games, improving stability and final performance.
Dueling DQN: Separating State Value and Advantage
Motivation: Not All States Need to Care About Actions
Consider two scenarios in Atari games: 1. Urgent situations (like ball about to be missed in Pong): action choice is critical 2. Calm moments (like ball far from paddle): regardless of action, values are similar
Traditional DQN uses the same network head to output all
Intuitively,
Identifiability Problem
Directly separating
Network Architecture
Dueling DQN's convolutional layers are same as DQN, but fully connected layers split into two streams:
1 | class DuelingDQN(nn.Module): |
Why It Works
Dueling architecture's advantage is more efficient value
function learning. In many states, action choice has little
impact on value — at this point
Experiments show Dueling DQN has significant improvements in tasks requiring long-term planning (like Seaquest, Enduro), because in these tasks differences in state value matter more.
Prioritized Experience Replay: Importance Sampling
Problem: Not All Experiences Are Equally Important
Uniform random sampling in experience replay assumes all experiences are equally important, but intuitively: - High TD-error experiences are more "surprising", contain more information - Low TD-error experiences are already well-learned, repeated training has diminishing returns
Prioritized Experience Replay (PER, Schaul et al., 2016) allocates sampling probability based on TD error magnitude.
Priority Definition
Each experience's priority is defined as:
Importance Sampling Weights
Prioritized sampling changes data distribution, introducing bias. To
correct, use importance sampling weights:
Implementation
Needs to maintain a priority queue, can use SumTree data structure
for
1 | class PrioritizedReplayBuffer: |
PER significantly improves sample efficiency but has higher implementation complexity and slightly increased training time.
Other Rainbow Components
Noisy Networks: Replace
Multi-step Learning: Replace 1-step TD target with
n-step return:
Distributional RL (C51): Don't learn Q-value expectation, learn entire distribution. Discretize value range into 51 bins, train distribution prediction with cross-entropy loss. This captures risk (variance) information, beneficial for exploration.
Rainbow: Combines all 6 techniques (Double DQN, Dueling, PER, Noisy Nets, Multi-step, C51), achieving state-of-the-art performance on Atari at the time. Ablation studies show each technique contributes, and combined effect exceeds simple addition (synergy).
Theoretical Analysis
Theoretical Guarantees of Experience Replay
From an optimization perspective, experience replay samples from
experience distribution
Good news: Munos et al. (2016) proved that if replay buffer data
satisfies certain coverage conditions, and target network updates slowly
enough, DQN can still converge to approximately optimal solution with
certain probability, error bound
Variance Analysis of Target Networks
Target networks' role can be understood through the variance-bias
tradeoff. Consider TD target variance:
Mathematically, we can quantify using Bellman error expectation:
Practical Tips and Debugging
Hyperparameter Selection
| Hyperparameter | Typical Value | Description |
|---|---|---|
buffer_size |
100k-1M | Larger is better, but memory limited; Atari uses 1M |
batch_size |
32-128 | Too small unstable, too large slow training; 32 common |
learning_rate |
1e-4 to 1e-3 | Adam optimizer typically uses 2.5e-4 |
gamma |
0.99 | Discount factor, 0.99 for long-term, 0.9 for short-term |
epsilon_start |
1.0 | Initial exploration rate |
epsilon_end |
0.01-0.1 | Final exploration rate, 0.01 for Atari |
epsilon_decay |
0.995 | Per-episode decay factor, or linear decay |
target_update_freq |
1k-10k | Target network update frequency, 10k common |
Tuning Advice: 1. Start with smaller
buffer_size and larger learning_rate to
quickly verify code correctness 2. After confirming loss decreases and
reward increases, use full hyperparameters for long training 3. If
training unstable (loss oscillates, reward collapses), reduce
learning_rate or increase
target_update_freq
Training Curve Analysis
Normal Curve: Reward gradually rises, loss first rises then falls - Early stage (0-100 episodes): Reward near random level, loss rises (Q-values grow from 0, TD error increases) - Middle stage (100-500 episodes): Reward improves rapidly, loss peaks then falls (Q-values stabilize) - Late stage (500+ episodes): Reward converges near optimal, loss stabilizes at low level
Anomalies: - Loss Explosion: Suddenly increases to 1e3 or 1e6 - Cause: Gradient explosion or Q-value divergence - Solution: Check gradient clipping, reduce learning rate, check if rewards are unnormalized - Reward Collapse: First rises then suddenly drops and doesn't recover - Cause: Catastrophic forgetting or local optimum - Solution: Increase buffer_size, reduce learning rate, check if target network updates too frequently - Reward Stagnation: Stays at random level for long time - Cause: Insufficient exploration or inadequate network capacity - Solution: Increase epsilon or extend epsilon_decay, check network architecture
Debugging Techniques
1. Monitor Q-Value Distribution
Print Q-value statistics during training:
1 | with torch.no_grad(): |
Normally, Q-values should gradually increase and stabilize. If Q-values continuously explode (e.g., exceed 1000), there's a problem.
2. Check TD Error
Print TD error distribution:
1 | td_errors = (target_q - current_q).abs() |
TD error should gradually decrease. If it stays high for long, the network isn't learning useful patterns.
3. Visualize Action Distribution
Count frequency of agent selecting each action:
1 | action_counts = np.zeros(n_actions) |
If an action is never selected, the network may be stuck in a local optimum.
4. Test on Simplified Environments
When debugging complex environments (like Atari), first verify code correctness on simple environments (like CartPole):
1 | env = gym.make('CartPole-v1') |
If simple environments don't work, there's a bug in the code.
Deep Q&A
Q1: Why does DQN use off-policy instead of on-policy?
A: Off-policy's core advantage is data utilization efficiency. On-policy methods (like SARSA, A3C) require training data to come from the current policy — after each policy update, old data becomes obsolete. Off-policy methods (like Q-Learning, DQN) can use data collected by any policy, as long as it explored relevant state-action pairs.
DQN's experience replay leverages this: data in the buffer comes from
multiple past policies (
The cost is off-policy methods are harder to converge (due to distribution mismatch between data and target policy), requiring additional stabilization techniques (like target networks). But in sample-expensive scenarios (like robotics, Atari), this trade-off is worthwhile.
Q2: How to choose experience replay buffer size?
A: Buffer size
- Coverage: Larger
means buffer covers broader state distribution, smaller off-policy bias - Freshness: Smaller
means buffer data is more "fresh", closer to current policy - Memory Limits: Storing
requires memory, Atari's 84x84x4 images take ~28KB
Typical choices: - Atari: 1M (~28GB memory) - Simple environments (like CartPole): 10k-100k - Continuous control (like MuJoCo): 100k-1M
Rule of thumb: Buffer should hold at least 100 complete episodes of data. If too small (like only 10 episodes), buffer gets overwritten frequently, insufficient sample diversity.
Experimental suggestion: Start with smaller buffer (like 10k) for fast iteration, confirm code correctness, then use large buffer for training.
Q3: How serious is Double DQN's overestimation problem?
A: In many Atari games, DQN's Q-value overestimation is very significant. van Hasselt et al. showed in their paper an example:
- Pong game: True Q-value range approximately -21 to +21 (game score)
- DQN estimates: Q-values gradually inflate to +50 or even +100
- Double DQN estimates: Stable around -21 to +21
Overestimation severity depends on environment randomness and action
space size: - High randomness (like noisy rewards) →
large estimation error → severe overestimation - Many
actions (like 18 actions) → large
In some games (like Seaquest), overestimation causes agents to choose suboptimal policies — because some "seemingly good" actions (overestimated Q-values) are actually poor. Double DQN can reduce overestimation bias by 50%-90% through decoupling selection and evaluation.
But there are exceptions: in deterministic environments (like chess), overestimation is less problematic, Double DQN's improvement is limited.
Q4: When does Dueling DQN work best?
A: Dueling DQN improves most in scenarios where action impact differences are small. Specifically:
Long-term planning tasks (like Seaquest, Enduro): Most time spent "cruising", only a few critical moments need precise actions. State value
is more important than advantage . High action redundancy (like some Atari actions nearly equivalent): For example, "left" and "left+shoot" have same effect when no enemies present. Dueling can automatically identify this redundancy, focus on learning
. Sparse reward environments: In most states, all actions have similar Q-values (due to sparse rewards), so
is close to 0, carries main information.
Counterexample: In environments with large action
differences (like fighting games where every action is
critical), Dueling's advantage is less obvious, because the advantage
function
Q5: Why isn't Rainbow a simple 1+1=2?
A: Rainbow's performance improvement isn't a simple sum of individual techniques, but has synergy. Ablation studies show:
- Double DQN alone: 30% improvement
- Dueling DQN alone: 20% improvement
- Both combined: 60% improvement (greater than 30%+20%)
Reasons: - Double DQN reduces overestimation →
Dueling's
But there are negative interactions too: PER's importance sampling
weights
Q6: Why can't DQN handle continuous action spaces?
A: DQN's core operation is
But in continuous action spaces (like robot joint angles
Solutions: - Discretization: Divide continuous
action space into grid (like 10x10), but curse of dimensionality causes
combinatorial explosion - Actor-Critic methods: Use
separate actor network
Q7: How to judge if DQN training has converged?
A: Multiple indicators for judging convergence:
Reward Stability: In test environment (deterministic, no exploration), average reward over 100 consecutive episodes no longer increases, and variance less than 5%.
Q-Value Stability: Monitor change rate of
: If for 1000 consecutive steps, Q-values have converged. Policy Stability: Consistency of actions over multiple inferences in same state:
TD Error Stability: Average TD error on test set
drops close to 0.
In practice, Atari games typically need 10M-50M frames (about 200-1000 episodes) to converge. If reward still random walks after 1000 episodes, there's a problem.
Q8: DQN vs PPO, when to use which?
A: DQN and PPO each have suitable scenarios:
| Feature | DQN | PPO |
|---|---|---|
| Action Space | Discrete (small scale) | Discrete + Continuous |
| Sample Efficiency | High (experience replay) | Low (on-policy) |
| Stability | Needs tuning (target network etc.) | Out-of-box |
| Parallelization | Single environment | Multi-environment |
| Suitable Tasks | Atari, discrete control | Robotics, continuous control |
Selection Advice: - Discrete actions + expensive samples (like robot grasping 5 minutes per try) → DQN - Continuous actions + abundant sampling (like simulator environments) → PPO - Need fast prototyping (don't want to tune) → PPO - Pursue ultimate performance (willing to fine-tune) → Rainbow DQN
Interesting phenomenon: On Atari, DQN's final performance is usually higher than PPO, but PPO's training curve is smoother. This reflects off-policy methods' "high risk high reward" characteristic.
Q9: How to handle high-dimensional image input (Atari)?
A: Atari environment's raw observation is 210x160x3 RGB image, direct input causes:
- Too high dimensionality: 210x160x3 = 100,800 dimensions, network parameters explode
- Redundant information: Most pixels are background (black), low information density
- Temporal correlation: Single frame can't determine velocity (like ball motion direction)
DQN's preprocessing pipeline:
Step 1: Grayscale 1
gray = np.dot(rgb[...,:3], [0.299, 0.587, 0.114]) # Convert to grayscale
Step 2: Crop irrelevant regions 1
cropped = gray[34:194, :] # Remove Pong's top scoreboard
Step 3: Downsample 1
resized = cv2.resize(cropped, (84, 84)) # Resize to 84x84
Step 4: Frame Stacking 1
state = np.stack([frame_t, frame_{t-1}, frame_{t-2}, frame_{t-3}], axis=0) # 4-frame stack
Frame stacking solves velocity information problem — network can infer motion direction by comparing adjacent frames.
Step 5: Normalization 1
state = state / 255.0 # Normalize to [0, 1]
After these processes, input reduces from 100,800 to 84x84x4 = 28,224 dimensions, information density greatly improved.
Q10: How is DQN's sample efficiency?
A: Sample efficiency measures "how many samples needed to reach target performance". DQN's performance is moderate:
Compared to tabular methods: - Tabular Q-Learning (CartPole): about 5000 steps - DQN (CartPole): about 50,000 steps
DQN needs 10x samples because neural networks need more data to fit Q-function.
Compared to on-policy methods: - DQN (Atari Pong): about 2M frames - A3C (on-policy): about 10M frames
DQN's experience replay improves sample efficiency by 5x.
Compared to model-based methods: - DQN (simulator): about 100k steps - Dyna-Q (model-based): about 10k steps
Model-based methods can use fewer real interactions by learning environment model and simulating in model.
Absolute numbers: On Atari, DQN typically needs 10M-50M frames (about 40-200 hours game time) to reach human level. This is unacceptable for real environments like robotics — so in practice, DQN is more used in simulator environments or combined with sim-to-real techniques.
Directions for improving sample efficiency: - Better network architectures (like ResNet, Transformer) - Auxiliary tasks (like self-supervised learning, contrastive learning) - Transfer learning (pretraining + fine-tuning)
References
Core papers on DQN and its variants (chronological order):
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4), 279-292.
Paper Link
Q-Learning convergence proofMnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). Playing Atari with Deep Reinforcement Learning. NIPS Deep Learning Workshop.
arXiv:1312.5602
First DQN proposal, introducing experience replay and target networksMnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
Nature Paper
Complete DQN, achieving human-level performance on 49 Atari gamesvan Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI.
arXiv:1509.06461
Double DQN, addressing Q-value overestimationWang, Z., Schaul, T., Hessel, M., et al. (2016). Dueling Network Architectures for Deep Reinforcement Learning. ICML.
arXiv:1511.06581
Dueling DQN, separating state value and advantageSchaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized Experience Replay. ICLR.
arXiv:1511.05952
PER, prioritized sampling based on TD errorFortunato, M., Azar, M. G., Piot, B., et al. (2018). Noisy Networks for Exploration. ICLR.
arXiv:1706.10295
Noisy Networks, learnable exploration noiseBellemare, M. G., Dabney, W., & Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. ICML.
arXiv:1707.06887
C51, learning value distributions instead of expectationsHessel, M., Modayil, J., van Hasselt, H., et al. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. AAAI.
arXiv:1710.02298
Rainbow, integrating 6 DQN improvement techniquesSutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Online Version
Classic RL textbook, detailed explanation of Q-Learning and Deadly Triad
From tabular Q-Learning to DQN's deep learning, from single algorithm to Rainbow's technical integration, value-based methods have undergone tremendous evolution over the past thirty years. DQN's two innovations — experience replay and target networks — not only solved deep Q-network stability issues but also provided design paradigms for subsequent Deep RL algorithms. Variants like Double DQN, Dueling DQN, and PER each have their focus, addressing overestimation, learning efficiency, and sample efficiency issues. Rainbow's success demonstrates that carefully combined technical stacks can produce synergistic effects exceeding simple addition.
However, DQN's limitations are also obvious: only applicable to discrete action spaces, struggling with continuous control tasks. The next chapter will turn to Policy Gradient methods and Actor-Critic architectures, exploring how to directly optimize policies and naturally handle continuous action spaces — leading us into the world of modern algorithms like DDPG, TD3, and SAC.
- Post title:Reinforcement Learning (2): Q-Learning and Deep Q-Networks (DQN)
- Post author:Chen Kai
- Create time:2024-08-09 14:00:00
- Post link:https://www.chenk.top/reinforcement-learning-2-q-learning-and-dqn/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.