Learning rate (LR) is the knob that most often decides whether training converges, crawls, or blows up. This post builds an actionable mental model — from the simplest quadratic loss to modern large-scale training recipes — so you can choose schedules (warmup/cosine/WSD), debug instability, and tune LR systematically. We cover the math (why "too big explodes, too small stalls"), practical workflows (LR range test, schedule selection), failure mode diagnosis, recent research (schedule-free, power scheduler, warmup theory), and a troubleshooting checklist for common issues.
The One-Sentence Definition
Learning rate controls how far you move in the direction suggested by the gradient each step.
A typical update is:
Core trade-off:
large → fast but unstable; small → stable but slow (or stuck)
Minimal Math: Why "Too Big Explodes, Too Small Stalls"
1D quadratic: the simplest intuition
Consider a 1D quadratic:
Intuition: If
Multi-dimensional case: most curved direction controls stability
In higher dimensions with Hessian
Key insight: The steepest direction (largest eigenvalue) determines the stability boundary.
Analogy: You're walking in a valley. Most directions are gentle slopes, but one direction is a cliff edge. Your step size must be small enough to not fall off that cliff.
Why schedules help
In real networks, curvature and gradient noise change over training. A schedule typically provides:
- Warmup: stabilizes early training and enables larger peak LR
- Stable phase: efficient progress at a good LR
- Decay / cooldown: reduces noise and refines the final solution
Common choices:
- Warmup + cosine decay
- Warmup – stable – decay (WSD)
Why Batch Size Affects Learning Rate
Gradient noise model
Mini-batch gradient can be viewed as:
Two effects:
- Good: Noise helps escape sharp local minima
- Bad: Noise makes large step sizes unstable
Linear scaling rule (empirical):
- If you increase batch size by
, multiply LR by - But add warmup to stabilize early training
Why warmup? Early training is chaotic (parameters uninitialized, high curvature). Warmup lets the model enter a "good region" before cranking up the LR.
Momentum: The Hidden LR Amplifier
SGD
+ Momentum where (typically 0.9).
Intuition: You're pushing a shopping cart downhill:
-
Critical insight: Momentum amplifies effective step size, so you often need smaller LR with momentum than without.
Adaptive Optimizers: Per-Parameter Learning Rates
Adam core
formula
Key insight: The effective LR is roughly
Analogy: Same car on different roads. Bumpy roads (high gradient variance) automatically reduce effective LR to prevent skidding.
Why Adam still needs warmup
Even with adaptive scaling, early training has:
- Unstable statistics (
, not converged) - High preconditioned sharpness (effective curvature after Adam's scaling)
Solution: Warmup from small LR to target LR over 1-5% of steps.
LR Schedules: From Old-School to Modern LLMs
Constant LR
Pros: Simple.
Cons: Either too slow early or too noisy late (can't have both).
Step
decay
Pros: Easy to implement.
Cons: Abrupt changes can cause loss spikes.
Cosine
decay (most popular in deep learning)
Intuition: Slow decay early (exploration), fast decay late (convergence).
Typical setup: Warmup to
WSD (Warmup – Stable – Decay): modern large-model default
Structure:
- Warmup: ramp up to
- Stable: hold at
for most of training - Decay/cooldown: linearly (or otherwise) drop to
in final 10-20%
Why popular?
- More resumable (you can extend training without redesigning schedule)
- Cooldown phase often shows a sharp loss drop (model finally "fine-tunes")
Practical Workflow: From "It Runs" to "It Works"
Step 1: Identify the failure mode
Training "fails" in 3 ways:
- Immediate divergence: Loss → NaN/inf within a few steps
- High oscillation: Loss bounces around, no consistent descent
- Stuck plateau: Loss barely moves, validation accuracy flat
Diagnosis:
- (1)(2): LR too high, insufficient warmup, or missing gradient clipping
- (3): LR too small, schedule decays too fast, or batch too noisy
Step 2: Run an LR range test
Classic method (fast.ai style):
- Start with very small LR (e.g., 1e-7)
- Exponentially increase LR each step (e.g., multiply by 1.1)
- Stop when loss explodes
- Plot LR vs loss curve
Interpretation:
- Loss decreases → LR is safe
- Loss starts increasing → approaching stability boundary
- Pick 0.3-1× the "edge" as your peak LR
Code example (PyTorch)
1 | def lr_range_test(model, loader, loss_fn, optimizer, lr_min=1e-7, lr_max=10, num_steps=200): |
Step 3: Choose a schedule
For mid-size models (< 1B params, < 1 week training):
- Default: Warmup + cosine (simple, robust)
For LLMs (LLM pretraining, long runs):
- Default: WSD (more flexible for resuming/extending)
Warmup duration:
- Small models: 1-2% of steps
- LLMs / large batch: 5-10% of steps
Cooldown duration (WSD only):
- Typically 10-20% of total steps
- Should see a "loss elbow" when cooldown starts
Step 4: Tune the "3-way coupling"
LR, batch size, and weight decay are highly coupled:
| Issue | Don't only adjust LR | Try also |
|---|---|---|
| Training unstable | ❌ Lower LR blindly | ✅ Add gradient clipping, increase warmup, add weight decay |
| Loss stuck high | ❌ Raise LR blindly | ✅ Increase batch size (reduce noise), check for bugs |
| Overfitting | ❌ Lower LR only | ✅ Increase weight decay, add dropout, use data augmentation |
Analogy: LR is the gas pedal, batch size is road friction, weight decay is a gentle brake. Adjusting only one rarely fixes the problem.
Troubleshooting Checklist
Problem 1: Loss immediately explodes (NaN/inf)
Priority order:
- Lower peak LR by 10× (e.g., 3e-4 → 3e-5)
- Increase warmup (0 → 5% steps)
- Add gradient clipping:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) - Check mixed precision: Ensure loss scaling is
enabled (
torch.cuda.amp.GradScaler) - Increase weight decay (especially for LLMs)
Problem 2: Loss decreases very slowly
Common causes:
- LR too small
- Schedule decays too aggressively
- Batch too small (high gradient noise)
- Data/label issue (not LR-related)
Solutions:
- Run LR range test to find safe upper bound
- Use WSD with longer stable phase
- Increase batch size (if memory allows)
- Check data pipeline (labels correct? augmentation too strong?)
Problem 3: Loss oscillates wildly
Common causes:
- LR too high
- Momentum too high
- Weight decay mismatched with normalization
Solutions:
- Lower peak LR
- Reduce momentum (
) - Add gradient clipping
- Check optimizer-norm interaction (e.g., AdamW + LayerNorm is safe, but SGD + BatchNorm can be tricky)
Problem 4: Validation loss diverges from training loss
Cause: Overfitting (not directly LR-related, but LR affects regularization)
Solutions:
- Increase weight decay
- Lower peak LR slightly (slower training can reduce overfitting)
- Add dropout, label smoothing, or data augmentation
- Early stopping
Recent Research: What's New in LR Tuning?
2023: Learning-rate-free optimization (D-Adaptation)
Idea: Automatically estimate the LR scale without manual tuning.
How: Theory-driven method that estimates "distance to optimum" during training.
When to use: Prototyping, grid search reduction.
Reference: Learning Rate-Free Learning by D-Adaptation (Meta, 2023)
2024: Schedule-Free AdamW
Problem: Most schedules require knowing total
steps
Solution: Combine scheduling + iterate averaging to achieve schedule-like performance without explicit decay.
Benefit: Can extend training mid-run without redesigning the schedule.
Reference: Schedule-Free AdamW (arXiv:2405.15682)
2024: Why Warmup Helps (New Theory)
Old view: Warmup helps Adam's statistics stabilize.
New view: Warmup lets the model enter a region where it can tolerate larger LR (lower effective sharpness).
Implication: Warmup is not just about "waiting for statistics"— it's about shaping the optimization landscape.
Reference: Why Warmup the Learning Rate? (arXiv:2406.09405)
2024: Power Scheduler (Batch/Token Agnostic)
Problem: Optimal LR changes when you change batch size or training tokens.
Solution: Use power-law relationships between LR, batch size, and tokens to design transferable schedules.
Benefit: Less retuning when scaling up/down.
Reference: Power Scheduler (arXiv:2408.13359)
2024-2025: Small Models Reproduce LLM Instabilities
Insight: Many "LLM bugs" (e.g., loss spikes) can be reproduced in small models by using higher LR.
Benefit: Debug instabilities at 1/100 the cost.
Reference: Small-scale proxies for large-scale Transformer instabilities (arXiv:2309.14322)
Comparison: Cosine vs WSD vs Schedule-Free
| Schedule | Pros | Cons | Best For |
|---|---|---|---|
| Cosine | Simple, smooth, well-tested | Requires knowing |
Fixed-length runs |
| WSD | Resumable, clear phases | Need to choose cooldown timing | Long/resumable training |
| Schedule-Free | No |
Newer, less battle-tested | Research, variable budgets |
Rule of thumb:
- Fixed training budget → Cosine
- May resume/extend → WSD
- Prototyping / unknown budget → Schedule-Free
Code: Implementing Warmup + Cosine and WSD
Warmup + Cosine
1 | import math |
Warmup + Stable + Decay (WSD)
1 | def lr_wsd(step, total_steps, warmup_steps, cooldown_steps, lr_max, lr_min=0.0): |
Usage in training loop
1 | def train_one_epoch(model, loader, optimizer, step_offset, total_steps, schedule_fn, device="cuda"): |
One-Page Cheat Sheet
AdamW Default Recipe
- Schedule: Warmup + cosine or warmup + WSD
- Warmup: 1-5% of total steps
- Cooldown (WSD): 10-20% of total steps
- Gradient clipping:
max_norm=1.0(LLMs) - Weight decay: 0.01-0.1 (tune with LR)
3 Key Metrics to Monitor
- Training stability: Watch for loss spikes / gradient norm explosions
- Effective step size: Is loss decreasing steadily or stuck?
- LR sensitivity: Small LR changes → big result changes = you're near instability
When to Use Cosine vs WSD
- Fixed training length, no resume: Cosine (simple, robust)
- May extend training / multiple budgets: WSD (resumable)
- Want minimal tuning: Schedule-Free (research, prototyping)
Summary: LR in 5 Steps
- Run LR range test → Find stability boundary
- Pick peak LR = 0.3-1× the "edge"
- Add warmup (1-5% steps, longer for LLMs)
- Choose schedule (Cosine for fixed, WSD for resumable)
- Couple with batch/weight decay (don't tune LR in isolation)
Key hyperparameters:
- Peak LR (3e-4 typical for AdamW)
- Warmup fraction (0.01-0.05)
- Cooldown fraction (0.1-0.2 for WSD)
Common pitfalls:
- Immediate divergence → Lower LR, add warmup, clip gradients
- Slow training → Raise LR (run range test), extend stable phase
- High oscillation → Lower LR or momentum, add clipping
Further reading:
- Post title:Learning Rate: From Basics to Large-Scale Training (2026 Complete Guide)
- Post author:Chen Kai
- Create time:2024-08-20 00:00:00
- Post link:https://www.chenk.top/en/learning-rate-guide/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.