Learning Rate: From Basics to Large-Scale Training (2026 Complete Guide)
Chen Kai BOSS

Learning rate (LR) is the knob that most often decides whether training converges, crawls, or blows up. This post builds an actionable mental model — from the simplest quadratic loss to modern large-scale training recipes — so you can choose schedules (warmup/cosine/WSD), debug instability, and tune LR systematically. We cover the math (why "too big explodes, too small stalls"), practical workflows (LR range test, schedule selection), failure mode diagnosis, recent research (schedule-free, power scheduler, warmup theory), and a troubleshooting checklist for common issues.

The One-Sentence Definition

Learning rate controls how far you move in the direction suggested by the gradient each step.

A typical update is:

whereis often a mini-batch (stochastic) gradient.

Core trade-off:

large → fast but unstable;small → stable but slow (or stuck)


Minimal Math: Why "Too Big Explodes, Too Small Stalls"

1D quadratic: the simplest intuition

Consider a 1D quadratic:Gradient descent gives:Stability requires:

Intuition: Ifis too large, you overshoot the valley and bounce back with increasing amplitude.

Multi-dimensional case: most curved direction controls stability

In higher dimensions with Hessian:

Key insight: The steepest direction (largest eigenvalue) determines the stability boundary.

Analogy: You're walking in a valley. Most directions are gentle slopes, but one direction is a cliff edge. Your step size must be small enough to not fall off that cliff.

Why schedules help

In real networks, curvature and gradient noise change over training. A schedule typically provides:

  • Warmup: stabilizes early training and enables larger peak LR
  • Stable phase: efficient progress at a good LR
  • Decay / cooldown: reduces noise and refines the final solution

Common choices:

  • Warmup + cosine decay
  • Warmup – stable – decay (WSD)

Why Batch Size Affects Learning Rate

Gradient noise model

Mini-batch gradient can be viewed as:whereis noise (variance decreases with batch size).

Two effects:

  • Good: Noise helps escape sharp local minima
  • Bad: Noise makes large step sizes unstable

Linear scaling rule (empirical):

  • If you increase batch size by, multiply LR by
  • But add warmup to stabilize early training

Why warmup? Early training is chaotic (parameters uninitialized, high curvature). Warmup lets the model enter a "good region" before cranking up the LR.


Momentum: The Hidden LR Amplifier

SGD + Momentumwhere (typically 0.9).

Intuition: You're pushing a shopping cart downhill:

-: current slope -: cart's velocity (has inertia) -: how much the cart "remembers" past momentum -: conversion from velocity to displacement

Critical insight: Momentum amplifies effective step size, so you often need smaller LR with momentum than without.


Adaptive Optimizers: Per-Parameter Learning Rates

Adam core formula

Key insight: The effective LR is roughly, so each parameter gets a different step size based on its gradient history.

Analogy: Same car on different roads. Bumpy roads (high gradient variance) automatically reduce effective LR to prevent skidding.

Why Adam still needs warmup

Even with adaptive scaling, early training has:

  • Unstable statistics (,not converged)
  • High preconditioned sharpness (effective curvature after Adam's scaling)

Solution: Warmup from small LR to target LR over 1-5% of steps.


LR Schedules: From Old-School to Modern LLMs

Constant LR

Pros: Simple.

Cons: Either too slow early or too noisy late (can't have both).

Step decay

Pros: Easy to implement.

Cons: Abrupt changes can cause loss spikes.

Intuition: Slow decay early (exploration), fast decay late (convergence).

Typical setup: Warmup to, then cosine decay to.

WSD (Warmup – Stable – Decay): modern large-model default

Structure:

  • Warmup: ramp up to
  • Stable: hold atfor most of training
  • Decay/cooldown: linearly (or otherwise) drop toin final 10-20%

Why popular?

  • More resumable (you can extend training without redesigning schedule)
  • Cooldown phase often shows a sharp loss drop (model finally "fine-tunes")

Practical Workflow: From "It Runs" to "It Works"

Step 1: Identify the failure mode

Training "fails" in 3 ways:

  1. Immediate divergence: Loss → NaN/inf within a few steps
  2. High oscillation: Loss bounces around, no consistent descent
  3. Stuck plateau: Loss barely moves, validation accuracy flat

Diagnosis:

  • (1)(2): LR too high, insufficient warmup, or missing gradient clipping
  • (3): LR too small, schedule decays too fast, or batch too noisy

Step 2: Run an LR range test

Classic method (fast.ai style):

  • Start with very small LR (e.g., 1e-7)
  • Exponentially increase LR each step (e.g., multiply by 1.1)
  • Stop when loss explodes
  • Plot LR vs loss curve

Interpretation:

  • Loss decreases → LR is safe
  • Loss starts increasing → approaching stability boundary
  • Pick 0.3-1× the "edge" as your peak LR

Code example (PyTorch)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def lr_range_test(model, loader, loss_fn, optimizer, lr_min=1e-7, lr_max=10, num_steps=200):
model.train()
mult = (lr_max / lr_min) ** (1 / (num_steps - 1))
lr = lr_min
for g in optimizer.param_groups:
g["lr"] = lr

losses, lrs = [], []
it = iter(loader)
for step in range(num_steps):
try:
x, y = next(it)
except StopIteration:
it = iter(loader)
x, y = next(it)

optimizer.zero_grad(set_to_none=True)
pred = model(x)
loss = loss_fn(pred, y)
loss.backward()
optimizer.step()

losses.append(loss.item())
lrs.append(lr)

lr *= mult
for g in optimizer.param_groups:
g["lr"] = lr

return lrs, losses

Step 3: Choose a schedule

For mid-size models (< 1B params, < 1 week training):

  • Default: Warmup + cosine (simple, robust)

For LLMs (LLM pretraining, long runs):

  • Default: WSD (more flexible for resuming/extending)

Warmup duration:

  • Small models: 1-2% of steps
  • LLMs / large batch: 5-10% of steps

Cooldown duration (WSD only):

  • Typically 10-20% of total steps
  • Should see a "loss elbow" when cooldown starts

Step 4: Tune the "3-way coupling"

LR, batch size, and weight decay are highly coupled:

Issue Don't only adjust LR Try also
Training unstable ❌ Lower LR blindly ✅ Add gradient clipping, increase warmup, add weight decay
Loss stuck high ❌ Raise LR blindly ✅ Increase batch size (reduce noise), check for bugs
Overfitting ❌ Lower LR only ✅ Increase weight decay, add dropout, use data augmentation

Analogy: LR is the gas pedal, batch size is road friction, weight decay is a gentle brake. Adjusting only one rarely fixes the problem.


Troubleshooting Checklist

Problem 1: Loss immediately explodes (NaN/inf)

Priority order:

  1. Lower peak LR by 10× (e.g., 3e-4 → 3e-5)
  2. Increase warmup (0 → 5% steps)
  3. Add gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
  4. Check mixed precision: Ensure loss scaling is enabled (torch.cuda.amp.GradScaler)
  5. Increase weight decay (especially for LLMs)

Problem 2: Loss decreases very slowly

Common causes:

  • LR too small
  • Schedule decays too aggressively
  • Batch too small (high gradient noise)
  • Data/label issue (not LR-related)

Solutions:

  1. Run LR range test to find safe upper bound
  2. Use WSD with longer stable phase
  3. Increase batch size (if memory allows)
  4. Check data pipeline (labels correct? augmentation too strong?)

Problem 3: Loss oscillates wildly

Common causes:

  • LR too high
  • Momentum too high
  • Weight decay mismatched with normalization

Solutions:

  1. Lower peak LR
  2. Reduce momentum ()
  3. Add gradient clipping
  4. Check optimizer-norm interaction (e.g., AdamW + LayerNorm is safe, but SGD + BatchNorm can be tricky)

Problem 4: Validation loss diverges from training loss

Cause: Overfitting (not directly LR-related, but LR affects regularization)

Solutions:

  1. Increase weight decay
  2. Lower peak LR slightly (slower training can reduce overfitting)
  3. Add dropout, label smoothing, or data augmentation
  4. Early stopping

Recent Research: What's New in LR Tuning?

2023: Learning-rate-free optimization (D-Adaptation)

Idea: Automatically estimate the LR scale without manual tuning.

How: Theory-driven method that estimates "distance to optimum" during training.

When to use: Prototyping, grid search reduction.

Reference: Learning Rate-Free Learning by D-Adaptation (Meta, 2023)

2024: Schedule-Free AdamW

Problem: Most schedules require knowing total stepsupfront (e.g., cosine half-period).

Solution: Combine scheduling + iterate averaging to achieve schedule-like performance without explicit decay.

Benefit: Can extend training mid-run without redesigning the schedule.

Reference: Schedule-Free AdamW (arXiv:2405.15682)

2024: Why Warmup Helps (New Theory)

Old view: Warmup helps Adam's statistics stabilize.

New view: Warmup lets the model enter a region where it can tolerate larger LR (lower effective sharpness).

Implication: Warmup is not just about "waiting for statistics"— it's about shaping the optimization landscape.

Reference: Why Warmup the Learning Rate? (arXiv:2406.09405)

2024: Power Scheduler (Batch/Token Agnostic)

Problem: Optimal LR changes when you change batch size or training tokens.

Solution: Use power-law relationships between LR, batch size, and tokens to design transferable schedules.

Benefit: Less retuning when scaling up/down.

Reference: Power Scheduler (arXiv:2408.13359)

2024-2025: Small Models Reproduce LLM Instabilities

Insight: Many "LLM bugs" (e.g., loss spikes) can be reproduced in small models by using higher LR.

Benefit: Debug instabilities at 1/100 the cost.

Reference: Small-scale proxies for large-scale Transformer instabilities (arXiv:2309.14322)


Comparison: Cosine vs WSD vs Schedule-Free

Schedule Pros Cons Best For
Cosine Simple, smooth, well-tested Requires knowingupfront Fixed-length runs
WSD Resumable, clear phases Need to choose cooldown timing Long/resumable training
Schedule-Free Noneeded, minimal tuning Newer, less battle-tested Research, variable budgets

Rule of thumb:

  • Fixed training budget → Cosine
  • May resume/extend → WSD
  • Prototyping / unknown budget → Schedule-Free

Code: Implementing Warmup + Cosine and WSD

Warmup + Cosine

1
2
3
4
5
6
7
8
9
10
11
12
import math

def lr_warmup_cosine(step, total_steps, warmup_steps, lr_max, lr_min=0.0):
if step < warmup_steps:
# Linear warmup
return lr_max * (step + 1) / max(1, warmup_steps)

# Cosine decay
t = step - warmup_steps
T = max(1, total_steps - warmup_steps)
cos = 0.5 * (1.0 + math.cos(math.pi * t / T))
return lr_min + (lr_max - lr_min) * cos

Warmup + Stable + Decay (WSD)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def lr_wsd(step, total_steps, warmup_steps, cooldown_steps, lr_max, lr_min=0.0):
# Warmup
if step < warmup_steps:
return lr_max * (step + 1) / max(1, warmup_steps)

# Stable
stable_end = total_steps - cooldown_steps
if step < stable_end:
return lr_max

# Cooldown (linear decay to lr_min)
t = step - stable_end
T = max(1, cooldown_steps)
frac = min(1.0, (t + 1) / T)
return lr_max + (lr_min - lr_max) * frac

Usage in training loop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def train_one_epoch(model, loader, optimizer, step_offset, total_steps, schedule_fn, device="cuda"):
model.train()
step = step_offset
for x, y in loader:
x, y = x.to(device), y.to(device)

lr = schedule_fn(step, total_steps)
for g in optimizer.param_groups:
g["lr"] = lr

optimizer.zero_grad(set_to_none=True)
pred = model(x)
loss = torch.nn.functional.cross_entropy(pred, y)
loss.backward()

# Optional: gradient clipping (common for LLMs)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

optimizer.step()
step += 1

return step

One-Page Cheat Sheet

AdamW Default Recipe

  • Schedule: Warmup + cosine or warmup + WSD
  • Warmup: 1-5% of total steps
  • Cooldown (WSD): 10-20% of total steps
  • Gradient clipping: max_norm=1.0 (LLMs)
  • Weight decay: 0.01-0.1 (tune with LR)

3 Key Metrics to Monitor

  1. Training stability: Watch for loss spikes / gradient norm explosions
  2. Effective step size: Is loss decreasing steadily or stuck?
  3. LR sensitivity: Small LR changes → big result changes = you're near instability

When to Use Cosine vs WSD

  • Fixed training length, no resume: Cosine (simple, robust)
  • May extend training / multiple budgets: WSD (resumable)
  • Want minimal tuning: Schedule-Free (research, prototyping)

Summary: LR in 5 Steps

  1. Run LR range test → Find stability boundary
  2. Pick peak LR = 0.3-1× the "edge"
  3. Add warmup (1-5% steps, longer for LLMs)
  4. Choose schedule (Cosine for fixed, WSD for resumable)
  5. Couple with batch/weight decay (don't tune LR in isolation)

Key hyperparameters:

  • Peak LR (3e-4 typical for AdamW)
  • Warmup fraction (0.01-0.05)
  • Cooldown fraction (0.1-0.2 for WSD)

Common pitfalls:

  • Immediate divergence → Lower LR, add warmup, clip gradients
  • Slow training → Raise LR (run range test), extend stable phase
  • High oscillation → Lower LR or momentum, add clipping

Further reading:

  • Post title:Learning Rate: From Basics to Large-Scale Training (2026 Complete Guide)
  • Post author:Chen Kai
  • Create time:2024-08-20 00:00:00
  • Post link:https://www.chenk.top/en/learning-rate-guide/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments