Pre-training and fine-tuning have become one of the most successful transfer learning paradigms in modern deep learning. The emergence of BERT in 2018 fundamentally transformed the NLP research landscape, and pre-trained models have achieved tremendous success in computer vision, speech, and multimodal domains. But why does pre-training work? How should we adjust learning rates during fine-tuning? Which layers should be frozen? These questions involve deep theoretical considerations and engineering trade-offs.
This article derives the mathematical foundations of pre-training from first principles, analyzes the loss functions of contrastive learning and masked language models, explains various fine-tuning strategies in detail, and provides a complete industrial-grade BERT fine-tuning implementation with gradient accumulation, mixed-precision training, and learning rate scheduling. We'll see that pre-training essentially learns a powerful prior distribution, while fine-tuning performs Bayesian updates with limited labeled data.
Motivation for Pre-training: Why Pre-train?
From Data Scarcity to Knowledge Transfer
Deep learning models typically require massive amounts of labeled data to achieve good performance. However, in real-world applications, labeled data is often scarce and expensive:
- Medical Imaging Diagnosis: Requires expert radiologist annotations, with costs reaching $100-500 per CT scan
- Legal Text Classification: Requires professional lawyer review, extremely slow annotation speed
- Low-Resource Language Translation: Lack of parallel corpora, difficult to annotate
Yet unlabeled data is extremely abundant - there are terabytes of text, images, and videos on the internet. The core idea of pre-training is to leverage large-scale unlabeled data to learn universal representations, then fine-tune on specific tasks with limited labeled data.
Mathematical Perspective on Pre-training: Bayesian Priors
From a Bayesian perspective, pre-training learns a strong prior
distribution. Let
- Pre-training: Learn prior
- Fine-tuning: Bayesian update
This explains why pre-training works: when task data is scarce, a strong prior significantly improves posterior estimation quality.
Information-Theoretic Perspective: Feature Reuse
From an information-theoretic perspective, pre-training learns the
common structure in data. Let the input space be
Intuitive Example: Low-level features (edges, textures) and mid-level features (object parts) learned from ImageNet pre-training are useful for many vision tasks. Syntactic and semantic knowledge learned from large-scale text corpus pre-training helps various NLP tasks.
Pre-training vs Training from Scratch: Convergence Speed and Generalization
Experiments show pre-training not only improves final performance but also accelerates convergence. Two reasons:
- Better Initialization: Pre-trained parameters are in low-loss regions of the loss landscape, requiring only local adjustments during fine-tuning
- Regularization Effect: The prior introduced by pre-training constrains the parameter space, preventing overfitting
Formally, let pre-trained parameters be
Self-Supervised Learning: Constructing Pre-training Tasks
The key to pre-training is designing self-supervised learning (SSL) tasks that automatically generate supervisory signals from unlabeled data.
Contrastive Learning
The core idea of contrastive learning is: representations of similar samples should be close, while representations of dissimilar samples should be far apart.
SimCLR Framework
SimCLR is one of the most successful contrastive learning methods in
computer vision. Given a batch of images
Key Intuition: - Numerator
Theoretical Foundation of InfoNCE Loss
SimCLR's loss is an instance of InfoNCE loss. It can be proved that
minimizing InfoNCE is equivalent to maximizing a lower bound on mutual
information. Let positive pairs
MoCo: Momentum Contrastive Learning
SimCLR requires large batch sizes (typically 4096-8192) to have
enough negative samples. MoCo solves this by maintaining a
momentum-updated queue:
Masked Language Model
Masked language modeling is the mainstream method for NLP pre-training, first proposed by BERT.
BERT's MLM Task
Given input sequence
Details of 15% Masking Strategy: - 80% probability:
replace with
This alleviates distribution shift between pre-training and
fine-tuning (since there's no
Autoregressive Decomposition of MLM
Although MLM is non-autoregressive (all masked positions predicted in
parallel), its loss can be decomposed autoregressively. Let
Mathematical Analysis of Masking Strategy
Why choose 15% masking ratio? Too few (e.g., 5%) provides weak learning signal; too many (e.g., 50%) lacks context information. Information-theoretic analysis:
Let masking ratio be
Next Sentence Prediction (NSP)
BERT also introduces the NSP task: given two sentences
However, subsequent research (RoBERTa) showed NSP has insignificant or even harmful effects. The reason is NSP is too easy: the model might just learn topic discrimination rather than inter-sentence relationships.
Sentence Order Prediction (SOP)
ALBERT proposes using SOP to replace NSP: given two consecutive sentences, determine if their order is correct. This is harder than NSP and requires understanding fine-grained inter-sentence relationships.
Fine-tuning Strategies: Efficient Adaptation to Downstream Tasks
Pre-trained models typically have hundreds of millions of parameters. How to efficiently adapt them to downstream tasks is a key question.
Full Fine-Tuning
The most straightforward method is to fine-tune all parameters. Let
pre-trained parameters be
Learning Rate Adjustment: Discriminative Fine-tuning
During full fine-tuning, different layers should use different learning rates. Intuition: - Bottom layers (e.g., embedding layer) learn universal features and should be adjusted slightly (small learning rate) - Top layers (e.g., classification head) are task-specific and should be adjusted significantly (large learning rate)
ULMFiT proposes discriminative fine-tuning: for a
model with
Learning Rate Scheduling: Warmup and Cosine Decay
Common learning rate scheduling strategy for fine-tuning pre-trained models:
Warmup: Linearly increase learning rate for first
steps Cosine decay: Then decay with cosine schedule
Warmup intuition: In early fine-tuning, gradient variance is large (model hasn't adapted to new task yet), small learning rate stabilizes training.
Layer Freezing
For tasks with limited data, freezing some layers can prevent overfitting.
Choosing Freezing Strategy
Three common strategies:
- Freeze bottom layers: Freeze embeddings and first few Transformer layers, only fine-tune top layers
- Freeze top layers: Freeze top layers, only fine-tune bottom layers (less common)
- Gradual unfreezing: Freeze all layers first, gradually unfreeze (from top to bottom)
ULMFiT uses gradual unfreezing: first fine-tune top layer, after convergence unfreeze second-to-last layer, and so on. This gradually adapts to the task while avoiding catastrophic forgetting.
Mathematical Explanation of Freezing: Regularization Perspective
Freezing some parameters is equivalent to applying infinite
Adapter: Parameter-Efficient Fine-tuning
Full fine-tuning requires storing a complete model copy for each task. Adapters insert small modules into pre-trained models and only fine-tune these modules, significantly reducing parameters.
Adapter Architecture
Adapter is a bottleneck structure inserted into each Transformer
layer:
Parameter count is
Adapter Theory: Low-Rank Updates
Adapters essentially perform low-rank updates to
pre-trained models. Let pre-trained weight be
LoRA: Low-Rank Adaptation
LoRA (Low-Rank Adaptation) further simplifies Adapters by directly
performing low-rank decomposition on weight matrices:
LoRA advantages: - Parameter efficient: Only need to
store
BERT Pre-training and Fine-tuning
BERT Architecture Review
BERT (Bidirectional Encoder Representations from Transformers) is a
multi-layer bidirectional Transformer encoder. Given input sequence
BERT Pre-training Tasks
BERT uses two pre-training tasks:
- Masked Language Model (MLM): Randomly mask 15% of tokens and predict
- Next Sentence Prediction (NSP): Determine if two sentences are consecutive
Total loss:
BERT Fine-tuning Paradigm
During fine-tuning, BERT can adapt to various NLP tasks:
Text Classification
Add
Sequence Labeling (e.g., NER)
Predict label for each token:
Question Answering (e.g., SQuAD)
Predict start and end positions of answer:
GPT Pre-training and Fine-tuning
GPT (Generative Pre-trained Transformer) uses autoregressive language
modeling for pre-training:
Complete Implementation: BERT Fine-tuning for Text Classification
Below is a complete BERT fine-tuning implementation with industrial-grade techniques including gradient accumulation, mixed-precision training, and learning rate scheduling.
1 | import torch |
Code Explanation
Discriminative Learning Rates
The _create_discriminative_optimizer method implements
different learning rates for different layers:
Gradient Accumulation
When GPU memory is insufficient, gradient accumulation simulates large batch sizes:
1 | loss = loss / self.gradient_accumulation_steps |
Parameter updates every gradient_accumulation_steps
steps, equivalent to enlarging batch size by
gradient_accumulation_steps times.
Mixed Precision Training
Uses torch.cuda.amp for mixed precision training,
significantly reducing GPU memory usage and training time:
1 | with autocast(): |
Deep Q&A
Q1: Why does pre-training typically outperform training from scratch?
Theoretical Explanation: 1. Data Efficiency: Pre-training leverages large-scale unlabeled data, learning common structures in data 2. Regularization: Pre-trained parameters serve as priors, constraining parameter space and preventing overfitting 3. Optimization Landscape: Pre-trained parameters are in low-loss regions of loss surface, easier to converge during fine-tuning
Experimental Evidence: - BERT outperforms from-scratch models on 8 out of 9 GLUE benchmark tasks - ImageNet pre-training improves COCO object detection by 10+ mAP
Q2: Why does contrastive learning need negative samples?
Contrastive learning aims to learn a representation space where similar samples are close and dissimilar samples are far apart. Negative samples provide repulsive force, preventing all samples from collapsing to a single point (model collapse).
Mathematically, SimCLR's loss can be decomposed as:
Without negative samples, the second term degenerates to a constant, and the model easily collapses.
Q3: Why does BERT use bidirectional encoding while GPT uses unidirectional encoding?
BERT: Bidirectional encoding can leverage contextual information, suitable for understanding tasks (classification, NER, QA)
GPT: Unidirectional encoding aligns with autoregressive generation, suitable for generation tasks (text generation, dialogue)
Experiments show: for understanding tasks, bidirectional > unidirectional; for generation tasks, unidirectional is more natural.
Q4: Why is warmup needed during fine-tuning?
In early fine-tuning, model parameters haven't adapted to the new task yet, and gradient variance is large. Using large learning rates directly can lead to: 1. Gradient explosion: Some samples have very large gradients, destroying pre-trained knowledge 2. Parameter oscillation: Optimization trajectory oscillates violently, difficult to converge
Warmup gradually increases learning rate, allowing smooth transition
to new task. Mathematically, warmup is equivalent to using adaptive
learning rate:
Q5: How to choose fine-tuning learning rate?
Rule of thumb: fine-tuning learning rate should be 1-2 orders of magnitude smaller than pre-training.
- Pre-training learning rate:
- Fine-tuning learning rate:
Reason: Pre-trained parameters are already close to optimal, fine-tuning only needs minor adjustments. Too large learning rate destroys pre-trained knowledge.
In practice, use learning rate finder: start from small learning rate, gradually increase, observe loss curve, select learning rate where loss decreases fastest.
Q6: Which layers to freeze for best results?
Depends on similarity between task and pre-training data:
| Similarity | Data Amount | Recommended Strategy |
|---|---|---|
| High | Few | Freeze bottom layers, fine-tune top layers |
| High | Many | Full fine-tuning |
| Low | Few | Freeze middle layers, fine-tune bottom and top layers |
| Low | Many | Full fine-tuning + discriminative learning rates |
Intuition: Bottom layers learn universal features (edges, textures, syntax), top layers learn task-specific features. High-similarity tasks reuse bottom features, low-similarity tasks need to adjust bottom features.
Q7: How to determine if model is overfitting?
Overfitting signals: 1. Training loss decreases but validation loss increases (most obvious signal) 2. Training acc very high but validation acc stagnates 3. Model predictions on training samples are very confident (output probabilities close to 0 or 1)
Solutions: 1. Increase regularization: Increase dropout, weight decay 2. Early stopping: Stop training when validation loss is lowest 3. Data augmentation: Increase diversity of training samples 4. Reduce model capacity: Use smaller models or freeze more layers
Q8: How does mixed precision training ensure accuracy isn't lost?
Mixed precision training uses FP16 for storage and computation, but uses FP32 for critical steps:
- Loss scaling: Multiply loss by a large number (e.g., 1024), preventing FP16 underflow
- Master weights: Optimizer maintains FP32 weight copies
- Dynamic loss scaling: Automatically adjusts scaling factor, avoiding overflow
Mathematically, FP16's dynamic range is
Q9: How much data is needed for pre-training to be effective?
No unified answer, but some empirical rules:
- NLP: At least hundreds of MB of text (e.g., Wikipedia dump ~4GB)
- CV: At least millions of images (e.g., ImageNet 1.2M images)
The key isn't data quantity but data diversity. 10M images of the same category is worse than 1M images covering diverse categories.
Experiments show: when pre-training data increases by 10x, downstream task performance improves by about 2-5 percentage points (diminishing returns).
Q10: How to evaluate pre-trained model quality?
Three evaluation methods:
- Downstream task performance: Fine-tune on multiple tasks, compute average performance (e.g., GLUE benchmark)
- Representation quality: Evaluate if learned representations are meaningful (e.g., linear probing, nearest neighbor retrieval)
- Pre-training loss: Lower loss indicates better model (but not absolute)
Most reliable is downstream task performance, but costly. Linear probing is a fast evaluation method: freeze pre-trained model, only train a linear classifier. If accuracy is high, representation quality is good.
Q11: How to handle distribution shift between pre-training and fine-tuning?
Distribution shift is a common problem in pre-training. For example,
BERT has
Solutions:
- BERT's masking strategy: 10% probability replace with random token, 10% probability keep unchanged, alleviates distribution shift
- Domain-adaptive pre-training: Continue pre-training on target domain data
- Gradual unfreezing: Gradually unfreeze layers, allowing model to gradually adapt to new distribution
Theoretically, use importance weighting to correct
distribution shift:
Q12: How to allocate computational cost between pre-training and fine-tuning?
Typically pre-training accounts for over 90% of computational cost. For example, BERT-large pre-training requires:
- Hardware: 64 TPU v3 (equivalent to 512 V100 GPUs)
- Time: 4 days
- Cost: About$10,000
While fine-tuning only requires: - Hardware: Single V100 GPU - Time: A few hours - Cost: About$10
Therefore, pre-train once, fine-tune many times is the most economical strategy. Large companies (like Google, OpenAI) pre-train general models and open-source them for community use.
Related Papers
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin et al., NAACL 2019
https://arxiv.org/abs/1810.04805Improving Language Understanding by Generative Pre-Training (GPT)
Radford et al., OpenAI Technical Report 2018
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdfA Simple Framework for Contrastive Learning of Visual Representations (SimCLR)
Chen et al., ICML 2020
https://arxiv.org/abs/2002.05709Momentum Contrast for Unsupervised Visual Representation Learning (MoCo)
He et al., CVPR 2020
https://arxiv.org/abs/1911.05722Universal Language Model Fine-tuning for Text Classification (ULMFiT)
Howard and Ruder, ACL 2018
https://arxiv.org/abs/1801.06146RoBERTa: A Robustly Optimized BERT Pretraining Approach
Liu et al., arXiv 2019
https://arxiv.org/abs/1907.11692Parameter-Efficient Transfer Learning for NLP (Adapter)
Houlsby et al., ICML 2019
https://arxiv.org/abs/1902.00751LoRA: Low-Rank Adaptation of Large Language Models
Hu et al., ICLR 2022
https://arxiv.org/abs/2106.09685ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Lan et al., ICLR 2020
https://arxiv.org/abs/1909.11942Representation Learning with Contrastive Predictive Coding
van den Oord et al., arXiv 2018
https://arxiv.org/abs/1807.03748Understanding the Difficulty of Training Deep Feedforward Neural Networks
Glorot and Bengio, AISTATS 2010
http://proceedings.mlr.press/v9/glorot10a.htmlScaling Laws for Neural Language Models
Kaplan et al., arXiv 2020
https://arxiv.org/abs/2001.08361
Summary
Pre-training and fine-tuning represent the most successful paradigm in transfer learning. This article derives the mathematical foundations from first principles - the Bayesian perspective (learning prior distributions) and information-theoretic perspective (learning common structures), analyzes the mathematical basis of contrastive learning (SimCLR, MoCo) and masked language models (BERT MLM) in detail.
For fine-tuning strategies, we discussed full fine-tuning, discriminative learning rates, layer freezing, and Adapters, providing theoretical explanations from regularization and low-rank update perspectives. Finally, we provided a complete BERT fine-tuning implementation with industrial-grade techniques including gradient accumulation, mixed-precision training, and learning rate scheduling.
Pre-training isn't a silver bullet - its effectiveness depends on the similarity between pre-training data and downstream tasks. In the next chapter, we'll delve into domain adaptation methods, addressing the problem of distribution mismatch between pre-training and downstream tasks.
- Post title:Transfer Learning (2): Pre-training and Fine-tuning Techniques
- Post author:Chen Kai
- Create time:2024-11-09 14:30:00
- Post link:https://www.chenk.top/transfer-learning-2-pre-training-and-fine-tuning/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.