Transfer Learning (2): Pre-training and Fine-tuning Techniques
Chen Kai BOSS

Pre-training and fine-tuning have become one of the most successful transfer learning paradigms in modern deep learning. The emergence of BERT in 2018 fundamentally transformed the NLP research landscape, and pre-trained models have achieved tremendous success in computer vision, speech, and multimodal domains. But why does pre-training work? How should we adjust learning rates during fine-tuning? Which layers should be frozen? These questions involve deep theoretical considerations and engineering trade-offs.

This article derives the mathematical foundations of pre-training from first principles, analyzes the loss functions of contrastive learning and masked language models, explains various fine-tuning strategies in detail, and provides a complete industrial-grade BERT fine-tuning implementation with gradient accumulation, mixed-precision training, and learning rate scheduling. We'll see that pre-training essentially learns a powerful prior distribution, while fine-tuning performs Bayesian updates with limited labeled data.

Motivation for Pre-training: Why Pre-train?

From Data Scarcity to Knowledge Transfer

Deep learning models typically require massive amounts of labeled data to achieve good performance. However, in real-world applications, labeled data is often scarce and expensive:

  • Medical Imaging Diagnosis: Requires expert radiologist annotations, with costs reaching $100-500 per CT scan
  • Legal Text Classification: Requires professional lawyer review, extremely slow annotation speed
  • Low-Resource Language Translation: Lack of parallel corpora, difficult to annotate

Yet unlabeled data is extremely abundant - there are terabytes of text, images, and videos on the internet. The core idea of pre-training is to leverage large-scale unlabeled data to learn universal representations, then fine-tune on specific tasks with limited labeled data.

Mathematical Perspective on Pre-training: Bayesian Priors

From a Bayesian perspective, pre-training learns a strong prior distribution. Letbe model parameters,be pre-training data, andbe task data. Standard training directly maximizes:While pre-training + fine-tuning follows two steps:

  1. Pre-training: Learn prior
  2. Fine-tuning: Bayesian updateThis explains why pre-training works: when task data is scarce, a strong prior significantly improves posterior estimation quality.

Information-Theoretic Perspective: Feature Reuse

From an information-theoretic perspective, pre-training learns the common structure in data. Let the input space be, and different task label spaces be. The feature extractor$f: learned during pre-training satisfies:whereis mutual information. In other words, the representationlearned during pre-training preserves useful information for multiple downstream tasks.

Intuitive Example: Low-level features (edges, textures) and mid-level features (object parts) learned from ImageNet pre-training are useful for many vision tasks. Syntactic and semantic knowledge learned from large-scale text corpus pre-training helps various NLP tasks.

Pre-training vs Training from Scratch: Convergence Speed and Generalization

Experiments show pre-training not only improves final performance but also accelerates convergence. Two reasons:

  1. Better Initialization: Pre-trained parameters are in low-loss regions of the loss landscape, requiring only local adjustments during fine-tuning
  2. Regularization Effect: The prior introduced by pre-training constrains the parameter space, preventing overfitting

Formally, let pre-trained parameters beand fine-tuning loss be. Second-order Taylor expansion around:Ifis already close to optimal, thenis small, leading to faster convergence.

Self-Supervised Learning: Constructing Pre-training Tasks

The key to pre-training is designing self-supervised learning (SSL) tasks that automatically generate supervisory signals from unlabeled data.

Contrastive Learning

The core idea of contrastive learning is: representations of similar samples should be close, while representations of dissimilar samples should be far apart.

SimCLR Framework

SimCLR is one of the most successful contrastive learning methods in computer vision. Given a batch of images, apply two random data augmentations to each image to getas positive pairs. Let encoder beand projection head be, the loss function is:𝟙whereis the projected representation,is cosine similarity, andis the temperature parameter.

Key Intuition: - Numeratorwants high similarity for positive pairs - Denominatoris the normalization term including all negative samples - Temperaturecontrols distribution smoothness: smallis sensitive to hard negatives

Theoretical Foundation of InfoNCE Loss

SimCLR's loss is an instance of InfoNCE loss. It can be proved that minimizing InfoNCE is equivalent to maximizing a lower bound on mutual information. Let positive pairscome from joint distributionand negative samplesfrom marginal distribution:The proof uses Jensen's inequality and importance sampling. This shows contrastive learning implicitly maximizes mutual information between positive pairs.

MoCo: Momentum Contrastive Learning

SimCLR requires large batch sizes (typically 4096-8192) to have enough negative samples. MoCo solves this by maintaining a momentum-updated queue:whereis the query encoder parameters,is the key encoder parameters, andis the momentum coefficient. The queue size can reach 65536, providing abundant negative samples.

Masked Language Model

Masked language modeling is the mainstream method for NLP pre-training, first proposed by BERT.

BERT's MLM Task

Given input sequence, randomly mask 15% of tokens (replace with special token). Let the masked position set be, the model needs to predict masked tokens:whererepresents all tokens except masked positions.

Details of 15% Masking Strategy: - 80% probability: replace with - 10% probability: replace with random token - 10% probability: keep unchanged

This alleviates distribution shift between pre-training and fine-tuning (since there's notoken during fine-tuning).

Autoregressive Decomposition of MLM

Although MLM is non-autoregressive (all masked positions predicted in parallel), its loss can be decomposed autoregressively. Letbe masked tokens arranged in some order:However, BERT's MLM assumes independence between masked tokens:This is an independence assumption that ignores dependencies between masked tokens. XLNet addresses this through Permutation Language Modeling.

Mathematical Analysis of Masking Strategy

Why choose 15% masking ratio? Too few (e.g., 5%) provides weak learning signal; too many (e.g., 50%) lacks context information. Information-theoretic analysis:

Let masking ratio be, the conditional entropy is:Whenis too small,is small (easy to predict); whenis too large,doesn't have enough information to predict. Experiments show 15% is a good balance.

Next Sentence Prediction (NSP)

BERT also introduces the NSP task: given two sentencesand, determine ifis the next sentence of. Loss function:where, andis the special token representation.

However, subsequent research (RoBERTa) showed NSP has insignificant or even harmful effects. The reason is NSP is too easy: the model might just learn topic discrimination rather than inter-sentence relationships.

Sentence Order Prediction (SOP)

ALBERT proposes using SOP to replace NSP: given two consecutive sentences, determine if their order is correct. This is harder than NSP and requires understanding fine-grained inter-sentence relationships.

Fine-tuning Strategies: Efficient Adaptation to Downstream Tasks

Pre-trained models typically have hundreds of millions of parameters. How to efficiently adapt them to downstream tasks is a key question.

Full Fine-Tuning

The most straightforward method is to fine-tune all parameters. Let pre-trained parameters beand downstream task loss be, fine-tuning optimizes:whereis the regularization term preventing too much deviation from pre-trained parameters. This corresponds to a simplified version of elastic weight consolidation (EWC).

Learning Rate Adjustment: Discriminative Fine-tuning

During full fine-tuning, different layers should use different learning rates. Intuition: - Bottom layers (e.g., embedding layer) learn universal features and should be adjusted slightly (small learning rate) - Top layers (e.g., classification head) are task-specific and should be adjusted significantly (large learning rate)

ULMFiT proposes discriminative fine-tuning: for a model withlayers, the learning rate of layeris:whereis the top layer learning rate andis the decay factor (typically 2.6). This makes bottom layer learning ratetimes smaller than the top layer.

Learning Rate Scheduling: Warmup and Cosine Decay

Common learning rate scheduling strategy for fine-tuning pre-trained models:

  1. Warmup: Linearly increase learning rate for firststeps

  2. Cosine decay: Then decay with cosine scheduleWarmup intuition: In early fine-tuning, gradient variance is large (model hasn't adapted to new task yet), small learning rate stabilizes training.

Layer Freezing

For tasks with limited data, freezing some layers can prevent overfitting.

Choosing Freezing Strategy

Three common strategies:

  1. Freeze bottom layers: Freeze embeddings and first few Transformer layers, only fine-tune top layers
  2. Freeze top layers: Freeze top layers, only fine-tune bottom layers (less common)
  3. Gradual unfreezing: Freeze all layers first, gradually unfreeze (from top to bottom)

ULMFiT uses gradual unfreezing: first fine-tune top layer, after convergence unfreeze second-to-last layer, and so on. This gradually adapts to the task while avoiding catastrophic forgetting.

Mathematical Explanation of Freezing: Regularization Perspective

Freezing some parameters is equivalent to applying infiniteregularization to those parameters:This is an optimization problem with equality constraints. Using Lagrange multipliers, it's equivalent to:Thus freezing is an extreme form of regularization.

Adapter: Parameter-Efficient Fine-tuning

Full fine-tuning requires storing a complete model copy for each task. Adapters insert small modules into pre-trained models and only fine-tune these modules, significantly reducing parameters.

Adapter Architecture

Adapter is a bottleneck structure inserted into each Transformer layer:whereis Transformer layer output,,, andis bottleneck dimension (typically,).

Parameter count is, far less than Transformer layer parameter count(self-attention + FFN).

Adapter Theory: Low-Rank Updates

Adapters essentially perform low-rank updates to pre-trained models. Let pre-trained weight beand after fine-tuning be, Adapters assume:i.e.,is a rank-low-rank matrix. The underlying assumption: task adaptation only needs to move in a low-dimensional subspace of parameter space.

LoRA: Low-Rank Adaptation

LoRA (Low-Rank Adaptation) further simplifies Adapters by directly performing low-rank decomposition on weight matrices:where,,. During training, freezeand only updateand.

LoRA advantages: - Parameter efficient: Only need to storeand(parameter count) - No inference overhead: Can mergeinto, no extra computation during inference - Easy task switching: Can quickly switch between tasks (just replace)

BERT Pre-training and Fine-tuning

BERT Architecture Review

BERT (Bidirectional Encoder Representations from Transformers) is a multi-layer bidirectional Transformer encoder. Given input sequence, BERT learns contextual representations through multi-layer self-attention:Each Transformer layer contains multi-head self-attention and feedforward network:

BERT Pre-training Tasks

BERT uses two pre-training tasks:

  1. Masked Language Model (MLM): Randomly mask 15% of tokens and predict
  2. Next Sentence Prediction (NSP): Determine if two sentences are consecutive

Total loss:

BERT Fine-tuning Paradigm

During fine-tuning, BERT can adapt to various NLP tasks:

Text Classification

Addtoken at input beginning, use its representationfor classification:Loss function is cross-entropy:

Sequence Labeling (e.g., NER)

Predict label for each token:

Question Answering (e.g., SQuAD)

Predict start and end positions of answer:

GPT Pre-training and Fine-tuning

GPT (Generative Pre-trained Transformer) uses autoregressive language modeling for pre-training:During fine-tuning, GPT adds task-specific tokens at input end and uses the last token's representation for prediction.

Complete Implementation: BERT Fine-tuning for Text Classification

Below is a complete BERT fine-tuning implementation with industrial-grade techniques including gradient accumulation, mixed-precision training, and learning rate scheduling.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertModel, AdamW, get_linear_schedule_with_warmup
from torch.cuda.amp import autocast, GradScaler
from tqdm import tqdm
import numpy as np
from sklearn.metrics import accuracy_score, f1_score


class BERTClassifier(nn.Module):
"""BERT text classifier"""

def __init__(self, bert_model_name='bert-base-uncased', num_classes=2, dropout=0.1):
super().__init__()
self.bert = BertModel.from_pretrained(bert_model_name)
self.dropout = nn.Dropout(dropout)
self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)

def forward(self, input_ids, attention_mask):
# BERT encoding
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
# Use [CLS] token representation
pooled_output = outputs.pooler_output # (batch_size, hidden_size)
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
return logits


class TextDataset(Dataset):
"""Text classification dataset"""

def __init__(self, texts, labels, tokenizer, max_length=128):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_length = max_length

def __len__(self):
return len(self.texts)

def __getitem__(self, idx):
text = str(self.texts[idx])
label = self.labels[idx]

# Tokenization
encoding = self.tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_length,
padding='max_length',
truncation=True,
return_attention_mask=True,
return_tensors='pt'
)

return {
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'label': torch.tensor(label, dtype=torch.long)
}


class BERTFineTuner:
"""BERT fine-tuning trainer"""

def __init__(
self,
model,
train_dataloader,
val_dataloader,
num_epochs=3,
learning_rate=2e-5,
warmup_ratio=0.1,
gradient_accumulation_steps=1,
max_grad_norm=1.0,
device='cuda',
use_amp=True,
discriminative_lr=False,
lr_decay=2.6
):
self.model = model.to(device)
self.train_dataloader = train_dataloader
self.val_dataloader = val_dataloader
self.num_epochs = num_epochs
self.device = device
self.use_amp = use_amp
self.gradient_accumulation_steps = gradient_accumulation_steps
self.max_grad_norm = max_grad_norm

# Calculate total steps
self.total_steps = len(train_dataloader) * num_epochs // gradient_accumulation_steps
self.warmup_steps = int(self.total_steps * warmup_ratio)

# Discriminative learning rates (different layers have different learning rates)
if discriminative_lr:
self.optimizer = self._create_discriminative_optimizer(learning_rate, lr_decay)
else:
self.optimizer = AdamW(model.parameters(), lr=learning_rate, eps=1e-8)

# Learning rate scheduler (warmup + linear decay)
self.scheduler = get_linear_schedule_with_warmup(
self.optimizer,
num_warmup_steps=self.warmup_steps,
num_training_steps=self.total_steps
)

# Mixed precision training
self.scaler = GradScaler() if use_amp else None

# Loss function
self.criterion = nn.CrossEntropyLoss()

# Training history
self.train_losses = []
self.val_losses = []
self.val_accuracies = []

def _create_discriminative_optimizer(self, lr, decay):
"""Create discriminative optimizer: different layers use different learning rates"""
# Get number of BERT layers
num_layers = len(self.model.bert.encoder.layer)

# Group parameters
param_groups = []

# Embedding layer (lowest learning rate)
param_groups.append({
'params': self.model.bert.embeddings.parameters(),
'lr': lr / (decay ** num_layers)
})

# Each Transformer layer
for i in range(num_layers):
param_groups.append({
'params': self.model.bert.encoder.layer[i].parameters(),
'lr': lr / (decay ** (num_layers - i - 1))
})

# Pooler and classifier (highest learning rate)
param_groups.append({
'params': list(self.model.bert.pooler.parameters()) +
list(self.model.classifier.parameters()),
'lr': lr
})

return AdamW(param_groups, eps=1e-8)

def train_epoch(self):
"""Train one epoch"""
self.model.train()
total_loss = 0

progress_bar = tqdm(self.train_dataloader, desc='Training')

for step, batch in enumerate(progress_bar):
input_ids = batch['input_ids'].to(self.device)
attention_mask = batch['attention_mask'].to(self.device)
labels = batch['label'].to(self.device)

# Mixed precision training
if self.use_amp:
with autocast():
logits = self.model(input_ids, attention_mask)
loss = self.criterion(logits, labels)
loss = loss / self.gradient_accumulation_steps

# Backward
self.scaler.scale(loss).backward()
else:
logits = self.model(input_ids, attention_mask)
loss = self.criterion(logits, labels)
loss = loss / self.gradient_accumulation_steps
loss.backward()

# Gradient accumulation
if (step + 1) % self.gradient_accumulation_steps == 0:
if self.use_amp:
# Gradient clipping
self.scaler.unscale_(self.optimizer)
torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.max_grad_norm)

# Optimizer step
self.scaler.step(self.optimizer)
self.scaler.update()
else:
torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.max_grad_norm)
self.optimizer.step()

self.scheduler.step()
self.optimizer.zero_grad()

total_loss += loss.item() * self.gradient_accumulation_steps
progress_bar.set_postfix({'loss': loss.item() * self.gradient_accumulation_steps})

avg_loss = total_loss / len(self.train_dataloader)
return avg_loss

def evaluate(self):
"""Evaluate model"""
self.model.eval()
total_loss = 0
all_preds = []
all_labels = []

with torch.no_grad():
for batch in tqdm(self.val_dataloader, desc='Evaluating'):
input_ids = batch['input_ids'].to(self.device)
attention_mask = batch['attention_mask'].to(self.device)
labels = batch['label'].to(self.device)

logits = self.model(input_ids, attention_mask)
loss = self.criterion(logits, labels)

total_loss += loss.item()

preds = torch.argmax(logits, dim=1).cpu().numpy()
all_preds.extend(preds)
all_labels.extend(labels.cpu().numpy())

avg_loss = total_loss / len(self.val_dataloader)
accuracy = accuracy_score(all_labels, all_preds)
f1 = f1_score(all_labels, all_preds, average='weighted')

return avg_loss, accuracy, f1

def train(self):
"""Complete training workflow"""
print(f"Total steps: {self.total_steps}")
print(f"Warmup steps: {self.warmup_steps}")
print(f"Gradient accumulation steps: {self.gradient_accumulation_steps}")

best_val_loss = float('inf')

for epoch in range(self.num_epochs):
print(f"\nEpoch {epoch + 1}/{self.num_epochs}")

# Train
train_loss = self.train_epoch()
self.train_losses.append(train_loss)

# Evaluate
val_loss, val_acc, val_f1 = self.evaluate()
self.val_losses.append(val_loss)
self.val_accuracies.append(val_acc)

print(f"Train Loss: {train_loss:.4f}")
print(f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}, Val F1: {val_f1:.4f}")

# Save best model
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(self.model.state_dict(), 'best_model.pt')
print("Saved best model!")

return self.train_losses, self.val_losses, self.val_accuracies


# Usage example
def main():
# Hyperparameters
BERT_MODEL = 'bert-base-uncased'
NUM_CLASSES = 2
MAX_LENGTH = 128
BATCH_SIZE = 16
NUM_EPOCHS = 3
LEARNING_RATE = 2e-5
GRADIENT_ACCUMULATION_STEPS = 2

# Simulated data
train_texts = ["This is great!" * 10, "This is terrible!" * 10] * 500
train_labels = [1, 0] * 500
val_texts = ["This is great!" * 10, "This is terrible!" * 10] * 100
val_labels = [1, 0] * 100

# Tokenizer
tokenizer = BertTokenizer.from_pretrained(BERT_MODEL)

# Datasets
train_dataset = TextDataset(train_texts, train_labels, tokenizer, MAX_LENGTH)
val_dataset = TextDataset(val_texts, val_labels, tokenizer, MAX_LENGTH)

# Dataloaders
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

# Model
model = BERTClassifier(BERT_MODEL, NUM_CLASSES)

# Trainer
trainer = BERTFineTuner(
model=model,
train_dataloader=train_dataloader,
val_dataloader=val_dataloader,
num_epochs=NUM_EPOCHS,
learning_rate=LEARNING_RATE,
gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
discriminative_lr=True, # Use discriminative learning rates
lr_decay=2.6,
use_amp=True # Use mixed precision training
)

# Train
train_losses, val_losses, val_accuracies = trainer.train()


if __name__ == '__main__':
main()

Code Explanation

Discriminative Learning Rates

The _create_discriminative_optimizer method implements different learning rates for different layers:Embedding layer uses, classifier uses.

Gradient Accumulation

When GPU memory is insufficient, gradient accumulation simulates large batch sizes:

1
2
3
4
5
6
loss = loss / self.gradient_accumulation_steps
loss.backward()

if (step + 1) % self.gradient_accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()

Parameter updates every gradient_accumulation_steps steps, equivalent to enlarging batch size by gradient_accumulation_steps times.

Mixed Precision Training

Uses torch.cuda.amp for mixed precision training, significantly reducing GPU memory usage and training time:

1
2
3
4
5
6
7
with autocast():
logits = self.model(input_ids, attention_mask)
loss = self.criterion(logits, labels)

self.scaler.scale(loss).backward()
self.scaler.step(self.optimizer)
self.scaler.update()

Deep Q&A

Q1: Why does pre-training typically outperform training from scratch?

Theoretical Explanation: 1. Data Efficiency: Pre-training leverages large-scale unlabeled data, learning common structures in data 2. Regularization: Pre-trained parameters serve as priors, constraining parameter space and preventing overfitting 3. Optimization Landscape: Pre-trained parameters are in low-loss regions of loss surface, easier to converge during fine-tuning

Experimental Evidence: - BERT outperforms from-scratch models on 8 out of 9 GLUE benchmark tasks - ImageNet pre-training improves COCO object detection by 10+ mAP

Q2: Why does contrastive learning need negative samples?

Contrastive learning aims to learn a representation space where similar samples are close and dissimilar samples are far apart. Negative samples provide repulsive force, preventing all samples from collapsing to a single point (model collapse).

Mathematically, SimCLR's loss can be decomposed as: - First termpulls positive pairs closer - Second termincludes negative samples, pushing negative pairs apart

Without negative samples, the second term degenerates to a constant, and the model easily collapses.

Q3: Why does BERT use bidirectional encoding while GPT uses unidirectional encoding?

BERT: Bidirectional encoding can leverage contextual information, suitable for understanding tasks (classification, NER, QA)

GPT: Unidirectional encoding aligns with autoregressive generation, suitable for generation tasks (text generation, dialogue)

Experiments show: for understanding tasks, bidirectional > unidirectional; for generation tasks, unidirectional is more natural.

Q4: Why is warmup needed during fine-tuning?

In early fine-tuning, model parameters haven't adapted to the new task yet, and gradient variance is large. Using large learning rates directly can lead to: 1. Gradient explosion: Some samples have very large gradients, destroying pre-trained knowledge 2. Parameter oscillation: Optimization trajectory oscillates violently, difficult to converge

Warmup gradually increases learning rate, allowing smooth transition to new task. Mathematically, warmup is equivalent to using adaptive learning rate:

Q5: How to choose fine-tuning learning rate?

Rule of thumb: fine-tuning learning rate should be 1-2 orders of magnitude smaller than pre-training.

  • Pre-training learning rate:
  • Fine-tuning learning rate:Reason: Pre-trained parameters are already close to optimal, fine-tuning only needs minor adjustments. Too large learning rate destroys pre-trained knowledge.

In practice, use learning rate finder: start from small learning rate, gradually increase, observe loss curve, select learning rate where loss decreases fastest.

Q6: Which layers to freeze for best results?

Depends on similarity between task and pre-training data:

Similarity Data Amount Recommended Strategy
High Few Freeze bottom layers, fine-tune top layers
High Many Full fine-tuning
Low Few Freeze middle layers, fine-tune bottom and top layers
Low Many Full fine-tuning + discriminative learning rates

Intuition: Bottom layers learn universal features (edges, textures, syntax), top layers learn task-specific features. High-similarity tasks reuse bottom features, low-similarity tasks need to adjust bottom features.

Q7: How to determine if model is overfitting?

Overfitting signals: 1. Training loss decreases but validation loss increases (most obvious signal) 2. Training acc very high but validation acc stagnates 3. Model predictions on training samples are very confident (output probabilities close to 0 or 1)

Solutions: 1. Increase regularization: Increase dropout, weight decay 2. Early stopping: Stop training when validation loss is lowest 3. Data augmentation: Increase diversity of training samples 4. Reduce model capacity: Use smaller models or freeze more layers

Q8: How does mixed precision training ensure accuracy isn't lost?

Mixed precision training uses FP16 for storage and computation, but uses FP32 for critical steps:

  1. Loss scaling: Multiply loss by a large number (e.g., 1024), preventing FP16 underflow
  2. Master weights: Optimizer maintains FP32 weight copies
  3. Dynamic loss scaling: Automatically adjusts scaling factor, avoiding overflow

Mathematically, FP16's dynamic range is, while gradients are typically inrange. Using FP16 directly causes small gradient underflow. Loss scaling magnifies gradients to, within FP16 range.

Q9: How much data is needed for pre-training to be effective?

No unified answer, but some empirical rules:

  • NLP: At least hundreds of MB of text (e.g., Wikipedia dump ~4GB)
  • CV: At least millions of images (e.g., ImageNet 1.2M images)

The key isn't data quantity but data diversity. 10M images of the same category is worse than 1M images covering diverse categories.

Experiments show: when pre-training data increases by 10x, downstream task performance improves by about 2-5 percentage points (diminishing returns).

Q10: How to evaluate pre-trained model quality?

Three evaluation methods:

  1. Downstream task performance: Fine-tune on multiple tasks, compute average performance (e.g., GLUE benchmark)
  2. Representation quality: Evaluate if learned representations are meaningful (e.g., linear probing, nearest neighbor retrieval)
  3. Pre-training loss: Lower loss indicates better model (but not absolute)

Most reliable is downstream task performance, but costly. Linear probing is a fast evaluation method: freeze pre-trained model, only train a linear classifier. If accuracy is high, representation quality is good.

Q11: How to handle distribution shift between pre-training and fine-tuning?

Distribution shift is a common problem in pre-training. For example, BERT hastoken during pre-training but not during fine-tuning.

Solutions:

  1. BERT's masking strategy: 10% probability replace with random token, 10% probability keep unchanged, alleviates distribution shift
  2. Domain-adaptive pre-training: Continue pre-training on target domain data
  3. Gradual unfreezing: Gradually unfreeze layers, allowing model to gradually adapt to new distribution

Theoretically, use importance weighting to correct distribution shift:But in practice, directly estimating density ratio is very difficult.

Q12: How to allocate computational cost between pre-training and fine-tuning?

Typically pre-training accounts for over 90% of computational cost. For example, BERT-large pre-training requires:

  • Hardware: 64 TPU v3 (equivalent to 512 V100 GPUs)
  • Time: 4 days
  • Cost: About$10,000

While fine-tuning only requires: - Hardware: Single V100 GPU - Time: A few hours - Cost: About$10

Therefore, pre-train once, fine-tune many times is the most economical strategy. Large companies (like Google, OpenAI) pre-train general models and open-source them for community use.

  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
    Devlin et al., NAACL 2019
    https://arxiv.org/abs/1810.04805

  2. Improving Language Understanding by Generative Pre-Training (GPT)
    Radford et al., OpenAI Technical Report 2018
    https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

  3. A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)
    Chen et al., ICML 2020
    https://arxiv.org/abs/2002.05709

  4. Momentum Contrast for Unsupervised Visual Representation Learning (MoCo)
    He et al., CVPR 2020
    https://arxiv.org/abs/1911.05722

  5. Universal Language Model Fine-tuning for Text Classification (ULMFiT)
    Howard and Ruder, ACL 2018
    https://arxiv.org/abs/1801.06146

  6. RoBERTa: A Robustly Optimized BERT Pretraining Approach
    Liu et al., arXiv 2019
    https://arxiv.org/abs/1907.11692

  7. Parameter-Efficient Transfer Learning for NLP (Adapter)
    Houlsby et al., ICML 2019
    https://arxiv.org/abs/1902.00751

  8. LoRA: Low-Rank Adaptation of Large Language Models
    Hu et al., ICLR 2022
    https://arxiv.org/abs/2106.09685

  9. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
    Lan et al., ICLR 2020
    https://arxiv.org/abs/1909.11942

  10. Representation Learning with Contrastive Predictive Coding
    van den Oord et al., arXiv 2018
    https://arxiv.org/abs/1807.03748

  11. Understanding the Difficulty of Training Deep Feedforward Neural Networks
    Glorot and Bengio, AISTATS 2010
    http://proceedings.mlr.press/v9/glorot10a.html

  12. Scaling Laws for Neural Language Models
    Kaplan et al., arXiv 2020
    https://arxiv.org/abs/2001.08361

Summary

Pre-training and fine-tuning represent the most successful paradigm in transfer learning. This article derives the mathematical foundations from first principles - the Bayesian perspective (learning prior distributions) and information-theoretic perspective (learning common structures), analyzes the mathematical basis of contrastive learning (SimCLR, MoCo) and masked language models (BERT MLM) in detail.

For fine-tuning strategies, we discussed full fine-tuning, discriminative learning rates, layer freezing, and Adapters, providing theoretical explanations from regularization and low-rank update perspectives. Finally, we provided a complete BERT fine-tuning implementation with industrial-grade techniques including gradient accumulation, mixed-precision training, and learning rate scheduling.

Pre-training isn't a silver bullet - its effectiveness depends on the similarity between pre-training data and downstream tasks. In the next chapter, we'll delve into domain adaptation methods, addressing the problem of distribution mismatch between pre-training and downstream tasks.

  • Post title:Transfer Learning (2): Pre-training and Fine-tuning Techniques
  • Post author:Chen Kai
  • Create time:2024-11-09 14:30:00
  • Post link:https://www.chenk.top/transfer-learning-2-pre-training-and-fine-tuning/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments