Transfer Learning (2): Pre-training and Fine-tuning Techniques

Pre-training and fine-tuning have become one of the most successful transfer learning paradigms in modern deep learning. The emergence of BERT in 2018 fundamentally transformed the NLP research landscape, and pre-trained models have achieved tremendous success in computer vision, speech, and multimodal domains. But why does pre-training work? How should we adjust learning rates during fine-tuning? Which layers should be frozen? These questions involve deep theoretical considerations and engineering trade-offs.

This article derives the mathematical foundations of pre-training from first principles, analyzes the loss functions of contrastive learning and masked language models, explains various fine-tuning strategies in detail, and provides a complete industrial-grade BERT fine-tuning implementation with gradient accumulation, mixed-precision training, and learning rate scheduling. We'll see that pre-training essentially learns a powerful prior distribution, while fine-tuning performs Bayesian updates with limited labeled data.

Motivation for Pre-training: Why Pre-train?

From Data Scarcity to Knowledge Transfer

Deep learning models typically require massive amounts of labeled data to achieve good performance. However, in real-world applications, labeled data is often scarce and expensive:

Medical Imaging Diagnosis: Requires expert radiologist annotations, with costs reaching $100-500 per CT scan
Legal Text Classification: Requires professional lawyer review, extremely slow annotation speed
Low-Resource Language Translation: Lack of parallel corpora, difficult to annotate

Yet unlabeled data is extremely abundant - there are terabytes of text, images, and videos on the internet. The core idea of pre-training is to leverage large-scale unlabeled data to learn universal representations, then fine-tune on specific tasks with limited labeled data.

Mathematical Perspective on Pre-training: Bayesian Priors

From a Bayesian perspective, pre-training learns a strong prior distribution. Letbe model parameters,be pre-training data, andbe task data. Standard training directly maximizes:While pre-training + fine-tuning follows two steps:

Pre-training: Learn prior
Fine-tuning: Bayesian updateThis explains why pre-training works: when task data is scarce, a strong prior significantly improves posterior estimation quality.

Information-Theoretic Perspective: Feature Reuse

From an information-theoretic perspective, pre-training learns the common structure in data. Let the input space be, and different task label spaces be. The feature extractor$f: learned during pre-training satisfies:whereis mutual information. In other words, the representationlearned during pre-training preserves useful information for multiple downstream tasks.

Intuitive Example: Low-level features (edges, textures) and mid-level features (object parts) learned from ImageNet pre-training are useful for many vision tasks. Syntactic and semantic knowledge learned from large-scale text corpus pre-training helps various NLP tasks.

Pre-training vs Training from Scratch: Convergence Speed and Generalization

Experiments show pre-training not only improves final performance but also accelerates convergence. Two reasons:

Better Initialization: Pre-trained parameters are in low-loss regions of the loss landscape, requiring only local adjustments during fine-tuning
Regularization Effect: The prior introduced by pre-training constrains the parameter space, preventing overfitting

Formally, let pre-trained parameters beand fine-tuning loss be. Second-order Taylor expansion around:Ifis already close to optimal, thenis small, leading to faster convergence.

Self-Supervised Learning: Constructing Pre-training Tasks

The key to pre-training is designing self-supervised learning (SSL) tasks that automatically generate supervisory signals from unlabeled data.

Contrastive Learning

The core idea of contrastive learning is: representations of similar samples should be close, while representations of dissimilar samples should be far apart.

SimCLR Framework

SimCLR is one of the most successful contrastive learning methods in computer vision. Given a batch of images, apply two random data augmentations to each image to getas positive pairs. Let encoder beand projection head be, the loss function is: $𝟙$ whereis the projected representation,is cosine similarity, andis the temperature parameter.

Key Intuition: - Numeratorwants high similarity for positive pairs - Denominatoris the normalization term including all negative samples - Temperaturecontrols distribution smoothness: smallis sensitive to hard negatives

Theoretical Foundation of InfoNCE Loss

SimCLR's loss is an instance of InfoNCE loss. It can be proved that minimizing InfoNCE is equivalent to maximizing a lower bound on mutual information. Let positive pairscome from joint distributionand negative samplesfrom marginal distribution:The proof uses Jensen's inequality and importance sampling. This shows contrastive learning implicitly maximizes mutual information between positive pairs.

MoCo: Momentum Contrastive Learning

SimCLR requires large batch sizes (typically 4096-8192) to have enough negative samples. MoCo solves this by maintaining a momentum-updated queue:whereis the query encoder parameters,is the key encoder parameters, andis the momentum coefficient. The queue size can reach 65536, providing abundant negative samples.

Masked Language Model

Masked language modeling is the mainstream method for NLP pre-training, first proposed by BERT.

BERT's MLM Task

Given input sequence, randomly mask 15% of tokens (replace with special token). Let the masked position set be, the model needs to predict masked tokens:whererepresents all tokens except masked positions.

Details of 15% Masking Strategy: - 80% probability: replace with - 10% probability: replace with random token - 10% probability: keep unchanged

This alleviates distribution shift between pre-training and fine-tuning (since there's notoken during fine-tuning).

Autoregressive Decomposition of MLM

Although MLM is non-autoregressive (all masked positions predicted in parallel), its loss can be decomposed autoregressively. Letbe masked tokens arranged in some order:However, BERT's MLM assumes independence between masked tokens:This is an independence assumption that ignores dependencies between masked tokens. XLNet addresses this through Permutation Language Modeling.

Mathematical Analysis of Masking Strategy

Why choose 15% masking ratio? Too few (e.g., 5%) provides weak learning signal; too many (e.g., 50%) lacks context information. Information-theoretic analysis:

Let masking ratio be, the conditional entropy is:Whenis too small,is small (easy to predict); whenis too large,doesn't have enough information to predict. Experiments show 15% is a good balance.

Next Sentence Prediction (NSP)

BERT also introduces the NSP task: given two sentencesand, determine ifis the next sentence of. Loss function:where, andis the special token representation.

However, subsequent research (RoBERTa) showed NSP has insignificant or even harmful effects. The reason is NSP is too easy: the model might just learn topic discrimination rather than inter-sentence relationships.

Sentence Order Prediction (SOP)

ALBERT proposes using SOP to replace NSP: given two consecutive sentences, determine if their order is correct. This is harder than NSP and requires understanding fine-grained inter-sentence relationships.

Fine-tuning Strategies: Efficient Adaptation to Downstream Tasks

Pre-trained models typically have hundreds of millions of parameters. How to efficiently adapt them to downstream tasks is a key question.

Full Fine-Tuning

The most straightforward method is to fine-tune all parameters. Let pre-trained parameters beand downstream task loss be, fine-tuning optimizes:whereis the regularization term preventing too much deviation from pre-trained parameters. This corresponds to a simplified version of elastic weight consolidation (EWC).

Learning Rate Adjustment: Discriminative Fine-tuning

During full fine-tuning, different layers should use different learning rates. Intuition: - Bottom layers (e.g., embedding layer) learn universal features and should be adjusted slightly (small learning rate) - Top layers (e.g., classification head) are task-specific and should be adjusted significantly (large learning rate)

ULMFiT proposes discriminative fine-tuning: for a model withlayers, the learning rate of layeris:whereis the top layer learning rate andis the decay factor (typically 2.6). This makes bottom layer learning ratetimes smaller than the top layer.

Learning Rate Scheduling: Warmup and Cosine Decay

Common learning rate scheduling strategy for fine-tuning pre-trained models:

Warmup: Linearly increase learning rate for firststeps
Cosine decay: Then decay with cosine scheduleWarmup intuition: In early fine-tuning, gradient variance is large (model hasn't adapted to new task yet), small learning rate stabilizes training.

Layer Freezing

For tasks with limited data, freezing some layers can prevent overfitting.

Choosing Freezing Strategy

Three common strategies:

Freeze bottom layers: Freeze embeddings and first few Transformer layers, only fine-tune top layers
Freeze top layers: Freeze top layers, only fine-tune bottom layers (less common)
Gradual unfreezing: Freeze all layers first, gradually unfreeze (from top to bottom)

ULMFiT uses gradual unfreezing: first fine-tune top layer, after convergence unfreeze second-to-last layer, and so on. This gradually adapts to the task while avoiding catastrophic forgetting.

Mathematical Explanation of Freezing: Regularization Perspective

Freezing some parameters is equivalent to applying infiniteregularization to those parameters:This is an optimization problem with equality constraints. Using Lagrange multipliers, it's equivalent to:Thus freezing is an extreme form of regularization.

Adapter: Parameter-Efficient Fine-tuning

Full fine-tuning requires storing a complete model copy for each task. Adapters insert small modules into pre-trained models and only fine-tune these modules, significantly reducing parameters.

Adapter Architecture

Adapter is a bottleneck structure inserted into each Transformer layer:whereis Transformer layer output,,, andis bottleneck dimension (typically,).

Parameter count is, far less than Transformer layer parameter count(self-attention + FFN).

Adapter Theory: Low-Rank Updates

Adapters essentially perform low-rank updates to pre-trained models. Let pre-trained weight beand after fine-tuning be, Adapters assume:i.e.,is a rank-low-rank matrix. The underlying assumption: task adaptation only needs to move in a low-dimensional subspace of parameter space.

LoRA: Low-Rank Adaptation

LoRA (Low-Rank Adaptation) further simplifies Adapters by directly performing low-rank decomposition on weight matrices:where,,. During training, freezeand only updateand.

LoRA advantages: - Parameter efficient: Only need to storeand(parameter count) - No inference overhead: Can mergeinto, no extra computation during inference - Easy task switching: Can quickly switch between tasks (just replace)

BERT Pre-training and Fine-tuning

BERT Architecture Review

BERT (Bidirectional Encoder Representations from Transformers) is a multi-layer bidirectional Transformer encoder. Given input sequence, BERT learns contextual representations through multi-layer self-attention:Each Transformer layer contains multi-head self-attention and feedforward network:

BERT Pre-training Tasks

BERT uses two pre-training tasks:

Masked Language Model (MLM): Randomly mask 15% of tokens and predict
Next Sentence Prediction (NSP): Determine if two sentences are consecutive

Total loss:

BERT Fine-tuning Paradigm

During fine-tuning, BERT can adapt to various NLP tasks:

Text Classification

Addtoken at input beginning, use its representationfor classification:Loss function is cross-entropy:

Sequence Labeling (e.g., NER)

Predict label for each token:

Question Answering (e.g., SQuAD)

Predict start and end positions of answer:

GPT Pre-training and Fine-tuning

GPT (Generative Pre-trained Transformer) uses autoregressive language modeling for pre-training:During fine-tuning, GPT adds task-specific tokens at input end and uses the last token's representation for prediction.

Complete Implementation: BERT Fine-tuning for Text Classification

Below is a complete BERT fine-tuning implementation with industrial-grade techniques including gradient accumulation, mixed-precision training, and learning rate scheduling.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertModel, AdamW, get_linear_schedule_with_warmup
from torch.cuda.amp import autocast, GradScaler
from tqdm import tqdm
import numpy as np
from sklearn.metrics import accuracy_score, f1_score


class BERTClassifier(nn.Module):
    """BERT text classifier"""
    
    def __init__(self, bert_model_name='bert-base-uncased', num_classes=2, dropout=0.1):
        super().__init__()
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)
        
    def forward(self, input_ids, attention_mask):
        # BERT encoding
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        # Use [CLS] token representation
        pooled_output = outputs.pooler_output  # (batch_size, hidden_size)
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits


class TextDataset(Dataset):
    """Text classification dataset"""
    
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        # Tokenization
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }


class BERTFineTuner:
    """BERT fine-tuning trainer"""
    
    def __init__(
        self,
        model,
        train_dataloader,
        val_dataloader,
        num_epochs=3,
        learning_rate=2e-5,
        warmup_ratio=0.1,
        gradient_accumulation_steps=1,
        max_grad_norm=1.0,
        device='cuda',
        use_amp=True,
        discriminative_lr=False,
        lr_decay=2.6
    ):
        self.model = model.to(device)
        self.train_dataloader = train_dataloader
        self.val_dataloader = val_dataloader
        self.num_epochs = num_epochs
        self.device = device
        self.use_amp = use_amp
        self.gradient_accumulation_steps = gradient_accumulation_steps
        self.max_grad_norm = max_grad_norm
        
        # Calculate total steps
        self.total_steps = len(train_dataloader) * num_epochs // gradient_accumulation_steps
        self.warmup_steps = int(self.total_steps * warmup_ratio)
        
        # Discriminative learning rates (different layers have different learning rates)
        if discriminative_lr:
            self.optimizer = self._create_discriminative_optimizer(learning_rate, lr_decay)
        else:
            self.optimizer = AdamW(model.parameters(), lr=learning_rate, eps=1e-8)
        
        # Learning rate scheduler (warmup + linear decay)
        self.scheduler = get_linear_schedule_with_warmup(
            self.optimizer,
            num_warmup_steps=self.warmup_steps,
            num_training_steps=self.total_steps
        )
        
        # Mixed precision training
        self.scaler = GradScaler() if use_amp else None
        
        # Loss function
        self.criterion = nn.CrossEntropyLoss()
        
        # Training history
        self.train_losses = []
        self.val_losses = []
        self.val_accuracies = []
        
    def _create_discriminative_optimizer(self, lr, decay):
        """Create discriminative optimizer: different layers use different learning rates"""
        # Get number of BERT layers
        num_layers = len(self.model.bert.encoder.layer)
        
        # Group parameters
        param_groups = []
        
        # Embedding layer (lowest learning rate)
        param_groups.append({
            'params': self.model.bert.embeddings.parameters(),
            'lr': lr / (decay ** num_layers)
        })
        
        # Each Transformer layer
        for i in range(num_layers):
            param_groups.append({
                'params': self.model.bert.encoder.layer[i].parameters(),
                'lr': lr / (decay ** (num_layers - i - 1))
            })
        
        # Pooler and classifier (highest learning rate)
        param_groups.append({
            'params': list(self.model.bert.pooler.parameters()) + 
                     list(self.model.classifier.parameters()),
            'lr': lr
        })
        
        return AdamW(param_groups, eps=1e-8)
    
    def train_epoch(self):
        """Train one epoch"""
        self.model.train()
        total_loss = 0
        
        progress_bar = tqdm(self.train_dataloader, desc='Training')
        
        for step, batch in enumerate(progress_bar):
            input_ids = batch['input_ids'].to(self.device)
            attention_mask = batch['attention_mask'].to(self.device)
            labels = batch['label'].to(self.device)
            
            # Mixed precision training
            if self.use_amp:
                with autocast():
                    logits = self.model(input_ids, attention_mask)
                    loss = self.criterion(logits, labels)
                    loss = loss / self.gradient_accumulation_steps
                
                # Backward
                self.scaler.scale(loss).backward()
            else:
                logits = self.model(input_ids, attention_mask)
                loss = self.criterion(logits, labels)
                loss = loss / self.gradient_accumulation_steps
                loss.backward()
            
            # Gradient accumulation
            if (step + 1) % self.gradient_accumulation_steps == 0:
                if self.use_amp:
                    # Gradient clipping
                    self.scaler.unscale_(self.optimizer)
                    torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.max_grad_norm)
                    
                    # Optimizer step
                    self.scaler.step(self.optimizer)
                    self.scaler.update()
                else:
                    torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.max_grad_norm)
                    self.optimizer.step()
                
                self.scheduler.step()
                self.optimizer.zero_grad()
            
            total_loss += loss.item() * self.gradient_accumulation_steps
            progress_bar.set_postfix({'loss': loss.item() * self.gradient_accumulation_steps})
        
        avg_loss = total_loss / len(self.train_dataloader)
        return avg_loss
    
    def evaluate(self):
        """Evaluate model"""
        self.model.eval()
        total_loss = 0
        all_preds = []
        all_labels = []
        
        with torch.no_grad():
            for batch in tqdm(self.val_dataloader, desc='Evaluating'):
                input_ids = batch['input_ids'].to(self.device)
                attention_mask = batch['attention_mask'].to(self.device)
                labels = batch['label'].to(self.device)
                
                logits = self.model(input_ids, attention_mask)
                loss = self.criterion(logits, labels)
                
                total_loss += loss.item()
                
                preds = torch.argmax(logits, dim=1).cpu().numpy()
                all_preds.extend(preds)
                all_labels.extend(labels.cpu().numpy())
        
        avg_loss = total_loss / len(self.val_dataloader)
        accuracy = accuracy_score(all_labels, all_preds)
        f1 = f1_score(all_labels, all_preds, average='weighted')
        
        return avg_loss, accuracy, f1
    
    def train(self):
        """Complete training workflow"""
        print(f"Total steps: {self.total_steps}")
        print(f"Warmup steps: {self.warmup_steps}")
        print(f"Gradient accumulation steps: {self.gradient_accumulation_steps}")
        
        best_val_loss = float('inf')
        
        for epoch in range(self.num_epochs):
            print(f"\nEpoch {epoch + 1}/{self.num_epochs}")
            
            # Train
            train_loss = self.train_epoch()
            self.train_losses.append(train_loss)
            
            # Evaluate
            val_loss, val_acc, val_f1 = self.evaluate()
            self.val_losses.append(val_loss)
            self.val_accuracies.append(val_acc)
            
            print(f"Train Loss: {train_loss:.4f}")
            print(f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}, Val F1: {val_f1:.4f}")
            
            # Save best model
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                torch.save(self.model.state_dict(), 'best_model.pt')
                print("Saved best model!")
        
        return self.train_losses, self.val_losses, self.val_accuracies


# Usage example
def main():
    # Hyperparameters
    BERT_MODEL = 'bert-base-uncased'
    NUM_CLASSES = 2
    MAX_LENGTH = 128
    BATCH_SIZE = 16
    NUM_EPOCHS = 3
    LEARNING_RATE = 2e-5
    GRADIENT_ACCUMULATION_STEPS = 2
    
    # Simulated data
    train_texts = ["This is great!" * 10, "This is terrible!" * 10] * 500
    train_labels = [1, 0] * 500
    val_texts = ["This is great!" * 10, "This is terrible!" * 10] * 100
    val_labels = [1, 0] * 100
    
    # Tokenizer
    tokenizer = BertTokenizer.from_pretrained(BERT_MODEL)
    
    # Datasets
    train_dataset = TextDataset(train_texts, train_labels, tokenizer, MAX_LENGTH)
    val_dataset = TextDataset(val_texts, val_labels, tokenizer, MAX_LENGTH)
    
    # Dataloaders
    train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE)
    
    # Model
    model = BERTClassifier(BERT_MODEL, NUM_CLASSES)
    
    # Trainer
    trainer = BERTFineTuner(
        model=model,
        train_dataloader=train_dataloader,
        val_dataloader=val_dataloader,
        num_epochs=NUM_EPOCHS,
        learning_rate=LEARNING_RATE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
        discriminative_lr=True,  # Use discriminative learning rates
        lr_decay=2.6,
        use_amp=True  # Use mixed precision training
    )
    
    # Train
    train_losses, val_losses, val_accuracies = trainer.train()


if __name__ == '__main__':
    main()

Code Explanation

Discriminative Learning Rates

The _create_discriminative_optimizer method implements different learning rates for different layers:Embedding layer uses, classifier uses.

Gradient Accumulation

When GPU memory is insufficient, gradient accumulation simulates large batch sizes:

loss = loss / self.gradient_accumulation_steps
loss.backward()

if (step + 1) % self.gradient_accumulation_steps == 0:
    optimizer.step()
    optimizer.zero_grad()

Parameter updates every gradient_accumulation_steps steps, equivalent to enlarging batch size by gradient_accumulation_steps times.

Mixed Precision Training

Uses torch.cuda.amp for mixed precision training, significantly reducing GPU memory usage and training time:

with autocast():
    logits = self.model(input_ids, attention_mask)
    loss = self.criterion(logits, labels)

self.scaler.scale(loss).backward()
self.scaler.step(self.optimizer)
self.scaler.update()

Deep Q&A

Q1: Why does pre-training typically outperform training from scratch?

Theoretical Explanation: 1. Data Efficiency: Pre-training leverages large-scale unlabeled data, learning common structures in data 2. Regularization: Pre-trained parameters serve as priors, constraining parameter space and preventing overfitting 3. Optimization Landscape: Pre-trained parameters are in low-loss regions of loss surface, easier to converge during fine-tuning

Experimental Evidence: - BERT outperforms from-scratch models on 8 out of 9 GLUE benchmark tasks - ImageNet pre-training improves COCO object detection by 10+ mAP

Q2: Why does contrastive learning need negative samples?

Contrastive learning aims to learn a representation space where similar samples are close and dissimilar samples are far apart. Negative samples provide repulsive force, preventing all samples from collapsing to a single point (model collapse).

Mathematically, SimCLR's loss can be decomposed as: - First termpulls positive pairs closer - Second termincludes negative samples, pushing negative pairs apart

Without negative samples, the second term degenerates to a constant, and the model easily collapses.

Q3: Why does BERT use bidirectional encoding while GPT uses unidirectional encoding?

BERT: Bidirectional encoding can leverage contextual information, suitable for understanding tasks (classification, NER, QA)

GPT: Unidirectional encoding aligns with autoregressive generation, suitable for generation tasks (text generation, dialogue)

Experiments show: for understanding tasks, bidirectional > unidirectional; for generation tasks, unidirectional is more natural.

Q4: Why is warmup needed during fine-tuning?

In early fine-tuning, model parameters haven't adapted to the new task yet, and gradient variance is large. Using large learning rates directly can lead to: 1. Gradient explosion: Some samples have very large gradients, destroying pre-trained knowledge 2. Parameter oscillation: Optimization trajectory oscillates violently, difficult to converge

Warmup gradually increases learning rate, allowing smooth transition to new task. Mathematically, warmup is equivalent to using adaptive learning rate:

Q5: How to choose fine-tuning learning rate?

Rule of thumb: fine-tuning learning rate should be 1-2 orders of magnitude smaller than pre-training.

Pre-training learning rate:
Fine-tuning learning rate:Reason: Pre-trained parameters are already close to optimal, fine-tuning only needs minor adjustments. Too large learning rate destroys pre-trained knowledge.

In practice, use learning rate finder: start from small learning rate, gradually increase, observe loss curve, select learning rate where loss decreases fastest.

Q6: Which layers to freeze for best results?

Depends on similarity between task and pre-training data:

Similarity	Data Amount	Recommended Strategy
High	Few	Freeze bottom layers, fine-tune top layers
High	Many	Full fine-tuning
Low	Few	Freeze middle layers, fine-tune bottom and top layers
Low	Many	Full fine-tuning + discriminative learning rates

Intuition: Bottom layers learn universal features (edges, textures, syntax), top layers learn task-specific features. High-similarity tasks reuse bottom features, low-similarity tasks need to adjust bottom features.

Q7: How to determine if model is overfitting?

Overfitting signals: 1. Training loss decreases but validation loss increases (most obvious signal) 2. Training acc very high but validation acc stagnates 3. Model predictions on training samples are very confident (output probabilities close to 0 or 1)

Solutions: 1. Increase regularization: Increase dropout, weight decay 2. Early stopping: Stop training when validation loss is lowest 3. Data augmentation: Increase diversity of training samples 4. Reduce model capacity: Use smaller models or freeze more layers

Q8: How does mixed precision training ensure accuracy isn't lost?

Mixed precision training uses FP16 for storage and computation, but uses FP32 for critical steps:

Loss scaling: Multiply loss by a large number (e.g., 1024), preventing FP16 underflow
Master weights: Optimizer maintains FP32 weight copies
Dynamic loss scaling: Automatically adjusts scaling factor, avoiding overflow

Mathematically, FP16's dynamic range is, while gradients are typically inrange. Using FP16 directly causes small gradient underflow. Loss scaling magnifies gradients to, within FP16 range.

Q9: How much data is needed for pre-training to be effective?

No unified answer, but some empirical rules:

NLP: At least hundreds of MB of text (e.g., Wikipedia dump ~4GB)
CV: At least millions of images (e.g., ImageNet 1.2M images)

The key isn't data quantity but data diversity. 10M images of the same category is worse than 1M images covering diverse categories.

Experiments show: when pre-training data increases by 10x, downstream task performance improves by about 2-5 percentage points (diminishing returns).

Q10: How to evaluate pre-trained model quality?

Three evaluation methods:

Downstream task performance: Fine-tune on multiple tasks, compute average performance (e.g., GLUE benchmark)
Representation quality: Evaluate if learned representations are meaningful (e.g., linear probing, nearest neighbor retrieval)
Pre-training loss: Lower loss indicates better model (but not absolute)

Most reliable is downstream task performance, but costly. Linear probing is a fast evaluation method: freeze pre-trained model, only train a linear classifier. If accuracy is high, representation quality is good.

Q11: How to handle distribution shift between pre-training and fine-tuning?

Distribution shift is a common problem in pre-training. For example, BERT hastoken during pre-training but not during fine-tuning.

Solutions:

BERT's masking strategy: 10% probability replace with random token, 10% probability keep unchanged, alleviates distribution shift
Domain-adaptive pre-training: Continue pre-training on target domain data
Gradual unfreezing: Gradually unfreeze layers, allowing model to gradually adapt to new distribution

Theoretically, use importance weighting to correct distribution shift:But in practice, directly estimating density ratio is very difficult.

Q12: How to allocate computational cost between pre-training and fine-tuning?

Typically pre-training accounts for over 90% of computational cost. For example, BERT-large pre-training requires:

Hardware: 64 TPU v3 (equivalent to 512 V100 GPUs)
Time: 4 days
Cost: About$10,000

While fine-tuning only requires: - Hardware: Single V100 GPU - Time: A few hours - Cost: About$10

Therefore, pre-train once, fine-tune many times is the most economical strategy. Large companies (like Google, OpenAI) pre-train general models and open-source them for community use.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin et al., NAACL 2019
https://arxiv.org/abs/1810.04805
Improving Language Understanding by Generative Pre-Training (GPT)
Radford et al., OpenAI Technical Report 2018
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)
Chen et al., ICML 2020
https://arxiv.org/abs/2002.05709
Momentum Contrast for Unsupervised Visual Representation Learning (MoCo)
He et al., CVPR 2020
https://arxiv.org/abs/1911.05722
Universal Language Model Fine-tuning for Text Classification (ULMFiT)
Howard and Ruder, ACL 2018
https://arxiv.org/abs/1801.06146
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Liu et al., arXiv 2019
https://arxiv.org/abs/1907.11692
Parameter-Efficient Transfer Learning for NLP (Adapter)
Houlsby et al., ICML 2019
https://arxiv.org/abs/1902.00751
LoRA: Low-Rank Adaptation of Large Language Models
Hu et al., ICLR 2022
https://arxiv.org/abs/2106.09685
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Lan et al., ICLR 2020
https://arxiv.org/abs/1909.11942
Representation Learning with Contrastive Predictive Coding
van den Oord et al., arXiv 2018
https://arxiv.org/abs/1807.03748
Understanding the Difficulty of Training Deep Feedforward Neural Networks
Glorot and Bengio, AISTATS 2010
http://proceedings.mlr.press/v9/glorot10a.html
Scaling Laws for Neural Language Models
Kaplan et al., arXiv 2020
https://arxiv.org/abs/2001.08361

Summary

Pre-training and fine-tuning represent the most successful paradigm in transfer learning. This article derives the mathematical foundations from first principles - the Bayesian perspective (learning prior distributions) and information-theoretic perspective (learning common structures), analyzes the mathematical basis of contrastive learning (SimCLR, MoCo) and masked language models (BERT MLM) in detail.

For fine-tuning strategies, we discussed full fine-tuning, discriminative learning rates, layer freezing, and Adapters, providing theoretical explanations from regularization and low-rank update perspectives. Finally, we provided a complete BERT fine-tuning implementation with industrial-grade techniques including gradient accumulation, mixed-precision training, and learning rate scheduling.

Pre-training isn't a silver bullet - its effectiveness depends on the similarity between pre-training data and downstream tasks. In the next chapter, we'll delve into domain adaptation methods, addressing the problem of distribution mismatch between pre-training and downstream tasks.