Why can a model trained on ImageNet quickly achieve usable performance on medical imaging? Why can BERT learn text classification from just hundreds of samples after pretraining? The essence of these phenomena is transfer learning — enabling models to transfer existing knowledge to new problems rather than starting from scratch every time.
In the deep learning era, transfer learning has become standard practice rather than an option. This article systematically covers the mathematical formalization, core concepts, taxonomy, feasibility analysis, and negative transfer issues, along with a complete 200+ line implementation of feature transfer with MMD domain adaptation.
Why We Need Transfer Learning
The Dilemma of Training From Scratch
Suppose we want to train a medical image diagnosis model. The traditional supervised learning paradigm requires:
- Massive labeled data: Deep neural networks typically need tens of thousands to millions of labeled samples to achieve good generalization
- Enormous computational resources: Training large models from random initialization requires hundreds to thousands of GPU hours
- Difficulty reusing domain knowledge: Even similar tasks (e.g., X-ray classification vs. CT image classification) require independent model training
However, real-world scenarios often face:
- Data scarcity: Some rare diseases have only a few hundred cases
- Expensive annotation: Medical image annotation requires professional doctors at extremely high costs
- Time urgency: Rapid model deployment needed during new disease outbreaks
These contradictions gave birth to transfer learning: Can we leverage models trained on large-scale data to quickly adapt to new tasks with scarce data?
The Intuition Behind Transfer Learning
Human learning naturally possesses transfer capability:
- People who can ride bicycles learn motorcycles faster
- Programmers who know Python don't start from zero when learning Java
- People who have seen cats can recognize "this is a feline" when first encountering a lion
This ability stems from shared underlying knowledge structures. Similarly, low-level features in deep neural networks (edges, textures) are highly reusable across different visual tasks, and high-level features (semantic concepts) also exhibit certain similarities. Transfer learning exploits this similarity.
Core Idea of Transfer Learning
Given a source domain
Leverage knowledge from
Key assumption: There exists some correlation between source and target domains (though not required to be identical), making knowledge transfer possible.
Formal Definitions of Core Concepts
Domain
A domain is defined as a tuple:
Example: - Source domain: Natural images (ImageNet),
feature space is
Task
A task is defined as:
For supervised learning, the task also includes the conditional
probability distribution
Example: - Task 1: ImageNet 1000-class
classification,
Source Domain and Target Domain
Transfer learning setup:
- Source domain:
, with source task - Target domain:
, with target task Key differences: - (different feature spaces) - (different marginal distributions) - (different label spaces) - (different conditional distributions)
Transfer learning does not require source and target domains to be identical— this is precisely its value.
Mathematical Definition of Transfer Learning
According to the seminal survey by Pan and Yang (2010)1, transfer learning is formally defined as:
Given source domain
Taxonomy of Transfer Learning
Based on the availability of labeled data in source and target domains, transfer learning is categorized into three types2:
Inductive Transfer Learning
Definition: Source and target tasks are different
(
Mathematical description: - Source domain:
Application scenarios: - ImageNet pretrained model → Medical image classification (different tasks) - General language model → Sentiment analysis (different tasks)
Transductive Transfer Learning
Definition: Source and target tasks are the same
(
Mathematical description: - Source domain:
Application scenarios: - Synthetic data → Real data (e.g., GTAV game scenes → Real street views) - Sentiment analysis for product reviews: Book reviews → Electronics reviews
Unsupervised Transfer Learning
Definition: Both source and target domains lack labeled data; transfer focuses on intrinsic data structure.
Mathematical description: - Both source and target
domains have only
Typical methods: 1. Self-supervised learning: Learn general features through proxy tasks (rotation prediction, contrastive learning) 2. Deep clustering: Transfer clustering structures
Application scenarios: - Word vector transfer in NLP (Word2Vec trained on general corpora, applied to specific domains) - Self-supervised pretraining for images (MoCo, SimCLR)
Core Assumptions of Transfer Learning
Relatedness Assumption
Transfer learning is predicated on some correlation existing
between source and target domains. Formalized as:
Common similarity measures: 1. Feature space
similarity: Degree of overlap between
- Distribution divergence: KL divergence, Maximum Mean Discrepancy (MMD), Wasserstein distance
- Task relatedness: Semantic similarity of label spaces
Shared Representation Assumption
There exists a shared feature representation$:
The Problem of Negative Transfer
Definition of Negative Transfer
When source domain knowledge not only fails to improve but actually
harms target domain performance, it's called negative
transfer3:
Causes of Negative Transfer
1. Excessive Domain Divergence
If the distribution difference between source and target domains
exceeds a threshold:
Example: Transferring a natural image-trained model to hand-drawn sketches — due to completely different textures and colors, pretrained features may be entirely ineffective.
2. Task Conflict
Source and target tasks have inherent conflicts. Let the optimal
solution for the source task be
Example: Transferring an English sentiment analysis model to Chinese, where Chinese negation expressions and irony differ vastly from English.
3. Overfitting to Source Domain
The model overfits to the source domain, learning noise patterns
specific to the source rather than common knowledge. Formalized as:
Avoiding Negative Transfer
- Measure domain similarity: Use MMD, A-distance, or other metrics to estimate transfer feasibility
- Selective transfer: Only transfer low-level general features; retrain high-level task-specific layers
- Regularization constraints: Add regularization during target domain fine-tuning to limit parameter deviation from source domain
- Ensemble methods: Combine predictions from scratch training and transfer learning to reduce single-strategy risks
Quantitative Analysis of Transfer Feasibility
Ben-David Bound
Ben-David et al.4 theoretically analyzed domain
adaptation generalization bounds. Let
Interpretation: Target domain error is controlled by three terms: 1. Source domain error (reducible through training) 2. Domain divergence (reducible through domain adaptation) 3. Task relatedness (determined by problem nature, unchangeable)
If
Maximum Mean Discrepancy (MMD)
MMD is a common metric for measuring distributional differences5:
In practice, MMD can serve as a domain adaptation loss:
Transfer Learning vs. Related Concepts
Transfer Learning vs. Multi-Task Learning
| Dimension | Transfer Learning | Multi-Task Learning |
|---|---|---|
| Goal | Optimize target domain performance | Optimize all tasks simultaneously |
| Training | Sequential (source then target) | Parallel (simultaneous) |
| Data distribution | Source and target can differ | Typically assumes related tasks |
| Typical use | Pretrain-finetune | Multi-head network with shared encoder |
Transfer Learning vs. Meta-Learning
| Dimension | Transfer Learning | Meta-Learning |
|---|---|---|
| Learning goal | Transfer specific knowledge | Learn how to learn |
| Training data | Single or few source domains | Many diverse tasks |
| Adaptation speed | Requires some fine-tuning | Fast adaptation (few-shot) |
| Theoretical framework | Statistical learning | Optimization theory |
Transfer Learning vs. Domain Generalization
| Dimension | Transfer Learning | Domain Generalization |
|---|---|---|
| Test-time info | Access to target domain data | Target domain unknown |
| Methodology | Domain adaptation (use target data) | Learn domain-invariant features |
| Challenge | Domain alignment | Generalize to unseen domains |
Complete Implementation: Feature Transfer Example
Below is a complete example demonstrating the basic transfer learning workflow: domain adaptation on the Office-31 dataset, transferring an Amazon domain model to the Webcam domain.
Experimental Setup
1 | """ |
Code Explanation
Core components: 1. Data generation: Simulate distribution shift between source and target (rotation + translation) 2. MMD computation: Calculate domain distance using RBF kernel 3. Two-stage training: - Stage 1: Source classification + MMD alignment - Stage 2: Target domain fine-tuning 4. Visualization: t-SNE feature space, performance comparison, confusion matrices
Key parameters: - lambda_mmd=0.5:
Controls domain adaptation strength - epochs=100:
Pretraining epochs - Fine-tuning epochs: 50 (using few labeled target
samples)
Frequently Asked Questions
Q1: Is transfer learning always better than training from scratch?
Not necessarily. Transfer learning effectiveness depends on: 1. Domain relatedness: The more similar source and target domains, the better the results 2. Data quantity: With extremely few target samples (<100), transfer learning has clear advantages; with abundant data (>100K), training from scratch may be better 3. Task relatedness: Excessive task differences lead to negative transfer
Rule of thumb: Consider transfer learning when target domain data < 10% of source domain data.
Q2: How to select a source domain?
Selection criteria: 1. Domain similarity: Choose ImageNet for vision, BERT/GPT for NLP 2. Data scale: Larger source data is better (million-scale+) 3. Task relatedness: Classification transfers to classification, detection to detection
Visualization tools: Use t-SNE to compare source and target feature distributions; MMD < 0.1 typically indicates transferability.
Q3: Which layers of a pretrained model should be frozen?
General guidelines: - CV models: Freeze first 3-4 layers (edge, texture features), fine-tune later layers - NLP models: Usually fine-tune all layers but with reduced learning rate (1/10 of source domain lr) - Small data scenarios: Only fine-tune last 1-2 layers to avoid overfitting
Experimental strategy: Progressive Unfreezing — gradually unfreeze more layers starting from the top.
Q4: How to detect negative transfer?
Detection methods: 1. Baseline comparison: Transfer learning accuracy < training from scratch accuracy 2. Loss curves: Loss increases instead of decreasing during fine-tuning 3. Domain distance: MMD or A-distance too large (> 0.5)
Remedies: - Only transfer shallow features - Increase target domain data weight - Use adversarial domain adaptation
Q5: How does transfer learning handle different label spaces?
Three strategies: 1. Zero-shot transfer: Use semantic embeddings (e.g., Word2Vec) to map labels to shared space 2. Partial transfer: Only transfer shared classes, ignore source-specific classes 3. Open-set transfer: Introduce "unknown class" to identify new target classes
Example: ImageNet (1000 classes) → Medical imaging (5 classes)— keep first 4096 dimensions of classifier features, replace final softmax layer.
Q6: How to evaluate transfer learning effectiveness?
Evaluation metrics: 1. Accuracy improvement:
- Sample efficiency: Ratio of target samples needed to achieve same accuracy
- Training efficiency: Convergence speed (transfer typically 10x faster)
- A-distance:
, where is domain classifier error rate
Q7: What's the difference between transfer learning and data augmentation?
| Dimension | Transfer Learning | Data Augmentation |
|---|---|---|
| Knowledge source | External source domain | Current dataset |
| Method | Model initialization/feature alignment | Sample transformation |
| Applicable scenarios | Data scarcity | Improve generalization |
| Computational cost | Requires pretraining | Real-time generation |
Both can be combined: First use transfer learning for good initialization, then use data augmentation for improved robustness.
Q8: How to implement cross-modal transfer (e.g., image → text)?
Key techniques: 1. Shared embedding space: Map images and text to same vector space (CLIP) 2. Contrastive learning: Maximize matched pair similarity, minimize unmatched pair similarity 3. Generative models: Use VAE/GAN to learn cross-modal mappings
Loss function:
Q9: How to transfer learning to small devices (e.g., phones)?
Model compression + transfer learning: 1. Knowledge distillation: Large model (teacher) → Small model (student) 2. Pruning: Remove redundant parameters 3. Quantization: FP32 → INT8 4. Lightweight architectures: MobileNet, EfficientNet
Workflow: Large model pretraining → Distill to small model → Target domain fine-tuning
Q10: What theoretical guarantees exist for transfer learning?
The Ben-David bound provides generalization guarantees:
Theoretical insight: Successful transfer requires good source training + small domain gap + high task relatedness.
Q11: How to address catastrophic forgetting?
When a model forgets source knowledge during target fine-tuning, it's called catastrophic forgetting. Solutions: 1. Elastic Weight Consolidation (EWC): Add regularization for important parameters 2. Progressive fine-tuning: Gradually unfreeze layers, preserve low-level features 3. Memory replay: Mix small amount of source data during training 4. Knowledge distillation: Keep pretrained model as teacher
Q12: How to do semi-supervised transfer learning?
Leveraging unlabeled data: 1. Pseudo-labeling: Use source model to label unlabeled target data 2. Consistency regularization: Augmented samples should have consistent predictions 3. Self-training: Iterative self-improvement
Loss function:
Summary
This article systematically introduced the fundamentals and core concepts of transfer learning:
- Motivation: Addressing data scarcity and expensive training
- Core concepts: Formal definitions of domain, task, source/target domains
- Taxonomy: Inductive, transductive, and unsupervised transfer
- Negative transfer: Causes, detection, and avoidance
- Theoretical analysis: Ben-David bound, MMD, and other feasibility criteria
- Practical code: Complete feature transfer + MMD domain adaptation implementation
Transfer learning is not a silver bullet, but in scenarios with data scarcity, computational constraints, and rapid deployment needs, it's one of the most effective technical approaches. The next chapter will delve into pretraining and fine-tuning techniques, covering classic paradigms from ImageNet to BERT.
References
Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345-1359.↩︎
Weiss, K., Khoshgoftaar, T. M., & Wang, D. (2016). A survey of transfer learning. Journal of Big Data, 3(1), 1-40.↩︎
Rosenstein, M. T., Marx, Z., Kaelbling, L. P., & Dietterich, T. G. (2005). To transfer or not to transfer. NIPS Workshop on Transfer Learning.↩︎
Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Vaughan, J. W. (2010). A theory of learning from different domains. Machine Learning, 79(1), 151-175.↩︎
Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch ö lkopf, B., & Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13, 723-773.↩︎
- Post title:Transfer Learning (1): Fundamentals and Core Concepts
- Post author:Chen Kai
- Create time:2024-11-03 09:00:00
- Post link:https://www.chenk.top/transfer-learning-1-fundamentals-and-core-concepts/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.