Few-shot learning represents one of the most challenging problems in machine learning. Humans can rapidly learn new concepts from minimal examples - recognizing new species after seeing just a few images, or understanding new linguistic patterns from a handful of instances. Traditional deep learning models, however, require massive amounts of labeled data to train effectively and perform poorly in data-scarce scenarios.
The goal of few-shot learning is to learn classifiers from only a few examples per class (typically 1-10 samples). This requires models with powerful generalization and transfer capabilities - the ability to learn "how to learn" from known classes and quickly adapt to novel classes. This article derives the mathematical foundations of metric learning and meta-learning from first principles, explains classic methods like Siamese networks, Prototypical networks, and MAML in detail, and provides a complete Prototypical network implementation.
The Few-Shot Learning Challenge
Problem Definition
Few-shot learning typically adopts an N-way K-shot
setting: - N-way: Classify among
For example, 5-way 1-shot means classifying among 5 classes with only 1 training sample per class.
Formally, we define: - Support Set
The goal is to train a model
Why Is It Difficult?
- Data scarcity:
samples are far insufficient to learn a complex classifier - Overfitting risk: Models easily memorize specific support set samples rather than learning generalizable features
- Inter-class similarity: Novel classes may be very similar to known classes, making discrimination difficult
Failure of Traditional Methods
Standard empirical risk minimization (ERM):
Core Ideas of Few-Shot Learning
To learn from few samples requires leveraging prior knowledge. Few-shot learning's core approach:
- Learn priors from known classes: Train on numerous base classes
- Rapidly adapt to novel classes: Use learned priors to quickly adapt on novel classes
This is equivalent to learning a meta-learner: "Learning to Learn".
Metric Learning: Similarity-Based Classification
Metric learning's idea is to learn an embedding space where same-class samples are close and different-class samples are distant. During classification, query samples are compared with support set samples by distance, selecting the nearest class.
Siamese Networks: Twin Networks
Siamese networks are among the earliest metric learning methods, learning embedding spaces through contrastive loss.
Architecture
Siamese networks contain two weight-shared encoders
Contrastive Loss
The contrastive loss is defined as:
Few-Shot Classification
Given support set
Prototypical Networks
Prototypical networks improve metric learning by learning class prototypes for classification.
Class Prototypes
Given class
Intuition: The prototype is the class's "center" in embedding space, representing typical features of that class.
Distance Metric
Prototypical networks use Euclidean distance to measure the distance
between query samples and prototypes:
Classification and Loss
Classification probability is computed via softmax:
Theory of Prototypical Networks
Prototypical networks can be viewed as implementing nearest centroid classification in embedding space. Under linear separability, Prototypical networks are equivalent to linear classifiers.
Theorem: In embedding space, if class prototypes are linearly separable, the decision boundary of Prototypical networks is linear.
Proof: Query sample
Matching Networks
Matching networks introduce attention mechanisms and memory augmentation to further improve few-shot learning performance.
Attention Kernel
Matching networks use an attention kernel to compute similarity
between query samples and support samples:
Prediction
The predicted class for a query sample is a weighted sum of support
labels:
Intuition: Support samples with higher similarity to the query contribute more to prediction.
Full Context Embeddings
Matching networks use bidirectional LSTMs to encode
the support set, making each sample's embedding contain contextual
information from the entire support set:
Relation Networks
Relation networks don't use fixed distance metrics (like Euclidean distance), but instead learn a metric function.
Architecture
Relation networks contain two modules:
- Embedding module
: maps samples to embedding space - Relation module
: learns similarity between embeddings
Given query sample
Loss Function
Relation networks use MSE loss:
Why Learn the Metric?
Fixed distances (like Euclidean) assume the embedding space is isotropic, but different dimensions may have different importance. Learning the metric allows adaptive adjustment of distance computation.
Meta-Learning: Learning to Learn
Meta-learning's core idea is: learn across multiple tasks how to rapidly adapt to new tasks.
Formalization of Meta-Learning
Given
The goal of meta-learning is to learn meta-parameters
MAML: Model-Agnostic Meta-Learning
Model-Agnostic Meta-Learning (MAML) is the most classic meta-learning algorithm, learning good initialization parameters so models can rapidly adapt to new tasks.
MAML Algorithm
Given task distribution
MAML Gradient Computation
MAML's key is computing second-order gradients:
Computational Complexity: Computing the Hessian
requires
MAML Intuition
MAML learns
Analogy:
Reptile: First-Order Meta-Learning
Reptile is a simplified version of MAML that uses only first-order gradients, making computation more efficient.
Reptile Algorithm
- Sample task$
k $
Intuition: Reptile moves meta-parameters toward
task-specific parameters. After multiple iterations,
Reptile vs MAML
| Method | Gradient Order | Computational Complexity | Performance |
|---|---|---|---|
| MAML | Second-order | High (requires Hessian) | Optimal |
| FOMAML | First-order (approximation) | Medium | Close to MAML |
| Reptile | First-order | Low | Slightly below MAML |
Reptile performs similarly to FOMAML in practice but with simpler implementation.
Theory of Meta-Learning
Meta-learning can be understood from a Bayesian perspective. Let task
parameters
Episode Training: Simulating Few-Shot Scenarios
Few-shot learning training adopts episodic training, where each episode simulates a few-shot task.
Episode Sampling
Each episode contains: 1. Randomly sample
Formally, an episode is:
Episode Training Workflow
1 | for epoch in range(num_epochs): |
Intuition Behind Episode Training
Episode training exposes the model to few-shot scenarios during training, forcing it to learn how to generalize from few samples. This is a form of curriculum learning: training difficulty matches testing difficulty.
Complete Implementation: Prototypical Networks
Below is a complete Prototypical network implementation including episode sampling, distance computation, and support/query set partitioning.
1 | import torch |
Code Breakdown
Episode Sampling
EpisodeSampler implements the core sampling logic for
few-shot learning:
1 | def sample_episode(self): |
Prototype Computation
compute_prototypes computes the prototype (mean) for
each class:
1 | for c in range(n_way): |
Distance Computation
Using torch.cdist for efficient Euclidean distance
computation:
1 | distances = torch.cdist(query_embeddings, prototypes, p=2) |
Advanced Extensions
Transductive Prototypical Networks
Standard Prototypical networks use only support set to compute prototypes. Transductive Prototypical Networks leverage query set information through semi-supervised learning.
Soft k-Means
Iteratively refine prototypes using query predictions:
- Initialize prototypes from support set
- Compute query predictions:$P(y = c | x_q)
$4. Repeat steps 2-3 until convergence
This is equivalent to applying soft k-means clustering in embedding space.
Task-Dependent Adaptive Metric (TADAM)
TADAM conditions the metric on task context, making it adaptive to different task characteristics.
Task Embedding
Compute task representation from support set:
Task-Conditioned Feature Extraction
Modulate feature extractor using task embedding via Feature-wise
Linear Modulation (FiLM):
This allows the network to adapt its features to task-specific characteristics.
Meta-Learning with Latent Embedding Optimization (LEO)
LEO learns in a lower-dimensional latent space for better generalization.
Architecture
Encoder: Map data to latent code:
Relation Network: Model dependencies between support samples
Decoder: Generate task-specific parameters:
Classifier:
Training
In latent space, perform gradient-based adaptation on support set, then evaluate on query set. This reduces overfitting by constraining adaptation to a low-dimensional space.
Comprehensive Q&A
Q1: How does Few-Shot Learning differ from Transfer Learning?
Connection: Both leverage existing knowledge for new tasks.
Differences:
| Dimension | Transfer Learning | Few-Shot Learning |
|---|---|---|
| Data Volume | Target task has substantial labeled data | Target task has minimal labeled data (1-10 samples) |
| Adaptation Method | Fine-tune pre-trained model | Rapid adaptation via metrics or meta-learning |
| Training Paradigm | Standard supervised learning | Episodic training |
Few-shot learning can be viewed as an extreme case of transfer learning where target task data is extremely scarce.
Q2: Why do Prototypical Networks use mean as prototype? Is there theoretical support?
Theoretical Support: Under Gaussian distribution assumptions, class prototypes are optimal Bayesian classifiers.
Proof: Assume class
Q3: Why does MAML require second-order gradients? Can it be avoided?
MAML requires second-order gradients because it differentiates
adapted parameters
Avoidance Methods:
- FOMAML: Ignore second-order terms, use only first-order gradients
- Reptile: Directly move toward adapted parameters, no second-order gradients needed
Experiments show FOMAML and Reptile perform similarly to MAML but with much higher computational efficiency.
Q4: What's the fundamental difference between episode training and standard training?
Standard Training: Each batch contains samples from multiple classes; model learns discriminative boundaries for all classes.
Episode Training: Each episode contains only
Fundamental Difference: - Standard training learns task-specific knowledge (which features distinguish which classes) - Episode training learns meta-knowledge (how to rapidly learn new tasks)
Analogy: - Standard training is like "learning specific subjects" (learning math, learning physics) - Episode training is like "learning how to learn" (learning methodologies)
Q5: Why does Few-Shot Learning require many base classes?
Although target tasks (novel classes) have few samples, learning "how to learn" requires training on many tasks.
Data Requirements: - Number of base classes: typically dozens to hundreds - Samples per base class: typically hundreds
Intuition: Just as humans can learn new concepts from few examples because of accumulated life experience, few-shot learning models need to learn this capability on many base classes.
Experimental Evidence: - Omniglot: 1200+ base classes - miniImageNet: 64 base classes - tieredImageNet: 351 base classes
More base classes lead to better few-shot learning performance.
Q6: Can Prototypical Networks be used for regression tasks?
Yes, but modifications are needed. In classification, prototypes are discrete (one per class); in regression, a continuous prototype space is needed.
Method 1: Kernel Regression
View prototypes as kernel centers, predict as weighted average:
Method 2: Conditional Neural Processes (CNP)
Learn a function distribution
Q7: How to choose a Few-Shot Learning method?
Decision tree:
1 | 1. Data characteristics? |
Q8: What are the challenges of Few-Shot Learning in real applications?
- Domain Shift: Base classes and novel classes have
different distributions
- Solution: Domain adaptation + Few-Shot Learning (Cross-Domain Few-Shot Learning)
- Class Imbalance: Novel classes may have different
sample counts
- Solution: Weighted loss, resampling
- Label Noise: Annotation errors in few samples have
large impact
- Solution: Robust loss functions, denoising methods
- Computational Efficiency: Episode training is
slower than standard training
- Solution: Pre-training + limited episode fine-tuning
- Generalization: Model may overfit base classes
- Solution: Increase base class diversity, regularization
Q9: How do Prototypical Networks differ from k-NN?
Prototypical networks can be viewed as k-NN with learned embedding space.
| Method | Distance Metric | Embedding Space | Prototype |
|---|---|---|---|
| k-NN | Fixed (Euclidean, cosine) | Original feature space | Each sample |
| Prototypical | Learned | Learned embedding space | Class mean |
Key Differences: 1. Embedding
Learning: Prototypical networks learn embedding function
Experiments: In the same embedding space, Prototypical networks slightly outperform k-NN, but difference is small. Main advantage comes from embedding learning.
Q10: Why is MAML's initialization important?
MAML learns initialization located in a flat region of the loss surface, enabling:
- Rapid Adaptation: Gradient descent in any direction can quickly reduce loss
- Strong Generalization: Flat regions correspond to better generalization (Sharp Minima vs Flat Minima)
Mathematically, MAML is equivalent to minimizing the
second-order Taylor expansion of the loss:
Q11: Can Few-Shot Learning be used in Reinforcement Learning?
Yes! Few-Shot Reinforcement Learning is an active research area.
Challenges: 1. Even lower sample efficiency (requires interaction) 2. Sparse rewards 3. Exploration-exploitation tradeoff
Methods: 1. MAML for RL: Meta-learn policies across multiple tasks 2. Meta-RL with Context: Learn task representations, condition policies 3. Model-Based Meta-RL: Learn dynamics models, plan
Applications: - Robots rapidly adapting to new tasks - Game AI quickly learning new games - Recommendation systems adapting to new users
Q12: How to evaluate Few-Shot Learning models?
Standard evaluation protocol:
Data Split:
- Base classes: training
- Val classes: hyperparameter validation
- Novel classes: final testing
Evaluation Metrics:
- Accuracy (primary)
- 95% confidence interval (report uncertainty)
- Per-class accuracy (check class imbalance)
Evaluation Steps:
1
2
3
4
5for episode in test_episodes:
sample N-way K-shot task from novel classes
compute accuracy on query set
report: mean ± 95% confidence intervalStandard Benchmarks:
- Omniglot: 20-way 1-shot, 20-way 5-shot
- miniImageNet: 5-way 1-shot, 5-way 5-shot
- tieredImageNet: 5-way 1-shot, 5-way 5-shot
Note: Must report confidence intervals because few-shot learning has high variance.
Q13: How does gradient-based meta-learning relate to pre-training?
Both learn transferable representations, but with different mechanisms:
Pre-training: - Learns fixed feature extractor on large dataset - Transfer via fine-tuning all or part of parameters - Adaptation: standard gradient descent
MAML: - Learns initialization optimized for rapid adaptation - Transfer via few gradient steps from initialization - Adaptation: few-step gradient descent from learned initialization
Connection: Both can be viewed as learning good priors in Bayesian framework. Pre-training learns features (prior on function space), MAML learns initialization (prior on parameter space).
Q14: What is the relationship between Few-Shot Learning and Zero-Shot Learning?
Few-Shot Learning: Learn from
Zero-Shot Learning: Learn from
Unified View - Meta-Learning Spectrum: - Zero-shot: No labeled examples, only semantic information - One-shot: 1 labeled example per class - Few-shot: 2-10 labeled examples per class - Standard learning: Many labeled examples
Zero-shot can be viewed as extreme few-shot where "support set" is semantic descriptions rather than labeled examples.
Q15: How to handle distribution shift between base and novel classes?
Problem: Base classes (e.g., cats, dogs) and novel classes (e.g., birds) may have very different distributions, hurting transfer.
Solutions:
- Domain-Adversarial Meta-Learning
- Add domain discriminator to learn domain-invariant features
- Minimize domain classification loss while maximizing task performance
- Feature-wise Transformation
- Learn affine transformations to align base and novel class features
- Use task embedding to predict transformation parameters
- Self-Supervised Pre-training
- Pre-train on large unlabeled dataset covering both base and novel class distributions
- Helps learn more general features
- Data Augmentation
- Augment base classes to simulate novel class characteristics
- Mixup, CutMix, domain randomization
- Cross-Domain Few-Shot Learning Benchmarks
- Train on miniImageNet, test on CUB birds
- Evaluate robustness to domain shift
Related Papers
Siamese Neural Networks for One-shot Image Recognition
Koch et al., ICML Deep Learning Workshop 2015
https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdfPrototypical Networks for Few-shot Learning
Snell et al., NeurIPS 2017
https://arxiv.org/abs/1703.05175Matching Networks for One Shot Learning
Vinyals et al., NeurIPS 2016
https://arxiv.org/abs/1606.04080Learning to Compare: Relation Network for Few-Shot Learning
Sung et al., CVPR 2018
https://arxiv.org/abs/1711.06025Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML)
Finn et al., ICML 2017
https://arxiv.org/abs/1703.03400On First-Order Meta-Learning Algorithms (Reptile)
Nichol et al., arXiv 2018
https://arxiv.org/abs/1803.02999A Closer Look at Few-shot Classification
Chen et al., ICLR 2019
https://arxiv.org/abs/1904.04232Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples
Triantafillou et al., ICLR 2020
https://arxiv.org/abs/1903.03096Learning to Learn with Conditional Class Dependencies
Bertinetto et al., ICLR 2019
https://arxiv.org/abs/1806.03961TADAM: Task dependent adaptive metric for improved few-shot learning
Oreshkin et al., NeurIPS 2018
https://arxiv.org/abs/1805.10123Meta-Learning with Differentiable Convex Optimization
Lee et al., CVPR 2019
https://arxiv.org/abs/1904.03758Generalizing from a Few Examples: A Survey on Few-Shot Learning
Wang et al., ACM Computing Surveys 2020
https://arxiv.org/abs/1904.05046Latent Embedding Optimization for Few-Shot Learning (LEO)
Rusu et al., ICLR 2019
https://arxiv.org/abs/1807.05960Transductive Propagation Network for Few-shot Learning
Liu et al., arXiv 2019
https://arxiv.org/abs/1805.10002
Summary
Few-shot learning addresses one of deep learning's biggest bottlenecks: data scarcity. This article derived the mathematical foundations of metric learning (Siamese, Prototypical, Matching, Relation Networks) and meta-learning (MAML, Reptile) from first principles, providing detailed analysis of their architectures, loss functions, and optimization methods.
We saw that few-shot learning's core is leveraging prior knowledge: metric learning makes metrics transferable by learning embedding spaces, while meta-learning makes adaptation rapid by learning initialization or optimizers. Episode training is crucial - it exposes models to few-shot scenarios during training, teaching them "how to learn".
The complete Prototypical network implementation demonstrates core techniques including episode sampling, prototype computation, and distance metrics. Next chapter we'll explore knowledge distillation, studying how to transfer knowledge from large models to small models.
- Post title:Transfer Learning (4): Few-Shot Learning
- Post author:Chen Kai
- Create time:2024-11-21 15:45:00
- Post link:https://www.chenk.top/transfer-learning-4-few-shot-learning/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.