Zero-Shot Learning (ZSL) is a machine learning paradigm capable of recognizing classes never seen during training. Humans possess powerful zero-shot learning abilities — even without seeing a zebra before, we can recognize it through descriptions like "looks like a horse but with black and white stripes." Lampert et al.'s pioneering 2009 paper "Learning to Detect Unseen Object Classes" introduced this capability to computer vision, launching zero-shot learning research. Zero-shot learning has important applications in long-tail distributions, rapid new class adaptation, and low-resource scenarios, but also faces many challenges like semantic gaps, domain shift, and hubness problems.
This article derives the mathematical foundations of zero-shot learning from first principles, explains construction of attribute representations and semantic embedding spaces, details compatibility function design and optimization, deeply analyzes principles of traditional discriminative ZSL and modern generative ZSL (f-CLSWGAN, f-VAEGAN, etc.), introduces bias calibration methods for generalized zero-shot learning (GZSL), and provides complete code implementations (including attribute learning, visual-semantic mapping, conditional generative models, etc.). We'll see that zero-shot learning essentially learns a cross-modal mapping from visual space to semantic space, bridging seen and unseen classes through auxiliary information (attributes, word embeddings, etc.).
Motivation for Zero-Shot Learning
From Closed-World to Open-World: Long-Tail Distribution Challenge
Traditional supervised learning assumes training and test sets come from the same class set — the Closed-World Assumption. But the real world is Open-World:
- ImageNet has 1000 classes, but reality has millions of object types
- Animal Recognition: Biologists discovered ~1 million animal species, training sets cover very few
- Medical Diagnosis: Rare disease samples are scarce but still need recognition
A more severe problem is Long-Tail Distribution: A few classes have many samples (head), many classes have few samples (tail).
Example (iNaturalist dataset): - Top 10% classes account for 60% of total samples - Bottom 50% classes account for only 5% of total samples
Adequately annotating tail classes is extremely costly. Zero-shot learning provides a solution: leverage semantic descriptions of classes (like attributes, text descriptions, knowledge graphs) to recognize them without labeled images.
Formal Definition of Zero-Shot Learning
Notation: - Seen Classes:
Auxiliary Information: Each class
Zero-Shot Learning Task: - Training
Phase: Given seen class data
This is Conventional Zero-Shot Learning. A more
realistic variant is Generalized Zero-Shot Learning
(GZSL): at test time
Mathematical Perspective on ZSL: Knowledge Transfer
Zero-shot learning's core is knowledge transfer: how to transfer knowledge learned from seen classes to unseen classes?
Key Assumption: Classes are related through semantic
space. Let: -
Zero-shot learning assumes existence of compatibility
function
Intuition: Compatibility function measures match between visual features and semantic descriptions. Learn this function on seen classes, then generalize to unseen classes.
Attribute Representation: Describing Class Semantics
Attributes are the most commonly used semantic representation form in zero-shot learning.
Attribute Definition and Construction
Attributes are high-level semantic features describing classes, like: - Color: Black, white, brown - Shape: Round, elongated - Texture: Furry, smooth, striped - Parts: Has wings, has tail, has four legs
Each class represented by attribute vector:
Example (Animals with Attributes dataset, 50 animal
classes, 85 attributes): - Zebra:
Attribute Construction Methods:
- Manual Annotation: Experts annotate attributes for
each class
- Pros: Accurate, interpretable
- Cons: High cost, subjective
- Crowdsourced Annotation: Collect via platforms like
Amazon Mechanical Turk
- Pros: Relatively low cost, broad coverage
- Cons: High annotation noise
- Automatic Extraction: Extract attributes from text
descriptions (like Wikipedia)
- Pros: Low cost, scalable
- Cons: May be incomplete, noisy
Attribute Learning: Predicting Attributes from Images
Given training set
Loss Function (multi-label classification):
Network Structure: - Backbone:
ResNet, VGG etc. extract visual features
Direct Attribute Prediction (DAP)
Lampert et al. proposed Direct Attribute Prediction (DAP) in 2009, one of the earliest zero-shot learning methods.
Two-Stage Process:
Attribute Prediction: For input
, predict attribute vector Nearest Neighbor Classification: Select class closest to predicted attributes
Intuition: If an image's predicted attributes are "striped, four-legged, wingless", the closest class is "zebra".
Pros: - Strong interpretability: Can see which attributes led to classification decision - Modular: Attribute classifiers can be trained and debugged independently
Cons: - Error accumulation: Attribute prediction errors directly cause classification errors - Independence assumption: Ignores correlations between attributes (like "has wings" and "can fly" highly correlated)
Semantic Embedding Space: Beyond Attributes
Attributes require manual design, limiting scalability. Semantic Embeddings automatically learn semantic representations from class names or descriptions.
Word Embeddings: Word2Vec and GloVe
Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) are two popular word embedding methods.
Word2Vec (Skip-Gram model): Given center word
GloVe: Minimize weighted least squares loss:
Application to ZSL: Map class names (like "zebra")
to word embedding space
Problem: Word embeddings capture linguistic similarity, not necessarily visual similarity. "dog" and "cat" are visually similar, but word embeddings may not be close.
Class Prototypes: Extracting from Text Descriptions
For each class, obtain text descriptions from Wikipedia, encyclopedias etc., then extract feature vectors as class prototypes.
Method 1: TF-IDF:
Method 2: BERT Embeddings:
Use pre-trained BERT model to encode text descriptions:
Advantages: - Automated: No manual attribute annotation needed - Rich information: Text descriptions contain more details
Challenges: - Text quality: Descriptions may be inaccurate or incomplete - Cross-modal gap: Text and visual feature distributions differ greatly
Compatibility Functions: Connecting Visual and Semantic
Compatibility function
Linear Compatibility Function
Simplest form is bilinear function:
Training: On seen classes, maximize compatibility of
correct class:
Deep Compatibility Functions
Use neural networks to learn non-linear compatibility:
Example Architecture: 1
2
3v (d_v dim) -> FC(512) -> ReLU -> z_v (256 dim)
a (d_s dim) -> FC(512) -> ReLU -> z_a (256 dim)
F = z_v^T z_a (inner product)
This allows learning more complex visual-semantic relationships.
Generative Zero-Shot Learning
Traditional discriminative ZSL learns mapping from visual to semantic space. Generative ZSL takes opposite approach: generate visual features from semantic descriptions, then use generated features to train classifiers.
f-CLSWGAN: Feature-Generating GAN
Xian et al. proposed f-CLSWGAN in 2018, using conditional GAN to generate visual features.
Architecture:
Generator:
- Input: Noise and semantic description - Output: Fake visual feature Discriminator:
- Distinguish real visual features from fake ones Classifier:
- Classify visual features to classes
Loss Functions:
Training Process:
- Train on seen classes: Generate features for seen classes
- Test on unseen classes: Generate synthetic training data for unseen classes using semantic descriptions
- Train classifier on both real (seen) and synthetic (unseen) features
- Classify test samples
Advantages: - Converts ZSL to standard supervised learning - Can leverage powerful classification models - Handles GZSL naturally (mix real and synthetic data)
Generalized Zero-Shot Learning (GZSL)
Conventional ZSL assumes test samples only from unseen classes. Generalized ZSL is more realistic: test samples may come from both seen and unseen classes.
GZSL Challenge: Bias Toward Seen Classes
Main challenge: Models trained on seen classes have strong bias toward them. Even if unseen class features are correct, model still predicts seen classes.
Experimental Observation (AWA2 dataset): - Conventional ZSL accuracy: 65% - GZSL accuracy on unseen classes: 15% - GZSL accuracy on seen classes: 85%
Model severely biased toward seen classes!
Calibration Methods
1. Temperature Scaling:
Adjust prediction confidence via temperature parameter:
2. Bias Calibration:
Add calibration terms to compatibility scores:
3. Separate Classifiers:
Train two classifiers: - Classifier 1: Discriminate seen vs unseen - Classifier 2: If unseen, classify within unseen classes; if seen, classify within seen classes
This is a gating mechanism.
Complete Code Implementation
Below is complete zero-shot learning implementation including attribute learning, compatibility functions, and generative models.
1 | import torch |
Comprehensive Q&A
Q1: When should I use zero-shot learning?
A: Zero-shot learning suitable when: - New classes emerge frequently, no time/resources for annotation - Long-tail distribution, tail classes have very few samples - Need to recognize rare or novel classes - Have good semantic descriptions (attributes, text) available
Not suitable when: - All classes have sufficient training data - Classes lack clear semantic descriptions - Visual appearance very different from semantic description
Q2: Attributes vs word embeddings - which is better?
A: Trade-offs:
Attributes: - Pros: Interpretable, capture discriminative features, work well for fine-grained tasks - Cons: Require manual design, expensive, domain-specific
Word Embeddings: - Pros: Automatic, scalable, leverage large text corpora - Cons: Capture linguistic not visual similarity, may not be discriminative
Recommendation: Use attributes when available and task is fine-grained; use word embeddings for broader domains or when attributes unavailable.
Q3: How to handle the hubness problem?
A: Hubness: In high-dimensional space, some points become "hubs" that are nearest neighbors to many other points, causing prediction bias.
Solutions:
- Dimensionality Reduction: Use PCA or autoencoders to reduce feature dimensions
- Hubness-Aware Scoring: Weight compatibility scores by point density
- Locally Adaptive Metrics: Use different distance metrics for different regions
- Reverse Nearest Neighbors: Consider reverse nearest neighbor relationships
Q4: Why does GZSL perform poorly?
A: Main reasons:
- Bias Toward Seen Classes: Model trained only on seen classes, strongly biased toward them
- Domain Shift: Visual features of seen and unseen classes may have different distributions
- Semantic Gap: Semantic descriptions may not capture all visual information
Solutions: - Calibration methods (temperature scaling, bias terms) - Generative models (synthesize unseen class features) - Transductive learning (leverage test set structure)
Q5: Can zero-shot learning be combined with few-shot learning?
A: Yes! This is called Few-Shot Zero-Shot Learning or Low-Shot Learning:
- Zero-shot provides initial knowledge via semantic descriptions
- Few-shot refines with limited labeled examples
- Combination achieves better performance than either alone
Method: First use zero-shot to generate pseudo-labels, then use few-shot samples to calibrate.
Related Papers
Classic Papers
- Lampert, C. H. et al., "Learning to detect unseen object
classes by between-class attribute transfer", CVPR 2009
- First systematic study of zero-shot learning
- Proposed attribute-based recognition
- IEEE
- Socher, R. et al., "Zero-Shot Learning Through Cross-Modal
Transfer", NeurIPS 2013
- Used word embeddings for zero-shot learning
- Learned visual-semantic mappings
- arXiv:1301.3666
Generative Models
- Xian, Y. et al., "Feature Generating Networks for Zero-Shot
Learning", CVPR 2018
- Proposed f-CLSWGAN
- Generate visual features from semantic descriptions
- arXiv:1712.00981
- Schonfeld, E. et al., "Generalized Zero- and Few-Shot
Learning via Aligned Variational Autoencoders", CVPR 2019
- f-VAEGAN for zero-shot learning
- Aligned visual and semantic spaces
- arXiv:1812.01784
Generalized ZSL
- Chao, W.-L. et al., "An Empirical Study and Analysis of
Generalized Zero-Shot Learning for Object Recognition in the Wild", ECCV
2016
- Comprehensive study of GZSL
- Analyzed bias problem
- arXiv:1605.04253
- Xian, Y. et al., "Zero-Shot Learning - A Comprehensive
Evaluation of the Good, the Bad and the Ugly", TPAMI 2019
- Large-scale benchmark and evaluation
- Systematically compared methods
- arXiv:1707.00600
Recent Advances
- Chen, S. et al., "FREE: Feature Refinement for Generalized
Zero-Shot Learning", ICCV 2021
- Feature refinement for better generalization
- Addressed domain shift problem
- arXiv:2107.13807
- Naeem, M. F. et al., "Learning Graph Embeddings for
Compositional Zero-shot Learning", CVPR 2021
- Compositional zero-shot learning
- Used knowledge graphs
- arXiv:2102.01987
Summary
Zero-shot learning enables recognizing unseen classes through semantic descriptions, addressing long-tail distribution and open-world recognition challenges. This article derived zero-shot learning's mathematical foundations from first principles, analyzed attribute representations and semantic embedding spaces in detail, explained compatibility function design, deeply analyzed discriminative and generative ZSL principles, introduced GZSL bias calibration methods, and provided complete implementations.
We saw that zero-shot learning's essence is learning cross-modal mapping from visual to semantic space, bridging seen and unseen classes via auxiliary information. From traditional attribute-based methods to modern generative models, from conventional ZSL to generalized ZSL, zero-shot learning techniques continue evolving. While challenges like semantic gaps, domain shift, and hubness problems remain, zero-shot learning has become an indispensable tool for handling novel classes and long-tail distributions in real-world applications.
Next chapter we'll explore multimodal transfer learning, investigating how to learn unified representations across different modalities and transfer knowledge between them.
- Post title:Transfer Learning (7): Zero-Shot Learning
- Post author:Chen Kai
- Create time:2025-11-15 00:00:00
- Post link:https://www.chenk.top/transfer-learning-7-zero-shot-learning/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.