Transfer Learning (7): Zero-Shot Learning
Chen Kai BOSS

Zero-Shot Learning (ZSL) is a machine learning paradigm capable of recognizing classes never seen during training. Humans possess powerful zero-shot learning abilities — even without seeing a zebra before, we can recognize it through descriptions like "looks like a horse but with black and white stripes." Lampert et al.'s pioneering 2009 paper "Learning to Detect Unseen Object Classes" introduced this capability to computer vision, launching zero-shot learning research. Zero-shot learning has important applications in long-tail distributions, rapid new class adaptation, and low-resource scenarios, but also faces many challenges like semantic gaps, domain shift, and hubness problems.

This article derives the mathematical foundations of zero-shot learning from first principles, explains construction of attribute representations and semantic embedding spaces, details compatibility function design and optimization, deeply analyzes principles of traditional discriminative ZSL and modern generative ZSL (f-CLSWGAN, f-VAEGAN, etc.), introduces bias calibration methods for generalized zero-shot learning (GZSL), and provides complete code implementations (including attribute learning, visual-semantic mapping, conditional generative models, etc.). We'll see that zero-shot learning essentially learns a cross-modal mapping from visual space to semantic space, bridging seen and unseen classes through auxiliary information (attributes, word embeddings, etc.).

Motivation for Zero-Shot Learning

From Closed-World to Open-World: Long-Tail Distribution Challenge

Traditional supervised learning assumes training and test sets come from the same class set — the Closed-World Assumption. But the real world is Open-World:

  • ImageNet has 1000 classes, but reality has millions of object types
  • Animal Recognition: Biologists discovered ~1 million animal species, training sets cover very few
  • Medical Diagnosis: Rare disease samples are scarce but still need recognition

A more severe problem is Long-Tail Distribution: A few classes have many samples (head), many classes have few samples (tail).

Example (iNaturalist dataset): - Top 10% classes account for 60% of total samples - Bottom 50% classes account for only 5% of total samples

Adequately annotating tail classes is extremely costly. Zero-shot learning provides a solution: leverage semantic descriptions of classes (like attributes, text descriptions, knowledge graphs) to recognize them without labeled images.

Formal Definition of Zero-Shot Learning

Notation: - Seen Classes: Extra close brace or missing open brace\mathcal{C}^s = \{c_1^s, \ldots, c_{N_s}^s}, have labeled data during training - Unseen Classes: Extra close brace or missing open brace\mathcal{C}^u = \{c_1^u, \ldots, c_{N_u}^u}, no labeled data during training - Constraint: (seen and unseen classes don't overlap)

Auxiliary Information: Each classhas semantic description, like: - Attribute vectors:(has "furry", "winged", etc. attributes) - Word embeddings: Word vectors of class names from Word2Vec, GloVe, etc. - Class prototypes: Feature vectors extracted from text descriptions

Zero-Shot Learning Task: - Training Phase: Given seen class datawhereis image,is label, and semantic descriptions for all classesExtra close brace or missing open brace\{a_c}_{c \in \mathcal{C}^s \cup \mathcal{C}^u} - Test Phase: For input, predict (classify only in unseen classes)

This is Conventional Zero-Shot Learning. A more realistic variant is Generalized Zero-Shot Learning (GZSL): at test time(classify in both seen and unseen classes).

Mathematical Perspective on ZSL: Knowledge Transfer

Zero-shot learning's core is knowledge transfer: how to transfer knowledge learned from seen classes to unseen classes?

Key Assumption: Classes are related through semantic space. Let: -be visual feature extractor -be semantic embedding (maps classes to semantic vectors)

Zero-shot learning assumes existence of compatibility functionsuch that:During prediction, for input, select class with highest compatibility:

Intuition: Compatibility function measures match between visual features and semantic descriptions. Learn this function on seen classes, then generalize to unseen classes.

Attribute Representation: Describing Class Semantics

Attributes are the most commonly used semantic representation form in zero-shot learning.

Attribute Definition and Construction

Attributes are high-level semantic features describing classes, like: - Color: Black, white, brown - Shape: Round, elongated - Texture: Furry, smooth, striped - Parts: Has wings, has tail, has four legs

Each class represented by attribute vector:Extra close brace or missing open bracea_c \in \{0,1}^M (binary attributes) or (continuous attributes), whereis number of attributes.

Example (Animals with Attributes dataset, 50 animal classes, 85 attributes): - Zebra:(striped=1, winged=0, four legs=1, ...) - Penguin:(striped=0, winged=1, four legs=0, ...)

Attribute Construction Methods:

  1. Manual Annotation: Experts annotate attributes for each class
    • Pros: Accurate, interpretable
    • Cons: High cost, subjective
  2. Crowdsourced Annotation: Collect via platforms like Amazon Mechanical Turk
    • Pros: Relatively low cost, broad coverage
    • Cons: High annotation noise
  3. Automatic Extraction: Extract attributes from text descriptions (like Wikipedia)
    • Pros: Low cost, scalable
    • Cons: May be incomplete, noisy

Attribute Learning: Predicting Attributes from Images

Given training setwhereis image andExtra close brace or missing open bracea_i \in \{0,1} ^Mis attribute labels, learn attribute classifierspredicting probability of-th attribute.

Loss Function (multi-label classification):where BCE is binary cross-entropy:

Network Structure: - Backbone: ResNet, VGG etc. extract visual features - Attribute Heads: For each attribute, use FC layer + sigmoid:

Direct Attribute Prediction (DAP)

Lampert et al. proposed Direct Attribute Prediction (DAP) in 2009, one of the earliest zero-shot learning methods.

Two-Stage Process:

  1. Attribute Prediction: For input, predict attribute vector

  2. Nearest Neighbor Classification: Select class closest to predicted attributes

Intuition: If an image's predicted attributes are "striped, four-legged, wingless", the closest class is "zebra".

Pros: - Strong interpretability: Can see which attributes led to classification decision - Modular: Attribute classifiers can be trained and debugged independently

Cons: - Error accumulation: Attribute prediction errors directly cause classification errors - Independence assumption: Ignores correlations between attributes (like "has wings" and "can fly" highly correlated)

Semantic Embedding Space: Beyond Attributes

Attributes require manual design, limiting scalability. Semantic Embeddings automatically learn semantic representations from class names or descriptions.

Word Embeddings: Word2Vec and GloVe

Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) are two popular word embedding methods.

Word2Vec (Skip-Gram model): Given center word, predict context word:whereis embedding vector of word.

GloVe: Minimize weighted least squares loss:whereis co-occurrence count of wordsand,is weighting function.

Application to ZSL: Map class names (like "zebra") to word embedding space(like 300 dimensions). Similar classes are close in embedding space, like "zebra" and "horse".

Problem: Word embeddings capture linguistic similarity, not necessarily visual similarity. "dog" and "cat" are visually similar, but word embeddings may not be close.

Class Prototypes: Extracting from Text Descriptions

For each class, obtain text descriptions from Wikipedia, encyclopedias etc., then extract feature vectors as class prototypes.

Method 1: TF-IDF:

Method 2: BERT Embeddings:

Use pre-trained BERT model to encode text descriptions:Take [CLS] token output or average pooling as class representation.

Advantages: - Automated: No manual attribute annotation needed - Rich information: Text descriptions contain more details

Challenges: - Text quality: Descriptions may be inaccurate or incomplete - Cross-modal gap: Text and visual feature distributions differ greatly

Compatibility Functions: Connecting Visual and Semantic

Compatibility functionmeasures match between visual featuresand semantic description.

Linear Compatibility Function

Simplest form is bilinear function:whereis learnable weight matrix.

Training: On seen classes, maximize compatibility of correct class:This is softmax cross-entropy loss, similar to classification tasks.

Deep Compatibility Functions

Use neural networks to learn non-linear compatibility:whereis neural network (like MLP),andare projection matrices.

Example Architecture:

1
2
3
v (d_v dim) -> FC(512) -> ReLU -> z_v (256 dim)
a (d_s dim) -> FC(512) -> ReLU -> z_a (256 dim)
F = z_v^T z_a (inner product)

This allows learning more complex visual-semantic relationships.

Generative Zero-Shot Learning

Traditional discriminative ZSL learns mapping from visual to semantic space. Generative ZSL takes opposite approach: generate visual features from semantic descriptions, then use generated features to train classifiers.

f-CLSWGAN: Feature-Generating GAN

Xian et al. proposed f-CLSWGAN in 2018, using conditional GAN to generate visual features.

Architecture:

  1. Generator: - Input: Noiseand semantic description - Output: Fake visual feature

  2. Discriminator: - Distinguish real visual features from fake ones

  3. Classifier: - Classify visual features to classes

Loss Functions: where classification loss:

Training Process:

  1. Train on seen classes: Generate features for seen classes
  2. Test on unseen classes: Generate synthetic training data for unseen classes using semantic descriptions
  3. Train classifier on both real (seen) and synthetic (unseen) features
  4. Classify test samples

Advantages: - Converts ZSL to standard supervised learning - Can leverage powerful classification models - Handles GZSL naturally (mix real and synthetic data)

Generalized Zero-Shot Learning (GZSL)

Conventional ZSL assumes test samples only from unseen classes. Generalized ZSL is more realistic: test samples may come from both seen and unseen classes.

GZSL Challenge: Bias Toward Seen Classes

Main challenge: Models trained on seen classes have strong bias toward them. Even if unseen class features are correct, model still predicts seen classes.

Experimental Observation (AWA2 dataset): - Conventional ZSL accuracy: 65% - GZSL accuracy on unseen classes: 15% - GZSL accuracy on seen classes: 85%

Model severely biased toward seen classes!

Calibration Methods

1. Temperature Scaling:

Adjust prediction confidence via temperature parameter:Set(higher temperature for seen classes) to reduce seen class confidence.

2. Bias Calibration:

Add calibration terms to compatibility scores:𝟙whereis calibration factor, penalizes seen classes.

3. Separate Classifiers:

Train two classifiers: - Classifier 1: Discriminate seen vs unseen - Classifier 2: If unseen, classify within unseen classes; if seen, classify within seen classes

This is a gating mechanism.

Complete Code Implementation

Below is complete zero-shot learning implementation including attribute learning, compatibility functions, and generative models.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
from typing import List, Tuple, Dict

# ============== Attribute Learning ==============

class AttributeClassifier(nn.Module):
"""Multi-label attribute classifier"""
def __init__(self, feature_dim: int, num_attributes: int):
super().__init__()
self.fc1 = nn.Linear(feature_dim, 1024)
self.fc2 = nn.Linear(1024, 512)
self.fc3 = nn.Linear(512, num_attributes)
self.dropout = nn.Dropout(0.5)

def forward(self, x: torch.Tensor) -> torch.Tensor:
x = F.relu(self.fc1(x))
x = self.dropout(x)
x = F.relu(self.fc2(x))
x = self.dropout(x)
x = torch.sigmoid(self.fc3(x))
return x


# ============== Compatibility Functions ==============

class BilinearCompatibility(nn.Module):
"""Bilinear compatibility function"""
def __init__(self, visual_dim: int, semantic_dim: int):
super().__init__()
self.W = nn.Parameter(torch.randn(visual_dim, semantic_dim))

def forward(self, v: torch.Tensor, a: torch.Tensor) -> torch.Tensor:
"""
v: (batch_size, visual_dim)
a: (num_classes, semantic_dim)
Returns: (batch_size, num_classes)
"""
return torch.matmul(torch.matmul(v, self.W), a.t())


class DeepCompatibility(nn.Module):
"""Deep neural compatibility function"""
def __init__(self, visual_dim: int, semantic_dim: int, hidden_dim: int = 512):
super().__init__()
# Visual encoder
self.visual_encoder = nn.Sequential(
nn.Linear(visual_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(hidden_dim, 256)
)
# Semantic encoder
self.semantic_encoder = nn.Sequential(
nn.Linear(semantic_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(hidden_dim, 256)
)

def forward(self, v: torch.Tensor, a: torch.Tensor) -> torch.Tensor:
"""
v: (batch_size, visual_dim)
a: (num_classes, semantic_dim)
Returns: (batch_size, num_classes)
"""
v_emb = self.visual_encoder(v) # (batch_size, 256)
a_emb = self.semantic_encoder(a) # (num_classes, 256)
# Cosine similarity
v_norm = F.normalize(v_emb, p=2, dim=1)
a_norm = F.normalize(a_emb, p=2, dim=1)
scores = torch.matmul(v_norm, a_norm.t())
return scores * 10.0 # Scale factor


# ============== Zero-Shot Classifier ==============

class ZeroShotClassifier:
"""Zero-shot learning classifier"""
def __init__(
self,
visual_dim: int,
semantic_dim: int,
seen_class_attributes: torch.Tensor,
unseen_class_attributes: torch.Tensor,
compatibility_type: str = 'deep',
device: str = 'cuda'
):
self.device = device
self.seen_attrs = seen_class_attributes.to(device)
self.unseen_attrs = unseen_class_attributes.to(device)

if compatibility_type == 'bilinear':
self.compat_fn = BilinearCompatibility(visual_dim, semantic_dim).to(device)
else:
self.compat_fn = DeepCompatibility(visual_dim, semantic_dim).to(device)

self.optimizer = optim.Adam(self.compat_fn.parameters(), lr=1e-3)

def train_step(self, features: torch.Tensor, labels: torch.Tensor) -> float:
"""Train on seen classes"""
self.compat_fn.train()

# Compute compatibility scores
scores = self.compat_fn(features, self.seen_attrs)

# Cross-entropy loss
loss = F.cross_entropy(scores, labels)

# Backprop
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()

return loss.item()

@torch.no_grad()
def predict_unseen(self, features: torch.Tensor) -> torch.Tensor:
"""Predict on unseen classes"""
self.compat_fn.eval()
scores = self.compat_fn(features, self.unseen_attrs)
preds = torch.argmax(scores, dim=1)
return preds

@torch.no_grad()
def predict_gzsl(self, features: torch.Tensor, calibration: float = 0.0) -> Tuple[torch.Tensor, torch.Tensor]:
"""Generalized ZSL prediction"""
self.compat_fn.eval()

# Scores for seen and unseen classes
seen_scores = self.compat_fn(features, self.seen_attrs)
unseen_scores = self.compat_fn(features, self.unseen_attrs)

# Apply calibration (reduce seen class scores)
seen_scores = seen_scores - calibration

# Concatenate scores
all_scores = torch.cat([seen_scores, unseen_scores], dim=1)
preds = torch.argmax(all_scores, dim=1)

# Determine if seen or unseen
is_seen = preds < self.seen_attrs.size(0)

return preds, is_seen


# ============== Feature Generating GAN ==============

class Generator(nn.Module):
"""Feature generator conditioned on semantic attributes"""
def __init__(self, noise_dim: int, semantic_dim: int, feature_dim: int):
super().__init__()
self.fc = nn.Sequential(
nn.Linear(noise_dim + semantic_dim, 4096),
nn.LeakyReLU(0.2),
nn.Linear(4096, feature_dim),
nn.ReLU()
)

def forward(self, noise: torch.Tensor, attributes: torch.Tensor) -> torch.Tensor:
x = torch.cat([noise, attributes], dim=1)
return self.fc(x)


class Discriminator(nn.Module):
"""Discriminator for real vs fake features"""
def __init__(self, feature_dim: int):
super().__init__()
self.fc = nn.Sequential(
nn.Linear(feature_dim, 4096),
nn.LeakyReLU(0.2),
nn.Linear(4096, 1),
nn.Sigmoid()
)

def forward(self, features: torch.Tensor) -> torch.Tensor:
return self.fc(features)


class FeatureGeneratingGAN:
"""f-CLSWGAN for generative zero-shot learning"""
def __init__(
self,
noise_dim: int,
semantic_dim: int,
feature_dim: int,
device: str = 'cuda'
):
self.device = device
self.noise_dim = noise_dim

self.generator = Generator(noise_dim, semantic_dim, feature_dim).to(device)
self.discriminator = Discriminator(feature_dim).to(device)

self.g_optimizer = optim.Adam(self.generator.parameters(), lr=1e-4, betas=(0.5, 0.999))
self.d_optimizer = optim.Adam(self.discriminator.parameters(), lr=1e-4, betas=(0.5, 0.999))

def train_step(self, real_features: torch.Tensor, attributes: torch.Tensor) -> Dict[str, float]:
"""Train generator and discriminator"""
batch_size = real_features.size(0)

# Train Discriminator
self.d_optimizer.zero_grad()

# Real samples
real_validity = self.discriminator(real_features)
d_real_loss = F.binary_cross_entropy(real_validity, torch.ones_like(real_validity))

# Fake samples
noise = torch.randn(batch_size, self.noise_dim).to(self.device)
fake_features = self.generator(noise, attributes)
fake_validity = self.discriminator(fake_features.detach())
d_fake_loss = F.binary_cross_entropy(fake_validity, torch.zeros_like(fake_validity))

d_loss = d_real_loss + d_fake_loss
d_loss.backward()
self.d_optimizer.step()

# Train Generator
self.g_optimizer.zero_grad()

noise = torch.randn(batch_size, self.noise_dim).to(self.device)
fake_features = self.generator(noise, attributes)
fake_validity = self.discriminator(fake_features)
g_loss = F.binary_cross_entropy(fake_validity, torch.ones_like(fake_validity))

g_loss.backward()
self.g_optimizer.step()

return {'d_loss': d_loss.item(), 'g_loss': g_loss.item()}

@torch.no_grad()
def generate_features(self, attributes: torch.Tensor, num_samples: int) -> torch.Tensor:
"""Generate synthetic features for given attributes"""
self.generator.eval()

# Repeat attributes num_samples times
attrs = attributes.repeat(num_samples, 1)

# Generate features
noise = torch.randn(attrs.size(0), self.noise_dim).to(self.device)
features = self.generator(noise, attrs)

return features


# ============== Usage Example ==============

def main():
# Hyperparameters
device = 'cuda' if torch.cuda.is_available() else 'cpu'
visual_dim = 2048
semantic_dim = 85
num_seen_classes = 40
num_unseen_classes = 10

# Create dummy data
seen_attributes = torch.randn(num_seen_classes, semantic_dim)
unseen_attributes = torch.randn(num_unseen_classes, semantic_dim)

# ========== Experiment 1: Discriminative ZSL ==========
print("\n" + "="*60)
print("Experiment 1: Discriminative Zero-Shot Learning")
print("="*60)

zsl_classifier = ZeroShotClassifier(
visual_dim=visual_dim,
semantic_dim=semantic_dim,
seen_class_attributes=seen_attributes,
unseen_class_attributes=unseen_attributes,
compatibility_type='deep',
device=device
)

# Training on seen classes
for epoch in range(10):
# Dummy training data
train_features = torch.randn(100, visual_dim).to(device)
train_labels = torch.randint(0, num_seen_classes, (100,)).to(device)

loss = zsl_classifier.train_step(train_features, train_labels)
print(f'Epoch {epoch}: Loss = {loss:.4f}')

# Testing on unseen classes
test_features = torch.randn(50, visual_dim).to(device)
preds = zsl_classifier.predict_unseen(test_features)
print(f'Predictions: {preds[:10]}')

# Generalized ZSL
preds_gzsl, is_seen = zsl_classifier.predict_gzsl(test_features, calibration=2.0)
print(f'GZSL Predictions: {preds_gzsl[:10]}')
print(f'Is Seen: {is_seen[:10]}')

# ========== Experiment 2: Generative ZSL ==========
print("\n" + "="*60)
print("Experiment 2: Generative Zero-Shot Learning (f-CLSWGAN)")
print("="*60)

gan = FeatureGeneratingGAN(
noise_dim=100,
semantic_dim=semantic_dim,
feature_dim=visual_dim,
device=device
)

# Training GAN on seen classes
for epoch in range(20):
# Dummy training data
real_features = torch.randn(64, visual_dim).to(device)
class_idx = torch.randint(0, num_seen_classes, (64,)).to(device)
attributes = seen_attributes[class_idx].to(device)

losses = gan.train_step(real_features, attributes)
print(f'Epoch {epoch}: D Loss = {losses["d_loss"]:.4f}, G Loss = {losses["g_loss"]:.4f}')

# Generate synthetic features for unseen classes
print("\nGenerating synthetic features for unseen classes...")
for class_idx in range(num_unseen_classes):
attr = unseen_attributes[class_idx:class_idx+1].to(device)
synthetic_features = gan.generate_features(attr, num_samples=100)
print(f'Class {class_idx}: Generated {synthetic_features.size(0)} features')


if __name__ == '__main__':
main()

Comprehensive Q&A

Q1: When should I use zero-shot learning?

A: Zero-shot learning suitable when: - New classes emerge frequently, no time/resources for annotation - Long-tail distribution, tail classes have very few samples - Need to recognize rare or novel classes - Have good semantic descriptions (attributes, text) available

Not suitable when: - All classes have sufficient training data - Classes lack clear semantic descriptions - Visual appearance very different from semantic description

Q2: Attributes vs word embeddings - which is better?

A: Trade-offs:

Attributes: - Pros: Interpretable, capture discriminative features, work well for fine-grained tasks - Cons: Require manual design, expensive, domain-specific

Word Embeddings: - Pros: Automatic, scalable, leverage large text corpora - Cons: Capture linguistic not visual similarity, may not be discriminative

Recommendation: Use attributes when available and task is fine-grained; use word embeddings for broader domains or when attributes unavailable.

Q3: How to handle the hubness problem?

A: Hubness: In high-dimensional space, some points become "hubs" that are nearest neighbors to many other points, causing prediction bias.

Solutions:

  1. Dimensionality Reduction: Use PCA or autoencoders to reduce feature dimensions
  2. Hubness-Aware Scoring: Weight compatibility scores by point density
  3. Locally Adaptive Metrics: Use different distance metrics for different regions
  4. Reverse Nearest Neighbors: Consider reverse nearest neighbor relationships

Q4: Why does GZSL perform poorly?

A: Main reasons:

  1. Bias Toward Seen Classes: Model trained only on seen classes, strongly biased toward them
  2. Domain Shift: Visual features of seen and unseen classes may have different distributions
  3. Semantic Gap: Semantic descriptions may not capture all visual information

Solutions: - Calibration methods (temperature scaling, bias terms) - Generative models (synthesize unseen class features) - Transductive learning (leverage test set structure)

Q5: Can zero-shot learning be combined with few-shot learning?

A: Yes! This is called Few-Shot Zero-Shot Learning or Low-Shot Learning:

  • Zero-shot provides initial knowledge via semantic descriptions
  • Few-shot refines with limited labeled examples
  • Combination achieves better performance than either alone

Method: First use zero-shot to generate pseudo-labels, then use few-shot samples to calibrate.

Classic Papers

  1. Lampert, C. H. et al., "Learning to detect unseen object classes by between-class attribute transfer", CVPR 2009
    • First systematic study of zero-shot learning
    • Proposed attribute-based recognition
    • IEEE
  2. Socher, R. et al., "Zero-Shot Learning Through Cross-Modal Transfer", NeurIPS 2013
    • Used word embeddings for zero-shot learning
    • Learned visual-semantic mappings
    • arXiv:1301.3666

Generative Models

  1. Xian, Y. et al., "Feature Generating Networks for Zero-Shot Learning", CVPR 2018
    • Proposed f-CLSWGAN
    • Generate visual features from semantic descriptions
    • arXiv:1712.00981
  2. Schonfeld, E. et al., "Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders", CVPR 2019
    • f-VAEGAN for zero-shot learning
    • Aligned visual and semantic spaces
    • arXiv:1812.01784

Generalized ZSL

  1. Chao, W.-L. et al., "An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild", ECCV 2016
  2. Xian, Y. et al., "Zero-Shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly", TPAMI 2019
    • Large-scale benchmark and evaluation
    • Systematically compared methods
    • arXiv:1707.00600

Recent Advances

  1. Chen, S. et al., "FREE: Feature Refinement for Generalized Zero-Shot Learning", ICCV 2021
    • Feature refinement for better generalization
    • Addressed domain shift problem
    • arXiv:2107.13807
  2. Naeem, M. F. et al., "Learning Graph Embeddings for Compositional Zero-shot Learning", CVPR 2021

Summary

Zero-shot learning enables recognizing unseen classes through semantic descriptions, addressing long-tail distribution and open-world recognition challenges. This article derived zero-shot learning's mathematical foundations from first principles, analyzed attribute representations and semantic embedding spaces in detail, explained compatibility function design, deeply analyzed discriminative and generative ZSL principles, introduced GZSL bias calibration methods, and provided complete implementations.

We saw that zero-shot learning's essence is learning cross-modal mapping from visual to semantic space, bridging seen and unseen classes via auxiliary information. From traditional attribute-based methods to modern generative models, from conventional ZSL to generalized ZSL, zero-shot learning techniques continue evolving. While challenges like semantic gaps, domain shift, and hubness problems remain, zero-shot learning has become an indispensable tool for handling novel classes and long-tail distributions in real-world applications.

Next chapter we'll explore multimodal transfer learning, investigating how to learn unified representations across different modalities and transfer knowledge between them.

  • Post title:Transfer Learning (7): Zero-Shot Learning
  • Post author:Chen Kai
  • Create time:2025-11-15 00:00:00
  • Post link:https://www.chenk.top/transfer-learning-7-zero-shot-learning/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments