Transfer Learning (7): Zero-Shot Learning

Zero-Shot Learning (ZSL) is a machine learning paradigm capable of recognizing classes never seen during training. Humans possess powerful zero-shot learning abilities — even without seeing a zebra before, we can recognize it through descriptions like "looks like a horse but with black and white stripes." Lampert et al.'s pioneering 2009 paper "Learning to Detect Unseen Object Classes" introduced this capability to computer vision, launching zero-shot learning research. Zero-shot learning has important applications in long-tail distributions, rapid new class adaptation, and low-resource scenarios, but also faces many challenges like semantic gaps, domain shift, and hubness problems.

This article derives the mathematical foundations of zero-shot learning from first principles, explains construction of attribute representations and semantic embedding spaces, details compatibility function design and optimization, deeply analyzes principles of traditional discriminative ZSL and modern generative ZSL (f-CLSWGAN, f-VAEGAN, etc.), introduces bias calibration methods for generalized zero-shot learning (GZSL), and provides complete code implementations (including attribute learning, visual-semantic mapping, conditional generative models, etc.). We'll see that zero-shot learning essentially learns a cross-modal mapping from visual space to semantic space, bridging seen and unseen classes through auxiliary information (attributes, word embeddings, etc.).

Motivation for Zero-Shot Learning

From Closed-World to Open-World: Long-Tail Distribution Challenge

Traditional supervised learning assumes training and test sets come from the same class set — the Closed-World Assumption. But the real world is Open-World:

ImageNet has 1000 classes, but reality has millions of object types
Animal Recognition: Biologists discovered ~1 million animal species, training sets cover very few
Medical Diagnosis: Rare disease samples are scarce but still need recognition

A more severe problem is Long-Tail Distribution: A few classes have many samples (head), many classes have few samples (tail).

Example (iNaturalist dataset): - Top 10% classes account for 60% of total samples - Bottom 50% classes account for only 5% of total samples

Adequately annotating tail classes is extremely costly. Zero-shot learning provides a solution: leverage semantic descriptions of classes (like attributes, text descriptions, knowledge graphs) to recognize them without labeled images.

Formal Definition of Zero-Shot Learning

Notation: - Seen Classes: $Extra close brace or missing open brace\mathcal{C}^s = \{c_1^s, \ldots, c_{N_s}^s}$ , have labeled data during training - Unseen Classes: $Extra close brace or missing open brace\mathcal{C}^u = \{c_1^u, \ldots, c_{N_u}^u}$ , no labeled data during training - Constraint: (seen and unseen classes don't overlap)

Auxiliary Information: Each classhas semantic description, like: - Attribute vectors:(has "furry", "winged", etc. attributes) - Word embeddings: Word vectors of class names from Word2Vec, GloVe, etc. - Class prototypes: Feature vectors extracted from text descriptions

Zero-Shot Learning Task: - Training Phase: Given seen class datawhereis image,is label, and semantic descriptions for all classes $Extra close brace or missing open brace\{a_c}_{c \in \mathcal{C}^s \cup \mathcal{C}^u}$ - Test Phase: For input, predict (classify only in unseen classes)

This is Conventional Zero-Shot Learning. A more realistic variant is Generalized Zero-Shot Learning (GZSL): at test time(classify in both seen and unseen classes).

Mathematical Perspective on ZSL: Knowledge Transfer

Zero-shot learning's core is knowledge transfer: how to transfer knowledge learned from seen classes to unseen classes?

Key Assumption: Classes are related through semantic space. Let: -be visual feature extractor -be semantic embedding (maps classes to semantic vectors)

Zero-shot learning assumes existence of compatibility functionsuch that:During prediction, for input, select class with highest compatibility:

Intuition: Compatibility function measures match between visual features and semantic descriptions. Learn this function on seen classes, then generalize to unseen classes.

Attribute Representation: Describing Class Semantics

Attributes are the most commonly used semantic representation form in zero-shot learning.

Attribute Definition and Construction

Attributes are high-level semantic features describing classes, like: - Color: Black, white, brown - Shape: Round, elongated - Texture: Furry, smooth, striped - Parts: Has wings, has tail, has four legs

Each class represented by attribute vector: $Extra close brace or missing open bracea_c \in \{0,1}^M$ (binary attributes) or (continuous attributes), whereis number of attributes.

Example (Animals with Attributes dataset, 50 animal classes, 85 attributes): - Zebra:(striped=1, winged=0, four legs=1, ...) - Penguin:(striped=0, winged=1, four legs=0, ...)

Attribute Construction Methods:

Manual Annotation: Experts annotate attributes for each class
- Pros: Accurate, interpretable
- Cons: High cost, subjective
Crowdsourced Annotation: Collect via platforms like Amazon Mechanical Turk
- Pros: Relatively low cost, broad coverage
- Cons: High annotation noise
Automatic Extraction: Extract attributes from text descriptions (like Wikipedia)
- Pros: Low cost, scalable
- Cons: May be incomplete, noisy

Attribute Learning: Predicting Attributes from Images

Given training setwhereis image and $Extra close brace or missing open bracea_i \in \{0,1} ^M$ is attribute labels, learn attribute classifierspredicting probability of-th attribute.

Loss Function (multi-label classification):where BCE is binary cross-entropy:

Network Structure: - Backbone: ResNet, VGG etc. extract visual features - Attribute Heads: For each attribute, use FC layer + sigmoid:

Direct Attribute Prediction (DAP)

Lampert et al. proposed Direct Attribute Prediction (DAP) in 2009, one of the earliest zero-shot learning methods.

Two-Stage Process:

Attribute Prediction: For input, predict attribute vector
Nearest Neighbor Classification: Select class closest to predicted attributes

Intuition: If an image's predicted attributes are "striped, four-legged, wingless", the closest class is "zebra".

Pros: - Strong interpretability: Can see which attributes led to classification decision - Modular: Attribute classifiers can be trained and debugged independently

Cons: - Error accumulation: Attribute prediction errors directly cause classification errors - Independence assumption: Ignores correlations between attributes (like "has wings" and "can fly" highly correlated)

Semantic Embedding Space: Beyond Attributes

Attributes require manual design, limiting scalability. Semantic Embeddings automatically learn semantic representations from class names or descriptions.

Word Embeddings: Word2Vec and GloVe

Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) are two popular word embedding methods.

Word2Vec (Skip-Gram model): Given center word, predict context word:whereis embedding vector of word.

GloVe: Minimize weighted least squares loss:whereis co-occurrence count of wordsand,is weighting function.

Application to ZSL: Map class names (like "zebra") to word embedding space(like 300 dimensions). Similar classes are close in embedding space, like "zebra" and "horse".

Problem: Word embeddings capture linguistic similarity, not necessarily visual similarity. "dog" and "cat" are visually similar, but word embeddings may not be close.

Class Prototypes: Extracting from Text Descriptions

For each class, obtain text descriptions from Wikipedia, encyclopedias etc., then extract feature vectors as class prototypes.

Method 1: TF-IDF:

Method 2: BERT Embeddings:

Use pre-trained BERT model to encode text descriptions:Take [CLS] token output or average pooling as class representation.

Advantages: - Automated: No manual attribute annotation needed - Rich information: Text descriptions contain more details

Challenges: - Text quality: Descriptions may be inaccurate or incomplete - Cross-modal gap: Text and visual feature distributions differ greatly

Compatibility Functions: Connecting Visual and Semantic

Compatibility functionmeasures match between visual featuresand semantic description.

Linear Compatibility Function

Simplest form is bilinear function:whereis learnable weight matrix.

Training: On seen classes, maximize compatibility of correct class:This is softmax cross-entropy loss, similar to classification tasks.

Deep Compatibility Functions

Use neural networks to learn non-linear compatibility:whereis neural network (like MLP),andare projection matrices.

Example Architecture:

1
2
3

v (d_v dim) -> FC(512) -> ReLU -> z_v (256 dim)
a (d_s dim) -> FC(512) -> ReLU -> z_a (256 dim)
F = z_v^T z_a  (inner product)

This allows learning more complex visual-semantic relationships.

Generative Zero-Shot Learning

Traditional discriminative ZSL learns mapping from visual to semantic space. Generative ZSL takes opposite approach: generate visual features from semantic descriptions, then use generated features to train classifiers.

f-CLSWGAN: Feature-Generating GAN

Xian et al. proposed f-CLSWGAN in 2018, using conditional GAN to generate visual features.

Architecture:

Generator: - Input: Noiseand semantic description - Output: Fake visual feature
Discriminator: - Distinguish real visual features from fake ones
Classifier: - Classify visual features to classes

Loss Functions: where classification loss:

Training Process:

Train on seen classes: Generate features for seen classes
Test on unseen classes: Generate synthetic training data for unseen classes using semantic descriptions
Train classifier on both real (seen) and synthetic (unseen) features
Classify test samples

Advantages: - Converts ZSL to standard supervised learning - Can leverage powerful classification models - Handles GZSL naturally (mix real and synthetic data)

Generalized Zero-Shot Learning (GZSL)

Conventional ZSL assumes test samples only from unseen classes. Generalized ZSL is more realistic: test samples may come from both seen and unseen classes.

GZSL Challenge: Bias Toward Seen Classes

Main challenge: Models trained on seen classes have strong bias toward them. Even if unseen class features are correct, model still predicts seen classes.

Experimental Observation (AWA2 dataset): - Conventional ZSL accuracy: 65% - GZSL accuracy on unseen classes: 15% - GZSL accuracy on seen classes: 85%

Model severely biased toward seen classes!

Calibration Methods

1. Temperature Scaling:

Adjust prediction confidence via temperature parameter:Set(higher temperature for seen classes) to reduce seen class confidence.

2. Bias Calibration:

Add calibration terms to compatibility scores: $𝟙$ whereis calibration factor, penalizes seen classes.

3. Separate Classifiers:

Train two classifiers: - Classifier 1: Discriminate seen vs unseen - Classifier 2: If unseen, classify within unseen classes; if seen, classify within seen classes

This is a gating mechanism.

Complete Code Implementation

Below is complete zero-shot learning implementation including attribute learning, compatibility functions, and generative models.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
from typing import List, Tuple, Dict

# ============== Attribute Learning ==============

class AttributeClassifier(nn.Module):
    """Multi-label attribute classifier"""
    def __init__(self, feature_dim: int, num_attributes: int):
        super().__init__()
        self.fc1 = nn.Linear(feature_dim, 1024)
        self.fc2 = nn.Linear(1024, 512)
        self.fc3 = nn.Linear(512, num_attributes)
        self.dropout = nn.Dropout(0.5)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = torch.sigmoid(self.fc3(x))
        return x


# ============== Compatibility Functions ==============

class BilinearCompatibility(nn.Module):
    """Bilinear compatibility function"""
    def __init__(self, visual_dim: int, semantic_dim: int):
        super().__init__()
        self.W = nn.Parameter(torch.randn(visual_dim, semantic_dim))
        
    def forward(self, v: torch.Tensor, a: torch.Tensor) -> torch.Tensor:
        """
        v: (batch_size, visual_dim)
        a: (num_classes, semantic_dim)
        Returns: (batch_size, num_classes)
        """
        return torch.matmul(torch.matmul(v, self.W), a.t())


class DeepCompatibility(nn.Module):
    """Deep neural compatibility function"""
    def __init__(self, visual_dim: int, semantic_dim: int, hidden_dim: int = 512):
        super().__init__()
        # Visual encoder
        self.visual_encoder = nn.Sequential(
            nn.Linear(visual_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(hidden_dim, 256)
        )
        # Semantic encoder
        self.semantic_encoder = nn.Sequential(
            nn.Linear(semantic_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(hidden_dim, 256)
        )
        
    def forward(self, v: torch.Tensor, a: torch.Tensor) -> torch.Tensor:
        """
        v: (batch_size, visual_dim)
        a: (num_classes, semantic_dim)
        Returns: (batch_size, num_classes)
        """
        v_emb = self.visual_encoder(v)  # (batch_size, 256)
        a_emb = self.semantic_encoder(a)  # (num_classes, 256)
        # Cosine similarity
        v_norm = F.normalize(v_emb, p=2, dim=1)
        a_norm = F.normalize(a_emb, p=2, dim=1)
        scores = torch.matmul(v_norm, a_norm.t())
        return scores * 10.0  # Scale factor


# ============== Zero-Shot Classifier ==============

class ZeroShotClassifier:
    """Zero-shot learning classifier"""
    def __init__(
        self,
        visual_dim: int,
        semantic_dim: int,
        seen_class_attributes: torch.Tensor,
        unseen_class_attributes: torch.Tensor,
        compatibility_type: str = 'deep',
        device: str = 'cuda'
    ):
        self.device = device
        self.seen_attrs = seen_class_attributes.to(device)
        self.unseen_attrs = unseen_class_attributes.to(device)
        
        if compatibility_type == 'bilinear':
            self.compat_fn = BilinearCompatibility(visual_dim, semantic_dim).to(device)
        else:
            self.compat_fn = DeepCompatibility(visual_dim, semantic_dim).to(device)
        
        self.optimizer = optim.Adam(self.compat_fn.parameters(), lr=1e-3)
    
    def train_step(self, features: torch.Tensor, labels: torch.Tensor) -> float:
        """Train on seen classes"""
        self.compat_fn.train()
        
        # Compute compatibility scores
        scores = self.compat_fn(features, self.seen_attrs)
        
        # Cross-entropy loss
        loss = F.cross_entropy(scores, labels)
        
        # Backprop
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        return loss.item()
    
    @torch.no_grad()
    def predict_unseen(self, features: torch.Tensor) -> torch.Tensor:
        """Predict on unseen classes"""
        self.compat_fn.eval()
        scores = self.compat_fn(features, self.unseen_attrs)
        preds = torch.argmax(scores, dim=1)
        return preds
    
    @torch.no_grad()
    def predict_gzsl(self, features: torch.Tensor, calibration: float = 0.0) -> Tuple[torch.Tensor, torch.Tensor]:
        """Generalized ZSL prediction"""
        self.compat_fn.eval()
        
        # Scores for seen and unseen classes
        seen_scores = self.compat_fn(features, self.seen_attrs)
        unseen_scores = self.compat_fn(features, self.unseen_attrs)
        
        # Apply calibration (reduce seen class scores)
        seen_scores = seen_scores - calibration
        
        # Concatenate scores
        all_scores = torch.cat([seen_scores, unseen_scores], dim=1)
        preds = torch.argmax(all_scores, dim=1)
        
        # Determine if seen or unseen
        is_seen = preds < self.seen_attrs.size(0)
        
        return preds, is_seen


# ============== Feature Generating GAN ==============

class Generator(nn.Module):
    """Feature generator conditioned on semantic attributes"""
    def __init__(self, noise_dim: int, semantic_dim: int, feature_dim: int):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(noise_dim + semantic_dim, 4096),
            nn.LeakyReLU(0.2),
            nn.Linear(4096, feature_dim),
            nn.ReLU()
        )
    
    def forward(self, noise: torch.Tensor, attributes: torch.Tensor) -> torch.Tensor:
        x = torch.cat([noise, attributes], dim=1)
        return self.fc(x)


class Discriminator(nn.Module):
    """Discriminator for real vs fake features"""
    def __init__(self, feature_dim: int):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(feature_dim, 4096),
            nn.LeakyReLU(0.2),
            nn.Linear(4096, 1),
            nn.Sigmoid()
        )
    
    def forward(self, features: torch.Tensor) -> torch.Tensor:
        return self.fc(features)


class FeatureGeneratingGAN:
    """f-CLSWGAN for generative zero-shot learning"""
    def __init__(
        self,
        noise_dim: int,
        semantic_dim: int,
        feature_dim: int,
        device: str = 'cuda'
    ):
        self.device = device
        self.noise_dim = noise_dim
        
        self.generator = Generator(noise_dim, semantic_dim, feature_dim).to(device)
        self.discriminator = Discriminator(feature_dim).to(device)
        
        self.g_optimizer = optim.Adam(self.generator.parameters(), lr=1e-4, betas=(0.5, 0.999))
        self.d_optimizer = optim.Adam(self.discriminator.parameters(), lr=1e-4, betas=(0.5, 0.999))
    
    def train_step(self, real_features: torch.Tensor, attributes: torch.Tensor) -> Dict[str, float]:
        """Train generator and discriminator"""
        batch_size = real_features.size(0)
        
        # Train Discriminator
        self.d_optimizer.zero_grad()
        
        # Real samples
        real_validity = self.discriminator(real_features)
        d_real_loss = F.binary_cross_entropy(real_validity, torch.ones_like(real_validity))
        
        # Fake samples
        noise = torch.randn(batch_size, self.noise_dim).to(self.device)
        fake_features = self.generator(noise, attributes)
        fake_validity = self.discriminator(fake_features.detach())
        d_fake_loss = F.binary_cross_entropy(fake_validity, torch.zeros_like(fake_validity))
        
        d_loss = d_real_loss + d_fake_loss
        d_loss.backward()
        self.d_optimizer.step()
        
        # Train Generator
        self.g_optimizer.zero_grad()
        
        noise = torch.randn(batch_size, self.noise_dim).to(self.device)
        fake_features = self.generator(noise, attributes)
        fake_validity = self.discriminator(fake_features)
        g_loss = F.binary_cross_entropy(fake_validity, torch.ones_like(fake_validity))
        
        g_loss.backward()
        self.g_optimizer.step()
        
        return {'d_loss': d_loss.item(), 'g_loss': g_loss.item()}
    
    @torch.no_grad()
    def generate_features(self, attributes: torch.Tensor, num_samples: int) -> torch.Tensor:
        """Generate synthetic features for given attributes"""
        self.generator.eval()
        
        # Repeat attributes num_samples times
        attrs = attributes.repeat(num_samples, 1)
        
        # Generate features
        noise = torch.randn(attrs.size(0), self.noise_dim).to(self.device)
        features = self.generator(noise, attrs)
        
        return features


# ============== Usage Example ==============

def main():
    # Hyperparameters
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    visual_dim = 2048
    semantic_dim = 85
    num_seen_classes = 40
    num_unseen_classes = 10
    
    # Create dummy data
    seen_attributes = torch.randn(num_seen_classes, semantic_dim)
    unseen_attributes = torch.randn(num_unseen_classes, semantic_dim)
    
    # ========== Experiment 1: Discriminative ZSL ==========
    print("\n" + "="*60)
    print("Experiment 1: Discriminative Zero-Shot Learning")
    print("="*60)
    
    zsl_classifier = ZeroShotClassifier(
        visual_dim=visual_dim,
        semantic_dim=semantic_dim,
        seen_class_attributes=seen_attributes,
        unseen_class_attributes=unseen_attributes,
        compatibility_type='deep',
        device=device
    )
    
    # Training on seen classes
    for epoch in range(10):
        # Dummy training data
        train_features = torch.randn(100, visual_dim).to(device)
        train_labels = torch.randint(0, num_seen_classes, (100,)).to(device)
        
        loss = zsl_classifier.train_step(train_features, train_labels)
        print(f'Epoch {epoch}: Loss = {loss:.4f}')
    
    # Testing on unseen classes
    test_features = torch.randn(50, visual_dim).to(device)
    preds = zsl_classifier.predict_unseen(test_features)
    print(f'Predictions: {preds[:10]}')
    
    # Generalized ZSL
    preds_gzsl, is_seen = zsl_classifier.predict_gzsl(test_features, calibration=2.0)
    print(f'GZSL Predictions: {preds_gzsl[:10]}')
    print(f'Is Seen: {is_seen[:10]}')
    
    # ========== Experiment 2: Generative ZSL ==========
    print("\n" + "="*60)
    print("Experiment 2: Generative Zero-Shot Learning (f-CLSWGAN)")
    print("="*60)
    
    gan = FeatureGeneratingGAN(
        noise_dim=100,
        semantic_dim=semantic_dim,
        feature_dim=visual_dim,
        device=device
    )
    
    # Training GAN on seen classes
    for epoch in range(20):
        # Dummy training data
        real_features = torch.randn(64, visual_dim).to(device)
        class_idx = torch.randint(0, num_seen_classes, (64,)).to(device)
        attributes = seen_attributes[class_idx].to(device)
        
        losses = gan.train_step(real_features, attributes)
        print(f'Epoch {epoch}: D Loss = {losses["d_loss"]:.4f}, G Loss = {losses["g_loss"]:.4f}')
    
    # Generate synthetic features for unseen classes
    print("\nGenerating synthetic features for unseen classes...")
    for class_idx in range(num_unseen_classes):
        attr = unseen_attributes[class_idx:class_idx+1].to(device)
        synthetic_features = gan.generate_features(attr, num_samples=100)
        print(f'Class {class_idx}: Generated {synthetic_features.size(0)} features')


if __name__ == '__main__':
    main()

Comprehensive Q&A

Q1: When should I use zero-shot learning?

A: Zero-shot learning suitable when: - New classes emerge frequently, no time/resources for annotation - Long-tail distribution, tail classes have very few samples - Need to recognize rare or novel classes - Have good semantic descriptions (attributes, text) available

Not suitable when: - All classes have sufficient training data - Classes lack clear semantic descriptions - Visual appearance very different from semantic description

Q2: Attributes vs word embeddings - which is better?

A: Trade-offs:

Attributes: - Pros: Interpretable, capture discriminative features, work well for fine-grained tasks - Cons: Require manual design, expensive, domain-specific

Word Embeddings: - Pros: Automatic, scalable, leverage large text corpora - Cons: Capture linguistic not visual similarity, may not be discriminative

Recommendation: Use attributes when available and task is fine-grained; use word embeddings for broader domains or when attributes unavailable.

Q3: How to handle the hubness problem?

A: Hubness: In high-dimensional space, some points become "hubs" that are nearest neighbors to many other points, causing prediction bias.

Solutions:

Dimensionality Reduction: Use PCA or autoencoders to reduce feature dimensions
Hubness-Aware Scoring: Weight compatibility scores by point density
Locally Adaptive Metrics: Use different distance metrics for different regions
Reverse Nearest Neighbors: Consider reverse nearest neighbor relationships

Q4: Why does GZSL perform poorly?

A: Main reasons:

Bias Toward Seen Classes: Model trained only on seen classes, strongly biased toward them
Domain Shift: Visual features of seen and unseen classes may have different distributions
Semantic Gap: Semantic descriptions may not capture all visual information

Solutions: - Calibration methods (temperature scaling, bias terms) - Generative models (synthesize unseen class features) - Transductive learning (leverage test set structure)

Q5: Can zero-shot learning be combined with few-shot learning?

A: Yes! This is called Few-Shot Zero-Shot Learning or Low-Shot Learning:

Zero-shot provides initial knowledge via semantic descriptions
Few-shot refines with limited labeled examples
Combination achieves better performance than either alone

Method: First use zero-shot to generate pseudo-labels, then use few-shot samples to calibrate.

Classic Papers

Lampert, C. H. et al., "Learning to detect unseen object classes by between-class attribute transfer", CVPR 2009
- First systematic study of zero-shot learning
- Proposed attribute-based recognition
- IEEE
Socher, R. et al., "Zero-Shot Learning Through Cross-Modal Transfer", NeurIPS 2013
- Used word embeddings for zero-shot learning
- Learned visual-semantic mappings
- arXiv:1301.3666

Generative Models

Xian, Y. et al., "Feature Generating Networks for Zero-Shot Learning", CVPR 2018
- Proposed f-CLSWGAN
- Generate visual features from semantic descriptions
- arXiv:1712.00981
Schonfeld, E. et al., "Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders", CVPR 2019
- f-VAEGAN for zero-shot learning
- Aligned visual and semantic spaces
- arXiv:1812.01784

Generalized ZSL

Chao, W.-L. et al., "An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild", ECCV 2016
- Comprehensive study of GZSL
- Analyzed bias problem
- arXiv:1605.04253
Xian, Y. et al., "Zero-Shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly", TPAMI 2019
- Large-scale benchmark and evaluation
- Systematically compared methods
- arXiv:1707.00600

Recent Advances

Chen, S. et al., "FREE: Feature Refinement for Generalized Zero-Shot Learning", ICCV 2021
- Feature refinement for better generalization
- Addressed domain shift problem
- arXiv:2107.13807
Naeem, M. F. et al., "Learning Graph Embeddings for Compositional Zero-shot Learning", CVPR 2021
- Compositional zero-shot learning
- Used knowledge graphs
- arXiv:2102.01987

Summary

Zero-shot learning enables recognizing unseen classes through semantic descriptions, addressing long-tail distribution and open-world recognition challenges. This article derived zero-shot learning's mathematical foundations from first principles, analyzed attribute representations and semantic embedding spaces in detail, explained compatibility function design, deeply analyzed discriminative and generative ZSL principles, introduced GZSL bias calibration methods, and provided complete implementations.

We saw that zero-shot learning's essence is learning cross-modal mapping from visual to semantic space, bridging seen and unseen classes via auxiliary information. From traditional attribute-based methods to modern generative models, from conventional ZSL to generalized ZSL, zero-shot learning techniques continue evolving. While challenges like semantic gaps, domain shift, and hubness problems remain, zero-shot learning has become an indispensable tool for handling novel classes and long-tail distributions in real-world applications.

Next chapter we'll explore multimodal transfer learning, investigating how to learn unified representations across different modalities and transfer knowledge between them.