Why can CLIP achieve zero-shot image classification using natural language descriptions? Why can DALL-E generate images from text? The core of these breakthroughs is multimodal transfer learning — enabling models to understand and associate information across different modalities (vision, language, audio, etc.).
Multimodal transfer is not just a fusion of technologies, but a key to cognitive intelligence. Starting from the mathematical principles of contrastive learning, this article systematically explains vision-language pretraining models like CLIP and ALIGN, deeply explores cross-modal alignment, fusion strategies, and downstream task applications, providing complete code for implementing multimodal models from scratch.
Motivation and Challenges of Multimodal Learning
Why Multimodal?
Limitations of single-modal learning:
- Incomplete information: Images alone cannot explain "why"; text alone cannot convey "what it looks like"
- Poor generalization: Pure vision models struggle with conceptual queries (e.g., "find all dangerous scenes")
- Low data efficiency: Image annotation is expensive, while text descriptions (like image-text pairs on web pages) naturally exist at massive scale
Advantages of multimodal approaches:
- Complementarity: Different modalities provide complementary information (e.g., spatial relations from images + causal explanations from text)
- Robustness: When one modality is missing or noisy, others can compensate
- Zero-shot generalization: Through language descriptions, models can recognize categories unseen during training
Core question: How can models learn correspondences between different modalities?
Challenges in Multimodal Transfer
1. Modality Heterogeneity
Vision and language are fundamentally different in representation space:
- Vision: Continuous, high-dimensional, locally correlated (pixel-level)
- Language: Discrete, symbolic, globally dependent (syntactic structure)
Mathematical description: Visual features
2. Semantic Gap
Same concepts have different expressions across modalities:
- "Cat" in images is a pixel pattern
- "Cat" in text is a symbol sequence
- Need to learn cross-modal semantic alignment
3. Data Alignment
Training data has different alignment granularities:
- Weak alignment: Image-text pairs (like web page images and captions), but text may only describe partial content
- Strong alignment: Fine-grained annotation (like region-phrase correspondences), but annotation cost is extremely high
4. Modality Fusion Strategy
When and how to fuse information from different modalities:
- Early fusion: Concatenate features at input layer
- Late fusion: Extract features separately then fuse
- Deep fusion: Interact at multiple network layers
Contrastive Learning: Foundation of Multimodal Pretraining
Mathematical Principles of Contrastive Learning
Core idea of contrastive learning: Pull positive pairs closer, push negative pairs apart.
Given image-text pairs
Why Does Contrastive Learning Work?
Understanding from mutual information maximization perspective:
Contrastive learning is equivalent to maximizing mutual information
between vision and text encodings
The contrastive loss achieves this by: 1. Maximizing
Thus maximizing mutual information.
The
Role of Temperature Parameter Temperature controls the "sharpness"
of similarity distribution:
- Small
(e.g., 0.01): Sharp distribution, focuses only on most similar samples, may lead to overfitting - Large
(e.g., 1.0): Smooth distribution, considers all samples, learning may be insufficient As , softmax degenerates to argmax (selects only maximum value).
In practice,
CLIP: Connecting Text and Images
Core Idea of CLIP
CLIP (Contrastive Language-Image Pre-training)1 design philosophy:
Don't predict specific categories; learn correspondences between images and text.
Traditional approach: Image → Fixed categories (like ImageNet's 1000
classes)
CLIP approach: Image ↔︎ Arbitrary text descriptions
Advantages of this design: 1. Data scale: Can leverage 400 million image-text pairs from the internet, far exceeding manually annotated datasets 2. Zero-shot generalization: Recognize unseen categories through text descriptions 3. Task flexibility: Same model can do classification, retrieval, generation, etc.
CLIP Architecture
CLIP consists of two encoders:
- Image encoder:
- Can be ResNet or Vision Transformer (ViT) - Outputs fixed-dimensional image embeddings
- Text encoder:
- Uses Transformer - Outputs text embeddings in same dimension as image embeddings
Training process: 1. Batch contains
Loss function:
CLIP's Zero-Shot Classification
Given an image and
- Convert category names to text descriptions:
- Simple version: Category name → "a photo of a {class}"
- Complex version: Ensemble multiple templates (like "a photo of a {class}", "a picture of a {class}")
- Encode image and all text descriptions: -
-$t_k = f_t() $4. Select category with highest probability
Advantage of this approach: No training on target dataset needed, only category names required.
CLIP vs. Traditional Methods
| Dimension | Traditional Supervised Learning | CLIP |
|---|---|---|
| Training data | Fixed category labels (like ImageNet) | Image-text pairs (like web pages) |
| Data scale | Millions | Billions |
| Generalization | Limited to training categories | Zero-shot recognition of new categories |
| Annotation cost | High (manual annotation needed) | Low (naturally exists) |
| Task adaptation | Requires fine-tuning | Zero-shot or few-shot |
ALIGN: Larger-Scale Alignment
ALIGN's Improvements
ALIGN (A Large-scale ImaGe and Noisy text embedding)2 is Google's improved version of CLIP, with core differences:
- Data scale: 1.8 billion image-text pairs (4.5x CLIP)
- Noisy data: Directly uses web-scraped data without filtering noise
- Simplified architecture: Uses EfficientNet as image encoder
Noise Robustness
ALIGN proved an important finding: Contrastive learning is naturally robust to noisy labels.
Reason analysis:
Suppose true matching pair is
In large-batch contrastive learning: -
Mathematical representation: Let noise ratio be
Experiments show: Even with 30% noise, ALIGN performance drops less than 5%.
Cross-Modal Alignment Methods
Levels of Alignment
Cross-modal alignment can occur at different granularities:
- Global alignment: Entire image ↔︎ Entire sentence (CLIP/ALIGN)
- Region alignment: Image regions ↔︎ Phrases (Visual Genome)
- Pixel alignment: Pixels ↔︎ Words (dense alignment)
Deep Alignment: OSCAR
OSCAR (Object-Semantics Aligned Pre-training)3 proposes an object label-based alignment strategy:
Core idea: Introduce object labels as "anchors" connecting vision and language.
Input representation:
Pretraining tasks: 1. Masked Language Modeling (MLM): Predict masked words 2. Masked Region Modeling (MRM): Predict masked image regions 3. Object label classification: Predict object categories of regions
Advantage: Object labels provide explicit semantic alignment signals, accelerating convergence.
Design of Alignment Losses
Besides contrastive loss, other alignment losses include:
1.
Triplet Loss where is matching
text, is non-matching text, is the margin.
2. Cycle Consistency Loss
Used for joint training of image captioning and image
generation:
3. Knowledge Distillation Alignment
Use pretrained single-modal models as teachers:
Multimodal Fusion Strategies
When to Fuse
1. Early Fusion
Concatenate features from different modalities at input layer:
Cons: Cannot leverage pretrained models, fragile to modality absence
2. Late Fusion
Extract features separately then fuse:
Cons: Insufficient interaction
3. Deep Fusion
Interact at multiple levels:
Cons: High computational complexity
Attention-Based Fusion
Cross-Attention
Visual features attending to text features:
Co-Attention
Vision and text mutually attend to each other:
Self-Attention on Concatenation
Apply self-attention after concatenating vision and text features
(Transformer style):
Downstream Task Applications
Image Captioning
Task definition: Given image
Encoder-Decoder Architecture
Encoder: Extract image features
Reinforcement Learning Optimization
Since metrics like BLEU are non-differentiable, use policy
gradient:
Visual Question Answering (VQA)
Task definition: Given image
Classification-Based VQA
Treat VQA as multi-class classification (candidate answer set
size
Generation-Based VQA
Treat VQA as conditional text generation:
Attention Mechanism
Question-guided visual attention:
Image-Text Retrieval
Task definition: Given text, retrieve relevant images (or vice versa).
Similarity-Based Ranking
Compute similarity between query text
Metric Learning Optimization
Triplet loss:
Hard Negative Mining
Select negative samples with highest similarity in batch:
Complete Implementation: Building CLIP Model from Scratch
Below implements a simplified CLIP including image encoder, text encoder, contrastive training, and zero-shot classification.
1 | """ |
Code Explanation
Core components:
- Image encoder: ResNet50 feature extraction + projection layer
- Text encoder: Transformer + positional encoding + projection layer
- Contrastive loss: Bidirectional InfoNCE loss
- Zero-shot classification: Compute similarity between image and all class texts
Training workflow:
- In-batch contrastive learning:
image-text pairs produce similarity matrix - Diagonal elements are positive pairs, off-diagonal elements are negative pairs
- Optimize both image → text and text → image directions simultaneously
Key techniques:
- L2 normalization: Ensures stable similarity computation
- Learnable temperature parameter: Automatically adjusts softmax distribution
- Large-batch training: More negative samples, better contrastive effect
Advanced Topics
Multimodal Transformers
ViLBERT (Vision-and-Language BERT)
ViLBERT4 proposes a dual-stream Transformer architecture:
- Vision stream: Processes image region features
- Language stream: Processes text tokens
- Cross-modal connections: Interact through Co-Attention layers
Architecture:
Text-to-Image Generation
DALL-E
DALL-E uses autoregressive Transformer for image generation:
VQ-VAE encoding: Discretize images into token sequences
Concatenate inputs:
Autoregressive generation: Predict image token-by-token
Loss function:
Diffusion Models + CLIP
Stable Diffusion and similar models use CLIP text encoder as
condition:
Cross-Lingual Multimodal
mCLIP (multilingual CLIP) extends CLIP to multiple languages:
- Uses multilingual text encoders (like mBERT, XLM-R)
- Trains on multilingual image-text pairs
- Achieves cross-lingual zero-shot transfer
Advantages: - Low-resource languages can leverage high-resource language knowledge - Single model supports 100+ languages
Frequently Asked Questions
Q1: Where does CLIP's zero-shot ability come from?
Zero-shot ability stems from three key factors:
- Massive data: 400 million image-text pairs cover extremely broad concepts
- Natural language supervision: Text descriptions naturally contain rich semantic information
- Contrastive learning: Learns correspondences between images and text, not fixed categories
Formal understanding: Traditional classifiers learn
Q2: Why doesn't CLIP need labeled data?
CLIP uses weak supervision rather than traditional labels:
- Traditional labels: Image → Discrete category labels (requires manual work)
- CLIP labels: Image ↔︎ Text description (naturally exists on internet)
The correspondence between image-text pairs is itself the supervision signal, no additional annotation needed.
Q3: How do multimodal models handle modality absence?
Three strategies:
- Modality completion: Use generative models to fill in missing modalities
- Robust training: Randomly drop modalities during training, forcing model to learn single-modal reasoning
- Ensemble methods: Train single-modal and multimodal models, select based on available modalities at test time
Loss function example:
Q4: Why is batch size important in contrastive learning?
Batch size determines number of negative samples:
- Batch size
: Each sample has negative samples - More negative samples → More accurate gradient estimation → Better contrastive effect
Experiments show: CLIP works best with batch size 32768, but computational cost is extremely high.
Solutions: - Gradient accumulation: Accumulate gradients over multiple small batches - MoCo queue: Maintain negative sample queue, decouples batch size from negative sample count
Q5: How to evaluate multimodal models?
Common evaluation tasks:
- Zero-shot classification: ImageNet, CIFAR-100, etc.
- Image-text retrieval: Recall@K metrics
- Image captioning: BLEU, CIDEr, SPICE
- VQA: Accuracy
Cross-task consistency is also important: Good multimodal representations should perform well across multiple tasks.
Q6: Where does CLIP perform poorly?
CLIP's limitations:
- Fine-grained classification: Difficulty distinguishing similar categories (like different dog breeds)
- Counting and spatial relations: Weak understanding of "three cats" or "cat on the left"
- Abstract concepts: Contrastive learning excels at concrete objects, not abstract concepts
- Rare concepts: Poor performance on concepts rare in pretraining data
Reason: Contrastive learning tends to learn coarse-grained, high-frequency visual-linguistic correspondences.
Q7: How to optimize computational efficiency of multimodal models?
Optimization strategies:
- Distillation: Distill large model to small model
- Pruning: Remove redundant attention heads
- Quantization: FP16 or INT8 inference
- Caching: Precompute image features, encode text in real-time
Example: CLIP's image encoding can be done offline, retrieval only needs to encode text query.
Q8: How to fine-tune CLIP on your own data?
Fine-tuning strategies:
- Freeze encoders, train classification head: Suitable for small data
- Low learning rate full fine-tuning: Suitable for medium data
- Parameter-efficient fine-tuning like LoRA: Suitable for large models
Notes: - Keep temperature parameter
Q9: How much data is needed for multimodal pretraining?
Empirical rules:
- Millions: Can learn basic visual-linguistic correspondence
- Tens of millions: Achieve usable zero-shot ability
- Billions: Match or exceed supervised learning
CLIP uses 400 million pairs, ALIGN uses 1.8 billion pairs.
But small data also has value: Domain-specific data (like medical imaging + reports) can continue fine-tuning on pretrained basis.
Q10: How to address bias in multimodal models?
Multimodal models inherit biases from training data:
- Gender bias: E.g., "nurse" often associated with female images
- Racial bias: Certain professions or scenes associated with specific races
- Cultural bias: Western culture dominates, other cultures underrepresented
Mitigation methods: - Data balancing: Increase proportion of minority group data - Debiasing regularization: Add fairness constraints to loss function - Post-processing: Adjust prediction distribution to reduce bias
Q11: What's the difference between CLIP and DALL-E?
| Dimension | CLIP | DALL-E |
|---|---|---|
| Task | Image understanding (classification, retrieval) | Image generation |
| Training method | Contrastive learning | Autoregressive generation |
| Input | Image or text | Text |
| Output | Embedding vectors | Images |
| Reversibility | Bidirectional (image ↔︎ text) | Unidirectional (text → image) |
DALL-E 2 and Stable Diffusion both use CLIP as text encoder.
Q12: Future directions of multimodal transfer?
Frontier trends:
- Unified models: Single model handles all modalities (vision, language, audio, video)
- Few-shot learning: More efficient multimodal adaptation
- Interpretability: Understanding how models associate different modalities
- Interactive learning: Human-AI collaborative annotation and learning
- Multimodal reasoning: Beyond simple correspondence, achieving logical reasoning
Representative works: GPT-4V (vision), Gemini (multimodal unified), Flamingo (few-shot).
Summary
This article comprehensively introduced core techniques of multimodal transfer learning:
- Contrastive learning: Learning cross-modal correspondences through InfoNCE loss
- CLIP/ALIGN: Large-scale vision-language pretraining models and their zero-shot capabilities
- Cross-modal alignment: From global to local, weak to strong supervision alignment methods
- Fusion strategies: Early, late, deep fusion and attention mechanisms
- Downstream applications: Technical details of image captioning, VQA, image-text retrieval
- Complete implementation: 200+ lines of code building CLIP model from scratch
Multimodal transfer learning is reshaping AI application boundaries, from search engines to content creation, from education to healthcare, everywhere. The next chapter will explore parameter-efficient fine-tuning techniques, examining how methods like LoRA and Adapter achieve efficient transfer without modifying pretrained models.
References
Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. ICML.↩︎
Jia, C., Yang, Y., Xia, Y., et al. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. ICML.↩︎
Li, X., Yin, X., Li, C., et al. (2020). Oscar: Object-semantics aligned pre-training for vision-language tasks. ECCV.↩︎
Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS.↩︎
- Post title:Transfer Learning (8): Multimodal Transfer
- Post author:Chen Kai
- Create time:2024-12-15 16:15:00
- Post link:https://www.chenk.top/transfer-learning-8-multimodal-transfer/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.