Transfer Learning (8): Multimodal Transfer
Chen Kai BOSS

Why can CLIP achieve zero-shot image classification using natural language descriptions? Why can DALL-E generate images from text? The core of these breakthroughs is multimodal transfer learning — enabling models to understand and associate information across different modalities (vision, language, audio, etc.).

Multimodal transfer is not just a fusion of technologies, but a key to cognitive intelligence. Starting from the mathematical principles of contrastive learning, this article systematically explains vision-language pretraining models like CLIP and ALIGN, deeply explores cross-modal alignment, fusion strategies, and downstream task applications, providing complete code for implementing multimodal models from scratch.

Motivation and Challenges of Multimodal Learning

Why Multimodal?

Limitations of single-modal learning:

  1. Incomplete information: Images alone cannot explain "why"; text alone cannot convey "what it looks like"
  2. Poor generalization: Pure vision models struggle with conceptual queries (e.g., "find all dangerous scenes")
  3. Low data efficiency: Image annotation is expensive, while text descriptions (like image-text pairs on web pages) naturally exist at massive scale

Advantages of multimodal approaches:

  • Complementarity: Different modalities provide complementary information (e.g., spatial relations from images + causal explanations from text)
  • Robustness: When one modality is missing or noisy, others can compensate
  • Zero-shot generalization: Through language descriptions, models can recognize categories unseen during training

Core question: How can models learn correspondences between different modalities?

Challenges in Multimodal Transfer

1. Modality Heterogeneity

Vision and language are fundamentally different in representation space:

  • Vision: Continuous, high-dimensional, locally correlated (pixel-level)
  • Language: Discrete, symbolic, globally dependent (syntactic structure)

Mathematical description: Visual features , text features (word index sequences)— direct comparison is meaningless.

2. Semantic Gap

Same concepts have different expressions across modalities:

  • "Cat" in images is a pixel pattern
  • "Cat" in text is a symbol sequence
  • Need to learn cross-modal semantic alignment

3. Data Alignment

Training data has different alignment granularities:

  • Weak alignment: Image-text pairs (like web page images and captions), but text may only describe partial content
  • Strong alignment: Fine-grained annotation (like region-phrase correspondences), but annotation cost is extremely high

4. Modality Fusion Strategy

When and how to fuse information from different modalities:

  • Early fusion: Concatenate features at input layer
  • Late fusion: Extract features separately then fuse
  • Deep fusion: Interact at multiple network layers

Contrastive Learning: Foundation of Multimodal Pretraining

Mathematical Principles of Contrastive Learning

Core idea of contrastive learning: Pull positive pairs closer, push negative pairs apart.

Given image-text pairswith batch size, define the contrastive loss (InfoNCE):where: -is cosine similarity -is the temperature parameter (controls distribution smoothness) -are positive pairs,are negative pairs

Why Does Contrastive Learning Work?

Understanding from mutual information maximization perspective:

Contrastive learning is equivalent to maximizing mutual information between vision and text encodings:whereis the entropy of visual features,is the conditional entropy given text.

The contrastive loss achieves this by: 1. Maximizing: Negative pairs push apart, maintaining diversity in feature space 2. Minimizing: Positive pairs pull together, reducing uncertainty given text

Thus maximizing mutual information.

The Role of Temperature ParameterTemperature controls the "sharpness" of similarity distribution:

  • Small (e.g., 0.01): Sharp distribution, focuses only on most similar samples, may lead to overfitting
  • Large (e.g., 1.0): Smooth distribution, considers all samples, learning may be insufficientAs, softmax degenerates to argmax (selects only maximum value).

In practice,is typically set to 0.07 (empirical value from CLIP paper).

CLIP: Connecting Text and Images

Core Idea of CLIP

CLIP (Contrastive Language-Image Pre-training)1 design philosophy:

Don't predict specific categories; learn correspondences between images and text.

Traditional approach: Image → Fixed categories (like ImageNet's 1000 classes)
CLIP approach: Image ↔︎ Arbitrary text descriptions

Advantages of this design: 1. Data scale: Can leverage 400 million image-text pairs from the internet, far exceeding manually annotated datasets 2. Zero-shot generalization: Recognize unseen categories through text descriptions 3. Task flexibility: Same model can do classification, retrieval, generation, etc.

CLIP Architecture

CLIP consists of two encoders:

  1. Image encoder: - Can be ResNet or Vision Transformer (ViT)
    • Outputs fixed-dimensional image embeddings
  2. Text encoder: - Uses Transformer
    • Outputs text embeddings in same dimension as image embeddings

Training process: 1. Batch containsimage-text pairs 2. Computesimilarity matrix$S_{ij} = (v_i, t_j)S_{ii}$are positive pairs, off-diagonal elements are negative pairs 4. Optimize contrastive loss in both image → text and text → image directions simultaneously

Loss function:where:

CLIP's Zero-Shot Classification

Given an image andcandidate categories, CLIP's zero-shot classification workflow:

  1. Convert category names to text descriptions:
    • Simple version: Category name → "a photo of a {class}"
    • Complex version: Ensemble multiple templates (like "a photo of a {class}", "a picture of a {class}")
  2. Encode image and all text descriptions: - -$t_k = f_t()$4. Select category with highest probability

Advantage of this approach: No training on target dataset needed, only category names required.

CLIP vs. Traditional Methods

Dimension Traditional Supervised Learning CLIP
Training data Fixed category labels (like ImageNet) Image-text pairs (like web pages)
Data scale Millions Billions
Generalization Limited to training categories Zero-shot recognition of new categories
Annotation cost High (manual annotation needed) Low (naturally exists)
Task adaptation Requires fine-tuning Zero-shot or few-shot

ALIGN: Larger-Scale Alignment

ALIGN's Improvements

ALIGN (A Large-scale ImaGe and Noisy text embedding)2 is Google's improved version of CLIP, with core differences:

  1. Data scale: 1.8 billion image-text pairs (4.5x CLIP)
  2. Noisy data: Directly uses web-scraped data without filtering noise
  3. Simplified architecture: Uses EfficientNet as image encoder

Noise Robustness

ALIGN proved an important finding: Contrastive learning is naturally robust to noisy labels.

Reason analysis:

Suppose true matching pair is, noisy label is, wheredoesn't match.

In large-batch contrastive learning: -is pulled together as positive pair, buthas low similarity with, small gradient - Other true matching pairsprovide correct signal, dominating optimization direction

Mathematical representation: Let noise ratio be, then expected gradient is:Whenis small and batch is large,dominates, noise is averaged out.

Experiments show: Even with 30% noise, ALIGN performance drops less than 5%.

Cross-Modal Alignment Methods

Levels of Alignment

Cross-modal alignment can occur at different granularities:

  1. Global alignment: Entire image ↔︎ Entire sentence (CLIP/ALIGN)
  2. Region alignment: Image regions ↔︎ Phrases (Visual Genome)
  3. Pixel alignment: Pixels ↔︎ Words (dense alignment)

Deep Alignment: OSCAR

OSCAR (Object-Semantics Aligned Pre-training)3 proposes an object label-based alignment strategy:

Core idea: Introduce object labels as "anchors" connecting vision and language.

Input representation:where: -are text words -are image region features -are object labels (like "dog", "car")

Pretraining tasks: 1. Masked Language Modeling (MLM): Predict masked words 2. Masked Region Modeling (MRM): Predict masked image regions 3. Object label classification: Predict object categories of regions

Advantage: Object labels provide explicit semantic alignment signals, accelerating convergence.

Design of Alignment Losses

Besides contrastive loss, other alignment losses include:

1. Triplet Losswhereis matching text,is non-matching text,is the margin.

2. Cycle Consistency Loss

Used for joint training of image captioning and image generation:whereis image captioning model,is text-to-image generation model.

3. Knowledge Distillation Alignment

Use pretrained single-modal models as teachers:

Multimodal Fusion Strategies

When to Fuse

1. Early Fusion

Concatenate features from different modalities at input layer:Pros: Simple, full interaction
Cons: Cannot leverage pretrained models, fragile to modality absence

2. Late Fusion

Extract features separately then fuse:Pros: Can use pretrained encoders, flexible
Cons: Insufficient interaction

3. Deep Fusion

Interact at multiple levels:Pros: Full interaction, flexible modeling
Cons: High computational complexity

Attention-Based Fusion

Cross-Attention

Visual features attending to text features:where,,.

Co-Attention

Vision and text mutually attend to each other:

Self-Attention on Concatenation

Apply self-attention after concatenating vision and text features (Transformer style):This is typical for BERT-like models (e.g., ViLBERT, LXMERT).

Downstream Task Applications

Image Captioning

Task definition: Given image, generate descriptive text.

Encoder-Decoder Architecture

Encoder: Extract image featuresDecoder: Autoregressive text generationwhereis decoder hidden state at time:Context vectoris computed by attention:whereis attention score.

Reinforcement Learning Optimization

Since metrics like BLEU are non-differentiable, use policy gradient:whereis reward for generated sequence (e.g., CIDEr score).

Visual Question Answering (VQA)

Task definition: Given imageand question, predict answer.

Classification-Based VQA

Treat VQA as multi-class classification (candidate answer set size):

Generation-Based VQA

Treat VQA as conditional text generation:

Attention Mechanism

Question-guided visual attention:Final prediction:

Image-Text Retrieval

Task definition: Given text, retrieve relevant images (or vice versa).

Similarity-Based Ranking

Compute similarity between query textand all candidate images:Rank by descending, take Top-K.

Metric Learning Optimization

Triplet loss:whereis matching image,is non-matching image.

Hard Negative Mining

Select negative samples with highest similarity in batch:Accelerates convergence and improves performance.

Complete Implementation: Building CLIP Model from Scratch

Below implements a simplified CLIP including image encoder, text encoder, contrastive training, and zero-shot classification.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
"""
CLIP Implementation from Scratch: Contrastive Vision-Language Pretraining
Includes: Image/text encoders, contrastive loss, zero-shot classification, visualization
"""

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as transforms
import torchvision.models as models
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import os
from typing import List, Tuple, Dict

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

# ============================================================================
# Image Encoder: Using ResNet50
# ============================================================================

class ImageEncoder(nn.Module):
"""
Image encoder: ResNet50 + projection head
"""
def __init__(self, embed_dim=512, pretrained=True):
super().__init__()
# Load pretrained ResNet50
resnet = models.resnet50(pretrained=pretrained)
# Remove final FC layer
self.backbone = nn.Sequential(*list(resnet.children())[:-1])

# Projection head: 2048 → embed_dim
self.projection = nn.Sequential(
nn.Linear(2048, embed_dim),
nn.ReLU(),
nn.Linear(embed_dim, embed_dim)
)

def forward(self, images):
"""
Args:
images: (batch_size, 3, H, W)
Returns:
embeddings: (batch_size, embed_dim)
"""
features = self.backbone(images) # (B, 2048, 1, 1)
features = features.view(features.size(0), -1) # (B, 2048)
embeddings = self.projection(features) # (B, embed_dim)
# L2 normalization
embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings

# ============================================================================
# Text Encoder: Transformer
# ============================================================================

class TextEncoder(nn.Module):
"""
Text encoder: Transformer + projection head
"""
def __init__(self, vocab_size=10000, embed_dim=512, max_len=77,
num_heads=8, num_layers=6):
super().__init__()
self.embed_dim = embed_dim
self.max_len = max_len

# Token embedding
self.token_embedding = nn.Embedding(vocab_size, embed_dim)
# Position encoding
self.position_embedding = nn.Parameter(torch.randn(max_len, embed_dim))

# Transformer encoder
encoder_layer = nn.TransformerEncoderLayer(
d_model=embed_dim,
nhead=num_heads,
dim_feedforward=2048,
dropout=0.1,
batch_first=True
)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

# Projection head
self.projection = nn.Sequential(
nn.Linear(embed_dim, embed_dim),
nn.ReLU(),
nn.Linear(embed_dim, embed_dim)
)

def forward(self, text_tokens):
"""
Args:
text_tokens: (batch_size, seq_len)
Returns:
embeddings: (batch_size, embed_dim)
"""
batch_size, seq_len = text_tokens.shape

# Token embedding + position encoding
token_embed = self.token_embedding(text_tokens) # (B, L, D)
position_embed = self.position_embedding[:seq_len, :] # (L, D)
x = token_embed + position_embed.unsqueeze(0) # (B, L, D)

# Transformer encoding
x = self.transformer(x) # (B, L, D)

# Take [CLS] token (first position) representation
cls_embed = x[:, 0, :] # (B, D)

# Projection
embeddings = self.projection(cls_embed) # (B, D)
# L2 normalization
embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings

# ============================================================================
# CLIP Model
# ============================================================================

class CLIP(nn.Module):
"""
CLIP model: Image encoder + Text encoder
"""
def __init__(self, embed_dim=512, vocab_size=10000):
super().__init__()
self.image_encoder = ImageEncoder(embed_dim=embed_dim)
self.text_encoder = TextEncoder(vocab_size=vocab_size, embed_dim=embed_dim)

# Learnable temperature parameter
self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))

def forward(self, images, text_tokens):
"""
Args:
images: (batch_size, 3, H, W)
text_tokens: (batch_size, seq_len)
Returns:
logits_per_image: (batch_size, batch_size)
logits_per_text: (batch_size, batch_size)
"""
# Encoding
image_embeddings = self.image_encoder(images) # (B, D)
text_embeddings = self.text_encoder(text_tokens) # (B, D)

# Compute similarity matrix
logit_scale = self.logit_scale.exp()
logits_per_image = logit_scale * image_embeddings @ text_embeddings.t() # (B, B)
logits_per_text = logits_per_image.t() # (B, B)

return logits_per_image, logits_per_text

# ============================================================================
# Contrastive Loss
# ============================================================================

def contrastive_loss(logits_per_image, logits_per_text):
"""
Contrastive loss: InfoNCE
"""
batch_size = logits_per_image.shape[0]
labels = torch.arange(batch_size, device=logits_per_image.device)

# Image → text direction loss
loss_i2t = nn.CrossEntropyLoss()(logits_per_image, labels)
# Text → image direction loss
loss_t2i = nn.CrossEntropyLoss()(logits_per_text, labels)

# Total loss
loss = (loss_i2t + loss_t2i) / 2
return loss

# ============================================================================
# Synthetic Dataset
# ============================================================================

class SyntheticImageTextDataset(Dataset):
"""
Synthetic image-text pair dataset
"""
def __init__(self, num_samples=1000, image_size=224, vocab_size=10000, seq_len=77):
self.num_samples = num_samples
self.image_size = image_size
self.vocab_size = vocab_size
self.seq_len = seq_len

# Generate synthetic data
self.images = []
self.texts = []

for i in range(num_samples):
# Generate random image (simulated)
img = np.random.randn(3, image_size, image_size).astype(np.float32)
self.images.append(img)

# Generate random text (simulated)
text = np.random.randint(1, vocab_size, size=seq_len)
self.texts.append(text)

def __len__(self):
return self.num_samples

def __getitem__(self, idx):
image = torch.FloatTensor(self.images[idx])
text = torch.LongTensor(self.texts[idx])
return image, text

# ============================================================================
# Training Function
# ============================================================================

def train_clip(model, dataloader, optimizer, device, num_epochs=10):
"""
Train CLIP model
"""
model.train()
losses = []

for epoch in range(num_epochs):
epoch_loss = 0
for batch_idx, (images, text_tokens) in enumerate(dataloader):
images = images.to(device)
text_tokens = text_tokens.to(device)

# Forward pass
logits_per_image, logits_per_text = model(images, text_tokens)

# Compute loss
loss = contrastive_loss(logits_per_image, logits_per_text)

# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()

epoch_loss += loss.item()

if (batch_idx + 1) % 10 == 0:
print(f"Epoch [{epoch+1}/{num_epochs}], Batch [{batch_idx+1}/{len(dataloader)}], Loss: {loss.item():.4f}")

avg_loss = epoch_loss / len(dataloader)
losses.append(avg_loss)
print(f"Epoch [{epoch+1}/{num_epochs}] Average Loss: {avg_loss:.4f}")

return losses

# ============================================================================
# Zero-Shot Classification
# ============================================================================

def zero_shot_classification(model, image, text_labels, device):
"""
Zero-shot classification: given image and text label candidates
Args:
model: CLIP model
image: (3, H, W)
text_labels: List[(seq_len,)] List of token sequences for text labels
device: Device
Returns:
probs: (num_classes,) Probability for each class
"""
model.eval()
with torch.no_grad():
# Encode image
image = image.unsqueeze(0).to(device) # (1, 3, H, W)
image_embedding = model.image_encoder(image) # (1, D)

# Encode all text labels
text_embeddings = []
for text_tokens in text_labels:
text_tokens = text_tokens.unsqueeze(0).to(device) # (1, L)
text_embedding = model.text_encoder(text_tokens) # (1, D)
text_embeddings.append(text_embedding)

text_embeddings = torch.cat(text_embeddings, dim=0) # (K, D)

# Compute similarity
logit_scale = model.logit_scale.exp()
logits = logit_scale * image_embedding @ text_embeddings.t() # (1, K)

# Softmax to get probabilities
probs = torch.softmax(logits, dim=-1).squeeze(0) # (K,)

return probs.cpu().numpy()

# ============================================================================
# Visualization
# ============================================================================

def visualize_similarity_matrix(model, dataloader, device, num_samples=16):
"""
Visualize image-text similarity matrix
"""
model.eval()

# Get one batch
images, text_tokens = next(iter(dataloader))
images = images[:num_samples].to(device)
text_tokens = text_tokens[:num_samples].to(device)

with torch.no_grad():
logits_per_image, _ = model(images, text_tokens)
similarity_matrix = logits_per_image.cpu().numpy()

# Plot heatmap
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(similarity_matrix, cmap='viridis', aspect='auto')

ax.set_xticks(np.arange(num_samples))
ax.set_yticks(np.arange(num_samples))
ax.set_xticklabels([f'Text {i}' for i in range(num_samples)])
ax.set_yticklabels([f'Image {i}' for i in range(num_samples)])

plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")

# Add colorbar
cbar = plt.colorbar(im, ax=ax)
cbar.set_label('Similarity Score', rotation=270, labelpad=20)

# Add value annotations
for i in range(num_samples):
for j in range(num_samples):
text = ax.text(j, i, f'{similarity_matrix[i, j]:.2f}',
ha="center", va="center", color="white", fontsize=8)

ax.set_title("Image-Text Similarity Matrix")
plt.tight_layout()
plt.savefig('similarity_matrix.png', dpi=150, bbox_inches='tight')
plt.close()
print("Similarity matrix saved to similarity_matrix.png")

def plot_training_curve(losses):
"""
Plot training curve
"""
plt.figure(figsize=(10, 6))
plt.plot(losses, marker='o', linewidth=2, markersize=6)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('CLIP Training Loss', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('training_curve.png', dpi=150, bbox_inches='tight')
plt.close()
print("Training curve saved to training_curve.png")

# ============================================================================
# Main Function
# ============================================================================

def main():
# Hyperparameters
embed_dim = 512
vocab_size = 10000
batch_size = 32
num_epochs = 20
learning_rate = 1e-4

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Create dataset and DataLoader
print("\nCreating synthetic dataset...")
dataset = SyntheticImageTextDataset(num_samples=1000, vocab_size=vocab_size)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Create model
print("Initializing CLIP model...")
model = CLIP(embed_dim=embed_dim, vocab_size=vocab_size).to(device)

# Calculate parameter count
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters: {num_params:,}")

# Optimizer
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training
print("\nStarting training...")
losses = train_clip(model, dataloader, optimizer, device, num_epochs=num_epochs)

# Plot training curve
plot_training_curve(losses)

# Visualize similarity matrix
print("\nVisualizing similarity matrix...")
visualize_similarity_matrix(model, dataloader, device)

# Zero-shot classification example
print("\nZero-shot classification example:")
# Create test image and text labels
test_image = torch.randn(3, 224, 224)
text_labels = [
torch.randint(1, vocab_size, (77,)) for _ in range(5)
]

probs = zero_shot_classification(model, test_image, text_labels, device)
for i, prob in enumerate(probs):
print(f"Class {i}: {prob:.4f}")

print("\n" + "="*60)
print("Training completed!")
print("="*60)

if __name__ == "__main__":
main()

Code Explanation

Core components:

  1. Image encoder: ResNet50 feature extraction + projection layer
  2. Text encoder: Transformer + positional encoding + projection layer
  3. Contrastive loss: Bidirectional InfoNCE loss
  4. Zero-shot classification: Compute similarity between image and all class texts

Training workflow:

  1. In-batch contrastive learning:image-text pairs producesimilarity matrix
  2. Diagonal elements are positive pairs, off-diagonal elements are negative pairs
  3. Optimize both image → text and text → image directions simultaneously

Key techniques:

  • L2 normalization: Ensures stable similarity computation
  • Learnable temperature parameter: Automatically adjusts softmax distribution
  • Large-batch training: More negative samples, better contrastive effect

Advanced Topics

Multimodal Transformers

ViLBERT (Vision-and-Language BERT)

ViLBERT4 proposes a dual-stream Transformer architecture:

  • Vision stream: Processes image region features
  • Language stream: Processes text tokens
  • Cross-modal connections: Interact through Co-Attention layers

Architecture:Pretraining tasks: 1. Masked Language Modeling (MLM) 2. Masked Region Modeling (MRM) 3. Image-Text Matching (ITM)

Text-to-Image Generation

DALL-E

DALL-E uses autoregressive Transformer for image generation:

  1. VQ-VAE encoding: Discretize images into token sequences

  2. Concatenate inputs:

  3. Autoregressive generation: Predict image token-by-token

Loss function:

Diffusion Models + CLIP

Stable Diffusion and similar models use CLIP text encoder as condition:whereis text description,is CLIP text embedding.

Cross-Lingual Multimodal

mCLIP (multilingual CLIP) extends CLIP to multiple languages:

  • Uses multilingual text encoders (like mBERT, XLM-R)
  • Trains on multilingual image-text pairs
  • Achieves cross-lingual zero-shot transfer

Advantages: - Low-resource languages can leverage high-resource language knowledge - Single model supports 100+ languages

Frequently Asked Questions

Q1: Where does CLIP's zero-shot ability come from?

Zero-shot ability stems from three key factors:

  1. Massive data: 400 million image-text pairs cover extremely broad concepts
  2. Natural language supervision: Text descriptions naturally contain rich semantic information
  3. Contrastive learning: Learns correspondences between images and text, not fixed categories

Formal understanding: Traditional classifiers learnwhereis fixed; CLIP learnswherecan be any text description.

Q2: Why doesn't CLIP need labeled data?

CLIP uses weak supervision rather than traditional labels:

  • Traditional labels: Image → Discrete category labels (requires manual work)
  • CLIP labels: Image ↔︎ Text description (naturally exists on internet)

The correspondence between image-text pairs is itself the supervision signal, no additional annotation needed.

Q3: How do multimodal models handle modality absence?

Three strategies:

  1. Modality completion: Use generative models to fill in missing modalities
  2. Robust training: Randomly drop modalities during training, forcing model to learn single-modal reasoning
  3. Ensemble methods: Train single-modal and multimodal models, select based on available modalities at test time

Loss function example:

Q4: Why is batch size important in contrastive learning?

Batch size determines number of negative samples:

  • Batch size: Each sample hasnegative samples
  • More negative samples → More accurate gradient estimation → Better contrastive effect

Experiments show: CLIP works best with batch size 32768, but computational cost is extremely high.

Solutions: - Gradient accumulation: Accumulate gradients over multiple small batches - MoCo queue: Maintain negative sample queue, decouples batch size from negative sample count

Q5: How to evaluate multimodal models?

Common evaluation tasks:

  1. Zero-shot classification: ImageNet, CIFAR-100, etc.
  2. Image-text retrieval: Recall@K metrics
  3. Image captioning: BLEU, CIDEr, SPICE
  4. VQA: Accuracy

Cross-task consistency is also important: Good multimodal representations should perform well across multiple tasks.

Q6: Where does CLIP perform poorly?

CLIP's limitations:

  1. Fine-grained classification: Difficulty distinguishing similar categories (like different dog breeds)
  2. Counting and spatial relations: Weak understanding of "three cats" or "cat on the left"
  3. Abstract concepts: Contrastive learning excels at concrete objects, not abstract concepts
  4. Rare concepts: Poor performance on concepts rare in pretraining data

Reason: Contrastive learning tends to learn coarse-grained, high-frequency visual-linguistic correspondences.

Q7: How to optimize computational efficiency of multimodal models?

Optimization strategies:

  1. Distillation: Distill large model to small model
  2. Pruning: Remove redundant attention heads
  3. Quantization: FP16 or INT8 inference
  4. Caching: Precompute image features, encode text in real-time

Example: CLIP's image encoding can be done offline, retrieval only needs to encode text query.

Q8: How to fine-tune CLIP on your own data?

Fine-tuning strategies:

  1. Freeze encoders, train classification head: Suitable for small data
  2. Low learning rate full fine-tuning: Suitable for medium data
  3. Parameter-efficient fine-tuning like LoRA: Suitable for large models

Notes: - Keep temperature parameterunchanged - Use contrastive loss rather than cross-entropy - Data augmentation equally important for multimodal models

Q9: How much data is needed for multimodal pretraining?

Empirical rules:

  • Millions: Can learn basic visual-linguistic correspondence
  • Tens of millions: Achieve usable zero-shot ability
  • Billions: Match or exceed supervised learning

CLIP uses 400 million pairs, ALIGN uses 1.8 billion pairs.

But small data also has value: Domain-specific data (like medical imaging + reports) can continue fine-tuning on pretrained basis.

Q10: How to address bias in multimodal models?

Multimodal models inherit biases from training data:

  1. Gender bias: E.g., "nurse" often associated with female images
  2. Racial bias: Certain professions or scenes associated with specific races
  3. Cultural bias: Western culture dominates, other cultures underrepresented

Mitigation methods: - Data balancing: Increase proportion of minority group data - Debiasing regularization: Add fairness constraints to loss function - Post-processing: Adjust prediction distribution to reduce bias

Q11: What's the difference between CLIP and DALL-E?

Dimension CLIP DALL-E
Task Image understanding (classification, retrieval) Image generation
Training method Contrastive learning Autoregressive generation
Input Image or text Text
Output Embedding vectors Images
Reversibility Bidirectional (image ↔︎ text) Unidirectional (text → image)

DALL-E 2 and Stable Diffusion both use CLIP as text encoder.

Q12: Future directions of multimodal transfer?

Frontier trends:

  1. Unified models: Single model handles all modalities (vision, language, audio, video)
  2. Few-shot learning: More efficient multimodal adaptation
  3. Interpretability: Understanding how models associate different modalities
  4. Interactive learning: Human-AI collaborative annotation and learning
  5. Multimodal reasoning: Beyond simple correspondence, achieving logical reasoning

Representative works: GPT-4V (vision), Gemini (multimodal unified), Flamingo (few-shot).

Summary

This article comprehensively introduced core techniques of multimodal transfer learning:

  1. Contrastive learning: Learning cross-modal correspondences through InfoNCE loss
  2. CLIP/ALIGN: Large-scale vision-language pretraining models and their zero-shot capabilities
  3. Cross-modal alignment: From global to local, weak to strong supervision alignment methods
  4. Fusion strategies: Early, late, deep fusion and attention mechanisms
  5. Downstream applications: Technical details of image captioning, VQA, image-text retrieval
  6. Complete implementation: 200+ lines of code building CLIP model from scratch

Multimodal transfer learning is reshaping AI application boundaries, from search engines to content creation, from education to healthcare, everywhere. The next chapter will explore parameter-efficient fine-tuning techniques, examining how methods like LoRA and Adapter achieve efficient transfer without modifying pretrained models.

References


  1. Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. ICML.↩︎

  2. Jia, C., Yang, Y., Xia, Y., et al. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. ICML.↩︎

  3. Li, X., Yin, X., Li, C., et al. (2020). Oscar: Object-semantics aligned pre-training for vision-language tasks. ECCV.↩︎

  4. Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS.↩︎

  • Post title:Transfer Learning (8): Multimodal Transfer
  • Post author:Chen Kai
  • Create time:2024-12-15 16:15:00
  • Post link:https://www.chenk.top/transfer-learning-8-multimodal-transfer/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments