Transfer Learning (8): Multimodal Transfer

Why can CLIP achieve zero-shot image classification using natural language descriptions? Why can DALL-E generate images from text? The core of these breakthroughs is multimodal transfer learning — enabling models to understand and associate information across different modalities (vision, language, audio, etc.).

Multimodal transfer is not just a fusion of technologies, but a key to cognitive intelligence. Starting from the mathematical principles of contrastive learning, this article systematically explains vision-language pretraining models like CLIP and ALIGN, deeply explores cross-modal alignment, fusion strategies, and downstream task applications, providing complete code for implementing multimodal models from scratch.

Motivation and Challenges of Multimodal Learning

Why Multimodal?

Limitations of single-modal learning:

Incomplete information: Images alone cannot explain "why"; text alone cannot convey "what it looks like"
Poor generalization: Pure vision models struggle with conceptual queries (e.g., "find all dangerous scenes")
Low data efficiency: Image annotation is expensive, while text descriptions (like image-text pairs on web pages) naturally exist at massive scale

Advantages of multimodal approaches:

Complementarity: Different modalities provide complementary information (e.g., spatial relations from images + causal explanations from text)
Robustness: When one modality is missing or noisy, others can compensate
Zero-shot generalization: Through language descriptions, models can recognize categories unseen during training

Core question: How can models learn correspondences between different modalities?

Challenges in Multimodal Transfer

1. Modality Heterogeneity

Vision and language are fundamentally different in representation space:

Vision: Continuous, high-dimensional, locally correlated (pixel-level)
Language: Discrete, symbolic, globally dependent (syntactic structure)

Mathematical description: Visual features , text features (word index sequences)— direct comparison is meaningless.

2. Semantic Gap

Same concepts have different expressions across modalities:

"Cat" in images is a pixel pattern
"Cat" in text is a symbol sequence
Need to learn cross-modal semantic alignment

3. Data Alignment

Training data has different alignment granularities:

Weak alignment: Image-text pairs (like web page images and captions), but text may only describe partial content
Strong alignment: Fine-grained annotation (like region-phrase correspondences), but annotation cost is extremely high

4. Modality Fusion Strategy

When and how to fuse information from different modalities:

Early fusion: Concatenate features at input layer
Late fusion: Extract features separately then fuse
Deep fusion: Interact at multiple network layers

Contrastive Learning: Foundation of Multimodal Pretraining

Mathematical Principles of Contrastive Learning

Core idea of contrastive learning: Pull positive pairs closer, push negative pairs apart.

Given image-text pairswith batch size, define the contrastive loss (InfoNCE):where: -is cosine similarity -is the temperature parameter (controls distribution smoothness) -are positive pairs,are negative pairs

Why Does Contrastive Learning Work?

Understanding from mutual information maximization perspective:

Contrastive learning is equivalent to maximizing mutual information between vision and text encodings:whereis the entropy of visual features,is the conditional entropy given text.

The contrastive loss achieves this by: 1. Maximizing: Negative pairs push apart, maintaining diversity in feature space 2. Minimizing: Positive pairs pull together, reducing uncertainty given text

Thus maximizing mutual information.

The Role of Temperature ParameterTemperature controls the "sharpness" of similarity distribution:

Small (e.g., 0.01): Sharp distribution, focuses only on most similar samples, may lead to overfitting
Large (e.g., 1.0): Smooth distribution, considers all samples, learning may be insufficientAs, softmax degenerates to argmax (selects only maximum value).

In practice,is typically set to 0.07 (empirical value from CLIP paper).

CLIP: Connecting Text and Images

Core Idea of CLIP

CLIP (Contrastive Language-Image Pre-training)¹ design philosophy:

Don't predict specific categories; learn correspondences between images and text.

Traditional approach: Image → Fixed categories (like ImageNet's 1000 classes)
CLIP approach: Image ↔︎ Arbitrary text descriptions

Advantages of this design: 1. Data scale: Can leverage 400 million image-text pairs from the internet, far exceeding manually annotated datasets 2. Zero-shot generalization: Recognize unseen categories through text descriptions 3. Task flexibility: Same model can do classification, retrieval, generation, etc.

CLIP Architecture

CLIP consists of two encoders:

Image encoder: - Can be ResNet or Vision Transformer (ViT)
- Outputs fixed-dimensional image embeddings
Text encoder: - Uses Transformer
- Outputs text embeddings in same dimension as image embeddings

Training process: 1. Batch containsimage-text pairs 2. Computesimilarity matrix$S_{ij} = (v_i, t_j)S_{ii}$are positive pairs, off-diagonal elements are negative pairs 4. Optimize contrastive loss in both image → text and text → image directions simultaneously

Loss function:where:

CLIP's Zero-Shot Classification

Given an image andcandidate categories, CLIP's zero-shot classification workflow:

Convert category names to text descriptions:
- Simple version: Category name → "a photo of a {class}"
- Complex version: Ensemble multiple templates (like "a photo of a {class}", "a picture of a {class}")
Encode image and all text descriptions: - -$t_k = f_t()$4. Select category with highest probability

Advantage of this approach: No training on target dataset needed, only category names required.

CLIP vs. Traditional Methods

Dimension	Traditional Supervised Learning	CLIP
Training data	Fixed category labels (like ImageNet)	Image-text pairs (like web pages)
Data scale	Millions	Billions
Generalization	Limited to training categories	Zero-shot recognition of new categories
Annotation cost	High (manual annotation needed)	Low (naturally exists)
Task adaptation	Requires fine-tuning	Zero-shot or few-shot

ALIGN: Larger-Scale Alignment

ALIGN's Improvements

ALIGN (A Large-scale ImaGe and Noisy text embedding)² is Google's improved version of CLIP, with core differences:

Data scale: 1.8 billion image-text pairs (4.5x CLIP)
Noisy data: Directly uses web-scraped data without filtering noise
Simplified architecture: Uses EfficientNet as image encoder

Noise Robustness

ALIGN proved an important finding: Contrastive learning is naturally robust to noisy labels.

Reason analysis:

Suppose true matching pair is, noisy label is, wheredoesn't match.

In large-batch contrastive learning: -is pulled together as positive pair, buthas low similarity with, small gradient - Other true matching pairsprovide correct signal, dominating optimization direction

Mathematical representation: Let noise ratio be, then expected gradient is:Whenis small and batch is large,dominates, noise is averaged out.

Experiments show: Even with 30% noise, ALIGN performance drops less than 5%.

Levels of Alignment

Cross-modal alignment can occur at different granularities:

Global alignment: Entire image ↔︎ Entire sentence (CLIP/ALIGN)
Region alignment: Image regions ↔︎ Phrases (Visual Genome)
Pixel alignment: Pixels ↔︎ Words (dense alignment)

Deep Alignment: OSCAR

OSCAR (Object-Semantics Aligned Pre-training)³ proposes an object label-based alignment strategy:

Core idea: Introduce object labels as "anchors" connecting vision and language.

Input representation:where: -are text words -are image region features -are object labels (like "dog", "car")

Pretraining tasks: 1. Masked Language Modeling (MLM): Predict masked words 2. Masked Region Modeling (MRM): Predict masked image regions 3. Object label classification: Predict object categories of regions

Advantage: Object labels provide explicit semantic alignment signals, accelerating convergence.

Design of Alignment Losses

Besides contrastive loss, other alignment losses include:

1. Triplet Losswhereis matching text,is non-matching text,is the margin.

2. Cycle Consistency Loss

Used for joint training of image captioning and image generation:whereis image captioning model,is text-to-image generation model.

3. Knowledge Distillation Alignment

Use pretrained single-modal models as teachers:

Multimodal Fusion Strategies

When to Fuse

1. Early Fusion

Concatenate features from different modalities at input layer:Pros: Simple, full interaction
Cons: Cannot leverage pretrained models, fragile to modality absence

2. Late Fusion

Extract features separately then fuse:Pros: Can use pretrained encoders, flexible
Cons: Insufficient interaction

3. Deep Fusion

Interact at multiple levels:Pros: Full interaction, flexible modeling
Cons: High computational complexity

Attention-Based Fusion

Cross-Attention

Visual features attending to text features:where,,.

Co-Attention

Vision and text mutually attend to each other:

Self-Attention on Concatenation

Apply self-attention after concatenating vision and text features (Transformer style):This is typical for BERT-like models (e.g., ViLBERT, LXMERT).

Downstream Task Applications

Image Captioning

Task definition: Given image, generate descriptive text.

Encoder-Decoder Architecture

Encoder: Extract image featuresDecoder: Autoregressive text generationwhereis decoder hidden state at time:Context vectoris computed by attention:whereis attention score.

Reinforcement Learning Optimization

Since metrics like BLEU are non-differentiable, use policy gradient:whereis reward for generated sequence (e.g., CIDEr score).

Visual Question Answering (VQA)

Task definition: Given imageand question, predict answer.

Classification-Based VQA

Treat VQA as multi-class classification (candidate answer set size):

Generation-Based VQA

Treat VQA as conditional text generation:

Attention Mechanism

Question-guided visual attention:Final prediction:

Image-Text Retrieval

Task definition: Given text, retrieve relevant images (or vice versa).

Similarity-Based Ranking

Compute similarity between query textand all candidate images:Rank by descending, take Top-K.

Metric Learning Optimization

Triplet loss:whereis matching image,is non-matching image.

Hard Negative Mining

Select negative samples with highest similarity in batch:Accelerates convergence and improves performance.

Complete Implementation: Building CLIP Model from Scratch

Below implements a simplified CLIP including image encoder, text encoder, contrastive training, and zero-shot classification.

"""
CLIP Implementation from Scratch: Contrastive Vision-Language Pretraining
Includes: Image/text encoders, contrastive loss, zero-shot classification, visualization
"""

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as transforms
import torchvision.models as models
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import os
from typing import List, Tuple, Dict

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

# ============================================================================
# Image Encoder: Using ResNet50
# ============================================================================

class ImageEncoder(nn.Module):
    """
    Image encoder: ResNet50 + projection head
    """
    def __init__(self, embed_dim=512, pretrained=True):
        super().__init__()
        # Load pretrained ResNet50
        resnet = models.resnet50(pretrained=pretrained)
        # Remove final FC layer
        self.backbone = nn.Sequential(*list(resnet.children())[:-1])
        
        # Projection head: 2048 → embed_dim
        self.projection = nn.Sequential(
            nn.Linear(2048, embed_dim),
            nn.ReLU(),
            nn.Linear(embed_dim, embed_dim)
        )
        
    def forward(self, images):
        """
        Args:
            images: (batch_size, 3, H, W)
        Returns:
            embeddings: (batch_size, embed_dim)
        """
        features = self.backbone(images)  # (B, 2048, 1, 1)
        features = features.view(features.size(0), -1)  # (B, 2048)
        embeddings = self.projection(features)  # (B, embed_dim)
        # L2 normalization
        embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
        return embeddings

# ============================================================================
# Text Encoder: Transformer
# ============================================================================

class TextEncoder(nn.Module):
    """
    Text encoder: Transformer + projection head
    """
    def __init__(self, vocab_size=10000, embed_dim=512, max_len=77, 
                 num_heads=8, num_layers=6):
        super().__init__()
        self.embed_dim = embed_dim
        self.max_len = max_len
        
        # Token embedding
        self.token_embedding = nn.Embedding(vocab_size, embed_dim)
        # Position encoding
        self.position_embedding = nn.Parameter(torch.randn(max_len, embed_dim))
        
        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embed_dim,
            nhead=num_heads,
            dim_feedforward=2048,
            dropout=0.1,
            batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        
        # Projection head
        self.projection = nn.Sequential(
            nn.Linear(embed_dim, embed_dim),
            nn.ReLU(),
            nn.Linear(embed_dim, embed_dim)
        )
        
    def forward(self, text_tokens):
        """
        Args:
            text_tokens: (batch_size, seq_len)
        Returns:
            embeddings: (batch_size, embed_dim)
        """
        batch_size, seq_len = text_tokens.shape
        
        # Token embedding + position encoding
        token_embed = self.token_embedding(text_tokens)  # (B, L, D)
        position_embed = self.position_embedding[:seq_len, :]  # (L, D)
        x = token_embed + position_embed.unsqueeze(0)  # (B, L, D)
        
        # Transformer encoding
        x = self.transformer(x)  # (B, L, D)
        
        # Take [CLS] token (first position) representation
        cls_embed = x[:, 0, :]  # (B, D)
        
        # Projection
        embeddings = self.projection(cls_embed)  # (B, D)
        # L2 normalization
        embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
        return embeddings

# ============================================================================
# CLIP Model
# ============================================================================

class CLIP(nn.Module):
    """
    CLIP model: Image encoder + Text encoder
    """
    def __init__(self, embed_dim=512, vocab_size=10000):
        super().__init__()
        self.image_encoder = ImageEncoder(embed_dim=embed_dim)
        self.text_encoder = TextEncoder(vocab_size=vocab_size, embed_dim=embed_dim)
        
        # Learnable temperature parameter
        self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
        
    def forward(self, images, text_tokens):
        """
        Args:
            images: (batch_size, 3, H, W)
            text_tokens: (batch_size, seq_len)
        Returns:
            logits_per_image: (batch_size, batch_size)
            logits_per_text: (batch_size, batch_size)
        """
        # Encoding
        image_embeddings = self.image_encoder(images)  # (B, D)
        text_embeddings = self.text_encoder(text_tokens)  # (B, D)
        
        # Compute similarity matrix
        logit_scale = self.logit_scale.exp()
        logits_per_image = logit_scale * image_embeddings @ text_embeddings.t()  # (B, B)
        logits_per_text = logits_per_image.t()  # (B, B)
        
        return logits_per_image, logits_per_text

# ============================================================================
# Contrastive Loss
# ============================================================================

def contrastive_loss(logits_per_image, logits_per_text):
    """
    Contrastive loss: InfoNCE
    """
    batch_size = logits_per_image.shape[0]
    labels = torch.arange(batch_size, device=logits_per_image.device)
    
    # Image → text direction loss
    loss_i2t = nn.CrossEntropyLoss()(logits_per_image, labels)
    # Text → image direction loss
    loss_t2i = nn.CrossEntropyLoss()(logits_per_text, labels)
    
    # Total loss
    loss = (loss_i2t + loss_t2i) / 2
    return loss

# ============================================================================
# Synthetic Dataset
# ============================================================================

class SyntheticImageTextDataset(Dataset):
    """
    Synthetic image-text pair dataset
    """
    def __init__(self, num_samples=1000, image_size=224, vocab_size=10000, seq_len=77):
        self.num_samples = num_samples
        self.image_size = image_size
        self.vocab_size = vocab_size
        self.seq_len = seq_len
        
        # Generate synthetic data
        self.images = []
        self.texts = []
        
        for i in range(num_samples):
            # Generate random image (simulated)
            img = np.random.randn(3, image_size, image_size).astype(np.float32)
            self.images.append(img)
            
            # Generate random text (simulated)
            text = np.random.randint(1, vocab_size, size=seq_len)
            self.texts.append(text)
    
    def __len__(self):
        return self.num_samples
    
    def __getitem__(self, idx):
        image = torch.FloatTensor(self.images[idx])
        text = torch.LongTensor(self.texts[idx])
        return image, text

# ============================================================================
# Training Function
# ============================================================================

def train_clip(model, dataloader, optimizer, device, num_epochs=10):
    """
    Train CLIP model
    """
    model.train()
    losses = []
    
    for epoch in range(num_epochs):
        epoch_loss = 0
        for batch_idx, (images, text_tokens) in enumerate(dataloader):
            images = images.to(device)
            text_tokens = text_tokens.to(device)
            
            # Forward pass
            logits_per_image, logits_per_text = model(images, text_tokens)
            
            # Compute loss
            loss = contrastive_loss(logits_per_image, logits_per_text)
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
            
            if (batch_idx + 1) % 10 == 0:
                print(f"Epoch [{epoch+1}/{num_epochs}], Batch [{batch_idx+1}/{len(dataloader)}], Loss: {loss.item():.4f}")
        
        avg_loss = epoch_loss / len(dataloader)
        losses.append(avg_loss)
        print(f"Epoch [{epoch+1}/{num_epochs}] Average Loss: {avg_loss:.4f}")
    
    return losses

# ============================================================================
# Zero-Shot Classification
# ============================================================================

def zero_shot_classification(model, image, text_labels, device):
    """
    Zero-shot classification: given image and text label candidates
    Args:
        model: CLIP model
        image: (3, H, W)
        text_labels: List[(seq_len,)] List of token sequences for text labels
        device: Device
    Returns:
        probs: (num_classes,) Probability for each class
    """
    model.eval()
    with torch.no_grad():
        # Encode image
        image = image.unsqueeze(0).to(device)  # (1, 3, H, W)
        image_embedding = model.image_encoder(image)  # (1, D)
        
        # Encode all text labels
        text_embeddings = []
        for text_tokens in text_labels:
            text_tokens = text_tokens.unsqueeze(0).to(device)  # (1, L)
            text_embedding = model.text_encoder(text_tokens)  # (1, D)
            text_embeddings.append(text_embedding)
        
        text_embeddings = torch.cat(text_embeddings, dim=0)  # (K, D)
        
        # Compute similarity
        logit_scale = model.logit_scale.exp()
        logits = logit_scale * image_embedding @ text_embeddings.t()  # (1, K)
        
        # Softmax to get probabilities
        probs = torch.softmax(logits, dim=-1).squeeze(0)  # (K,)
    
    return probs.cpu().numpy()

# ============================================================================
# Visualization
# ============================================================================

def visualize_similarity_matrix(model, dataloader, device, num_samples=16):
    """
    Visualize image-text similarity matrix
    """
    model.eval()
    
    # Get one batch
    images, text_tokens = next(iter(dataloader))
    images = images[:num_samples].to(device)
    text_tokens = text_tokens[:num_samples].to(device)
    
    with torch.no_grad():
        logits_per_image, _ = model(images, text_tokens)
        similarity_matrix = logits_per_image.cpu().numpy()
    
    # Plot heatmap
    fig, ax = plt.subplots(figsize=(10, 8))
    im = ax.imshow(similarity_matrix, cmap='viridis', aspect='auto')
    
    ax.set_xticks(np.arange(num_samples))
    ax.set_yticks(np.arange(num_samples))
    ax.set_xticklabels([f'Text {i}' for i in range(num_samples)])
    ax.set_yticklabels([f'Image {i}' for i in range(num_samples)])
    
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
    
    # Add colorbar
    cbar = plt.colorbar(im, ax=ax)
    cbar.set_label('Similarity Score', rotation=270, labelpad=20)
    
    # Add value annotations
    for i in range(num_samples):
        for j in range(num_samples):
            text = ax.text(j, i, f'{similarity_matrix[i, j]:.2f}',
                          ha="center", va="center", color="white", fontsize=8)
    
    ax.set_title("Image-Text Similarity Matrix")
    plt.tight_layout()
    plt.savefig('similarity_matrix.png', dpi=150, bbox_inches='tight')
    plt.close()
    print("Similarity matrix saved to similarity_matrix.png")

def plot_training_curve(losses):
    """
    Plot training curve
    """
    plt.figure(figsize=(10, 6))
    plt.plot(losses, marker='o', linewidth=2, markersize=6)
    plt.xlabel('Epoch', fontsize=12)
    plt.ylabel('Loss', fontsize=12)
    plt.title('CLIP Training Loss', fontsize=14, fontweight='bold')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig('training_curve.png', dpi=150, bbox_inches='tight')
    plt.close()
    print("Training curve saved to training_curve.png")

# ============================================================================
# Main Function
# ============================================================================

def main():
    # Hyperparameters
    embed_dim = 512
    vocab_size = 10000
    batch_size = 32
    num_epochs = 20
    learning_rate = 1e-4
    
    # Device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Using device: {device}")
    
    # Create dataset and DataLoader
    print("\nCreating synthetic dataset...")
    dataset = SyntheticImageTextDataset(num_samples=1000, vocab_size=vocab_size)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
    
    # Create model
    print("Initializing CLIP model...")
    model = CLIP(embed_dim=embed_dim, vocab_size=vocab_size).to(device)
    
    # Calculate parameter count
    num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Total trainable parameters: {num_params:,}")
    
    # Optimizer
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    
    # Training
    print("\nStarting training...")
    losses = train_clip(model, dataloader, optimizer, device, num_epochs=num_epochs)
    
    # Plot training curve
    plot_training_curve(losses)
    
    # Visualize similarity matrix
    print("\nVisualizing similarity matrix...")
    visualize_similarity_matrix(model, dataloader, device)
    
    # Zero-shot classification example
    print("\nZero-shot classification example:")
    # Create test image and text labels
    test_image = torch.randn(3, 224, 224)
    text_labels = [
        torch.randint(1, vocab_size, (77,)) for _ in range(5)
    ]
    
    probs = zero_shot_classification(model, test_image, text_labels, device)
    for i, prob in enumerate(probs):
        print(f"Class {i}: {prob:.4f}")
    
    print("\n" + "="*60)
    print("Training completed!")
    print("="*60)

if __name__ == "__main__":
    main()

Code Explanation

Core components:

Image encoder: ResNet50 feature extraction + projection layer
Text encoder: Transformer + positional encoding + projection layer
Contrastive loss: Bidirectional InfoNCE loss
Zero-shot classification: Compute similarity between image and all class texts

Training workflow:

In-batch contrastive learning:image-text pairs producesimilarity matrix
Diagonal elements are positive pairs, off-diagonal elements are negative pairs
Optimize both image → text and text → image directions simultaneously

Key techniques:

L2 normalization: Ensures stable similarity computation
Learnable temperature parameter: Automatically adjusts softmax distribution
Large-batch training: More negative samples, better contrastive effect

Advanced Topics

Multimodal Transformers

ViLBERT (Vision-and-Language BERT)

ViLBERT⁴ proposes a dual-stream Transformer architecture:

Vision stream: Processes image region features
Language stream: Processes text tokens
Cross-modal connections: Interact through Co-Attention layers

Architecture:Pretraining tasks: 1. Masked Language Modeling (MLM) 2. Masked Region Modeling (MRM) 3. Image-Text Matching (ITM)

Text-to-Image Generation

DALL-E

DALL-E uses autoregressive Transformer for image generation:

VQ-VAE encoding: Discretize images into token sequences
Concatenate inputs:
Autoregressive generation: Predict image token-by-token

Loss function:

Diffusion Models + CLIP

Stable Diffusion and similar models use CLIP text encoder as condition:whereis text description,is CLIP text embedding.

Cross-Lingual Multimodal

mCLIP (multilingual CLIP) extends CLIP to multiple languages:

Uses multilingual text encoders (like mBERT, XLM-R)
Trains on multilingual image-text pairs
Achieves cross-lingual zero-shot transfer

Advantages: - Low-resource languages can leverage high-resource language knowledge - Single model supports 100+ languages

Frequently Asked Questions

Q1: Where does CLIP's zero-shot ability come from?

Zero-shot ability stems from three key factors:

Massive data: 400 million image-text pairs cover extremely broad concepts
Natural language supervision: Text descriptions naturally contain rich semantic information
Contrastive learning: Learns correspondences between images and text, not fixed categories

Formal understanding: Traditional classifiers learnwhereis fixed; CLIP learnswherecan be any text description.

Q2: Why doesn't CLIP need labeled data?

CLIP uses weak supervision rather than traditional labels:

Traditional labels: Image → Discrete category labels (requires manual work)
CLIP labels: Image ↔︎ Text description (naturally exists on internet)

The correspondence between image-text pairs is itself the supervision signal, no additional annotation needed.

Q3: How do multimodal models handle modality absence?

Three strategies:

Modality completion: Use generative models to fill in missing modalities
Robust training: Randomly drop modalities during training, forcing model to learn single-modal reasoning
Ensemble methods: Train single-modal and multimodal models, select based on available modalities at test time

Loss function example:

Q4: Why is batch size important in contrastive learning?

Batch size determines number of negative samples:

Batch size: Each sample hasnegative samples
More negative samples → More accurate gradient estimation → Better contrastive effect

Experiments show: CLIP works best with batch size 32768, but computational cost is extremely high.

Solutions: - Gradient accumulation: Accumulate gradients over multiple small batches - MoCo queue: Maintain negative sample queue, decouples batch size from negative sample count

Q5: How to evaluate multimodal models?

Common evaluation tasks:

Zero-shot classification: ImageNet, CIFAR-100, etc.
Image-text retrieval: Recall@K metrics
Image captioning: BLEU, CIDEr, SPICE
VQA: Accuracy

Cross-task consistency is also important: Good multimodal representations should perform well across multiple tasks.

Q6: Where does CLIP perform poorly?

CLIP's limitations:

Fine-grained classification: Difficulty distinguishing similar categories (like different dog breeds)
Counting and spatial relations: Weak understanding of "three cats" or "cat on the left"
Abstract concepts: Contrastive learning excels at concrete objects, not abstract concepts
Rare concepts: Poor performance on concepts rare in pretraining data

Reason: Contrastive learning tends to learn coarse-grained, high-frequency visual-linguistic correspondences.

Q7: How to optimize computational efficiency of multimodal models?

Optimization strategies:

Distillation: Distill large model to small model
Pruning: Remove redundant attention heads
Quantization: FP16 or INT8 inference
Caching: Precompute image features, encode text in real-time

Example: CLIP's image encoding can be done offline, retrieval only needs to encode text query.

Q8: How to fine-tune CLIP on your own data?

Fine-tuning strategies:

Freeze encoders, train classification head: Suitable for small data
Low learning rate full fine-tuning: Suitable for medium data
Parameter-efficient fine-tuning like LoRA: Suitable for large models

Notes: - Keep temperature parameterunchanged - Use contrastive loss rather than cross-entropy - Data augmentation equally important for multimodal models

Q9: How much data is needed for multimodal pretraining?

Empirical rules:

Millions: Can learn basic visual-linguistic correspondence
Tens of millions: Achieve usable zero-shot ability
Billions: Match or exceed supervised learning

CLIP uses 400 million pairs, ALIGN uses 1.8 billion pairs.

But small data also has value: Domain-specific data (like medical imaging + reports) can continue fine-tuning on pretrained basis.

Q10: How to address bias in multimodal models?

Multimodal models inherit biases from training data:

Gender bias: E.g., "nurse" often associated with female images
Racial bias: Certain professions or scenes associated with specific races
Cultural bias: Western culture dominates, other cultures underrepresented

Mitigation methods: - Data balancing: Increase proportion of minority group data - Debiasing regularization: Add fairness constraints to loss function - Post-processing: Adjust prediction distribution to reduce bias

Q11: What's the difference between CLIP and DALL-E?

Dimension	CLIP	DALL-E
Task	Image understanding (classification, retrieval)	Image generation
Training method	Contrastive learning	Autoregressive generation
Input	Image or text	Text
Output	Embedding vectors	Images
Reversibility	Bidirectional (image ↔︎ text)	Unidirectional (text → image)

DALL-E 2 and Stable Diffusion both use CLIP as text encoder.

Q12: Future directions of multimodal transfer?

Frontier trends:

Unified models: Single model handles all modalities (vision, language, audio, video)
Few-shot learning: More efficient multimodal adaptation
Interpretability: Understanding how models associate different modalities
Interactive learning: Human-AI collaborative annotation and learning
Multimodal reasoning: Beyond simple correspondence, achieving logical reasoning

Representative works: GPT-4V (vision), Gemini (multimodal unified), Flamingo (few-shot).

Summary

This article comprehensively introduced core techniques of multimodal transfer learning:

Contrastive learning: Learning cross-modal correspondences through InfoNCE loss
CLIP/ALIGN: Large-scale vision-language pretraining models and their zero-shot capabilities
Cross-modal alignment: From global to local, weak to strong supervision alignment methods
Fusion strategies: Early, late, deep fusion and attention mechanisms
Downstream applications: Technical details of image captioning, VQA, image-text retrieval
Complete implementation: 200+ lines of code building CLIP model from scratch

Multimodal transfer learning is reshaping AI application boundaries, from search engines to content creation, from education to healthcare, everywhere. The next chapter will explore parameter-efficient fine-tuning techniques, examining how methods like LoRA and Adapter achieve efficient transfer without modifying pretrained models.

References

Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. ICML.↩︎
Jia, C., Yang, Y., Xia, Y., et al. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. ICML.↩︎
Li, X., Yin, X., Li, C., et al. (2020). Oscar: Object-semantics aligned pre-training for vision-language tasks. ECCV.↩︎
Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS.↩︎