Humans perceive the world multimodally: we see images, hear sounds,
read text, and these information streams fuse in the brain to form
unified understanding. However, traditional NLP models can only process
text, limiting AI's ability to understand the real world.
Multimodal Large Language Models (MLLMs) attempt to break this
limitation, enabling AI to understand images, audio, video, and text
simultaneously, like humans. But multimodal fusion is far from trivial:
different modalities have vastly different data distributions — how to
align them into a unified representation space? How to design efficient
cross-modal attention mechanisms? How to pretrain multimodal models on
large-scale data?
From CLIP's contrastive learning achieving vision-language alignment,
to BLIP-2's Q-Former enabling parameter-efficient multimodal
pretraining, to GPT-4V demonstrating general visual understanding
capabilities, multimodal technology is rapidly evolving. Audio-text
models like Whisper achieve near-human-level speech recognition, while
video understanding models can analyze complex temporal information.
These technologies not only achieve breakthroughs in academic research
but also demonstrate enormous potential in practical applications — from
intelligent customer service to content creation, from medical diagnosis
to autonomous driving.
This article dives deep into core technologies of multimodal large
language models: from mathematical principles of vision-language
alignment to data strategies for multimodal pretraining, from
implementation details of image captioning and visual question answering
to architectural designs of cutting-edge models like GPT-4V, from
audio-text alignment to video temporal modeling. Each technique includes
runnable code examples, helping readers not only understand principles
but also implement them.
Vision-Language Model
Fundamentals
CLIP:
Contrastive Learning for Vision-Language Alignment
CLIP (Contrastive Language-Image Pre-training) is a vision-language
model proposed by OpenAI in 2021. Its core innovation lies in using
large-scale contrastive learning to achieve unified representation of
images and text. CLIP was trained on 400 million image-text pairs,
demonstrating powerful zero-shot capabilities.
Core Idea:
CLIP's core assumption is: matching image-text pairs should be
semantically similar, hence close in vector space; non-matching pairs
should be far apart. Through contrastive learning, CLIP learns to map
images and text into the same high-dimensional space, bringing
semantically similar image and text vectors close together.
Architecture Design:
Image Encoder: Vision Transformer (ViT) or ResNet,
encoding images into fixed-dimensional vectors
Text Encoder: Transformer, encoding text into
vectors of the same dimension
Contrastive Loss: InfoNCE Loss, maximizing
similarity of matching pairs while minimizing similarity of non-matching
pairs
Mathematical Principle:
Given image-text pairs in a
batch, CLIP first computes similarity matrix between all images and
texts:whereis the embedding of the-th image andis the embedding of the-th text. Diagonal elementsare similarities of matching pairs,
off-diagonal elements are similarities of non-matching pairs.
The contrastive loss function contains two symmetric terms:The
first term is image-to-text contrastive loss, the second is
text-to-image contrastive loss. Temperature parametercontrols distribution sharpness,
typically set to 0.07.
Why It Works:
CLIP's success lies in combining large-scale data with contrastive
learning. Training on massive image-text pairs, the model learns rich
vision-language correspondences. Contrastive learning avoids expensive
manual annotation costs, requiring only image-text pairs for training.
This enables CLIP to handle unseen tasks during training, demonstrating
powerful zero-shot capabilities.
Zero-shot Image Classification: Classify images
without training
Image Retrieval: Retrieve relevant images based on
text descriptions
Image Generation Guidance: Provide text-image
alignment for generative models
BLIP:
Unified Vision-Language Understanding and Generation
BLIP (Bootstrapping Language-Image Pre-training) is a unified
vision-language model proposed by Salesforce that can simultaneously
perform understanding and generation tasks.
Architecture Features:
BLIP uses a multi-task learning framework with three modules:
Unimodal Encoders: Encode images and text
separately
Image-Text Cross-Attention Encoder: Fuse multimodal
information
Image-Text Decoder: Generate text descriptions
Pretraining Tasks:
Image-Text Contrastive Learning (ITC): Align image
and text representations
Image-Text Matching (ITM): Determine if image-text
pairs match
Image-Conditioned Language Modeling (ITG): Generate
image descriptions
from transformers import BlipProcessor, BlipForConditionalGeneration from PIL import Image
# Load BLIP model processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base") model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
BLIP-2 is an upgraded version of BLIP, with core innovation in
introducing Q-Former (Query Transformer) as a "bridge" between image
encoder and language model. This design enables BLIP-2 to freeze
pretrained image encoders and LLMs, training only about 188M parameters
in Q-Former, dramatically reducing training costs.
Core Innovation:
Frozen Pretrained Models: Image encoders (e.g.,
ViT) and language models (e.g., OPT, LLaMA) remain frozen, parameters
not updated. This avoids degrading pretrained model performance while
significantly reducing trainable parameters.
Two-Stage Pretraining Strategy:
Stage 1: Vision-Language Representation Learning:
Q-Former learns to extract most relevant visual features from frozen
image encoder. Through three tasks — image-text contrastive learning,
image-text matching, and image-conditioned language modeling — Q-Former
learns to convert image features into formats understandable by language
models.
Stage 2: Vision-to-Language Generation Learning:
Q-Former learns to align with frozen LLM, using extracted visual
features as "soft prompts" input to LLM, enabling LLM to generate text
based on visual information.
Q-Former Architecture Explained:
Q-Former contains a set of learnable query embeddings, typically 32
in number. These query vectors work through the following
mechanisms:
Cross-Attention: Query vectors interact with image
features through cross-attention, extracting relevant information from
images
Self-Attention: Query vectors learn relationships
between queries through self-attention, forming global understanding of
images
Feedforward Network: Non-linear transformation of
query vectors, enhancing representation capability
The advantage of this design: the number of query vectors is far
fewer than image patches (e.g., 32 vs 256), dramatically reducing
feature dimensions to process, improving efficiency.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
from transformers import Blip2Processor, Blip2ForConditionalGeneration import torch
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b") model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto" )
import json from PIL import Image from transformers import CLIPProcessor, CLIPModel
# Use CLIP for data filtering clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Batch filtering filtered_pairs = [] for img_path, text in image_text_pairs: if filter_image_text_pair(img_path, text): filtered_pairs.append((img_path, text))
Pretraining Objectives
Multi-Task Learning:
Simultaneously optimize multiple objectives to learn richer
representations:
# Usage example vqa = VQASystem() answer = vqa.answer_question("image.jpg", "What color is the car?") print(f"Answer: {answer}")
questions = [ "What is in the image?", "How many people are there?", "What is the weather like?" ] answers = vqa.batch_answer("image.jpg", questions) for q, a inzip(questions, answers): print(f"Q: {q}\nA: {a}\n")
GPT-4V and Multimodal LLMs
GPT-4V Architecture
GPT-4V (GPT-4 Vision) is OpenAI's multimodal large language model
capable of understanding images and generating text responses.
Core Capabilities:
Image Understanding: Recognize objects, scenes,
text, charts, etc.
Multi-turn Dialogue: Support mixed image and text
inputs
Complex Reasoning: Capable of visual reasoning and
logical analysis
from llava.model.builder import load_pretrained_model from llava.utils import disable_torch_init from llava.conversation import conv_templates, SeparatorStyle from llava.mm_utils import tokenizer_image_token, get_model_name_from_path from PIL import Image import torch
classWhisperASR: def__init__(self, model_size="base"): """ model_size: tiny, base, small, medium, large """ self.model = whisper.load_model(model_size) deftranscribe(self, audio_path, language=None, task="transcribe"): """ task: "transcribe" or "translate" (translate to English) """ result = self.model.transcribe( audio_path, language=language, task=task, verbose=False ) return result deftranscribe_with_timestamps(self, audio_path): """Transcription with timestamps""" result = self.model.transcribe( audio_path, word_timestamps=True, verbose=False ) return result defbatch_transcribe(self, audio_paths): """Batch transcription""" results = [] for audio_path in audio_paths: result = self.transcribe(audio_path) results.append(result) return results
# Usage example asr = WhisperASR(model_size="base") result = asr.transcribe("audio.mp3", language="zh") print(f"Transcription: {result['text']}")
# With timestamps result_with_ts = asr.transcribe_with_timestamps("audio.mp3") for segment in result_with_ts["segments"]: print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")
import torch import torch.nn as nn from transformers import CLIPProcessor, CLIPModel from PIL import Image import numpy as np from sklearn.metrics.pairwise import cosine_similarity
# Usage example retrieval_system = MultimodalRetrievalSystem()
# Add images and texts retrieval_system.add_image("image1.jpg") retrieval_system.add_image("image2.jpg") retrieval_system.add_text("a cat sitting on a mat") retrieval_system.add_text("a dog playing in the park")
# Text query images results = retrieval_system.search_by_text("a cute animal", top_k=3) for r in results: print(f"Image: {r['image_path']}, Similarity: {r['similarity']:.3f}")
# Image query texts results = retrieval_system.search_by_image("query.jpg", top_k=3) for r in results: print(f"Text: {r['text']}, Similarity: {r['similarity']:.3f}")
❓
Q&A: Common Questions About Multimodal Large Language Models
Q1: What are the main differences between CLIP and
BLIP?
A: CLIP focuses on image-text alignment, achieving zero-shot
capabilities through contrastive learning, but cannot generate text.
BLIP is a unified model that can both understand (VQA, image-text
matching) and generate (image captioning), achieved through multi-task
learning.
Q2: Why does BLIP-2 only need to train very few
parameters?
A: BLIP-2 freezes the image encoder and language model, only training
Q-Former (~188M parameters). Q-Former acts as a bridge, learning to
extract visual features from the frozen image encoder and converting
them into a format that the language model can understand.
Q3: How does Whisper handle speech in different
languages?
A: Whisper was trained on multilingual data and can automatically
detect languages. You can specify the language via the
language parameter or let the model auto-detect. The model
supports 99 languages, including Chinese, English, Japanese, etc.
Q4: How do multimodal models handle long videos?
A: Typically two strategies: 1) Uniformly sample key frames (e.g., 1
frame per second); 2) Use sliding windows to segment videos and then
fuse results. For very long videos, video summarization techniques can
be used first to compress information.
Q5: What's the difference between GPT-4V and open-source
multimodal LLMs like LLaVA?
A: GPT-4V is a closed-source model with powerful performance but
requires API calls and higher costs. Open-source models like LLaVA can
be deployed locally but may have slightly inferior performance. The
choice depends on specific needs: choose GPT-4V for performance, choose
open-source models for customization or cost control.
Q6: How to evaluate multimodal model
performance?
A: Different tasks have different metrics: image captioning uses
BLEU, METEOR, CIDEr; VQA uses accuracy; image-text retrieval uses
Recall@K. Human evaluation can also be conducted to check the accuracy
and fluency of generated content.
Q7: How much data is needed for multimodal
pretraining?
A: Large-scale pretraining typically requires tens of millions to
billions of image-text pairs. CLIP used 400 million pairs, BLIP used 129
million pairs. For specific domains, domain data can be used for
fine-tuning, requiring less data (tens of thousands to hundreds of
thousands).
Q8: How to solve memory issues with multimodal
models?
A: Can adopt the following strategies: 1) Use quantization
(4-bit/8-bit); 2) Use gradient checkpointing; 3) Use parameter-efficient
fine-tuning (LoRA); 4) Use model parallelism; 5) Use smaller model
variants.
Q9: What types of inputs can multimodal models
process?
A: Commonly supported: images (JPG, PNG), text. Partially supported:
audio (Whisper), video (Video-ChatGPT). Future trends include supporting
more modalities, such as 3D models, point clouds, etc.
Q10: How to build a production-grade multimodal
application?
A: Key steps: 1) Choose appropriate models (balance performance and
cost); 2) Implement efficient inference services (batching, caching); 3)
Add monitoring and logging; 4) Implement error handling and fallback
strategies; 5) Use containerized deployment (Docker); 6) Implement load
balancing and auto-scaling.
Post title:NLP (11): Multimodal Large Language Models
Post author:Chen Kai
Create time:2024-04-04 10:30:00
Post link:https://www.chenk.top/en/nlp-multimodal-nlp/
Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.