NLP (11): Multimodal Large Language Models

Humans perceive the world multimodally: we see images, hear sounds, read text, and these information streams fuse in the brain to form unified understanding. However, traditional NLP models can only process text, limiting AI's ability to understand the real world.

Multimodal Large Language Models (MLLMs) attempt to break this limitation, enabling AI to understand images, audio, video, and text simultaneously, like humans. But multimodal fusion is far from trivial: different modalities have vastly different data distributions — how to align them into a unified representation space? How to design efficient cross-modal attention mechanisms? How to pretrain multimodal models on large-scale data?

From CLIP's contrastive learning achieving vision-language alignment, to BLIP-2's Q-Former enabling parameter-efficient multimodal pretraining, to GPT-4V demonstrating general visual understanding capabilities, multimodal technology is rapidly evolving. Audio-text models like Whisper achieve near-human-level speech recognition, while video understanding models can analyze complex temporal information. These technologies not only achieve breakthroughs in academic research but also demonstrate enormous potential in practical applications — from intelligent customer service to content creation, from medical diagnosis to autonomous driving.

This article dives deep into core technologies of multimodal large language models: from mathematical principles of vision-language alignment to data strategies for multimodal pretraining, from implementation details of image captioning and visual question answering to architectural designs of cutting-edge models like GPT-4V, from audio-text alignment to video temporal modeling. Each technique includes runnable code examples, helping readers not only understand principles but also implement them.

Vision-Language Model Fundamentals

CLIP: Contrastive Learning for Vision-Language Alignment

CLIP (Contrastive Language-Image Pre-training) is a vision-language model proposed by OpenAI in 2021. Its core innovation lies in using large-scale contrastive learning to achieve unified representation of images and text. CLIP was trained on 400 million image-text pairs, demonstrating powerful zero-shot capabilities.

Core Idea:

CLIP's core assumption is: matching image-text pairs should be semantically similar, hence close in vector space; non-matching pairs should be far apart. Through contrastive learning, CLIP learns to map images and text into the same high-dimensional space, bringing semantically similar image and text vectors close together.

Architecture Design:

Image Encoder: Vision Transformer (ViT) or ResNet, encoding images into fixed-dimensional vectors
Text Encoder: Transformer, encoding text into vectors of the same dimension
Contrastive Loss: InfoNCE Loss, maximizing similarity of matching pairs while minimizing similarity of non-matching pairs

Mathematical Principle:

Given image-text pairs in a batch, CLIP first computes similarity matrix between all images and texts:whereis the embedding of the-th image andis the embedding of the-th text. Diagonal elementsare similarities of matching pairs, off-diagonal elements are similarities of non-matching pairs.

The contrastive loss function contains two symmetric terms:The first term is image-to-text contrastive loss, the second is text-to-image contrastive loss. Temperature parametercontrols distribution sharpness, typically set to 0.07.

Why It Works:

CLIP's success lies in combining large-scale data with contrastive learning. Training on massive image-text pairs, the model learns rich vision-language correspondences. Contrastive learning avoids expensive manual annotation costs, requiring only image-text pairs for training. This enables CLIP to handle unseen tasks during training, demonstrating powerful zero-shot capabilities.

Implementation Example:

import torch
import torch.nn as nn
from transformers import CLIPProcessor, CLIPModel

# Load pretrained CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Image and texts
image = Image.open("cat.jpg")
texts = ["a photo of a cat", "a photo of a dog", "a photo of a bird"]

# Process inputs
inputs = processor(
    text=texts,
    images=image,
    return_tensors="pt",
    padding=True
)

# Get features
outputs = model(**inputs)
image_embeds = outputs.image_embeds
text_embeds = outputs.text_embeds

# Compute similarity
similarities = (image_embeds @ text_embeds.T).softmax(dim=-1)
print(f"Most matching text index: {similarities.argmax().item()}")

Applications:

Zero-shot Image Classification: Classify images without training
Image Retrieval: Retrieve relevant images based on text descriptions
Image Generation Guidance: Provide text-image alignment for generative models

BLIP: Unified Vision-Language Understanding and Generation

BLIP (Bootstrapping Language-Image Pre-training) is a unified vision-language model proposed by Salesforce that can simultaneously perform understanding and generation tasks.

Architecture Features:

BLIP uses a multi-task learning framework with three modules:

Unimodal Encoders: Encode images and text separately
Image-Text Cross-Attention Encoder: Fuse multimodal information
Image-Text Decoder: Generate text descriptions

Pretraining Tasks:

Image-Text Contrastive Learning (ITC): Align image and text representations
Image-Text Matching (ITM): Determine if image-text pairs match
Image-Conditioned Language Modeling (ITG): Generate image descriptions

Implementation Example:

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

# Load BLIP model
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

# Image captioning
image = Image.open("scene.jpg")
inputs = processor(image, return_tensors="pt")
out = model.generate(**inputs, max_length=50)
caption = processor.decode(out[0], skip_special_tokens=True)
print(f"Generated caption: {caption}")

# Visual Question Answering
from transformers import BlipForQuestionAnswering

qa_model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
question = "What is in the image?"
inputs = processor(image, question, return_tensors="pt")
out = qa_model.generate(**inputs)
answer = processor.decode(out[0], skip_special_tokens=True)
print(f"Answer: {answer}")

BLIP-2: Parameter-Efficient Multimodal Pretraining

BLIP-2 is an upgraded version of BLIP, with core innovation in introducing Q-Former (Query Transformer) as a "bridge" between image encoder and language model. This design enables BLIP-2 to freeze pretrained image encoders and LLMs, training only about 188M parameters in Q-Former, dramatically reducing training costs.

Core Innovation:

Frozen Pretrained Models: Image encoders (e.g., ViT) and language models (e.g., OPT, LLaMA) remain frozen, parameters not updated. This avoids degrading pretrained model performance while significantly reducing trainable parameters.
Two-Stage Pretraining Strategy:
- Stage 1: Vision-Language Representation Learning: Q-Former learns to extract most relevant visual features from frozen image encoder. Through three tasks — image-text contrastive learning, image-text matching, and image-conditioned language modeling — Q-Former learns to convert image features into formats understandable by language models.
- Stage 2: Vision-to-Language Generation Learning: Q-Former learns to align with frozen LLM, using extracted visual features as "soft prompts" input to LLM, enabling LLM to generate text based on visual information.

Q-Former Architecture Explained:

Q-Former contains a set of learnable query embeddings, typically 32 in number. These query vectors work through the following mechanisms:

Cross-Attention: Query vectors interact with image features through cross-attention, extracting relevant information from images
Self-Attention: Query vectors learn relationships between queries through self-attention, forming global understanding of images
Feedforward Network: Non-linear transformation of query vectors, enhancing representation capability

The advantage of this design: the number of query vectors is far fewer than image patches (e.g., 32 vs 256), dramatically reducing feature dimensions to process, improving efficiency.

from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16,
    device_map="auto"
)

image = Image.open("image.jpg")
prompt = "Question: What is this? Answer:"
inputs = processor(image, text=prompt, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=50)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

Advantages:

Parameter Efficiency: Only need to train Q-Former (~188M parameters), while image encoder and LLM remain frozen
Flexibility: Can easily adapt to different LLMs (e.g., OPT, Flan-T5, LLaMA)
Performance: Achieves SOTA on multiple vision-language tasks

Multimodal Pretraining Strategies

Data Construction

Multimodal pretraining requires large-scale image-text pair data:

Common Datasets:

LAION: Billions of image-text pairs
CC (Common Crawl): Image-text pairs scraped from web pages
COCO: High-quality annotated image captioning dataset
Visual Genome: Image dataset with detailed visual relationships

Data Cleaning Strategy:

import json
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

# Use CLIP for data filtering
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def filter_image_text_pair(image_path, text, threshold=0.25):
    """Filter low-quality image-text pairs"""
    image = Image.open(image_path)
    inputs = clip_processor(
        text=[text],
        images=image,
        return_tensors="pt"
    )
    outputs = clip_model(**inputs)
    similarity = (outputs.image_embeds @ outputs.text_embeds.T).item()
    return similarity >= threshold

# Batch filtering
filtered_pairs = []
for img_path, text in image_text_pairs:
    if filter_image_text_pair(img_path, text):
        filtered_pairs.append((img_path, text))

Pretraining Objectives

Multi-Task Learning:

Simultaneously optimize multiple objectives to learn richer representations:

import torch
import torch.nn as nn

class MultimodalPretrainingLoss(nn.Module):
    def __init__(self, temperature=0.07):
        super().__init__()
        self.temperature = temperature
        self.itc_loss = nn.CrossEntropyLoss()
        self.itm_loss = nn.CrossEntropyLoss()
        self.lm_loss = nn.CrossEntropyLoss()
    
    def forward(self, image_embeds, text_embeds, itm_logits, lm_logits, labels):
        # Image-text contrastive loss
        logits = image_embeds @ text_embeds.T / self.temperature
        labels = torch.arange(logits.size(0), device=logits.device)
        itc_loss = (self.itc_loss(logits, labels) + 
                   self.itc_loss(logits.T, labels)) / 2
        
        # Image-text matching loss
        itm_loss = self.itm_loss(itm_logits, labels)
        
        # Language modeling loss
        lm_loss = self.lm_loss(lm_logits.view(-1, lm_logits.size(-1)), 
                               labels.view(-1))
        
        total_loss = itc_loss + itm_loss + lm_loss
        return total_loss, {
            "itc_loss": itc_loss.item(),
            "itm_loss": itm_loss.item(),
            "lm_loss": lm_loss.item()
        }

Curriculum Learning:

Train progressively from simple to complex:

Stage 1: Image-text alignment (ITC)
Stage 2: Image-text matching (ITM)
Stage 3: Image-conditioned generation (ITG)

Image Captioning and Visual Question Answering

Image Captioning

Image captioning is a fundamental task in multimodal understanding.

Evaluation Metrics:

BLEU: Based on n-gram overlap
METEOR: Considers synonyms and word order
CIDEr: Specifically designed for image captioning
SPICE: Semantic similarity based on scene graphs

Implementation Example:

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch

class ImageCaptioner:
    def __init__(self, model_name="Salesforce/blip-image-captioning-large"):
        self.processor = BlipProcessor.from_pretrained(model_name)
        self.model = BlipForConditionalGeneration.from_pretrained(model_name)
        self.model.eval()
    
    def generate_caption(self, image_path, max_length=50, num_beams=3):
        image = Image.open(image_path).convert("RGB")
        inputs = self.processor(image, return_tensors="pt")
        
        with torch.no_grad():
            out = self.model.generate(
                **inputs,
                max_length=max_length,
                num_beams=num_beams,
                early_stopping=True
            )
        
        caption = self.processor.decode(out[0], skip_special_tokens=True)
        return caption
    
    def generate_diverse_captions(self, image_path, num_return_sequences=3):
        """Generate diverse captions"""
        image = Image.open(image_path).convert("RGB")
        inputs = self.processor(image, return_tensors="pt")
        
        with torch.no_grad():
            out = self.model.generate(
                **inputs,
                max_length=50,
                num_beams=5,
                num_return_sequences=num_return_sequences,
                do_sample=True,
                temperature=0.8
            )
        
        captions = [self.processor.decode(o, skip_special_tokens=True) 
                   for o in out]
        return captions

# Usage example
captioner = ImageCaptioner()
caption = captioner.generate_caption("image.jpg")
print(f"Caption: {caption}")

diverse_captions = captioner.generate_diverse_captions("image.jpg")
for i, cap in enumerate(diverse_captions, 1):
    print(f"Caption {i}: {cap}")

Visual Question Answering (VQA)

Visual Question Answering requires models to understand image content and answer natural language questions.

Datasets:

VQA v2: Contains 200K+ images, 1.1M+ questions
GQA: Scene graph-enhanced visual question answering
TextVQA: Visual question answering with text

Implementation Example:

from transformers import BlipProcessor, BlipForQuestionAnswering
from PIL import Image

class VQASystem:
    def __init__(self, model_name="Salesforce/blip-vqa-base"):
        self.processor = BlipProcessor.from_pretrained(model_name)
        self.model = BlipForQuestionAnswering.from_pretrained(model_name)
        self.model.eval()
    
    def answer_question(self, image_path, question):
        image = Image.open(image_path).convert("RGB")
        inputs = self.processor(image, question, return_tensors="pt")
        
        with torch.no_grad():
            out = self.model.generate(**inputs, max_length=50)
        
        answer = self.processor.decode(out[0], skip_special_tokens=True)
        return answer
    
    def batch_answer(self, image_path, questions):
        """Batch answer questions"""
        image = Image.open(image_path).convert("RGB")
        answers = []
        
        for question in questions:
            inputs = self.processor(image, question, return_tensors="pt")
            with torch.no_grad():
                out = self.model.generate(**inputs, max_length=50)
            answer = self.processor.decode(out[0], skip_special_tokens=True)
            answers.append(answer)
        
        return answers

# Usage example
vqa = VQASystem()
answer = vqa.answer_question("image.jpg", "What color is the car?")
print(f"Answer: {answer}")

questions = [
    "What is in the image?",
    "How many people are there?",
    "What is the weather like?"
]
answers = vqa.batch_answer("image.jpg", questions)
for q, a in zip(questions, answers):
    print(f"Q: {q}\nA: {a}\n")

GPT-4V and Multimodal LLMs

GPT-4V Architecture

GPT-4V (GPT-4 Vision) is OpenAI's multimodal large language model capable of understanding images and generating text responses.

Core Capabilities:

Image Understanding: Recognize objects, scenes, text, charts, etc.
Multi-turn Dialogue: Support mixed image and text inputs
Complex Reasoning: Capable of visual reasoning and logical analysis

Usage Example (API call):

from openai import OpenAI

client = OpenAI()

def gpt4v_chat(image_path, text_prompt):
    """Use GPT-4V for multimodal dialogue"""
    with open(image_path, "rb") as image_file:
        response = client.chat.completions.create(
            model="gpt-4-vision-preview",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": text_prompt},
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{image_file.read()}"
                            }
                        }
                    ]
                }
            ],
            max_tokens=300
        )
    return response.choices[0].message.content

# Usage example
response = gpt4v_chat("chart.png", "Analyze the main trends in this chart")
print(response)

Open-Source Multimodal LLMs

LLaVA (Large Language and Vision Assistant):

from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init
from llava.conversation import conv_templates, SeparatorStyle
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path
from PIL import Image
import torch

def load_llava_model(model_path="liuhaotian/llava-v1.5-7b"):
    disable_torch_init()
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, image_processor, context_len = load_pretrained_model(
        model_path, None, model_name
    )
    return tokenizer, model, image_processor

def llava_chat(image_path, question, tokenizer, model, image_processor):
    """Use LLaVA for visual question answering"""
    conv_mode = "llava_v1"
    conv = conv_templates[conv_mode].copy()
    
    image = Image.open(image_path).convert("RGB")
    image_tensor = image_processor.preprocess(image, return_tensors="pt")["pixel_values"][0]
    
    # Build conversation
    conv.append_message(conv.roles[0], f"<image>\n{question}")
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    
    input_ids = tokenizer_image_token(
        prompt, tokenizer, IMAGE_TOKEN_INDEX=IMAGE_TOKEN_INDEX, return_tensors="pt"
    ).unsqueeze(0).cuda()
    
    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=image_tensor.unsqueeze(0).half().cuda(),
            do_sample=True,
            temperature=0.2,
            top_p=0.7,
            num_beams=1,
            max_new_tokens=512
        )
    
    outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
    return outputs

MiniGPT-4:

from minigpt4.models import load_pre-trained_minigpt4
from PIL import Image

def load_minigpt4():
    model = load_pre-trained_minigpt4(
        llama_model="path/to/vicuna",
        pretrained_ckpt="path/to/pretrained_minigpt4.pth"
    )
    return model

def minigpt4_chat(image_path, question, model):
    """Use MiniGPT-4 for dialogue"""
    image = Image.open(image_path).convert("RGB")
    response = model.generate(image, question, max_new_tokens=300)
    return response

Audio-Text Models

Whisper: Large-Scale Speech Recognition

Whisper is a multilingual speech recognition model developed by OpenAI that supports speech-to-text conversion in multiple languages.

Features:

Multilingual Support: Supports 99 languages
Robustness: Good adaptability to background noise, accents, and dialects
Zero-shot Capability: Can handle new languages without fine-tuning

Implementation Example:

import whisper
import torch

class WhisperASR:
    def __init__(self, model_size="base"):
        """
        model_size: tiny, base, small, medium, large
        """
        self.model = whisper.load_model(model_size)
    
    def transcribe(self, audio_path, language=None, task="transcribe"):
        """
        task: "transcribe" or "translate" (translate to English)
        """
        result = self.model.transcribe(
            audio_path,
            language=language,
            task=task,
            verbose=False
        )
        return result
    
    def transcribe_with_timestamps(self, audio_path):
        """Transcription with timestamps"""
        result = self.model.transcribe(
            audio_path,
            word_timestamps=True,
            verbose=False
        )
        return result
    
    def batch_transcribe(self, audio_paths):
        """Batch transcription"""
        results = []
        for audio_path in audio_paths:
            result = self.transcribe(audio_path)
            results.append(result)
        return results

# Usage example
asr = WhisperASR(model_size="base")
result = asr.transcribe("audio.mp3", language="zh")
print(f"Transcription: {result['text']}")

# With timestamps
result_with_ts = asr.transcribe_with_timestamps("audio.mp3")
for segment in result_with_ts["segments"]:
    print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")

Advanced Features:

def transcribe_with_vad(audio_path, model):
    """Transcribe with Voice Activity Detection (VAD) for segmentation"""
    import numpy as np
    from scipy.io import wavfile
    
    # Load audio
    sample_rate, audio = wavfile.read(audio_path)
    
    # VAD detection (simplified example)
    # In practice, can use libraries like webrtcvad
    segments = detect_speech_segments(audio, sample_rate)
    
    results = []
    for start, end in segments:
        segment_audio = audio[int(start*sample_rate):int(end*sample_rate)]
        # Save temporary file
        temp_path = "temp_segment.wav"
        wavfile.write(temp_path, sample_rate, segment_audio)
        
        result = model.transcribe(temp_path)
        results.append({
            "start": start,
            "end": end,
            "text": result["text"]
        })
    
    return results

Audio-Text Alignment

Wav2Vec2 + BERT:

from transformers import Wav2Vec2Processor, Wav2Vec2Model
from transformers import AutoTokenizer, AutoModel
import torch
import torchaudio

class AudioTextAlignment:
    def __init__(self):
        self.audio_processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base")
        self.audio_model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base")
        self.text_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
        self.text_model = AutoModel.from_pretrained("bert-base-uncased")
    
    def align(self, audio_path, text):
        """Align audio and text"""
        # Process audio
        waveform, sample_rate = torchaudio.load(audio_path)
        inputs = self.audio_processor(
            waveform.squeeze().numpy(),
            sampling_rate=sample_rate,
            return_tensors="pt"
        )
        audio_features = self.audio_model(**inputs).last_hidden_state
        
        # Process text
        text_inputs = self.text_tokenizer(text, return_tensors="pt", padding=True)
        text_features = self.text_model(**text_inputs).last_hidden_state
        
        # Compute alignment (simplified example)
        alignment = self.compute_alignment(audio_features, text_features)
        return alignment

Video Understanding

Video Encoding

Video-ChatGPT:

from transformers import VideoChatGPTProcessor, VideoChatGPTModel
import torch

class VideoUnderstanding:
    def __init__(self, model_name="MBZUAI/Video-ChatGPT"):
        self.processor = VideoChatGPTProcessor.from_pretrained(model_name)
        self.model = VideoChatGPTModel.from_pretrained(model_name)
    
    def understand_video(self, video_path, question):
        """Understand video content and answer questions"""
        # Load video (needs preprocessing into frame sequence)
        video_frames = self.load_video_frames(video_path)
        
        inputs = self.processor(
            text=question,
            videos=video_frames,
            return_tensors="pt"
        )
        
        outputs = self.model.generate(**inputs)
        answer = self.processor.decode(outputs[0], skip_special_tokens=True)
        return answer
    
    def load_video_frames(self, video_path, num_frames=8):
        """Extract frames from video"""
        import cv2
        cap = cv2.VideoCapture(video_path)
        frames = []
        
        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        frame_indices = [int(i * total_frames / num_frames) for i in range(num_frames)]
        
        for idx in frame_indices:
            cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
            ret, frame = cap.read()
            if ret:
                frames.append(frame)
        
        cap.release()
        return frames

Temporal Modeling

Video-LLaMA:

class VideoLLaMA:
    def __init__(self):
        # Load video encoder and LLM
        self.video_encoder = self.load_video_encoder()
        self.llm = self.load_llm()
        self.projector = self.load_projector()  # Connect video features to LLM
    
    def process_video(self, video_path, question):
        """Process video and generate answer"""
        # Extract video features
        video_features = self.video_encoder(video_path)
        
        # Project to LLM space
        projected_features = self.projector(video_features)
        
        # Generate answer
        prompt = self.build_prompt(question, projected_features)
        answer = self.llm.generate(prompt)
        return answer

Practical: Building Multimodal Applications

Multimodal Retrieval System

import torch
import torch.nn as nn
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

class MultimodalRetrievalSystem:
    def __init__(self, model_name="openai/clip-vit-base-patch32"):
        self.model = CLIPModel.from_pretrained(model_name)
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.model.eval()
        
        self.image_embeddings = []
        self.text_embeddings = []
        self.image_paths = []
        self.texts = []
    
    def add_image(self, image_path):
        """Add image to index"""
        image = Image.open(image_path).convert("RGB")
        inputs = self.processor(images=image, return_tensors="pt")
        
        with torch.no_grad():
            image_embed = self.model.get_image_features(**inputs)
            image_embed = image_embed / image_embed.norm(dim=-1, keepdim=True)
        
        self.image_embeddings.append(image_embed.cpu().numpy())
        self.image_paths.append(image_path)
    
    def add_text(self, text):
        """Add text to index"""
        inputs = self.processor(text=[text], return_tensors="pt", padding=True)
        
        with torch.no_grad():
            text_embed = self.model.get_text_features(**inputs)
            text_embed = text_embed / text_embed.norm(dim=-1, keepdim=True)
        
        self.text_embeddings.append(text_embed.cpu().numpy())
        self.texts.append(text)
    
    def search_by_text(self, query_text, top_k=5):
        """Search images by text query"""
        inputs = self.processor(text=[query_text], return_tensors="pt", padding=True)
        
        with torch.no_grad():
            query_embed = self.model.get_text_features(**inputs)
            query_embed = query_embed / query_embed.norm(dim=-1, keepdim=True)
        
        if len(self.image_embeddings) == 0:
            return []
        
        image_embeds = np.vstack(self.image_embeddings)
        similarities = cosine_similarity(
            query_embed.cpu().numpy(),
            image_embeds
        )[0]
        
        top_indices = np.argsort(similarities)[::-1][:top_k]
        results = [
            {
                "image_path": self.image_paths[i],
                "similarity": float(similarities[i])
            }
            for i in top_indices
        ]
        return results
    
    def search_by_image(self, query_image_path, top_k=5):
        """Search texts by image query"""
        image = Image.open(query_image_path).convert("RGB")
        inputs = self.processor(images=image, return_tensors="pt")
        
        with torch.no_grad():
            query_embed = self.model.get_image_features(**inputs)
            query_embed = query_embed / query_embed.norm(dim=-1, keepdim=True)
        
        if len(self.text_embeddings) == 0:
            return []
        
        text_embeds = np.vstack(self.text_embeddings)
        similarities = cosine_similarity(
            query_embed.cpu().numpy(),
            text_embeds
        )[0]
        
        top_indices = np.argsort(similarities)[::-1][:top_k]
        results = [
            {
                "text": self.texts[i],
                "similarity": float(similarities[i])
            }
            for i in top_indices
        ]
        return results

# Usage example
retrieval_system = MultimodalRetrievalSystem()

# Add images and texts
retrieval_system.add_image("image1.jpg")
retrieval_system.add_image("image2.jpg")
retrieval_system.add_text("a cat sitting on a mat")
retrieval_system.add_text("a dog playing in the park")

# Text query images
results = retrieval_system.search_by_text("a cute animal", top_k=3)
for r in results:
    print(f"Image: {r['image_path']}, Similarity: {r['similarity']:.3f}")

# Image query texts
results = retrieval_system.search_by_image("query.jpg", top_k=3)
for r in results:
    print(f"Text: {r['text']}, Similarity: {r['similarity']:.3f}")

Multimodal Chatbot

from transformers import BlipProcessor, BlipForConditionalGeneration
from transformers import BlipProcessor, BlipForQuestionAnswering
import torch

class MultimodalChatbot:
    def __init__(self):
        # Image captioning model
        self.caption_processor = BlipProcessor.from_pretrained(
            "Salesforce/blip-image-captioning-base"
        )
        self.caption_model = BlipForConditionalGeneration.from_pretrained(
            "Salesforce/blip-image-captioning-base"
        )
        
        # VQA model
        self.vqa_processor = BlipProcessor.from_pretrained(
            "Salesforce/blip-vqa-base"
        )
        self.vqa_model = BlipForQuestionAnswering.from_pretrained(
            "Salesforce/blip-vqa-base"
        )
        
        self.conversation_history = []
    
    def process_message(self, message, image_path=None):
        """Process user message (may include image)"""
        if image_path:
            image = Image.open(image_path).convert("RGB")
            
            # Determine if it's a description request or Q&A request
            if "?" in message or "what" in message.lower() or "how" in message.lower():
                # VQA mode
                inputs = self.vqa_processor(image, message, return_tensors="pt")
                with torch.no_grad():
                    out = self.vqa_model.generate(**inputs, max_length=50)
                response = self.vqa_processor.decode(out[0], skip_special_tokens=True)
            else:
                # Image captioning mode
                inputs = self.caption_processor(image, return_tensors="pt")
                with torch.no_grad():
                    out = self.caption_model.generate(**inputs, max_length=50)
                response = self.caption_processor.decode(out[0], skip_special_tokens=True)
        else:
            # Pure text dialogue (can use LLM)
            response = self.text_chat(message)
        
        self.conversation_history.append({
            "user": message,
            "image": image_path,
            "assistant": response
        })
        
        return response
    
    def text_chat(self, message):
        """Pure text chat (simplified example)"""
        # In practice, can use GPT, LLaMA, etc.
        return "I can help you with image-related questions. Please provide an image."

# Usage example
chatbot = MultimodalChatbot()
response = chatbot.process_message("What is in this image?", "image.jpg")
print(response)

Deploying Multimodal Services

FastAPI Deployment:

from fastapi import FastAPI, File, UploadFile, Form
from fastapi.responses import JSONResponse
from PIL import Image
import io
import torch

app = FastAPI()

# Globally load models
caption_model = None
vqa_model = None

@app.on_event("startup")
async def load_models():
    global caption_model, vqa_model
    from transformers import BlipProcessor, BlipForConditionalGeneration
    from transformers import BlipForQuestionAnswering
    
    caption_processor = BlipProcessor.from_pretrained(
        "Salesforce/blip-image-captioning-base"
    )
    caption_model = BlipForConditionalGeneration.from_pretrained(
        "Salesforce/blip-image-captioning-base"
    )
    
    vqa_processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
    vqa_model = BlipForQuestionAnswering.from_pretrained(
        "Salesforce/blip-vqa-base"
    )

@app.post("/caption")
async def generate_caption(file: UploadFile = File(...)):
    """Generate image caption"""
    image_bytes = await file.read()
    image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
    
    inputs = caption_processor(image, return_tensors="pt")
    with torch.no_grad():
        out = caption_model.generate(**inputs, max_length=50)
    
    caption = caption_processor.decode(out[0], skip_special_tokens=True)
    return JSONResponse({"caption": caption})

@app.post("/vqa")
async def answer_question(
    file: UploadFile = File(...),
    question: str = Form(...)
):
    """Visual question answering"""
    image_bytes = await file.read()
    image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
    
    inputs = vqa_processor(image, question, return_tensors="pt")
    with torch.no_grad():
        out = vqa_model.generate(**inputs, max_length=50)
    
    answer = vqa_processor.decode(out[0], skip_special_tokens=True)
    return JSONResponse({"answer": answer})

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Docker Deployment:

FROM python:3.9-slim

WORKDIR /app

# Install dependencies
RUN pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
RUN pip install transformers pillow fastapi uvicorn python-multipart

# Copy code
COPY app.py /app/

# Pre-download models (optional)
RUN python -c "from transformers import BlipProcessor; BlipProcessor.from_pretrained('Salesforce/blip-image-captioning-base')"

EXPOSE 8000

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

❓ Q&A: Common Questions About Multimodal Large Language Models

Q1: What are the main differences between CLIP and BLIP?

A: CLIP focuses on image-text alignment, achieving zero-shot capabilities through contrastive learning, but cannot generate text. BLIP is a unified model that can both understand (VQA, image-text matching) and generate (image captioning), achieved through multi-task learning.

Q2: Why does BLIP-2 only need to train very few parameters?

A: BLIP-2 freezes the image encoder and language model, only training Q-Former (~188M parameters). Q-Former acts as a bridge, learning to extract visual features from the frozen image encoder and converting them into a format that the language model can understand.

Q3: How does Whisper handle speech in different languages?

A: Whisper was trained on multilingual data and can automatically detect languages. You can specify the language via the language parameter or let the model auto-detect. The model supports 99 languages, including Chinese, English, Japanese, etc.

Q4: How do multimodal models handle long videos?

A: Typically two strategies: 1) Uniformly sample key frames (e.g., 1 frame per second); 2) Use sliding windows to segment videos and then fuse results. For very long videos, video summarization techniques can be used first to compress information.

Q5: What's the difference between GPT-4V and open-source multimodal LLMs like LLaVA?

A: GPT-4V is a closed-source model with powerful performance but requires API calls and higher costs. Open-source models like LLaVA can be deployed locally but may have slightly inferior performance. The choice depends on specific needs: choose GPT-4V for performance, choose open-source models for customization or cost control.

Q6: How to evaluate multimodal model performance?

A: Different tasks have different metrics: image captioning uses BLEU, METEOR, CIDEr; VQA uses accuracy; image-text retrieval uses Recall@K. Human evaluation can also be conducted to check the accuracy and fluency of generated content.

Q7: How much data is needed for multimodal pretraining?

A: Large-scale pretraining typically requires tens of millions to billions of image-text pairs. CLIP used 400 million pairs, BLIP used 129 million pairs. For specific domains, domain data can be used for fine-tuning, requiring less data (tens of thousands to hundreds of thousands).

Q8: How to solve memory issues with multimodal models?

A: Can adopt the following strategies: 1) Use quantization (4-bit/8-bit); 2) Use gradient checkpointing; 3) Use parameter-efficient fine-tuning (LoRA); 4) Use model parallelism; 5) Use smaller model variants.

Q9: What types of inputs can multimodal models process?

A: Commonly supported: images (JPG, PNG), text. Partially supported: audio (Whisper), video (Video-ChatGPT). Future trends include supporting more modalities, such as 3D models, point clouds, etc.

Q10: How to build a production-grade multimodal application?

A: Key steps: 1) Choose appropriate models (balance performance and cost); 2) Implement efficient inference services (batching, caching); 3) Add monitoring and logging; 4) Implement error handling and fallback strategies; 5) Use containerized deployment (Docker); 6) Implement load balancing and auto-scaling.