人类感知世界的方式是多模态的:我们看到图像、听到声音、阅读文本,这些信息在大脑中融合形成统一的理解。但传统的
NLP 模型只能处理文本,这限制了 AI 理解真实世界的能力。
多模态大模型( MLLM)试图打破这一限制,让 AI
能够像人类一样同时理解图像、音频、视频和文本。但多模态融合并非易事:不同模态的数据分布差异巨大,如何将它们对齐到统一的表示空间?如何设计高效的跨模态注意力机制?如何在大规模数据上预训练多模态模型?
从 CLIP 的对比学习实现视觉-语言对齐,到 BLIP-2 的 Q-Former
实现参数高效的多模态预训练,再到 GPT-4V
展现的通用视觉理解能力,多模态技术正在快速演进。音频-文本模型如 Whisper
实现了接近人类水平的语音识别,视频理解模型则能分析复杂的时序信息。这些技术不仅在学术研究中取得突破,更在实际应用中展现出巨大潜力——从智能客服到内容创作,从医疗诊断到自动驾驶。
本文深入解析多模态大模型的核心技术:从视觉-语言对齐的数学原理到多模态预训练的数据策略,从图像描述和视觉问答的实现细节到
GPT-4V
等前沿模型的架构设计,从音频-文本对齐到视频时序建模。每个技术点都配有可运行的代码示例,帮助读者不仅理解原理,更能动手实践。
视觉-语言模型基础
CLIP:对比学习的视觉-语言对齐
CLIP( Contrastive Language-Image Pre-training)是 OpenAI 在 2021
年提出的视觉-语言模型,其核心创新在于使用大规模对比学习实现图像和文本的统一表示空间。
CLIP 在 4 亿图像-文本对上训练,展现了强大的零样本能力。
核心思想 :
CLIP
的核心假设是:匹配的图像-文本对在语义上应该相似,因此在向量空间中应该距离较近;不匹配的对应该距离较远。通过对比学习,
CLIP
学习将图像和文本映射到同一个高维空间,使得语义相似的图像和文本向量接近。
架构设计 :
图像编码器 : Vision Transformer (ViT) 或
ResNet,将图像编码为固定维度的向量
文本编码器 :
Transformer,将文本编码为相同维度的向量
对比损失 : InfoNCE
Loss,最大化匹配对的相似度,最小化不匹配对的相似度
数学原理 :
给定一个 batch 中的
个图像-文本对, CLIP 首先计算所有图像和文本之间的相似度矩阵:
其中 是第 个图像的嵌入, 是第 个文本的嵌入。对角线元素
是匹配对的相似度,非对角线元素是不匹配对的相似度。
对比损失函数包含两个对称项:
第一项是从图像到文本的对比损失,第二项是从文本到图像的对比损失。温度参数 控制分布的尖锐程度,通常设置为 0.07
。
为什么有效 :
CLIP
的成功在于大规模数据和对比学习的结合。通过在海量图像-文本对上训练,模型学习到了丰富的视觉-语言对应关系。对比学习避免了需要人工标注的昂贵成本,只需要图像-文本对即可训练。这使得
CLIP 能够处理训练时未见过的任务,展现出强大的零样本能力。
实现示例 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import torchimport torch.nn as nnfrom transformers import CLIPProcessor, CLIPModelmodel = CLIPModel.from_pretrained("openai/clip-vit-base-patch32" ) processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32" ) image = Image.open ("cat.jpg" ) texts = ["a photo of a cat" , "a photo of a dog" , "a photo of a bird" ] inputs = processor( text=texts, images=image, return_tensors="pt" , padding=True ) outputs = model(**inputs) image_embeds = outputs.image_embeds text_embeds = outputs.text_embeds similarities = (image_embeds @ text_embeds.T).softmax(dim=-1 ) print (f"最匹配的文本索引: {similarities.argmax().item()} " )
应用场景 :
零样本图像分类 :无需训练即可对图像进行分类
图像检索 :根据文本描述检索相关图像
图像生成引导 :为生成模型提供文本-图像对齐能力
BLIP:统一视觉-语言理解与生成
BLIP( Bootstrapping Language-Image Pre-training)是 Salesforce
提出的统一视觉-语言模型,能够同时完成理解和生成任务。
架构特点 :
BLIP 使用多任务学习框架,包含三个模块:
单模态编码器 :分别编码图像和文本
图像-文本交叉注意力编码器 :融合多模态信息
图像-文本解码器 :生成文本描述
预训练任务 :
图像-文本对比学习 ( ITC):对齐图像和文本表示
图像-文本匹配 ( ITM):判断图像-文本对是否匹配
图像条件语言建模 ( ITM):生成图像描述
实现示例 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 from transformers import BlipProcessor, BlipForConditionalGenerationfrom PIL import Imageprocessor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base" ) model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base" ) image = Image.open ("scene.jpg" ) inputs = processor(image, return_tensors="pt" ) out = model.generate(**inputs, max_length=50 ) caption = processor.decode(out[0 ], skip_special_tokens=True ) print (f"生成的描述: {caption} " )from transformers import BlipForQuestionAnsweringqa_model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base" ) question = "What is in the image?" inputs = processor(image, question, return_tensors="pt" ) out = qa_model.generate(**inputs) answer = processor.decode(out[0 ], skip_special_tokens=True ) print (f"答案: {answer} " )
BLIP-2:参数高效的多模态预训练
BLIP-2 是 BLIP 的升级版,其核心创新在于引入了 Q-Former( Query
Transformer)作为图像编码器和语言模型之间的"桥梁"。这个设计使得 BLIP-2
能够冻结预训练的图像编码器和 LLM,只训练约 188M 参数的
Q-Former,大大降低了训练成本。
核心创新 :
冻结预训练模型 :图像编码器(如
ViT)和语言模型(如 OPT 、
LLaMA)保持冻结,不更新参数。这避免了破坏预训练模型的性能,同时大幅减少可训练参数。
两阶段预训练策略 :
阶段一:视觉-语言表示学习 : Q-Former
学习从冻结的图像编码器提取最相关的视觉特征。通过图像-文本对比学习、图像-文本匹配和图像条件语言建模三个任务,
Q-Former 学会将图像特征转换为语言模型能理解的格式。
阶段二:视觉到语言生成学习 : Q-Former 学习与冻结的
LLM 对齐,将提取的视觉特征作为"软提示"( soft prompts)输入 LLM,让 LLM
能够基于视觉信息生成文本。
Q-Former 架构详解 :
Q-Former 包含一组可学习的查询向量( learnable query
embeddings),数量通常为 32 。这些查询向量通过以下机制工作:
交叉注意力 :查询向量通过交叉注意力与图像特征交互,从图像中提取相关信息
自注意力 :查询向量之间通过自注意力学习查询之间的关系,形成对图像的全局理解
前馈网络 :对查询向量进行非线性变换,增强表示能力
这种设计的优势在于:查询向量数量远少于图像 patch 数量(如 32 vs
256),大大减少了需要处理的特征维度,提高了效率。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 from transformers import Blip2Processor, Blip2ForConditionalGenerationimport torchprocessor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b" ) model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-2.7b" , torch_dtype=torch.float16, device_map="auto" ) image = Image.open ("image.jpg" ) prompt = "Question: What is this? Answer:" inputs = processor(image, text=prompt, return_tensors="pt" ).to("cuda" ) generated_ids = model.generate(**inputs, max_new_tokens=50 ) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True )[0 ] print (generated_text)
优势 :
参数效率 :只需训练 Q-Former(约 188M
参数),而图像编码器和 LLM 保持冻结
灵活性 :可以轻松适配不同的 LLM(如 OPT 、 Flan-T5
、 LLaMA)
性能 :在多个视觉-语言任务上达到 SOTA
多模态预训练策略
数据构建
多模态预训练需要大规模图像-文本对数据:
常见数据集 :
LAION :数十亿级别的图像-文本对
CC( Common Crawl) :从网页爬取的图像-文本对
COCO :高质量标注的图像描述数据集
Visual Genome :包含详细视觉关系的图像数据集
数据清洗策略 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import jsonfrom PIL import Imagefrom transformers import CLIPProcessor, CLIPModelclip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32" ) clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32" ) def filter_image_text_pair (image_path, text, threshold=0.25 ): """过滤低质量的图像-文本对""" image = Image.open (image_path) inputs = clip_processor( text=[text], images=image, return_tensors="pt" ) outputs = clip_model(**inputs) similarity = (outputs.image_embeds @ outputs.text_embeds.T).item() return similarity >= threshold filtered_pairs = [] for img_path, text in image_text_pairs: if filter_image_text_pair(img_path, text): filtered_pairs.append((img_path, text))
预训练目标
多任务学习 :
同时优化多个目标,让模型学习更丰富的表示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 import torchimport torch.nn as nnclass MultimodalPretrainingLoss (nn.Module): def __init__ (self, temperature=0.07 ): super ().__init__() self.temperature = temperature self.itc_loss = nn.CrossEntropyLoss() self.itm_loss = nn.CrossEntropyLoss() self.lm_loss = nn.CrossEntropyLoss() def forward (self, image_embeds, text_embeds, itm_logits, lm_logits, labels ): logits = image_embeds @ text_embeds.T / self.temperature labels = torch.arange(logits.size(0 ), device=logits.device) itc_loss = (self.itc_loss(logits, labels) + self.itc_loss(logits.T, labels)) / 2 itm_loss = self.itm_loss(itm_logits, labels) lm_loss = self.lm_loss(lm_logits.view(-1 , lm_logits.size(-1 )), labels.view(-1 )) total_loss = itc_loss + itm_loss + lm_loss return total_loss, { "itc_loss" : itc_loss.item(), "itm_loss" : itm_loss.item(), "lm_loss" : lm_loss.item() }
课程学习 :
从简单到复杂逐步训练:
阶段一 :图像-文本对齐( ITC)
阶段二 :图像-文本匹配( ITM)
阶段三 :图像条件生成( ITG)
图像描述与视觉问答
图像描述生成
图像描述( Image Captioning)是多模态理解的基础任务。
评估指标 :
BLEU :基于 n-gram 重叠
METEOR :考虑同义词和词序
CIDEr :专门为图像描述设计的指标
SPICE :基于场景图的语义相似度
实现示例 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 from transformers import BlipProcessor, BlipForConditionalGenerationfrom PIL import Imageimport torchclass ImageCaptioner : def __init__ (self, model_name="Salesforce/blip-image-captioning-large" ): self.processor = BlipProcessor.from_pretrained(model_name) self.model = BlipForConditionalGeneration.from_pretrained(model_name) self.model.eval () def generate_caption (self, image_path, max_length=50 , num_beams=3 ): image = Image.open (image_path).convert("RGB" ) inputs = self.processor(image, return_tensors="pt" ) with torch.no_grad(): out = self.model.generate( **inputs, max_length=max_length, num_beams=num_beams, early_stopping=True ) caption = self.processor.decode(out[0 ], skip_special_tokens=True ) return caption def generate_diverse_captions (self, image_path, num_return_sequences=3 ): """生成多样化的描述""" image = Image.open (image_path).convert("RGB" ) inputs = self.processor(image, return_tensors="pt" ) with torch.no_grad(): out = self.model.generate( **inputs, max_length=50 , num_beams=5 , num_return_sequences=num_return_sequences, do_sample=True , temperature=0.8 ) captions = [self.processor.decode(o, skip_special_tokens=True ) for o in out] return captions captioner = ImageCaptioner() caption = captioner.generate_caption("image.jpg" ) print (f"描述: {caption} " )diverse_captions = captioner.generate_diverse_captions("image.jpg" ) for i, cap in enumerate (diverse_captions, 1 ): print (f"描述 {i} : {cap} " )
视觉问答( VQA)
视觉问答要求模型理解图像内容并回答自然语言问题。
数据集 :
VQA v2 :包含 200K+ 图像, 1.1M+ 问题
GQA :场景图增强的视觉问答
TextVQA :包含文本的视觉问答
实现示例 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 from transformers import BlipProcessor, BlipForQuestionAnsweringfrom PIL import Imageclass VQASystem : def __init__ (self, model_name="Salesforce/blip-vqa-base" ): self.processor = BlipProcessor.from_pretrained(model_name) self.model = BlipForQuestionAnswering.from_pretrained(model_name) self.model.eval () def answer_question (self, image_path, question ): image = Image.open (image_path).convert("RGB" ) inputs = self.processor(image, question, return_tensors="pt" ) with torch.no_grad(): out = self.model.generate(**inputs, max_length=50 ) answer = self.processor.decode(out[0 ], skip_special_tokens=True ) return answer def batch_answer (self, image_path, questions ): """批量回答问题""" image = Image.open (image_path).convert("RGB" ) answers = [] for question in questions: inputs = self.processor(image, question, return_tensors="pt" ) with torch.no_grad(): out = self.model.generate(**inputs, max_length=50 ) answer = self.processor.decode(out[0 ], skip_special_tokens=True ) answers.append(answer) return answers vqa = VQASystem() answer = vqa.answer_question("image.jpg" , "What color is the car?" ) print (f"答案: {answer} " )questions = [ "What is in the image?" , "How many people are there?" , "What is the weather like?" ] answers = vqa.batch_answer("image.jpg" , questions) for q, a in zip (questions, answers): print (f"Q: {q} \nA: {a} \n" )
GPT-4V 与多模态 LLM
GPT-4V 架构
GPT-4V( GPT-4 Vision)是 OpenAI
的多模态大语言模型,能够理解图像并生成文本响应。
核心能力 :
图像理解 :识别物体、场景、文字、图表等
多轮对话 :支持图像和文本的混合输入
复杂推理 :能够进行视觉推理和逻辑分析
使用示例 ( API 调用):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 from openai import OpenAIclient = OpenAI() def gpt4v_chat (image_path, text_prompt ): """使用 GPT-4V 进行多模态对话""" with open (image_path, "rb" ) as image_file: response = client.chat.completions.create( model="gpt-4-vision-preview" , messages=[ { "role" : "user" , "content" : [ {"type" : "text" , "text" : text_prompt}, { "type" : "image_url" , "image_url" : { "url" : f"data:image/jpeg;base64,{image_file.read()} " } } ] } ], max_tokens=300 ) return response.choices[0 ].message.content response = gpt4v_chat("chart.png" , "分析这张图表的主要趋势" ) print (response)
开源多模态 LLM
LLaVA( Large Language and Vision Assistant) :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 from llava.model.builder import load_pretrained_modelfrom llava.utils import disable_torch_initfrom llava.conversation import conv_templates, SeparatorStylefrom llava.mm_utils import tokenizer_image_token, get_model_name_from_pathfrom PIL import Imageimport torchdef load_llava_model (model_path="liuhaotian/llava-v1.5-7b" ): disable_torch_init() model_name = get_model_name_from_path(model_path) tokenizer, model, image_processor, context_len = load_pretrained_model( model_path, None , model_name ) return tokenizer, model, image_processor def llava_chat (image_path, question, tokenizer, model, image_processor ): """使用 LLaVA 进行视觉问答""" conv_mode = "llava_v1" conv = conv_templates[conv_mode].copy() image = Image.open (image_path).convert("RGB" ) image_tensor = image_processor.preprocess(image, return_tensors="pt" )["pixel_values" ][0 ] conv.append_message(conv.roles[0 ], f"<image>\n{question} " ) conv.append_message(conv.roles[1 ], None ) prompt = conv.get_prompt() input_ids = tokenizer_image_token( prompt, tokenizer, IMAGE_TOKEN_INDEX=IMAGE_TOKEN_INDEX, return_tensors="pt" ).unsqueeze(0 ).cuda() with torch.inference_mode(): output_ids = model.generate( input_ids, images=image_tensor.unsqueeze(0 ).half().cuda(), do_sample=True , temperature=0.2 , top_p=0.7 , num_beams=1 , max_new_tokens=512 ) outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True )[0 ].strip() return outputs
MiniGPT-4 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from minigpt4.models import load_pre-trained_minigpt4from PIL import Imagedef load_minigpt4 (): model = load_pre-trained_minigpt4( llama_model="path/to/vicuna" , pretrained_ckpt="path/to/pretrained_minigpt4.pth" ) return model def minigpt4_chat (image_path, question, model ): """使用 MiniGPT-4 进行对话""" image = Image.open (image_path).convert("RGB" ) response = model.generate(image, question, max_new_tokens=300 ) return response
音频-文本模型
Whisper:大规模语音识别
Whisper 是 OpenAI
开发的多语言语音识别模型,支持多种语言的语音转文本。
特点 :
多语言支持 :支持 99 种语言
鲁棒性 :对背景噪声、口音、方言有良好的适应性
零样本能力 :无需微调即可处理新语言
实现示例 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 import whisperimport torchclass WhisperASR : def __init__ (self, model_size="base" ): """ model_size: tiny, base, small, medium, large """ self.model = whisper.load_model(model_size) def transcribe (self, audio_path, language=None , task="transcribe" ): """ task: "transcribe" (转录) 或 "translate" (翻译为英文) """ result = self.model.transcribe( audio_path, language=language, task=task, verbose=False ) return result def transcribe_with_timestamps (self, audio_path ): """带时间戳的转录""" result = self.model.transcribe( audio_path, word_timestamps=True , verbose=False ) return result def batch_transcribe (self, audio_paths ): """批量转录""" results = [] for audio_path in audio_paths: result = self.transcribe(audio_path) results.append(result) return results asr = WhisperASR(model_size="base" ) result = asr.transcribe("audio.mp3" , language="zh" ) print (f"转录结果: {result['text' ]} " )result_with_ts = asr.transcribe_with_timestamps("audio.mp3" ) for segment in result_with_ts["segments" ]: print (f"[{segment['start' ]:.2 f} s - {segment['end' ]:.2 f} s] {segment['text' ]} " )
高级功能 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 def transcribe_with_vad (audio_path, model ): """使用语音活动检测( VAD)进行分段转录""" import numpy as np from scipy.io import wavfile sample_rate, audio = wavfile.read(audio_path) segments = detect_speech_segments(audio, sample_rate) results = [] for start, end in segments: segment_audio = audio[int (start*sample_rate):int (end*sample_rate)] temp_path = "temp_segment.wav" wavfile.write(temp_path, sample_rate, segment_audio) result = model.transcribe(temp_path) results.append({ "start" : start, "end" : end, "text" : result["text" ] }) return results
音频-文本对齐
Wav2Vec2 + BERT :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 from transformers import Wav2Vec2Processor, Wav2Vec2Modelfrom transformers import AutoTokenizer, AutoModelimport torchimport torchaudioclass AudioTextAlignment : def __init__ (self ): self.audio_processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base" ) self.audio_model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base" ) self.text_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased" ) self.text_model = AutoModel.from_pretrained("bert-base-uncased" ) def align (self, audio_path, text ): """对齐音频和文本""" waveform, sample_rate = torchaudio.load(audio_path) inputs = self.audio_processor( waveform.squeeze().numpy(), sampling_rate=sample_rate, return_tensors="pt" ) audio_features = self.audio_model(**inputs).last_hidden_state text_inputs = self.text_tokenizer(text, return_tensors="pt" , padding=True ) text_features = self.text_model(**text_inputs).last_hidden_state alignment = self.compute_alignment(audio_features, text_features) return alignment
视频理解
视频编码
Video-ChatGPT :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 from transformers import VideoChatGPTProcessor, VideoChatGPTModelimport torchclass VideoUnderstanding : def __init__ (self, model_name="MBZUAI/Video-ChatGPT" ): self.processor = VideoChatGPTProcessor.from_pretrained(model_name) self.model = VideoChatGPTModel.from_pretrained(model_name) def understand_video (self, video_path, question ): """理解视频内容并回答问题""" video_frames = self.load_video_frames(video_path) inputs = self.processor( text=question, videos=video_frames, return_tensors="pt" ) outputs = self.model.generate(**inputs) answer = self.processor.decode(outputs[0 ], skip_special_tokens=True ) return answer def load_video_frames (self, video_path, num_frames=8 ): """从视频中提取帧""" import cv2 cap = cv2.VideoCapture(video_path) frames = [] total_frames = int (cap.get(cv2.CAP_PROP_FRAME_COUNT)) frame_indices = [int (i * total_frames / num_frames) for i in range (num_frames)] for idx in frame_indices: cap.set (cv2.CAP_PROP_POS_FRAMES, idx) ret, frame = cap.read() if ret: frames.append(frame) cap.release() return frames
时序建模
Video-LLaMA :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 class VideoLLaMA : def __init__ (self ): self.video_encoder = self.load_video_encoder() self.llm = self.load_llm() self.projector = self.load_projector() def process_video (self, video_path, question ): """处理视频并生成回答""" video_features = self.video_encoder(video_path) projected_features = self.projector(video_features) prompt = self.build_prompt(question, projected_features) answer = self.llm.generate(prompt) return answer
实战:构建多模态应用
多模态检索系统
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 import torchimport torch.nn as nnfrom transformers import CLIPProcessor, CLIPModelfrom PIL import Imageimport numpy as npfrom sklearn.metrics.pairwise import cosine_similarityclass MultimodalRetrievalSystem : def __init__ (self, model_name="openai/clip-vit-base-patch32" ): self.model = CLIPModel.from_pretrained(model_name) self.processor = CLIPProcessor.from_pretrained(model_name) self.model.eval () self.image_embeddings = [] self.text_embeddings = [] self.image_paths = [] self.texts = [] def add_image (self, image_path ): """添加图像到索引""" image = Image.open (image_path).convert("RGB" ) inputs = self.processor(images=image, return_tensors="pt" ) with torch.no_grad(): image_embed = self.model.get_image_features(**inputs) image_embed = image_embed / image_embed.norm(dim=-1 , keepdim=True ) self.image_embeddings.append(image_embed.cpu().numpy()) self.image_paths.append(image_path) def add_text (self, text ): """添加文本到索引""" inputs = self.processor(text=[text], return_tensors="pt" , padding=True ) with torch.no_grad(): text_embed = self.model.get_text_features(**inputs) text_embed = text_embed / text_embed.norm(dim=-1 , keepdim=True ) self.text_embeddings.append(text_embed.cpu().numpy()) self.texts.append(text) def search_by_text (self, query_text, top_k=5 ): """根据文本查询图像""" inputs = self.processor(text=[query_text], return_tensors="pt" , padding=True ) with torch.no_grad(): query_embed = self.model.get_text_features(**inputs) query_embed = query_embed / query_embed.norm(dim=-1 , keepdim=True ) if len (self.image_embeddings) == 0 : return [] image_embeds = np.vstack(self.image_embeddings) similarities = cosine_similarity( query_embed.cpu().numpy(), image_embeds )[0 ] top_indices = np.argsort(similarities)[::-1 ][:top_k] results = [ { "image_path" : self.image_paths[i], "similarity" : float (similarities[i]) } for i in top_indices ] return results def search_by_image (self, query_image_path, top_k=5 ): """根据图像查询文本""" image = Image.open (query_image_path).convert("RGB" ) inputs = self.processor(images=image, return_tensors="pt" ) with torch.no_grad(): query_embed = self.model.get_image_features(**inputs) query_embed = query_embed / query_embed.norm(dim=-1 , keepdim=True ) if len (self.text_embeddings) == 0 : return [] text_embeds = np.vstack(self.text_embeddings) similarities = cosine_similarity( query_embed.cpu().numpy(), text_embeds )[0 ] top_indices = np.argsort(similarities)[::-1 ][:top_k] results = [ { "text" : self.texts[i], "similarity" : float (similarities[i]) } for i in top_indices ] return results retrieval_system = MultimodalRetrievalSystem() retrieval_system.add_image("image1.jpg" ) retrieval_system.add_image("image2.jpg" ) retrieval_system.add_text("a cat sitting on a mat" ) retrieval_system.add_text("a dog playing in the park" ) results = retrieval_system.search_by_text("a cute animal" , top_k=3 ) for r in results: print (f"图像: {r['image_path' ]} , 相似度: {r['similarity' ]:.3 f} " ) results = retrieval_system.search_by_image("query.jpg" , top_k=3 ) for r in results: print (f"文本: {r['text' ]} , 相似度: {r['similarity' ]:.3 f} " )
多模态对话系统
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 from transformers import BlipProcessor, BlipForConditionalGenerationfrom transformers import BlipProcessor, BlipForQuestionAnsweringimport torchclass MultimodalChatbot : def __init__ (self ): self.caption_processor = BlipProcessor.from_pretrained( "Salesforce/blip-image-captioning-base" ) self.caption_model = BlipForConditionalGeneration.from_pretrained( "Salesforce/blip-image-captioning-base" ) self.vqa_processor = BlipProcessor.from_pretrained( "Salesforce/blip-vqa-base" ) self.vqa_model = BlipForQuestionAnswering.from_pretrained( "Salesforce/blip-vqa-base" ) self.conversation_history = [] def process_message (self, message, image_path=None ): """处理用户消息(可能包含图像)""" if image_path: image = Image.open (image_path).convert("RGB" ) if "?" in message or "what" in message.lower() or "how" in message.lower(): inputs = self.vqa_processor(image, message, return_tensors="pt" ) with torch.no_grad(): out = self.vqa_model.generate(**inputs, max_length=50 ) response = self.vqa_processor.decode(out[0 ], skip_special_tokens=True ) else : inputs = self.caption_processor(image, return_tensors="pt" ) with torch.no_grad(): out = self.caption_model.generate(**inputs, max_length=50 ) response = self.caption_processor.decode(out[0 ], skip_special_tokens=True ) else : response = self.text_chat(message) self.conversation_history.append({ "user" : message, "image" : image_path, "assistant" : response }) return response def text_chat (self, message ): """纯文本对话(简化示例)""" return "I can help you with image-related questions. Please provide an image." chatbot = MultimodalChatbot() response = chatbot.process_message("What is in this image?" , "image.jpg" ) print (response)
部署多模态服务
FastAPI 部署 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 from fastapi import FastAPI, File, UploadFile, Formfrom fastapi.responses import JSONResponsefrom PIL import Imageimport ioimport torchapp = FastAPI() caption_model = None vqa_model = None @app.on_event("startup" ) async def load_models (): global caption_model, vqa_model from transformers import BlipProcessor, BlipForConditionalGeneration from transformers import BlipForQuestionAnswering caption_processor = BlipProcessor.from_pretrained( "Salesforce/blip-image-captioning-base" ) caption_model = BlipForConditionalGeneration.from_pretrained( "Salesforce/blip-image-captioning-base" ) vqa_processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base" ) vqa_model = BlipForQuestionAnswering.from_pretrained( "Salesforce/blip-vqa-base" ) @app.post("/caption" ) async def generate_caption (file: UploadFile = File(... ) ): """生成图像描述""" image_bytes = await file.read() image = Image.open (io.BytesIO(image_bytes)).convert("RGB" ) inputs = caption_processor(image, return_tensors="pt" ) with torch.no_grad(): out = caption_model.generate(**inputs, max_length=50 ) caption = caption_processor.decode(out[0 ], skip_special_tokens=True ) return JSONResponse({"caption" : caption}) @app.post("/vqa" ) async def answer_question ( file: UploadFile = File(... ), question: str = Form(... ) ): """视觉问答""" image_bytes = await file.read() image = Image.open (io.BytesIO(image_bytes)).convert("RGB" ) inputs = vqa_processor(image, question, return_tensors="pt" ) with torch.no_grad(): out = vqa_model.generate(**inputs, max_length=50 ) answer = vqa_processor.decode(out[0 ], skip_special_tokens=True ) return JSONResponse({"answer" : answer}) if __name__ == "__main__" : import uvicorn uvicorn.run(app, host="0.0.0.0" , port=8000 )
Docker 部署 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 FROM python:3.9 -slimWORKDIR /app RUN pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118 RUN pip install transformers pillow fastapi uvicorn python-multipart COPY app.py /app/ RUN python -c "from transformers import BlipProcessor; BlipProcessor.from_pretrained('Salesforce/blip-image-captioning-base')" EXPOSE 8000 CMD ["uvicorn" , "app:app" , "--host" , "0.0.0.0" , "--port" , "8000" ]
❓ Q&A: 多模态大模型常见问题
Q1: CLIP 和 BLIP 的主要区别是什么?
A: CLIP
专注于图像-文本对齐,通过对比学习实现零样本能力,但不能生成文本。 BLIP
是统一模型,既能理解( VQA
、图像-文本匹配)也能生成(图像描述),通过多任务学习实现。
Q2: 为什么 BLIP-2 只需要训练很少的参数?
A: BLIP-2 冻结了图像编码器和语言模型,只训练 Q-Former(约 188M
参数)。 Q-Former
作为桥梁,学习从冻结的图像编码器提取视觉特征,并将其转换为语言模型能理解的格式。
Q3: Whisper 如何处理不同语言的语音?
A: Whisper 在训练时使用了多语言数据,模型能够自动检测语言。可以通过
language 参数指定语言,也可以让模型自动检测。模型支持 99
种语言,包括中文、英文、日文等。
Q4: 多模态模型如何处理长视频?
A: 通常采用两种策略: 1) 均匀采样关键帧(如每 1 秒采样 1 帧); 2)
使用滑动窗口将视频分段处理,然后融合结果。对于超长视频,可以先使用视频摘要技术压缩信息。
Q5: GPT-4V 和开源多模态 LLM(如 LLaVA)的区别?
A: GPT-4V 是闭源模型,性能强大但需要 API 调用,成本较高。 LLaVA
等开源模型可以本地部署,但性能可能略逊。选择取决于具体需求:追求性能选
GPT-4V,需要定制化或成本控制选开源模型。
Q6: 如何评估多模态模型的性能?
A: 不同任务有不同指标:图像描述用 BLEU 、 METEOR 、 CIDEr; VQA
用准确率;图像-文本检索用 Recall@K
。还可以进行人工评估,检查生成内容的准确性和流畅性。
Q7: 多模态预训练需要多少数据?
A: 大规模预训练通常需要数千万到数十亿的图像-文本对。 CLIP 使用了 4
亿对, BLIP 使用了 1.29
亿对。对于特定领域,可以使用领域数据微调,数据量可以更少(几万到几十万)。
Q8: 如何解决多模态模型的内存占用问题?
A: 可以采用以下策略: 1) 使用量化( 4-bit/8-bit); 2)
使用梯度检查点; 3) 使用参数高效微调( LoRA); 4) 使用模型并行; 5)
使用更小的模型变体。
Q9: 多模态模型可以处理哪些类型的输入?
A: 常见支持:图像( JPG 、 PNG)、文本。部分模型支持:音频(
Whisper)、视频( Video-ChatGPT)。未来趋势是支持更多模态,如 3D
模型、点云等。
Q10: 如何构建一个生产级的多模态应用?
A: 关键步骤: 1) 选择合适的模型(平衡性能和成本); 2)
实现高效的推理服务(批处理、缓存); 3) 添加监控和日志; 4)
实现错误处理和降级策略; 5) 使用容器化部署( Docker); 6)
实现负载均衡和自动扩缩容。