As large language models continue to grow in size, the cost of full
fine-tuning has become increasingly prohibitive. Fine-tuning a model
with billions of parameters requires updating all parameters, which not
only demands massive computational resources but can also lead to
catastrophic forgetting. To address these challenges,
Parameter-Efficient Fine-Tuning (PEFT) techniques have emerged.
PEFT techniques achieve performance close to full fine-tuning by
updating only a small fraction of model parameters. Methods like LoRA
(Low-Rank Adaptation), QLoRA, Adapter, and Prefix-Tuning are
representative examples. These approaches not only dramatically reduce
computational costs but also make it possible to fine-tune large models
on consumer-grade hardware.
This article delves into the differences between full fine-tuning and
frozen fine-tuning, provides detailed explanations of PEFT techniques
including LoRA, QLoRA, Adapter, Prefix-Tuning, and P-Tuning v2,
introduces alignment techniques like Instruction Tuning and RLHF
(Reinforcement Learning from Human Feedback), and demonstrates how to
fine-tune large models using the HuggingFace PEFT library through
practical examples.
Full Fine-tuning vs
Frozen Fine-tuning
Full Fine-tuning
Full fine-tuning refers to updating all parameters of a pre-trained
model. This is the most straightforward approach but also the most
expensive.
import torch import torch.nn as nn from transformers import AutoModelForCausalLM, AutoTokenizer
# Full fine-tuning example model = AutoModelForCausalLM.from_pretrained("gpt2") tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Set all parameters as trainable for param in model.parameters(): param.requires_grad = True
# Training loop optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
for epoch inrange(num_epochs): for batch in dataloader: outputs = model(**batch) loss = outputs.loss loss.backward() optimizer.step() optimizer.zero_grad()
Advantages: - Theoretically achieves best
performance - Model can fully adapt to target task
Disadvantages: - Requires massive computational
resources (GPU memory, training time) - Prone to overfitting - May cause
catastrophic forgetting - Each task requires saving a complete model
copy
Frozen Fine-tuning
Frozen fine-tuning freezes most parameters of the pre-trained model
and only trains partial layers (typically the top layers).
1 2 3 4 5 6 7 8 9 10 11 12 13 14
# Frozen fine-tuning example model = AutoModelForCausalLM.from_pretrained("gpt2")
# Freeze all parameters for param in model.parameters(): param.requires_grad = False
# Only train top layers (e.g., last 2 layers) for param in model.transformer.h[-2:].parameters(): param.requires_grad = True
# Or only train classification head classifier = nn.Linear(model.config.n_embd, num_labels) optimizer = torch.optim.AdamW(classifier.parameters(), lr=1e-4)
Disadvantages: - Performance may be inferior to full
fine-tuning - Requires careful selection of unfrozen layers - Lower
flexibility
Parameter Efficiency
Comparison
Assuming a model has parameters:
Full fine-tuning: Need to updateparameters
Frozen fine-tuning: Only updateparameters (e.g., last few
layers)
PEFT methods: Typically only updateparameters (e.g., LoRA updates
only 0.1%-1% of parameters)
LoRA: Low-Rank Adaptation
LoRA (Low-Rank Adaptation) is one of the most popular PEFT methods
today. Its core idea is: instead of directly updating the
original weight matrix, learn a low-rank decomposed incremental
update.
LoRA Principle
For a pre-trained weight matrix, LoRA doesn't directly update, but learns two low-rank
matricesand, where.
During forward propagation, the actual weight used is:whereis the low-rank update.
Parameter Efficiency:
Original matrix parameters:
LoRA parameters:
Parameter reduction ratio:When, parameters are
dramatically reduced. For example, with:
Original parameters: 1,048,576
LoRA parameters: 16,384
Reduction ratio: approximately 98.4%
LoRA Implementation
Problem Context: Full fine-tuning requires updating
all parameters, which is expensive. LoRA reduces trainable parameters
dramatically by learning low-rank decomposed weight increments.
Solution Approach: Instead of directly updating the
original weight matrix, learn
two low-rank matricesandsuch that. During forward
propagation, use, whereis a
scaling factor.
Design Considerations: - Matrixis initialized with small random
values,is initialized to zero,
ensuring LoRA updates are zero initially - Rankcontrols the low-rank dimension,
typically chosen as 4-32 - Alpha controls the strength of LoRA updates,
typically set as a multiple of rank
Design Trade-offs: - ✅ Pros: Dramatically reduced
parameters (typically <1%), can merge into original weights,
inference speed unchanged - ⚠️ Note: Rank selection is important, too
small may lack expressiveness, too large loses parameter efficiency
advantage
Common Questions: - Q: Why initialize B to zero? A:
Ensures LoRA output is zero initially, doesn't affect pretrained model
behavior - Q: How to choose rank? A: Typically start with 8 or 16,
adjust based on task complexity. Simple tasks can use 4, complex tasks
may need 32 or 64 - Q: How to set alpha? A: Typically set as a multiple
of rank, e.g.,. Rule of thumb:usually works well
Usage Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# Apply LoRA to attention layers classAttentionWithLoRA(nn.Module): def__init__(self, attention_layer, rank=8, alpha=16): super().__init__() self.attention = attention_layer # Apply LoRA only to Q, K, V projections self.q_proj_lora = LoRALinear(attention_layer.q_proj, rank, alpha) self.k_proj_lora = LoRALinear(attention_layer.k_proj, rank, alpha) self.v_proj_lora = LoRALinear(attention_layer.v_proj, rank, alpha) defforward(self, x): # Use LoRA versions of projection layers q = self.q_proj_lora(x) k = self.k_proj_lora(x) v = self.v_proj_lora(x) # ... attention computation ...
Advantages of LoRA
Parameter Efficient: Only updates a small number of
parameters (typically <1%)
Modular: Can easily add or remove LoRA
adapters
Multi-task: Can train different LoRA adapters for
different tasks
Performance Close to Full Fine-tuning: Achieves
90%+ performance on most tasks
LoRA Hyperparameter
Selection
rank (r): Rank of low-rank decomposition, typically
choose 4, 8, 16, 32. Larger rank means stronger expressiveness but more
parameters.
alpha: Scaling factor, typically set as a multiple
of rank (e.g., rank=8, alpha=16). Larger alpha means greater influence
of LoRA updates.
Rule of thumb: alpha = 2 * rank usually works
well.
QLoRA: Quantized LoRA
QLoRA (Quantized LoRA) combines quantization with LoRA, further
reducing memory requirements.
QLoRA Principle
Core innovations of QLoRA:
4-bit Quantization: Quantize model weights to
4-bit
NF4 Quantization: Use NormalFloat4 quantization
format
Double Quantization: Quantize quantization
constants again
Paged Optimizer: Use paged AdamW optimizer
Memory Savings:
FP16 full fine-tuning: 2 bytes per parameter
QLoRA: Approximately 0.5 bytes per parameter (4-bit) + LoRA
parameters
For a 7B model: - FP16 full fine-tuning: Approximately 14 GB - QLoRA:
Approximately 3-4 GB
from transformers import BitsAndBytesConfig from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training from transformers import AutoModelForCausalLM, AutoTokenizer
classTransformerLayerWithAdapter(nn.Module): """Transformer layer with Adapter""" def__init__(self, transformer_layer, adapter_size=64): super().__init__() self.transformer_layer = transformer_layer self.adapter = Adapter( transformer_layer.self_attn.embed_dim, adapter_size ) defforward(self, x, **kwargs): # Original Transformer layer x = self.transformer_layer(x, **kwargs)[0] # Add Adapter x = self.adapter(x) return (x,)
Adapter vs LoRA
Feature
Adapter
LoRA
Insertion Position
Inside Transformer layer
Beside weight matrix
Parameter Count
Medium (~0.5% per Adapter)
Less (typically <1%)
Inference Speed
Slightly slower (extra forward computation)
Faster (can merge into weights)
Flexibility
Medium
High (easy to combine)
Prefix-Tuning
Prefix-Tuning adapts to tasks by adding learnable continuous prefixes
to the input sequence.
Prefix-Tuning Principle
For input sequence, Prefix-Tuning adds learnable prefixes, forming:These prefix vectorsare trainable parameters, while original model
parameters remain frozen.
Attention Computation:
In the attention mechanism, prefixes participate in key and value
computation:whereandinclude prefix parts:
Instruction tuning is a key technique for making models follow
instructions. By fine-tuning on instruction-response pairs, models learn
to understand and execute various instructions.
Instruction Data Format
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
instruction_data = [ { "instruction": "Explain what machine learning is", "input": "", "output": "Machine learning is a branch of artificial intelligence that enables computers to learn from data..." }, { "instruction": "Translate the following English to Chinese", "input": "Hello, how are you?", "output": "你好,你好吗?" }, { "instruction": "Summarize the following article", "input": "[Article content]", "output": "[Summary]" } ]