Transfer Learning (9): Parameter-Efficient Fine-Tuning
Chen Kai BOSS

How do you fine-tune GPT-3 with 175 billion parameters on a single GPU? When you need to customize models for 100 different tasks, how do you avoid storing 100 complete copies? Parameter-Efficient Fine-Tuning (PEFT) provides the answer: update only a small fraction of model parameters to achieve comparable results to full fine-tuning.

This article systematically explains the design philosophy and implementation details of mainstream PEFT methods including LoRA, Adapter, and Prefix-Tuning, starting from the mathematical principles of low-rank adaptation. We analyze trade-offs between parameter efficiency, computational cost, and performance, and provide complete code (200+ lines) for implementing LoRA from scratch.

Motivation for Parameter-Efficient Fine-Tuning

The Dilemma of Full Fine-Tuning

Traditional transfer learning adopts full fine-tuning:

whereincludes all model parameters.

Problems:

  1. Memory explosion: Fine-tuning GPT-3 (175B parameters) requiresmemory (FP32)
  2. Storage cost: Storing a complete model copy for each task requires 70TB for 100 tasks
  3. Computational inefficiency: Even when fine-tuning only the last few layers, the entire network must be forward propagated
  4. Catastrophic forgetting: Large parameter updates easily damage pre-trained knowledge

Core Idea of Parameter-Efficient Fine-Tuning

Assumption: Pre-trained models have learned general representations; task adaptation requires adjusting only a small number of parameters.

Formalized as:whereis the task-specific parameter increment satisfying:PEFT goal: optimize only, freeze.

Definition of Parameter Efficiency

Parameter efficiency is defined as the ratio of trainable parameters:Efficiency of typical PEFT methods:

Method Trainable Parameters Efficiency
Full Fine-Tuning 100% 0%
BitFit ~0.1% 99.9%
Adapter ~0.5-2% 98-99.5%
LoRA ~0.1-1% 99-99.9%
Prefix-Tuning ~0.1% 99.9%

LoRA: Low-Rank Adaptation

Mathematical Principles of LoRA

LoRA (Low-Rank Adaptation)1 core insight:

Assumption: Updatesto pre-trained weight matrices have a low-rank structure.

Formalized as:where: -is the frozen pre-trained weight -,are trainable low-rank factors -is the rank (typical values: 1-64)

Parameter comparison: - Original matrix:parameters - LoRA increment:parameters - Parameter ratio:

Example: With,, the parameter ratio is.

Why Does the Low-Rank Assumption Hold?

Intrinsic Dimensionality Theory

Aghajanyan et al.2 proved that neural network learning occurs in a low-dimensional subspace.

Let model parameters be, there exists a low-dimensional projection () such that:Optimization can be performed inspace rather thanspace.

Empirical Verification

Performing singular value decomposition on pre-trained weight matrices:Observing singular value distribution: the first few singular values are much larger than the rest, indicating the weight matrix is close to low-rank.

LoRA Implementation Details

Initialization Strategy

-uses Gaussian initialization: -initialized to zero:This ensuresat training start, so the model behaves identically to the pre-trained model.

Scaling Factor

To control update magnitude, introduce scaling factor:whereis a hyperparameter (typical value:, i.e., scaling factor is 1).

Application Locations

In Transformers, LoRA is typically applied to:

  1. Query and Value projections:,(recommended)
  2. All linear layers:,,,, FFN (best performance)
  3. Only Value projection:(most lightweight)

Forward PropagationComputation order:, avoiding explicit construction of(saves memory).

Merging at Inference

After training, LoRA weights can be merged into original weights:No additional computational overhead at inference, equivalent to full fine-tuning model.

Advantages and Limitations of LoRA

Advantages:

  1. Memory-friendly: Only need to store gradients of, memory requirement reduced toof original
  2. Modular:for different tasks can be stored and switched independently
  3. No inference latency: After merging, completely equivalent to full fine-tuning
  4. Training acceleration: Fewer parameters mean faster gradient computation

Limitations:

  1. Rank selection:too small limits performance,too large loses efficiency advantage
  2. Not applicable to all layers: Limited effect on embedding or output layers
  3. Insufficient theoretical guarantees: Low-rank assumption may not hold for some tasks

Adapter: Bottleneck Architecture

Adapter Design

Adapter3 inserts small bottleneck modules in each Transformer layer:where: -is the input feature -is down-projection (dimension reduction) -is up-projection (dimension increase) -is nonlinear activation (e.g., ReLU or GELU) -is bottleneck dimension (typical value: 64)

Parameter count:(assuming bias is negligible).

Adapter Insertion Locations

In Transformer Blocks, Adapters are typically inserted at two positions:

  1. After Multi-Head Attention:

    1
    2
    3
    h = h + Attention(h)
    h = h + Adapter(LayerNorm(h))
    h = h + FFN(LayerNorm(h))

  2. After Feed-Forward Network:

    1
    2
    3
    h = h + Attention(LayerNorm(h))
    h = h + FFN(LayerNorm(h))
    h = h + Adapter(LayerNorm(h))

Dual-insertion version (serial Adapter):

1
2
h = h + Adapter ₁(Attention(h))
h = h + Adapter ₂(FFN(h))

Parallel Adapter

To reduce inference latency, He et al.4 proposed parallel Adapter:Adapter computes in parallel with FFN, avoiding serial dependency.

Adapter vs LoRA

Dimension Adapter LoRA
Parameter location New module Modify existing weights
Inference latency Yes (serial) No (can merge)
Training stability High Moderate
Implementation complexity Low Moderate
Use cases Encoder models (BERT) Generative models (GPT)

Prefix-Tuning: Soft Prompt Optimization

Core Idea of Prefix-Tuning

Prefix-Tuning5 doesn't modify model parameters, but adds trainable "virtual tokens" before the input sequence.

Formalized as:Extra close brace or missing open braceP = \{p_1, p_2, \ldots, p_m} \in \mathbb{R}^{m \times d}whereis prefix length (typical values: 10-100),is hidden dimension.

Forward propagation:Onlyis trainable, all model parameters are frozen.

Prefix Parameterization

Direct Optimization (Unstable)

Directly optimizingeasily leads to training instability.

Use MLP to map low-dimensional vectors to high-dimensional:where,(e.g.,).

Optimizeduring training, only keepat inference.

Prefix-Tuning vs Prompt-Tuning

Method Prefix-Tuning Prompt-Tuning
Insertion location Every layer Input layer only
Parameters
Performance Better Moderate
Applicable models Encoder+Decoder Decoder only

P-Tuning v2

P-Tuning v26 extends Prefix-Tuning to Key and Value of every layer:Each layer has independent prefixes,, significantly improving performance.

Prompt-Tuning: Pure Soft Prompts

Simplified Design of Prompt-Tuning

Prompt-Tuning7 further simplifies by adding soft prompts only at the input layer:Trainable parameters:, onlyparameters.

Initialization Strategies

  1. Random initialization:

  2. Word embedding initialization: Select embeddings of relevant words from vocabulary

  3. Class label initialization: Use embeddings of class names

Experiments show: For large models (>10B parameters), initialization strategy has little impact; small models are sensitive to initialization.

Effect of Length

Relationship between prompt lengthand performance:

  • Small models (<1B): Largeris better, typically need
  • Large models (>10B):achieves good results

Reason: Large models have strong expressive power, few prompts are sufficient to guide behavior.

Theoretical Explanation of Prompt-Tuning

From an optimization perspective, Prompt-Tuning is equivalent to finding optimal perturbations in input space:This is input space optimization, not parameter space optimization.

BitFit: Bias-Only Fine-Tuning

BitFit's Minimalism

BitFit8 proposed an extremely simplified PEFT: fine-tune only bias terms.

In Transformers, all linear layers have bias:BitFit freezes, optimizes only.

Parameter count: Assuming each layer hasbiases (Query, Key, Value, Output each with), an-layer model hasparameters, accounting for ~0.1%.

Why Is Bias-Only Effective?

Special Nature of Bias

Bias can be understood as task-specific global offset:Equivalent to applying offsetto input.

Empirical Evidence

Experiments show: BitFit approaches full fine-tuning performance in few-shot scenarios (especially for large models).

Reason: Pre-trained model weights already encode general knowledge, bias adjustment is sufficient to adapt to new tasks.

Limitations of BitFit

  1. Poor for small models: For models <1B parameters, BitFit is significantly weaker than other PEFT methods
  2. Limited for complex tasks: Tasks requiring significant feature representation changes (e.g., domain transfer), BitFit is inadequate
  3. Cannot utilize low-rank structure: Bias is a vector, cannot leverage low-rank assumptions like LoRA

(IA)³: Activation Scaling

(IA)³ Design

(IA)³ (Infused Adapter by Inhibiting and Amplifying Inner Activations)9 adapts tasks by scaling activations:whereis element-wise multiplication,is a trainable scaling vector (initialized to 1).

In Transformers, applied at three locations:

  1. Attention's Key and Value:

  2. FFN's intermediate layer:Parameter count:parameters per layer (),forlayers, accounting for ~0.01%.

Advantages of (IA)³

  1. Ultimate efficiency: Parameter count is an order of magnitude less than LoRA
  2. No inference latency: Scaling operation has almost no overhead
  3. Numerical stability: Initialized to 1, smooth training process

Intuition of Scaling

Scaling can be understood as feature selection:

-: Amplify-th feature, enhance its importance -: Suppress-th feature, reduce its influence -: Approximately remove that dimension

By learning scaling patterns, the model can adjust relative importance of features for different tasks.

Complete Code Implementation: LoRA from Scratch

Below is a complete LoRA module implementation including LoRA replacement for linear layers, training, inference, and weight merging.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
"""
LoRA from Scratch: Low-Rank Adaptation for Parameter-Efficient Fine-Tuning
Includes: LoRA Layer, LoRA Model, Training, Inference, Weight Merging
"""

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import matplotlib.pyplot as plt
from typing import Optional, List

# Set random seed
torch.manual_seed(42)
np.random.seed(42)

# ============================================================================
# LoRA Layer Implementation
# ============================================================================

class LoRALayer(nn.Module):
"""
LoRA Layer: W' = W_0 + (α/r) * BA
"""
def __init__(
self,
in_features: int,
out_features: int,
rank: int = 4,
alpha: float = 1.0,
dropout: float = 0.0
):
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.rank = rank
self.alpha = alpha

# Pre-trained weight (frozen)
self.weight = nn.Parameter(torch.randn(out_features, in_features))
self.weight.requires_grad = False

# Bias (frozen)
self.bias = nn.Parameter(torch.zeros(out_features))
self.bias.requires_grad = False

# LoRA low-rank matrices
self.lora_A = nn.Parameter(torch.randn(rank, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))

# Dropout
self.dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()

# Initialization
nn.init.kaiming_uniform_(self.lora_A, a=np.sqrt(5))
nn.init.zeros_(self.lora_B)

# Scaling factor
self.scaling = alpha / rank

def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Forward pass: h = W_0 x + (α/r) BA x
"""
# Original linear transformation (frozen)
result = nn.functional.linear(x, self.weight, self.bias)

# LoRA increment: A first then B, avoid explicit BA construction
lora_out = (self.dropout(x) @ self.lora_A.t()) @ self.lora_B.t()
lora_out = lora_out * self.scaling

return result + lora_out

def merge_weights(self):
"""
Merge LoRA weights into original weights: W_merged = W_0 + (α/r) BA
"""
if self.rank > 0:
delta_W = self.lora_B @ self.lora_A * self.scaling
self.weight.data += delta_W
# Clear LoRA matrices
self.lora_A.data.zero_()
self.lora_B.data.zero_()

def extra_repr(self) -> str:
return f'in_features={self.in_features}, out_features={self.out_features}, rank={self.rank}, alpha={self.alpha}'

# ============================================================================
# Apply LoRA to Model
# ============================================================================

def apply_lora_to_linear(model: nn.Module, rank: int = 4, alpha: float = 1.0,
target_modules: Optional[List[str]] = None):
"""
Replace nn.Linear in model with LoRALayer
Args:
model: Target model
rank: LoRA rank
alpha: Scaling factor
target_modules: List of module names to replace (e.g., ['query', 'value'])
"""
for name, module in model.named_children():
if isinstance(module, nn.Linear):
# Check if in target modules
if target_modules is None or any(target in name for target in target_modules):
# Create LoRA layer
lora_layer = LoRALayer(
in_features=module.in_features,
out_features=module.out_features,
rank=rank,
alpha=alpha
)
# Copy pre-trained weights
lora_layer.weight.data = module.weight.data.clone()
if module.bias is not None:
lora_layer.bias.data = module.bias.data.clone()

# Replace module
setattr(model, name, lora_layer)
print(f"Applied LoRA to {name}: {module.in_features} -> {module.out_features}, rank={rank}")
else:
# Recursively apply to submodules
apply_lora_to_linear(module, rank, alpha, target_modules)

def count_parameters(model: nn.Module, trainable_only: bool = False) -> int:
"""
Count model parameters
"""
if trainable_only:
return sum(p.numel() for p in model.parameters() if p.requires_grad)
else:
return sum(p.numel() for p in model.parameters())

# ============================================================================
# Example Model: Simple Transformer Block
# ============================================================================

class MultiHeadAttention(nn.Module):
"""
Simplified multi-head attention
"""
def __init__(self, d_model: int, num_heads: int):
super().__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.head_dim = d_model // num_heads

# QKV projections
self.query = nn.Linear(d_model, d_model)
self.key = nn.Linear(d_model, d_model)
self.value = nn.Linear(d_model, d_model)
self.out = nn.Linear(d_model, d_model)

def forward(self, x: torch.Tensor) -> torch.Tensor:
batch_size, seq_len, d_model = x.shape

# QKV projections
Q = self.query(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
K = self.key(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
V = self.value(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)

# Attention scores
scores = Q @ K.transpose(-2, -1) / np.sqrt(self.head_dim)
attn = torch.softmax(scores, dim=-1)

# Weighted sum
out = attn @ V
out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model)

return self.out(out)

class FeedForward(nn.Module):
"""
Feed-Forward Network
"""
def __init__(self, d_model: int, d_ff: int):
super().__init__()
self.fc1 = nn.Linear(d_model, d_ff)
self.fc2 = nn.Linear(d_ff, d_model)
self.activation = nn.GELU()

def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.fc2(self.activation(self.fc1(x)))

class TransformerBlock(nn.Module):
"""
Simplified Transformer Block
"""
def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads)
self.ffn = FeedForward(d_model, d_ff)

self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)

self.dropout = nn.Dropout(dropout)

def forward(self, x: torch.Tensor) -> torch.Tensor:
# Self-attention
attn_out = self.attention(self.norm1(x))
x = x + self.dropout(attn_out)

# Feed-forward
ffn_out = self.ffn(self.norm2(x))
x = x + self.dropout(ffn_out)

return x

class SimpleTransformer(nn.Module):
"""
Simple Transformer model (for demonstrating LoRA)
"""
def __init__(self, d_model: int = 256, num_heads: int = 4, num_layers: int = 4,
d_ff: int = 1024, vocab_size: int = 10000, num_classes: int = 10):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.blocks = nn.ModuleList([
TransformerBlock(d_model, num_heads, d_ff) for _ in range(num_layers)
])
self.classifier = nn.Linear(d_model, num_classes)

def forward(self, x: torch.Tensor) -> torch.Tensor:
# Embedding
x = self.embedding(x) # (B, L) -> (B, L, D)

# Transformer blocks
for block in self.blocks:
x = block(x)

# Average pooling
x = x.mean(dim=1) # (B, L, D) -> (B, D)

# Classification
return self.classifier(x)

# ============================================================================
# Synthetic Dataset
# ============================================================================

class SyntheticTextDataset(Dataset):
"""
Synthetic text classification dataset
"""
def __init__(self, num_samples: int = 1000, seq_len: int = 32,
vocab_size: int = 10000, num_classes: int = 10):
self.num_samples = num_samples
self.seq_len = seq_len

# Generate random data
self.data = torch.randint(1, vocab_size, (num_samples, seq_len))
self.labels = torch.randint(0, num_classes, (num_samples,))

def __len__(self):
return self.num_samples

def __getitem__(self, idx):
return self.data[idx], self.labels[idx]

# ============================================================================
# Training Function
# ============================================================================

def train_model(model, dataloader, optimizer, criterion, device, num_epochs=10):
"""
Train model
"""
model.train()
losses = []
accuracies = []

for epoch in range(num_epochs):
epoch_loss = 0
epoch_correct = 0
epoch_total = 0

for batch_idx, (inputs, labels) in enumerate(dataloader):
inputs = inputs.to(device)
labels = labels.to(device)

# Forward pass
outputs = model(inputs)
loss = criterion(outputs, labels)

# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()

# Statistics
epoch_loss += loss.item()
_, predicted = torch.max(outputs, 1)
epoch_correct += (predicted == labels).sum().item()
epoch_total += labels.size(0)

if (batch_idx + 1) % 10 == 0:
print(f"Epoch [{epoch+1}/{num_epochs}], Batch [{batch_idx+1}/{len(dataloader)}], Loss: {loss.item():.4f}")

avg_loss = epoch_loss / len(dataloader)
accuracy = 100 * epoch_correct / epoch_total
losses.append(avg_loss)
accuracies.append(accuracy)

print(f"Epoch [{epoch+1}/{num_epochs}] Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%")

return losses, accuracies

# ============================================================================
# Visualization
# ============================================================================

def plot_training_curves(losses_baseline, accuracies_baseline,
losses_lora, accuracies_lora):
"""
Plot training curve comparison
"""
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curves
axes[0].plot(losses_baseline, marker='o', label='Full Fine-Tuning', linewidth=2)
axes[0].plot(losses_lora, marker='s', label='LoRA', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Training Loss Comparison', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy curves
axes[1].plot(accuracies_baseline, marker='o', label='Full Fine-Tuning', linewidth=2)
axes[1].plot(accuracies_lora, marker='s', label='LoRA', linewidth=2)
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy (%)', fontsize=12)
axes[1].set_title('Training Accuracy Comparison', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('lora_training_comparison.png', dpi=150, bbox_inches='tight')
plt.close()
print("Training curves saved to lora_training_comparison.png")

def visualize_lora_matrices(model):
"""
Visualize singular value distribution of LoRA matrices
"""
lora_layers = [m for m in model.modules() if isinstance(m, LoRALayer)]

if not lora_layers:
print("No LoRA layers found")
return

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for idx, layer in enumerate(lora_layers[:4]):
# Calculate singular values of BA
BA = layer.lora_B @ layer.lora_A
U, S, V = torch.svd(BA.detach().cpu())

axes[idx].bar(range(len(S)), S.numpy())
axes[idx].set_xlabel('Singular Value Index', fontsize=10)
axes[idx].set_ylabel('Magnitude', fontsize=10)
axes[idx].set_title(f'LoRA Layer {idx+1}: Singular Values of BA', fontsize=12)
axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('lora_singular_values.png', dpi=150, bbox_inches='tight')
plt.close()
print("Singular value plots saved to lora_singular_values.png")

# ============================================================================
# Main Function
# ============================================================================

def main():
# Hyperparameters
d_model = 256
num_heads = 4
num_layers = 4
d_ff = 1024
vocab_size = 10000
num_classes = 10
batch_size = 32
num_epochs = 20
learning_rate = 1e-3
lora_rank = 8
lora_alpha = 16

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Create dataset
print("\nCreating dataset...")
dataset = SyntheticTextDataset(num_samples=1000, vocab_size=vocab_size, num_classes=num_classes)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# ========================================================================
# Method 1: Full Fine-Tuning (baseline)
# ========================================================================
print("\n" + "="*60)
print("Method 1: Full Fine-Tuning (Baseline)")
print("="*60)

model_baseline = SimpleTransformer(
d_model=d_model, num_heads=num_heads, num_layers=num_layers,
d_ff=d_ff, vocab_size=vocab_size, num_classes=num_classes
).to(device)

total_params_baseline = count_parameters(model_baseline)
trainable_params_baseline = count_parameters(model_baseline, trainable_only=True)
print(f"Total parameters: {total_params_baseline:,}")
print(f"Trainable parameters: {trainable_params_baseline:,} ({100*trainable_params_baseline/total_params_baseline:.2f}%)")

optimizer_baseline = optim.Adam(model_baseline.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

losses_baseline, accuracies_baseline = train_model(
model_baseline, dataloader, optimizer_baseline, criterion, device, num_epochs
)

# ========================================================================
# Method 2: LoRA Fine-Tuning
# ========================================================================
print("\n" + "="*60)
print("Method 2: LoRA Fine-Tuning")
print("="*60)

model_lora = SimpleTransformer(
d_model=d_model, num_heads=num_heads, num_layers=num_layers,
d_ff=d_ff, vocab_size=vocab_size, num_classes=num_classes
).to(device)

# Apply LoRA to Query and Value projections
apply_lora_to_linear(model_lora, rank=lora_rank, alpha=lora_alpha,
target_modules=['query', 'value'])

total_params_lora = count_parameters(model_lora)
trainable_params_lora = count_parameters(model_lora, trainable_only=True)
print(f"\nTotal parameters: {total_params_lora:,}")
print(f"Trainable parameters: {trainable_params_lora:,} ({100*trainable_params_lora/total_params_lora:.2f}%)")
print(f"Parameter reduction: {100*(trainable_params_baseline - trainable_params_lora)/trainable_params_baseline:.2f}%")

optimizer_lora = optim.Adam(
[p for p in model_lora.parameters() if p.requires_grad],
lr=learning_rate
)

losses_lora, accuracies_lora = train_model(
model_lora, dataloader, optimizer_lora, criterion, device, num_epochs
)

# ========================================================================
# Results Comparison
# ========================================================================
print("\n" + "="*60)
print("Results Comparison")
print("="*60)
print(f"Full Fine-Tuning - Final Loss: {losses_baseline[-1]:.4f}, Final Accuracy: {accuracies_baseline[-1]:.2f}%")
print(f"LoRA Fine-Tuning - Final Loss: {losses_lora[-1]:.4f}, Final Accuracy: {accuracies_lora[-1]:.2f}%")
print(f"Performance gap: {accuracies_lora[-1] - accuracies_baseline[-1]:.2f}%")

# Plot training curves
plot_training_curves(losses_baseline, accuracies_baseline, losses_lora, accuracies_lora)

# Visualize LoRA matrices
visualize_lora_matrices(model_lora)

# ========================================================================
# Weight Merging Test
# ========================================================================
print("\n" + "="*60)
print("Weight Merging Test")
print("="*60)

# Test on one sample
test_input = torch.randint(1, vocab_size, (1, 32)).to(device)

# Output before merging
model_lora.eval()
with torch.no_grad():
output_before = model_lora(test_input)

# Merge weights
for module in model_lora.modules():
if isinstance(module, LoRALayer):
module.merge_weights()

# Output after merging
with torch.no_grad():
output_after = model_lora(test_input)

# Verify output consistency
diff = torch.abs(output_before - output_after).max().item()
print(f"Max difference between outputs before and after merging: {diff:.8f}")
print("Weights successfully merged!" if diff < 1e-5 else "Warning: Outputs differ!")

print("\n" + "="*60)
print("Experiment completed!")
print("="*60)

if __name__ == "__main__":
main()

Code Explanation

Core Components:

  1. LoRALayer: Implements low-rank decomposition

  2. apply_lora_to_linear: Automatically replaces Linear layers in model

  3. Weight merging: Merges LoRA weights into original weights after training, no inference overhead

Experimental Design:

  1. Method 1: Full fine-tuning (baseline)
  2. Method 2: LoRA fine-tuning (rank=8)
  3. Compare parameter count, training curves, final performance

Key Details:

  • Initialization:uses Kaiming,is all zeros
  • Computation order:, avoid explicitconstruction
  • Weight merging: No additional overhead at inference

Method Comparison and Selection Guide

Performance Comparison

Experimental results on GLUE benchmark (RoBERTa-base, ~125M parameters):

Method Trainable Parameters Average Score Relative to Full FT
Full Fine-Tuning 100% 84.8 100%
BitFit 0.1% 82.3 97.1%
Adapter 0.5% 84.2 99.3%
Prefix-Tuning 0.1% 83.9 99.0%
LoRA (r=8) 0.2% 84.6 99.8%
(IA)³ 0.01% 83.5 98.5%

Conclusion: LoRA achieves the best balance between parameter efficiency and performance.

Applicable Scenarios

LoRA suitable for:

  • Generative models (GPT, T5)
  • Large-scale models (>1B parameters)
  • Frequent task switching needed
  • Memory constrained

Adapter suitable for:

  • Encoder models (BERT, RoBERTa)
  • High training stability required
  • Inference latency insensitive
  • Implementation simplicity prioritized

Prefix-Tuning suitable for:

  • Generation tasks (summarization, translation)
  • Few-shot learning
  • Combined with prompt engineering
  • Variable input length

Prompt-Tuning suitable for:

  • Very large models (>10B parameters)
  • Zero-shot/few-shot scenarios
  • Flexible input format
  • Frequent task switching

BitFit suitable for:

  • Quick prototyping with large models
  • Ultimate parameter efficiency needs
  • Simple tasks
  • Extremely limited computational resources

(IA)³ suitable for:

  • Few-shot scenarios
  • Feature importance adjustment
  • Quick adaptation
  • Combined with other methods

Combination Strategies

Multiple PEFT methods can be combined:

  1. LoRA + Adapter: LoRA for attention, Adapter for FFN
  2. Prefix-Tuning + LoRA: Prefix adjusts input, LoRA adjusts weights
  3. BitFit + LoRA: Full fine-tune bias, low-rank fine-tune weights

Theoretical Analysis and Future Directions

Theoretical Foundations of Low-Rank Assumption

Neural Tangent Kernel Theory

In the infinite-width network limit, neural network training dynamics are described by the Neural Tangent Kernel (NTK):NTK theory shows: Under specific initialization, weight updatesconcentrate in low-rank subspaces.

Information Bottleneck

From an information theory perspective, effective feature representations should minimize redundancy:Low-rank structure is a manifestation of this information compression.

Future Research Directions

  1. Adaptive rank selection: Automatically determine optimal rankbased on task
  2. Structured low-rank: Further compression using tensor decomposition (Tucker, CP)
  3. Dynamic PEFT: Dynamically adjust parameter efficiency during training
  4. Hardware-friendly design: Optimize PEFT implementation for specific hardware (TPU, NPU)
  5. Multi-task PEFT: Share partial LoRA parameters, learn task correlations

Frequently Asked Questions

Q1: How to choose LoRA rank?

Empirical rules:

  • Small models (<1B):
  • Medium models (1B-10B):
  • Large models (>10B): Principles:
  • High task complexity → larger
  • Sufficient data → can use larger
  • Memory constrained → reduceIn practice, start withfor testing, then adjust based on performance.

Q2: Which layers should LoRA be applied to?

Priority (high to low):

  1. Query and Value: Affects attention mechanism, most significant effect
  2. All attention projections (QKVO): Best performance, slightly more parameters
  3. FFN layers: Use in combination with attention
  4. Value only: Most lightweight, suitable for extreme resource constraints

Recommendation: Try Query+Value first, extend to all layers if performance is insufficient.

Q3: Performance gap between LoRA and full fine-tuning?

Experiments show:

  • Large models (>10B): Gap <1%
  • Medium models (1B-10B): Gap 1-3%
  • Small models (<1B): Gap may be >5%

Reason: Large models have low intrinsic dimensionality, low-rank assumption holds better.

Q4: How to set learning rate for LoRA training?

Empirical values:

  • LoRA parameters:to
  • Usually 1-2 orders of magnitude higher than full fine-tuning learning rate

Reason: LoRA parameters initialized from zero, need larger learning rate for fast learning.

Q5: How to manage LoRA parameters in multi-task scenarios?

Strategies:

  1. Independent storage: One set ofper task, dynamically load at inference
  2. Shared base: Share, task-specific(or vice versa)
  3. Mixture of experts: Multiple LoRA modules, route based on input

Example: 100 tasks, each LoRA 10MB, total 1GB (vs full fine-tuning needs 100×700GB).

Q6: Does LoRA cause catastrophic forgetting?

Compared to full fine-tuning, LoRA significantly mitigates catastrophic forgetting:

  • Reason: Pre-trained weightsare completely frozen, not damaged
  • Incrementonly encodes task-specific knowledge

Experiments: LoRA outperforms full fine-tuning in continual learning scenarios.

Q7: What is LoRA's inference speed?

  • Before merging: Slightly slower (~5%), due to additional computation of
  • After merging: Identical to full fine-tuning, zero overhead

Recommendation: Merge weights at deployment to maintain inference efficiency.

Q8: Which is better, Adapter or LoRA?

Depends on scenario:

Dimension Adapter Better LoRA Better
Model type BERT-like encoders GPT-like generators
Training stability Stable Needs tuning
Inference latency Has latency No latency (after merge)
Implementation complexity Simple Moderate
Parameter efficiency Moderate High

Practice: Try LoRA first, consider Adapter if it doesn't work.

Q9: Can PEFT methods be combined with quantization?

Yes! Common combinations:

  1. QLoRA: 4-bit quantization + LoRA, fine-tune 65B model on single GPU
  2. Quantized Adapter: Quantize base model, only Adapter uses FP16
  3. Mixed precision PEFT: LoRA uses FP32, others use INT8

QLoRA effect: Memory requirement reduced 4x, performance drop <2%.

Q10: Why does Prefix-Tuning need reparameterization?

Problems with directly optimizing:

  1. Training instability: Large gradient variance
  2. Slow convergence: Difficult optimization in high-dimensional space
  3. Overfitting: Parameters directly exposed to loss function

Benefits of reparameterization ():

  • MLP provides regularization effect
  • Low-dimensionaleasier to optimize
  • Improved training stability

Q11: How effective are PEFT methods on CV tasks?

Not as effective as in NLP:

  • Reason: Vision models have higher intrinsic dimensionality, low-rank assumption not as strong
  • Improvement: Use larger rank(e.g.,)

Recent progress: Convpass, SSF and other methods designed for CV PEFT, approaching full fine-tuning performance.

Q12: How to debug PEFT training convergence issues?

Diagnostic steps:

  1. Check gradients: Are LoRA parameter gradients normal?

    1
    2
    3
    for name, param in model.named_parameters():
    if param.requires_grad and param.grad is not None:
    print(f"{name}: grad_norm={param.grad.norm().item():.6f}")

  2. Increase learning rate: LoRA needs higher lr than full fine-tuning

  3. Check initialization:should be zero,should be random

  4. Increase rank:too small may lack expressive power

  5. Remove Dropout: In some cases LoRA is sensitive to Dropout

Summary

This article comprehensively introduced parameter-efficient fine-tuning techniques:

  1. LoRA: Mathematical principles of low-rank decomposition and complete implementation
  2. Adapter: Bottleneck architecture design and application
  3. Prefix-Tuning: Soft prompt optimization and reparameterization
  4. Prompt-Tuning: Pure soft prompt minimalist design
  5. BitFit: Bias-only fine-tuning for ultimate efficiency
  6. (IA)³: Innovative activation scaling method
  7. Method comparison: Comprehensive analysis of performance, efficiency, and applicable scenarios
  8. Complete code: 200+ lines of production-level code implementing LoRA from scratch

PEFT technology transforms large model fine-tuning from a "luxury" to an "everyday tool", enabling fine-tuning of tens-of-billions parameter models on a single GPU. In the next chapter, we will explore continual learning and see how models can continuously learn new tasks without forgetting old knowledge.

References


  1. Hu, E. J., Shen, Y., Wallis, P., et al. (2021). LoRA: Low-rank adaptation of large language models. ICLR.↩︎

  2. Aghajanyan, A., Gupta, S., & Zettlemoyer, L. (2020). Intrinsic dimensionality explains the effectiveness of language model fine-tuning. ACL.↩︎

  3. Houlsby, N., Giurgiu, A., Jastrzebski, S., et al. (2019). Parameter-efficient transfer learning for NLP. ICML.↩︎

  4. He, J., Zhou, C., Ma, X., et al. (2021). Towards a unified view of parameter-efficient transfer learning. ICLR.↩︎

  5. Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. ACL.↩︎

  6. Liu, X., Ji, K., Fu, Y., et al. (2022). P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. ACL.↩︎

  7. Lester, B., Al-Rfou, R., & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. EMNLP.↩︎

  8. Zaken, E. B., Ravfogel, S., & Goldberg, Y. (2021). BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. ACL.↩︎

  9. Liu, H., Tam, D., Muqeeth, M., et al. (2022). Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. NeurIPS.↩︎

  • Post title:Transfer Learning (9): Parameter-Efficient Fine-Tuning
  • Post author:Chen Kai
  • Create time:2024-12-21 09:15:00
  • Post link:https://www.chenk.top/transfer-learning-9-parameter-efficient-fine-tuning/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments