NLP (8): Model Fine-tuning and PEFT
Chen Kai BOSS

As large language models continue to grow in size, the cost of full fine-tuning has become increasingly prohibitive. Fine-tuning a model with billions of parameters requires updating all parameters, which not only demands massive computational resources but can also lead to catastrophic forgetting. To address these challenges, Parameter-Efficient Fine-Tuning (PEFT) techniques have emerged.

PEFT techniques achieve performance close to full fine-tuning by updating only a small fraction of model parameters. Methods like LoRA (Low-Rank Adaptation), QLoRA, Adapter, and Prefix-Tuning are representative examples. These approaches not only dramatically reduce computational costs but also make it possible to fine-tune large models on consumer-grade hardware.

This article delves into the differences between full fine-tuning and frozen fine-tuning, provides detailed explanations of PEFT techniques including LoRA, QLoRA, Adapter, Prefix-Tuning, and P-Tuning v2, introduces alignment techniques like Instruction Tuning and RLHF (Reinforcement Learning from Human Feedback), and demonstrates how to fine-tune large models using the HuggingFace PEFT library through practical examples.

Full Fine-tuning vs Frozen Fine-tuning

Full Fine-tuning

Full fine-tuning refers to updating all parameters of a pre-trained model. This is the most straightforward approach but also the most expensive.

Process:

  1. Load pre-trained model weights
  2. Train on target task data
  3. Update all parameters using backpropagation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer

# Full fine-tuning example
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Set all parameters as trainable
for param in model.parameters():
param.requires_grad = True

# Training loop
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

for epoch in range(num_epochs):
for batch in dataloader:
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()

Advantages: - Theoretically achieves best performance - Model can fully adapt to target task

Disadvantages: - Requires massive computational resources (GPU memory, training time) - Prone to overfitting - May cause catastrophic forgetting - Each task requires saving a complete model copy

Frozen Fine-tuning

Frozen fine-tuning freezes most parameters of the pre-trained model and only trains partial layers (typically the top layers).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Frozen fine-tuning example
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Freeze all parameters
for param in model.parameters():
param.requires_grad = False

# Only train top layers (e.g., last 2 layers)
for param in model.transformer.h[-2:].parameters():
param.requires_grad = True

# Or only train classification head
classifier = nn.Linear(model.config.n_embd, num_labels)
optimizer = torch.optim.AdamW(classifier.parameters(), lr=1e-4)

Advantages: - Dramatically reduces trainable parameters - Lowers computational cost - Preserves pre-trained knowledge

Disadvantages: - Performance may be inferior to full fine-tuning - Requires careful selection of unfrozen layers - Lower flexibility

Parameter Efficiency Comparison

Assuming a model has parameters:

  • Full fine-tuning: Need to updateparameters
  • Frozen fine-tuning: Only updateparameters (e.g., last few layers)
  • PEFT methods: Typically only updateparameters (e.g., LoRA updates only 0.1%-1% of parameters)

LoRA: Low-Rank Adaptation

LoRA (Low-Rank Adaptation) is one of the most popular PEFT methods today. Its core idea is: instead of directly updating the original weight matrix, learn a low-rank decomposed incremental update.

LoRA Principle

For a pre-trained weight matrix, LoRA doesn't directly update, but learns two low-rank matricesand, where.

During forward propagation, the actual weight used is:whereis the low-rank update.

Parameter Efficiency:

  • Original matrix parameters:
  • LoRA parameters:
  • Parameter reduction ratio:When, parameters are dramatically reduced. For example, with:
  • Original parameters: 1,048,576
  • LoRA parameters: 16,384
  • Reduction ratio: approximately 98.4%

LoRA Implementation

Problem Context: Full fine-tuning requires updating all parameters, which is expensive. LoRA reduces trainable parameters dramatically by learning low-rank decomposed weight increments.

Solution Approach: Instead of directly updating the original weight matrix, learn two low-rank matricesandsuch that. During forward propagation, use, whereis a scaling factor.

Design Considerations: - Matrixis initialized with small random values,is initialized to zero, ensuring LoRA updates are zero initially - Rankcontrols the low-rank dimension, typically chosen as 4-32 - Alpha controls the strength of LoRA updates, typically set as a multiple of rank

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import torch
import torch.nn as nn
import torch.nn.functional as F

class LoRALayer(nn.Module):
"""
LoRA layer implementation: Low-Rank Adaptation

Problem: How to efficiently fine-tune large models?
Solution: Learn low-rank decomposed weight increments instead of directly updating original weights

Principle: W = W_0 + Δ W = W_0 + BA
where B ∈ R^(out_features × r), A ∈ R^(r × in_features)
Parameters: r × (out_features + in_features) << out_features × in_features
"""
def __init__(self, in_features, out_features, rank=8, alpha=16):
"""
Args:
in_features: Input feature dimension
out_features: Output feature dimension
rank: Low-rank rank r, controls LoRA capacity (typically 4-32)
alpha: Scaling factor, controls LoRA update strength (typically a multiple of rank)
"""
super().__init__()
self.rank = rank
self.alpha = alpha

# Low-rank matrices A and B
# A: [r, in_features], initialized with small random values
# Use Kaiming initialization, scaling factor 0.02 makes initial updates small
self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.02)

# B: [out_features, r], initialized to zero
# Ensures LoRA output is zero initially, doesn't affect original model behavior
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))

# Scaling factor: alpha / rank
# Effect: Controls LoRA update strength, larger alpha means greater LoRA influence
self.scaling = alpha / rank

def forward(self, x, original_weight):
"""
Forward pass: Compute W_0 * x + (alpha/r) * B * A * x

Args:
x: Input tensor, shape: [batch_size, ..., in_features]
original_weight: Original weight matrix W_0, shape: [out_features, in_features]

Returns:
Output tensor, shape: [batch_size, ..., out_features]
"""
# Compute LoRA update: BAx
# Step 1: x @ A^T -> [batch_size, ..., r]
# Note: lora_A is [r, in_features], needs transpose
lora_output = F.linear(x, self.lora_A.t()) # x @ A^T

# Step 2: (x @ A^T) @ B -> [batch_size, ..., out_features]
# lora_B is [out_features, r], no transpose needed
lora_output = F.linear(lora_output, self.lora_B) # (x @ A^T) @ B

# Apply scaling factor
# Scaling effect: Controls LoRA update strength
# When alpha=rank, scaling=1; when alpha=2*rank, scaling=2
lora_output = lora_output * self.scaling

# Original output: W_0 * x
original_output = F.linear(x, original_weight)

# Final output: Original output + LoRA update
return original_output + lora_output

class LoRALinear(nn.Module):
"""
Linear layer wrapper with LoRA

Problem: How to apply LoRA to existing linear layers?
Solution: Wrap original linear layer, apply LoRA updates during forward pass
"""
def __init__(self, linear_layer, rank=8, alpha=16):
"""
Args:
linear_layer: Original nn.Linear layer
rank: LoRA rank
alpha: LoRA scaling factor
"""
super().__init__()
self.linear = linear_layer # Original linear layer (frozen)
self.lora = LoRALayer(
linear_layer.in_features,
linear_layer.out_features,
rank=rank,
alpha=alpha
)

def forward(self, x):
"""
Forward pass: Apply LoRA updates

Args:
x: Input tensor

Returns:
Output tensor = W_0 * x + LoRA_update
"""
return self.lora(x, self.linear.weight)

Key Points: - Low-rank decomposition: Decomposeweight matrix intoandmatrices, reducing parameters fromto - Initialization strategy:randomly initialized,zero-initialized, ensuring LoRA doesn't affect model initially - Scaling factor:controls LoRA update strength, largermeans greater LoRA influence

Design Trade-offs: - ✅ Pros: Dramatically reduced parameters (typically <1%), can merge into original weights, inference speed unchanged - ⚠️ Note: Rank selection is important, too small may lack expressiveness, too large loses parameter efficiency advantage

Common Questions: - Q: Why initialize B to zero? A: Ensures LoRA output is zero initially, doesn't affect pretrained model behavior - Q: How to choose rank? A: Typically start with 8 or 16, adjust based on task complexity. Simple tasks can use 4, complex tasks may need 32 or 64 - Q: How to set alpha? A: Typically set as a multiple of rank, e.g.,. Rule of thumb:usually works well

Usage Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Apply LoRA to attention layers
class AttentionWithLoRA(nn.Module):
def __init__(self, attention_layer, rank=8, alpha=16):
super().__init__()
self.attention = attention_layer
# Apply LoRA only to Q, K, V projections
self.q_proj_lora = LoRALinear(attention_layer.q_proj, rank, alpha)
self.k_proj_lora = LoRALinear(attention_layer.k_proj, rank, alpha)
self.v_proj_lora = LoRALinear(attention_layer.v_proj, rank, alpha)

def forward(self, x):
# Use LoRA versions of projection layers
q = self.q_proj_lora(x)
k = self.k_proj_lora(x)
v = self.v_proj_lora(x)
# ... attention computation ...

Advantages of LoRA

  1. Parameter Efficient: Only updates a small number of parameters (typically <1%)
  2. Modular: Can easily add or remove LoRA adapters
  3. Multi-task: Can train different LoRA adapters for different tasks
  4. Performance Close to Full Fine-tuning: Achieves 90%+ performance on most tasks

LoRA Hyperparameter Selection

  • rank (r): Rank of low-rank decomposition, typically choose 4, 8, 16, 32. Larger rank means stronger expressiveness but more parameters.
  • alpha: Scaling factor, typically set as a multiple of rank (e.g., rank=8, alpha=16). Larger alpha means greater influence of LoRA updates.

Rule of thumb: alpha = 2 * rank usually works well.

QLoRA: Quantized LoRA

QLoRA (Quantized LoRA) combines quantization with LoRA, further reducing memory requirements.

QLoRA Principle

Core innovations of QLoRA:

  1. 4-bit Quantization: Quantize model weights to 4-bit
  2. NF4 Quantization: Use NormalFloat4 quantization format
  3. Double Quantization: Quantize quantization constants again
  4. Paged Optimizer: Use paged AdamW optimizer

Memory Savings:

  • FP16 full fine-tuning: 2 bytes per parameter
  • QLoRA: Approximately 0.5 bytes per parameter (4-bit) + LoRA parameters

For a 7B model: - FP16 full fine-tuning: Approximately 14 GB - QLoRA: Approximately 3-4 GB

QLoRA Implementation (using PEFT)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, AutoTokenizer

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32, # alpha
target_modules=["q_proj", "v_proj"], # Target modules
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Training (same as above)

Adapter Technology

Adapters are small trainable modules inserted into Transformer layers.

Adapter Architecture

Add two small feedforward networks (Adapters) to each Transformer layer:

  1. Down-projection: Reduce hidden dimension to smaller dimension

  2. Up-projection: Restore dimension to original hidden dimensionwhere: - -

Adapter Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
class Adapter(nn.Module):
"""Adapter module"""
def __init__(self, hidden_size, adapter_size=64):
super().__init__()
self.adapter_size = adapter_size

# Down-projection
self.down_proj = nn.Linear(hidden_size, adapter_size)
# Up-projection
self.up_proj = nn.Linear(adapter_size, hidden_size)

# Initialize: up_proj initialized to zero to ensure Adapter doesn't affect output initially
nn.init.zeros_(self.up_proj.weight)
nn.init.zeros_(self.up_proj.bias)

def forward(self, x):
# Residual connection
return x + self.up_proj(F.relu(self.down_proj(x)))

class TransformerLayerWithAdapter(nn.Module):
"""Transformer layer with Adapter"""
def __init__(self, transformer_layer, adapter_size=64):
super().__init__()
self.transformer_layer = transformer_layer
self.adapter = Adapter(
transformer_layer.self_attn.embed_dim,
adapter_size
)

def forward(self, x, **kwargs):
# Original Transformer layer
x = self.transformer_layer(x, **kwargs)[0]
# Add Adapter
x = self.adapter(x)
return (x,)

Adapter vs LoRA

Feature Adapter LoRA
Insertion Position Inside Transformer layer Beside weight matrix
Parameter Count Medium (~0.5% per Adapter) Less (typically <1%)
Inference Speed Slightly slower (extra forward computation) Faster (can merge into weights)
Flexibility Medium High (easy to combine)

Prefix-Tuning

Prefix-Tuning adapts to tasks by adding learnable continuous prefixes to the input sequence.

Prefix-Tuning Principle

For input sequence, Prefix-Tuning adds learnable prefixes, forming:These prefix vectorsare trainable parameters, while original model parameters remain frozen.

Attention Computation:

In the attention mechanism, prefixes participate in key and value computation:whereandinclude prefix parts:

Prefix-Tuning Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class PrefixTuning(nn.Module):
"""Prefix-Tuning implementation"""
def __init__(self, config, num_prefix_tokens=10):
super().__init__()
self.num_prefix_tokens = num_prefix_tokens
self.hidden_size = config.hidden_size
self.num_heads = config.num_attention_heads
self.head_dim = self.hidden_size // self.num_heads

# Prefix parameters
self.prefix_embeddings = nn.Parameter(
torch.randn(num_prefix_tokens, self.hidden_size)
)

# Prefixes for generating key and value
self.prefix_key = nn.Linear(self.hidden_size, self.hidden_size)
self.prefix_value = nn.Linear(self.hidden_size, self.hidden_size)

def get_prefix_kv(self):
"""Get key and value of prefix"""
prefix_k = self.prefix_key(self.prefix_embeddings)
prefix_v = self.prefix_value(self.prefix_embeddings)

# Reshape to multi-head format
batch_size = 1 # Can broadcast
prefix_k = prefix_k.view(
batch_size, self.num_prefix_tokens, self.num_heads, self.head_dim
).transpose(1, 2)
prefix_v = prefix_v.view(
batch_size, self.num_prefix_tokens, self.num_heads, self.head_dim
).transpose(1, 2)

return prefix_k, prefix_v

P-Tuning v2

P-Tuning v2 is an improved version of Prefix-Tuning, with main improvements:

  1. Apply to All Layers: Add prefixes not only at input layer but at all Transformer layers
  2. Remove Reparameterization: Directly optimize prefix parameters without MLP reparameterization
  3. Multi-task Learning: Support multi-task prefixes

P-Tuning v2 Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class PTuningV2(nn.Module):
"""P-Tuning v2 implementation"""
def __init__(self, config, num_layers, num_prefix_tokens=20):
super().__init__()
self.num_layers = num_layers
self.num_prefix_tokens = num_prefix_tokens
self.hidden_size = config.hidden_size

# Create prefix for each layer
self.prefix_embeddings = nn.ModuleList([
nn.Parameter(torch.randn(num_prefix_tokens, self.hidden_size))
for _ in range(num_layers)
])

def get_layer_prefix(self, layer_idx):
"""Get prefix for specified layer"""
return self.prefix_embeddings[layer_idx]

Instruction Tuning

Instruction tuning is a key technique for making models follow instructions. By fine-tuning on instruction-response pairs, models learn to understand and execute various instructions.

Instruction Data Format

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
instruction_data = [
{
"instruction": "Explain what machine learning is",
"input": "",
"output": "Machine learning is a branch of artificial intelligence that enables computers to learn from data..."
},
{
"instruction": "Translate the following English to Chinese",
"input": "Hello, how are you?",
"output": "你好,你好吗?"
},
{
"instruction": "Summarize the following article",
"input": "[Article content]",
"output": "[Summary]"
}
]

Instruction Tuning Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
from transformers import Trainer, TrainingArguments

def format_instruction(example):
"""Format instruction data"""
if example['input']:
prompt = f"Instruction: {example['instruction']}\nInput: {example['input']}\nOutput: "
else:
prompt = f"Instruction: {example['instruction']}\nOutput: "

return {
"text": prompt + example['output']
}

# Prepare data
dataset = dataset.map(format_instruction)

# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_steps=500,
)

# Use LoRA for instruction tuning
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.1,
task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
)

trainer.train()

RLHF and Alignment Techniques

RLHF (Reinforcement Learning from Human Feedback) is an important technique for aligning models with human values.

RLHF Process

RLHF typically consists of three stages:

  1. Supervised Fine-tuning (SFT): Fine-tune base model on instruction data
  2. Reward Model Training: Train a reward model to evaluate output quality
  3. Reinforcement Learning Optimization: Optimize policy model using algorithms like PPO

Reward Model Training

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class RewardModel(nn.Module):
"""Reward model"""
def __init__(self, base_model):
super().__init__()
self.base_model = base_model
self.reward_head = nn.Linear(base_model.config.hidden_size, 1)

def forward(self, input_ids, attention_mask=None):
outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
# Use hidden state of last token
last_hidden_state = outputs.last_hidden_state[:, -1, :]
reward = self.reward_head(last_hidden_state)
return reward

# Train reward model
def train_reward_model(model, chosen_data, rejected_data):
"""Train reward model"""
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

for chosen, rejected in zip(chosen_data, rejected_data):
# Compute rewards
reward_chosen = model(chosen['input_ids'], chosen['attention_mask'])
reward_rejected = model(rejected['input_ids'], rejected['attention_mask'])

# Loss: chosen reward should be greater than rejected
loss = -torch.log(torch.sigmoid(reward_chosen - reward_rejected))

loss.backward()
optimizer.step()
optimizer.zero_grad()

PPO Optimization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from trl import PPOTrainer, PPOConfig

# PPO configuration
ppo_config = PPOConfig(
model_name="gpt2",
learning_rate=1e-5,
batch_size=4,
mini_batch_size=2,
gradient_accumulation_steps=4,
)

# PPO Trainer
ppo_trainer = PPOTrainer(
config=ppo_config,
model=model,
ref_model=ref_model, # Reference model (frozen)
tokenizer=tokenizer,
reward_model=reward_model,
)

# Training loop
for epoch in range(num_epochs):
for batch in dataloader:
# Generate response
response = model.generate(**batch)

# Compute reward
rewards = reward_model(response)

# PPO update
ppo_trainer.step(
queries=batch['input_ids'],
responses=response,
rewards=rewards
)

Practical Guide: Fine-tuning Large Models with PEFT

Complete Example: Fine-tuning LLaMA with LoRA

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
import torch

# 1. Load model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 2. Load model (optional: use quantization)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)

# 3. Prepare model (if using quantization)
# model = prepare_model_for_kbit_training(model)

# 4. Configure LoRA
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32, # alpha
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Target modules
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)

# 5. Apply LoRA
model = get_peft_model(model, lora_config)

# Print trainable parameters
model.print_trainable_parameters()
# Output example:
# trainable params: 4,194,304 || all params: 6,738,415,616 || trainable%: 0.06

# 6. Prepare data
def tokenize_function(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=512,
padding="max_length"
)

dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# 7. Training arguments
training_args = TrainingArguments(
output_dir="./llama-lora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_steps=500,
evaluation_strategy="steps",
eval_steps=500,
)

# 8. Data collator
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False # Causal language modeling
)

# 9. Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
data_collator=data_collator,
)

# 10. Train
trainer.train()

# 11. Save model
model.save_pretrained("./llama-lora-final")

Fine-tuning with QLoRA

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Training (same as above)

Multi-task LoRA

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Train different LoRA adapters for different tasks
task1_lora = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
task_type="CAUSAL_LM"
)

task2_lora = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj", "k_proj"],
task_type="CAUSAL_LM"
)

# Train task 1
model_task1 = get_peft_model(base_model, task1_lora)
# ... training ...

# Train task 2 (can load different adapters)
model_task2 = get_peft_model(base_model, task2_lora)
# ... training ...

# Switch adapters during inference
model.set_adapter("task1_adapter")
output1 = model.generate(...)

model.set_adapter("task2_adapter")
output2 = model.generate(...)

❓ Q&A: Common Questions About Model Fine-tuning and PEFT

Q1: When should I use full fine-tuning, and when should I use PEFT?

  • Full fine-tuning: When you have sufficient computational resources, large amounts of high-quality data, and need best performance
  • PEFT: When computational resources are limited, data volume is moderate, need rapid iteration or multi-task adaptation

Q2: How to choose LoRA rank?

  • Small/Simple tasks: rank=4 or 8
  • Medium tasks: rank=16 or 32
  • Complex tasks: rank=32 or 64

Recommendation: Start with rank=16, adjust based on results.

Q3: What's the difference between QLoRA and LoRA?

  • LoRA: Applied on FP16/BF16 models
  • QLoRA: Applied on 4-bit quantized models, lower memory requirements

QLoRA is suitable for memory-constrained scenarios.

Q4: Which modules should I apply LoRA to?

Typically choose projection matrices of attention layers: - q_proj, k_proj, v_proj, o_proj (QKV attention) - gate_proj, up_proj, down_proj (MLP, optional)

Recommendation: At least include q_proj and v_proj.

Q5: Does PEFT affect inference speed?

  • LoRA: Can merge into original weights, inference speed unchanged
  • Adapter: Requires extra computation, inference slightly slower
  • Prefix-Tuning: Needs to process extra tokens, inference slightly slower

Q6: How to choose PEFT method?

Method Use Case
LoRA General scenarios, balance performance and efficiency
QLoRA Memory-constrained, large models
Adapter Need modular design
Prefix-Tuning Generation tasks, need to control generation

Q7: How much data is needed for instruction tuning?

  • Minimum: 100-1000 high-quality instructions
  • Recommended: 1000-10000 instructions
  • Optimal: 10000+ diverse instructions

Quality matters more than quantity.

Q8: Is RLHF necessary?

No. RLHF is mainly used for: - Need to align with human values - Need to control output style - Need to reduce harmful content

For most tasks, instruction tuning is sufficient.

Q9: How to evaluate fine-tuning effectiveness?

  • Task metrics: Accuracy, F1, BLEU, etc.
  • Generation quality: Human evaluation, GPT-4 evaluation
  • Alignment: Ability to follow instructions
  • Efficiency: Parameter count, inference speed

Q10: Future directions of PEFT?

  • More efficient parameter utilization: Achieve better results with fewer parameters
  • Automated method selection: Automatically select optimal PEFT configuration
  • Multi-modal extension: Apply to vision, speech, and other modalities
  • Combined methods: Combine multiple PEFT techniques
  • Post title:NLP (8): Model Fine-tuning and PEFT
  • Post author:Chen Kai
  • Create time:2024-03-15 16:15:00
  • Post link:https://www.chenk.top/en/nlp-fine-tuning-peft/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments