NLP (8): Model Fine-tuning and PEFT

As large language models continue to grow in size, the cost of full fine-tuning has become increasingly prohibitive. Fine-tuning a model with billions of parameters requires updating all parameters, which not only demands massive computational resources but can also lead to catastrophic forgetting. To address these challenges, Parameter-Efficient Fine-Tuning (PEFT) techniques have emerged.

PEFT techniques achieve performance close to full fine-tuning by updating only a small fraction of model parameters. Methods like LoRA (Low-Rank Adaptation), QLoRA, Adapter, and Prefix-Tuning are representative examples. These approaches not only dramatically reduce computational costs but also make it possible to fine-tune large models on consumer-grade hardware.

This article delves into the differences between full fine-tuning and frozen fine-tuning, provides detailed explanations of PEFT techniques including LoRA, QLoRA, Adapter, Prefix-Tuning, and P-Tuning v2, introduces alignment techniques like Instruction Tuning and RLHF (Reinforcement Learning from Human Feedback), and demonstrates how to fine-tune large models using the HuggingFace PEFT library through practical examples.

Full Fine-tuning vs Frozen Fine-tuning

Full Fine-tuning

Full fine-tuning refers to updating all parameters of a pre-trained model. This is the most straightforward approach but also the most expensive.

Process:

Load pre-trained model weights
Train on target task data
Update all parameters using backpropagation

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer

# Full fine-tuning example
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Set all parameters as trainable
for param in model.parameters():
    param.requires_grad = True

# Training loop
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

for epoch in range(num_epochs):
    for batch in dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

Advantages: - Theoretically achieves best performance - Model can fully adapt to target task

Disadvantages: - Requires massive computational resources (GPU memory, training time) - Prone to overfitting - May cause catastrophic forgetting - Each task requires saving a complete model copy

Frozen Fine-tuning

Frozen fine-tuning freezes most parameters of the pre-trained model and only trains partial layers (typically the top layers).

# Frozen fine-tuning example
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Freeze all parameters
for param in model.parameters():
    param.requires_grad = False

# Only train top layers (e.g., last 2 layers)
for param in model.transformer.h[-2:].parameters():
    param.requires_grad = True

# Or only train classification head
classifier = nn.Linear(model.config.n_embd, num_labels)
optimizer = torch.optim.AdamW(classifier.parameters(), lr=1e-4)

Advantages: - Dramatically reduces trainable parameters - Lowers computational cost - Preserves pre-trained knowledge

Disadvantages: - Performance may be inferior to full fine-tuning - Requires careful selection of unfrozen layers - Lower flexibility

Parameter Efficiency Comparison

Assuming a model has parameters:

Full fine-tuning: Need to updateparameters
Frozen fine-tuning: Only updateparameters (e.g., last few layers)
PEFT methods: Typically only updateparameters (e.g., LoRA updates only 0.1%-1% of parameters)

LoRA: Low-Rank Adaptation

LoRA (Low-Rank Adaptation) is one of the most popular PEFT methods today. Its core idea is: instead of directly updating the original weight matrix, learn a low-rank decomposed incremental update.

LoRA Principle

For a pre-trained weight matrix, LoRA doesn't directly update, but learns two low-rank matricesand, where.

During forward propagation, the actual weight used is:whereis the low-rank update.

Parameter Efficiency:

Original matrix parameters:
LoRA parameters:
Parameter reduction ratio:When, parameters are dramatically reduced. For example, with:
Original parameters: 1,048,576
LoRA parameters: 16,384
Reduction ratio: approximately 98.4%

LoRA Implementation

Problem Context: Full fine-tuning requires updating all parameters, which is expensive. LoRA reduces trainable parameters dramatically by learning low-rank decomposed weight increments.

Solution Approach: Instead of directly updating the original weight matrix, learn two low-rank matricesandsuch that. During forward propagation, use, whereis a scaling factor.

Design Considerations: - Matrixis initialized with small random values,is initialized to zero, ensuring LoRA updates are zero initially - Rankcontrols the low-rank dimension, typically chosen as 4-32 - Alpha controls the strength of LoRA updates, typically set as a multiple of rank

import torch
import torch.nn as nn
import torch.nn.functional as F

class LoRALayer(nn.Module):
    """
    LoRA layer implementation: Low-Rank Adaptation
    
    Problem: How to efficiently fine-tune large models?
    Solution: Learn low-rank decomposed weight increments instead of directly updating original weights
    
    Principle: W = W_0 + Δ W = W_0 + BA
    where B ∈ R^(out_features × r), A ∈ R^(r × in_features)
    Parameters: r × (out_features + in_features) << out_features × in_features
    """
    def __init__(self, in_features, out_features, rank=8, alpha=16):
        """
        Args:
            in_features: Input feature dimension
            out_features: Output feature dimension
            rank: Low-rank rank r, controls LoRA capacity (typically 4-32)
            alpha: Scaling factor, controls LoRA update strength (typically a multiple of rank)
        """
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        
        # Low-rank matrices A and B
        # A: [r, in_features], initialized with small random values
        # Use Kaiming initialization, scaling factor 0.02 makes initial updates small
        self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.02)
        
        # B: [out_features, r], initialized to zero
        # Ensures LoRA output is zero initially, doesn't affect original model behavior
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        # Scaling factor: alpha / rank
        # Effect: Controls LoRA update strength, larger alpha means greater LoRA influence
        self.scaling = alpha / rank
    
    def forward(self, x, original_weight):
        """
        Forward pass: Compute W_0 * x + (alpha/r) * B * A * x
        
        Args:
            x: Input tensor, shape: [batch_size, ..., in_features]
            original_weight: Original weight matrix W_0, shape: [out_features, in_features]
        
        Returns:
            Output tensor, shape: [batch_size, ..., out_features]
        """
        # Compute LoRA update: BAx
        # Step 1: x @ A^T -> [batch_size, ..., r]
        # Note: lora_A is [r, in_features], needs transpose
        lora_output = F.linear(x, self.lora_A.t())  # x @ A^T
        
        # Step 2: (x @ A^T) @ B -> [batch_size, ..., out_features]
        # lora_B is [out_features, r], no transpose needed
        lora_output = F.linear(lora_output, self.lora_B)  # (x @ A^T) @ B
        
        # Apply scaling factor
        # Scaling effect: Controls LoRA update strength
        # When alpha=rank, scaling=1; when alpha=2*rank, scaling=2
        lora_output = lora_output * self.scaling
        
        # Original output: W_0 * x
        original_output = F.linear(x, original_weight)
        
        # Final output: Original output + LoRA update
        return original_output + lora_output

class LoRALinear(nn.Module):
    """
    Linear layer wrapper with LoRA
    
    Problem: How to apply LoRA to existing linear layers?
    Solution: Wrap original linear layer, apply LoRA updates during forward pass
    """
    def __init__(self, linear_layer, rank=8, alpha=16):
        """
        Args:
            linear_layer: Original nn.Linear layer
            rank: LoRA rank
            alpha: LoRA scaling factor
        """
        super().__init__()
        self.linear = linear_layer  # Original linear layer (frozen)
        self.lora = LoRALayer(
            linear_layer.in_features,
            linear_layer.out_features,
            rank=rank,
            alpha=alpha
        )
    
    def forward(self, x):
        """
        Forward pass: Apply LoRA updates
        
        Args:
            x: Input tensor
        
        Returns:
            Output tensor = W_0 * x + LoRA_update
        """
        return self.lora(x, self.linear.weight)

Key Points: - Low-rank decomposition: Decomposeweight matrix intoandmatrices, reducing parameters fromto - Initialization strategy:randomly initialized,zero-initialized, ensuring LoRA doesn't affect model initially - Scaling factor:controls LoRA update strength, largermeans greater LoRA influence

Design Trade-offs: - ✅ Pros: Dramatically reduced parameters (typically <1%), can merge into original weights, inference speed unchanged - ⚠️ Note: Rank selection is important, too small may lack expressiveness, too large loses parameter efficiency advantage

Common Questions: - Q: Why initialize B to zero? A: Ensures LoRA output is zero initially, doesn't affect pretrained model behavior - Q: How to choose rank? A: Typically start with 8 or 16, adjust based on task complexity. Simple tasks can use 4, complex tasks may need 32 or 64 - Q: How to set alpha? A: Typically set as a multiple of rank, e.g.,. Rule of thumb:usually works well

Usage Example:

# Apply LoRA to attention layers
class AttentionWithLoRA(nn.Module):
    def __init__(self, attention_layer, rank=8, alpha=16):
        super().__init__()
        self.attention = attention_layer
        # Apply LoRA only to Q, K, V projections
        self.q_proj_lora = LoRALinear(attention_layer.q_proj, rank, alpha)
        self.k_proj_lora = LoRALinear(attention_layer.k_proj, rank, alpha)
        self.v_proj_lora = LoRALinear(attention_layer.v_proj, rank, alpha)
    
    def forward(self, x):
        # Use LoRA versions of projection layers
        q = self.q_proj_lora(x)
        k = self.k_proj_lora(x)
        v = self.v_proj_lora(x)
        # ... attention computation ...

Advantages of LoRA

Parameter Efficient: Only updates a small number of parameters (typically <1%)
Modular: Can easily add or remove LoRA adapters
Multi-task: Can train different LoRA adapters for different tasks
Performance Close to Full Fine-tuning: Achieves 90%+ performance on most tasks

LoRA Hyperparameter Selection

rank (r): Rank of low-rank decomposition, typically choose 4, 8, 16, 32. Larger rank means stronger expressiveness but more parameters.
alpha: Scaling factor, typically set as a multiple of rank (e.g., rank=8, alpha=16). Larger alpha means greater influence of LoRA updates.

Rule of thumb: alpha = 2 * rank usually works well.

QLoRA: Quantized LoRA

QLoRA (Quantized LoRA) combines quantization with LoRA, further reducing memory requirements.

QLoRA Principle

Core innovations of QLoRA:

4-bit Quantization: Quantize model weights to 4-bit
NF4 Quantization: Use NormalFloat4 quantization format
Double Quantization: Quantize quantization constants again
Paged Optimizer: Use paged AdamW optimizer

Memory Savings:

FP16 full fine-tuning: 2 bytes per parameter
QLoRA: Approximately 0.5 bytes per parameter (4-bit) + LoRA parameters

For a 7B model: - FP16 full fine-tuning: Approximately 14 GB - QLoRA: Approximately 3-4 GB

QLoRA Implementation (using PEFT)

from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, AutoTokenizer

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=16,  # rank
    lora_alpha=32,  # alpha
    target_modules=["q_proj", "v_proj"],  # Target modules
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Training (same as above)

Adapter Technology

Adapters are small trainable modules inserted into Transformer layers.

Adapter Architecture

Add two small feedforward networks (Adapters) to each Transformer layer:

Down-projection: Reduce hidden dimension to smaller dimension
Up-projection: Restore dimension to original hidden dimensionwhere: - -

Adapter Implementation

class Adapter(nn.Module):
    """Adapter module"""
    def __init__(self, hidden_size, adapter_size=64):
        super().__init__()
        self.adapter_size = adapter_size
        
        # Down-projection
        self.down_proj = nn.Linear(hidden_size, adapter_size)
        # Up-projection
        self.up_proj = nn.Linear(adapter_size, hidden_size)
        
        # Initialize: up_proj initialized to zero to ensure Adapter doesn't affect output initially
        nn.init.zeros_(self.up_proj.weight)
        nn.init.zeros_(self.up_proj.bias)
    
    def forward(self, x):
        # Residual connection
        return x + self.up_proj(F.relu(self.down_proj(x)))

class TransformerLayerWithAdapter(nn.Module):
    """Transformer layer with Adapter"""
    def __init__(self, transformer_layer, adapter_size=64):
        super().__init__()
        self.transformer_layer = transformer_layer
        self.adapter = Adapter(
            transformer_layer.self_attn.embed_dim,
            adapter_size
        )
    
    def forward(self, x, **kwargs):
        # Original Transformer layer
        x = self.transformer_layer(x, **kwargs)[0]
        # Add Adapter
        x = self.adapter(x)
        return (x,)

Adapter vs LoRA

Feature	Adapter	LoRA
Insertion Position	Inside Transformer layer	Beside weight matrix
Parameter Count	Medium (~0.5% per Adapter)	Less (typically <1%)
Inference Speed	Slightly slower (extra forward computation)	Faster (can merge into weights)
Flexibility	Medium	High (easy to combine)

Prefix-Tuning

Prefix-Tuning adapts to tasks by adding learnable continuous prefixes to the input sequence.

Prefix-Tuning Principle

For input sequence, Prefix-Tuning adds learnable prefixes, forming:These prefix vectorsare trainable parameters, while original model parameters remain frozen.

Attention Computation:

In the attention mechanism, prefixes participate in key and value computation:whereandinclude prefix parts:

Prefix-Tuning Implementation

class PrefixTuning(nn.Module):
    """Prefix-Tuning implementation"""
    def __init__(self, config, num_prefix_tokens=10):
        super().__init__()
        self.num_prefix_tokens = num_prefix_tokens
        self.hidden_size = config.hidden_size
        self.num_heads = config.num_attention_heads
        self.head_dim = self.hidden_size // self.num_heads
        
        # Prefix parameters
        self.prefix_embeddings = nn.Parameter(
            torch.randn(num_prefix_tokens, self.hidden_size)
        )
        
        # Prefixes for generating key and value
        self.prefix_key = nn.Linear(self.hidden_size, self.hidden_size)
        self.prefix_value = nn.Linear(self.hidden_size, self.hidden_size)
    
    def get_prefix_kv(self):
        """Get key and value of prefix"""
        prefix_k = self.prefix_key(self.prefix_embeddings)
        prefix_v = self.prefix_value(self.prefix_embeddings)
        
        # Reshape to multi-head format
        batch_size = 1  # Can broadcast
        prefix_k = prefix_k.view(
            batch_size, self.num_prefix_tokens, self.num_heads, self.head_dim
        ).transpose(1, 2)
        prefix_v = prefix_v.view(
            batch_size, self.num_prefix_tokens, self.num_heads, self.head_dim
        ).transpose(1, 2)
        
        return prefix_k, prefix_v

P-Tuning v2

P-Tuning v2 is an improved version of Prefix-Tuning, with main improvements:

Apply to All Layers: Add prefixes not only at input layer but at all Transformer layers
Remove Reparameterization: Directly optimize prefix parameters without MLP reparameterization
Multi-task Learning: Support multi-task prefixes

P-Tuning v2 Implementation

class PTuningV2(nn.Module):
    """P-Tuning v2 implementation"""
    def __init__(self, config, num_layers, num_prefix_tokens=20):
        super().__init__()
        self.num_layers = num_layers
        self.num_prefix_tokens = num_prefix_tokens
        self.hidden_size = config.hidden_size
        
        # Create prefix for each layer
        self.prefix_embeddings = nn.ModuleList([
            nn.Parameter(torch.randn(num_prefix_tokens, self.hidden_size))
            for _ in range(num_layers)
        ])
    
    def get_layer_prefix(self, layer_idx):
        """Get prefix for specified layer"""
        return self.prefix_embeddings[layer_idx]

Instruction Tuning

Instruction tuning is a key technique for making models follow instructions. By fine-tuning on instruction-response pairs, models learn to understand and execute various instructions.

Instruction Data Format

instruction_data = [
    {
        "instruction": "Explain what machine learning is",
        "input": "",
        "output": "Machine learning is a branch of artificial intelligence that enables computers to learn from data..."
    },
    {
        "instruction": "Translate the following English to Chinese",
        "input": "Hello, how are you?",
        "output": "你好，你好吗？"
    },
    {
        "instruction": "Summarize the following article",
        "input": "[Article content]",
        "output": "[Summary]"
    }
]

Instruction Tuning Implementation

from transformers import Trainer, TrainingArguments

def format_instruction(example):
    """Format instruction data"""
    if example['input']:
        prompt = f"Instruction: {example['instruction']}\nInput: {example['input']}\nOutput: "
    else:
        prompt = f"Instruction: {example['instruction']}\nOutput: "
    
    return {
        "text": prompt + example['output']
    }

# Prepare data
dataset = dataset.map(format_instruction)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_steps=500,
)

# Use LoRA for instruction tuning
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.1,
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

trainer.train()

RLHF and Alignment Techniques

RLHF (Reinforcement Learning from Human Feedback) is an important technique for aligning models with human values.

RLHF Process

RLHF typically consists of three stages:

Supervised Fine-tuning (SFT): Fine-tune base model on instruction data
Reward Model Training: Train a reward model to evaluate output quality
Reinforcement Learning Optimization: Optimize policy model using algorithms like PPO

Reward Model Training

class RewardModel(nn.Module):
    """Reward model"""
    def __init__(self, base_model):
        super().__init__()
        self.base_model = base_model
        self.reward_head = nn.Linear(base_model.config.hidden_size, 1)
    
    def forward(self, input_ids, attention_mask=None):
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
        # Use hidden state of last token
        last_hidden_state = outputs.last_hidden_state[:, -1, :]
        reward = self.reward_head(last_hidden_state)
        return reward

# Train reward model
def train_reward_model(model, chosen_data, rejected_data):
    """Train reward model"""
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
    
    for chosen, rejected in zip(chosen_data, rejected_data):
        # Compute rewards
        reward_chosen = model(chosen['input_ids'], chosen['attention_mask'])
        reward_rejected = model(rejected['input_ids'], rejected['attention_mask'])
        
        # Loss: chosen reward should be greater than rejected
        loss = -torch.log(torch.sigmoid(reward_chosen - reward_rejected))
        
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

PPO Optimization

from trl import PPOTrainer, PPOConfig

# PPO configuration
ppo_config = PPOConfig(
    model_name="gpt2",
    learning_rate=1e-5,
    batch_size=4,
    mini_batch_size=2,
    gradient_accumulation_steps=4,
)

# PPO Trainer
ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    ref_model=ref_model,  # Reference model (frozen)
    tokenizer=tokenizer,
    reward_model=reward_model,
)

# Training loop
for epoch in range(num_epochs):
    for batch in dataloader:
        # Generate response
        response = model.generate(**batch)
        
        # Compute reward
        rewards = reward_model(response)
        
        # PPO update
        ppo_trainer.step(
            queries=batch['input_ids'],
            responses=response,
            rewards=rewards
        )

Practical Guide: Fine-tuning Large Models with PEFT

Complete Example: Fine-tuning LLaMA with LoRA

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
import torch

# 1. Load model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 2. Load model (optional: use quantization)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# 3. Prepare model (if using quantization)
# model = prepare_model_for_kbit_training(model)

# 4. Configure LoRA
lora_config = LoraConfig(
    r=16,  # rank
    lora_alpha=32,  # alpha
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Target modules
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

# 5. Apply LoRA
model = get_peft_model(model, lora_config)

# Print trainable parameters
model.print_trainable_parameters()
# Output example:
# trainable params: 4,194,304 || all params: 6,738,415,616 || trainable%: 0.06

# 6. Prepare data
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding="max_length"
    )

dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# 7. Training arguments
training_args = TrainingArguments(
    output_dir="./llama-lora",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_steps=500,
    evaluation_strategy="steps",
    eval_steps=500,
)

# 8. Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # Causal language modeling
)

# 9. Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
)

# 10. Train
trainer.train()

# 11. Save model
model.save_pretrained("./llama-lora-final")

Fine-tuning with QLoRA

from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Training (same as above)

Multi-task LoRA

# Train different LoRA adapters for different tasks
task1_lora = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM"
)

task2_lora = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj", "k_proj"],
    task_type="CAUSAL_LM"
)

# Train task 1
model_task1 = get_peft_model(base_model, task1_lora)
# ... training ...

# Train task 2 (can load different adapters)
model_task2 = get_peft_model(base_model, task2_lora)
# ... training ...

# Switch adapters during inference
model.set_adapter("task1_adapter")
output1 = model.generate(...)

model.set_adapter("task2_adapter")
output2 = model.generate(...)

❓ Q&A: Common Questions About Model Fine-tuning and PEFT

Q1: When should I use full fine-tuning, and when should I use PEFT?

Full fine-tuning: When you have sufficient computational resources, large amounts of high-quality data, and need best performance
PEFT: When computational resources are limited, data volume is moderate, need rapid iteration or multi-task adaptation

Q2: How to choose LoRA rank?

Small/Simple tasks: rank=4 or 8
Medium tasks: rank=16 or 32
Complex tasks: rank=32 or 64

Recommendation: Start with rank=16, adjust based on results.

Q3: What's the difference between QLoRA and LoRA?

LoRA: Applied on FP16/BF16 models
QLoRA: Applied on 4-bit quantized models, lower memory requirements

QLoRA is suitable for memory-constrained scenarios.

Q4: Which modules should I apply LoRA to?

Typically choose projection matrices of attention layers: - q_proj, k_proj, v_proj, o_proj (QKV attention) - gate_proj, up_proj, down_proj (MLP, optional)

Recommendation: At least include q_proj and v_proj.

Q5: Does PEFT affect inference speed?

LoRA: Can merge into original weights, inference speed unchanged
Adapter: Requires extra computation, inference slightly slower
Prefix-Tuning: Needs to process extra tokens, inference slightly slower

Q6: How to choose PEFT method?

Method	Use Case
LoRA	General scenarios, balance performance and efficiency
QLoRA	Memory-constrained, large models
Adapter	Need modular design
Prefix-Tuning	Generation tasks, need to control generation

Q7: How much data is needed for instruction tuning?

Minimum: 100-1000 high-quality instructions
Recommended: 1000-10000 instructions
Optimal: 10000+ diverse instructions

Quality matters more than quantity.

Q8: Is RLHF necessary?

No. RLHF is mainly used for: - Need to align with human values - Need to control output style - Need to reduce harmful content

For most tasks, instruction tuning is sufficient.

Q9: How to evaluate fine-tuning effectiveness?

Task metrics: Accuracy, F1, BLEU, etc.
Generation quality: Human evaluation, GPT-4 evaluation
Alignment: Ability to follow instructions
Efficiency: Parameter count, inference speed

Q10: Future directions of PEFT?

More efficient parameter utilization: Achieve better results with fewer parameters
Automated method selection: Automatically select optimal PEFT configuration
Multi-modal extension: Apply to vision, speech, and other modalities
Combined methods: Combine multiple PEFT techniques