Reinforcement Learning (12): RLHF and Large Language Model Applications

The breakthrough progress of Large Language Models (LLMs)— from GPT-3 to ChatGPT, from Claude to Gemini — stems not only from model scaling and pretraining data growth, but crucially from the introduction of Reinforcement Learning from Human Feedback (RLHF). While pretrained language models can generate fluent text, they often produce harmful content, misinformation, or responses misaligned with user intent. RLHF collects human preference data on model outputs, trains reward models to capture human values, then uses reinforcement learning (PPO) to fine-tune models toward more helpful, honest, and harmless content. InstructGPT systematized the RLHF pipeline, ChatGPT brought it to mainstream awareness, while DPO (Direct Preference Optimization) and RLAIF (RL from AI Feedback) simplified training complexity and data collection costs. Beyond language, reinforcement learning plays a core role in embodied intelligence (robotics, autonomous driving)— from sim-to-real policy transfer to offline-to-online fine-tuning, RL is shaping the next generation of general agents. This chapter systematically examines RLHF's technical details, DPO's theoretical innovations, RLAIF's practical approaches, and RL applications in multimodal and embodied intelligence, with complete code to help you implement a simplified RLHF pipeline.

RLHF: From Pretraining to Human Alignment

Why Do We Need RLHF?

Limitations of Pretrained Language Models: - Misaligned Objectives: Maximizing next-token prediction likelihooddoesn't guarantee useful or safe outputs - Distribution Bias: Pretraining data includes internet text (filled with misinformation, bias, harmful content), which models may learn - Lack of Instruction Understanding: GPT-3 struggles zero-shot with instructions like "please summarize this article"

Value of Human Feedback: - Captures complex, implicit human preferences (e.g., "helpful," "polite," "avoid bias") - More flexible than manual rules, more efficient than supervised learning (only requires comparing two outputs, not generating perfect answers)

RLHF's Goal: - Align model outputs with human values - Maximize human ratings (reward) rather than likelihood

RLHF's Three-Stage Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Starting from pretrained model (e.g., GPT-3), fine-tune on high-quality demonstration data: - Collect human-labeled (prompt, desired response) pairs - Fine-tune with standard cross-entropy loss:

Purpose: - Provide "formatted" output initialization for model (e.g., dialogue format, instruction following) - Reduce exploration difficulty in RL training

Data Scale: InstructGPT used ~13k demonstrations (high-quality responses written by labelers).

Stage 2: Reward Model Training

Train reward modelto predict human rating of outputs: - Collect comparison data: for same prompt, humans compare multiple model-generated outputs, label preferences (e.g.,) - Model preferences with Bradley-Terry model: - Loss function:whereis preferred output (winner),is non-preferred output (loser),is sigmoid function.

Architecture: Typically based on SFT model, remove last layer, add linear layer to output scalar reward.

Data Scale: InstructGPT used ~33k comparisons (4-9 outputs per prompt, pairwise comparisons).

Stage 3: PPO Fine-Tuning (Policy Optimization)

Use reinforcement learning (PPO) to optimize policy, maximizing reward model score:

Objective Function:

First term: Reward model score, encourages model to generate high-scoring outputs
Second term: KL divergence regularization, prevents model from deviating too far from SFT initialization (avoids "reward hacking"— generating outputs that score high with reward model but appear garbage to humans)

PPO Algorithm: - Sample prompts, generate responses - Compute reward - Update policy with PPO (clipped objective):whereis advantage function.

Training Details: - Each iteration samples batch of prompts, generates responses, computes rewards, updates policy - Simultaneously applies supervised loss on SFT data (prevents forgetting) - Iterates thousands of steps until reward saturates

InstructGPT: Systematic RLHF Practice

InstructGPT's Training Pipeline

OpenAI published InstructGPT paper in 2022, systematizing RLHF pipeline:

1. Data Collection: - SFT data: 13k prompts + human-labeled responses - Comparison data: 33k prompts, 4-9 model outputs per prompt, humans label preference rankings - Prompt sources: Real requests from API users (privacy-removed) + diverse prompts written by labelers

2. Model Scales: - Based on GPT-3's 1.3B, 6B, 175B parameter models - Train all sizes in both SFT and RL stages, compare effectiveness

3. Reward Model: - 6B parameter model performs best (more stable than 175B parameter reward model) - Input: prompt + response, output: scalar reward - Training: optimize Bradley-Terry loss on comparison data

4. PPO Fine-Tuning: - Initialization: start from SFT model - KL coefficient:(balances reward and KL penalty) - Training: iterate on 256k prompts - Mixed loss: RL loss + SFT loss (prevents forgetting)

InstructGPT's Key Findings

1. Model Scale vs Data Quality: - 1.3B parameter InstructGPT (RLHF-trained) outperforms 175B parameter GPT-3 (pretrained only) in human evaluation - Shows alignment training more important than scale

2. Generalization Ability: - On held-out prompts, InstructGPT performs well (unseen task types) - Reward model generalizes to new prompt distributions

3. Alignment Tax: - After RLHF training, model performance slightly drops on some NLP benchmarks (e.g., SQuAD) - But actual user experience significantly improves

4. Labeler Consistency: - Different labelers show high preference consistency (>70%) - But greater divergence on subjective tasks (e.g., creative writing)

InstructGPT's Limitations

1. Reward Model Limitations: - Reward model can be "hacked" (producing high-scoring but meaningless outputs) - Example: generating extremely long but repetitive text (reward model may score high due to length)

2. Preference Data Bias: - Labeler preferences may reflect group biases - Reward model inherits these biases

3. Computational Cost: - RLHF training expensive (requires online sampling + multiple forward passes) - PPO updates unstable (requires careful hyperparameter tuning)

ChatGPT: Large-Scale RLHF Application

ChatGPT's Technical Evolution

ChatGPT (released November 2022): - Based on GPT-3.5 (improved GPT-3) - Complete RLHF pipeline (SFT → reward model → PPO) - Dialogue optimization: multi-turn conversation ability, context understanding

GPT-4 (released March 2023): - Multimodal input (text + images) - Stronger reasoning ability, fewer hallucinations - More complex RLHF: multi-objective optimization (helpful, honest, harmless)

ChatGPT's Training Details (Inferred)

OpenAI hasn't released complete details, but from papers and public information:

1. SFT Data: - Hundreds of thousands of dialogue samples (human-labeled) - Covers diverse tasks: Q&A, creative writing, code generation, translation, etc.

2. Reward Model: - Multiple reward models (separately modeling "helpful," "honest," "harmless") - Weighted combination: 3. PPO Fine-Tuning: - Online data collection: continuously sample from API user requests, collect feedback - Iterative training: periodically retrain reward model and policy

4. Safety Layer: - Content moderation model: filters harmful outputs - Rule-based system: hard constraints (e.g., refusing illegal requests)

ChatGPT's Impact

1. User Experience Improvement: - Fluent dialogue, accurate instruction understanding - Refuses inappropriate requests (e.g., "teach me to make bombs")

2. New Challenges: - Jailbreaking: Users design prompts to bypass safety restrictions - Bias: Model outputs may still contain gender, racial bias - Hallucinations: Model sometimes generates plausible-sounding but factually incorrect content

3. Driving RLHF Research: - ChatGPT's success sparked academic interest in RLHF - Open-source alternatives: LLaMA+RLHF, Alpaca, Vicuna, etc.

DPO: Direct Preference Optimization

Traditional RLHF's Problems

Complexity: - Requires three-stage training (SFT → reward model → PPO) - PPO training unstable, requires careful hyperparameter tuning

Computational Cost: - RL stage requires online sampling (generating large amounts of text) - Each update requires multiple forward passes (computing rewards, advantages, etc.)

Reward Model Error Propagation: - Reward model errors affect RL training - Reward hacking: policy learns to exploit reward model's loopholes

DPO's Core Idea

Insight: RLHF's optimal policy has closed-form solution:

Under standard RLHF objective:Optimal policy is:whereis normalization constant.

Inversely solve for reward function:Substitute into Bradley-Terry model:

DPO Loss: Directly optimize policy, no explicit reward model needed:

DPO's Advantages

Simple: - Only requires one training stage (skips reward model and RL) - Loss function is standard cross-entropy, optimize with gradient descent

Stable: - No complex PPO sampling and updates - No reward model error propagation

Efficient: - Low computational cost (no online sampling needed) - Fast training (direct supervised learning)

DPO's Experimental Results

Paper Experiments (Rafailov et al., 2023): - Tasks: sentiment control, summarization, dialogue - Data: TL;DR (summarization), Anthropic HH (dialogue) - Results: DPO performance matches or exceeds PPO-based RLHF

Subsequent Improvements: - ODPO (Offset DPO): considers preference strength (not all preferences equally important) - IPO (Identity Preference Optimization): improves DPO's theoretical foundation

DPO's Limitations

Implicit Reward Modeling: - DPO implicitly learns reward, but cannot explicitly view reward values - Difficult to debug (why did model choose this output?)

Sensitive to Data Quality: - Requires high-quality preference pairs - Noisy labels have greater impact (because directly optimizing policy)

Generalization Ability: - Underperforms RLHF on some tasks (especially requiring complex reasoning)

RLAIF: Replacing Human Feedback with AI Feedback

Human Feedback Bottleneck

High Cost: - Labeler time cost (InstructGPT used 40 full-time labelers over months) - Quality control cost (requires training, quality checks, consistency checks)

Poor Scalability: - Human annotation slow (each comparison takes tens of seconds) - Difficult to continuously collect new data

Bias Accumulation: - Labeler population may not represent user population - Subjective tasks (e.g., creative writing) difficult to obtain consistent preferences

RLAIF's Core Idea

Use AI Models to Generate Preference Labels: - Given promptand two outputs - Use pretrained LLM (e.g., GPT-4, PaLM) to evaluate which is better - Prompt template:

Given the following question and two responses, which response is better?
Question: {x}
Response A: {y_1}
Response B: {y_2}
Answer: (A or B)

Training Pipeline: - Use AI to generate preference data - Rest of pipeline same as RLHF (train reward model, PPO fine-tuning) - Or use DPO to directly optimize

RLAIF Variants

1. Constitutional AI (Anthropic, 2022): - Use predefined rules (constitution) to guide AI evaluation - Rule examples: "output should be honest, helpful, harmless" - AI evaluation references these rules

2. Self-Critique: - Model generates output, then self-evaluates and improves - Iteration: generate → evaluate → revise → generate

3. Direct-RLAIF: - Skip reward model, directly use AI scoring as reward - During each RL sampling, call AI model online for scoring

RLAIF's Experimental Results

Paper (Lee et al., 2023): - Tasks: summarization, dialogue, harmlessness - Comparison: RLAIF vs RLHF - Results: RLAIF performance approaches RLHF (even exceeds on some tasks)

Key Findings: - AI feedback high consistency (>85% aligns with human preferences) - Cost reduction 10x+ (no human annotation needed) - Strong scalability (quickly collect large amounts of data)

RLAIF's Limitations

AI Evaluation Bias: - AI models may inherit pretraining data biases - Evaluation may be overly conservative or overly aggressive

Circular Dependency: - Using AI A to train AI B may lead to error accumulation - "Model collapse": performance degrades after multiple generations of training

Difficulty Capturing Subtle Preferences: - Some human preferences hard to express via prompts (e.g., aesthetics, emotional nuance)

Complete Code Implementation: Simplified RLHF

Below implements simplified RLHF pipeline, including: - Synthetic data generation (simulate prompts and responses) - Reward model training (based on preference pairs) - PPO fine-tuning (simplified version, using REINFORCE+baseline)

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from torch.utils.data import Dataset, DataLoader

# ============ Data Generation ============
class SyntheticDataset:
    """Synthetic RLHF data"""
    def __init__(self, num_prompts=1000):
        self.prompts = [f"Prompt {i}: Tell me about topic {i%10}." for i in range(num_prompts)]
        
    def generate_responses(self, prompt, model, tokenizer, num_responses=4):
        """Generate multiple responses for prompt"""
        inputs = tokenizer(prompt, return_tensors='pt')
        outputs_list = []
        
        for _ in range(num_responses):
            output = model.generate(
                inputs['input_ids'],
                max_length=50,
                do_sample=True,
                top_p=0.9,
                temperature=0.7
            )
            response = tokenizer.decode(output[0], skip_special_tokens=True)
            outputs_list.append(response)
        
        return outputs_list
    
    def create_comparison_data(self, model, tokenizer, num_comparisons=500):
        """Create comparison data (simulating human preferences)"""
        comparisons = []
        
        for i in range(num_comparisons):
            prompt = self.prompts[i]
            responses = self.generate_responses(prompt, model, tokenizer, num_responses=2)
            
            # Simulate preference: longer response usually better (simplified assumption)
            y1, y2 = responses
            if len(y1) > len(y2):
                y_w, y_l = y1, y2
            else:
                y_w, y_l = y2, y1
            
            comparisons.append({
                'prompt': prompt,
                'chosen': y_w,
                'rejected': y_l
            })
        
        return comparisons

# ============ Reward Model ============
class RewardModel(nn.Module):
    """GPT-2-based reward model"""
    def __init__(self, model_name='gpt2'):
        super().__init__()
        self.transformer = GPT2LMHeadModel.from_pretrained(model_name)
        self.value_head = nn.Linear(self.transformer.config.n_embd, 1)
        
    def forward(self, input_ids, attention_mask=None):
        # Get last layer hidden states
        outputs = self.transformer.transformer(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        hidden_states = outputs.last_hidden_state
        
        # Take last token representation
        last_hidden = hidden_states[:, -1, :]
        
        # Output reward
        reward = self.value_head(last_hidden)
        return reward.squeeze(-1)

class ComparisonDataset(Dataset):
    """Preference pair dataset"""
    def __init__(self, comparisons, tokenizer, max_length=128):
        self.comparisons = comparisons
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.comparisons)
    
    def __getitem__(self, idx):
        item = self.comparisons[idx]
        
        # Tokenize
        prompt = item['prompt']
        chosen = item['chosen']
        rejected = item['rejected']
        
        chosen_text = prompt + chosen
        rejected_text = prompt + rejected
        
        chosen_enc = self.tokenizer(
            chosen_text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        rejected_enc = self.tokenizer(
            rejected_text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        return {
            'chosen_input_ids': chosen_enc['input_ids'].squeeze(0),
            'chosen_attention_mask': chosen_enc['attention_mask'].squeeze(0),
            'rejected_input_ids': rejected_enc['input_ids'].squeeze(0),
            'rejected_attention_mask': rejected_enc['attention_mask'].squeeze(0)
        }

def train_reward_model(reward_model, dataloader, num_epochs=3, lr=1e-5):
    """Train reward model"""
    optimizer = optim.Adam(reward_model.parameters(), lr=lr)
    
    for epoch in range(num_epochs):
        total_loss = 0
        for batch in dataloader:
            # Compute chosen and rejected rewards
            r_chosen = reward_model(
                batch['chosen_input_ids'],
                batch['chosen_attention_mask']
            )
            r_rejected = reward_model(
                batch['rejected_input_ids'],
                batch['rejected_attention_mask']
            )
            
            # Bradley-Terry loss
            loss = -torch.log(torch.sigmoid(r_chosen - r_rejected)).mean()
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(dataloader):.4f}")
    
    return reward_model

# ============ PPO Fine-Tuning (Simplified) ============
class SimpleRLHF:
    """Simplified RLHF trainer"""
    def __init__(self, policy_model, reward_model, ref_model, tokenizer, beta=0.01):
        self.policy = policy_model
        self.reward_model = reward_model
        self.ref_model = ref_model
        self.tokenizer = tokenizer
        self.beta = beta  # KL penalty coefficient
        
        self.optimizer = optim.Adam(policy_model.parameters(), lr=1e-6)
    
    def compute_reward(self, prompt, response):
        """Compute reward: RM score - KL penalty"""
        # RM score
        text = prompt + response
        inputs = self.tokenizer(text, return_tensors='pt', max_length=128, truncation=True)
        with torch.no_grad():
            rm_score = self.reward_model(inputs['input_ids'], inputs['attention_mask'])
        
        # KL penalty: log(π/π_ref)
        with torch.no_grad():
            policy_logprobs = self.policy(inputs['input_ids'], labels=inputs['input_ids']).logits
            ref_logprobs = self.ref_model(inputs['input_ids'], labels=inputs['input_ids']).logits
            
            kl_div = F.kl_div(
                F.log_softmax(policy_logprobs, dim=-1),
                F.softmax(ref_logprobs, dim=-1),
                reduction='batchmean'
            )
        
        reward = rm_score - self.beta * kl_div
        return reward.item()
    
    def train_step(self, prompts):
        """Train one step (simplified REINFORCE)"""
        total_loss = 0
        
        for prompt in prompts:
            # Generate response
            inputs = self.tokenizer(prompt, return_tensors='pt')
            output = self.policy.generate(
                inputs['input_ids'],
                max_length=50,
                do_sample=True,
                top_p=0.9,
                output_scores=True,
                return_dict_in_generate=True
            )
            
            response_ids = output.sequences[0]
            response = self.tokenizer.decode(response_ids, skip_special_tokens=True)
            
            # Compute reward
            reward = self.compute_reward(prompt, response)
            
            # Policy gradient loss
            logits = self.policy(response_ids.unsqueeze(0), labels=response_ids.unsqueeze(0)).logits
            log_probs = F.log_softmax(logits, dim=-1)
            
            # REINFORCE: loss = -log_prob * reward
            loss = -log_probs.mean() * reward
            
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
            
            total_loss += loss.item()
        
        return total_loss / len(prompts)

# ============ Main Training Pipeline ============
def main():
    print("Initializing models and tokenizer...")
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    tokenizer.pad_token = tokenizer.eos_token
    
    # Stage 1: SFT (skip here, directly use pretrained GPT-2)
    policy_model = GPT2LMHeadModel.from_pretrained('gpt2')
    ref_model = GPT2LMHeadModel.from_pretrained('gpt2')  # Reference model (frozen)
    ref_model.eval()
    
    # Stage 2: Reward model training
    print("\nGenerating comparison data...")
    dataset_gen = SyntheticDataset(num_prompts=1000)
    comparisons = dataset_gen.create_comparison_data(policy_model, tokenizer, num_comparisons=500)
    
    print("Training reward model...")
    reward_model = RewardModel('gpt2')
    comparison_dataset = ComparisonDataset(comparisons, tokenizer)
    dataloader = DataLoader(comparison_dataset, batch_size=8, shuffle=True)
    
    reward_model = train_reward_model(reward_model, dataloader, num_epochs=3)
    
    # Stage 3: PPO fine-tuning (simplified)
    print("\nRLHF fine-tuning...")
    rlhf_trainer = SimpleRLHF(policy_model, reward_model, ref_model, tokenizer)
    
    train_prompts = dataset_gen.prompts[:100]
    for step in range(10):
        loss = rlhf_trainer.train_step(train_prompts[:10])
        print(f"Step {step+1}/10, Loss: {loss:.4f}")
    
    # Test
    print("\nTesting generation:")
    test_prompt = "Tell me about artificial intelligence."
    inputs = tokenizer(test_prompt, return_tensors='pt')
    output = policy_model.generate(inputs['input_ids'], max_length=50)
    print(tokenizer.decode(output[0], skip_special_tokens=True))
    
    print("\nTraining complete!")

if __name__ == "__main__":
    # Note: This is simplified demonstration, complete RLHF requires more complex implementation
    print("Simplified RLHF example")
    print("Warning: This code is for educational demonstration only, not suitable for production")
    # main()  # Uncomment to run (requires GPU and sufficient memory)

Code Analysis

Data Generation: - SyntheticDataset: synthesizes prompts and responses - create_comparison_data: generates 2 responses per prompt, simulates preference (simplified as longer is better)

Reward Model: - RewardModel: based on GPT-2, adds scalar output head - train_reward_model: trains with Bradley-Terry loss

RLHF Training: - SimpleRLHF: simplified trainer - compute_reward: RM score - KL penalty - train_step: generates response, computes reward, updates policy with REINFORCE

Note: - Complete RLHF requires more complex implementation (GAE, PPO clipping, multi-GPU training, etc.) - This code is for educational demonstration only

RL Applications in Embodied Intelligence

Robot Learning: From Simulation to Reality

Sim-to-Real Transfer: - Train policies in simulators (e.g., MuJoCo, PyBullet) - Transfer to real robots (domain randomization, domain adaptation)

Challenges: - Real-world dynamics complex (friction, contact, sensor noise) - Reality gap between simulator and real world

Success Cases: - OpenAI's Dactyl: trained robot hand to solve Rubik's cube with RL (trained in simulation, transferred to real) - Boston Dynamics: quadruped robot locomotion control (combining RL and traditional control)

Offline RL for Robotics

Data Sources: - Human demonstrations (teleoperation) - Random policy exploration - Historical task data

Algorithms: - CQL, IQL, Decision Transformer (see Chapter 10)

Advantages: - No expensive online exploration needed - Utilizes existing data

Applications: - Robomimic: learns robot manipulation from demonstration data - D4RL for Manipulation: offline datasets support robot grasping, pushing, etc.

RL in Autonomous Driving

End-to-End Learning: - Input: sensor data (cameras, radar) - Output: steering, throttle, brake - Use RL to optimize trajectories (maximize safety, comfort, efficiency)

Model-Based RL: - Learn environment model (predict other vehicle behaviors) - Plan in model (MCTS, MPC+RL)

Challenges: - Safety: exploration may be dangerous (requires offline RL or high-fidelity simulation) - Generalization: training environment vs actual road differences

Company Applications: - Waymo: combines RL and imitation learning - Tesla: end-to-end learning (though details not public)

Multimodal RL: Vision-Language-Action

Task: Given language instruction, execute robot task - Input: "pick up red cup" - Output: robot action sequence

Architecture: - Vision encoder: extracts scene features - Language encoder: understands instructions - Policy network: conditional policy Training: - Data: language-vision-action triples - Objective: maximize task success rate

Frontier Work: - CLIPort: uses CLIP embeddings to bridge language and vision - RT-1, RT-2 (Google): large-scale robotics Transformer, language-conditioned RL

In-Depth Q&A

Q1: Why Is RLHF More Effective Than SFT?

SFT's Limitations: - Learns to "mimic" demonstration data, but demonstration data limited (e.g., InstructGPT only has 13k samples) - Cannot generalize to unseen prompt types - Difficult to capture "implicit" preferences (e.g., "polite," "avoid verbosity")

RLHF's Advantages: - Reward model can learn from large amounts of comparison data (33k comparisons > 13k demonstrations) - Comparison data easier to annotate (judge which is better, rather than generate perfect response) - RL optimization directly targets human preferences, not likelihood

Experimental Verification: InstructGPT paper shows RLHF-trained 1.3B model outperforms SFT-trained 175B model.

Q2: Why Can DPO Bypass Reward Model?

Mathematical Insight: DPO discovers RLHF's optimal policy has closed-form solution:Inversely solve:Substitute into Bradley-Terry model, preference probability becomes:

Key: This formula only depends on policyand reference policy, doesn't need explicit reward model!

Directly optimizeto maximize preference log-likelihood, equivalent to RLHF.

Q3: Will RLAIF Lead to "Model Collapse"?

Model Collapse: - Using AI-generated data to train AI, quality degrades after multiple generations - Reason: AI-generated data distribution bias accumulates

RLAIF's Risk: - Use AI A (e.g., GPT-4) to label data, train AI B - If B approaches A, then use B to label data to train C... may collapse

Mitigation Strategies: 1. Mix Human Data: RLAIF + some human annotation 2. Diverse AI Evaluators: voting from multiple models 3. Regular Calibration: recalibrate with human data periodically 4. Task Diversity: avoid overfitting on single distribution

Experimental Evidence: Current RLAIF papers (1-2 generations training) haven't observed obvious collapse, but long-term effects unknown.

Q4: Why Is PPO the Preferred Algorithm for RLHF?

RL Algorithm Comparison:

DQN/Q-learning: - Suitable for discrete actions - LLM action space is vocabulary (tens of thousands of dimensions), Q-function difficult to represent

A3C/A2C: - Policy gradient + value function - Training unstable (high variance)

PPO: - Clipped objective limits policy update magnitude - Reduces catastrophic updates (avoids sudden policy deterioration) - Easy to implement, robust hyperparameters

RLHF-Specific Challenges: - Huge LM action space (select one token per step) - Sparse rewards (only given at sequence end) - Need stable training (avoid forgetting SFT initialization)

PPO's clipping and KL penalty naturally suit these needs.

Q5: How Does RLHF Handle Multiple Objectives (Helpful, Honest, Harmless)?

Naive Approach: weighted combination of rewards

Challenges: - Weightsdifficult to tune (how to balance?) - Objectives may conflict (e.g., honest vs harmless: "user asks how to commit suicide"— honest answer vs refusal)

Improvement Methods:

1. Multiple Reward Models: - Train 3 independent reward models - Use Pareto optimization in RL stage (multi-objective RL)

2. Constitutional AI: - Use rule constraints (e.g., "must refuse harmful requests") - Reward model only models "helpful" and "honest"

3. Human Feedback Specifies Weights: - Let users choose preferences (e.g., "I prioritize safety more") - Adjustbased on user preferences

Q6: Why Is Offline RL Important in Robotics?

Online RL Difficulties: - Safety: robot exploration may damage hardware or cause danger - Time Cost: real robot interaction slow (e.g., one grasp takes seconds), collecting millions of samples infeasible - Data Waste: abundant human demonstration data exists, but online RL starts from scratch

Offline RL Advantages: - Utilizes demonstration data, historical task data - Safe (no online exploration needed) - Efficient (parallel training)

Challenges: - Data distribution shift (demonstration data vs optimal policy) - Real robot dynamics complex (simulation data difficult to transfer)

Practical Approach: - Offline pretraining (CQL, IQL) - Online fine-tuning (small amount of safe exploration) - Combine models (learn dynamics model, plan in model)

Q7: How to Evaluate RLHF-Trained Models?

Automatic Metrics: - Reward Model Score: on held-out data, RM's preference prediction accuracy - KL Divergence:, measures policy deviation from reference model - Perplexity: language modeling loss on held-out text

Human Evaluation: - Win Rate: humans compare model output vs baseline, calculate "win rate" - Absolute Rating: Likert scale (1-5 points) evaluating helpful, honest, harmless - Task Success Rate: for specific tasks (e.g., code generation), run code to check correctness

NLP Benchmarks: - MMLU (multitask language understanding) - HumanEval (code generation) - TruthfulQA (truthfulness) - But RLHF may perform worse on benchmarks (alignment tax), while actual user experience improves

A/B Testing: - Deploy two versions (RLHF vs baseline), collect user feedback - ChatGPT's success largely based on actual user satisfaction

Q8: What Is "Reward Hacking" in RLHF Training?

Definition: Policy learns to exploit reward model's loopholes, producing high-reward but actually low-quality outputs.

Examples: - Length Hacking: reward model may prefer long text, policy generates extremely long but repetitive/meaningless outputs - Format Hacking: reward model prefers specific format (e.g., lists), policy overuses lists - Sycophancy: policy learns to "please" reward model, generating plausible-sounding but actually incorrect content

Reasons: - Reward model is imperfect proxy, not fully equivalent to human preferences - RL over-optimizes proxy objective

Mitigation Methods: - KL Penalty: limit policy deviation from SFT initialization (addin RLHF objective) - Reward Model Regularization: add regularization terms when training RM (e.g., length normalization) - Red Teaming: use adversarial examples to test reward model, find loopholes and fix - Iterative Updates: periodically retrain RM with new human feedback

Q9: How Does Constitutional AI Differ from RLHF?

Constitutional AI (CAI): - Proposed by Anthropic, uses predefined rules (constitution) to guide training - Process: 1. Model generates output 2. Evaluate with rules (e.g., "is it harmful?") 3. Model self-corrects (generates improved version) 4. Train with improved version

Difference from RLHF:

RLHF: - Humans label preference data - Reward model implicitly learns human values

CAI: - Humans define explicit rules - AI evaluates whether rules are followed

Advantages: - Interpretable: rules explicit, easy to review - Controllable: directly modify rules to change behavior - Scalable: no extensive human annotation needed

Limitations: - Rules difficult to exhaust (how to define "polite"?) - Rules may conflict (e.g., honest vs harmless)

In Practice: CAI and RLHF often combined (CAI defines hard constraints, RLHF optimizes soft preferences).

Q10: Future Directions for RL Beyond LLMs?

1. Multimodal RLHF: - Not just text, but images, video, audio - Reward models evaluate multimodal outputs (e.g., "is this video helpful?")

2. Online RLHF: - Continuously learn from user interactions - User upvotes/downvotes as real-time feedback - Challenges: distribution shift, privacy

3. Personalized RLHF: - Each user has different preferences - Train user-specific reward models - Meta-learning to generalize across users

4. RL for Reasoning: - LLM reasoning ability still limited (e.g., math, logic) - Use RL to optimize reasoning process (like AlphaGo's MCTS+RL) - Algorithms: Process Reward Model (PRM), STaR

5. RL for Embodied Intelligence: - LLM as high-level planner (generates subgoals) - RL trains low-level executor (robot actions) - Joint training of language-vision-action

6. Safe Alignment: - Beyond "helpful, honest, harmless," research long-term safety - AI alignment theory (e.g., CIRL, IRL) - Mechanism design (making AI objectives naturally align with human objectives)

Q11: How Do RT-1 and RT-2 (Google's Robotics Transformers) Work?

RT-1 (Robotics Transformer 1, 2022): - Input: images + language instructions - Output: robot actions (discretized joint angles, grasp states) - Architecture: - Vision encoder: EfficientNet extracts image features - Language encoder: Universal Sentence Encoder processes instructions - Transformer: Decoder processes sequence - Training: - Data: 13 robots, 130k demos (700 tasks) - Loss: Behavior Cloning (BC) + small amount of online RL fine-tuning

RT-2 (2023): - Improvement: initialized with pretrained VLM (Vision-Language Model) - Backbone: PaLI-X (vision-language large model) - Training: 1. Pretrain VLM on web image-text data 2. Fine-tune on robot data (co-fine-tuning: language tasks + robot tasks) - Effect: significantly improved generalization (zero-shot reasoning on new tasks)

Key Innovations: - Large-scale data (RT-1: 130k, RT-2: combines web data) - Multi-task learning (one model handles 700+ tasks) - Language conditioning (natural language instruction control)

RL's Role: - Mainly uses BC (imitation learning) - RL used for online fine-tuning (improves task success rate)

Q12: How High Is RLHF's Computational Cost?

Training Stage Cost Estimation (using InstructGPT 175B as example):

SFT: - Data: 13k samples - Computation: approximately fine-tuning GPT-3 on 13k samples (hours, single machine multi-GPU)

Reward Model Training: - Data: 33k comparisons - Model: 6B parameters (smaller than policy) - Computation: hours

PPO Fine-Tuning: - Data: 256k prompts - Each iteration: - Generation: 256k responses (dozens of tokens each) - Compute rewards: 256k forward passes (RM + policy) - PPO updates: multiple gradient steps (each requires computing advantages, clipping, etc.) - Total computation: approximately training on millions of samples for days (multi-machine multi-GPU)

Comparison: - Pretraining GPT-3: approximately 10^23 FLOPs (thousands of GPU-months) - RLHF (SFT+RM+PPO): approximately 10^21 FLOPs (tens of GPU-months) - RLHF approximately 1-10% of pretraining cost

DPO's Cost: - Skips RM and RL, direct supervised learning - Approximately equal to SFT cost (hours-days) - 1-2 orders of magnitude lower than RLHF

Core Papers

RLHF:

InstructGPT:
Ouyang et al. (2022). "Training language models to follow instructions with human feedback". NeurIPS.
https://arxiv.org/abs/2203.02155
ChatGPT Technical Report:
OpenAI (2022). Blog post.
https://openai.com/blog/chatgpt
RLHF Survey:
Wang et al. (2024). "A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More". arXiv.
https://arxiv.org/abs/2407.16216

DPO:

DPO:
Rafailov et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". NeurIPS.
https://arxiv.org/abs/2305.18290
DPO Survey:
(2024). "A Survey of Direct Preference Optimization". arXiv.

RLAIF:

RLAIF:
Lee et al. (2023). "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback". arXiv.
https://arxiv.org/abs/2309.00267
Constitutional AI:
Bai et al. (2022). "Constitutional AI: Harmlessness from AI Feedback". arXiv.
https://arxiv.org/abs/2212.08073

Embodied Intelligence:

RT-1:
Brohan et al. (2022). "RT-1: Robotics Transformer for Real-World Control at Scale". arXiv.
https://arxiv.org/abs/2212.06817
RT-2:
Brohan et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control". arXiv.
https://arxiv.org/abs/2307.15818
Dactyl:
OpenAI (2018). "Learning Dexterous In-Hand Manipulation". arXiv.
https://arxiv.org/abs/1808.00177

Code Libraries

TRL (Transformer Reinforcement Learning):
https://github.com/huggingface/trl
HuggingFace's RLHF library, supports PPO, DPO
OpenAI Baselines:
https://github.com/openai/baselines
Includes PPO implementation
Anthropic's RLHF:
https://github.com/anthropics/hh-rlhf
Helpful & Harmless dataset
DeepSpeed-Chat:
https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat
Efficient RLHF training

Summary

Reinforcement learning has evolved from game AI to language model alignment to embodied intelligence, demonstrating its core position in shaping general AI.

RLHF infuses human values into large language models: - Through three-stage pipeline (SFT → reward model → PPO), making models generate more helpful, honest, harmless content - InstructGPT and ChatGPT proved RLHF's effectiveness, driving large-scale LLM applications

DPO simplified RLHF's training complexity: - Directly optimizes policy from preference data, bypassing reward model and RL sampling - Maintains performance while reducing computational cost, enabling RLHF democratization

RLAIF replaces human annotation with AI feedback: - Reduces data collection cost 10x+, improves scalability - Constitutional AI and other methods combine rules with AI feedback, enhancing controllability

RL in Embodied Intelligence: - From offline demonstration data to online fine-tuning, RL helps robots learn complex operations - Multimodal learning (language-vision-action) opens new chapter for general agents

In the future, reinforcement learning will deeply integrate with large-scale pretraining, multimodal learning, and causal reasoning — from conversational assistants to autonomous driving, from research assistants to home robots, RL is defining the new paradigm for AI-human interaction. The reinforcement learning series concludes here, but RL's journey has just begun.