The breakthrough progress of Large Language Models (LLMs)— from GPT-3 to ChatGPT, from Claude to Gemini — stems not only from model scaling and pretraining data growth, but crucially from the introduction of Reinforcement Learning from Human Feedback (RLHF). While pretrained language models can generate fluent text, they often produce harmful content, misinformation, or responses misaligned with user intent. RLHF collects human preference data on model outputs, trains reward models to capture human values, then uses reinforcement learning (PPO) to fine-tune models toward more helpful, honest, and harmless content. InstructGPT systematized the RLHF pipeline, ChatGPT brought it to mainstream awareness, while DPO (Direct Preference Optimization) and RLAIF (RL from AI Feedback) simplified training complexity and data collection costs. Beyond language, reinforcement learning plays a core role in embodied intelligence (robotics, autonomous driving)— from sim-to-real policy transfer to offline-to-online fine-tuning, RL is shaping the next generation of general agents. This chapter systematically examines RLHF's technical details, DPO's theoretical innovations, RLAIF's practical approaches, and RL applications in multimodal and embodied intelligence, with complete code to help you implement a simplified RLHF pipeline.
RLHF: From Pretraining to Human Alignment
Why Do We Need RLHF?
Limitations of Pretrained Language Models: -
Misaligned Objectives: Maximizing next-token prediction
likelihood
Value of Human Feedback: - Captures complex, implicit human preferences (e.g., "helpful," "polite," "avoid bias") - More flexible than manual rules, more efficient than supervised learning (only requires comparing two outputs, not generating perfect answers)
RLHF's Goal: - Align model outputs with human values - Maximize human ratings (reward) rather than likelihood
RLHF's Three-Stage Pipeline
Stage 1: Supervised Fine-Tuning (SFT)
Starting from pretrained model (e.g., GPT-3), fine-tune on
high-quality demonstration data: - Collect human-labeled (prompt,
desired response) pairs - Fine-tune with standard cross-entropy
loss:
Purpose: - Provide "formatted" output initialization for model (e.g., dialogue format, instruction following) - Reduce exploration difficulty in RL training
Data Scale: InstructGPT used ~13k demonstrations (high-quality responses written by labelers).
Stage 2: Reward Model Training
Train reward model
Architecture: Typically based on SFT model, remove
last layer, add linear layer to output scalar reward
Data Scale: InstructGPT used ~33k comparisons (4-9 outputs per prompt, pairwise comparisons).
Stage 3: PPO Fine-Tuning (Policy Optimization)
Use reinforcement learning (PPO) to optimize policy
Objective Function:
- First term
: Reward model score, encourages model to generate high-scoring outputs - Second term
: KL divergence regularization, prevents model from deviating too far from SFT initialization (avoids "reward hacking"— generating outputs that score high with reward model but appear garbage to humans)
PPO Algorithm: - Sample prompts
Training Details: - Each iteration samples batch of prompts, generates responses, computes rewards, updates policy - Simultaneously applies supervised loss on SFT data (prevents forgetting) - Iterates thousands of steps until reward saturates
InstructGPT: Systematic RLHF Practice
InstructGPT's Training Pipeline
OpenAI published InstructGPT paper in 2022, systematizing RLHF pipeline:
1. Data Collection: - SFT data: 13k prompts + human-labeled responses - Comparison data: 33k prompts, 4-9 model outputs per prompt, humans label preference rankings - Prompt sources: Real requests from API users (privacy-removed) + diverse prompts written by labelers
2. Model Scales: - Based on GPT-3's 1.3B, 6B, 175B parameter models - Train all sizes in both SFT and RL stages, compare effectiveness
3. Reward Model: - 6B parameter model performs best (more stable than 175B parameter reward model) - Input: prompt + response, output: scalar reward - Training: optimize Bradley-Terry loss on comparison data
4. PPO Fine-Tuning: - Initialization: start from SFT
model - KL coefficient:
InstructGPT's Key Findings
1. Model Scale vs Data Quality: - 1.3B parameter InstructGPT (RLHF-trained) outperforms 175B parameter GPT-3 (pretrained only) in human evaluation - Shows alignment training more important than scale
2. Generalization Ability: - On held-out prompts, InstructGPT performs well (unseen task types) - Reward model generalizes to new prompt distributions
3. Alignment Tax: - After RLHF training, model performance slightly drops on some NLP benchmarks (e.g., SQuAD) - But actual user experience significantly improves
4. Labeler Consistency: - Different labelers show high preference consistency (>70%) - But greater divergence on subjective tasks (e.g., creative writing)
InstructGPT's Limitations
1. Reward Model Limitations: - Reward model can be "hacked" (producing high-scoring but meaningless outputs) - Example: generating extremely long but repetitive text (reward model may score high due to length)
2. Preference Data Bias: - Labeler preferences may reflect group biases - Reward model inherits these biases
3. Computational Cost: - RLHF training expensive (requires online sampling + multiple forward passes) - PPO updates unstable (requires careful hyperparameter tuning)
ChatGPT: Large-Scale RLHF Application
ChatGPT's Technical Evolution
ChatGPT (released November 2022): - Based on GPT-3.5 (improved GPT-3) - Complete RLHF pipeline (SFT → reward model → PPO) - Dialogue optimization: multi-turn conversation ability, context understanding
GPT-4 (released March 2023): - Multimodal input (text + images) - Stronger reasoning ability, fewer hallucinations - More complex RLHF: multi-objective optimization (helpful, honest, harmless)
ChatGPT's Training Details (Inferred)
OpenAI hasn't released complete details, but from papers and public information:
1. SFT Data: - Hundreds of thousands of dialogue samples (human-labeled) - Covers diverse tasks: Q&A, creative writing, code generation, translation, etc.
2. Reward Model: - Multiple reward models
(separately modeling "helpful," "honest," "harmless") - Weighted
combination:
4. Safety Layer: - Content moderation model: filters harmful outputs - Rule-based system: hard constraints (e.g., refusing illegal requests)
ChatGPT's Impact
1. User Experience Improvement: - Fluent dialogue, accurate instruction understanding - Refuses inappropriate requests (e.g., "teach me to make bombs")
2. New Challenges: - Jailbreaking: Users design prompts to bypass safety restrictions - Bias: Model outputs may still contain gender, racial bias - Hallucinations: Model sometimes generates plausible-sounding but factually incorrect content
3. Driving RLHF Research: - ChatGPT's success sparked academic interest in RLHF - Open-source alternatives: LLaMA+RLHF, Alpaca, Vicuna, etc.
DPO: Direct Preference Optimization
Traditional RLHF's Problems
Complexity: - Requires three-stage training (SFT → reward model → PPO) - PPO training unstable, requires careful hyperparameter tuning
Computational Cost: - RL stage requires online sampling (generating large amounts of text) - Each update requires multiple forward passes (computing rewards, advantages, etc.)
Reward Model Error Propagation: - Reward model errors affect RL training - Reward hacking: policy learns to exploit reward model's loopholes
DPO's Core Idea
Insight: RLHF's optimal policy has closed-form solution:
Under standard RLHF objective:
Inversely solve for reward function:
DPO Loss: Directly optimize policy
DPO's Advantages
Simple: - Only requires one training stage (skips reward model and RL) - Loss function is standard cross-entropy, optimize with gradient descent
Stable: - No complex PPO sampling and updates - No reward model error propagation
Efficient: - Low computational cost (no online sampling needed) - Fast training (direct supervised learning)
DPO's Experimental Results
Paper Experiments (Rafailov et al., 2023): - Tasks: sentiment control, summarization, dialogue - Data: TL;DR (summarization), Anthropic HH (dialogue) - Results: DPO performance matches or exceeds PPO-based RLHF
Subsequent Improvements: - ODPO (Offset DPO): considers preference strength (not all preferences equally important) - IPO (Identity Preference Optimization): improves DPO's theoretical foundation
DPO's Limitations
Implicit Reward Modeling: - DPO implicitly learns reward, but cannot explicitly view reward values - Difficult to debug (why did model choose this output?)
Sensitive to Data Quality: - Requires high-quality
preference pairs
Generalization Ability: - Underperforms RLHF on some tasks (especially requiring complex reasoning)
RLAIF: Replacing Human Feedback with AI Feedback
Human Feedback Bottleneck
High Cost: - Labeler time cost (InstructGPT used 40 full-time labelers over months) - Quality control cost (requires training, quality checks, consistency checks)
Poor Scalability: - Human annotation slow (each comparison takes tens of seconds) - Difficult to continuously collect new data
Bias Accumulation: - Labeler population may not represent user population - Subjective tasks (e.g., creative writing) difficult to obtain consistent preferences
RLAIF's Core Idea
Use AI Models to Generate Preference Labels: - Given
prompt1
2
3
4
5Given the following question and two responses, which response is better?
Question: {x}
Response A: {y_1}
Response B: {y_2}
Answer: (A or B)
Training Pipeline: - Use AI to generate preference
data
RLAIF Variants
1. Constitutional AI (Anthropic, 2022): - Use predefined rules (constitution) to guide AI evaluation - Rule examples: "output should be honest, helpful, harmless" - AI evaluation references these rules
2. Self-Critique: - Model generates output, then self-evaluates and improves - Iteration: generate → evaluate → revise → generate
3. Direct-RLAIF: - Skip reward model, directly use AI scoring as reward - During each RL sampling, call AI model online for scoring
RLAIF's Experimental Results
Paper (Lee et al., 2023): - Tasks: summarization, dialogue, harmlessness - Comparison: RLAIF vs RLHF - Results: RLAIF performance approaches RLHF (even exceeds on some tasks)
Key Findings: - AI feedback high consistency (>85% aligns with human preferences) - Cost reduction 10x+ (no human annotation needed) - Strong scalability (quickly collect large amounts of data)
RLAIF's Limitations
AI Evaluation Bias: - AI models may inherit pretraining data biases - Evaluation may be overly conservative or overly aggressive
Circular Dependency: - Using AI A to train AI B may lead to error accumulation - "Model collapse": performance degrades after multiple generations of training
Difficulty Capturing Subtle Preferences: - Some human preferences hard to express via prompts (e.g., aesthetics, emotional nuance)
Complete Code Implementation: Simplified RLHF
Below implements simplified RLHF pipeline, including: - Synthetic data generation (simulate prompts and responses) - Reward model training (based on preference pairs) - PPO fine-tuning (simplified version, using REINFORCE+baseline)
1 | import torch |
Code Analysis
Data Generation: - SyntheticDataset:
synthesizes prompts and responses - create_comparison_data:
generates 2 responses per prompt, simulates preference (simplified as
longer is better)
Reward Model: - RewardModel: based on
GPT-2, adds scalar output head - train_reward_model: trains
with Bradley-Terry loss
RLHF Training: - SimpleRLHF: simplified
trainer - compute_reward: RM score - KL penalty -
train_step: generates response, computes reward, updates
policy with REINFORCE
Note: - Complete RLHF requires more complex implementation (GAE, PPO clipping, multi-GPU training, etc.) - This code is for educational demonstration only
RL Applications in Embodied Intelligence
Robot Learning: From Simulation to Reality
Sim-to-Real Transfer: - Train policies in simulators (e.g., MuJoCo, PyBullet) - Transfer to real robots (domain randomization, domain adaptation)
Challenges: - Real-world dynamics complex (friction, contact, sensor noise) - Reality gap between simulator and real world
Success Cases: - OpenAI's Dactyl: trained robot hand to solve Rubik's cube with RL (trained in simulation, transferred to real) - Boston Dynamics: quadruped robot locomotion control (combining RL and traditional control)
Offline RL for Robotics
Data Sources: - Human demonstrations (teleoperation) - Random policy exploration - Historical task data
Algorithms: - CQL, IQL, Decision Transformer (see Chapter 10)
Advantages: - No expensive online exploration needed - Utilizes existing data
Applications: - Robomimic: learns robot manipulation from demonstration data - D4RL for Manipulation: offline datasets support robot grasping, pushing, etc.
RL in Autonomous Driving
End-to-End Learning: - Input: sensor data (cameras, radar) - Output: steering, throttle, brake - Use RL to optimize trajectories (maximize safety, comfort, efficiency)
Model-Based RL: - Learn environment model (predict other vehicle behaviors) - Plan in model (MCTS, MPC+RL)
Challenges: - Safety: exploration may be dangerous (requires offline RL or high-fidelity simulation) - Generalization: training environment vs actual road differences
Company Applications: - Waymo: combines RL and imitation learning - Tesla: end-to-end learning (though details not public)
Multimodal RL: Vision-Language-Action
Task: Given language instruction, execute robot task - Input: "pick up red cup" - Output: robot action sequence
Architecture: - Vision encoder: extracts scene
features - Language encoder: understands instructions - Policy network:
conditional policy
Frontier Work: - CLIPort: uses CLIP embeddings to bridge language and vision - RT-1, RT-2 (Google): large-scale robotics Transformer, language-conditioned RL
In-Depth Q&A
Q1: Why Is RLHF More Effective Than SFT?
SFT's Limitations: - Learns to "mimic" demonstration data, but demonstration data limited (e.g., InstructGPT only has 13k samples) - Cannot generalize to unseen prompt types - Difficult to capture "implicit" preferences (e.g., "polite," "avoid verbosity")
RLHF's Advantages: - Reward model can learn from large amounts of comparison data (33k comparisons > 13k demonstrations) - Comparison data easier to annotate (judge which is better, rather than generate perfect response) - RL optimization directly targets human preferences, not likelihood
Experimental Verification: InstructGPT paper shows RLHF-trained 1.3B model outperforms SFT-trained 175B model.
Q2: Why Can DPO Bypass Reward Model?
Mathematical Insight: DPO discovers RLHF's optimal
policy has closed-form solution:
Key: This formula only depends on policy
Directly optimize
Q3: Will RLAIF Lead to "Model Collapse"?
Model Collapse: - Using AI-generated data to train AI, quality degrades after multiple generations - Reason: AI-generated data distribution bias accumulates
RLAIF's Risk: - Use AI A (e.g., GPT-4) to label data, train AI B - If B approaches A, then use B to label data to train C... may collapse
Mitigation Strategies: 1. Mix Human Data: RLAIF + some human annotation 2. Diverse AI Evaluators: voting from multiple models 3. Regular Calibration: recalibrate with human data periodically 4. Task Diversity: avoid overfitting on single distribution
Experimental Evidence: Current RLAIF papers (1-2 generations training) haven't observed obvious collapse, but long-term effects unknown.
Q4: Why Is PPO the Preferred Algorithm for RLHF?
RL Algorithm Comparison:
DQN/Q-learning: - Suitable for discrete actions - LLM action space is vocabulary (tens of thousands of dimensions), Q-function difficult to represent
A3C/A2C: - Policy gradient + value function - Training unstable (high variance)
PPO: - Clipped objective limits policy update magnitude - Reduces catastrophic updates (avoids sudden policy deterioration) - Easy to implement, robust hyperparameters
RLHF-Specific Challenges: - Huge LM action space (select one token per step) - Sparse rewards (only given at sequence end) - Need stable training (avoid forgetting SFT initialization)
PPO's clipping and KL penalty naturally suit these needs.
Q5: How Does RLHF Handle Multiple Objectives (Helpful, Honest, Harmless)?
Naive Approach: weighted combination of rewards
Challenges: - Weights
Improvement Methods:
1. Multiple Reward Models: - Train 3 independent reward models - Use Pareto optimization in RL stage (multi-objective RL)
2. Constitutional AI: - Use rule constraints (e.g., "must refuse harmful requests") - Reward model only models "helpful" and "honest"
3. Human Feedback Specifies Weights: - Let users
choose preferences (e.g., "I prioritize safety more") - Adjust
Q6: Why Is Offline RL Important in Robotics?
Online RL Difficulties: - Safety: robot exploration may damage hardware or cause danger - Time Cost: real robot interaction slow (e.g., one grasp takes seconds), collecting millions of samples infeasible - Data Waste: abundant human demonstration data exists, but online RL starts from scratch
Offline RL Advantages: - Utilizes demonstration data, historical task data - Safe (no online exploration needed) - Efficient (parallel training)
Challenges: - Data distribution shift (demonstration data vs optimal policy) - Real robot dynamics complex (simulation data difficult to transfer)
Practical Approach: - Offline pretraining (CQL, IQL) - Online fine-tuning (small amount of safe exploration) - Combine models (learn dynamics model, plan in model)
Q7: How to Evaluate RLHF-Trained Models?
Automatic Metrics: - Reward Model
Score: on held-out data, RM's preference prediction accuracy -
KL Divergence:
Human Evaluation: - Win Rate: humans compare model output vs baseline, calculate "win rate" - Absolute Rating: Likert scale (1-5 points) evaluating helpful, honest, harmless - Task Success Rate: for specific tasks (e.g., code generation), run code to check correctness
NLP Benchmarks: - MMLU (multitask language understanding) - HumanEval (code generation) - TruthfulQA (truthfulness) - But RLHF may perform worse on benchmarks (alignment tax), while actual user experience improves
A/B Testing: - Deploy two versions (RLHF vs baseline), collect user feedback - ChatGPT's success largely based on actual user satisfaction
Q8: What Is "Reward Hacking" in RLHF Training?
Definition: Policy learns to exploit reward model's loopholes, producing high-reward but actually low-quality outputs.
Examples: - Length Hacking: reward model may prefer long text, policy generates extremely long but repetitive/meaningless outputs - Format Hacking: reward model prefers specific format (e.g., lists), policy overuses lists - Sycophancy: policy learns to "please" reward model, generating plausible-sounding but actually incorrect content
Reasons: - Reward model is imperfect proxy, not fully equivalent to human preferences - RL over-optimizes proxy objective
Mitigation Methods: - KL Penalty:
limit policy deviation from SFT initialization (add
Q9: How Does Constitutional AI Differ from RLHF?
Constitutional AI (CAI): - Proposed by Anthropic, uses predefined rules (constitution) to guide training - Process: 1. Model generates output 2. Evaluate with rules (e.g., "is it harmful?") 3. Model self-corrects (generates improved version) 4. Train with improved version
Difference from RLHF:
RLHF: - Humans label preference data - Reward model implicitly learns human values
CAI: - Humans define explicit rules - AI evaluates whether rules are followed
Advantages: - Interpretable: rules explicit, easy to review - Controllable: directly modify rules to change behavior - Scalable: no extensive human annotation needed
Limitations: - Rules difficult to exhaust (how to define "polite"?) - Rules may conflict (e.g., honest vs harmless)
In Practice: CAI and RLHF often combined (CAI defines hard constraints, RLHF optimizes soft preferences).
Q10: Future Directions for RL Beyond LLMs?
1. Multimodal RLHF: - Not just text, but images, video, audio - Reward models evaluate multimodal outputs (e.g., "is this video helpful?")
2. Online RLHF: - Continuously learn from user interactions - User upvotes/downvotes as real-time feedback - Challenges: distribution shift, privacy
3. Personalized RLHF: - Each user has different preferences - Train user-specific reward models - Meta-learning to generalize across users
4. RL for Reasoning: - LLM reasoning ability still limited (e.g., math, logic) - Use RL to optimize reasoning process (like AlphaGo's MCTS+RL) - Algorithms: Process Reward Model (PRM), STaR
5. RL for Embodied Intelligence: - LLM as high-level planner (generates subgoals) - RL trains low-level executor (robot actions) - Joint training of language-vision-action
6. Safe Alignment: - Beyond "helpful, honest, harmless," research long-term safety - AI alignment theory (e.g., CIRL, IRL) - Mechanism design (making AI objectives naturally align with human objectives)
Q11: How Do RT-1 and RT-2 (Google's Robotics Transformers) Work?
RT-1 (Robotics Transformer 1, 2022): -
Input: images + language instructions -
Output: robot actions (discretized joint angles, grasp
states) - Architecture: - Vision encoder: EfficientNet
extracts image features - Language encoder: Universal Sentence Encoder
processes instructions - Transformer: Decoder processes sequence
RT-2 (2023): - Improvement: initialized with pretrained VLM (Vision-Language Model) - Backbone: PaLI-X (vision-language large model) - Training: 1. Pretrain VLM on web image-text data 2. Fine-tune on robot data (co-fine-tuning: language tasks + robot tasks) - Effect: significantly improved generalization (zero-shot reasoning on new tasks)
Key Innovations: - Large-scale data (RT-1: 130k, RT-2: combines web data) - Multi-task learning (one model handles 700+ tasks) - Language conditioning (natural language instruction control)
RL's Role: - Mainly uses BC (imitation learning) - RL used for online fine-tuning (improves task success rate)
Q12: How High Is RLHF's Computational Cost?
Training Stage Cost Estimation (using InstructGPT 175B as example):
SFT: - Data: 13k samples - Computation: approximately fine-tuning GPT-3 on 13k samples (hours, single machine multi-GPU)
Reward Model Training: - Data: 33k comparisons - Model: 6B parameters (smaller than policy) - Computation: hours
PPO Fine-Tuning: - Data: 256k prompts - Each iteration: - Generation: 256k responses (dozens of tokens each) - Compute rewards: 256k forward passes (RM + policy) - PPO updates: multiple gradient steps (each requires computing advantages, clipping, etc.) - Total computation: approximately training on millions of samples for days (multi-machine multi-GPU)
Comparison: - Pretraining GPT-3: approximately 10^23 FLOPs (thousands of GPU-months) - RLHF (SFT+RM+PPO): approximately 10^21 FLOPs (tens of GPU-months) - RLHF approximately 1-10% of pretraining cost
DPO's Cost: - Skips RM and RL, direct supervised learning - Approximately equal to SFT cost (hours-days) - 1-2 orders of magnitude lower than RLHF
Related Papers and Resources
Core Papers
RLHF:
InstructGPT:
Ouyang et al. (2022). "Training language models to follow instructions with human feedback". NeurIPS.
https://arxiv.org/abs/2203.02155ChatGPT Technical Report:
OpenAI (2022). Blog post.
https://openai.com/blog/chatgptRLHF Survey:
Wang et al. (2024). "A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More". arXiv.
https://arxiv.org/abs/2407.16216
DPO:
DPO:
Rafailov et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". NeurIPS.
https://arxiv.org/abs/2305.18290DPO Survey:
(2024). "A Survey of Direct Preference Optimization". arXiv.
RLAIF:
RLAIF:
Lee et al. (2023). "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback". arXiv.
https://arxiv.org/abs/2309.00267Constitutional AI:
Bai et al. (2022). "Constitutional AI: Harmlessness from AI Feedback". arXiv.
https://arxiv.org/abs/2212.08073
Embodied Intelligence:
RT-1:
Brohan et al. (2022). "RT-1: Robotics Transformer for Real-World Control at Scale". arXiv.
https://arxiv.org/abs/2212.06817RT-2:
Brohan et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control". arXiv.
https://arxiv.org/abs/2307.15818Dactyl:
OpenAI (2018). "Learning Dexterous In-Hand Manipulation". arXiv.
https://arxiv.org/abs/1808.00177
Code Libraries
TRL (Transformer Reinforcement Learning):
https://github.com/huggingface/trl
HuggingFace's RLHF library, supports PPO, DPOOpenAI Baselines:
https://github.com/openai/baselines
Includes PPO implementationAnthropic's RLHF:
https://github.com/anthropics/hh-rlhf
Helpful & Harmless datasetDeepSpeed-Chat:
https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat
Efficient RLHF training
Summary
Reinforcement learning has evolved from game AI to language model alignment to embodied intelligence, demonstrating its core position in shaping general AI.
RLHF infuses human values into large language models: - Through three-stage pipeline (SFT → reward model → PPO), making models generate more helpful, honest, harmless content - InstructGPT and ChatGPT proved RLHF's effectiveness, driving large-scale LLM applications
DPO simplified RLHF's training complexity: - Directly optimizes policy from preference data, bypassing reward model and RL sampling - Maintains performance while reducing computational cost, enabling RLHF democratization
RLAIF replaces human annotation with AI feedback: - Reduces data collection cost 10x+, improves scalability - Constitutional AI and other methods combine rules with AI feedback, enhancing controllability
RL in Embodied Intelligence: - From offline demonstration data to online fine-tuning, RL helps robots learn complex operations - Multimodal learning (language-vision-action) opens new chapter for general agents
In the future, reinforcement learning will deeply integrate with large-scale pretraining, multimodal learning, and causal reasoning — from conversational assistants to autonomous driving, from research assistants to home robots, RL is defining the new paradigm for AI-human interaction. The reinforcement learning series concludes here, but RL's journey has just begun.
- Post title:Reinforcement Learning (12): RLHF and Large Language Model Applications
- Post author:Chen Kai
- Create time:2024-10-04 15:00:00
- Post link:https://www.chenk.top/reinforcement-learning-12-rlhf-and-llm-applications/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.