If BERT opened the golden age of understanding-based NLP, then the GPT series represents the pinnacle of generative NLP. From GPT-1 in 2018 to GPT-4 in 2023, OpenAI has demonstrated through continuously scaling model size and optimizing training strategies that autoregressive language models can serve as the foundation for artificial general intelligence. GPT's success lies not only in its powerful text generation capabilities but also in demonstrating the magical power of In-Context Learning: models can learn new tasks with just a few examples without updating parameters.
GPT's core is autoregressive language modeling: given previous tokens, predict the next token. This seemingly simple objective, combined with Transformer decoder architecture and large-scale data training, produces astonishing emergent capabilities. Understanding GPT is not just key to understanding modern large language models — it's the starting point for exploring AI general intelligence.
This article provides an in-depth exploration of the GPT series evolution, principles of autoregressive language modeling, various decoding strategies, in-context learning mechanisms, and how to evaluate generation quality. We'll also build a dialogue system through practical code, demonstrating GPT's powerful capabilities in real applications.
Evolution of the GPT Series
The development of the GPT series can be seen as a continuous "scaling" process: larger models, more data, longer training times, ultimately producing qualitative leaps.
GPT-1: First Pretraining with Transformer Decoder
In June 2018, OpenAI released GPT-1 (Generative Pre-trained Transformer), the first language model to use Transformer architecture for large-scale pretraining.
Architecture Features: - Based on Transformer decoder (12 layers) - Unidirectional attention mechanism, can only see left context - Parameters: 117M - Pretraining data: BooksCorpus (approximately 7,000 books, 4.5GB)
Pretraining Objective:
GPT-2: Exploring Zero-Shot Learning
In February 2019, OpenAI released GPT-2, an important turning point. GPT-2's core idea: language models should be able to perform any language task without task-specific architectural modifications.
Key Improvements: - Larger models: GPT-2 Small (117M) to GPT-2 XL (1.5B) - Larger data: WebText (8 million web pages, 40GB) - Zero-shot learning: Execute tasks directly through prompts without finetuning
GPT-2 demonstrated language models' zero-shot learning capability: just by providing task descriptions and examples, the model can understand and execute tasks. For example:
1 | Translate to French: |
GPT-2 would generate "Le chien a couru dans le parc." even though it was never trained on translation data.
GPT-3: Emergent Capabilities from Scale
In May 2020, OpenAI released GPT-3, the first truly "large language model". GPT-3's parameter count reached 175B (175 billion), over 100 times larger than GPT-2's largest version.
Key Features: - Few-shot Learning: Can learn new tasks with just a few examples - In-Context Learning: Can adapt to tasks without gradient updates, only through context - Instruction Following: Can understand and follow natural language instructions
GPT-3 demonstrated emergent capabilities from scale: - After parameter count reaches a certain threshold, models suddenly gain previously absent capabilities - These capabilities aren't explicitly programmed but naturally emerge from data
GPT-3 Scale Comparison:
| Model | Parameters | Training Data | Main Capabilities |
|---|---|---|---|
| GPT-1 | 117M | 4.5GB | Finetune for tasks |
| GPT-2 | 1.5B | 40GB | Zero-shot learning |
| GPT-3 | 175B | 570GB | Few-shot learning, code generation |
GPT-4: Multimodality and Instruction Optimization
In March 2023, OpenAI released GPT-4. Although specific architectural details weren't disclosed, known key features include:
- Multimodal Capabilities: Can process text and image inputs
- Stronger Instruction Following: Optimized through reinforcement learning from human feedback (RLHF)
- Longer Context: Supports longer input sequences
- Better Safety: Reduced harmful outputs
GPT-4 represents the current state-of-the-art for large language models, achieving human-level or near-human-level performance on multiple benchmarks.
Principles of Autoregressive Language Modeling
GPT's core is autoregressive language modeling. Understanding this principle is key to understanding GPT.
Definition of Autoregressive
Autoregressive models assume each element in a sequence depends on
previous elements:
Transformer Decoder Architecture
GPT uses Transformer decoder architecture. The key difference from encoders is masked self-attention.
Role of Masked Self-Attention: - Prevents the model
from seeing information at position
Mathematical expression of masked self-attention:
Positional Encoding
GPT uses learnable positional encodings, same as BERT. Positional
encodings are added to token embeddings:
Forward Pass Process
Given input sequence
Embedding Layer: Convert tokens to vectors
Transformer Blocks (repeat
times): - Masked self-attention
- Residual connection and layer normalization
- Feed-forward network
- Residual connection and layer normalization
Output Layer: Predict probability distribution of next token
where is the hidden state at position , is the output weight matrix.
Training Objective
GPT's training objective is to maximize the likelihood function:
Decoding Strategies
GPT's text generation process is autoregressive decoding: generate one token at a time, then use it as input to continue generating the next token. Different decoding strategies produce different generation effects.
Greedy Decoding
The simplest strategy is to always choose the token with highest
probability:
Advantages: Simple and fast, deterministic output
Disadvantages: Prone to repetition and monotonous text
1 | def greedy_decode(model, tokenizer, prompt, max_length=100): |
Beam Search
Beam Search maintains
Algorithm: 1. Initialize: beam contains start token
2. Expand: For each beam, generate all possible next tokens, compute
scores 3. Select: Keep top
Score Calculation:
1 | def beam_search(model, tokenizer, prompt, beam_size=5, max_length=100, length_penalty=0.6): |
Advantages: Usually generates better text than greedy decoding
Disadvantages: High computational cost, may produce overly conservative text
Top-k Sampling
Top-k sampling restricts candidate tokens to only the top
Algorithm: 1. Compute probability distribution for
all tokens 2. Select top
1 | def top_k_decode(model, tokenizer, prompt, k=50, max_length=100, temperature=1.0): |
Advantages: Balances diversity and quality
Disadvantages:
Top-p (Nucleus) Sampling
Top-p sampling (also called Nucleus Sampling) is an improved version
of Top-k, dynamically selecting the minimum token set whose cumulative
probability reaches
Algorithm: 1. Sort all tokens by probability from
high to low 2. Select minimum token set whose cumulative probability
reaches
1 | def top_p_decode(model, tokenizer, prompt, p=0.9, max_length=100, temperature=1.0): |
Advantages: - Adaptive: Dynamically adjusts candidate count based on probability distribution - More flexible: Automatically adapts in different contexts
Disadvantages: Slightly more complex computation
Temperature Parameter
The temperature parameter controls sampling randomness:
: Sharper distribution, more deterministic output (more conservative) : Standard softmax : Smoother distribution, more diverse output (more creative)
Decoding Strategy Comparison
| Strategy | Diversity | Quality | Speed | Use Cases |
|---|---|---|---|---|
| Greedy | Low | Medium | Fast | Deterministic tasks |
| Beam Search | Low | High | Slow | Quality priority |
| Top-k | Medium | Medium-High | Medium | Balanced scenarios |
| Top-p | Medium-High | Medium-High | Medium | Creative generation |
Zero-shot, Few-shot, and In-Context Learning
One of the GPT series' most astonishing capabilities is In-Context Learning: models can learn new tasks with just a few examples without updating parameters.
Zero-shot Learning
Zero-shot learning means the model performs tasks without task-specific training data, relying only on task descriptions or prompts.
Example: 1
2Translate the following English to Chinese:
The cat sat on the mat.
The model needs to understand the "translation" task and execute it directly.
Few-shot Learning
Few-shot learning means the model learns task patterns through a few examples (usually 1-10), then applies them to new samples.
Example: 1
2
3
4Translate the following English to Chinese:
Example 1: The cat sat on the mat. → 猫坐在垫子上。
Example 2: The dog ran in the park. → 狗在公园里跑。
Now translate: The bird flew in the sky.
The model learns translation patterns through examples, then applies them to new sentences.
Mechanisms of In-Context Learning
The mechanisms of in-context learning aren't fully understood, but there are several possible explanations:
1. Pattern Matching - Models have seen many similar task formats during pretraining - Identify task types through pattern matching and apply corresponding patterns
2. Implicit Gradient Updates - Some research suggests Transformer attention mechanisms can implicitly perform gradient-update-like operations - By attending to examples, models "adjust" internal representations
3. Meta-Learning - Models learn how to quickly adapt to new tasks during pretraining - This is an implicit meta-learning capability
Prompt Engineering
Prompt engineering is the technique of optimizing model inputs to obtain better outputs:
1. Task Description - Clearly describe task objectives - Use natural language to specify expected output format
2. Example Selection - Choose representative examples - Examples should cover task diversity
3. Format Consistency - Maintain input format consistency - Use clear separators
4. Chain-of-Thought - For complex tasks, guide models to reason step by step - Include reasoning processes in examples
1 | def few_shot_prompt(task_description, examples, query): |
Generation Quality Evaluation
Evaluating the quality of generated text is a complex problem, typically requiring multiple metrics.
BLEU Score
BLEU (Bilingual Evaluation Understudy) was originally used for machine translation evaluation, computed by comparing n-gram overlap between generated and reference texts.
Calculation Process: 1. Compute precision for
different n-grams (1-gram to 4-gram) 2. Apply length penalty (Brevity
Penalty) 3. Compute geometric mean
1 | from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction |
Advantages: Fast, objective
Disadvantages: Only considers exact matches, doesn't consider semantic similarity
ROUGE Score
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is mainly used for summarization evaluation, focusing on recall.
ROUGE-N: Computes recall for n-grams
ROUGE-L: Based on longest common subsequence (LCS)
1 | from rouge_score import rouge_scorer |
Perplexity
Perplexity measures model's prediction uncertainty on test data:
1 | def calculate_perplexity(model, tokenizer, text): |
Human Evaluation
Human evaluation remains the most reliable method, typically evaluating: - Fluency: Whether text is natural and fluent - Relevance: Whether it's relevant to input - Accuracy: Whether information is correct - Creativity: Whether it's novel
Metric Selection
Different tasks suit different metrics:
| Task | Recommended Metrics |
|---|---|
| Machine Translation | BLEU, METEOR |
| Text Summarization | ROUGE, BLEU |
| Dialogue Systems | BLEU, Human Evaluation |
| Creative Writing | Human Evaluation, Diversity Metrics |
Practice: Building a Dialogue System
Below demonstrates how to build a simple dialogue system using GPT models:
1 | from transformers import GPT2LMHeadModel, GPT2Tokenizer |
Improvement: Multi-turn Dialogue Context
To support multi-turn dialogue, need to maintain conversation history:
1 | class MultiTurnChatBot: |
Using HuggingFace Pipeline
HuggingFace provides a simpler interface:
1 | from transformers import pipeline |
GPT's Limitations
Despite GPT's tremendous success, it also has some limitations:
1. Hallucination Problem - Models may generate seemingly reasonable but actually incorrect information - Lack fact-checking mechanisms
2. Context Length Limitations - Early GPT models had limited context windows (e.g., GPT-3's 2048 tokens) - Cannot handle very long documents
3. High Computational Cost - Large models require massive computational resources - Inference speed may be slow
4. Training Data Bias - Models may learn biases from training data - Requires careful data filtering and model alignment
5. Limited Controllability - Difficult to precisely control generated content - May generate harmful or inappropriate content
Summary
The GPT series represents the pinnacle of generative language models. Through autoregressive language modeling and Transformer architecture, it achieves powerful text generation capabilities. From GPT-1 to GPT-4, continuously scaling model size has brought emergent capabilities, especially in-context learning, enabling models to adapt to new tasks without updating parameters.
GPT's core contributions: 1. Autoregressive Language Modeling: Simple yet powerful pretraining objective 2. In-Context Learning: Zero-shot and few-shot learning capabilities 3. Generality: The same model can handle multiple tasks
Understanding GPT is not just key to understanding modern large language models — it's the starting point for exploring AI general intelligence. As model sizes continue to grow and training strategies keep optimizing, we can expect even more powerful generative AI systems.
❓ Q&A: GPT Common Questions
Q1: What are the main differences between GPT and BERT?
A: Main differences: - Architecture: GPT is decoder (unidirectional), BERT is encoder (bidirectional) - Pretraining Tasks: GPT uses language modeling, BERT uses MLM + NSP - Suitable Tasks: GPT excels at generation tasks, BERT excels at understanding tasks - Context Utilization: GPT can only see forward context, BERT can see both forward and backward context
Q2: Why does GPT use masked self-attention?
A: Masked self-attention ensures consistency between training and inference: - During training, model can only see previous tokens to predict current token - During inference, model can only see already generated tokens - Without masking, model could "see the future" during training, causing training-inference inconsistency
Q3: How does GPT achieve zero-shot learning?
A: GPT's zero-shot learning capability comes from: 1. Large-scale Pretraining: Trained on massive diverse data, seen various task formats 2. Pattern Matching: Identify task descriptions and formats, match corresponding generation patterns 3. Context Understanding: Transformer architecture can understand long-range dependencies, capture task patterns
Q4: What's the difference between Top-k and Top-p sampling?
A: Main differences: - Top-k: Fixed selection of top k tokens by probability - Top-p: Dynamically selects token set whose cumulative probability reaches p - Top-p is more flexible: Automatically adjusts candidate count in different contexts - Top-k is simpler: More intuitive to implement and understand
Q5: How to choose appropriate decoding strategies?
A: Selection recommendations: - Deterministic tasks (e.g., code completion): Greedy or Beam Search - Creative tasks (e.g., story generation): Top-p sampling, temperature > 1 - Balanced scenarios: Top-k or Top-p, temperature ≈ 0.7-0.9 - Quality priority: Beam Search, beam_size = 3-5 - Speed priority: Greedy or Top-k (small k)
Q6: How does GPT's in-context learning capability emerge?
A: In-context learning capability may come from: 1. Pretraining Data Diversity: Seen many task examples and formats 2. Transformer Attention Mechanism: Can attend to relevant examples and extract patterns 3. Implicit Meta-Learning: Learned how to quickly adapt during pretraining 4. Scale Effects: Emerges after model scale reaches certain threshold
Q7: How to evaluate GPT-generated text quality?
A: Evaluation methods: 1. Automatic Metrics: BLEU, ROUGE, Perplexity 2. Human Evaluation: Fluency, relevance, accuracy 3. Task-Specific Metrics: Choose appropriate metrics based on specific tasks 4. Combined Evaluation: Comprehensive judgment combining multiple metrics
Q8: Why do GPT models hallucinate?
A: Reasons for hallucination: 1. Training Data Noise: Pretraining data contains incorrect information 2. Probabilistic Generation: Sampling process may select low-probability but incorrect tokens 3. Lack of Fact-Checking: Models don't have explicit fact-verification mechanisms 4. Overfitting Patterns: May generate content that fits language patterns but not facts
Q9: How to reduce harmful content generated by GPT?
A: Methods to reduce harmful content: 1. Data Filtering: Carefully filter and clean training data 2. RLHF: Optimize models using reinforcement learning from human feedback 3. Safety Prompts: Add safety constraints in prompts 4. Post-Processing Filtering: Filter and check generated content 5. Model Alignment: Finetune models to follow safety guidelines
Q10: What are GPT's future development directions?
A: Possible development directions: 1. Larger Scale: Continue increasing model and training data size 2. Multimodality: Integrate text, images, audio, and other modalities 3. Longer Context: Support processing longer input sequences 4. Better Controllability: Precisely control generated content and style 5. More Efficient: Reduce computational costs, improve inference speed 6. Safer: Reduce bias and harmful content 7. Specialization: Optimize models for specific domains
- Post title:NLP (6): GPT and Generative Language Models
- Post author:Chen Kai
- Create time:2024-03-03 14:00:00
- Post link:https://www.chenk.top/en/nlp-gpt-generative-models/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.