NLP (6): GPT and Generative Language Models

If BERT opened the golden age of understanding-based NLP, then the GPT series represents the pinnacle of generative NLP. From GPT-1 in 2018 to GPT-4 in 2023, OpenAI has demonstrated through continuously scaling model size and optimizing training strategies that autoregressive language models can serve as the foundation for artificial general intelligence. GPT's success lies not only in its powerful text generation capabilities but also in demonstrating the magical power of In-Context Learning: models can learn new tasks with just a few examples without updating parameters.

GPT's core is autoregressive language modeling: given previous tokens, predict the next token. This seemingly simple objective, combined with Transformer decoder architecture and large-scale data training, produces astonishing emergent capabilities. Understanding GPT is not just key to understanding modern large language models — it's the starting point for exploring AI general intelligence.

This article provides an in-depth exploration of the GPT series evolution, principles of autoregressive language modeling, various decoding strategies, in-context learning mechanisms, and how to evaluate generation quality. We'll also build a dialogue system through practical code, demonstrating GPT's powerful capabilities in real applications.

Evolution of the GPT Series

The development of the GPT series can be seen as a continuous "scaling" process: larger models, more data, longer training times, ultimately producing qualitative leaps.

GPT-1: First Pretraining with Transformer Decoder

In June 2018, OpenAI released GPT-1 (Generative Pre-trained Transformer), the first language model to use Transformer architecture for large-scale pretraining.

Architecture Features: - Based on Transformer decoder (12 layers) - Unidirectional attention mechanism, can only see left context - Parameters: 117M - Pretraining data: BooksCorpus (approximately 7,000 books, 4.5GB)

Pretraining Objective: GPT-1 achieved good results on multiple tasks through finetuning, but the real breakthrough was proving that pretrained language models can serve as universal feature extractors.

GPT-2: Exploring Zero-Shot Learning

In February 2019, OpenAI released GPT-2, an important turning point. GPT-2's core idea: language models should be able to perform any language task without task-specific architectural modifications.

Key Improvements: - Larger models: GPT-2 Small (117M) to GPT-2 XL (1.5B) - Larger data: WebText (8 million web pages, 40GB) - Zero-shot learning: Execute tasks directly through prompts without finetuning

GPT-2 demonstrated language models' zero-shot learning capability: just by providing task descriptions and examples, the model can understand and execute tasks. For example:

Translate to French:
English: The cat sat on the mat.
French: Le chat s'est assis sur le tapis.
English: The dog ran in the park.
French:

GPT-2 would generate "Le chien a couru dans le parc." even though it was never trained on translation data.

GPT-3: Emergent Capabilities from Scale

In May 2020, OpenAI released GPT-3, the first truly "large language model". GPT-3's parameter count reached 175B (175 billion), over 100 times larger than GPT-2's largest version.

Key Features: - Few-shot Learning: Can learn new tasks with just a few examples - In-Context Learning: Can adapt to tasks without gradient updates, only through context - Instruction Following: Can understand and follow natural language instructions

GPT-3 demonstrated emergent capabilities from scale: - After parameter count reaches a certain threshold, models suddenly gain previously absent capabilities - These capabilities aren't explicitly programmed but naturally emerge from data

GPT-3 Scale Comparison:

Model	Parameters	Training Data	Main Capabilities
GPT-1	117M	4.5GB	Finetune for tasks
GPT-2	1.5B	40GB	Zero-shot learning
GPT-3	175B	570GB	Few-shot learning, code generation

GPT-4: Multimodality and Instruction Optimization

In March 2023, OpenAI released GPT-4. Although specific architectural details weren't disclosed, known key features include:

Multimodal Capabilities: Can process text and image inputs
Stronger Instruction Following: Optimized through reinforcement learning from human feedback (RLHF)
Longer Context: Supports longer input sequences
Better Safety: Reduced harmful outputs

GPT-4 represents the current state-of-the-art for large language models, achieving human-level or near-human-level performance on multiple benchmarks.

Principles of Autoregressive Language Modeling

GPT's core is autoregressive language modeling. Understanding this principle is key to understanding GPT.

Definition of Autoregressive

Autoregressive models assume each element in a sequence depends on previous elements:whererepresents all tokens before position.

Transformer Decoder Architecture

GPT uses Transformer decoder architecture. The key difference from encoders is masked self-attention.

Role of Masked Self-Attention: - Prevents the model from seeing information at positionand beyond when predicting token at position - Ensures consistency between training and inference: during training, can only see previous tokens; same during inference

Mathematical expression of masked self-attention:whereis the mask matrix:This way, positioncan only attend to information from positionsto.

Positional Encoding

GPT uses learnable positional encodings, same as BERT. Positional encodings are added to token embeddings:

Forward Pass Process

Given input sequence, GPT's forward pass:

Embedding Layer: Convert tokens to vectors
Transformer Blocks (repeattimes):
- Masked self-attention
- Residual connection and layer normalization
- Feed-forward network
- Residual connection and layer normalization
Output Layer: Predict probability distribution of next tokenwhereis the hidden state at position,is the output weight matrix.

Training Objective

GPT's training objective is to maximize the likelihood function:In implementation, cross-entropy loss is typically used: $𝟙$ whereis vocabulary size, $𝟙$ is the indicator function.

Decoding Strategies

GPT's text generation process is autoregressive decoding: generate one token at a time, then use it as input to continue generating the next token. Different decoding strategies produce different generation effects.

Greedy Decoding

The simplest strategy is to always choose the token with highest probability:

Advantages: Simple and fast, deterministic output

Disadvantages: Prone to repetition and monotonous text

def greedy_decode(model, tokenizer, prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors='pt')
    generated = inputs['input_ids'].clone()
    
    for _ in range(max_length):
        outputs = model(generated)
        logits = outputs.logits[:, -1, :]
        next_token = torch.argmax(logits, dim=-1)
        generated = torch.cat([generated, next_token.unsqueeze(0)], dim=1)
        
        if next_token.item() == tokenizer.eos_token_id:
            break
    
    return tokenizer.decode(generated[0], skip_special_tokens=True)

Beam Search

Beam Search maintainsmost likely sequences (beams), expanding these sequences at each step:

Algorithm: 1. Initialize: beam contains start token 2. Expand: For each beam, generate all possible next tokens, compute scores 3. Select: Keep topsequences by score 4. Repeat steps 2-3 until generating end token or reaching max length

Score Calculation:To avoid bias toward short sequences, length normalization is typically used:whereis the length penalty coefficient (usually 0.6-0.7).

def beam_search(model, tokenizer, prompt, beam_size=5, max_length=100, length_penalty=0.6):
    inputs = tokenizer(prompt, return_tensors='pt')
    beams = [(inputs['input_ids'], 0.0)]  # (sequence, score)
    
    for _ in range(max_length):
        candidates = []
        for seq, score in beams:
            if seq[0, -1].item() == tokenizer.eos_token_id:
                candidates.append((seq, score))
                continue
                
            outputs = model(seq)
            logits = outputs.logits[:, -1, :]
            probs = torch.log_softmax(logits, dim=-1)
            top_k_probs, top_k_indices = torch.topk(probs, beam_size, dim=-1)
            
            for prob, idx in zip(top_k_probs[0], top_k_indices[0]):
                new_seq = torch.cat([seq, idx.unsqueeze(0).unsqueeze(0)], dim=1)
                new_score = score + prob.item()
                candidates.append((new_seq, new_score))
        
        # Select top-k
        candidates.sort(key=lambda x: x[1] / (len(x[0][0]) ** length_penalty), reverse=True)
        beams = candidates[:beam_size]
        
        # Check if all finished
        if all(seq[0, -1].item() == tokenizer.eos_token_id for seq, _ in beams):
            break
    
    best_seq = beams[0][0]
    return tokenizer.decode(best_seq[0], skip_special_tokens=True)

Advantages: Usually generates better text than greedy decoding

Disadvantages: High computational cost, may produce overly conservative text

Top-k Sampling

Top-k sampling restricts candidate tokens to only the topby probability:

Algorithm: 1. Compute probability distribution for all tokens 2. Select toptokens by probability 3. Renormalize probabilities of thesetokens 4. Sample according to renormalized probabilities

def top_k_decode(model, tokenizer, prompt, k=50, max_length=100, temperature=1.0):
    inputs = tokenizer(prompt, return_tensors='pt')
    generated = inputs['input_ids'].clone()
    
    for _ in range(max_length):
        outputs = model(generated)
        logits = outputs.logits[:, -1, :] / temperature
        probs = torch.softmax(logits, dim=-1)
        
        # Top-k
        top_k_probs, top_k_indices = torch.topk(probs, k, dim=-1)
        top_k_probs = top_k_probs / top_k_probs.sum()
        
        # Sample
        next_token = torch.multinomial(top_k_probs, 1)
        next_token_id = top_k_indices[0, next_token].item()
        
        generated = torch.cat([generated, torch.tensor([[next_token_id]])], dim=1)
        
        if next_token_id == tokenizer.eos_token_id:
            break
    
    return tokenizer.decode(generated[0], skip_special_tokens=True)

Advantages: Balances diversity and quality

Disadvantages:selection requires tuning; fixedmay not suit all situations

Top-p (Nucleus) Sampling

Top-p sampling (also called Nucleus Sampling) is an improved version of Top-k, dynamically selecting the minimum token set whose cumulative probability reaches.

Algorithm: 1. Sort all tokens by probability from high to low 2. Select minimum token set whose cumulative probability reaches3. Renormalize probabilities of these tokens 4. Sample according to renormalized probabilities

def top_p_decode(model, tokenizer, prompt, p=0.9, max_length=100, temperature=1.0):
    inputs = tokenizer(prompt, return_tensors='pt')
    generated = inputs['input_ids'].clone()
    
    for _ in range(max_length):
        outputs = model(generated)
        logits = outputs.logits[:, -1, :] / temperature
        probs = torch.softmax(logits, dim=-1)
        
        # Sort
        sorted_probs, sorted_indices = torch.sort(probs, descending=True)
        cumsum_probs = torch.cumsum(sorted_probs, dim=-1)
        
        # Find position where cumulative probability reaches p
        mask = cumsum_probs <= p
        mask[0] = True  # Keep at least one
        
        # Filter and normalize
        filtered_probs = sorted_probs[mask]
        filtered_indices = sorted_indices[mask]
        filtered_probs = filtered_probs / filtered_probs.sum()
        
        # Sample
        next_token_idx = torch.multinomial(filtered_probs.unsqueeze(0), 1)
        next_token_id = filtered_indices[next_token_idx].item()
        
        generated = torch.cat([generated, torch.tensor([[next_token_id]])], dim=1)
        
        if next_token_id == tokenizer.eos_token_id:
            break
    
    return tokenizer.decode(generated[0], skip_special_tokens=True)

Advantages: - Adaptive: Dynamically adjusts candidate count based on probability distribution - More flexible: Automatically adapts in different contexts

Disadvantages: Slightly more complex computation

Temperature Parameter

The temperature parameter controls sampling randomness:whereare logits,is temperature.

: Sharper distribution, more deterministic output (more conservative)
: Standard softmax
: Smoother distribution, more diverse output (more creative)

Decoding Strategy Comparison

Strategy	Diversity	Quality	Speed	Use Cases
Greedy	Low	Medium	Fast	Deterministic tasks
Beam Search	Low	High	Slow	Quality priority
Top-k	Medium	Medium-High	Medium	Balanced scenarios
Top-p	Medium-High	Medium-High	Medium	Creative generation

Zero-shot, Few-shot, and In-Context Learning

One of the GPT series' most astonishing capabilities is In-Context Learning: models can learn new tasks with just a few examples without updating parameters.

Zero-shot Learning

Zero-shot learning means the model performs tasks without task-specific training data, relying only on task descriptions or prompts.

Example:

1 2	Translate the following English to Chinese: The cat sat on the mat.

The model needs to understand the "translation" task and execute it directly.

Few-shot Learning

Few-shot learning means the model learns task patterns through a few examples (usually 1-10), then applies them to new samples.

Example:

Translate the following English to Chinese:
Example 1: The cat sat on the mat. → 猫坐在垫子上。
Example 2: The dog ran in the park. → 狗在公园里跑。
Now translate: The bird flew in the sky.

The model learns translation patterns through examples, then applies them to new sentences.

Mechanisms of In-Context Learning

The mechanisms of in-context learning aren't fully understood, but there are several possible explanations:

1. Pattern Matching - Models have seen many similar task formats during pretraining - Identify task types through pattern matching and apply corresponding patterns

2. Implicit Gradient Updates - Some research suggests Transformer attention mechanisms can implicitly perform gradient-update-like operations - By attending to examples, models "adjust" internal representations

3. Meta-Learning - Models learn how to quickly adapt to new tasks during pretraining - This is an implicit meta-learning capability

Prompt Engineering

Prompt engineering is the technique of optimizing model inputs to obtain better outputs:

1. Task Description - Clearly describe task objectives - Use natural language to specify expected output format

2. Example Selection - Choose representative examples - Examples should cover task diversity

3. Format Consistency - Maintain input format consistency - Use clear separators

4. Chain-of-Thought - For complex tasks, guide models to reason step by step - Include reasoning processes in examples

def few_shot_prompt(task_description, examples, query):
    prompt = f"{task_description}\n\n"
    for i, (input_text, output_text) in enumerate(examples, 1):
        prompt += f"Example {i}: {input_text} → {output_text}\n"
    prompt += f"\nNow: {query} →"
    return prompt

# Usage example
task = "Translate the following English to Chinese:"
examples = [
    ("The cat sat on the mat.", "猫坐在垫子上。"),
    ("The dog ran in the park.", "狗在公园里跑。")
]
query = "The bird flew in the sky."

prompt = few_shot_prompt(task, examples, query)
response = model.generate(prompt)

Generation Quality Evaluation

Evaluating the quality of generated text is a complex problem, typically requiring multiple metrics.

BLEU Score

BLEU (Bilingual Evaluation Understudy) was originally used for machine translation evaluation, computed by comparing n-gram overlap between generated and reference texts.

Calculation Process: 1. Compute precision for different n-grams (1-gram to 4-gram) 2. Apply length penalty (Brevity Penalty) 3. Compute geometric meanwhereis n-gram precision, BP is length penalty:whereis generated text length,is reference text length.

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

def calculate_bleu(generated, reference):
    reference_tokens = [reference.split()]
    generated_tokens = generated.split()
    smoothing = SmoothingFunction().method1
    return sentence_bleu(reference_tokens, generated_tokens, smoothing_function=smoothing)

Advantages: Fast, objective

Disadvantages: Only considers exact matches, doesn't consider semantic similarity

ROUGE Score

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is mainly used for summarization evaluation, focusing on recall.

ROUGE-N: Computes recall for n-grams

ROUGE-L: Based on longest common subsequence (LCS)

from rouge_score import rouge_scorer

def calculate_rouge(generated, reference):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, generated)
    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rouge2': scores['rouge2'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }

Perplexity

Perplexity measures model's prediction uncertainty on test data:Lower perplexity indicates the model is more confident about the data.

def calculate_perplexity(model, tokenizer, text):
    inputs = tokenizer(text, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs['input_ids'])
        loss = outputs.loss
    return torch.exp(loss).item()

Human Evaluation

Human evaluation remains the most reliable method, typically evaluating: - Fluency: Whether text is natural and fluent - Relevance: Whether it's relevant to input - Accuracy: Whether information is correct - Creativity: Whether it's novel

Metric Selection

Different tasks suit different metrics:

Task	Recommended Metrics
Machine Translation	BLEU, METEOR
Text Summarization	ROUGE, BLEU
Dialogue Systems	BLEU, Human Evaluation
Creative Writing	Human Evaluation, Diversity Metrics

Practice: Building a Dialogue System

Below demonstrates how to build a simple dialogue system using GPT models:

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

class ChatBot:
    def __init__(self, model_name='gpt2'):
        self.tokenizer = GPT2Tokenizer.from_pretrained(model_name)
        self.model = GPT2LMHeadModel.from_pretrained(model_name)
        self.model.eval()
        
        # Set pad_token
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
    
    def generate_response(self, user_input, max_length=100, temperature=0.7, top_p=0.9):
        # Build dialogue context
        prompt = f"User: {user_input}\nAssistant: "
        
        # Encode
        inputs = self.tokenizer.encode(prompt, return_tensors='pt')
        
        # Generate
        with torch.no_grad():
            outputs = self.model.generate(
                inputs,
                max_length=inputs.shape[1] + max_length,
                temperature=temperature,
                top_p=top_p,
                do_sample=True,
                pad_token_id=self.tokenizer.eos_token_id,
                eos_token_id=self.tokenizer.eos_token_id,
                no_repeat_ngram_size=2
            )
        
        # Decode
        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        # Extract assistant reply part
        response = response.split("Assistant: ")[-1].strip()
        
        return response

# Usage example
chatbot = ChatBot()

while True:
    user_input = input("You: ")
    if user_input.lower() in ['quit', 'exit']:
        break
    response = chatbot.generate_response(user_input)
    print(f"Assistant: {response}")

Improvement: Multi-turn Dialogue Context

To support multi-turn dialogue, need to maintain conversation history:

class MultiTurnChatBot:
    def __init__(self, model_name='gpt2'):
        self.tokenizer = GPT2Tokenizer.from_pretrained(model_name)
        self.model = GPT2LMHeadModel.from_pretrained(model_name)
        self.model.eval()
        self.conversation_history = []
        
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
    
    def add_to_history(self, user_input, assistant_response):
        self.conversation_history.append(f"User: {user_input}")
        self.conversation_history.append(f"Assistant: {assistant_response}")
    
    def build_context(self, user_input, max_history=5):
        # Keep only recent conversation history
        recent_history = self.conversation_history[-max_history*2:]
        context = "\n".join(recent_history)
        context += f"\nUser: {user_input}\nAssistant: "
        return context
    
    def generate_response(self, user_input, max_length=100, temperature=0.7, top_p=0.9):
        context = self.build_context(user_input)
        inputs = self.tokenizer.encode(context, return_tensors='pt')
        
        with torch.no_grad():
            outputs = self.model.generate(
                inputs,
                max_length=inputs.shape[1] + max_length,
                temperature=temperature,
                top_p=top_p,
                do_sample=True,
                pad_token_id=self.tokenizer.eos_token_id,
                eos_token_id=self.tokenizer.eos_token_id,
                no_repeat_ngram_size=2
            )
        
        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        response = response.split("Assistant: ")[-1].strip()
        
        self.add_to_history(user_input, response)
        return response

Using HuggingFace Pipeline

HuggingFace provides a simpler interface:

from transformers import pipeline

generator = pipeline('text-generation', model='gpt2')

def chat_with_pipeline(user_input):
    prompt = f"User: {user_input}\nAssistant: "
    response = generator(
        prompt,
        max_length=len(prompt.split()) + 50,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        num_return_sequences=1
    )[0]['generated_text']
    
    return response.split("Assistant: ")[-1].strip()

GPT's Limitations

Despite GPT's tremendous success, it also has some limitations:

1. Hallucination Problem - Models may generate seemingly reasonable but actually incorrect information - Lack fact-checking mechanisms

2. Context Length Limitations - Early GPT models had limited context windows (e.g., GPT-3's 2048 tokens) - Cannot handle very long documents

3. High Computational Cost - Large models require massive computational resources - Inference speed may be slow

4. Training Data Bias - Models may learn biases from training data - Requires careful data filtering and model alignment

5. Limited Controllability - Difficult to precisely control generated content - May generate harmful or inappropriate content

Summary

The GPT series represents the pinnacle of generative language models. Through autoregressive language modeling and Transformer architecture, it achieves powerful text generation capabilities. From GPT-1 to GPT-4, continuously scaling model size has brought emergent capabilities, especially in-context learning, enabling models to adapt to new tasks without updating parameters.

GPT's core contributions: 1. Autoregressive Language Modeling: Simple yet powerful pretraining objective 2. In-Context Learning: Zero-shot and few-shot learning capabilities 3. Generality: The same model can handle multiple tasks

Understanding GPT is not just key to understanding modern large language models — it's the starting point for exploring AI general intelligence. As model sizes continue to grow and training strategies keep optimizing, we can expect even more powerful generative AI systems.

❓ Q&A: GPT Common Questions

Q1: What are the main differences between GPT and BERT?

A: Main differences: - Architecture: GPT is decoder (unidirectional), BERT is encoder (bidirectional) - Pretraining Tasks: GPT uses language modeling, BERT uses MLM + NSP - Suitable Tasks: GPT excels at generation tasks, BERT excels at understanding tasks - Context Utilization: GPT can only see forward context, BERT can see both forward and backward context

Q2: Why does GPT use masked self-attention?

A: Masked self-attention ensures consistency between training and inference: - During training, model can only see previous tokens to predict current token - During inference, model can only see already generated tokens - Without masking, model could "see the future" during training, causing training-inference inconsistency

Q3: How does GPT achieve zero-shot learning?

A: GPT's zero-shot learning capability comes from: 1. Large-scale Pretraining: Trained on massive diverse data, seen various task formats 2. Pattern Matching: Identify task descriptions and formats, match corresponding generation patterns 3. Context Understanding: Transformer architecture can understand long-range dependencies, capture task patterns

Q4: What's the difference between Top-k and Top-p sampling?

A: Main differences: - Top-k: Fixed selection of top k tokens by probability - Top-p: Dynamically selects token set whose cumulative probability reaches p - Top-p is more flexible: Automatically adjusts candidate count in different contexts - Top-k is simpler: More intuitive to implement and understand

Q5: How to choose appropriate decoding strategies?

A: Selection recommendations: - Deterministic tasks (e.g., code completion): Greedy or Beam Search - Creative tasks (e.g., story generation): Top-p sampling, temperature > 1 - Balanced scenarios: Top-k or Top-p, temperature ≈ 0.7-0.9 - Quality priority: Beam Search, beam_size = 3-5 - Speed priority: Greedy or Top-k (small k)

Q6: How does GPT's in-context learning capability emerge?

A: In-context learning capability may come from: 1. Pretraining Data Diversity: Seen many task examples and formats 2. Transformer Attention Mechanism: Can attend to relevant examples and extract patterns 3. Implicit Meta-Learning: Learned how to quickly adapt during pretraining 4. Scale Effects: Emerges after model scale reaches certain threshold

Q7: How to evaluate GPT-generated text quality?

A: Evaluation methods: 1. Automatic Metrics: BLEU, ROUGE, Perplexity 2. Human Evaluation: Fluency, relevance, accuracy 3. Task-Specific Metrics: Choose appropriate metrics based on specific tasks 4. Combined Evaluation: Comprehensive judgment combining multiple metrics

Q8: Why do GPT models hallucinate?

A: Reasons for hallucination: 1. Training Data Noise: Pretraining data contains incorrect information 2. Probabilistic Generation: Sampling process may select low-probability but incorrect tokens 3. Lack of Fact-Checking: Models don't have explicit fact-verification mechanisms 4. Overfitting Patterns: May generate content that fits language patterns but not facts

Q9: How to reduce harmful content generated by GPT?

A: Methods to reduce harmful content: 1. Data Filtering: Carefully filter and clean training data 2. RLHF: Optimize models using reinforcement learning from human feedback 3. Safety Prompts: Add safety constraints in prompts 4. Post-Processing Filtering: Filter and check generated content 5. Model Alignment: Finetune models to follow safety guidelines

Q10: What are GPT's future development directions?

A: Possible development directions: 1. Larger Scale: Continue increasing model and training data size 2. Multimodality: Integrate text, images, audio, and other modalities 3. Longer Context: Support processing longer input sequences 4. Better Controllability: Precisely control generated content and style 5. More Efficient: Reduce computational costs, improve inference speed 6. Safer: Reduce bias and harmful content 7. Specialization: Optimize models for specific domains