NLP (9): Deep Dive into LLM Architecture

ChatGPT's emergence has made Large Language Models (LLMs) the focal point of AI, but understanding how they work is far from straightforward. Why can GPT generate fluent text while BERT excels at understanding tasks? Why do some models handle tens of thousands of tokens while others degrade beyond 2048 tokens? These differences stem from fundamental architectural choices.

Architectural choices define a model's capabilities: Encoder-only architectures understand context through bidirectional attention but cannot autoregressively generate; Decoder-only architectures excel at generation but only see unidirectional information; Encoder-Decoder architectures balance both but at higher computational cost. Long-context techniques (ALiBi, RoPE, Flash Attention) break sequence length limits through different position encodings and attention optimizations. MoE architectures achieve trillion-parameter scale through sparse activation, while quantization and KV Cache techniques enable large models to run on consumer hardware.

This article dives deep into these core technologies: from architectural trade-offs to long-context implementation details, from MoE routing mechanisms to quantization error control, from KV Cache memory optimization to inference service engineering. Each technique includes runnable code examples and performance analysis, helping readers not only understand principles but also implement them.

LLM Architecture Choices: Encoder-only vs Decoder-only vs Encoder-Decoder

The architectural choice of large language models is a key factor determining their capabilities and application scenarios. The three mainstream architectures each have their advantages and disadvantages, and understanding their differences is crucial for selecting the appropriate model.

Encoder-only Architecture

Encoder-only architecture uses only the encoder part of the Transformer, with BERT being a typical representative. This architecture typically uses Masked Language Modeling (MLM) tasks during pre-training.

Characteristics: - Bidirectional context understanding: Can see both preceding and following information in the input sequence - Suitable for understanding tasks: Text classification, named entity recognition, sentiment analysis, etc. - Not suitable for generation tasks: Cannot perform autoregressive generation

Mathematical Representation:

For input sequence , an Encoder-only model computes:whereis the contextual representation at each position.

from transformers import AutoModel, AutoTokenizer
import torch

# Encoder-only model example (BERT)
model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "The cat sat on the mat"
inputs = tokenizer(text, return_tensors="pt")

# Forward pass: all positions computed simultaneously
outputs = model(**inputs)
# outputs.last_hidden_state shape: [batch_size, seq_len, hidden_size]
# Each position's representation contains bidirectional context information

Application Scenarios: - Text classification - Named Entity Recognition (NER) - Sentiment analysis - Text similarity computation - Question answering systems (requiring context understanding)

Decoder-only Architecture

Decoder-only architecture uses only the decoder part of the Transformer, with the GPT series being typical representatives. This architecture uses causal masking to ensure each position can only see previous information.

Characteristics: - Autoregressive generation: Generates tokens one by one, with each token depending on all previous tokens - Unidirectional context: Can only see information before the current position - Suitable for generation tasks: Text generation, dialogue systems, code generation, etc.

Mathematical Representation:

For input sequence, a Decoder-only model during generation:whereis the hidden state at position.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Decoder-only model example (GPT-2)
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Generate text
prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt")

# Autoregressive generation
outputs = model.generate(
    **inputs,
    max_length=50,
    num_return_sequences=1,
    temperature=0.7
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Application Scenarios: - Text generation - Dialogue systems - Code generation - Text completion - Creative writing

Encoder-Decoder Architecture

Encoder-Decoder architecture uses both encoder and decoder, with T5 and BART being typical representatives. The encoder processes input, and the decoder generates output.

Characteristics: - Bidirectional understanding + autoregressive generation: Encoder understands input bidirectionally, decoder generates output unidirectionally - Suitable for sequence-to-sequence tasks: Translation, summarization, question answering, etc. - Higher computational cost: Requires maintaining both encoder and decoder

Mathematical Representation:

For input sequenceand target sequence:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Encoder-Decoder model example (T5)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")

# Text summarization task
text = "The quick brown fox jumps over the lazy dog. " * 3
inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)

outputs = model.generate(
    **inputs,
    max_length=50,
    num_beams=4,
    early_stopping=True
)

summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(summary)

Application Scenarios: - Machine translation - Text summarization - Question answering systems - Dialogue systems (requiring context understanding) - Text rewriting

Architecture Selection Guide

Architecture Type	Advantages	Disadvantages	Typical Applications
Encoder-only	Bidirectional understanding, strong comprehension	Cannot generate, requires additional task head	Classification, NER, similarity
Decoder-only	Strong generation capability, simple architecture	Only unidirectional understanding	Text generation, dialogue
Encoder-Decoder	Understanding + generation, flexible	High computational cost, large parameter count	Translation, summarization, Q&A

Long-Context Handling Techniques

The attention mechanism of traditional Transformers hascomplexity, whereis the sequence length. As sequence length increases, computational and memory costs grow dramatically. To address this, researchers have proposed various long-context handling techniques.

ALiBi (Attention with Linear Biases)

Traditional position encodings (like sinusoidal position encodings) fix a maximum length during training, causing performance to degrade sharply beyond this length. ALiBi solves this with an elegant insight: instead of adding position information at the embedding layer, directly penalize long-distance attention connections during attention computation.

Core Idea:

ALiBi embodies the intuition that "the farther apart, the less attention should be paid." It achieves this by adding a negative bias proportional to relative distance to attention scores:whereis the slope for each attention head, andis a linear bias matrix based on relative positions. For positionsand, the bias value is, meaning the farther apart, the more the attention score is reduced.

Slope Design:

ALiBi assigns different slopes to each attention head, typically using a geometric sequence: the-th head has slope. This design reflects that different heads attend to different scales: small-slope heads focus on local patterns, large-slope heads focus on global patterns.

import torch
import torch.nn as nn
import math

class ALiBiAttention(nn.Module):
    def __init__(self, d_model, n_heads, max_len=512):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        
        # ALiBi slopes: each head has a different slope
        # Typically uses 2^{-8/n_heads * i}, where i is the head index
        slopes = []
        for i in range(n_heads):
            slope = 2 ** (-8 / n_heads * (i + 1))
            slopes.append(slope)
        self.register_buffer('slopes', torch.tensor(slopes))
        
        self.q_proj = nn.Linear(d_model, d_model)
        self.k_proj = nn.Linear(d_model, d_model)
        self.v_proj = nn.Linear(d_model, d_model)
        self.out_proj = nn.Linear(d_model, d_model)
        
    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.shape
        
        Q = self.q_proj(x).view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        K = self.k_proj(x).view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        V = self.v_proj(x).view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        
        # Compute attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)
        
        # Add ALiBi bias
        # bias shape: [n_heads, seq_len, seq_len]
        bias = self._get_alibi_bias(seq_len, x.device)
        scores = scores + bias.unsqueeze(0)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        attn_weights = torch.softmax(scores, dim=-1)
        attn_output = torch.matmul(attn_weights, V)
        
        attn_output = attn_output.transpose(1, 2).contiguous().view(
            batch_size, seq_len, self.d_model
        )
        
        return self.out_proj(attn_output)
    
    def _get_alibi_bias(self, seq_len, device):
        # Create relative position bias matrix
        # For positions i and j, bias is -m * |i - j|
        positions = torch.arange(seq_len, device=device).float()
        relative_positions = positions.unsqueeze(0) - positions.unsqueeze(1)
        
        # Apply different slope for each head
        bias = -self.slopes.unsqueeze(1).unsqueeze(2) * relative_positions.abs()
        return bias

Advantages: - No position encoding needed, simplifies model architecture - Can extrapolate to longer sequences - High training and inference efficiency

Applications: - BLOOM model uses ALiBi - Suitable for scenarios requiring long-text processing

RoPE (Rotary Position Embedding)

RoPE is the position encoding method adopted by mainstream models like LLaMA. Its core idea is encoding position information as rotation operations. Compared to absolute position encodings, RoPE has better extrapolation capabilities because it encodes relative positional relationships.

Mathematical Principle:

RoPE encodes positions as rotations in the complex domain. For a vector of dimension, it's divided intopairs, with each pairtreated as a complex number. This complex number is then multiplied by a rotation factor, whereis the position andis the frequency.

For query vectorat positionand key vectorat position, the rotated vectors are: whereis a rotation matrix,$= {_i = 10000^{-2i/d}, i z $, better utilizing the quantization range but slightly more complex. In practice, weights typically use symmetric quantization, while activations use asymmetric quantization.

import torch
import torch.nn as nn

def quantize_tensor(x, num_bits=8):
    """Symmetric quantization"""
    # Compute scale factor
    scale = x.abs().max() / (2 ** (num_bits - 1) - 1)
    
    # Quantize
    q = torch.round(x / scale).clamp(-2**(num_bits-1), 2**(num_bits-1)-1)
    
    # Dequantize
    x_dequant = q * scale
    
    return q, scale, x_dequant

# Example: quantize linear layer
class QuantizedLinear(nn.Module):
    def __init__(self, linear_layer):
        super().__init__()
        self.weight = linear_layer.weight.data
        self.bias = linear_layer.bias.data if linear_layer.bias is not None else None
        
        # Quantize weights
        self.weight_q, self.weight_scale, _ = quantize_tensor(self.weight)
    
    def forward(self, x):
        # Dequantize weights
        weight_dequant = self.weight_q * self.weight_scale
        
        # Compute output
        output = F.linear(x, weight_dequant, self.bias)
        return output

INT4 Quantization

INT4 quantization further reduces precision to 4 bits, reducing model size by 8x, but may cause greater accuracy loss.

GPTQ (GPT Quantization)

GPTQ is a post-training quantization method that minimizes quantization error through layer-wise optimization.

Principle:

GPTQ quantizes each layer independently, using the Hessian matrix to guide the quantization process:

Compute the Hessian matrixof weights
Quantize weights in order of importance
Update unquantized weights to compensate for quantization error

def gptq_quantize_layer(weight, num_bits=4):
    """
    GPTQ quantization (simplified)
    weight: [out_features, in_features]
    """
    out_features, in_features = weight.shape
    
    # Compute Hessian matrix (simplified: use identity matrix)
    H = torch.eye(in_features, device=weight.device)
    
    # Initialize quantized weights
    weight_q = weight.clone()
    quantized = torch.zeros(in_features, dtype=torch.bool, device=weight.device)
    
    # Quantize column by column
    for col_idx in range(in_features):
        if quantized[col_idx]:
            continue
        
        # Compute quantization error
        w_col = weight_q[:, col_idx]
        scale = w_col.abs().max() / (2 ** (num_bits - 1) - 1)
        w_col_q = torch.round(w_col / scale).clamp(-2**(num_bits-1), 2**(num_bits-1)-1)
        w_col_dequant = w_col_q * scale
        
        # Update weights to compensate for error
        error = w_col - w_col_dequant
        
        # Update unquantized columns
        for j in range(in_features):
            if not quantized[j] and j != col_idx:
                weight_q[:, j] -= error * H[col_idx, j] / H[col_idx, col_idx]
        
        # Save quantization result
        weight_q[:, col_idx] = w_col_dequant
        quantized[col_idx] = True
    
    return weight_q

AWQ (Activation-aware Weight Quantization)

AWQ is an activation-aware quantization method that maintains model performance by protecting important weight channels.

Principle:

AWQ believes different channels have different importance and should use higher precision for important channels:

Analyze activation value importance
Identify important channels (typically 1%)
Keep important channels in FP16, quantize others to INT4

def awq_quantize(weight, activation, num_bits=4, preserve_ratio=0.01):
    """
    AWQ quantization
    weight: [out_features, in_features]
    activation: [batch_size, in_features] for importance analysis
    """
    # Compute importance of each channel (using L2 norm of activation)
    channel_importance = activation.abs().mean(dim=0)  # [in_features]
    
    # Select important channels
    num_preserve = int(in_features * preserve_ratio)
    _, important_indices = torch.topk(channel_importance, num_preserve)
    
    # Initialize quantized weights
    weight_q = weight.clone()
    
    # Quantize non-important channels
    for col_idx in range(weight.shape[1]):
        if col_idx not in important_indices:
            w_col = weight[:, col_idx]
            scale = w_col.abs().max() / (2 ** (num_bits - 1) - 1)
            w_col_q = torch.round(w_col / scale).clamp(-2**(num_bits-1), 2**(num_bits-1)-1)
            weight_q[:, col_idx] = w_col_q * scale
    
    return weight_q, important_indices

KV Cache Optimization

In autoregressive generation, each new token requires recomputing Keys and Values for all previous tokens. KV Cache avoids redundant computation by caching these intermediate results.

KV Cache Principle

Computation without Cache:

When generating the-th token, need to compute:

Computation with Cache: Only need to computeandfor the new token, then concatenate.

class KVCache:
    """KV Cache implementation"""
    def __init__(self, batch_size, n_heads, head_dim, max_len=2048):
        self.batch_size = batch_size
        self.n_heads = n_heads
        self.head_dim = head_dim
        self.max_len = max_len
        
        # Initialize Cache
        self.k_cache = torch.zeros(batch_size, n_heads, max_len, head_dim)
        self.v_cache = torch.zeros(batch_size, n_heads, max_len, head_dim)
        self.cache_len = 0
    
    def update(self, k, v, start_pos=0):
        """
        Update Cache
        k, v: [batch_size, n_heads, seq_len, head_dim]
        """
        seq_len = k.shape[2]
        
        if start_pos == 0:
            # Initial fill
            self.k_cache[:, :, :seq_len] = k
            self.v_cache[:, :, :seq_len] = v
            self.cache_len = seq_len
        else:
            # Append new token
            end_pos = start_pos + seq_len
            self.k_cache[:, :, start_pos:end_pos] = k
            self.v_cache[:, :, start_pos:end_pos] = v
            self.cache_len = max(self.cache_len, end_pos)
    
    def get(self, start_pos=0, end_pos=None):
        """Get Cache"""
        if end_pos is None:
            end_pos = self.cache_len
        return (
            self.k_cache[:, :, start_pos:end_pos],
            self.v_cache[:, :, start_pos:end_pos]
        )

# Generation example with KV Cache
def generate_with_kv_cache(model, tokenizer, prompt, max_new_tokens=50):
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"]
    
    # Initialize KV Cache
    kv_cache = KVCache(
        batch_size=1,
        n_heads=model.config.n_head,
        head_dim=model.config.n_embd // model.config.n_head
    )
    
    # Process initial prompt
    with torch.no_grad():
        outputs = model(input_ids, use_cache=True)
        kv_cache.update(
            outputs.past_key_values[0][0],  # k
            outputs.past_key_values[0][1],  # v
            start_pos=0
        )
    
    # Autoregressive generation
    generated_ids = input_ids.clone()
    for _ in range(max_new_tokens):
        # Generate next token using Cache
        next_token_logits = model(
            generated_ids[:, -1:],
            past_key_values=kv_cache.get(),
            use_cache=True
        ).logits
        
        next_token = next_token_logits[:, -1, :].argmax(dim=-1, keepdim=True)
        generated_ids = torch.cat([generated_ids, next_token], dim=-1)
        
        # Update Cache
        kv_cache.update(
            outputs.past_key_values[0][0],
            outputs.past_key_values[0][1],
            start_pos=kv_cache.cache_len
        )
    
    return tokenizer.decode(generated_ids[0], skip_special_tokens=True)

KV Cache Optimization Strategies

Chunked storage: Store Cache in chunks, supporting dynamic expansion
Compression: Compress historical KV (e.g., using low precision)
Sliding window: Only keep KV for the most recenttokens

Inference Optimization Techniques

Batching

Batching merges multiple requests for processing, improving GPU utilization.

def batch_generate(model, tokenizer, prompts, batch_size=8):
    """Batch generation"""
    results = []
    
    for i in range(0, len(prompts), batch_size):
        batch_prompts = prompts[i:i+batch_size]
        
        # Tokenize
        inputs = tokenizer(
            batch_prompts,
            return_tensors="pt",
            padding=True,
            truncation=True
        )
        
        # Generate
        outputs = model.generate(**inputs, max_length=100)
        
        # Decode
        batch_results = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        results.extend(batch_results)
    
    return results

Continuous Batching

Continuous batching allows dynamically adding and removing requests, improving throughput.

Quantized Inference

Use quantized models for inference, reducing memory and computational requirements.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 8-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config
)

Model Parallelism

Distribute models across multiple GPUs, supporting larger models.

import torch.nn as nn

# Model parallelism example
class ParallelModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(1024, 2048).to('cuda:0')
        self.layer2 = nn.Linear(2048, 1024).to('cuda:1')
    
    def forward(self, x):
        x = self.layer1(x.to('cuda:0'))
        x = self.layer2(x.to('cuda:1'))
        return x

Practical: Deploying and Optimizing LLMs

Deploying with vLLM

vLLM is a high-performance LLM inference and serving framework.

# Install: pip install vllm

from vllm import LLM, SamplingParams

# Load model
llm = LLM(model="meta-llama/Llama-2-7b-hf")

# Set sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=100
)

# Generate
prompts = [
    "The future of AI is",
    "Machine learning is"
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}")

Optimizing with TensorRT-LLM

TensorRT-LLM is NVIDIA's LLM inference optimization framework.

# TensorRT-LLM optimization process
# 1. Convert model
# 2. Build TensorRT engine
# 3. Deploy inference

# Example command (requires TensorRT-LLM environment)
# trtllm-build --checkpoint_dir ./checkpoints \
#              --output_dir ./engines \
#              --gemm_plugin float16

Performance Monitoring

import time
import torch

def benchmark_model(model, tokenizer, prompt, num_runs=10):
    """Performance benchmark"""
    inputs = tokenizer(prompt, return_tensors="pt")
    
    # Warmup
    for _ in range(3):
        _ = model.generate(**inputs, max_length=50)
    
    # Synchronize GPU
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    
    # Test
    start_time = time.time()
    for _ in range(num_runs):
        outputs = model.generate(**inputs, max_length=50)
        if torch.cuda.is_available():
            torch.cuda.synchronize()
    end_time = time.time()
    
    avg_time = (end_time - start_time) / num_runs
    tokens_per_second = 50 / avg_time
    
    print(f"Average generation time: {avg_time:.3f}s")
    print(f"Tokens per second: {tokens_per_second:.2f}")
    
    return avg_time, tokens_per_second

❓ Q&A: Common Questions on LLM Architecture

Q1: How to choose between Encoder-only, Decoder-only, and Encoder-Decoder architectures?

A: The choice depends on task type: - Encoder-only: Suitable for understanding tasks (classification, NER, similarity), requires bidirectional context - Decoder-only: Suitable for generation tasks (text generation, dialogue), simple architecture, high training efficiency - Encoder-Decoder: Suitable for sequence-to-sequence tasks (translation, summarization), requires understanding input and generating output

Q2: Which is better, RoPE or ALiBi?

A: Each has advantages: - RoPE: Relative position encoding, strong generalization, adopted by mainstream models like LLaMA - ALiBi: No position encoding needed, strong extrapolation, used by BLOOM - Choice depends on specific needs: If processing ultra-long sequences, ALiBi may be better; if better position understanding is needed, RoPE may be more suitable

Q3: How much performance improvement can Flash Attention bring?

A: Flash Attention's main advantages are in memory and long sequences: - Memory: Reduced fromto, can handle 4-8x longer sequences - Speed: Usually 2-4x speedup on long sequences (>2048 tokens) - Short sequences: Little improvement, may even be slightly slower (due to chunking overhead)

Q4: How does MoE architecture achieve load balancing?

A: Load balancing is a key challenge: 1. Routing strategy: Use Top-k routing to ensure each input activates a fixed number of experts 2. Load balancing loss: Add load balancing term to loss function, encouraging uniform distribution 3. Auxiliary loss: Monitor expert usage frequency, penalize imbalance 4. Dynamic routing: Dynamically adjust routing strategy based on load

Q5: How much accuracy is lost with INT4 quantization?

A: Accuracy loss depends on: - Model size: Large models (>7B) usually have smaller loss (<2%) - Quantization method: Advanced methods like GPTQ/AWQ have smaller loss - Task type: Generation tasks are usually more sensitive than understanding tasks - Activation quantization: Quantizing only weights has smaller loss, quantizing activations simultaneously has larger loss

Q6: How much computation can KV Cache save?

A: KV Cache is crucial in autoregressive generation: - Computation savings: Avoid redundant computation, theoretically can savecomputation (is sequence length) - Actual effect: When generating 100 tokens, can save about 99% of attention computation - Memory overhead: Need additional storage for KV, memory increases about 2x (for each token)

Q7: How to choose quantization method (GPTQ vs AWQ)?

A: - GPTQ: Post-training quantization, suitable for general scenarios, fast quantization speed - AWQ: Activation-aware, usually higher accuracy, but requires calibration data - Recommendation: If pursuing highest accuracy, choose AWQ; if fast quantization is needed, choose GPTQ

Q8: How do MoE models select experts during inference?

A: Expert selection during inference: 1. Top-k routing: Selectexperts with highest gating scores (usually) 2. Deterministic routing: Use argmax to select single expert (faster but may have slightly lower accuracy) 3. Load-aware routing: Consider expert load, avoid overloading certain experts

Q9: Applicable scenarios for long-context handling techniques?

A: - ALiBi: Suitable for scenarios requiring extrapolation to ultra-long sequences (e.g., long document processing) - RoPE: Suitable for scenarios requiring precise position understanding (e.g., code generation) - Flash Attention: Should be used in all scenarios requiring long sequence processing - Sparse Attention: Suitable for scenarios with low accuracy requirements but need to process ultra-long sequences

Q10: How to optimize LLM inference latency?

A: Multiple approaches: 1. Quantization: Use INT8/INT4 quantization to reduce computation 2. KV Cache: Must use, avoid redundant computation 3. Batching: Merge requests to improve GPU utilization 4. Model parallelism: Distribute large models across multiple GPUs 5. Compilation optimization: Use TensorRT, ONNX Runtime, etc. 6. Hardware acceleration: Use dedicated AI chips (e.g., H100)

This article delves deep into various aspects of large language model architecture, from basic architectural choices to advanced optimization techniques. Understanding these technologies is crucial for building efficient and scalable LLM applications. In practice, it's necessary to select appropriate architectures and technology combinations based on specific needs, finding a balance between performance and cost.