ChatGPT's emergence has made Large Language Models (LLMs) the focal point of AI, but understanding how they work is far from straightforward. Why can GPT generate fluent text while BERT excels at understanding tasks? Why do some models handle tens of thousands of tokens while others degrade beyond 2048 tokens? These differences stem from fundamental architectural choices.
Architectural choices define a model's capabilities: Encoder-only architectures understand context through bidirectional attention but cannot autoregressively generate; Decoder-only architectures excel at generation but only see unidirectional information; Encoder-Decoder architectures balance both but at higher computational cost. Long-context techniques (ALiBi, RoPE, Flash Attention) break sequence length limits through different position encodings and attention optimizations. MoE architectures achieve trillion-parameter scale through sparse activation, while quantization and KV Cache techniques enable large models to run on consumer hardware.
This article dives deep into these core technologies: from architectural trade-offs to long-context implementation details, from MoE routing mechanisms to quantization error control, from KV Cache memory optimization to inference service engineering. Each technique includes runnable code examples and performance analysis, helping readers not only understand principles but also implement them.
LLM Architecture Choices: Encoder-only vs Decoder-only vs Encoder-Decoder
The architectural choice of large language models is a key factor determining their capabilities and application scenarios. The three mainstream architectures each have their advantages and disadvantages, and understanding their differences is crucial for selecting the appropriate model.
Encoder-only Architecture
Encoder-only architecture uses only the encoder part of the Transformer, with BERT being a typical representative. This architecture typically uses Masked Language Modeling (MLM) tasks during pre-training.
Characteristics: - Bidirectional context understanding: Can see both preceding and following information in the input sequence - Suitable for understanding tasks: Text classification, named entity recognition, sentiment analysis, etc. - Not suitable for generation tasks: Cannot perform autoregressive generation
Mathematical Representation:
For input sequence
1 | from transformers import AutoModel, AutoTokenizer |
Application Scenarios: - Text classification - Named Entity Recognition (NER) - Sentiment analysis - Text similarity computation - Question answering systems (requiring context understanding)
Decoder-only Architecture
Decoder-only architecture uses only the decoder part of the Transformer, with the GPT series being typical representatives. This architecture uses causal masking to ensure each position can only see previous information.
Characteristics: - Autoregressive generation: Generates tokens one by one, with each token depending on all previous tokens - Unidirectional context: Can only see information before the current position - Suitable for generation tasks: Text generation, dialogue systems, code generation, etc.
Mathematical Representation:
For input sequence
1 | from transformers import AutoModelForCausalLM, AutoTokenizer |
Application Scenarios: - Text generation - Dialogue systems - Code generation - Text completion - Creative writing
Encoder-Decoder Architecture
Encoder-Decoder architecture uses both encoder and decoder, with T5 and BART being typical representatives. The encoder processes input, and the decoder generates output.
Characteristics: - Bidirectional understanding + autoregressive generation: Encoder understands input bidirectionally, decoder generates output unidirectionally - Suitable for sequence-to-sequence tasks: Translation, summarization, question answering, etc. - Higher computational cost: Requires maintaining both encoder and decoder
Mathematical Representation:
For input sequence1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Encoder-Decoder model example (T5)
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")
# Text summarization task
text = "The quick brown fox jumps over the lazy dog. " * 3
inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
**inputs,
max_length=50,
num_beams=4,
early_stopping=True
)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(summary)
Application Scenarios: - Machine translation - Text summarization - Question answering systems - Dialogue systems (requiring context understanding) - Text rewriting
Architecture Selection Guide
| Architecture Type | Advantages | Disadvantages | Typical Applications |
|---|---|---|---|
| Encoder-only | Bidirectional understanding, strong comprehension | Cannot generate, requires additional task head | Classification, NER, similarity |
| Decoder-only | Strong generation capability, simple architecture | Only unidirectional understanding | Text generation, dialogue |
| Encoder-Decoder | Understanding + generation, flexible | High computational cost, large parameter count | Translation, summarization, Q&A |
Long-Context Handling Techniques
The attention mechanism of traditional Transformers has
ALiBi (Attention with Linear Biases)
Traditional position encodings (like sinusoidal position encodings) fix a maximum length during training, causing performance to degrade sharply beyond this length. ALiBi solves this with an elegant insight: instead of adding position information at the embedding layer, directly penalize long-distance attention connections during attention computation.
Core Idea:
ALiBi embodies the intuition that "the farther apart, the less
attention should be paid." It achieves this by adding a negative bias
proportional to relative distance to attention scores:
Slope Design:
ALiBi assigns different slopes to each attention head, typically
using a geometric sequence: the
1 | import torch |
Advantages: - No position encoding needed, simplifies model architecture - Can extrapolate to longer sequences - High training and inference efficiency
Applications: - BLOOM model uses ALiBi - Suitable for scenarios requiring long-text processing
RoPE (Rotary Position Embedding)
RoPE is the position encoding method adopted by mainstream models like LLaMA. Its core idea is encoding position information as rotation operations. Compared to absolute position encodings, RoPE has better extrapolation capabilities because it encodes relative positional relationships.
Mathematical Principle:
RoPE encodes positions as rotations in the complex domain. For a
vector of dimension
For query vector
1 | import torch |
INT4 Quantization
INT4 quantization further reduces precision to 4 bits, reducing model size by 8x, but may cause greater accuracy loss.
GPTQ (GPT Quantization)
GPTQ is a post-training quantization method that minimizes quantization error through layer-wise optimization.
Principle:
GPTQ quantizes each layer independently, using the Hessian matrix to guide the quantization process:
- Compute the Hessian matrix
of weights - Quantize weights in order of importance
- Update unquantized weights to compensate for quantization error
1 | def gptq_quantize_layer(weight, num_bits=4): |
AWQ (Activation-aware Weight Quantization)
AWQ is an activation-aware quantization method that maintains model performance by protecting important weight channels.
Principle:
AWQ believes different channels have different importance and should use higher precision for important channels:
- Analyze activation value importance
- Identify important channels (typically 1%)
- Keep important channels in FP16, quantize others to INT4
1 | def awq_quantize(weight, activation, num_bits=4, preserve_ratio=0.01): |
KV Cache Optimization
In autoregressive generation, each new token requires recomputing Keys and Values for all previous tokens. KV Cache avoids redundant computation by caching these intermediate results.
KV Cache Principle
Computation without Cache:
When generating the
Computation with Cache:
1 | class KVCache: |
KV Cache Optimization Strategies
- Chunked storage: Store Cache in chunks, supporting dynamic expansion
- Compression: Compress historical KV (e.g., using low precision)
- Sliding window: Only keep KV for the most
recent
tokens
Inference Optimization Techniques
Batching
Batching merges multiple requests for processing, improving GPU utilization.
1 | def batch_generate(model, tokenizer, prompts, batch_size=8): |
Continuous Batching
Continuous batching allows dynamically adding and removing requests, improving throughput.
Quantized Inference
Use quantized models for inference, reducing memory and computational requirements.
1 | from transformers import AutoModelForCausalLM, BitsAndBytesConfig |
Model Parallelism
Distribute models across multiple GPUs, supporting larger models.
1 | import torch.nn as nn |
Practical: Deploying and Optimizing LLMs
Deploying with vLLM
vLLM is a high-performance LLM inference and serving framework.
1 | # Install: pip install vllm |
Optimizing with TensorRT-LLM
TensorRT-LLM is NVIDIA's LLM inference optimization framework.
1 | # TensorRT-LLM optimization process |
Performance Monitoring
1 | import time |
❓ Q&A: Common Questions on LLM Architecture
Q1: How to choose between Encoder-only, Decoder-only, and Encoder-Decoder architectures?
A: The choice depends on task type: - Encoder-only: Suitable for understanding tasks (classification, NER, similarity), requires bidirectional context - Decoder-only: Suitable for generation tasks (text generation, dialogue), simple architecture, high training efficiency - Encoder-Decoder: Suitable for sequence-to-sequence tasks (translation, summarization), requires understanding input and generating output
Q2: Which is better, RoPE or ALiBi?
A: Each has advantages: - RoPE: Relative position encoding, strong generalization, adopted by mainstream models like LLaMA - ALiBi: No position encoding needed, strong extrapolation, used by BLOOM - Choice depends on specific needs: If processing ultra-long sequences, ALiBi may be better; if better position understanding is needed, RoPE may be more suitable
Q3: How much performance improvement can Flash Attention bring?
A: Flash Attention's main advantages are in memory
and long sequences: - Memory: Reduced from
Q4: How does MoE architecture achieve load balancing?
A: Load balancing is a key challenge: 1. Routing strategy: Use Top-k routing to ensure each input activates a fixed number of experts 2. Load balancing loss: Add load balancing term to loss function, encouraging uniform distribution 3. Auxiliary loss: Monitor expert usage frequency, penalize imbalance 4. Dynamic routing: Dynamically adjust routing strategy based on load
Q5: How much accuracy is lost with INT4 quantization?
A: Accuracy loss depends on: - Model size: Large models (>7B) usually have smaller loss (<2%) - Quantization method: Advanced methods like GPTQ/AWQ have smaller loss - Task type: Generation tasks are usually more sensitive than understanding tasks - Activation quantization: Quantizing only weights has smaller loss, quantizing activations simultaneously has larger loss
Q6: How much computation can KV Cache save?
A: KV Cache is crucial in autoregressive generation:
- Computation savings: Avoid redundant computation,
theoretically can save
Q7: How to choose quantization method (GPTQ vs AWQ)?
A: - GPTQ: Post-training quantization, suitable for general scenarios, fast quantization speed - AWQ: Activation-aware, usually higher accuracy, but requires calibration data - Recommendation: If pursuing highest accuracy, choose AWQ; if fast quantization is needed, choose GPTQ
Q8: How do MoE models select experts during inference?
A: Expert selection during inference: 1.
Top-k routing: Select
Q9: Applicable scenarios for long-context handling techniques?
A: - ALiBi: Suitable for scenarios requiring extrapolation to ultra-long sequences (e.g., long document processing) - RoPE: Suitable for scenarios requiring precise position understanding (e.g., code generation) - Flash Attention: Should be used in all scenarios requiring long sequence processing - Sparse Attention: Suitable for scenarios with low accuracy requirements but need to process ultra-long sequences
Q10: How to optimize LLM inference latency?
A: Multiple approaches: 1. Quantization: Use INT8/INT4 quantization to reduce computation 2. KV Cache: Must use, avoid redundant computation 3. Batching: Merge requests to improve GPU utilization 4. Model parallelism: Distribute large models across multiple GPUs 5. Compilation optimization: Use TensorRT, ONNX Runtime, etc. 6. Hardware acceleration: Use dedicated AI chips (e.g., H100)
This article delves deep into various aspects of large language model architecture, from basic architectural choices to advanced optimization techniques. Understanding these technologies is crucial for building efficient and scalable LLM applications. In practice, it's necessary to select appropriate architectures and technology combinations based on specific needs, finding a balance between performance and cost.
- Post title:NLP (9): Deep Dive into LLM Architecture
- Post author:Chen Kai
- Create time:2024-03-21 09:15:00
- Post link:https://www.chenk.top/en/nlp-llm-architecture-deep-dive/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.