The Transformer architecture revolutionized natural language
processing by introducing a mechanism that allows models to focus on
relevant parts of the input when processing each element. Unlike
recurrent networks that process sequences step-by-step, Transformers use
attention to capture dependencies regardless of distance, making them
both more powerful and more parallelizable. This article explores the
evolution from basic sequence-to-sequence models to the full Transformer
architecture, diving deep into attention mechanisms, multi-head
attention, positional encoding, and providing complete PyTorch
implementations that you can run and modify.
Why
Sequence-to-Sequence Models Needed Attention
Traditional sequence-to-sequence (seq2seq) models, introduced around
2014, used an encoder-decoder architecture with recurrent neural
networks. The encoder processes the input sequence and compresses all
information into a fixed-size context vector, which the decoder then
uses to generate the output sequence.
The Bottleneck Problem
Consider translating a long sentence: "The cat that chased the mouse
that ate the cheese was very tired and needed to rest." A vanilla
seq2seq model must compress this entire sentence into a single
fixed-dimensional vector before decoding begins. This creates several
problems:
Information Loss: Long sentences contain far more
information than can be captured in a fixed-size vector. As sequence
length increases, the final hidden state struggles to retain details
from early tokens.
Gradient Flow Issues: Even with LSTM or GRU cells,
gradients must flow through many timesteps. The encoder's early states
have limited influence on the decoder's later outputs.
Uniform Weighting: When generating each output
token, the decoder has equal (or diminishing) access to all input tokens
through the context vector. It cannot dynamically focus on relevant
parts of the input.
For example, when translating "the cat" to French ("le chat"), the
decoder should focus heavily on those specific words, not on "the
cheese" that appears later in the sentence. The fixed context vector
provides no mechanism for this selective focus.
The Context Vector
Becomes a Cognitive Load
Think of the context vector as a person trying to memorize an entire
paragraph and then recite it from memory. As the paragraph grows longer,
details get fuzzy. Attention mechanisms solve this by allowing the
decoder to "look back" at the original input at each generation step,
similar to how a human translator might repeatedly reference the source
text.
Birth of Attention:
Bahdanau Mechanism
In 2015, Bahdanau et al. introduced the first attention mechanism for
neural machine translation. The core insight was elegant: instead of
relying solely on a fixed context vector, the decoder should be able to
compute a weighted combination of all encoder hidden states at each
decoding step.
Architecture Overview
The Bahdanau attention mechanism consists of three key
components:
Encoder States: The encoder produces a sequence of
hidden states for the input sequence of length.
Alignment Scoring: At each decoder timestep, we compute alignment scores between
the current decoder stateand
each encoder state. These scores
indicate how well the decoder's current position "aligns" with each
input position.
Context Vector Generation: The alignment scores are
normalized into attention weights, which are used to compute a weighted
sum of encoder states, producing a context vector specific to the
current decoding step.
Mathematical Formulation
denote the decoder's hidden state at timeasand the encoder hidden states as.
Step 1: Compute Alignment ScoresThe alignment
functionis typically a small
feedforward network that takes the previous decoder stateand encoder stateas input. Bahdanau used a
one-hidden-layer network:Here,,, andare learnable parameters. Theactivation introduces non-linearity,
and the final linear projection byproduces a scalar score.
Step 2: Normalize to Attention WeightsThis softmax operation ensures that the attention
weightssum to 1 across
all input positions. Positions with higher alignment scores receive
higher weights.
Step 3: Compute Context VectorThe
context vectoris a weighted
average of all encoder states, where the weights reflect how much
attention the decoder should pay to each input position.
Step 4: Update Decoder State
The context vectoris
concatenated with the embedded previous output token and fed into the
decoder RNN:whereis the RNN cell
(typically LSTM or GRU) andis the embedding of the previously
generated token.
Visualization of Attention
Weights
Attention weights can be visualized as a heatmap where rows represent
decoder timesteps and columns represent encoder timesteps. High values
indicate strong attention. For the translation "the cat" → "le chat",
the alignment might look like:
1 2 3 4
the cat was tired le 0.7 0.2 0.05 0.05 chat 0.1 0.8 0.05 0.05 é tait 0.05 0.05 0.7 0.2
This shows that when generating "le", the model attends strongly to
"the"; when generating "chat", it focuses on "cat"; and when generating
"é tait", it looks at "was" and "tired".
Luong Attention:
Simplification and Variants
Shortly after Bahdanau, Luong et al. (2015) proposed alternative
attention mechanisms that simplified some aspects while introducing new
scoring functions.
Key Differences from
Bahdanau
Decoder State Usage: Luong attention uses the
current decoder state (after the
RNN update) rather than the previous statewhen computing attention. This
means attention is calculated after processing the input, not
before.
Concat:The dot product is the simplest and fastest but requires
thatandhave the same dimensionality. The
general form adds a learnable matrix to handle dimension mismatches. The
concat version is similar to Bahdanau's approach.
Local vs Global Attention
Luong also introduced the concept of local attention, where the model
only attends to a small window of encoder states around a predicted
position, rather than all states. This reduces computational cost for
very long sequences.
For local attention, the model predicts an alignment positionand computes attention weights only
for positions in,
whereis the window size.
From RNN to
Self-Attention: The Paradigm Shift
While Bahdanau and Luong attention improved seq2seq models, they
still relied on RNNs for encoding and decoding. The Transformer,
introduced in the 2017 paper "Attention Is All You Need" by Vaswani et
al., took a radical step: eliminate recurrence entirely and rely solely
on attention mechanisms.
Self-Attention Intuition
Self-attention allows each position in a sequence to attend to all
positions in the same sequence. Unlike encoder-decoder attention (where
decoder attends to encoder), self-attention operates within a single
sequence.
Imagine reading the sentence: "The animal didn't cross the street
because it was too tired." When processing "it", self-attention would
assign high weight to "animal" (since "it" refers to the animal),
helping the model understand the reference.
Query, Key, Value: The
Attention Trinity
Self-attention introduces three concepts: queries (Q), keys (K), and
values (V). These are derived from the input through learned linear
transformations.
Query: Represents the "question" being asked by the
current position. "What should I attend to?"
Key: Represents the "relevance" of each position.
"How relevant am I to the query?"
Value: Represents the actual information to retrieve
from each position. "What information do I contain?"
For each positionin the input
sequence, we compute:whereis the input embedding at
position, and,,are learned weight matrices.
Scaled Dot-Product Attention
The core attention operation computes how much each position should
attend to every other position using queries and keys, then retrieves
information using values.
Step 1: Compute Attention ScoresThis dot product measures the compatibility between
queryand key. High scores indicate that
positionshould attend strongly to
position.
Step 2: Scale the Scoreswhereis the dimension of the key vectors.
Scaling prevents the dot products from growing too large, which would
push the softmax into regions with extremely small gradients.
Why Scale by? When the key
dimension is large, dot products tend to grow in magnitude. For example,
if keys and queries are unit vectors in high dimensions, their dot
product variance scales with.
Dividing bynormalizes
this variance, keeping the softmax input in a reasonable range.
Step 3: Apply Softmaxwhereis the sequence length. This produces
attention weights that sum to 1.
Step 4: Compute Weighted Sum of ValuesThe output for positionis a weighted combination of all value
vectors, where the weights are determined by the query-key
compatibility.
Matrix Form for Efficient
Computation
In practice, we compute attention for all positions simultaneously
using matrix operations. Stacking all queries, keys, and values into
matrices,,(where each row is a query/key/value
vector), the attention operation becomes:Here,is a
matrix of all pairwise query-key dot products, the softmax is applied
row-wise, and the result is multiplied byto produce the output.
Multi-Head
Attention: Learning Different Perspectives
Single attention heads compute one set of attention weights,
capturing one "view" of the relationships in the sequence. Multi-head
attention runs multiple attention operations in parallel, each with
different learned projections, allowing the model to attend to different
aspects simultaneously.
Motivation
Consider the sentence "The bank by the river has low interest rates."
A single attention mechanism might struggle to simultaneously capture: -
Grammatical relationships (subject-verb agreement between "bank" and
"has") - Semantic relationships (the financial meaning of "bank" vs.
geographical "river") - Positional relationships (nearby vs. distant
tokens)
Multiple heads can specialize in different types of
relationships.
Mathematical Formulation
Given input, we computedifferent attention outputs in
parallel:where,,are the learned projection matrices for head.
Typically,, so each head operates in a lower-dimensional subspace.
After computing all heads, we concatenate them and apply a final
linear projection:whereis the output
projection matrix.
Example:
8-Head Attention withIf we use 8 heads with a model dimension of 512:
Each head hasdimensions
Each head learns its own,,matrices of size
The concatenated output has dimension
The final projectionis
Masked Multi-Head Attention
In the decoder, we need to prevent positions from attending to future
positions (to maintain the autoregressive property during training).
This is achieved by masking:whereis a mask
matrix withif(allowed positions) andif(forbidden future positions).
Thevalues ensure that after
softmax, those positions receive zero weight.
Positional
Encoding: Injecting Sequence Order
Self-attention operates on sets, not sequences — it's
permutation-invariant. Without additional information, the model cannot
distinguish between "cat eats fish" and "fish eats cat". Positional
encodings solve this by adding position-dependent signals to the input
embeddings.
Sinusoidal Positional
Encoding
The original Transformer paper used fixed sinusoidal functions:whereis
the position index andis the
dimension index.
Why This Design?
Unique Encoding: Each position gets a unique
encoding vector
Relative Position: The encoding allows the model
to learn to attend by relative positions, sincecan be expressed as a
linear function of
Extrapolation: The model can potentially
generalize to sequence lengths longer than seen during training
The different frequencies () create a
spectrum: low dimensions oscillate rapidly (capturing fine-grained
position), high dimensions oscillate slowly (capturing coarse
position).
Learned Positional
Embeddings
An alternative approach is to treat positional encodings as learnable
parameters:whereis a standard embedding
layer. This is simpler and often performs comparably to sinusoidal
encodings, but cannot naturally extrapolate to longer sequences.
Modern models like BERT use learned positional embeddings, while
GPT-2 and GPT-3 use learned embeddings with careful initialization. Some
recent models (like T5 and ALiBi) use relative positional encodings or
attention biases instead.
Adding Positional Encoding
to Input
Positional encodings are added (not concatenated) to the input
embeddings:whereare the token embeddings andare the positional encodings. Both have
dimension.
The Complete Transformer
Architecture
Now we assemble all components into the full Transformer
architecture, consisting of an encoder stack and a decoder stack.
Encoder Architecture
Each encoder layer consists of two sub-layers:
1. Multi-Head Self-AttentionThe inputserves as
queries, keys, and values (self-attention).
2. Position-Wise Feed-Forward NetworkThis is a two-layer fully connected network applied
independently to each position. Typically, the inner dimension is.
Each sub-layer uses residual connections and layer
normalization:The complete encoder
layer:The encoder typically stacks 6 such layers (though
BERT uses 12 or 24, and GPT-3 uses 96).
Decoder Architecture
Each decoder layer has three sub-layers:
1. Masked Multi-Head Self-Attentionwhereis the decoder
input (shifted target sequence during training). Masking ensures
causality.
2. Cross-Attention (Encoder-Decoder Attention)The decoder attends to the encoder's output.
Queries come from the decoder (),
keys and values come from the encoder ().
3. Position-Wise Feed-Forward NetworkSame as in the encoder.
The complete decoder layer:The decoder also stacks 6 layers.
Input and Output Processing
Encoder Input:
Decoder Input (during training):The target is shifted right by one
position (starting with a special start-of-sequence token).
Final Output Layer:A linear projection maps the decoder output to
vocabulary size, followed by softmax for probability distribution over
tokens.
Layer
Normalization and Residual Connections: Stabilizing Deep Networks
Training deep networks is challenging due to gradient flow issues and
internal covariate shift. The Transformer addresses these with residual
connections and layer normalization.
Residual Connections
Introduced by ResNet, residual connections add the input of a
sub-layer to its output:This creates "shortcut paths" for gradients
to flow directly through, alleviating vanishing gradients in deep
networks. Even iflearns poorly, the
identity mapping allows information to pass through unchanged.
Layer Normalization
Layer normalization standardizes the inputs across features for each
sample:where:
-(mean across features) -(variance) -andare learnable scale and shift
parameters -is a small
constant for numerical stability
Unlike batch normalization (which normalizes across the batch
dimension), layer normalization operates independently on each sample.
This makes it more suitable for sequence models where batch elements may
have different lengths.
Post-Layer Norm vs Pre-Layer
Norm
The original Transformer used post-layer norm:More recent work suggests pre-layer norm
improves training stability:Pre-layer norm is used
in GPT-2, GPT-3, and many modern Transformers because it reduces
sensitivity to learning rate and initialization.
PyTorch Implementation from
Scratch
implement a complete Transformer model in PyTorch. This
implementation includes all components: positional encoding, multi-head
attention, encoder, decoder, and the full model.
While implementing from scratch provides deep understanding,
production systems typically use HuggingFace's transformers
library, which offers pre-trained models and optimized
implementations.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM import torch
# Load pre-trained T5 model (based on Transformer architecture) model_name = "t5-small" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Example: Translation task text = "translate English to French: The cat is sleeping on the mat." inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
# Example data source_texts = [ "translate English to Spanish: Hello, how are you?", "translate English to Spanish: The weather is nice today.", # ... more examples ] target_texts = [ "Hola, ¿ c ó mo est á s?", "El clima es agradable hoy.", # ... corresponding translations ]
# Initialize tokenizer and model tokenizer = T5Tokenizer.from_pretrained("t5-small") model = T5ForConditionalGeneration.from_pretrained("t5-small")
# Example: Sentiment classification text = "This movie was absolutely fantastic! I loved every minute of it." inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
from transformers import GPT2Tokenizer, GPT2LMHeadModel import torch
# Load pre-trained GPT-2 tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model = GPT2LMHeadModel.from_pretrained('gpt2')
# Set padding token (GPT-2 doesn't have one by default) tokenizer.pad_token = tokenizer.eos_token
# Text generation prompt = "Once upon a time in a land far away," inputs = tokenizer(prompt, return_tensors='pt')
# Generate text with torch.no_grad(): outputs = model.generate( inputs.input_ids, max_length=100, num_return_sequences=3, no_repeat_ngram_size=2, top_k=50, top_p=0.95, temperature=0.8 )
# Decode and print for i, output inenumerate(outputs): text = tokenizer.decode(output, skip_special_tokens=True) print(f"\nGeneration {i+1}:\n{text}\n")
Attention
Visualization and Interpretation
Understanding what attention heads learn is crucial for model
interpretability. implement attention visualization.
import matplotlib.pyplot as plt import seaborn as sns import numpy as np
defvisualize_attention(attention_weights, tokens_src, tokens_tgt, layer=0, head=0): """ Visualize attention weights as a heatmap. Args: attention_weights: Attention weights from model tokens_src: Source tokens (list of strings) tokens_tgt: Target tokens (list of strings) layer: Which layer to visualize head: Which head to visualize """ # Extract specific layer and head # Shape: (seq_len_tgt, seq_len_src) attn = attention_weights[layer][head].detach().cpu().numpy() # Create figure fig, ax = plt.subplots(figsize=(10, 8)) # Plot heatmap sns.heatmap( attn, xticklabels=tokens_src, yticklabels=tokens_tgt, cmap='Blues', ax=ax, cbar_kws={'label': 'Attention Weight'} ) ax.set_xlabel('Source Tokens') ax.set_ylabel('Target Tokens') ax.set_title(f'Attention Weights - Layer {layer}, Head {head}') plt.tight_layout() plt.show()
# Example with HuggingFace model from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("t5-small") model = AutoModelForSeq2SeqLM.from_pretrained("t5-small", output_attentions=True)
text = "translate English to French: The cat sleeps." inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad(): outputs = model.generate( inputs.input_ids, max_length=20, output_attentions=True, return_dict_in_generate=True )
# Access attention weights from encoder encoder_attentions = outputs.encoder_attentions
# Get tokens src_tokens = tokenizer.convert_ids_to_tokens(inputs.input_ids[0]) tgt_tokens = tokenizer.convert_ids_to_tokens(outputs.sequences[0])
# Visualize first layer, first head if encoder_attentions: visualize_attention( encoder_attentions, src_tokens, src_tokens, # Self-attention in encoder layer=0, head=0 )
Analyzing What Different
Heads Learn
Research has shown that different attention heads specialize in
different linguistic phenomena:
Syntactic Heads: Some heads learn to attend to
syntactic relationships like subject-verb agreement or dependency
parsing structures.
Positional Heads: Some heads focus on relative
positions, attending primarily to previous or next tokens.
Rare Word Heads: Some heads attend strongly to rare
or important content words, ignoring common function words.
Delimiter Heads: Some heads attend to punctuation
and sentence boundaries.
Questions and Answers
Q1:
Why does scaled dot-product attention scale by?
Answer: Without scaling, the dot product of two
random unit vectors grows with dimension. For large, this pushes the softmax into regions
with extremely small gradients. Specifically, ifandare independent random vectors with unit
variance components, their dot product has variance. Dividing bynormalizes this variance to 1,
keeping the softmax input in a reasonable range where gradients can flow
effectively.
Q2:
Can Transformers handle sequences longer than they were trained on?
Answer: It depends on the positional encoding
scheme. Sinusoidal encodings can theoretically extrapolate to longer
sequences because the encoding is a mathematical function. However,
performance often degrades because the model has never seen longer-range
dependencies during training. Learned positional embeddings cannot
extrapolate beyond their maximum trained length without modification.
Recent techniques like ALiBi (Attention with Linear Biases) and rotary
positional embeddings (RoPE) improve length extrapolation by encoding
relative rather than absolute positions.
Q3:
Why use multi-head attention instead of a single high-dimensional
head?
Answer: Multi-head attention allows the model to
attend to different representation subspaces simultaneously. A single
head might capture syntactic relationships, but miss semantic or
positional patterns. Multiple heads can specialize: one might focus on
local context, another on long-range dependencies, another on specific
syntactic relations. This is similar to how CNNs use multiple filters to
capture different visual features. Empirically, 8-16 heads with smaller
dimensions per head outperform a single head with the same total
parameters.
Q4:
What's the computational complexity of self-attention?
Answer: Self-attention hascomplexity, whereis sequence length andis model dimension. Computingis, and multiplying byis also. For very long sequences (thousands of tokens), this
becomes a bottleneck. This motivated efficient Transformer variants like
Linformer, Performer, and Longformer that reduce complexity toorthrough various
approximations like sparse attention patterns or kernel methods.
Q5:
Why do we add positional encodings instead of concatenating them?
Answer: Adding preserves the full model dimension
for both content and position information, allowing them to interact
through subsequent layers. Concatenation would split the dimension,
dedicating part solely to position and part solely to content, reducing
the capacity for each. Adding also allows the model to learn how to
combine positional and content information optimally through the learned
transformations in attention and feed-forward layers. Empirically,
addition works well and simplifies the architecture.
Q6:
What is the purpose of the feed-forward network in each layer?
Answer: The feed-forward network (FFN) processes
each position independently, adding non-linear transformations that
attention alone cannot provide. Attention is primarily a weighted
averaging operation (linear in the values), while the FFN with ReLU
activation introduces non-linearity. The FFN also expands the
dimensionality (typically to) before projecting back down, creating a
bottleneck architecture that can learn complex position-wise
transformations. Research suggests that FFN layers store factual
knowledge, while attention layers handle information routing.
Q7:
How does the Transformer avoid vanishing/exploding gradients?
Answer: Three key mechanisms help: (1)
Residual connections provide direct gradient paths from
output to input, bypassing potential bottlenecks in attention and FFN
layers. (2) Layer normalization stabilizes activations,
preventing them from growing or shrinking uncontrollably across layers.
(3) Attention mechanism itself is less prone to
vanishing gradients than RNNs because gradients can flow directly
between any pair of positions without passing through many intermediate
timesteps.
Q8: Why does the
decoder use masked attention?
Answer: During training, the entire target sequence
is available, but we must prevent the decoder from "cheating" by looking
at future tokens. Masked (causal) attention ensures that positioncan only attend to positions, preserving the autoregressive
property. This makes training match inference conditions, where future
tokens are not yet generated. Without masking, the model would learn to
simply copy future tokens rather than genuinely predict them.
Q9:
Can attention weights be interpreted as "importance" or
"relevance"?
Answer: Partially, but with caveats. High attention
weights indicate that information from one position is being used when
processing another position. However, attention weights are not causal
explanations — they show correlation, not causation. Multiple heads may
contain redundant information, and high attention doesn't necessarily
mean "importance" in a semantic sense. Research has shown that attention
weights can be manipulated without changing model outputs, suggesting
they're only one component of model reasoning. For interpretability,
consider attention alongside gradient-based methods and probing
tasks.
Q10:
What are the main differences between BERT, GPT, and T5?
Answer:
BERT (Encoder-only): Bidirectional context, uses
masked language modeling (predicting randomly masked tokens). Best for
tasks requiring understanding of entire context: classification, named
entity recognition, question answering. Cannot generate sequences
naturally.
GPT (Decoder-only): Unidirectional
(left-to-right) context, uses causal language modeling (predicting next
token). Excels at text generation, continuation, and few-shot learning.
Can be adapted for understanding tasks but loses bidirectional
context.
T5 (Encoder-Decoder): Full Transformer with both
encoder and decoder. Frames all tasks as seq2seq (text-to-text).
Combines benefits of both: bidirectional encoding and autoregressive
decoding. More flexible but larger and slower than encoder-only or
decoder-only models.
The choice depends on the task: BERT for understanding, GPT for
generation, T5 for versatility.
Conclusion
The Transformer architecture revolutionized NLP by replacing
recurrence with attention, enabling parallel processing and better
long-range dependency modeling. Starting from the limitations of seq2seq
models, we explored how attention mechanisms evolved from Bahdanau's
alignment model to the Transformer's self-attention. Key innovations
include scaled dot-product attention, multi-head attention for multiple
representation subspaces, positional encodings for sequence order, and
residual connections with layer normalization for training
stability.
The full Transformer architecture, with its encoder-decoder
structure, has become the foundation for modern NLP. Variants like BERT
(encoder-only) and GPT (decoder-only) dominate tasks from classification
to generation. The PyTorch implementation provided here gives you a
complete, working model that you can extend and experiment with.
Meanwhile, HuggingFace's transformers library offers
production-ready implementations and pre-trained models for immediate
use.
Understanding attention mechanisms and Transformers is essential for
anyone working in modern NLP. These architectures continue to evolve —
with innovations in efficiency (sparse attention), length extrapolation
(better positional encodings), and scale (models with hundreds of
billions of parameters)— but the core principles remain. Whether you're
fine-tuning BERT for classification, using GPT for generation, or
building custom architectures, the concepts covered here form your
foundation.
Post title:NLP (4): Attention Mechanism and Transformer
Post author:Chen Kai
Create time:2024-02-20 15:45:00
Post link:https://www.chenk.top/en/nlp-attention-transformer/
Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.