NLP (4): Attention Mechanism and Transformer

The Transformer architecture revolutionized natural language processing by introducing a mechanism that allows models to focus on relevant parts of the input when processing each element. Unlike recurrent networks that process sequences step-by-step, Transformers use attention to capture dependencies regardless of distance, making them both more powerful and more parallelizable. This article explores the evolution from basic sequence-to-sequence models to the full Transformer architecture, diving deep into attention mechanisms, multi-head attention, positional encoding, and providing complete PyTorch implementations that you can run and modify.

Why Sequence-to-Sequence Models Needed Attention

Traditional sequence-to-sequence (seq2seq) models, introduced around 2014, used an encoder-decoder architecture with recurrent neural networks. The encoder processes the input sequence and compresses all information into a fixed-size context vector, which the decoder then uses to generate the output sequence.

The Bottleneck Problem

Consider translating a long sentence: "The cat that chased the mouse that ate the cheese was very tired and needed to rest." A vanilla seq2seq model must compress this entire sentence into a single fixed-dimensional vector before decoding begins. This creates several problems:

Information Loss: Long sentences contain far more information than can be captured in a fixed-size vector. As sequence length increases, the final hidden state struggles to retain details from early tokens.

Gradient Flow Issues: Even with LSTM or GRU cells, gradients must flow through many timesteps. The encoder's early states have limited influence on the decoder's later outputs.

Uniform Weighting: When generating each output token, the decoder has equal (or diminishing) access to all input tokens through the context vector. It cannot dynamically focus on relevant parts of the input.

For example, when translating "the cat" to French ("le chat"), the decoder should focus heavily on those specific words, not on "the cheese" that appears later in the sentence. The fixed context vector provides no mechanism for this selective focus.

The Context Vector Becomes a Cognitive Load

Think of the context vector as a person trying to memorize an entire paragraph and then recite it from memory. As the paragraph grows longer, details get fuzzy. Attention mechanisms solve this by allowing the decoder to "look back" at the original input at each generation step, similar to how a human translator might repeatedly reference the source text.

Birth of Attention: Bahdanau Mechanism

In 2015, Bahdanau et al. introduced the first attention mechanism for neural machine translation. The core insight was elegant: instead of relying solely on a fixed context vector, the decoder should be able to compute a weighted combination of all encoder hidden states at each decoding step.

Architecture Overview

The Bahdanau attention mechanism consists of three key components:

Encoder States: The encoder produces a sequence of hidden states for the input sequence of length.

Alignment Scoring: At each decoder timestep, we compute alignment scores between the current decoder stateand each encoder state. These scores indicate how well the decoder's current position "aligns" with each input position.

Context Vector Generation: The alignment scores are normalized into attention weights, which are used to compute a weighted sum of encoder states, producing a context vector specific to the current decoding step.

Mathematical Formulation

denote the decoder's hidden state at timeasand the encoder hidden states as.

Step 1: Compute Alignment ScoresThe alignment functionis typically a small feedforward network that takes the previous decoder stateand encoder stateas input. Bahdanau used a one-hidden-layer network:Here,,, andare learnable parameters. Theactivation introduces non-linearity, and the final linear projection byproduces a scalar score.

Step 2: Normalize to Attention WeightsThis softmax operation ensures that the attention weightssum to 1 across all input positions. Positions with higher alignment scores receive higher weights.

Step 3: Compute Context VectorThe context vectoris a weighted average of all encoder states, where the weights reflect how much attention the decoder should pay to each input position.

Step 4: Update Decoder State

The context vectoris concatenated with the embedded previous output token and fed into the decoder RNN:whereis the RNN cell (typically LSTM or GRU) andis the embedding of the previously generated token.

Visualization of Attention Weights

Attention weights can be visualized as a heatmap where rows represent decoder timesteps and columns represent encoder timesteps. High values indicate strong attention. For the translation "the cat" → "le chat", the alignment might look like:

        the    cat    was    tired
le      0.7    0.2    0.05   0.05
chat    0.1    0.8    0.05   0.05
é tait   0.05   0.05   0.7    0.2

This shows that when generating "le", the model attends strongly to "the"; when generating "chat", it focuses on "cat"; and when generating "é tait", it looks at "was" and "tired".

Luong Attention: Simplification and Variants

Shortly after Bahdanau, Luong et al. (2015) proposed alternative attention mechanisms that simplified some aspects while introducing new scoring functions.

Key Differences from Bahdanau

Decoder State Usage: Luong attention uses the current decoder state (after the RNN update) rather than the previous statewhen computing attention. This means attention is calculated after processing the input, not before.

Alignment Functions: Luong explored multiple scoring functions:

Dot Product:
General:
Concat:The dot product is the simplest and fastest but requires thatandhave the same dimensionality. The general form adds a learnable matrix to handle dimension mismatches. The concat version is similar to Bahdanau's approach.

Local vs Global Attention

Luong also introduced the concept of local attention, where the model only attends to a small window of encoder states around a predicted position, rather than all states. This reduces computational cost for very long sequences.

For local attention, the model predicts an alignment positionand computes attention weights only for positions in, whereis the window size.

From RNN to Self-Attention: The Paradigm Shift

While Bahdanau and Luong attention improved seq2seq models, they still relied on RNNs for encoding and decoding. The Transformer, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al., took a radical step: eliminate recurrence entirely and rely solely on attention mechanisms.

Self-Attention Intuition

Self-attention allows each position in a sequence to attend to all positions in the same sequence. Unlike encoder-decoder attention (where decoder attends to encoder), self-attention operates within a single sequence.

Imagine reading the sentence: "The animal didn't cross the street because it was too tired." When processing "it", self-attention would assign high weight to "animal" (since "it" refers to the animal), helping the model understand the reference.

Query, Key, Value: The Attention Trinity

Self-attention introduces three concepts: queries (Q), keys (K), and values (V). These are derived from the input through learned linear transformations.

Query: Represents the "question" being asked by the current position. "What should I attend to?"

Key: Represents the "relevance" of each position. "How relevant am I to the query?"

Value: Represents the actual information to retrieve from each position. "What information do I contain?"

For each positionin the input sequence, we compute:whereis the input embedding at position, and,,are learned weight matrices.

Scaled Dot-Product Attention

The core attention operation computes how much each position should attend to every other position using queries and keys, then retrieves information using values.

Step 1: Compute Attention ScoresThis dot product measures the compatibility between queryand key. High scores indicate that positionshould attend strongly to position.

Step 2: Scale the Scoreswhereis the dimension of the key vectors. Scaling prevents the dot products from growing too large, which would push the softmax into regions with extremely small gradients.

Why Scale by? When the key dimension is large, dot products tend to grow in magnitude. For example, if keys and queries are unit vectors in high dimensions, their dot product variance scales with. Dividing bynormalizes this variance, keeping the softmax input in a reasonable range.

Step 3: Apply Softmaxwhereis the sequence length. This produces attention weights that sum to 1.

Step 4: Compute Weighted Sum of ValuesThe output for positionis a weighted combination of all value vectors, where the weights are determined by the query-key compatibility.

Matrix Form for Efficient Computation

In practice, we compute attention for all positions simultaneously using matrix operations. Stacking all queries, keys, and values into matrices,,(where each row is a query/key/value vector), the attention operation becomes:Here,is a matrix of all pairwise query-key dot products, the softmax is applied row-wise, and the result is multiplied byto produce the output.

Multi-Head Attention: Learning Different Perspectives

Single attention heads compute one set of attention weights, capturing one "view" of the relationships in the sequence. Multi-head attention runs multiple attention operations in parallel, each with different learned projections, allowing the model to attend to different aspects simultaneously.

Motivation

Consider the sentence "The bank by the river has low interest rates." A single attention mechanism might struggle to simultaneously capture: - Grammatical relationships (subject-verb agreement between "bank" and "has") - Semantic relationships (the financial meaning of "bank" vs. geographical "river") - Positional relationships (nearby vs. distant tokens)

Multiple heads can specialize in different types of relationships.

Mathematical Formulation

Given input, we computedifferent attention outputs in parallel:where,,are the learned projection matrices for head.

Typically,, so each head operates in a lower-dimensional subspace.

After computing all heads, we concatenate them and apply a final linear projection:whereis the output projection matrix.

Example: 8-Head Attention withIf we use 8 heads with a model dimension of 512:

Each head hasdimensions
Each head learns its own,,matrices of size
The concatenated output has dimension
The final projectionis

Masked Multi-Head Attention

In the decoder, we need to prevent positions from attending to future positions (to maintain the autoregressive property during training). This is achieved by masking:whereis a mask matrix withif(allowed positions) andif(forbidden future positions). Thevalues ensure that after softmax, those positions receive zero weight.

Positional Encoding: Injecting Sequence Order

Self-attention operates on sets, not sequences — it's permutation-invariant. Without additional information, the model cannot distinguish between "cat eats fish" and "fish eats cat". Positional encodings solve this by adding position-dependent signals to the input embeddings.

Sinusoidal Positional Encoding

The original Transformer paper used fixed sinusoidal functions: whereis the position index andis the dimension index.

Why This Design?

Unique Encoding: Each position gets a unique encoding vector
Relative Position: The encoding allows the model to learn to attend by relative positions, sincecan be expressed as a linear function of
Extrapolation: The model can potentially generalize to sequence lengths longer than seen during training

The different frequencies () create a spectrum: low dimensions oscillate rapidly (capturing fine-grained position), high dimensions oscillate slowly (capturing coarse position).

Learned Positional Embeddings

An alternative approach is to treat positional encodings as learnable parameters:whereis a standard embedding layer. This is simpler and often performs comparably to sinusoidal encodings, but cannot naturally extrapolate to longer sequences.

Modern models like BERT use learned positional embeddings, while GPT-2 and GPT-3 use learned embeddings with careful initialization. Some recent models (like T5 and ALiBi) use relative positional encodings or attention biases instead.

Adding Positional Encoding to Input

Positional encodings are added (not concatenated) to the input embeddings:whereare the token embeddings andare the positional encodings. Both have dimension.

The Complete Transformer Architecture

Now we assemble all components into the full Transformer architecture, consisting of an encoder stack and a decoder stack.

Encoder Architecture

Each encoder layer consists of two sub-layers:

1. Multi-Head Self-AttentionThe inputserves as queries, keys, and values (self-attention).

2. Position-Wise Feed-Forward NetworkThis is a two-layer fully connected network applied independently to each position. Typically, the inner dimension is.

Each sub-layer uses residual connections and layer normalization:The complete encoder layer:The encoder typically stacks 6 such layers (though BERT uses 12 or 24, and GPT-3 uses 96).

Decoder Architecture

Each decoder layer has three sub-layers:

1. Masked Multi-Head Self-Attentionwhereis the decoder input (shifted target sequence during training). Masking ensures causality.

2. Cross-Attention (Encoder-Decoder Attention)The decoder attends to the encoder's output. Queries come from the decoder (), keys and values come from the encoder ().

3. Position-Wise Feed-Forward NetworkSame as in the encoder.

The complete decoder layer:The decoder also stacks 6 layers.

Input and Output Processing

Encoder Input:

Decoder Input (during training):The target is shifted right by one position (starting with a special start-of-sequence token).

Final Output Layer:A linear projection maps the decoder output to vocabulary size, followed by softmax for probability distribution over tokens.

Full Architecture Diagram (Text Representation)

Input Tokens → Token Embedding + Positional Encoding
                        ↓
            ┌─────── Encoder Stack ───────┐
            │                              │
            │  ┌─ MultiHead Self-Attention │
            │  │  + LayerNorm + Residual   │
            │  └→ Feed-Forward Network     │
            │     + LayerNorm + Residual   │
            │                              │
            │  (Repeat 6 times)            │
            └──────────┬───────────────────┘
                       │ Encoder Output
Target Tokens → Token Embedding + Positional Encoding
                        ↓
            ┌─────── Decoder Stack ────────┐
            │                              │
            │  ┌─ Masked MultiHead Self-Attention
            │  │  + LayerNorm + Residual   │
            │  ├─ Cross-Attention (to Encoder)
            │  │  + LayerNorm + Residual   │
            │  └→ Feed-Forward Network     │
            │     + LayerNorm + Residual   │
            │                              │
            │  (Repeat 6 times)            │
            └──────────┬───────────────────┘
                       ↓
                Linear + Softmax
                       ↓
                Output Probabilities

Layer Normalization and Residual Connections: Stabilizing Deep Networks

Training deep networks is challenging due to gradient flow issues and internal covariate shift. The Transformer addresses these with residual connections and layer normalization.

Residual Connections

Introduced by ResNet, residual connections add the input of a sub-layer to its output:This creates "shortcut paths" for gradients to flow directly through, alleviating vanishing gradients in deep networks. Even iflearns poorly, the identity mapping allows information to pass through unchanged.

Layer Normalization

Layer normalization standardizes the inputs across features for each sample:where: -(mean across features) -(variance) -andare learnable scale and shift parameters -is a small constant for numerical stability

Unlike batch normalization (which normalizes across the batch dimension), layer normalization operates independently on each sample. This makes it more suitable for sequence models where batch elements may have different lengths.

Post-Layer Norm vs Pre-Layer Norm

The original Transformer used post-layer norm:More recent work suggests pre-layer norm improves training stability:Pre-layer norm is used in GPT-2, GPT-3, and many modern Transformers because it reduces sensitivity to learning rate and initialization.

PyTorch Implementation from Scratch

implement a complete Transformer model in PyTorch. This implementation includes all components: positional encoding, multi-head attention, encoder, decoder, and the full model.

Scaled Dot-Product Attention

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

def scaled_dot_product_attention(query, key, value, mask=None):
    """
    Compute scaled dot-product attention.
    
    Args:
        query: (batch_size, num_heads, seq_len_q, d_k)
        key: (batch_size, num_heads, seq_len_k, d_k)
        value: (batch_size, num_heads, seq_len_v, d_v)
        mask: (batch_size, 1, 1, seq_len_k) or (batch_size, 1, seq_len_q, seq_len_k)
    
    Returns:
        output: (batch_size, num_heads, seq_len_q, d_v)
        attention_weights: (batch_size, num_heads, seq_len_q, seq_len_k)
    """
    d_k = query.size(-1)
    
    # Compute attention scores: Q @ K^T / sqrt(d_k)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    
    # Apply mask (set masked positions to -inf before softmax)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    
    # Apply softmax to get attention weights
    attention_weights = F.softmax(scores, dim=-1)
    
    # Compute weighted sum of values
    output = torch.matmul(attention_weights, value)
    
    return output, attention_weights

Multi-Head Attention

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        """
        Multi-head attention module.
        
        Args:
            d_model: Model dimension (embedding size)
            num_heads: Number of attention heads
        """
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Linear projections for Q, K, V
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        
        # Output projection
        self.W_o = nn.Linear(d_model, d_model)
        
    def split_heads(self, x):
        """
        Split the last dimension into (num_heads, d_k).
        Reshape from (batch_size, seq_len, d_model) to 
        (batch_size, num_heads, seq_len, d_k)
        """
        batch_size, seq_len, d_model = x.size()
        return x.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
    
    def combine_heads(self, x):
        """
        Inverse of split_heads.
        Reshape from (batch_size, num_heads, seq_len, d_k) to
        (batch_size, seq_len, d_model)
        """
        batch_size, _, seq_len, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
    
    def forward(self, query, key, value, mask=None):
        """
        Args:
            query: (batch_size, seq_len_q, d_model)
            key: (batch_size, seq_len_k, d_model)
            value: (batch_size, seq_len_v, d_model)
            mask: (batch_size, 1, 1, seq_len_k) or (batch_size, 1, seq_len_q, seq_len_k)
        
        Returns:
            output: (batch_size, seq_len_q, d_model)
            attention_weights: (batch_size, num_heads, seq_len_q, seq_len_k)
        """
        # Linear projections
        Q = self.W_q(query)
        K = self.W_k(key)
        V = self.W_v(value)
        
        # Split into multiple heads
        Q = self.split_heads(Q)
        K = self.split_heads(K)
        V = self.split_heads(V)
        
        # Apply scaled dot-product attention
        attn_output, attention_weights = scaled_dot_product_attention(Q, K, V, mask)
        
        # Combine heads
        attn_output = self.combine_heads(attn_output)
        
        # Final linear projection
        output = self.W_o(attn_output)
        
        return output, attention_weights

Position-Wise Feed-Forward Network

class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        """
        Position-wise feed-forward network.
        
        Args:
            d_model: Model dimension
            d_ff: Hidden dimension of feed-forward network (typically 4 * d_model)
            dropout: Dropout rate
        """
        super(PositionWiseFeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        """
        Args:
            x: (batch_size, seq_len, d_model)
        
        Returns:
            output: (batch_size, seq_len, d_model)
        """
        return self.linear2(self.dropout(F.relu(self.linear1(x))))

Positional Encoding

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        """
        Sinusoidal positional encoding.
        
        Args:
            d_model: Model dimension
            max_len: Maximum sequence length
        """
        super(PositionalEncoding, self).__init__()
        
        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                             (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Add batch dimension and register as buffer (not a parameter)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        """
        Args:
            x: (batch_size, seq_len, d_model)
        
        Returns:
            x with positional encoding added
        """
        seq_len = x.size(1)
        return x + self.pe[:, :seq_len, :]

Encoder Layer

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        """
        Single encoder layer.
        
        Args:
            d_model: Model dimension
            num_heads: Number of attention heads
            d_ff: Feed-forward hidden dimension
            dropout: Dropout rate
        """
        super(EncoderLayer, self).__init__()
        
        self.self_attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff, dropout)
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        """
        Args:
            x: (batch_size, seq_len, d_model)
            mask: (batch_size, 1, 1, seq_len)
        
        Returns:
            output: (batch_size, seq_len, d_model)
        """
        # Multi-head self-attention with residual and layer norm
        attn_output, _ = self.self_attention(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        
        # Feed-forward with residual and layer norm
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout2(ff_output))
        
        return x

Decoder Layer

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        """
        Single decoder layer.
        
        Args:
            d_model: Model dimension
            num_heads: Number of attention heads
            d_ff: Feed-forward hidden dimension
            dropout: Dropout rate
        """
        super(DecoderLayer, self).__init__()
        
        self.self_attention = MultiHeadAttention(d_model, num_heads)
        self.cross_attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff, dropout)
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)
        
    def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
        """
        Args:
            x: Decoder input (batch_size, tgt_seq_len, d_model)
            encoder_output: Encoder output (batch_size, src_seq_len, d_model)
            src_mask: Source mask (batch_size, 1, 1, src_seq_len)
            tgt_mask: Target mask (batch_size, 1, tgt_seq_len, tgt_seq_len)
        
        Returns:
            output: (batch_size, tgt_seq_len, d_model)
        """
        # Masked multi-head self-attention
        attn_output, _ = self.self_attention(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout1(attn_output))
        
        # Cross-attention to encoder output
        cross_output, _ = self.cross_attention(x, encoder_output, encoder_output, src_mask)
        x = self.norm2(x + self.dropout2(cross_output))
        
        # Feed-forward
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout3(ff_output))
        
        return x

Full Transformer Model

class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, num_heads=8, 
                 num_encoder_layers=6, num_decoder_layers=6, d_ff=2048, 
                 max_seq_len=5000, dropout=0.1):
        """
        Complete Transformer model.
        
        Args:
            src_vocab_size: Source vocabulary size
            tgt_vocab_size: Target vocabulary size
            d_model: Model dimension
            num_heads: Number of attention heads
            num_encoder_layers: Number of encoder layers
            num_decoder_layers: Number of decoder layers
            d_ff: Feed-forward hidden dimension
            max_seq_len: Maximum sequence length
            dropout: Dropout rate
        """
        super(Transformer, self).__init__()
        
        self.d_model = d_model
        
        # Embedding layers
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
        
        # Positional encoding
        self.positional_encoding = PositionalEncoding(d_model, max_seq_len)
        
        # Encoder and decoder stacks
        self.encoder_layers = nn.ModuleList([
            EncoderLayer(d_model, num_heads, d_ff, dropout) 
            for _ in range(num_encoder_layers)
        ])
        
        self.decoder_layers = nn.ModuleList([
            DecoderLayer(d_model, num_heads, d_ff, dropout) 
            for _ in range(num_decoder_layers)
        ])
        
        # Final output projection
        self.output_projection = nn.Linear(d_model, tgt_vocab_size)
        
        self.dropout = nn.Dropout(dropout)
        
        # Initialize parameters
        self._init_parameters()
        
    def _init_parameters(self):
        """Initialize parameters with Xavier uniform."""
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)
    
    def create_padding_mask(self, seq, pad_idx=0):
        """
        Create mask for padding tokens.
        
        Args:
            seq: (batch_size, seq_len)
            pad_idx: Padding token index
        
        Returns:
            mask: (batch_size, 1, 1, seq_len)
        """
        return (seq != pad_idx).unsqueeze(1).unsqueeze(2)
    
    def create_causal_mask(self, size):
        """
        Create causal (look-ahead) mask for decoder.
        
        Args:
            size: Sequence length
        
        Returns:
            mask: (1, 1, size, size)
        """
        mask = torch.triu(torch.ones(size, size), diagonal=1).bool()
        return ~mask
    
    def encode(self, src, src_mask=None):
        """
        Encode source sequence.
        
        Args:
            src: Source tokens (batch_size, src_seq_len)
            src_mask: Source mask (batch_size, 1, 1, src_seq_len)
        
        Returns:
            encoder_output: (batch_size, src_seq_len, d_model)
        """
        # Embedding and positional encoding
        x = self.src_embedding(src) * math.sqrt(self.d_model)
        x = self.positional_encoding(x)
        x = self.dropout(x)
        
        # Pass through encoder layers
        for layer in self.encoder_layers:
            x = layer(x, src_mask)
        
        return x
    
    def decode(self, tgt, encoder_output, src_mask=None, tgt_mask=None):
        """
        Decode target sequence.
        
        Args:
            tgt: Target tokens (batch_size, tgt_seq_len)
            encoder_output: Encoder output (batch_size, src_seq_len, d_model)
            src_mask: Source mask (batch_size, 1, 1, src_seq_len)
            tgt_mask: Target mask (batch_size, 1, tgt_seq_len, tgt_seq_len)
        
        Returns:
            decoder_output: (batch_size, tgt_seq_len, d_model)
        """
        # Embedding and positional encoding
        x = self.tgt_embedding(tgt) * math.sqrt(self.d_model)
        x = self.positional_encoding(x)
        x = self.dropout(x)
        
        # Pass through decoder layers
        for layer in self.decoder_layers:
            x = layer(x, encoder_output, src_mask, tgt_mask)
        
        return x
    
    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        """
        Forward pass through the entire Transformer.
        
        Args:
            src: Source tokens (batch_size, src_seq_len)
            tgt: Target tokens (batch_size, tgt_seq_len)
            src_mask: Source mask
            tgt_mask: Target mask
        
        Returns:
            output: Logits (batch_size, tgt_seq_len, tgt_vocab_size)
        """
        # Encode source
        encoder_output = self.encode(src, src_mask)
        
        # Decode target
        decoder_output = self.decode(tgt, encoder_output, src_mask, tgt_mask)
        
        # Project to vocabulary
        output = self.output_projection(decoder_output)
        
        return output

Training Example

def train_step(model, src, tgt, optimizer, criterion, pad_idx=0):
    """
    Single training step.
    
    Args:
        model: Transformer model
        src: Source tokens (batch_size, src_seq_len)
        tgt: Target tokens (batch_size, tgt_seq_len)
        optimizer: Optimizer
        criterion: Loss function
        pad_idx: Padding token index
    
    Returns:
        loss: Scalar loss value
    """
    model.train()
    optimizer.zero_grad()
    
    # Create masks
    src_mask = model.create_padding_mask(src, pad_idx)
    
    # Target input is shifted right (remove last token)
    tgt_input = tgt[:, :-1]
    
    # Create causal mask for decoder
    tgt_seq_len = tgt_input.size(1)
    causal_mask = model.create_causal_mask(tgt_seq_len).to(tgt_input.device)
    tgt_padding_mask = model.create_padding_mask(tgt_input, pad_idx)
    tgt_mask = causal_mask & tgt_padding_mask
    
    # Forward pass
    output = model(src, tgt_input, src_mask, tgt_mask)
    
    # Target output is shifted left (remove first token)
    tgt_output = tgt[:, 1:]
    
    # Compute loss (ignore padding tokens)
    loss = criterion(output.reshape(-1, output.size(-1)), tgt_output.reshape(-1))
    
    # Backward pass
    loss.backward()
    optimizer.step()
    
    return loss.item()


# Example usage
if __name__ == "__main__":
    # Hyperparameters
    src_vocab_size = 10000
    tgt_vocab_size = 10000
    d_model = 512
    num_heads = 8
    num_encoder_layers = 6
    num_decoder_layers = 6
    d_ff = 2048
    dropout = 0.1
    pad_idx = 0
    
    # Create model
    model = Transformer(
        src_vocab_size=src_vocab_size,
        tgt_vocab_size=tgt_vocab_size,
        d_model=d_model,
        num_heads=num_heads,
        num_encoder_layers=num_encoder_layers,
        num_decoder_layers=num_decoder_layers,
        d_ff=d_ff,
        dropout=dropout
    )
    
    # Optimizer and loss
    optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
    criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)
    
    # Dummy batch
    batch_size = 32
    src_seq_len = 20
    tgt_seq_len = 25
    
    src = torch.randint(1, src_vocab_size, (batch_size, src_seq_len))
    tgt = torch.randint(1, tgt_vocab_size, (batch_size, tgt_seq_len))
    
    # Training step
    loss = train_step(model, src, tgt, optimizer, criterion, pad_idx)
    print(f"Training loss: {loss:.4f}")
    
    # Inference example (greedy decoding)
    model.eval()
    with torch.no_grad():
        src_test = torch.randint(1, src_vocab_size, (1, 15))
        src_mask = model.create_padding_mask(src_test, pad_idx)
        encoder_output = model.encode(src_test, src_mask)
        
        # Start with BOS token (assume index 1)
        tgt_tokens = torch.tensor([[1]])
        max_len = 30
        
        for _ in range(max_len):
            tgt_mask = model.create_causal_mask(tgt_tokens.size(1)).to(tgt_tokens.device)
            decoder_output = model.decode(tgt_tokens, encoder_output, src_mask, tgt_mask)
            logits = model.output_projection(decoder_output)
            next_token = logits[:, -1, :].argmax(dim=-1, keepdim=True)
            tgt_tokens = torch.cat([tgt_tokens, next_token], dim=1)
            
            # Stop if EOS token (assume index 2)
            if next_token.item() == 2:
                break
        
        print(f"Generated tokens: {tgt_tokens.squeeze().tolist()}")

Using HuggingFace Transformers

While implementing from scratch provides deep understanding, production systems typically use HuggingFace's transformers library, which offers pre-trained models and optimized implementations.

Loading Pre-trained Models

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Load pre-trained T5 model (based on Transformer architecture)
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Example: Translation task
text = "translate English to French: The cat is sleeping on the mat."
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)

# Generate translation
with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=50,
        num_beams=4,  # Beam search
        early_stopping=True
    )

translated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Translation: {translated}")

Fine-tuning on Custom Data

from transformers import T5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments
from torch.utils.data import Dataset

class TranslationDataset(Dataset):
    def __init__(self, source_texts, target_texts, tokenizer, max_length=512):
        self.source_texts = source_texts
        self.target_texts = target_texts
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.source_texts)
    
    def __getitem__(self, idx):
        source = self.source_texts[idx]
        target = self.target_texts[idx]
        
        # Tokenize inputs
        source_encoding = self.tokenizer(
            source,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        # Tokenize targets
        target_encoding = self.tokenizer(
            target,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        return {
            'input_ids': source_encoding.input_ids.squeeze(),
            'attention_mask': source_encoding.attention_mask.squeeze(),
            'labels': target_encoding.input_ids.squeeze()
        }

# Example data
source_texts = [
    "translate English to Spanish: Hello, how are you?",
    "translate English to Spanish: The weather is nice today.",
    # ... more examples
]
target_texts = [
    "Hola, ¿ c ó mo est á s?",
    "El clima es agradable hoy.",
    # ... corresponding translations
]

# Initialize tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Create dataset
dataset = TranslationDataset(source_texts, target_texts, tokenizer)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=100,
    save_steps=1000,
    evaluation_strategy="steps",
    eval_steps=500,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    # eval_dataset=eval_dataset,  # Add evaluation dataset if available
)

# Train
trainer.train()

# Save fine-tuned model
model.save_pretrained("./fine_tuned_t5")
tokenizer.save_pretrained("./fine_tuned_t5")

Using BERT (Encoder-Only Transformer)

from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load pre-trained BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Example: Sentiment classification
text = "This movie was absolutely fantastic! I loved every minute of it."
inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    prediction = torch.argmax(logits, dim=-1)

print(f"Prediction: {prediction.item()} (0=negative, 1=positive)")

Using GPT-2 (Decoder-Only Transformer)

from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

# Load pre-trained GPT-2
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set padding token (GPT-2 doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token

# Text generation
prompt = "Once upon a time in a land far away,"
inputs = tokenizer(prompt, return_tensors='pt')

# Generate text
with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=100,
        num_return_sequences=3,
        no_repeat_ngram_size=2,
        top_k=50,
        top_p=0.95,
        temperature=0.8
    )

# Decode and print
for i, output in enumerate(outputs):
    text = tokenizer.decode(output, skip_special_tokens=True)
    print(f"\nGeneration {i+1}:\n{text}\n")

Attention Visualization and Interpretation

Understanding what attention heads learn is crucial for model interpretability. implement attention visualization.

Extracting Attention Weights

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

def visualize_attention(attention_weights, tokens_src, tokens_tgt, layer=0, head=0):
    """
    Visualize attention weights as a heatmap.
    
    Args:
        attention_weights: Attention weights from model
        tokens_src: Source tokens (list of strings)
        tokens_tgt: Target tokens (list of strings)
        layer: Which layer to visualize
        head: Which head to visualize
    """
    # Extract specific layer and head
    # Shape: (seq_len_tgt, seq_len_src)
    attn = attention_weights[layer][head].detach().cpu().numpy()
    
    # Create figure
    fig, ax = plt.subplots(figsize=(10, 8))
    
    # Plot heatmap
    sns.heatmap(
        attn,
        xticklabels=tokens_src,
        yticklabels=tokens_tgt,
        cmap='Blues',
        ax=ax,
        cbar_kws={'label': 'Attention Weight'}
    )
    
    ax.set_xlabel('Source Tokens')
    ax.set_ylabel('Target Tokens')
    ax.set_title(f'Attention Weights - Layer {layer}, Head {head}')
    
    plt.tight_layout()
    plt.show()

# Example with HuggingFace model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small", output_attentions=True)

text = "translate English to French: The cat sleeps."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=20,
        output_attentions=True,
        return_dict_in_generate=True
    )

# Access attention weights from encoder
encoder_attentions = outputs.encoder_attentions

# Get tokens
src_tokens = tokenizer.convert_ids_to_tokens(inputs.input_ids[0])
tgt_tokens = tokenizer.convert_ids_to_tokens(outputs.sequences[0])

# Visualize first layer, first head
if encoder_attentions:
    visualize_attention(
        encoder_attentions,
        src_tokens,
        src_tokens,  # Self-attention in encoder
        layer=0,
        head=0
    )

Analyzing What Different Heads Learn

Research has shown that different attention heads specialize in different linguistic phenomena:

Syntactic Heads: Some heads learn to attend to syntactic relationships like subject-verb agreement or dependency parsing structures.

Positional Heads: Some heads focus on relative positions, attending primarily to previous or next tokens.

Rare Word Heads: Some heads attend strongly to rare or important content words, ignoring common function words.

Delimiter Heads: Some heads attend to punctuation and sentence boundaries.

Questions and Answers

Q1: Why does scaled dot-product attention scale by?

Answer: Without scaling, the dot product of two random unit vectors grows with dimension. For large, this pushes the softmax into regions with extremely small gradients. Specifically, ifandare independent random vectors with unit variance components, their dot product has variance. Dividing bynormalizes this variance to 1, keeping the softmax input in a reasonable range where gradients can flow effectively.

Q2: Can Transformers handle sequences longer than they were trained on?

Answer: It depends on the positional encoding scheme. Sinusoidal encodings can theoretically extrapolate to longer sequences because the encoding is a mathematical function. However, performance often degrades because the model has never seen longer-range dependencies during training. Learned positional embeddings cannot extrapolate beyond their maximum trained length without modification. Recent techniques like ALiBi (Attention with Linear Biases) and rotary positional embeddings (RoPE) improve length extrapolation by encoding relative rather than absolute positions.

Q3: Why use multi-head attention instead of a single high-dimensional head?

Answer: Multi-head attention allows the model to attend to different representation subspaces simultaneously. A single head might capture syntactic relationships, but miss semantic or positional patterns. Multiple heads can specialize: one might focus on local context, another on long-range dependencies, another on specific syntactic relations. This is similar to how CNNs use multiple filters to capture different visual features. Empirically, 8-16 heads with smaller dimensions per head outperform a single head with the same total parameters.

Q4: What's the computational complexity of self-attention?

Answer: Self-attention hascomplexity, whereis sequence length andis model dimension. Computingis, and multiplying byis also. For very long sequences (thousands of tokens), this becomes a bottleneck. This motivated efficient Transformer variants like Linformer, Performer, and Longformer that reduce complexity toorthrough various approximations like sparse attention patterns or kernel methods.

Q5: Why do we add positional encodings instead of concatenating them?

Answer: Adding preserves the full model dimension for both content and position information, allowing them to interact through subsequent layers. Concatenation would split the dimension, dedicating part solely to position and part solely to content, reducing the capacity for each. Adding also allows the model to learn how to combine positional and content information optimally through the learned transformations in attention and feed-forward layers. Empirically, addition works well and simplifies the architecture.

Q6: What is the purpose of the feed-forward network in each layer?

Answer: The feed-forward network (FFN) processes each position independently, adding non-linear transformations that attention alone cannot provide. Attention is primarily a weighted averaging operation (linear in the values), while the FFN with ReLU activation introduces non-linearity. The FFN also expands the dimensionality (typically to) before projecting back down, creating a bottleneck architecture that can learn complex position-wise transformations. Research suggests that FFN layers store factual knowledge, while attention layers handle information routing.

Q7: How does the Transformer avoid vanishing/exploding gradients?

Answer: Three key mechanisms help: (1) Residual connections provide direct gradient paths from output to input, bypassing potential bottlenecks in attention and FFN layers. (2) Layer normalization stabilizes activations, preventing them from growing or shrinking uncontrollably across layers. (3) Attention mechanism itself is less prone to vanishing gradients than RNNs because gradients can flow directly between any pair of positions without passing through many intermediate timesteps.

Q8: Why does the decoder use masked attention?

Answer: During training, the entire target sequence is available, but we must prevent the decoder from "cheating" by looking at future tokens. Masked (causal) attention ensures that positioncan only attend to positions, preserving the autoregressive property. This makes training match inference conditions, where future tokens are not yet generated. Without masking, the model would learn to simply copy future tokens rather than genuinely predict them.

Q9: Can attention weights be interpreted as "importance" or "relevance"?

Answer: Partially, but with caveats. High attention weights indicate that information from one position is being used when processing another position. However, attention weights are not causal explanations — they show correlation, not causation. Multiple heads may contain redundant information, and high attention doesn't necessarily mean "importance" in a semantic sense. Research has shown that attention weights can be manipulated without changing model outputs, suggesting they're only one component of model reasoning. For interpretability, consider attention alongside gradient-based methods and probing tasks.

Q10: What are the main differences between BERT, GPT, and T5?

Answer:

BERT (Encoder-only): Bidirectional context, uses masked language modeling (predicting randomly masked tokens). Best for tasks requiring understanding of entire context: classification, named entity recognition, question answering. Cannot generate sequences naturally.
GPT (Decoder-only): Unidirectional (left-to-right) context, uses causal language modeling (predicting next token). Excels at text generation, continuation, and few-shot learning. Can be adapted for understanding tasks but loses bidirectional context.
T5 (Encoder-Decoder): Full Transformer with both encoder and decoder. Frames all tasks as seq2seq (text-to-text). Combines benefits of both: bidirectional encoding and autoregressive decoding. More flexible but larger and slower than encoder-only or decoder-only models.

The choice depends on the task: BERT for understanding, GPT for generation, T5 for versatility.

Conclusion

The Transformer architecture revolutionized NLP by replacing recurrence with attention, enabling parallel processing and better long-range dependency modeling. Starting from the limitations of seq2seq models, we explored how attention mechanisms evolved from Bahdanau's alignment model to the Transformer's self-attention. Key innovations include scaled dot-product attention, multi-head attention for multiple representation subspaces, positional encodings for sequence order, and residual connections with layer normalization for training stability.

The full Transformer architecture, with its encoder-decoder structure, has become the foundation for modern NLP. Variants like BERT (encoder-only) and GPT (decoder-only) dominate tasks from classification to generation. The PyTorch implementation provided here gives you a complete, working model that you can extend and experiment with. Meanwhile, HuggingFace's transformers library offers production-ready implementations and pre-trained models for immediate use.

Understanding attention mechanisms and Transformers is essential for anyone working in modern NLP. These architectures continue to evolve — with innovations in efficiency (sparse attention), length extrapolation (better positional encodings), and scale (models with hundreds of billions of parameters)— but the core principles remain. Whether you're fine-tuning BERT for classification, using GPT for generation, or building custom architectures, the concepts covered here form your foundation.