NLP (3): RNN and Sequence Modeling

Sequence data is everywhere in natural language processing — from sentences and documents to time-series conversations. Unlike feedforward networks that treat inputs as independent fixed-size vectors, Recurrent Neural Networks (RNNs) maintain an internal state that evolves as they process sequences step by step. This recurrent connection allows the network to capture temporal dependencies and context, making RNNs a natural choice for language modeling, machine translation, and text generation. However, vanilla RNNs struggle with long-range dependencies due to vanishing gradients. This challenge led to the development of gated architectures like LSTM and GRU, which selectively control information flow and maintain long-term memory. In this article, we'll explore the core mechanics of RNN architectures, understand why gradient issues arise during backpropagation through time, and dive into practical implementations using PyTorch for text generation and sequence-to-sequence tasks.

Traditional feedforward neural networks process inputs in a single forward pass, with no memory of previous inputs. For sequential data, this is problematic. Consider the sentence "The cat sat on the mat." To understand "mat," you need context from earlier words. RNNs address this by introducing a recurrent connection that feeds the hidden state from one time step to the next.

Recurrent Structure

At each time step , an RNN receives inputand the previous hidden state, then computes the new hidden stateand output: Here,is the recurrent weight matrix,transforms the input, andproduces the output. The activation functionsquashes values to. The key insight is that the same weight matrices,, andare reused at every time step. This parameter sharing means the model learns a single transformation that applies across the entire sequence, regardless of its length.

Parameter sharing is crucial for several reasons:

Generalization: The model learns patterns that work at any position in the sequence, not just specific positions.
Efficiency: The number of parameters doesn't grow with sequence length. A feedforward network processing variable-length sequences would need different weights for each position.
Translation invariance: Features learned at one time step transfer to others, similar to how convolutional filters work in CNNs.

Consider language modeling: the pattern "the cat" can appear at the start, middle, or end of a sentence. Parameter sharing ensures the model recognizes this pattern regardless of position.

Unrolling Through Time

To visualize computation, we "unroll" the RNN across time steps. For a sequence of length, the unrolled network looks like a feedforward network withlayers, where each layer shares the same weights: This unrolled view helps us understand how gradients flow backward through the network during training.

The Vanishing and Exploding Gradient Problem

Training RNNs requires computing gradients with respect to parameters at all time steps. This is done via Backpropagation Through Time (BPTT), which unrolls the network and applies the chain rule. However, BPTT suffers from a critical issue: vanishing or exploding gradients.

Backpropagation Through Time (BPTT)

Given a losscomputed over a sequence, we need,, etc. Using the chain rule, the gradient of the loss with respect todepends on gradients from future time steps:The terminvolves the recurrent weight matrixand the derivative of:To propagate gradients from time stepback to time step, we multiply these Jacobian matrices:

Why Gradients Vanish

Sincefor all, the gradient of the activation is at most 1. If the largest eigenvalue ofis less than 1, repeated multiplication causes the gradient to shrink exponentially:If, this product approaches zero asincreases. The result: gradients from distant time steps become negligibly small, and the model can't learn long-range dependencies. For example, in "The cat, which was sitting on the mat and purring loudly, was happy," the model struggles to connect "cat" with "was happy" because gradients decay over the intervening words.

Why Gradients Explode

Conversely, if the largest eigenvalue ofexceeds 1, gradients grow exponentially. This causes numerical instability, producing NaN or Inf values during training. Gradient clipping — capping gradients at a threshold — is a common workaround, but doesn't solve the underlying issue.

Empirical Evidence

In practice, vanilla RNNs struggle to learn dependencies beyond 10-20 time steps. Experiments on synthetic tasks like copying sequences or remembering values show that RNNs quickly forget early inputs. This limitation motivated the development of gated architectures.

Long Short-Term Memory (LSTM)

The LSTM, introduced by Hochreiter and Schmidhuber in 1997, addresses vanishing gradients by replacing the simple recurrent unit with a more complex cell that explicitly maintains long-term memory. The key innovation is a cell statethat runs through the sequence with minimal transformations, allowing gradients to flow more easily.

Architecture Overview

An LSTM unit consists of three gates — forget gate, input gate, and output gate — and a cell state. At each time step:

Forget gatedecides what information to discard from the cell state.
Input gatedetermines what new information to add.
Candidate cell stateproposes new values.
Cell state update combines old and new information.
Output gatecontrols what part of the cell state to output.

Mathematical Formulation

Given inputand previous hidden state, the LSTM computes: Here,is the sigmoid function,denotes element-wise multiplication, andis the concatenation of hidden state and input. Each gate uses sigmoid activation to produce values in, acting as a soft switch.

Gate Functions Explained

Forget Gate: Determines what fraction of the previous cell state to retain. If, the cell forgets the past; if, it preserves memory. For example, when encountering a new subject in a sentence, the forget gate might reset information about the previous subject.

Input Gate: Controls how much of the candidate cell stateto add. It allows the model to selectively incorporate new information. If the input is irrelevant,, and the candidate is ignored.

Output Gate: Regulates which parts of the cell state to expose as the hidden state. This separation between cell state and hidden state allows the model to maintain internal memory without immediately revealing it.

Why LSTMs Mitigate Vanishing Gradients

The cell statehas an additive update mechanism:Unlike the multiplicative updates in vanilla RNNs, this addition allows gradients to flow backward without repeated multiplication by. When backpropagating through the cell state:If, gradients pass through unchanged. This creates a "highway" for gradients, enabling the model to learn dependencies over hundreds of time steps.

Practical Considerations

LSTMs have four times as many parameters as vanilla RNNs due to the three gates and candidate state. Training is slower, but the ability to capture long-term dependencies makes them far more effective. In practice, LSTMs became the default for sequence modeling tasks until the advent of Transformers.

Gated Recurrent Unit (GRU)

The GRU, proposed by Cho et al. in 2014, simplifies the LSTM architecture by combining the forget and input gates into a single update gate and merging the cell state and hidden state. GRUs have fewer parameters and are faster to train, often performing comparably to LSTMs.

Architecture

A GRU unit uses two gates:

Update gatecontrols how much of the previous hidden state to keep.
Reset gatedetermines how much of the past to forget when computing the candidate hidden state.

Mathematical Formulation

Gate Functions Explained

Reset Gate: When, the model ignoresin computing, effectively starting fresh. This is useful when the current input signals a new context.

Update Gate: Balances the previous hidden state and candidate state. If, the model keeps; if, it adopts. The interpolationsmoothly transitions between past and present.

Comparison with LSTM

GRUs merge the cell state and hidden state, reducing parameters by about 25%. They lack a separate output gate, meaning the entire hidden state is always exposed. In practice, GRUs often match LSTM performance on shorter sequences but may underperform on very long sequences where the LSTM's separate cell state provides more flexibility. The choice between GRU and LSTM is often task-dependent; GRUs are popular in resource-constrained settings.

Bidirectional RNNs (Bi-RNN)

In many NLP tasks, future context is as important as past context. For example, in sentiment analysis, the word "not" appearing after "good" completely changes the meaning. Bidirectional RNNs process sequences in both forward and backward directions, then combine the hidden states.

Architecture

A Bi-RNN consists of two separate RNNs:

Forward RNN: Processes the sequence fromto, producing forward hidden states.
Backward RNN: Processes the sequence fromto, producing backward hidden states.

At each time step, the final hidden state is the concatenation:

Use Cases

Bi-RNNs are ideal for tasks where the entire sequence is available at once:

Named Entity Recognition: Identifying entities requires context from both sides.
Part-of-Speech Tagging: The syntactic role of a word depends on surrounding words.
Machine Translation (Encoder): The encoder in sequence-to-sequence models benefits from bidirectional context.

However, Bi-RNNs cannot be used for online tasks like real-time language generation, since the backward pass requires access to future tokens.

Deep Bidirectional RNNs

Stacking multiple bidirectional layers creates a deep Bi-RNN. Each layer processes the concatenated outputs of the previous layer:whereis the hidden state at layer. Deep Bi-RNNs learn hierarchical representations, with lower layers capturing local patterns and higher layers capturing long-range dependencies.

Stacked RNNs: Building Depth

Stacking multiple RNN layers on top of each other increases model capacity and allows learning of hierarchical features. Each layer processes the sequence using the hidden states from the layer below as input.

Architecture

For a 2-layer stacked RNN: The output at timeis computed from the topmost layer:.

Intuition

Lower layers learn low-level features (e.g., character patterns, word boundaries), while higher layers learn abstract concepts (e.g., syntax, semantics). This hierarchical structure mirrors the success of deep convolutional networks in computer vision.

Practical Tips

Depth: 2-4 layers are common. Beyond 4 layers, training becomes difficult without techniques like residual connections or layer normalization.
Dropout: Apply dropout between layers to prevent overfitting. Dropout between time steps (variational dropout) is more effective than standard dropout.
Regularization: Gradient clipping and careful initialization are crucial for deep RNNs.

Sequence-to-Sequence Models (Seq2Seq)

Sequence-to-sequence models map an input sequence to an output sequence of potentially different length. Introduced by Sutskever et al. in 2014, Seq2Seq models revolutionized machine translation and other transduction tasks.

Encoder-Decoder Architecture

A Seq2Seq model consists of two RNNs:

Encoder: Processes the input sequenceand produces a context vector, typically the final hidden state.
Decoder: Generates the output sequenceconditioned onand previously generated tokens.

Encoder

The encoder is an RNN that reads the input sequence:The context vector is:or a function of all encoder hidden states, such as the mean or max.

Decoder

The decoder generates the output sequence one token at a time. At each decoding step: During training, the decoder uses teacher forcing: the true tokenis fed as input, even if the model predicted incorrectly. During inference, the decoder uses its own predictions.

Limitations

The encoder compresses the entire input sequence into a fixed-size vector. For long sequences, this bottleneck loses information. The decoder must reconstruct the output from this single vector, which is challenging. This limitation led to the development of attention mechanisms.

Applications

Machine Translation: Translate sentences from one language to another.
Summarization: Generate a short summary of a long document.
Dialogue Systems: Produce responses to user queries.
Code Generation: Convert natural language descriptions to code.

A Preview of Attention Mechanisms

Attention mechanisms address the bottleneck in Seq2Seq models by allowing the decoder to dynamically focus on different parts of the input sequence at each decoding step. Instead of relying on a single context vector, the decoder computes a weighted sum of all encoder hidden states.

Basic Idea

At each decoding step, the attention mechanism computes a score for each encoder hidden state, indicating how relevant it is to generating. These scores are normalized via softmax to produce attention weights: The context vector for decoding stepis:

Scoring Functions

Common scoring functions include:

Dot product:
Bilinear:
Additive (Bahdanau):

Benefits

Attention allows the model to handle long sequences by avoiding the fixed-size bottleneck. It also provides interpretability: attention weights reveal which input tokens the model focuses on at each decoding step. Attention mechanisms became a cornerstone of NLP, eventually leading to Transformer models that rely entirely on attention.

Beyond Seq2Seq

Attention isn't limited to Seq2Seq. It's used in:

Self-Attention: Tokens attend to other tokens in the same sequence (Transformers).
Hierarchical Attention: Multiple levels of attention for documents with sentence and word structure.
Multi-Head Attention: Multiple attention mechanisms run in parallel, capturing different relationships.

PyTorch Implementation: Text Generation

implement a character-level RNN for text generation using PyTorch. The model learns to predict the next character given a sequence of previous characters.

Dataset Preparation

We'll use a simple text dataset. For demonstration, we'll train on a small corpus and generate text.

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# Sample text corpus
text = """
Deep learning is a subset of machine learning that uses neural networks 
with many layers. These networks can learn hierarchical representations 
of data, making them powerful for tasks like image recognition, natural 
language processing, and speech recognition.
"""

# Create character mappings
chars = sorted(set(text))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
vocab_size = len(chars)

print(f"Vocabulary size: {vocab_size}")
print(f"Characters: {chars}")

RNN Model Definition

class CharRNN(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_layers):
        super(CharRNN, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        
        # LSTM layer
        self.lstm = nn.LSTM(hidden_size, hidden_size, num_layers, batch_first=True)
        
        # Output layer
        self.fc = nn.Linear(hidden_size, vocab_size)
    
    def forward(self, x, hidden):
        # x: (batch, seq_len)
        embedded = self.embedding(x)  # (batch, seq_len, hidden_size)
        
        out, hidden = self.lstm(embedded, hidden)  # out: (batch, seq_len, hidden_size)
        
        # Reshape for linear layer
        out = out.reshape(-1, self.hidden_size)  # (batch * seq_len, hidden_size)
        out = self.fc(out)  # (batch * seq_len, vocab_size)
        
        return out, hidden
    
    def init_hidden(self, batch_size, device):
        # Initialize hidden state and cell state
        h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device)
        c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device)
        return (h0, c0)

Training Loop

def train_char_rnn(model, data, epochs, seq_length, batch_size, lr, device):
    model = model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)
    
    # Prepare data
    data_indices = [char_to_idx[ch] for ch in data]
    
    for epoch in range(epochs):
        model.train()
        hidden = model.init_hidden(batch_size, device)
        total_loss = 0
        
        # Create random batches
        num_batches = len(data_indices) // (seq_length * batch_size)
        
        for batch in range(num_batches):
            # Sample random starting positions
            start_indices = np.random.randint(0, len(data_indices) - seq_length - 1, batch_size)
            
            # Create input and target sequences
            input_seqs = []
            target_seqs = []
            
            for start in start_indices:
                input_seqs.append(data_indices[start:start + seq_length])
                target_seqs.append(data_indices[start + 1:start + seq_length + 1])
            
            inputs = torch.LongTensor(input_seqs).to(device)
            targets = torch.LongTensor(target_seqs).to(device)
            
            # Detach hidden state
            hidden = tuple([h.detach() for h in hidden])
            
            # Forward pass
            outputs, hidden = model(inputs, hidden)
            targets_flat = targets.reshape(-1)
            
            # Compute loss
            loss = criterion(outputs, targets_flat)
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            
            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
            
            optimizer.step()
            
            total_loss += loss.item()
        
        avg_loss = total_loss / num_batches
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {avg_loss:.4f}")

Text Generation

def generate_text(model, start_str, length, temperature, device):
    model.eval()
    
    # Initialize hidden state
    hidden = model.init_hidden(1, device)
    
    # Convert start string to indices
    input_seq = [char_to_idx[ch] for ch in start_str]
    
    generated = start_str
    
    with torch.no_grad():
        # Process start string
        for ch_idx in input_seq[:-1]:
            x = torch.LongTensor([[ch_idx]]).to(device)
            output, hidden = model(x, hidden)
        
        # Generate new characters
        x = torch.LongTensor([[input_seq[-1]]]).to(device)
        
        for _ in range(length):
            output, hidden = model(x, hidden)
            
            # Apply temperature
            output = output.squeeze(0) / temperature
            probs = torch.softmax(output, dim=0).cpu().numpy()
            
            # Sample from distribution
            char_idx = np.random.choice(len(probs), p=probs)
            
            generated += idx_to_char[char_idx]
            x = torch.LongTensor([[char_idx]]).to(device)
    
    return generated

Running the Example

# Set hyperparameters
hidden_size = 128
num_layers = 2
seq_length = 50
batch_size = 16
epochs = 100
lr = 0.002
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Create model
model = CharRNN(vocab_size, hidden_size, num_layers)

# Train model
train_char_rnn(model, text, epochs, seq_length, batch_size, lr, device)

# Generate text
start_str = "Deep learning"
generated = generate_text(model, start_str, 200, temperature=0.8, device=device)
print("\nGenerated text:")
print(generated)

Interpretation

The temperature parameter controls randomness: low temperature (0.5) makes the model conservative, choosing high-probability characters; high temperature (1.5) increases diversity but may produce nonsense. After training, the model learns character patterns, word boundaries, and even simple grammar. For longer texts and more layers, the model can generate surprisingly coherent passages.

PyTorch Implementation: Simple Translation

implement a basic Seq2Seq model for English-to-French translation using LSTM encoder and decoder.

Data Preparation

import torch
import torch.nn as nn
import torch.optim as optim
import random

# Sample parallel corpus (English -> French)
pairs = [
    ("hello", "bonjour"),
    ("good morning", "bon matin"),
    ("thank you", "merci"),
    ("goodbye", "au revoir"),
    ("how are you", "comment allez vous"),
    ("i love you", "je t aime"),
    ("welcome", "bienvenue"),
    ("good night", "bonne nuit"),
]

# Build vocabularies
SOS_token = 0  # Start of sentence
EOS_token = 1  # End of sentence

class Vocabulary:
    def __init__(self, name):
        self.name = name
        self.word2idx = {"<SOS>": SOS_token, "<EOS>": EOS_token}
        self.idx2word = {SOS_token: "<SOS>", EOS_token: "<EOS>"}
        self.word2count = {}
        self.n_words = 2
    
    def add_sentence(self, sentence):
        for word in sentence.split():
            self.add_word(word)
    
    def add_word(self, word):
        if word not in self.word2idx:
            self.word2idx[word] = self.n_words
            self.idx2word[self.n_words] = word
            self.word2count[word] = 1
            self.n_words += 1
        else:
            self.word2count[word] += 1

# Create vocabularies
input_vocab = Vocabulary("English")
output_vocab = Vocabulary("French")

for pair in pairs:
    input_vocab.add_sentence(pair[0])
    output_vocab.add_sentence(pair[1])

print(f"Input vocab size: {input_vocab.n_words}")
print(f"Output vocab size: {output_vocab.n_words}")

Encoder

class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers=1):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, num_layers, batch_first=True)
    
    def forward(self, x):
        # x: (batch, seq_len)
        embedded = self.embedding(x)  # (batch, seq_len, hidden_size)
        output, hidden = self.lstm(embedded)
        return output, hidden

Decoder

class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, num_layers=1):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)
    
    def forward(self, x, hidden):
        # x: (batch, 1)
        embedded = self.embedding(x)  # (batch, 1, hidden_size)
        output, hidden = self.lstm(embedded, hidden)
        output = self.fc(output.squeeze(1))  # (batch, output_size)
        output = self.softmax(output)
        return output, hidden

Training Function

def train_seq2seq(encoder, decoder, pairs, input_vocab, output_vocab, 
                  epochs, lr, device, teacher_forcing_ratio=0.5):
    encoder = encoder.to(device)
    decoder = decoder.to(device)
    
    encoder_optimizer = optim.Adam(encoder.parameters(), lr=lr)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=lr)
    criterion = nn.NLLLoss()
    
    for epoch in range(epochs):
        total_loss = 0
        random.shuffle(pairs)
        
        for pair in pairs:
            input_sentence, target_sentence = pair
            
            # Convert sentences to indices
            input_indices = [input_vocab.word2idx[word] for word in input_sentence.split()]
            target_indices = [output_vocab.word2idx[word] for word in target_sentence.split()]
            
            input_indices.append(EOS_token)
            target_indices.append(EOS_token)
            
            input_tensor = torch.LongTensor([input_indices]).to(device)
            target_tensor = torch.LongTensor(target_indices).to(device)
            
            # Zero gradients
            encoder_optimizer.zero_grad()
            decoder_optimizer.zero_grad()
            
            # Encoder forward
            encoder_output, encoder_hidden = encoder(input_tensor)
            
            # Decoder forward
            decoder_hidden = encoder_hidden
            decoder_input = torch.LongTensor([[SOS_token]]).to(device)
            
            loss = 0
            use_teacher_forcing = random.random() < teacher_forcing_ratio
            
            for i in range(len(target_indices)):
                decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
                loss += criterion(decoder_output, target_tensor[i:i+1])
                
                if use_teacher_forcing:
                    # Teacher forcing: use ground truth
                    decoder_input = target_tensor[i:i+1].unsqueeze(0)
                else:
                    # Use model prediction
                    _, topi = decoder_output.topk(1)
                    decoder_input = topi.detach()
            
            # Backward pass
            loss.backward()
            encoder_optimizer.step()
            decoder_optimizer.step()
            
            total_loss += loss.item() / len(target_indices)
        
        avg_loss = total_loss / len(pairs)
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {avg_loss:.4f}")

Translation Function

def translate(encoder, decoder, sentence, input_vocab, output_vocab, device, max_length=20):
    encoder.eval()
    decoder.eval()
    
    with torch.no_grad():
        # Prepare input
        input_indices = [input_vocab.word2idx.get(word, 0) for word in sentence.split()]
        input_indices.append(EOS_token)
        input_tensor = torch.LongTensor([input_indices]).to(device)
        
        # Encoder forward
        encoder_output, encoder_hidden = encoder(input_tensor)
        
        # Decoder
        decoder_hidden = encoder_hidden
        decoder_input = torch.LongTensor([[SOS_token]]).to(device)
        
        decoded_words = []
        
        for _ in range(max_length):
            decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden)
            _, topi = decoder_output.topk(1)
            
            if topi.item() == EOS_token:
                break
            else:
                decoded_words.append(output_vocab.idx2word[topi.item()])
            
            decoder_input = topi.detach()
        
        return ' '.join(decoded_words)

Running Translation

# Hyperparameters
hidden_size = 256
num_layers = 1
epochs = 500
lr = 0.01
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Create models
encoder = EncoderRNN(input_vocab.n_words, hidden_size, num_layers)
decoder = DecoderRNN(hidden_size, output_vocab.n_words, num_layers)

# Train
train_seq2seq(encoder, decoder, pairs, input_vocab, output_vocab, 
              epochs, lr, device)

# Test translation
test_sentences = ["hello", "thank you", "good morning"]
for sentence in test_sentences:
    translation = translate(encoder, decoder, sentence, input_vocab, output_vocab, device)
    print(f"{sentence} -> {translation}")

Notes

This is a minimal Seq2Seq implementation without attention. For real translation tasks, you'd need:

Larger vocabulary and dataset
Attention mechanism
Beam search for decoding
Proper validation set
Handling unknown words with techniques like subword tokenization (BPE)

Common Questions and Answers

1. Why do we use tanh in RNN hidden states instead of ReLU?

Historically,was preferred because it outputs values in, centering activations around zero and providing symmetric gradients. This helps stabilize learning. ReLU can be used in RNNs, but it may cause hidden states to grow unbounded. In practice, modern variants like LSTMs and GRUs use a combination ofand sigmoid, each serving different purposes: sigmoid for gates (0-1 range for soft switches) andfor cell state candidates (-1 to 1 for centered values).

2. How does gradient clipping prevent exploding gradients?

Gradient clipping caps the norm of gradients during backpropagation. If the gradient norm exceeds a threshold, we rescale it:This prevents parameters from making drastic updates that cause numerical overflow. However, clipping doesn't solve vanishing gradients — it's a workaround for explosions. For vanishing gradients, architectural changes like LSTM or GRU are necessary.

3. What is teacher forcing, and when should we use it?

Teacher forcing feeds the ground-truth token as input to the decoder at each step during training, rather than the decoder's own prediction. This accelerates training because the model learns faster with correct inputs. However, at inference time, the model uses its own predictions, creating a train-test mismatch. To mitigate this, use scheduled sampling: gradually reduce teacher forcing ratio as training progresses, forcing the model to learn to recover from its own mistakes.

4. Can RNNs handle sequences of different lengths in a single batch?

Yes, but it requires padding and masking. Pad shorter sequences to the length of the longest sequence in the batch using a special padding token. During computation, mask the loss and attention for padded positions so they don't contribute to gradients. PyTorch providesandto efficiently handle variable-length sequences.

5. Why do LSTMs have separate cell state and hidden state?

The cell stateserves as long-term memory, passing information across many time steps with minimal transformation. The hidden stateis a filtered version of the cell state, controlled by the output gate. This separation allows the model to store raw information inwhile selectively exposing relevant parts via. It's analogous to a computer's RAM (cell state) versus registers (hidden state).

6. How do Bi-RNNs differ from stacked RNNs?

Bi-RNNs process sequences in both forward and backward directions at the same layer, capturing bidirectional context. Stacked RNNs add depth by layering RNNs vertically, learning hierarchical features. You can combine both: a 2-layer Bi-RNN has two bidirectional layers stacked on top of each other, providing both depth and bidirectional context.

7. What is the curse of long sequences in Seq2Seq models?

In vanilla Seq2Seq, the encoder compresses the entire input into a fixed-size context vector. For long sequences, this bottleneck loses critical information, causing the decoder to struggle. Attention mechanisms solve this by allowing the decoder to access all encoder hidden states, dynamically focusing on relevant parts of the input at each decoding step.

8. Why do we detach hidden states between batches during training?

Detaching hidden states prevents backpropagating gradients across batch boundaries. If we didn't detach, gradients would flow through the entire dataset, which is computationally infeasible and makes training unstable. Detaching treats each batch as independent, though we still pass hidden states forward to maintain sequence continuity within an epoch.

9. How does temperature affect text generation?

Temperaturescales the logits before applying softmax:Low temperature () makes the distribution peaky, favoring high-probability tokens (more conservative). High temperature () flattens the distribution, increasing randomness (more creative but less coherent). Settinguses the model's raw predictions.

10. Are RNNs still relevant given Transformers' success?

Transformers dominate most NLP tasks due to parallelization and better long-range dependency modeling. However, RNNs remain relevant in resource-constrained settings (fewer parameters, lower memory), online learning scenarios where sequences are processed incrementally, and certain time-series tasks where sequential processing is natural. Understanding RNNs also provides foundational knowledge for grasping attention mechanisms and Transformer architectures.

Conclusion

Recurrent Neural Networks introduced the paradigm of sequential processing with memory, enabling models to handle variable-length inputs and capture temporal dependencies. While vanilla RNNs suffer from vanishing gradients, gated architectures like LSTM and GRU overcome this limitation by carefully controlling information flow. Bidirectional and stacked RNNs extend these models to capture richer context and hierarchical features. Sequence-to-sequence models enable transduction tasks like machine translation, and attention mechanisms address the bottleneck of fixed-size context vectors.

Despite the rise of Transformers, RNNs remain a cornerstone of deep learning for sequences. The concepts of recurrence, hidden state, and gradient flow through time underpin many modern architectures. By mastering RNN fundamentals, you gain insight into how neural networks process sequential data and the challenges involved in learning long-term dependencies. Whether you're building language models, translation systems, or time-series forecasters, RNNs provide a powerful and intuitive framework for sequence modeling.

The Core Idea: Recurrence and Parameter Sharing

Recurrent Structure

Why Parameter Sharing Matters

Unrolling Through Time

The Vanishing and Exploding Gradient Problem

Backpropagation Through Time (BPTT)

Why Gradients Vanish

Why Gradients Explode

Empirical Evidence

Long Short-Term Memory (LSTM)

Architecture Overview

Mathematical Formulation

Gate Functions Explained

Why LSTMs Mitigate Vanishing Gradients

Practical Considerations

Gated Recurrent Unit (GRU)

Architecture

Mathematical Formulation

Gate Functions Explained

Comparison with LSTM

Bidirectional RNNs (Bi-RNN)

Architecture

Use Cases

Deep Bidirectional RNNs

Stacked RNNs: Building Depth

Architecture

Intuition

Practical Tips

Sequence-to-Sequence Models (Seq2Seq)

Encoder-Decoder Architecture

Encoder

Decoder

Limitations

Applications

A Preview of Attention Mechanisms

Basic Idea

Scoring Functions

Benefits

Beyond Seq2Seq

PyTorch Implementation: Text Generation

Dataset Preparation

RNN Model Definition

Training Loop

Text Generation

Running the Example

Interpretation

PyTorch Implementation: Simple Translation

Data Preparation

Encoder

Decoder

Training Function

Translation Function

Running Translation

Notes

Common Questions and Answers

1. Why do we use tanh in RNN hidden states instead of ReLU?

2. How does gradient clipping prevent exploding gradients?

3. What is teacher forcing, and when should we use it?

4. Can RNNs handle sequences of different lengths in a single batch?

5. Why do LSTMs have separate cell state and hidden state?

6. How do Bi-RNNs differ from stacked RNNs?

7. What is the curse of long sequences in Seq2Seq models?

8. Why do we detach hidden states between batches during training?

9. How does temperature affect text generation?

10. Are RNNs still relevant given Transformers' success?

Conclusion