Sequence data is everywhere in natural language processing — from
sentences and documents to time-series conversations. Unlike feedforward
networks that treat inputs as independent fixed-size vectors, Recurrent
Neural Networks (RNNs) maintain an internal state that evolves as they
process sequences step by step. This recurrent connection allows the
network to capture temporal dependencies and context, making RNNs a
natural choice for language modeling, machine translation, and text
generation. However, vanilla RNNs struggle with long-range dependencies
due to vanishing gradients. This challenge led to the development of
gated architectures like LSTM and GRU, which selectively control
information flow and maintain long-term memory. In this article, we'll
explore the core mechanics of RNN architectures, understand why gradient
issues arise during backpropagation through time, and dive into
practical implementations using PyTorch for text generation and
sequence-to-sequence tasks.
The Core Idea:
Recurrence and Parameter Sharing
Traditional feedforward neural networks process inputs in a single
forward pass, with no memory of previous inputs. For sequential data,
this is problematic. Consider the sentence "The cat sat on the mat." To
understand "mat," you need context from earlier words. RNNs address this
by introducing a recurrent connection that feeds the hidden state from
one time step to the next.
Recurrent Structure
At each time step , an RNN
receives inputand the previous
hidden state, then computes
the new hidden stateand
output:Here,is the
recurrent weight matrix,transforms the input, andproduces the output. The activation
functionsquashes values
to. The key insight is that
the same weight matrices,, andare reused at every time step. This
parameter sharing means the model learns a single transformation that
applies across the entire sequence, regardless of its length.
Why Parameter Sharing
Matters
Parameter sharing is crucial for several reasons:
Generalization: The model learns patterns that work
at any position in the sequence, not just specific positions.
Efficiency: The number of parameters doesn't grow
with sequence length. A feedforward network processing variable-length
sequences would need different weights for each position.
Translation invariance: Features learned at one
time step transfer to others, similar to how convolutional filters work
in CNNs.
Consider language modeling: the pattern "the cat" can appear at the
start, middle, or end of a sentence. Parameter sharing ensures the model
recognizes this pattern regardless of position.
Unrolling Through Time
To visualize computation, we "unroll" the RNN across time steps. For
a sequence of length, the unrolled
network looks like a feedforward network withlayers, where each layer shares the same
weights:This unrolled view helps us understand how gradients flow
backward through the network during training.
The Vanishing and
Exploding Gradient Problem
Training RNNs requires computing gradients with respect to parameters
at all time steps. This is done via Backpropagation Through Time (BPTT),
which unrolls the network and applies the chain rule. However, BPTT
suffers from a critical issue: vanishing or exploding gradients.
Backpropagation Through Time
(BPTT)
Given a losscomputed over a
sequence, we need,, etc. Using the chain rule, the gradient of the loss
with respect todepends on
gradients from future time steps:The terminvolves the recurrent weight matrixand the derivative of:To propagate gradients from time stepback to time step, we multiply these Jacobian
matrices:
Why Gradients Vanish
Sincefor
all, the gradient of the
activation is at most 1. If the largest eigenvalue ofis less than 1, repeated
multiplication causes the gradient to shrink exponentially:If, this product approaches zero asincreases. The result: gradients from distant time steps
become negligibly small, and the model can't learn long-range
dependencies. For example, in "The cat, which was sitting on the mat and
purring loudly, was happy," the model struggles to connect "cat" with
"was happy" because gradients decay over the intervening words.
Why Gradients Explode
Conversely, if the largest eigenvalue ofexceeds 1, gradients grow
exponentially. This causes numerical instability, producing NaN or Inf
values during training. Gradient clipping — capping gradients at a
threshold — is a common workaround, but doesn't solve the underlying
issue.
Empirical Evidence
In practice, vanilla RNNs struggle to learn dependencies beyond 10-20
time steps. Experiments on synthetic tasks like copying sequences or
remembering values show that RNNs quickly forget early inputs. This
limitation motivated the development of gated architectures.
Long Short-Term Memory (LSTM)
The LSTM, introduced by Hochreiter and Schmidhuber in 1997, addresses
vanishing gradients by replacing the simple recurrent unit with a more
complex cell that explicitly maintains long-term memory. The key
innovation is a cell statethat
runs through the sequence with minimal transformations, allowing
gradients to flow more easily.
Architecture Overview
An LSTM unit consists of three gates — forget gate, input gate, and
output gate — and a cell state. At each time step:
Forget gatedecides what information to discard
from the cell state.
Input gatedetermines what new information to
add.
Candidate cell stateproposes new values.
Cell state update combines old and new
information.
Output gatecontrols what part of the cell state
to output.
Mathematical Formulation
Given inputand previous
hidden state, the LSTM
computes:Here,is the sigmoid function,denotes element-wise multiplication,
andis the
concatenation of hidden state and input. Each gate uses sigmoid
activation to produce values in, acting as a soft switch.
Gate Functions Explained
Forget Gate: Determines what fraction of the
previous cell state to retain. If, the cell forgets the past; if, it preserves memory. For example, when encountering
a new subject in a sentence, the forget gate might reset information
about the previous subject.
Input Gate: Controls how much of the candidate cell
stateto add. It allows
the model to selectively incorporate new information. If the input is
irrelevant,, and the
candidate is ignored.
Output Gate: Regulates which parts of the cell state
to expose as the hidden state.
This separation between cell state and hidden state allows the model to
maintain internal memory without immediately revealing it.
Why LSTMs Mitigate
Vanishing Gradients
The cell statehas an additive
update mechanism:Unlike the multiplicative updates in
vanilla RNNs, this addition allows gradients to flow backward without
repeated multiplication by.
When backpropagating through the cell state:If,
gradients pass through unchanged. This creates a "highway" for
gradients, enabling the model to learn dependencies over hundreds of
time steps.
Practical Considerations
LSTMs have four times as many parameters as vanilla RNNs due to the
three gates and candidate state. Training is slower, but the ability to
capture long-term dependencies makes them far more effective. In
practice, LSTMs became the default for sequence modeling tasks until the
advent of Transformers.
Gated Recurrent Unit (GRU)
The GRU, proposed by Cho et al. in 2014, simplifies the LSTM
architecture by combining the forget and input gates into a single
update gate and merging the cell state and hidden state. GRUs have fewer
parameters and are faster to train, often performing comparably to
LSTMs.
Architecture
A GRU unit uses two gates:
Update gatecontrols how much of the previous
hidden state to keep.
Reset gatedetermines how much of the past to
forget when computing the candidate hidden state.
Mathematical
Formulation
Gate Functions Explained
Reset Gate: When, the model ignoresin computing, effectively starting fresh.
This is useful when the current input signals a new context.
Update Gate: Balances the previous hidden state and
candidate state. If,
the model keeps; if, it adopts. The interpolationsmoothly transitions between past and present.
Comparison with LSTM
GRUs merge the cell state and hidden state, reducing parameters by
about 25%. They lack a separate output gate, meaning the entire hidden
state is always exposed. In practice, GRUs often match LSTM performance
on shorter sequences but may underperform on very long sequences where
the LSTM's separate cell state provides more flexibility. The choice
between GRU and LSTM is often task-dependent; GRUs are popular in
resource-constrained settings.
Bidirectional RNNs (Bi-RNN)
In many NLP tasks, future context is as important as past context.
For example, in sentiment analysis, the word "not" appearing after
"good" completely changes the meaning. Bidirectional RNNs process
sequences in both forward and backward directions, then combine the
hidden states.
Architecture
A Bi-RNN consists of two separate RNNs:
Forward RNN: Processes the sequence fromto, producing forward hidden states.
Backward RNN: Processes the sequence fromto, producing backward hidden
states.
At each time step, the final
hidden state is the concatenation:
Use Cases
Bi-RNNs are ideal for tasks where the entire sequence is available at
once:
Named Entity Recognition: Identifying entities
requires context from both sides.
Part-of-Speech Tagging: The syntactic role of a
word depends on surrounding words.
Machine Translation (Encoder): The encoder in
sequence-to-sequence models benefits from bidirectional context.
However, Bi-RNNs cannot be used for online tasks like real-time
language generation, since the backward pass requires access to future
tokens.
Deep Bidirectional RNNs
Stacking multiple bidirectional layers creates a deep Bi-RNN. Each
layer processes the concatenated outputs of the previous layer:whereis the hidden state at
layer. Deep Bi-RNNs learn
hierarchical representations, with lower layers capturing local patterns
and higher layers capturing long-range dependencies.
Stacked RNNs: Building Depth
Stacking multiple RNN layers on top of each other increases model
capacity and allows learning of hierarchical features. Each layer
processes the sequence using the hidden states from the layer below as
input.
Architecture
For a 2-layer stacked RNN:The output at timeis computed from the topmost layer:.
Intuition
Lower layers learn low-level features (e.g., character patterns, word
boundaries), while higher layers learn abstract concepts (e.g., syntax,
semantics). This hierarchical structure mirrors the success of deep
convolutional networks in computer vision.
Practical Tips
Depth: 2-4 layers are common. Beyond 4 layers,
training becomes difficult without techniques like residual connections
or layer normalization.
Dropout: Apply dropout between layers to prevent
overfitting. Dropout between time steps (variational dropout) is more
effective than standard dropout.
Regularization: Gradient clipping and careful
initialization are crucial for deep RNNs.
Sequence-to-Sequence Models
(Seq2Seq)
Sequence-to-sequence models map an input sequence to an output
sequence of potentially different length. Introduced by Sutskever et al.
in 2014, Seq2Seq models revolutionized machine translation and other
transduction tasks.
Encoder-Decoder Architecture
A Seq2Seq model consists of two RNNs:
Encoder: Processes the input sequenceand produces a context
vector, typically the final hidden
state.
Decoder: Generates the output sequenceconditioned
onand previously generated
tokens.
Encoder
The encoder is an RNN that reads the input sequence:The context vector is:or a function of
all encoder hidden states, such as the mean or max.
Decoder
The decoder generates the output sequence one token at a time. At
each decoding step:During
training, the decoder uses teacher forcing: the true tokenis fed as input, even if the
model predicted incorrectly. During inference, the decoder uses its own
predictions.
Limitations
The encoder compresses the entire input sequence into a fixed-size
vector. For long sequences, this
bottleneck loses information. The decoder must reconstruct the output
from this single vector, which is challenging. This limitation led to
the development of attention mechanisms.
Applications
Machine Translation: Translate sentences from one
language to another.
Summarization: Generate a short summary of a long
document.
Dialogue Systems: Produce responses to user
queries.
Code Generation: Convert natural language
descriptions to code.
A Preview of Attention
Mechanisms
Attention mechanisms address the bottleneck in Seq2Seq models by
allowing the decoder to dynamically focus on different parts of the
input sequence at each decoding step. Instead of relying on a single
context vector, the decoder computes a weighted sum of all encoder
hidden states.
Basic Idea
At each decoding step, the
attention mechanism computes a score for each encoder hidden state, indicating how
relevant it is to generating. These scores are normalized
via softmax to produce attention weights:The context vector for
decoding stepis:
Scoring Functions
Common scoring functions include:
Dot product:
Bilinear:
Additive (Bahdanau):
Benefits
Attention allows the model to handle long sequences by avoiding the
fixed-size bottleneck. It also provides interpretability: attention
weights reveal which input tokens the model focuses on at each decoding
step. Attention mechanisms became a cornerstone of NLP, eventually
leading to Transformer models that rely entirely on attention.
Beyond Seq2Seq
Attention isn't limited to Seq2Seq. It's used in:
Self-Attention: Tokens attend to other tokens in
the same sequence (Transformers).
Hierarchical Attention: Multiple levels of
attention for documents with sentence and word structure.
Multi-Head Attention: Multiple attention mechanisms
run in parallel, capturing different relationships.
PyTorch Implementation:
Text Generation
implement a character-level RNN for text generation using PyTorch.
The model learns to predict the next character given a sequence of
previous characters.
Dataset Preparation
We'll use a simple text dataset. For demonstration, we'll train on a
small corpus and generate text.
import torch import torch.nn as nn import torch.optim as optim import numpy as np
# Sample text corpus text = """ Deep learning is a subset of machine learning that uses neural networks with many layers. These networks can learn hierarchical representations of data, making them powerful for tasks like image recognition, natural language processing, and speech recognition. """
# Create character mappings chars = sorted(set(text)) char_to_idx = {ch: i for i, ch inenumerate(chars)} idx_to_char = {i: ch for i, ch inenumerate(chars)} vocab_size = len(chars)
The temperature parameter controls randomness: low temperature (0.5)
makes the model conservative, choosing high-probability characters; high
temperature (1.5) increases diversity but may produce nonsense. After
training, the model learns character patterns, word boundaries, and even
simple grammar. For longer texts and more layers, the model can generate
surprisingly coherent passages.
PyTorch
Implementation: Simple Translation
implement a basic Seq2Seq model for English-to-French translation
using LSTM encoder and decoder.
# Test translation test_sentences = ["hello", "thank you", "good morning"] for sentence in test_sentences: translation = translate(encoder, decoder, sentence, input_vocab, output_vocab, device) print(f"{sentence} -> {translation}")
Notes
This is a minimal Seq2Seq implementation without attention. For real
translation tasks, you'd need:
Larger vocabulary and dataset
Attention mechanism
Beam search for decoding
Proper validation set
Handling unknown words with techniques like subword tokenization
(BPE)
Common Questions and Answers
1. Why
do we use tanh in RNN hidden states instead of ReLU?
Historically,was preferred
because it outputs values in, centering activations around zero and providing symmetric
gradients. This helps stabilize learning. ReLU can be used in RNNs, but
it may cause hidden states to grow unbounded. In practice, modern
variants like LSTMs and GRUs use a combination ofand sigmoid, each serving different
purposes: sigmoid for gates (0-1 range for soft switches) andfor cell state candidates (-1 to 1
for centered values).
2. How
does gradient clipping prevent exploding gradients?
Gradient clipping caps the norm of gradients during backpropagation.
If the gradient norm exceeds a threshold, we rescale it:This prevents
parameters from making drastic updates that cause numerical overflow.
However, clipping doesn't solve vanishing gradients — it's a workaround
for explosions. For vanishing gradients, architectural changes like LSTM
or GRU are necessary.
3. What is
teacher forcing, and when should we use it?
Teacher forcing feeds the ground-truth token as input to the decoder
at each step during training, rather than the decoder's own prediction.
This accelerates training because the model learns faster with correct
inputs. However, at inference time, the model uses its own predictions,
creating a train-test mismatch. To mitigate this, use scheduled
sampling: gradually reduce teacher forcing ratio as training progresses,
forcing the model to learn to recover from its own mistakes.
4.
Can RNNs handle sequences of different lengths in a single batch?
Yes, but it requires padding and masking. Pad shorter sequences to
the length of the longest sequence in the batch using a special padding
token. During computation, mask the loss and attention for padded
positions so they don't contribute to gradients. PyTorch providesandto
efficiently handle variable-length sequences.
5. Why
do LSTMs have separate cell state and hidden state?
The cell stateserves as
long-term memory, passing information across many time steps with
minimal transformation. The hidden stateis a filtered version of the cell
state, controlled by the output gate. This separation allows the model
to store raw information inwhile
selectively exposing relevant parts via. It's analogous to a computer's RAM
(cell state) versus registers (hidden state).
6. How do Bi-RNNs
differ from stacked RNNs?
Bi-RNNs process sequences in both forward and backward directions at
the same layer, capturing bidirectional context. Stacked RNNs add depth
by layering RNNs vertically, learning hierarchical features. You can
combine both: a 2-layer Bi-RNN has two bidirectional layers stacked on
top of each other, providing both depth and bidirectional context.
7. What
is the curse of long sequences in Seq2Seq models?
In vanilla Seq2Seq, the encoder compresses the entire input into a
fixed-size context vector. For long sequences, this bottleneck loses
critical information, causing the decoder to struggle. Attention
mechanisms solve this by allowing the decoder to access all encoder
hidden states, dynamically focusing on relevant parts of the input at
each decoding step.
8.
Why do we detach hidden states between batches during training?
Detaching hidden states prevents backpropagating gradients across
batch boundaries. If we didn't detach, gradients would flow through the
entire dataset, which is computationally infeasible and makes training
unstable. Detaching treats each batch as independent, though we still
pass hidden states forward to maintain sequence continuity within an
epoch.
9. How does
temperature affect text generation?
Temperaturescales the logits
before applying softmax:Low temperature
() makes the
distribution peaky, favoring high-probability tokens (more
conservative). High temperature () flattens the distribution, increasing randomness (more
creative but less coherent). Settinguses the model's raw predictions.
10. Are RNNs
still relevant given Transformers' success?
Transformers dominate most NLP tasks due to parallelization and
better long-range dependency modeling. However, RNNs remain relevant in
resource-constrained settings (fewer parameters, lower memory), online
learning scenarios where sequences are processed incrementally, and
certain time-series tasks where sequential processing is natural.
Understanding RNNs also provides foundational knowledge for grasping
attention mechanisms and Transformer architectures.
Conclusion
Recurrent Neural Networks introduced the paradigm of sequential
processing with memory, enabling models to handle variable-length inputs
and capture temporal dependencies. While vanilla RNNs suffer from
vanishing gradients, gated architectures like LSTM and GRU overcome this
limitation by carefully controlling information flow. Bidirectional and
stacked RNNs extend these models to capture richer context and
hierarchical features. Sequence-to-sequence models enable transduction
tasks like machine translation, and attention mechanisms address the
bottleneck of fixed-size context vectors.
Despite the rise of Transformers, RNNs remain a cornerstone of deep
learning for sequences. The concepts of recurrence, hidden state, and
gradient flow through time underpin many modern architectures. By
mastering RNN fundamentals, you gain insight into how neural networks
process sequential data and the challenges involved in learning
long-term dependencies. Whether you're building language models,
translation systems, or time-series forecasters, RNNs provide a powerful
and intuitive framework for sequence modeling.
Post title:NLP (3): RNN and Sequence Modeling
Post author:Chen Kai
Create time:2024-02-14 10:15:00
Post link:https://www.chenk.top/en/nlp-rnn-sequence-modeling/
Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.