At its core, deep learning is just large-scale matrix computation.
Whether it's the simplest fully connected network or the complex
Transformer architecture, linear algebra is the mathematical foundation
that powers everything. Understanding this connection will help you
debug models, optimize performance, and design more efficient network
architectures.
Matrix Representation
of Neural Networks
Starting with a Single
Neuron
A neuron performs a remarkably simple operation: it receives multiple
inputs, computes a weighted sum, adds a bias, and passes the result
through an activation function. Mathematically:
This weighted sum is actually a vector inner
product. If we write the weights as a row vectorand
inputs as a column vector,
then:One neuron = one inner product + one nonlinear
transformation. That's all there is to it.
Matrix Form of a Layer
Now suppose we haveneurons in a
layer, each with its own weight vector. Stacking these weight vectors as
rows forms the weight matrix:The forward pass through this layer becomes:whereis the layer's output,is the bias
vector, andapplies
element-wise to the vector.
Intuitive Understanding: The weight matrixmaps the input from-dimensional space to-dimensional space. This is a linear
transformation. The activation functionintroduces nonlinearity, enabling
the network to learn complex functions.
Batch Processing
In practice, we don't process samples one at a time. Givensamples, each an-dimensional vector, we arrange them as
rows in a matrix:The batched forward pass becomes:Hereis an all-ones vector, andbroadcasts the bias to
every sample.
A multi-layer neural network chains multiple transformations:Without activation functions, this reduces to matrix
multiplication, equivalent to a single matrix. Without
nonlinearity, deep networks are no more expressive than shallow
ones.
The Role of Nonlinearity: It breaks the closure of
matrix multiplication, enabling networks to approximate arbitrary
continuous functions (Universal Approximation Theorem).
Backpropagation in Matrix
Form
Backpropagation is deep learning's core algorithm. From a linear
algebra perspective, it's the matrix version of the chain rule.
Single-Layer Gradients
Consider a layerwith loss function. We need to compute,,
and (to pass to the previous layer).
Let
(pre-activation) and.
Step 1: Assume we already have (the
gradient from the next layer).
Step 2: Backpropagate through the activation.
Ifis element-wise:wheredenotes element-wise multiplication
andis the activation's
derivative.
Step 3: Compute parameter gradients. This is the key
matrix operation:
Step 4: Pass to the previous layer:
Intuitive Understanding: The appearance ofoccurs because in the forward
pass,transformsto. In backpropagation, we need to
traverse the "transposed path" backward.
Batch Backpropagation
For batch data, forward pass is.
Letbe the loss gradient with respect to
output, andbe the gradient after backpropagating through the
activation.
Weight gradients sum over all samples:Bias gradients
sum over the batch dimension:Gradients to the previous layer:
More generally, ifwhere, the Jacobian matrix is defined as:The
matrix form of the chain rule is the product of Jacobian matrices:For a linear layer, the Jacobian is simplyitself. This explains why
backpropagation uses.
Matrix Form of Convolution
Convolutional Neural Networks (CNNs) are central to image processing.
While convolution appears different from matrix multiplication, it can
be converted to matrix multiplication.
One-Dimensional Convolution
The definition of 1D discrete convolution:In deep learning, we work with finite-length signals and
kernels:whereis a kernel of
lengthandis the input signal.
Matrix Form: 1D convolution can be expressed as
Toeplitz matrix multiplication. For inputand kernel, output length is.
Construct matrix:Then.
Two-Dimensional Convolution
2D convolution for image processing:
The im2col Trick: This is the most common
convolution implementation in deep learning frameworks. The core idea is
to convert convolution into matrix multiplication.
Steps:
Unfold Input: For each output position, unfold the corresponding input
patch (size) into a
column vector. All these column vectors form the matrix.
Unfold Kernel: Flatten the kernel into a row
vector.
Matrix Multiplication:
Reshape Output: Reshape the result to the output
feature map shape.
print(f"Difference: {(y1 - y2).abs().max().item():.6f}") # Should be close to 0
Why use im2col? Because GEMM (General Matrix
Multiply) has highly optimized implementations on both CPUs and GPUs
(e.g., cuBLAS). Although im2col increases memory usage (input is copied
multiple times), the speedup from efficient GEMM typically far outweighs
the memory overhead.
Transposed Convolution
(Deconvolution)
Transposed convolution is used for upsampling, common in generative
models and semantic segmentation decoders.
Intuition: If forward convolution transforms animage to(typically), transposed convolution
does the reverse.
Matrix Perspective: If forward convolution can be
expressed as(whereis a sparse
matrix constructed from the kernel), transposed convolution is.
Note: Transposed convolution is not the inverse of
convolution.. It's only
"reversed" in terms of matrix dimensions.
Depthwise Separable
Convolution
Standard convolution hasparameters. Depthwise separable
convolution decomposes this into two steps:
Depthwise Convolution: Each input channel is
convolved independently, without mixing channels. Parameters:.
Pointwise Convolution:convolution that mixes channels. Parameters:.
Total parameters:, much smaller than standard
convolution.
Matrix Perspective: This is a form of low-rank
factorization. The standard convolution weight tensor can be
approximately decomposed into the product of depthwise and pointwise
parts.
print(f"Standard conv params: {sum(p.numel() for p in standard_conv.parameters())}") # 64 * 128 * 3 * 3 + 128 = 73856
print(f"Depthwise separable params: {sum(p.numel() for p in depthwise_sep.parameters())}") # 64 * 3 * 3 + 64 + 64 * 128 + 128 = 8896
Matrix Operations in
Attention Mechanisms
Attention is one of the most important innovations in modern deep
learning. Its core is built entirely on matrix operations.
Intuition Behind Attention
Imagine searching for information in a library. You have a question
(Query), and the library has many books (Key-Value pairs). You first
compare your question with each book's keywords (Key) to find relevance,
then retrieve content (Value) weighted by that relevance.
What Attention Does: Given a query, compute its similarity with all
keys, then use this similarity to
weight-sum the values.
Scaled
Dot-Product AttentionLet's
break down this formula:
Step 1:—
Compute the similarity matrix
Assume(queries, each-dimensional) and(keys).is the dot product between
the-th query and-th key, representing their
similarity.
Step 2: Divide by— Scaling
Why scale? Assume elements ofandare independent random variables with
mean 0 and variance 1. Their dot product has variance. Whenis large, dot products become large,
causing softmax saturation (gradients near zero). Dividing bynormalizes the variance to
1.
Step 3: softmax — Normalization
Apply softmax to each row (each query), obtaining attention weights.
Weights sum to 1, interpretable as a probability distribution.
Step 4: Multiply by— Weighted sum
Assume. The attention weight matrix, output.
Each output is a weighted average of all value vectors, with weights
determined by attention.
# Example batch, n_heads, seq_len, d_k = 2, 8, 10, 64 Q = torch.randn(batch, n_heads, seq_len, d_k) K = torch.randn(batch, n_heads, seq_len, d_k) V = torch.randn(batch, n_heads, seq_len, d_k)
output, weights = scaled_dot_product_attention(Q, K, V) print(f"Output shape: {output.shape}") # (2, 8, 10, 64) print(f"Attention weights shape: {weights.shape}") # (2, 8, 10, 10) print(f"Row sum of weights: {weights[0, 0, 0].sum().item():.4f}") # Should be 1
Multi-Head Attention
Single attention can only learn one "attention pattern." Multi-head
attention lets the model learn multiple different patterns
simultaneously.where each
head is:
Linear Algebra Perspective:
-project
input into different subspaces - Each head computes attention in its own
subspace -mixes all heads'
outputs
Parameter Shapes: - Input dimension, number of heads, per-head dimension -(can be viewed asmatrices of shapeconcatenated) -
Standard attention hastime
and space complexity, whereis
sequence length. This is because we need to compute theattention matrix.
For long sequences, this is a major bottleneck. Various efficient
attention variants (Sparse Attention, Linear Attention, FlashAttention,
etc.) aim to solve this problem.
FlashAttention is an algorithmic optimization that
speeds up standard attention through block-wise computation and reduced
GPU memory access, without changing the mathematical result.
Linear Algebra
Interpretation of Transformers
Transformers are the foundational architecture for modern NLP and
multimodal AI. Let's fully interpret them from a linear algebra
perspective.
Transformer Encoder
One Encoder layer contains:
Multi-Head Self-Attention (covered above)
Feed-Forward Network (FFN): Two fully connected
layers + activation
Residual Connections
Layer Normalization
Feed-Forward Network:Typically,, where.
Intuition: FFN is position-wise — the same
transformation is applied independently to each position in the
sequence. It can be viewed as a two-layer mini-MLP that first expands
dimensions, introduces nonlinearity, then compresses back to the
original dimension.
Residual Connection:Residual connections allow gradients to
"skip" complex transformations and flow directly backward, mitigating
vanishing gradients.
Transformer Decoder
The Decoder adds Cross-Attention compared to the
Encoder:
Query comes from the Decoder's previous layer output
Key and Value come from the Encoder's output
This allows the Decoder to "see" the Encoder-processed input.
Additionally, the Decoder's self-attention is
causal: positioncan only attend to positions. This is implemented via an
attention mask:Masked positions are set tobefore softmax, making their
weights zero.
Transformers lack recurrent structure, and self-attention is
position-agnostic (permutation equivariant). Positional encoding
provides position information to the model.
Sinusoidal Positional Encoding:
Linear Algebra Perspective: Each position is encoded
as a-dimensional vector.
Sine/cosine waves of different frequencies allow the model to learn
relative position relationships —can be expressed as a linear
function of.
Large Language Models (LLMs) have billions of parameters, making full
fine-tuning extremely expensive. LoRA (Low-Rank Adaptation) is an
efficient fine-tuning method based on low-rank matrix factorization.
Core Idea
Let the pre-trained model's weights be. Instead of directly modifying, LoRA learns a low-rank update:where,, and.
Parameter Comparison: - Full fine-tuning:parameters - LoRA:parameters
For example, with, full fine-tuning needs 16.77 million parameters.
LoRA withonly needsparameters — a 256x
reduction.
Why Low-Rank Works
Research shows that weight changes during fine-tuning have low-rank
structure. That is,'s
effective rank is much smaller than its dimensions. LoRA directly
constrains's rank to, acting as regularization.
Intuition: Fine-tuning makes small adjustments in
the pre-trained model's feature space without restructuring the entire
space. Low-rank updates only adjust a low-dimensional subspace.
# Usage example base = nn.Linear(4096, 4096) lora = LoRALinear(base, r=8)
print(f"Original params: {sum(p.numel() for p in base.parameters())}") # 16781312 print(f"LoRA trainable params: {sum(p.numel() for p in lora.parameters() if p.requires_grad)}") # 65536
x = torch.randn(2, 10, 4096) y = lora(x) print(f"Output shape: {y.shape}") # (2, 10, 4096)
LoRA in Transformers
Typically, LoRA is applied to attention layers',,,. These are key positions for
information flow.
QLoRA and Other Variants
QLoRA: Combines quantization with LoRA. Base model
stored in 4-bit quantization, LoRA parameters in FP16/BF16, further
reducing memory.
DoRA: Decomposes weights into direction and
magnitude, adapting each with LoRA for better results.
AdaLoRA: Adaptively allocates rankto different layers, with more important
layers getting higher ranks.
Linear Algebra in
Deep Learning Optimization
Weight Initialization
Good initialization is crucial for training. The goal is to prevent
signal explosion or vanishing during forward and backward passes.
Derivation Idea: Assume inputhas variance. We want outputto also have variance. This requires.
Similarly, backpropagation requires. Take the average of both.
He Initialization (for ReLU):ReLU "kills" half the activations (negatives become
0), so variance is multiplied by 2 to compensate.
Gradient Problems and
Singular Values
In deep networks, gradients must propagate through many layers. If
each layer's Jacobian has singular values greater than 1, gradients
explode; if less than 1, gradients vanish.
For network, backpropagation gradients:Gradient norm is
roughly the product of each layer's Jacobian singular values.
Solutions:
Residual Connections:, Jacobian becomes, eigenvalues centered around 1
Normalization: Controls activation value range
Gradient Clipping:
Exercises
Basic Exercises
1. Prove: For a linear layer, ifis an orthogonal matrix (), then both forward and
backward passes preserve vector norms.
2. Given inputpassing through a three-layer fully connected
network:.
Write the shape of each layer's weight matrix and compute total
parameters.
3. Explain why in multi-head attention, each head's
dimension is typically(whereis the number of
heads) rather than all using.
4. On which dimensions do BatchNorm and LayerNorm
compute mean and variance? For a convolutional feature map of shape, what are their
normalization statistics shapes?
Advanced Exercises
5. Derive why dividing byis necessary in scaled
dot-product attention. Assuming each element ofandis i.i.d. from, compute the mean and variance of
each element in.
6. For the im2col method: - Given input, kernel, stride=1, padding=1 - Calculate
the shape of the im2col matrix - Analyze the memory overhead of this
method
7. Prove: If(,), then.
8. Analyze the gradient flow in a ResNet residual
block. Prove that in
backpropagation, there exists a "shortcut" path where gradients can flow
directly to earlier layers without passing through.
Programming Exercises
9. Implement a complete Transformer Encoder (with
multiple layers) and use it for a simple sequence classification
task.
10. Implement a simplified LoRA training pipeline: -
Load a small pre-trained model (e.g., a simple classifier) - Apply LoRA
to its linear layers - Fine-tune LoRA parameters on a new task - Compare
parameter counts and performance between full fine-tuning and LoRA
11. Implement and visualize the effects of different
normalization methods: - Generate a random batch - Apply BatchNorm,
LayerNorm, RMSNorm respectively - Visualize the feature distribution
changes before and after normalization
12. Analyze the computational complexity of a real
Transformer model (e.g., BERT-tiny or GPT-2-small): - Count the FLOPs
proportion of different components (attention, FFN, normalization, etc.)
- Analyze how sequence length affects computation - Plot computation vs.
sequence length curves
Thought Questions
13. Why do Transformers universally use LayerNorm
instead of BatchNorm? Analyze from perspectives of training stability,
variable sequence length, and parallel computation.
14. LoRA assumes fine-tuning weight changes are
low-rank. Under what circumstances might this assumption not hold? How
can you detect this?
15. If you want to migrate traditional 2D CNNs to
process 3D data (e.g., video or medical imaging), analyze how
computation would change from a matrix operations perspective.
Chapter Summary
Deep learning's core operations can all be described with linear
algebra:
Backpropagation: Matrix form of chain rule,
gradients flow through transposed matrices
Convolution: Can be converted to matrix
multiplication via im2col, leveraging efficient GEMM
Attention Mechanism: Query-key dot product computes
similarity, weighted sum of values
Transformer: Combination of attention + FFN +
residual + normalization
Normalization Layers: Standardize along different
dimensions, stabilizing training
LoRA: Low-rank matrix factorization for efficient
fine-tuning
Understanding these linear algebra foundations enables you to: 1.
Better understand how models work 2. Efficiently implement and optimize
models 3. Design new network architectures 4. Debug training issues
(gradient vanishing/exploding, etc.)
References
Vaswani, A., et al. "Attention is All You Need." NeurIPS 2017.
Hu, E., et al. "LoRA: Low-Rank Adaptation of Large Language Models."
ICLR 2022.