Essence of Linear Algebra (16): Linear Algebra in Deep Learning
Chen Kai BOSS

At its core, deep learning is just large-scale matrix computation. Whether it's the simplest fully connected network or the complex Transformer architecture, linear algebra is the mathematical foundation that powers everything. Understanding this connection will help you debug models, optimize performance, and design more efficient network architectures.

Matrix Representation of Neural Networks

Starting with a Single Neuron

A neuron performs a remarkably simple operation: it receives multiple inputs, computes a weighted sum, adds a bias, and passes the result through an activation function. Mathematically:

This weighted sum is actually a vector inner product. If we write the weights as a row vectorand inputs as a column vector, then:One neuron = one inner product + one nonlinear transformation. That's all there is to it.

Matrix Form of a Layer

Now suppose we haveneurons in a layer, each with its own weight vector. Stacking these weight vectors as rows forms the weight matrix:The forward pass through this layer becomes:whereis the layer's output,is the bias vector, andapplies element-wise to the vector.

Intuitive Understanding: The weight matrixmaps the input from-dimensional space to-dimensional space. This is a linear transformation. The activation functionintroduces nonlinearity, enabling the network to learn complex functions.

Batch Processing

In practice, we don't process samples one at a time. Givensamples, each an-dimensional vector, we arrange them as rows in a matrix:The batched forward pass becomes:Hereis an all-ones vector, andbroadcasts the bias to every sample.

Why Batch Processing? GPUs excel at large-scale parallel computation. Single-sample operations can't fully utilize GPU parallelism. Batch processing increases the matrix multiplication size, maximizing hardware utilization.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import torch
import torch.nn as nn

# Define a simple fully connected layer
linear = nn.Linear(in_features=784, out_features=256)

# Single sample
x_single = torch.randn(784)
h_single = linear(x_single) # Output shape: (256,)

# Batch of samples
x_batch = torch.randn(32, 784) # 32 samples
h_batch = linear(x_batch) # Output shape: (32, 256)

print(f"Weight matrix shape: {linear.weight.shape}") # (256, 784)
print(f"Bias vector shape: {linear.bias.shape}") # (256,)

Multi-Layer Networks as Matrix Chains

A multi-layer neural network chains multiple transformations: Without activation functions, this reduces to matrix multiplication, equivalent to a single matrix. Without nonlinearity, deep networks are no more expressive than shallow ones.

The Role of Nonlinearity: It breaks the closure of matrix multiplication, enabling networks to approximate arbitrary continuous functions (Universal Approximation Theorem).

Backpropagation in Matrix Form

Backpropagation is deep learning's core algorithm. From a linear algebra perspective, it's the matrix version of the chain rule.

Single-Layer Gradients

Consider a layerwith loss function. We need to compute,, and (to pass to the previous layer).

Let (pre-activation) and.

Step 1: Assume we already have (the gradient from the next layer).

Step 2: Backpropagate through the activation. Ifis element-wise:wheredenotes element-wise multiplication andis the activation's derivative.

Step 3: Compute parameter gradients. This is the key matrix operation:

Step 4: Pass to the previous layer:

Intuitive Understanding: The appearance ofoccurs because in the forward pass,transformsto. In backpropagation, we need to traverse the "transposed path" backward.

Batch Backpropagation

For batch data, forward pass is.

Letbe the loss gradient with respect to output, andbe the gradient after backpropagating through the activation.

Weight gradients sum over all samples:Bias gradients sum over the batch dimension:Gradients to the previous layer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import torch
import torch.nn.functional as F

# Manual implementation of forward and backward pass
class ManualLinear:
def __init__(self, in_features, out_features):
# Xavier initialization
self.W = torch.randn(out_features, in_features) * (2 / (in_features + out_features)) ** 0.5
self.b = torch.zeros(out_features)
self.W.requires_grad = True
self.b.requires_grad = True

def forward(self, x):
self.x = x # Save for backprop
self.z = x @ self.W.T + self.b
self.h = F.relu(self.z)
return self.h

def backward(self, grad_h):
# Backprop through ReLU
grad_z = grad_h * (self.z > 0).float()

# Parameter gradients
grad_W = grad_z.T @ self.x # (out, batch) @ (batch, in) = (out, in)
grad_b = grad_z.sum(dim=0)

# Gradient to previous layer
grad_x = grad_z @ self.W # (batch, out) @ (out, in) = (batch, in)

# Store gradients
self.W.grad = grad_W
self.b.grad = grad_b

return grad_x

# Verify
layer = ManualLinear(784, 256)
x = torch.randn(32, 784)
h = layer.forward(x)
grad_h = torch.randn_like(h)
grad_x = layer.backward(grad_h)

print(f"Input gradient shape: {grad_x.shape}") # (32, 784)
print(f"Weight gradient shape: {layer.W.grad.shape}") # (256, 784)

The Jacobian Matrix Perspective

More generally, ifwhere, the Jacobian matrix is defined as:The matrix form of the chain rule is the product of Jacobian matrices:For a linear layer, the Jacobian is simplyitself. This explains why backpropagation uses.

Matrix Form of Convolution

Convolutional Neural Networks (CNNs) are central to image processing. While convolution appears different from matrix multiplication, it can be converted to matrix multiplication.

One-Dimensional Convolution

The definition of 1D discrete convolution:In deep learning, we work with finite-length signals and kernels:whereis a kernel of lengthandis the input signal.

Matrix Form: 1D convolution can be expressed as Toeplitz matrix multiplication. For inputand kernel, output length is.

Construct matrix:Then.

Two-Dimensional Convolution

2D convolution for image processing:

The im2col Trick: This is the most common convolution implementation in deep learning frameworks. The core idea is to convert convolution into matrix multiplication.

Steps:

  1. Unfold Input: For each output position, unfold the corresponding input patch (size) into a column vector. All these column vectors form the matrix.

  2. Unfold Kernel: Flatten the kernel into a row vector.

  3. Matrix Multiplication:

  4. Reshape Output: Reshape the result to the output feature map shape.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import torch
import torch.nn.functional as F

def im2col_naive(x, kernel_size, stride=1, padding=0):
"""
Convert input image to column matrix
x: (batch, channels, height, width)
"""
batch, channels, h, w = x.shape
kh, kw = kernel_size

# Add padding
if padding > 0:
x = F.pad(x, [padding] * 4)

_, _, h_pad, w_pad = x.shape
out_h = (h_pad - kh) // stride + 1
out_w = (w_pad - kw) // stride + 1

# Extract all patches
col = torch.zeros(batch, channels * kh * kw, out_h * out_w)

for i in range(out_h):
for j in range(out_w):
patch = x[:, :, i*stride:i*stride+kh, j*stride:j*stride+kw]
col[:, :, i*out_w+j] = patch.reshape(batch, -1)

return col, out_h, out_w

def conv2d_via_im2col(x, weight, stride=1, padding=0):
"""
Implement 2D convolution using im2col
x: (batch, in_channels, h, w)
weight: (out_channels, in_channels, kh, kw)
"""
batch = x.shape[0]
out_channels, in_channels, kh, kw = weight.shape

# im2col
col, out_h, out_w = im2col_naive(x, (kh, kw), stride, padding)

# Flatten weights
weight_col = weight.reshape(out_channels, -1) # (out_channels, in_channels*kh*kw)

# Matrix multiplication
out = weight_col @ col # (batch, out_channels, out_h*out_w)

# Reshape output
out = out.reshape(batch, out_channels, out_h, out_w)

return out

# Verify
x = torch.randn(2, 3, 8, 8)
weight = torch.randn(16, 3, 3, 3)

# Our implementation
y1 = conv2d_via_im2col(x, weight, padding=1)

# PyTorch
y2 = F.conv2d(x, weight, padding=1)

print(f"Difference: {(y1 - y2).abs().max().item():.6f}") # Should be close to 0

Why use im2col? Because GEMM (General Matrix Multiply) has highly optimized implementations on both CPUs and GPUs (e.g., cuBLAS). Although im2col increases memory usage (input is copied multiple times), the speedup from efficient GEMM typically far outweighs the memory overhead.

Transposed Convolution (Deconvolution)

Transposed convolution is used for upsampling, common in generative models and semantic segmentation decoders.

Intuition: If forward convolution transforms animage to(typically), transposed convolution does the reverse.

Matrix Perspective: If forward convolution can be expressed as(whereis a sparse matrix constructed from the kernel), transposed convolution is.

Note: Transposed convolution is not the inverse of convolution.. It's only "reversed" in terms of matrix dimensions.

Depthwise Separable Convolution

Standard convolution hasparameters. Depthwise separable convolution decomposes this into two steps:

Depthwise Convolution: Each input channel is convolved independently, without mixing channels. Parameters:.

Pointwise Convolution:convolution that mixes channels. Parameters:.

Total parameters:, much smaller than standard convolution.

Matrix Perspective: This is a form of low-rank factorization. The standard convolution weight tensor can be approximately decomposed into the product of depthwise and pointwise parts.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Depthwise separable convolution
class DepthwiseSeparableConv(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0):
super().__init__()
# Depthwise: groups=in_channels makes each channel independent
self.depthwise = nn.Conv2d(in_channels, in_channels, kernel_size,
stride=stride, padding=padding, groups=in_channels)
# Pointwise
self.pointwise = nn.Conv2d(in_channels, out_channels, kernel_size=1)

def forward(self, x):
x = self.depthwise(x)
x = self.pointwise(x)
return x

# Compare parameter counts
standard_conv = nn.Conv2d(64, 128, 3, padding=1)
depthwise_sep = DepthwiseSeparableConv(64, 128, 3, padding=1)

print(f"Standard conv params: {sum(p.numel() for p in standard_conv.parameters())}")
# 64 * 128 * 3 * 3 + 128 = 73856

print(f"Depthwise separable params: {sum(p.numel() for p in depthwise_sep.parameters())}")
# 64 * 3 * 3 + 64 + 64 * 128 + 128 = 8896

Matrix Operations in Attention Mechanisms

Attention is one of the most important innovations in modern deep learning. Its core is built entirely on matrix operations.

Intuition Behind Attention

Imagine searching for information in a library. You have a question (Query), and the library has many books (Key-Value pairs). You first compare your question with each book's keywords (Key) to find relevance, then retrieve content (Value) weighted by that relevance.

What Attention Does: Given a query, compute its similarity with all keys, then use this similarity to weight-sum the values.

Scaled Dot-Product AttentionLet's break down this formula:

Step 1:— Compute the similarity matrix

Assume(queries, each-dimensional) and(keys). is the dot product between the-th query and-th key, representing their similarity.

Step 2: Divide by— Scaling

Why scale? Assume elements ofandare independent random variables with mean 0 and variance 1. Their dot product has variance. Whenis large, dot products become large, causing softmax saturation (gradients near zero). Dividing bynormalizes the variance to 1.

Step 3: softmax — Normalization

Apply softmax to each row (each query), obtaining attention weights. Weights sum to 1, interpretable as a probability distribution.

Step 4: Multiply by— Weighted sum

Assume. The attention weight matrix, output.

Each output is a weighted average of all value vectors, with weights determined by attention.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Scaled dot-product attention
Q: (batch, n_heads, seq_len_q, d_k)
K: (batch, n_heads, seq_len_k, d_k)
V: (batch, n_heads, seq_len_k, d_v)
mask: (batch, 1, 1, seq_len_k) or (batch, 1, seq_len_q, seq_len_k)
"""
d_k = Q.size(-1)

# Compute attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
# scores: (batch, n_heads, seq_len_q, seq_len_k)

# Apply mask (for causal attention in decoders)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))

# Softmax to get attention weights
attn_weights = F.softmax(scores, dim=-1)

# Weighted sum
output = torch.matmul(attn_weights, V)
# output: (batch, n_heads, seq_len_q, d_v)

return output, attn_weights

# Example
batch, n_heads, seq_len, d_k = 2, 8, 10, 64
Q = torch.randn(batch, n_heads, seq_len, d_k)
K = torch.randn(batch, n_heads, seq_len, d_k)
V = torch.randn(batch, n_heads, seq_len, d_k)

output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {output.shape}") # (2, 8, 10, 64)
print(f"Attention weights shape: {weights.shape}") # (2, 8, 10, 10)
print(f"Row sum of weights: {weights[0, 0, 0].sum().item():.4f}") # Should be 1

Multi-Head Attention

Single attention can only learn one "attention pattern." Multi-head attention lets the model learn multiple different patterns simultaneously.where each head is:

Linear Algebra Perspective:

-project input into different subspaces - Each head computes attention in its own subspace -mixes all heads' outputs

Parameter Shapes: - Input dimension, number of heads, per-head dimension -(can be viewed asmatrices of shapeconcatenated) -

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
assert d_model % n_heads == 0

self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads

# Linear projections
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)

def forward(self, Q, K, V, mask=None):
batch_size = Q.size(0)

# Linear projection and split heads
# (batch, seq_len, d_model) -> (batch, seq_len, n_heads, d_k) -> (batch, n_heads, seq_len, d_k)
Q = self.W_q(Q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
K = self.W_k(K).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
V = self.W_v(V).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)

# Attention computation
attn_output, attn_weights = scaled_dot_product_attention(Q, K, V, mask)

# Merge heads
# (batch, n_heads, seq_len, d_k) -> (batch, seq_len, d_model)
attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)

# Output projection
output = self.W_o(attn_output)

return output, attn_weights

# Test
mha = MultiHeadAttention(d_model=512, n_heads=8)
x = torch.randn(2, 10, 512) # (batch, seq_len, d_model)
output, weights = mha(x, x, x) # Self-attention: Q=K=V=x
print(f"Output shape: {output.shape}") # (2, 10, 512)

Computational Complexity of Attention

Standard attention hastime and space complexity, whereis sequence length. This is because we need to compute theattention matrix.

For long sequences, this is a major bottleneck. Various efficient attention variants (Sparse Attention, Linear Attention, FlashAttention, etc.) aim to solve this problem.

FlashAttention is an algorithmic optimization that speeds up standard attention through block-wise computation and reduced GPU memory access, without changing the mathematical result.

Linear Algebra Interpretation of Transformers

Transformers are the foundational architecture for modern NLP and multimodal AI. Let's fully interpret them from a linear algebra perspective.

Transformer Encoder

One Encoder layer contains:

  1. Multi-Head Self-Attention (covered above)
  2. Feed-Forward Network (FFN): Two fully connected layers + activation
  3. Residual Connections
  4. Layer Normalization

Feed-Forward Network:Typically,, where.

Intuition: FFN is position-wise — the same transformation is applied independently to each position in the sequence. It can be viewed as a two-layer mini-MLP that first expands dimensions, introduces nonlinearity, then compresses back to the original dimension.

Residual Connection:Residual connections allow gradients to "skip" complex transformations and flow directly backward, mitigating vanishing gradients.

Transformer Decoder

The Decoder adds Cross-Attention compared to the Encoder:

  • Query comes from the Decoder's previous layer output
  • Key and Value come from the Encoder's output

This allows the Decoder to "see" the Encoder-processed input.

Additionally, the Decoder's self-attention is causal: positioncan only attend to positions. This is implemented via an attention mask:Masked positions are set tobefore softmax, making their weights zero.

1
2
3
4
5
6
7
8
9
10
11
12
def create_causal_mask(seq_len):
"""Create causal attention mask"""
mask = torch.tril(torch.ones(seq_len, seq_len))
return mask.unsqueeze(0).unsqueeze(0) # (1, 1, seq_len, seq_len)

mask = create_causal_mask(5)
print(mask[0, 0])
# tensor([[1., 0., 0., 0., 0.],
# [1., 1., 0., 0., 0.],
# [1., 1., 1., 0., 0.],
# [1., 1., 1., 1., 0.],
# [1., 1., 1., 1., 1.]])

Positional Encoding

Transformers lack recurrent structure, and self-attention is position-agnostic (permutation equivariant). Positional encoding provides position information to the model.

Sinusoidal Positional Encoding:

Linear Algebra Perspective: Each position is encoded as a-dimensional vector. Sine/cosine waves of different frequencies allow the model to learn relative position relationships —can be expressed as a linear function of.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()

pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)

pe = pe.unsqueeze(0) # (1, max_len, d_model)
self.register_buffer('pe', pe)

def forward(self, x):
# x: (batch, seq_len, d_model)
return x + self.pe[:, :x.size(1), :]

Complete Transformer Layer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class TransformerEncoderLayer(nn.Module):
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()

self.self_attn = MultiHeadAttention(d_model, n_heads)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)

self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)

def forward(self, x, mask=None):
# Self-attention + residual
attn_output, _ = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_output))

# FFN + residual
ffn_output = self.ffn(x)
x = self.norm2(x + self.dropout(ffn_output))

return x

BatchNorm and LayerNorm

Normalization layers are critical components in deep learning. They stabilize training by standardizing activation values.

Batch Normalization

Operation: For each feature, compute mean and variance over the batch dimension, then standardize:whereand.

After standardization, scale and shift with learnable parameters:

Input Shape: For convolutional layer, BN is applied independently to each channel, with mean and variance computed over.

Issues: BN depends on batch statistics. Small batches lead to poor estimates; inference requires running averages accumulated during training.

Layer Normalization

Operation: For each sample, compute mean and variance over the feature dimension:whereandare computed over all features of the same sample.

Input Shape: Forsequences, LN is applied independently to each position's-dimensional vector.

Advantages: Doesn't depend on batch, suitable for small batches and sequence models (RNN, Transformer).

Matrix Perspective Difference:

  • BatchNorm: Standardizes each feature (column) along the batch dimension (rows)
  • LayerNorm: Standardizes each sample (row) along the feature dimension (columns)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class LayerNorm(nn.Module):
def __init__(self, normalized_shape, eps=1e-5):
super().__init__()
self.eps = eps

# Learnable parameters
self.gamma = nn.Parameter(torch.ones(normalized_shape))
self.beta = nn.Parameter(torch.zeros(normalized_shape))

def forward(self, x):
# x: (..., normalized_shape)
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)

x_norm = (x - mean) / torch.sqrt(var + self.eps)
return self.gamma * x_norm + self.beta

RMSNorm

RMSNorm is a simplified version of LayerNorm that only uses root mean square (RMS) for standardization, without subtracting the mean:

Advantages: Simpler computation, similar performance. Models like LLaMA use RMSNorm.

1
2
3
4
5
6
7
8
9
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-6):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(dim))

def forward(self, x):
rms = torch.sqrt(x.pow(2).mean(dim=-1, keepdim=True) + self.eps)
return self.weight * x / rms

Parameter-Efficient Fine-Tuning: LoRA

Large Language Models (LLMs) have billions of parameters, making full fine-tuning extremely expensive. LoRA (Low-Rank Adaptation) is an efficient fine-tuning method based on low-rank matrix factorization.

Core Idea

Let the pre-trained model's weights be. Instead of directly modifying, LoRA learns a low-rank update:where,, and.

Parameter Comparison: - Full fine-tuning:parameters - LoRA:parameters

For example, with, full fine-tuning needs 16.77 million parameters. LoRA withonly needsparameters — a 256x reduction.

Why Low-Rank Works

Research shows that weight changes during fine-tuning have low-rank structure. That is,'s effective rank is much smaller than its dimensions. LoRA directly constrains's rank to, acting as regularization.

Intuition: Fine-tuning makes small adjustments in the pre-trained model's feature space without restructuring the entire space. Low-rank updates only adjust a low-dimensional subspace.

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
class LoRALinear(nn.Module):
def __init__(self, base_layer, r=8, alpha=16, dropout=0.1):
"""
base_layer: Original nn.Linear layer
r: LoRA rank
alpha: Scaling factor
"""
super().__init__()

self.base_layer = base_layer
self.r = r
self.alpha = alpha

in_features = base_layer.in_features
out_features = base_layer.out_features

# LoRA parameters
self.lora_A = nn.Parameter(torch.randn(r, in_features) * 0.01)
self.lora_B = nn.Parameter(torch.zeros(out_features, r))

self.scaling = alpha / r
self.dropout = nn.Dropout(dropout)

# Freeze original weights
for param in base_layer.parameters():
param.requires_grad = False

def forward(self, x):
# Original output
base_output = self.base_layer(x)

# LoRA output
# x: (..., in_features)
# lora_A: (r, in_features)
# lora_B: (out_features, r)
lora_output = self.dropout(x) @ self.lora_A.T @ self.lora_B.T * self.scaling

return base_output + lora_output

def merge_weights(self):
"""Merge LoRA weights into original weights for inference"""
self.base_layer.weight.data += (self.lora_B @ self.lora_A) * self.scaling

# Usage example
base = nn.Linear(4096, 4096)
lora = LoRALinear(base, r=8)

print(f"Original params: {sum(p.numel() for p in base.parameters())}") # 16781312
print(f"LoRA trainable params: {sum(p.numel() for p in lora.parameters() if p.requires_grad)}") # 65536

x = torch.randn(2, 10, 4096)
y = lora(x)
print(f"Output shape: {y.shape}") # (2, 10, 4096)

LoRA in Transformers

Typically, LoRA is applied to attention layers',,,. These are key positions for information flow.

QLoRA and Other Variants

QLoRA: Combines quantization with LoRA. Base model stored in 4-bit quantization, LoRA parameters in FP16/BF16, further reducing memory.

DoRA: Decomposes weights into direction and magnitude, adapting each with LoRA for better results.

AdaLoRA: Adaptively allocates rankto different layers, with more important layers getting higher ranks.

Linear Algebra in Deep Learning Optimization

Weight Initialization

Good initialization is crucial for training. The goal is to prevent signal explosion or vanishing during forward and backward passes.

Xavier Initialization (for tanh/sigmoid):Or Gaussian version:

Derivation Idea: Assume inputhas variance. We want outputto also have variance. This requires. Similarly, backpropagation requires. Take the average of both.

He Initialization (for ReLU):ReLU "kills" half the activations (negatives become 0), so variance is multiplied by 2 to compensate.

Gradient Problems and Singular Values

In deep networks, gradients must propagate through many layers. If each layer's Jacobian has singular values greater than 1, gradients explode; if less than 1, gradients vanish.

For network, backpropagation gradients:Gradient norm is roughly the product of each layer's Jacobian singular values.

Solutions:

  1. Residual Connections:, Jacobian becomes, eigenvalues centered around 1
  2. Normalization: Controls activation value range
  3. Gradient Clipping:

Exercises

Basic Exercises

1. Prove: For a linear layer, ifis an orthogonal matrix (), then both forward and backward passes preserve vector norms.

2. Given inputpassing through a three-layer fully connected network:. Write the shape of each layer's weight matrix and compute total parameters.

3. Explain why in multi-head attention, each head's dimension is typically(whereis the number of heads) rather than all using.

4. On which dimensions do BatchNorm and LayerNorm compute mean and variance? For a convolutional feature map of shape, what are their normalization statistics shapes?

Advanced Exercises

5. Derive why dividing byis necessary in scaled dot-product attention. Assuming each element ofandis i.i.d. from, compute the mean and variance of each element in.

6. For the im2col method: - Given input, kernel, stride=1, padding=1 - Calculate the shape of the im2col matrix - Analyze the memory overhead of this method

7. Prove: If(,), then.

8. Analyze the gradient flow in a ResNet residual block. Prove that in backpropagation, there exists a "shortcut" path where gradients can flow directly to earlier layers without passing through.

Programming Exercises

9. Implement a complete Transformer Encoder (with multiple layers) and use it for a simple sequence classification task.

10. Implement a simplified LoRA training pipeline: - Load a small pre-trained model (e.g., a simple classifier) - Apply LoRA to its linear layers - Fine-tune LoRA parameters on a new task - Compare parameter counts and performance between full fine-tuning and LoRA

11. Implement and visualize the effects of different normalization methods: - Generate a random batch - Apply BatchNorm, LayerNorm, RMSNorm respectively - Visualize the feature distribution changes before and after normalization

12. Analyze the computational complexity of a real Transformer model (e.g., BERT-tiny or GPT-2-small): - Count the FLOPs proportion of different components (attention, FFN, normalization, etc.) - Analyze how sequence length affects computation - Plot computation vs. sequence length curves

Thought Questions

13. Why do Transformers universally use LayerNorm instead of BatchNorm? Analyze from perspectives of training stability, variable sequence length, and parallel computation.

14. LoRA assumes fine-tuning weight changes are low-rank. Under what circumstances might this assumption not hold? How can you detect this?

15. If you want to migrate traditional 2D CNNs to process 3D data (e.g., video or medical imaging), analyze how computation would change from a matrix operations perspective.

Chapter Summary

Deep learning's core operations can all be described with linear algebra:

  • Fully Connected Layers: Matrix multiplication + nonlinear activation
  • Backpropagation: Matrix form of chain rule, gradients flow through transposed matrices
  • Convolution: Can be converted to matrix multiplication via im2col, leveraging efficient GEMM
  • Attention Mechanism: Query-key dot product computes similarity, weighted sum of values
  • Transformer: Combination of attention + FFN + residual + normalization
  • Normalization Layers: Standardize along different dimensions, stabilizing training
  • LoRA: Low-rank matrix factorization for efficient fine-tuning

Understanding these linear algebra foundations enables you to: 1. Better understand how models work 2. Efficiently implement and optimize models 3. Design new network architectures 4. Debug training issues (gradient vanishing/exploding, etc.)

References

  1. Vaswani, A., et al. "Attention is All You Need." NeurIPS 2017.
  2. Hu, E., et al. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022.
  3. Ba, J., Kiros, J., & Hinton, G. "Layer Normalization." arXiv 2016.
  4. Ioffe, S., & Szegedy, C. "Batch Normalization." ICML 2015.
  5. He, K., et al. "Deep Residual Learning for Image Recognition." CVPR 2016.
  6. Glorot, X., & Bengio, Y. "Understanding the difficulty of training deep feedforward neural networks." AISTATS 2010.

This is Chapter 16 of the 18-part "Essence of Linear Algebra" series.

  • Post title:Essence of Linear Algebra (16): Linear Algebra in Deep Learning
  • Post author:Chen Kai
  • Create time:2019-03-22 14:30:00
  • Post link:https://www.chenk.top/chapter-16-linear-algebra-in-deep-learning/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments