When you adjust the shower water temperature, you're essentially doing the same thing as training a neural network — adjusting "parameters" (knob position) based on current "error" (water too cold or too hot). The only difference is that neural networks have millions of parameters, and the mathematical tool for adjusting them is matrix calculus.
Introduction: The Derivative Revolution from One to Many Dimensions
Remember the derivative you learned in high school?
But when we face machine learning problems, things get complicated:
- Linear regression:
, where is a vector or even a matrix - Neural networks:
, with millions of parameters - Principal Component Analysis:
, subject to constraint What these problems have in common is: the variable is no longer a single number, but a vector or matrix. We need a new mathematical language to describe "how one quantity changes when a matrix changes"— this is matrix calculus.
An Intuitive Analogy
Imagine you're on a mountain, holding a GPS showing an altimeter. In the one-dimensional case (you can only walk along one path), the derivative tells you "how much the altitude changes if you take one step forward." But on a two-dimensional mountain surface, you can walk in any direction — so what's the "slope" in that direction? What you need to know is no longer a single number, but a vector— it tells you "which direction to walk for the fastest descent, and what's the descent rate."
This vector is the gradient, the most fundamental and important concept in matrix calculus.
Scalar Derivatives with Respect to Vectors: The Gradient
Extending from One to Multiple Dimensions
Suppose you run a bubble tea shop, and profit
You want to know: if you slightly adjust the price or advertising investment, how will profit change?
Take partial derivatives with respect to each variable: -
"Packaging" these two partial derivatives into a vector gives us the
gradient:
Formal Definition
For a scalar function
Three Geometric Meanings of the Gradient
The gradient is more than just a "collection of partial derivatives"— it has profound geometric meaning:
Directionality: The gradient points in the direction of fastest function increase. Back to the bubble tea shop example, the gradient direction tells you "how to simultaneously adjust price and advertising to maximize profit growth."
Magnitude: The norm of the gradient
Orthogonality: The gradient is perpendicular to the function's level curves (level surfaces). This is like contour lines on a map — if you walk along a contour line, altitude stays constant; the gradient direction is perpendicular to contour lines, being the steepest "uphill" direction.
Example: Gradient of a Linear Function
Let
Intuitive Understanding: The linear function
Example: Gradient of a Quadratic Function
Let
Directional Derivative: Not Just Uphill and Downhill
On a mountain, you don't necessarily have to walk in the steepest direction. You can choose any direction — so what's the "slope" in that direction?
Definition
The directional derivative of function
Geometric Intuition
The directional derivative formula
- When
(walking along gradient direction), , fastest increase - When
(walking along level curve), , function value unchanged - When
(walking opposite to gradient), , fastest decrease
This is the theoretical foundation of gradient descent: to decrease the function value fastest, walk in the negative gradient direction.
Practical Application: Mountain Climbing Strategies
Suppose you're a blind mountain climber who can only sense the slope under your feet. Your strategies might be: - Steepest ascent: Always walk in the steepest uphill direction (gradient direction) - Contour strolling: Walk horizontally, maintaining altitude (perpendicular to gradient) - Zigzag climbing: Slightly deviate from the steepest direction, so it's not too tiring
In optimization algorithms, these strategies all have counterparts: gradient descent, constrained optimization, momentum-based gradient descent, etc.
Vector Derivatives with Respect to Vectors: The Jacobian Matrix
When the output is also a vector, things get more interesting.
Starting with a Life Example
Suppose you're cooking, and there are three seasoning amounts
Now the question is: if all seasoning amounts change slightly, how will the three taste metrics change?
We need a matrix to describe this, because there
are
Formal Definition
For a vector function
Dimensions:
Geometric Meaning of the Jacobian
The Jacobian matrix describes the linear
approximation of function
Another Perspective: The Jacobian matrix describes
how the function "deforms space." If you draw a small square in the
input space, it becomes a small parallelogram after mapping through
function
Classic Example: Polar Coordinate Transformation
The transformation from polar to Cartesian coordinates
Meaning of the Determinant:
The Hessian Matrix: Complete Description of Curvature
The gradient tells us the function's "slope," but it doesn't tell us whether the slope is changing. For that, we need second derivatives.
Review of the One-Dimensional Case
In single-variable calculus, the second derivative
Multidimensional Case: The Hessian Matrix
For a multivariable function
Properties of the Hessian Matrix
Symmetry: If
Curvature Description: The Hessian matrix describes the "bending degree" of the function surface. Imagine a bowl-shaped surface — the Hessian matrix tells you how "steep" the bowl is in various directions.
Second-Order Taylor Expansion
The Hessian matrix appears in the second-order Taylor expansion of a
function:
Classification of Critical Points
At a critical point
| Hessian Property | Critical Point Type | Intuitive Understanding |
|---|---|---|
| Positive definite | Local minimum | "Bowl bottom" in all directions |
| Negative definite | Local maximum | "Mountain top" in all directions |
| Indefinite | Saddle point | Goes up in some directions, down in others |
| Semi-definite | Needs higher-order analysis | "Flat" in some directions |
Example: Two-Dimensional Quadratic Function
Let
Now look at
Scalar Derivatives with Respect to Matrices
In machine learning, parameters are often in matrix form. For
example, the weight matrix
Definition and Notation
For a scalar function
Derivatives of the Trace Function
The matrix trace
Basic Formulas:
Proof Technique: Use the cyclic property of
trace
Derivative of the Determinant
The derivative of determinant with respect to matrix has an elegant
formula:
Furthermore, for the log-determinant:
Derivative of the Inverse Matrix
When
Proof: Starting from
The Chain Rule in Matrix Calculus
The chain rule is one of the most powerful tools in calculus. It tells us how to compute derivatives of composite functions.
Scalar Case Review
Let
Vector Chain Rule
Let
Dimension Analysis:
Vector-to-Vector Chain Rule
Let
An Intuitive Understanding
The essence of the chain rule is "propagation of small changes." Imagine pollutant spreading through a river:
- Upstream factory discharge changes by
→ - Midstream pollutant concentration changes by
→ - Downstream ecological index changes by
Finally: This is the chain rule! The "amplification factors" at each stage multiply to give the total amplification factor.
The Backpropagation Algorithm
Backpropagation is the core algorithm of deep learning — it's the efficient implementation of the chain rule on computation graphs.
Computation Graphs: Breaking Complex Functions into Simple Steps
Any complex mathematical expression can be broken down into
combinations of basic operations. For example:
This decomposition forms a directed acyclic graph (DAG), called a computation graph.
Forward Pass and Backward Pass
Forward Pass: Starting from inputs, compute step by step through the computation graph to the output.
Backward Pass: Starting from output, compute gradients step by step backward through the computation graph.
The core idea of backpropagation is: each node only needs to know "the derivative of output with respect to itself" to compute "the derivative of output with respect to all its inputs".
Why is Backpropagation More Efficient Than Forward Mode?
For a function with
Neural networks typically have
This is the magic of backpropagation — it reduces computational
complexity from
Backpropagation Through a Fully Connected Layer
Forward Pass:
Backward Pass: Assuming we know
Step 1: Through the activation function
Step 2: Derivative with respect to weights
Step 3: Derivative with respect to bias
Step 4: Derivative with respect to input (passed to
earlier layers)
Common Activation Functions and Their Derivatives
ReLU (Rectified Linear Unit):
Advantages: Simple computation, doesn't saturate (in positive region). Disadvantages: Zero gradient in negative region ("dying ReLU").
Sigmoid:
Advantages: Output in
Tanh:
Advantages: Output in
Softmax and Cross-Entropy
The Softmax function converts any real-valued vector
into a probability distribution:
Cross-entropy loss measures the gap between
predicted and true distributions:
Important Simplification: The combination of Softmax
+ Cross-entropy has a very clean gradient:
Basics of Convex Optimization
Why do we always pursue "convex" problems? Because convex problems have only one extremum point, and it's the global optimum.
Convex Sets and Convex Functions
Convex Set: A set
Intuitive Understanding: The line segment connecting any two points in a convex set lies entirely inside the set. A disk is convex; a crescent is not convex.
Convex Function: A function
Intuitive Understanding: The line segment connecting any two points on the function graph lies above the graph. A bowl is convex; a wave is not convex.
Equivalent Characterizations of Convex Functions
The following conditions are equivalent (for differentiable functions):
1.
For strictly convex functions, condition 3
becomes
Why is Convexity Important?
Theorem: Any local minimum of a convex function is a global minimum.
Proof Idea: Assume
This means for convex problems, any extremum point found by gradient descent is globally optimal!
Common Convex Functions
| Function | Convexity Condition |
|---|---|
| Both convex and concave | |
| Convex | |
| Convex when |
|
| Convex | |
| Convex | |
| Convex |
KKT Conditions for Convex Optimization
For the constrained optimization problem:
KKT Conditions (Karush-Kuhn-Tucker) are necessary conditions for optimality (and sufficient for convex problems):
Primal feasibility:
,Dual feasibility:
Complementary slackness:
Stationarity:
Optimization Algorithms
Gradient Descent
Update Rule:
Convergence: For convex functions, gradient descent
converges to the global optimum. Convergence rate depends on the
function's condition number
Learning Rate Selection: Too large will diverge, too small converges too slowly. Learning rate decay strategies are commonly used in practice.
Newton's Method
Update Rule:
Intuitive Understanding: Newton's method approximates the original function with a quadratic function, then jumps to the extremum of the quadratic function in one step.
Advantages: Quadratic convergence (error reduces by its square), no need to choose learning rate.
Disadvantages: Requires computing and inverting
Hessian matrix (
Stochastic Gradient Descent (SGD)
When the objective function is a sum of losses over many
samples:
Advantages: High computational efficiency (only uses one sample per step), helps escape local minima.
Disadvantages: Noisy update direction, requires careful learning rate tuning.
Momentum
SGD's update direction can "jitter" a lot. Momentum smooths the
update direction by accumulating historical gradients:
Intuitive Understanding: Imagine a ball rolling down a hill. The ball has inertia and won't immediately change direction, but accumulates previous velocity. This way it can roll over small "bumps" and reach the bottom faster.
Adam Optimizer
Adam combines the advantages of momentum and adaptive learning
rates:
Intuitive Understanding:
Application Examples
Analytical Solution for Linear Regression
Objective Function:
Expansion:
Gradient:
Optimal Solution (setting gradient to zero):
Ridge Regression: The Power of Regularization
When
Objective Function:
Gradient:
Optimal Solution:
Benefit:
Principal Component Analysis: Optimization Perspective
PCA can be formulated as the following optimization problem:
Using Lagrange multipliers:
Formula Quick Reference
Vector Derivatives
| Function | Derivative | Notes |
|---|---|---|
| Linear | ||
| Squared norm | ||
| General quadratic form | ||
| L2 norm |
Matrix Derivatives
| Function | Derivative | Notes |
|---|---|---|
| Trace | ||
| Trace | ||
| Determinant | ||
| Log-determinant | ||
| Inverse matrix |
Exercises
Basic Problems
1. Compute the gradients of the following functions:
- (a)
3. Prove:
5. Prove that the Jacobian matrix of the Softmax
function is
Advanced Problems
6. Prove:
7. Derive the complete backpropagation formulas for
a two-layer neural network
8. Prove that Newton's method converges in one step
for quadratic function
9. Prove that any local minimum of a convex function is a global minimum.
10. Let
Application Problems
11. For Logistic regression objective function
12. For linear regression with L2
regularization:
Programming Problems
13. Implement a gradient checking function that compares analytical gradients with numerical gradients:
1 | def gradient_check(f, grad_f, x, epsilon=1e-5): |
14. Implement from scratch a simple computation graph supporting automatic differentiation, with addition, multiplication, and ReLU operations.
15. Implement and compare the convergence
trajectories of SGD, Momentum, and Adam on the quadratic function
16. Implement a two-layer neural network on the MNIST dataset with manually implemented backpropagation (without using automatic differentiation from deep learning frameworks).
Summary
Matrix calculus is the bridge connecting calculus and linear algebra, and is the mathematical foundation of machine learning and deep learning.
Key Points:
- Gradient is the derivative of a scalar function with respect to a vector, pointing in the direction of fastest function increase
- Jacobian matrix describes the linear approximation of a vector function
- Hessian matrix describes the curvature of a function, used to classify critical points
- Chain rule is the theoretical foundation of backpropagation
- Convex optimization guarantees that found extrema are globally optimal
After mastering these tools, you'll understand the core principles of modern deep learning frameworks and be able to design and analyze new optimization algorithms.
Preview of Next Chapter
"Sparse Matrices and Compressed Sensing"
- Mathematical principles of sparse representation
- Why L1 regularization promotes sparsity
- Compressed sensing theory
- RIP conditions and recovery guarantees
矩阵微积分与优化/fig1.png)
矩阵微积分与优化/fig2.png)
矩阵微积分与优化/fig3.png)
矩阵微积分与优化/fig4.png)
矩阵微积分与优化/fig5.png)
矩阵微积分与优化/fig6.png)
References
- Petersen & Pedersen - The Matrix
Cookbook
- Comprehensive collection of matrix calculus formulas, essential reference
- Goodfellow et al. - Deep Learning, Chapter
6
- Backpropagation algorithm in deep learning
- Boyd & Vandenberghe - Convex
Optimization
- Classic textbook on convex optimization theory
- Nocedal & Wright - Numerical
Optimization
- Authoritative reference on numerical optimization algorithms
This is Chapter 11 of the 18-part "Essence of Linear Algebra" series.
- Post title:Essence of Linear Algebra (11): Matrix Calculus and Optimization
- Post author:Chen Kai
- Create time:2019-02-28 16:00:00
- Post link:https://www.chenk.top/chapter-11-matrix-calculus-and-optimization/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.