Essence of Linear Algebra (11): Matrix Calculus and Optimization

When you adjust the shower water temperature, you're essentially doing the same thing as training a neural network — adjusting "parameters" (knob position) based on current "error" (water too cold or too hot). The only difference is that neural networks have millions of parameters, and the mathematical tool for adjusting them is matrix calculus.

Introduction: The Derivative Revolution from One to Many Dimensions

Remember the derivative you learned in high school? — it tells us the rate of change of a function at a point. This simple concept has driven the development of all modern science — from Newtonian mechanics to economic models.

But when we face machine learning problems, things get complicated:

Linear regression:, whereis a vector or even a matrix
Neural networks:, with millions of parameters
Principal Component Analysis:, subject to constraintWhat these problems have in common is: the variable is no longer a single number, but a vector or matrix. We need a new mathematical language to describe "how one quantity changes when a matrix changes"— this is matrix calculus.

An Intuitive Analogy

Imagine you're on a mountain, holding a GPS showing an altimeter. In the one-dimensional case (you can only walk along one path), the derivative tells you "how much the altitude changes if you take one step forward." But on a two-dimensional mountain surface, you can walk in any direction — so what's the "slope" in that direction? What you need to know is no longer a single number, but a vector— it tells you "which direction to walk for the fastest descent, and what's the descent rate."

This vector is the gradient, the most fundamental and important concept in matrix calculus.

Scalar Derivatives with Respect to Vectors: The Gradient

Extending from One to Multiple Dimensions

Suppose you run a bubble tea shop, and profitdepends on two factors: tea priceand advertising investment. That is,.

You want to know: if you slightly adjust the price or advertising investment, how will profit change?

Take partial derivatives with respect to each variable: -: Holding advertising fixed, how much does profit change when price changes by 1 dollar -: Holding price fixed, how much does profit change when advertising changes by 1 dollar

"Packaging" these two partial derivatives into a vector gives us the gradient:

Formal Definition

For a scalar function, its derivative with respect to vectoris called the gradient:

Three Geometric Meanings of the Gradient

The gradient is more than just a "collection of partial derivatives"— it has profound geometric meaning:

Directionality: The gradient points in the direction of fastest function increase. Back to the bubble tea shop example, the gradient direction tells you "how to simultaneously adjust price and advertising to maximize profit growth."

Magnitude: The norm of the gradientis the maximum rate of increase in that direction. If the gradient is large, the function is "steep" at that point — small movements cause big changes.

Orthogonality: The gradient is perpendicular to the function's level curves (level surfaces). This is like contour lines on a map — if you walk along a contour line, altitude stays constant; the gradient direction is perpendicular to contour lines, being the steepest "uphill" direction.

Example: Gradient of a Linear Function

LetPartial derivative with respect to:Therefore:

Intuitive Understanding: The linear functionrepresents a hyperplane in space, andis exactly the normal direction of this plane, which is also the direction of fastest function increase.

Example: Gradient of a Quadratic Function

LetPartial derivative with respect to:Therefore: Intuitive Understanding: This function achieves its minimum value 0 at the origin, and the gradientalways points away from the origin (the direction of function increase). At the origin, the gradient is the zero vector — this is a characteristic of extremum points.

Directional Derivative: Not Just Uphill and Downhill

On a mountain, you don't necessarily have to walk in the steepest direction. You can choose any direction — so what's the "slope" in that direction?

Definition

The directional derivative of functionat pointalong direction (unit vector):whereis the angle betweenand.

Geometric Intuition

The directional derivative formulatells us:

When (walking along gradient direction),, fastest increase
When (walking along level curve),, function value unchanged
When (walking opposite to gradient),, fastest decrease

This is the theoretical foundation of gradient descent: to decrease the function value fastest, walk in the negative gradient direction.

Practical Application: Mountain Climbing Strategies

Suppose you're a blind mountain climber who can only sense the slope under your feet. Your strategies might be: - Steepest ascent: Always walk in the steepest uphill direction (gradient direction) - Contour strolling: Walk horizontally, maintaining altitude (perpendicular to gradient) - Zigzag climbing: Slightly deviate from the steepest direction, so it's not too tiring

In optimization algorithms, these strategies all have counterparts: gradient descent, constrained optimization, momentum-based gradient descent, etc.

Vector Derivatives with Respect to Vectors: The Jacobian Matrix

When the output is also a vector, things get more interesting.

Starting with a Life Example

Suppose you're cooking, and there are three seasoning amountsthat jointly affect three dish metrics: saltiness, sweetness, and spiciness.

Now the question is: if all seasoning amounts change slightly, how will the three taste metrics change?

We need a matrix to describe this, because there are"input-to-output" influence relationships.

Formal Definition

For a vector function, its derivative with respect to vectoris the Jacobian Matrix:

Dimensions:is anmatrix. The element in row, columnrepresents "the rate of change of the-th output when the-th input changes."

Geometric Meaning of the Jacobian

The Jacobian matrix describes the linear approximation of functionnear:In other words, the Jacobian matrix is the best coefficient matrix for "approximating a nonlinear function with a linear function."

Another Perspective: The Jacobian matrix describes how the function "deforms space." If you draw a small square in the input space, it becomes a small parallelogram after mapping through function. The Jacobian matrix describes this deformation.

Classic Example: Polar Coordinate Transformation

The transformation from polar to Cartesian coordinates

Meaning of the Determinant:This is exactly the famous Jacobian factor in polar coordinate integration! When changing from Cartesian coordinate integration to polar coordinate integration, the area element changes fromto, and thisis the Jacobian determinant.

The Hessian Matrix: Complete Description of Curvature

The gradient tells us the function's "slope," but it doesn't tell us whether the slope is changing. For that, we need second derivatives.

Review of the One-Dimensional Case

In single-variable calculus, the second derivativetells us: -: The function graph is "concave up" (bowl-shaped), this point is a candidate for local minimum -: The function graph is "concave down" (dome-shaped), this point is a candidate for local maximum -: May be an inflection point, needs further analysis

Multidimensional Case: The Hessian Matrix

For a multivariable function, we need a matrix to describe all second partial derivatives:

Properties of the Hessian Matrix

Symmetry: If's second partial derivatives are continuous, thenis symmetric (Schwarz's theorem). This means.

Curvature Description: The Hessian matrix describes the "bending degree" of the function surface. Imagine a bowl-shaped surface — the Hessian matrix tells you how "steep" the bowl is in various directions.

Second-Order Taylor Expansion

The Hessian matrix appears in the second-order Taylor expansion of a function:This expansion is the foundation for understanding optimization algorithms. The first term is the current function value, the second term is the linear approximation (gradient direction), and the third term is the quadratic correction (curvature effect).

Classification of Critical Points

At a critical point, the Taylor expansion simplifies to:The local behavior of the function is completely determined by the Hessian matrix:

Hessian Property	Critical Point Type	Intuitive Understanding
Positive definite	Local minimum	"Bowl bottom" in all directions
Negative definite	Local maximum	"Mountain top" in all directions
Indefinite	Saddle point	Goes up in some directions, down in others
Semi-definite	Needs higher-order analysis	"Flat" in some directions

Example: Two-Dimensional Quadratic Function

Let, compute its gradient and Hessian matrix:Hessian matrixis positive definite (all eigenvalues are 2), sois a minimum point.

Now look at(saddle surface):Hessian matrix eigenvalues areand, indefinite, sois a saddle point.

Scalar Derivatives with Respect to Matrices

In machine learning, parameters are often in matrix form. For example, the weight matrixin a neural network. We need to know "the derivative of the loss function with respect to the weight matrix."

Definition and Notation

For a scalar function, whereis anmatrix, the derivative is defined as:The result is a matrix with the same shape as.

Derivatives of the Trace Function

The matrix traceplays an important role in matrix calculus, because many scalar functions can be written in trace form.

Basic Formulas:

Proof Technique: Use the cyclic property of trace, and.

Derivative of the Determinant

The derivative of determinant with respect to matrix has an elegant formula:where.

Furthermore, for the log-determinant:This formula is very useful in statistical maximum likelihood estimation, especially when dealing with multivariate normal distributions.

Derivative of the Inverse Matrix

Whenis a function of:

Proof: Starting from, differentiate both sides with respect to:Solve forto get the result.

The Chain Rule in Matrix Calculus

The chain rule is one of the most powerful tools in calculus. It tells us how to compute derivatives of composite functions.

Scalar Case Review

Let,, then:This is the formula learned in introductory calculus. But when variables become vectors and matrices, things get a bit more complicated.

Vector Chain Rule

Let(),(), then:whereis the Jacobian matrix of.

Dimension Analysis:, dimensions match!

Vector-to-Vector Chain Rule

Let,, then:That is, Jacobian matrices multiply:

An Intuitive Understanding

The essence of the chain rule is "propagation of small changes." Imagine pollutant spreading through a river:

Upstream factory discharge changes by→
Midstream pollutant concentration changes by→
Downstream ecological index changes byFinally:This is the chain rule! The "amplification factors" at each stage multiply to give the total amplification factor.

The Backpropagation Algorithm

Backpropagation is the core algorithm of deep learning — it's the efficient implementation of the chain rule on computation graphs.

Computation Graphs: Breaking Complex Functions into Simple Steps

Any complex mathematical expression can be broken down into combinations of basic operations. For example:Can be decomposed into: 1.(addition) 2.(multiplication)

This decomposition forms a directed acyclic graph (DAG), called a computation graph.

Forward Pass and Backward Pass

Forward Pass: Starting from inputs, compute step by step through the computation graph to the output.

Backward Pass: Starting from output, compute gradients step by step backward through the computation graph.

The core idea of backpropagation is: each node only needs to know "the derivative of output with respect to itself" to compute "the derivative of output with respect to all its inputs".

Why is Backpropagation More Efficient Than Forward Mode?

For a function withinputs andoutputs: - Forward mode requirespasses (compute separately for each input variable) - Backward mode requirespasses (compute separately for each output variable)

Neural networks typically have(millions of parameters, but the loss function is a scalar), so backpropagation only needs one pass to compute gradients for all parameters!

This is the magic of backpropagation — it reduces computational complexity fromto(relative to number of parameters).

Backpropagation Through a Fully Connected Layer

Forward Pass: whereis an element-wise activation function.

Backward Pass: Assuming we know(coming from later layers)

Step 1: Through the activation functionwhereis element-wise multiplication (Hadamard product).

Step 2: Derivative with respect to weights

Step 3: Derivative with respect to bias

Step 4: Derivative with respect to input (passed to earlier layers)

Common Activation Functions and Their Derivatives

ReLU (Rectified Linear Unit):

Advantages: Simple computation, doesn't saturate (in positive region). Disadvantages: Zero gradient in negative region ("dying ReLU").

Sigmoid:

Advantages: Output in, interpretable as probability. Disadvantages: Easily saturates, gradient vanishing.

Tanh:

Advantages: Output in, zero-centered. Disadvantages: Still has saturation problem.

Softmax and Cross-Entropy

The Softmax function converts any real-valued vector into a probability distribution:

Cross-entropy loss measures the gap between predicted and true distributions:

Important Simplification: The combination of Softmax + Cross-entropy has a very clean gradient:This clean form makes Softmax+Cross-entropy the standard choice for classification problems. It tells us: the gradient is just "predicted probability minus true probability"— intuitive and efficient.

Basics of Convex Optimization

Why do we always pursue "convex" problems? Because convex problems have only one extremum point, and it's the global optimum.

Convex Sets and Convex Functions

Convex Set: A setis convex if and only if for anyand, we have.

Intuitive Understanding: The line segment connecting any two points in a convex set lies entirely inside the set. A disk is convex; a crescent is not convex.

Convex Function: A functionis convex if and only if for anyand:

Intuitive Understanding: The line segment connecting any two points on the function graph lies above the graph. A bowl is convex; a wave is not convex.

Equivalent Characterizations of Convex Functions

The following conditions are equivalent (for differentiable functions):

1.is a convex function 2.(function lies above its tangent line) 3.(Hessian matrix is positive semi-definite)

For strictly convex functions, condition 3 becomes(positive definite).

Why is Convexity Important?

Theorem: Any local minimum of a convex function is a global minimum.

Proof Idea: Assumeis a local but not global minimum, then there existssuch that. By convexity, function values on the line segment betweenandare all less than or equal to the weighted average of the endpoints, contradicting thatis a local minimum.

This means for convex problems, any extremum point found by gradient descent is globally optimal!

Common Convex Functions

Function	Convexity Condition
(affine function)	Both convex and concave
(norm for)	Convex
(quadratic form)	Convex when
	Convex
()	Convex
()	Convex

KKT Conditions for Convex Optimization

For the constrained optimization problem:

KKT Conditions (Karush-Kuhn-Tucker) are necessary conditions for optimality (and sufficient for convex problems):

Primal feasibility:,
Dual feasibility:
Complementary slackness:
Stationarity:

Optimization Algorithms

Gradient Descent

Update Rule:whereis the learning rate.

Convergence: For convex functions, gradient descent converges to the global optimum. Convergence rate depends on the function's condition number; larger condition numbers mean slower convergence.

Learning Rate Selection: Too large will diverge, too small converges too slowly. Learning rate decay strategies are commonly used in practice.

Newton's Method

Update Rule:

Intuitive Understanding: Newton's method approximates the original function with a quadratic function, then jumps to the extremum of the quadratic function in one step.

Advantages: Quadratic convergence (error reduces by its square), no need to choose learning rate.

Disadvantages: Requires computing and inverting Hessian matrix (), may run toward saddle points or maxima for non-convex functions.

Stochastic Gradient Descent (SGD)

When the objective function is a sum of losses over many samples: Update Rule:whereis a randomly selected sample index.

Advantages: High computational efficiency (only uses one sample per step), helps escape local minima.

Disadvantages: Noisy update direction, requires careful learning rate tuning.

Momentum

SGD's update direction can "jitter" a lot. Momentum smooths the update direction by accumulating historical gradients:

Intuitive Understanding: Imagine a ball rolling down a hill. The ball has inertia and won't immediately change direction, but accumulates previous velocity. This way it can roll over small "bumps" and reach the bottom faster.

Adam Optimizer

Adam combines the advantages of momentum and adaptive learning rates:

Intuitive Understanding:is the exponential moving average of gradients (momentum),is the exponential moving average of squared gradients (for adaptive learning rate adjustment). For parameters with large gradients, effective learning rate becomes smaller; for parameters with small gradients, effective learning rate becomes larger.

Application Examples

Analytical Solution for Linear Regression

Objective Function:

Expansion:

Gradient:

Optimal Solution (setting gradient to zero): This is the famous Normal Equation.

Ridge Regression: The Power of Regularization

Whenis close to singular, the solution becomes very unstable. Ridge regression "stabilizes" the solution by adding a regularization term:

Objective Function:

Gradient:

Optimal Solution:

Benefit:is always invertible (positive value added to diagonal), making the solution more stable.

Principal Component Analysis: Optimization Perspective

PCA can be formulated as the following optimization problem:whereis the data covariance matrix.

Using Lagrange multipliers:Differentiate with respect toand set to zero: This is exactly an eigenvalue problem! The optimalis an eigenvector of, and the largestcorresponds to the principal component direction.

Formula Quick Reference

Vector Derivatives

Function	Derivative	Notes
		Linear
		Squared norm
		General quadratic form
		symmetric
		L2 norm

Matrix Derivatives

Function	Derivative	Notes
		Trace
		Trace
		Determinant
		Log-determinant
		Inverse matrix

Exercises

Basic Problems

1. Compute the gradients of the following functions: - (a) - (b) - (c) 2. Let. Find all critical points and classify them (maximum, minimum, saddle point).

3. Prove: 4. Compute the gradient and Hessian matrix of(assumingis symmetric).

5. Prove that the Jacobian matrix of the Softmax function is, where.

Advanced Problems

6. Prove: Hint: Use cofactor expansion of determinants and the adjugate matrix.

7. Derive the complete backpropagation formulas for a two-layer neural network.

8. Prove that Newton's method converges in one step for quadratic function.

9. Prove that any local minimum of a convex function is a global minimum.

10. Letbe-smooth (i.e.,). Prove that gradient descent converges when learning rate.

Application Problems

11. For Logistic regression objective function - (a) Compute the gradient - (b) Compute the Hessian matrix - (c) Prove thatis a convex function

12. For linear regression with L2 regularization: - (a) Derive the closed-form expression for the optimal solution - (b) Analyze the effect of regularization parameteron the solution - (c) Interpret regularization from a Bayesian perspective

Programming Problems

13. Implement a gradient checking function that compares analytical gradients with numerical gradients:

def gradient_check(f, grad_f, x, epsilon=1e-5):
    """
    f: scalar function
    grad_f: gradient function
    x: evaluation point
    Returns: relative error between analytical and numerical gradients
    """
    # Your code here
    pass

14. Implement from scratch a simple computation graph supporting automatic differentiation, with addition, multiplication, and ReLU operations.

15. Implement and compare the convergence trajectories of SGD, Momentum, and Adam on the quadratic function. Visualize the optimization paths.

16. Implement a two-layer neural network on the MNIST dataset with manually implemented backpropagation (without using automatic differentiation from deep learning frameworks).

Summary

Matrix calculus is the bridge connecting calculus and linear algebra, and is the mathematical foundation of machine learning and deep learning.

Key Points:

Gradient is the derivative of a scalar function with respect to a vector, pointing in the direction of fastest function increase
Jacobian matrix describes the linear approximation of a vector function
Hessian matrix describes the curvature of a function, used to classify critical points
Chain rule is the theoretical foundation of backpropagation
Convex optimization guarantees that found extrema are globally optimal

After mastering these tools, you'll understand the core principles of modern deep learning frameworks and be able to design and analyze new optimization algorithms.

Preview of Next Chapter

"Sparse Matrices and Compressed Sensing"

Mathematical principles of sparse representation
Why L1 regularization promotes sparsity
Compressed sensing theory
RIP conditions and recovery guarantees

References

Petersen & Pedersen - The Matrix Cookbook
- Comprehensive collection of matrix calculus formulas, essential reference
Goodfellow et al. - Deep Learning, Chapter 6
- Backpropagation algorithm in deep learning
Boyd & Vandenberghe - Convex Optimization
- Classic textbook on convex optimization theory
Nocedal & Wright - Numerical Optimization
- Authoritative reference on numerical optimization algorithms

This is Chapter 11 of the 18-part "Essence of Linear Algebra" series.