In Google's PageRank algorithm, the web ranking problem was
transformed into a massive eigenvalue problem: finding the principal
eigenvector of a transition matrix. Behind this elegant mathematical
formulation lies the profound structure of linear algebra. Linear
algebra is not merely the language of machine learning — it is the key
to understanding the geometric structure of data.
Machine learning is fundamentally about finding optimal linear or
nonlinear transformations in high-dimensional spaces. From the simplest
linear regression (solving ), to complex deep neural networks (chains of
matrix multiplications), to principal component analysis (eigenvalue
decomposition), linear algebra is everywhere. This chapter derives all
the linear algebra tools needed for machine learning from first
principles.
Vector Spaces and
Linear Transformations
Axiomatic Definition of
Vector Spaces
Definition 1 (Vector Space): Letbe a non-empty set andbe a field (typicallyor). If two operations are
defined on:
Vector addition:
Scalar multiplication:satisfying the following 8 axioms, thenis called a vector space
over:
Addition axioms:
Commutativity:,
Associativity:,
Zero element:such that,
Inverse element: , such that
Scalar multiplication axioms:
Distributivity 1:,
Distributivity 2:,
Associativity:,
Identity: ,
Common vector space examples:
Space
Elements
Dimension
Application
-dimensional real vectors
ML feature vectors
-dimensional complex
vectors
Quantum computing, signal processing
real matrices
Data matrices, weight matrices
Continuous functions on
Function approximation
Polynomials of degree
Polynomial regression
Linear Independence and
Basis
Definition 2 (Linear Combination): A vector can be expressed as a linear
combination of vector setif there exist scalarssuch that:
Definition 3 (Linear Independence): A vector
setis linearly
independent if:
Otherwise, it is linearly dependent.
Theorem 1 (Equivalent Condition for Linear
Dependence): A vector setis linearly dependent if and only if some vector
can be expressed as a linear combination of others.
Proof: Suppose linearly dependent,
then there exist not all zero such that. Without
loss of generality, let, then:
Suppose ,
then, and coefficients are not all zero, hence
linearly dependent. QED.
Definition 4 (Basis and Dimension): A basis of
vector spaceis a linearly
independent vector setsuch that any vector incan be uniquely expressed as a linear
combination of these basis vectors. The number of basis vectors is
called the dimension of the vector space, denoted.
Standard basis example:
Standard basis of:
Any can be
uniquely expressed as .
Linear Transformations and
Matrices
Definition 5 (Linear Transformation): Letbe two vector spaces. A mappingis called a linear
transformation (or linear map) if:
Additivity:,
Homogeneity: ,
Theorem 2 (Matrix Representation of Linear
Transformations): Let
and be -dimensional and -dimensional vector spaces respectively,
with chosen basesand. For a linear transformation , there exists a unique matrix such that for any :
where and are coordinate representations of
vectors in their respective bases.
Proof:
Let ,
then:
by linearity. Now,
can be expressed as a linear combination of 's basis:
Therefore:
This is exactly the -th
component of matrix-vector multiplication , where . QED.
The -th column of
matrix is exactly the coordinates
of in 's basis. This is the geometric
meaning of matrix representation.
Four Fundamental Subspaces
For anmatrix, there are four fundamental subspaces
that characterize its structure:
Definition 6 (Four Fundamental Subspaces):
Column Space:
Dimension:
Geometric meaning: All vectorscan reach
Null Space:
Dimension: (by
rank-nullity theorem)
Geometric meaning: All vectors mapped to zero by
Row Space:
Dimension:
Geometric meaning: All vectorscan reach
Left Null Space:
Dimension:
Geometric meaning: Vectors orthogonal to row space
Theorem 3 (Orthogonality of Fundamental
Subspaces):
(null space orthogonal to row space)
(left null space orthogonal to column
space)
Proof of property 1:
Let , i.e.,
. For any , there exists
such that . Compute inner product:
Therefore , i.e., .
QED.
Dimension relations of fundamental subspaces:
These two equations show that and are each decomposed into two
orthogonal subspaces.
Inner Product Spaces and
Norms
Definition and
Properties of Inner Products
Definition 7 (Inner Product): Letbe a real vector space. An inner product
is a mappingsatisfying:
Symmetry:
Linearity:
Positive definiteness: , and
Common inner products:
Standard inner product (Euclidean inner product):
For ,
Weighted inner product: Given positive definite
diagonal matrix ,
Function space inner product: For ,
Matrix Frobenius inner product: For,
Definition and Properties of
Norms
Definition 8 (Norm): A norm on vector spaceis a mappingsatisfying:
Inner product induced norm: For inner product
spaces,
Theorem 4 (Cauchy-Schwarz Inequality): For any in an inner product space,
Equality holds if and only if
and are linearly dependent.
Proof:
For any ,
consider:
This is a quadratic function in , always non-negative. Therefore
discriminant:
i.e., . Equality holds if and only if there exists such that , i.e.,
(linearly dependent). QED.
Corollaries of Cauchy-Schwarz inequality:
Triangle inequality:
Proof:
Taking square roots gives the triangle inequality.
Cosine formula:
where is the angle between
and . By Cauchy-Schwarz inequality, .
The figure below visualizes the unit balls of different norms in: the norm unit ball is a diamond, the norm unit ball is a circle, and the
norm unit ball is a
square. These geometric shapes directly determine the behavior of
corresponding regularization methods (Lasso, Ridge):
Matrix Norms
Common Matrix Norms
Definition 9 (Matrix Norm): A matrix norm is a norm
defined on matrix space.
Common matrix norms:
Frobenius norm:
Induced norm (operator norm): Induced by vector
norms,
Submultiplicativity:(Frobenius
norm doesn't satisfy this, but satisfies)
Submultiplicativity of induced norms:
Compatibility with vector norms: #
Eigenvalues and Eigenvectors
Definition of Eigenvalue
Problem
Definition 10 (Eigenvalues and Eigenvectors): Let
. If
there exists a non-zero vector and a scalarsuch that:
then is called an
eigenvalue of ,
and is the corresponding
eigenvector.
Computing eigenvalues:
Eigenvalues satisfy the characteristic equation:
This is an -th degree
polynomial equation in,
called the characteristic polynomial:
Theorem 6 (Properties of Eigenvalues):
(sum of eigenvalues equals trace)
(product of eigenvalues equals determinant)
is invertible all eigenvalues are
non-zero
Eigenvalues of are
Similar matrices have same eigenvalues
Proof of properties 1 and 2:
The leading coefficient of characteristic polynomial is, the next coefficient is, and constant
term is.
By Vieta's formulas, polynomialexpanded gives:
Coefficient of:
Constant term:
Comparing coefficients givesand. QED.
Spectral Theorem for
Symmetric Matrices
Theorem 7 (Spectral Theorem for Symmetric Matrices):
Let be
a symmetric matrix, then:
All eigenvalues of are
real
Eigenvectors corresponding to different eigenvalues are
orthogonal
can be orthogonally
diagonalized, i.e., there exist orthogonal matrix and diagonal matrix such that:
where, and columns of are orthonormal eigenvectors of .
Proof:
Step 1: Eigenvalues are real
Let , . Compute:
Therefore, i.e., .
Step 2: Eigenvectors for different eigenvalues are
orthogonal
Let , , where. Compute:
Therefore. Since, we get .
Step 3: Orthogonal diagonalization
Since is an real symmetric matrix, it has
eigenvalues (counting
multiplicities) in the complex field. By Step 1, these eigenvalues are
all real. For repeated eigenvalues, we can select an orthogonal basis in
their eigenspace (Gram-Schmidt orthogonalization).
Therefore we can find
orthonormal eigenvectors , forming orthogonal matrix . Then:
Multiplying both sides by on
the right (using ), we get
. QED.
Geometric meaning of spectral decomposition:
This shows that can be
decomposed as a weighted sum of
rank-1 matrices, where each matrix is a projection along eigenvector
, with weight being
eigenvalue.
Positive Definite Matrices
Definition 11 (Positive Definite Matrix): A
symmetric matrix is called positive definite if for all
non-zero vectors ,
Similarly, we define positive semi-definite (), negative definite (), and negative semi-definite
() matrices.
Theorem 8 (Equivalent Conditions for Positive
Definiteness): The following are equivalent:
is positive definite
All eigenvalues of are
positive
All leading principal minors of are positive
There exists an invertible matrix such that
Cholesky decomposition of
exists: , where is lower triangular
Proof:
Let , . Then:
Therefore.
By spectral theorem,
, where, . For any , let
(since is invertible):
because at least one and all.
QED.
Properties of positive definite matrices:
Positive definite matrices are always invertible
Inverse of positive definite matrix is also positive definite
Sum of two positive definite matrices is positive definite
Diagonal elements of positive definite matrix are all positive
The figure below illustrates the geometric meaning of eigenvalue
decomposition: matrix transforms
the unit circle into an ellipse, where the principal axes of the ellipse
correspond to eigenvectors, and the axis lengths correspond to
eigenvalues. The right panel shows that eigenvectors point in the
principal directions of data (the core idea behind PCA):
Eigenvalue Visualization
Singular Value Decomposition
(SVD)
SVD Theorem and Geometric
Meaning
Theorem 9 (Singular Value Decomposition): Let with rank
. Then there exist orthogonal
matrices and , and diagonal matrix, such that:
where diagonal elementsof are called singular
values of .
Full form:
Compact form (economy SVD):
where , , .
Proof sketch:
Step 1: Inspiration from eigenvalue
decomposition
Considerand:
is symmetric positive semi-definite
is symmetric positive semi-definite
By spectral theorem, they can be orthogonally diagonalized.
Step 2: Right singular vectors (from )
Eigenvalue decomposition of :
where columns of , , are orthonormal
eigenvectors of , , .
Defineas singular values (ordered decreasingly).
Step 3: Construction of left singular vectors
For (where), define:
Verify orthogonality of :
For ,
extendto an
orthonormal basis of
using Gram-Schmidt orthogonalization.
Step 4: Verify
For :
For :
(because ,
i.e., implies)
Therefore:
Multiplying both sides by on
the right gives .
QED.
Geometric meaning of SVD:
Any linear transformation can be decomposed into three steps:
Rotation (or reflection): rotates's standard orthonormal basis
to eigenvector basis of
Scaling: scales along directions, with scaling factors being
singular values
Rotation (or reflection): rotates result to's standard orthonormal
basis
Linear Algebra Core Concepts
Applications of SVD
1. Low-Rank Approximation
Theorem 10 (Eckart-Young Theorem): Let be SVD of , and be rank approximation. Then is the best approximation to among all rank matrices in both Frobenius norm
and spectral norm:
Proof sketch (spectral norm):
Let be any rank matrix. Then.
Considerand:
But they are both subspaces of, so there must exist a
non-zero intersection. There exists , .
Since ,
,
. Then:
More refined arguments (using minimax principle) prove, and achieves this lower bound.
Application: Principal Component Analysis (PCA)
Given centered data matrix (each column has mean 0), perform SVD
on :
Take first principal
components:
This is the best rank
approximation of , i.e., reducing
data from dimensions to dimensions while preserving maximum
variance.
The figure below visualizes the geometric meaning of SVD
decomposition: any matrix
transforms the unit circle into an ellipse through rotation and scaling,
where singular values
determine the length of each axis:
SVD Visualization
The following animation intuitively demonstrates the SVD low-rank
approximation process — as the number of retained singular values increases, the approximation quality
improves continuously, which is a practical manifestation of the
Eckart-Young theorem:
SVD Truncation Animation
2. Pseudoinverse Matrix
Definition 12 (Moore-Penrose Pseudoinverse): The
pseudoinverse of matrix is the unique matrix satisfying these four
properties:
Theorem 11 (Computing Pseudoinverse via SVD): Let
be SVD of , then:
whereis defined as:
i.e., .
Verification:
Verify that
satisfies the four properties. Taking property 1 as example:
because (first diagonal elements are 1). Other
properties verified similarly.
Least squares solution:
For inconsistent system ,
the least squares solution is:
This is the minimum norm least squares solution among all solutions
minimizing.
3. Condition Number
Definition 13 (Condition Number): The condition
number of matrix is defined
as:
Condition number measures the "ill-conditioning" of a matrix.
Largermeans system is more sensitive to perturbations
in .
Theorem 12 (Perturbation Bound): Let , , then:
This shows relative error is amplified by condition number.
Matrix Calculus
Matrix calculus is ubiquitous in machine learning optimization. This
section derives common matrix calculus formulas from first
principles.
Notation for Matrix Calculus
Scalar with respect to vector:
Let , then derivative of with respect to vector is defined
as:
This is called the gradient vector, denoted or.
Scalar with respect to matrix:
Let , then derivative of with respect to matrix is defined as:
Vector with respect to vector:
Let,, then the Jacobian matrix is
defined as:
Note: Matrix calculus notation conventions may
differ across literatures (numerator layout vs denominator layout). This
chapter uses numerator layout, where derivative shape
matches numerator (function being differentiated) shape.
Basic Derivative Formulas
1. Linear Functions
Formula 1: , then:
Proof:
Therefore:
i.e., .
2. Quadratic Form
Formula 2: (where ), then:
In particular, if is
symmetric:
Proof:
Differentiating with respect to :
Therefore. When , we get. QED.
3. Matrix Multiplication
Formula 3: (where , , ), then:
Proof:
Differentiating with respect to :
Therefore. QED.
4. Trace of Matrix
Formula 4: (where , ), then:
In particular, if is
symmetric:
Proof:
Using cyclic property of trace:
Direct computation:
Therefore. QED.
5. Determinant and Inverse
Formula 5: (where is invertible), then:
Proof:
Using chain rule and:
For determinant differential, there is formula (Jacobi's
formula):
Therefore:
By relation between differential and derivative , we get:
QED.
Formula 6: , then:
where is matrix with 1
atposition and 0
elsewhere.
Chain Rule and
Backpropagation
Theorem 13 (Chain Rule for Scalar Functions): Let
,
,
then composite function has derivative:
whereis Jacobian matrix.
Proof:
By multivariate chain rule:
In matrix form:
QED.
Backpropagation example:
Consider one layer of neural network:
where , ,
, is
activation function (element-wise).
Let loss function be . To
compute:
Let , then:
But this involves matrix-to-matrix differentiation, requiring more
careful treatment. Actually:
Note that , so (Kronecker delta).
Therefore:
whereis "error signal" of -th neuron.
In matrix form:
where.
Numerical
Computation of Matrix Decompositions
QR Decomposition
Theorem 14 (QR Decomposition): Let () be full column rank, then there
exist orthogonal matrix and upper triangular matrix such
that:
Gram-Schmidt orthogonalization algorithm:
Given linearly independent vectors, construct orthonormal vectors:
Then:
In matrix form:
where , ().
Applications of QR decomposition:
Solving least squares: Solution is (using ).
Computing eigenvalues: QR iteration
algorithm
Repeat:
Under certain conditions,
converges to upper triangular matrix, with eigenvalues on diagonal.
Cholesky Decomposition
Theorem 15 (Cholesky Decomposition): Let be positive
definite, then there exists unique lower triangular matrix (with positive diagonal) such that:
Algorithm:
Recursively compute elements of :
Computational complexity: (twice as fast as LU
decomposition)
Applications:
Solving linear system :
Decompose
Solve (forward
substitution)
Solve (backward
substitution)
Testing positive definiteness: If Cholesky decomposition succeeds
(all positive), then is positive definite.
Code
Implementation: Matrix Decompositions and Applications
-'s relative error may be
amplifiedtimes -'s relative error may be
amplifiedtimes
In finite precision arithmetic (e.g., double precision floating
point, relative precision):
Using normal equations: lose about 6 significant digits, result may
have only 10 significant digits
Using QR decomposition: lose about 3 significant digits, result has
13 significant digits
Practical example:
Consider matrix:
In double precision, term gets rounded off, making
nearly singular, unable to
solve accurately.
But QR decomposition works directly on, doesn't involve, numerically stable.
Recommendations:
✅ For ill-conditioned problems with, use QR decomposition
or SVD
✅ For tall-thin matrices with, QR decomposition is more efficient
⚠️ Only use normal equations whenhas small condition number () andnot too large
Q3: How
to understand matrix rank? Why is rank important?
Multiple equivalent definitions of rank:
Column space dimension:
Row space dimension:
Number of linearly independent columns: Size of
maximal linearly independent column set
Number of non-zero singular values:
Geometric meaning:
Rank is the "effective dimension" of matrixas a linear transformation:
-meansmaps-dimensional space to an-dimensional subspace of-dimensional space - Dimension "loss"
=(null space dimension)
Why is rank important?
Invertibility: -is invertible
Equation solvability: -has solution
Solution unique
Degrees of freedom:
General solution of homogeneous equationhasfree parameters
Data dimensionality reduction:
PCA projects data onto rank-subspace, preserving maximum
variance
Practical applications:
Domain
Meaning of Rank
Machine Learning
Effective dimension of features
Recommender Systems
Number of latent factors in user-item interaction matrix
Image Compression
Low-rank approximation of image matrix
Control Theory
Controllability/observability of system
Numerical rank:
Due to floating point errors, numerical rank is defined as:
where is threshold
(e.g., ).
Q4:
What's the relationship between inner products and norms? Why do we need
different norms?
Inner product induced norm:
For inner product spaces, inner product naturally induces a norm:
For example, standard inner product induces
norm.
Not all norms come from inner products:normandnormcannot be
induced by any inner product.
Testing if norm comes from inner product: Parallelogram
law
Normis induced by inner
productsatisfies
parallelogram law:norm satisfies this law,anddon't.
Applications of different norms:
Norm
Geometric Meaning
Application
Manhattan distance, diamond contours
Sparse regularization (Lasso)
Euclidean distance, circular contours
Ridge regression, SVM
Chebyshev distance, square contours
Robust optimization, minimax
Number of non-zero elements (not a norm)
Feature selection, compressed sensing
Why doesproduce
sparse solutions?norm's
level sets have corners at coordinate axes. When constraint set
(e.g.,) intersects
objective function's level sets, intersection point tends to be on
coordinate axes (some components zero), producing sparse solutions.
Function(positive definite) is
concave, used for maximization problems.
Testing positive definiteness:
Eigenvalues: All eigenvalues
Sylvester's criterion: All leading principal
minors
Cholesky decomposition: Try decomposition,
success means positive definite
Numerical method: Compute minimum
eigenvalue
Q6:
What's the difference between "numerator layout" and "denominator
layout" in matrix calculus?
Two layout conventions:
Layout
Rule
shape
Common in
Numerator
Derivative shape matches numerator
Same asshape
Statistics, ML
Denominator
Derivative shape matches denominator
Same asshape (transposed)
Physics, engineering
Example:
Let,, Jacobian matrix:
Numerator layout:
Denominator layout:
Scalar with respect to vector:
Let.
Numerator layout:(column vector)
Denominator layout:(row vector)
Comparison of important formulas:
Formula
Numerator Layout
Denominator Layout
Convention in this chapter:
This chapter uses numerator layout, because:
Consistent with natural vector/matrix shape
More common in ML literature
Gradient is column vector, convenient for gradient descent
Practical recommendations:
✅ When reading papers, first determine author's layout
convention
✅ Stay consistent in your own code/derivations
✅ Use automatic differentiation libraries (PyTorch, JAX) to avoid
manual derivation
Q7: How
to quickly determine if a matrix is invertible?
Theoretical criteria:
Matrixis invertibleany of these conditions
holds:
1. 2. 3. (null space
contains only zero vector) 4. All eigenvalues non-zero 5. Row vectors
(or column vectors) linearly independent 6.can be reduced to identity matrix
through row operations
Numerical criteria (practical computation):
Due to floating point errors, use:
Condition number:
:
numerically invertible
:
numerically singular
Minimum singular value:
(where): invertible
LU decomposition: Try LU decomposition, if
encounter zero pivot (or extremely small pivot), matrix singular
Quick checks (without full computation):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
defis_invertible_quick(A, tol=1e-10): """ Quickly determine if matrix is numerically invertible """ # Method 1: Check determinant (only for small matrices) if A.shape[0] <= 10: returnabs(np.linalg.det(A)) > tol # Method 2: Check condition number kappa = np.linalg.cond(A) return kappa < 1 / tol # Method 3: Try solving Ax=b try: x = np.linalg.solve(A, np.ones(A.shape[0])) residual = np.linalg.norm(A @ x - np.ones(A.shape[0])) return residual < tol except np.linalg.LinAlgError: returnFalse
Invertibility of special matrices:
Matrix Type
Invertibility Condition
Diagonal
All diagonal elements non-zero
Triangular
All diagonal elements non-zero
Orthogonal
Always invertible ()
Symmetric positive definite
Always invertible
Idempotent ()
Only when
Q8:
What's the relationship between SVD and PCA? Why does PCA use eigenvalue
decomposition of covariance matrix?
Two equivalent derivations of PCA:
Method 1: Maximum variance (eigenvalue decomposition of
covariance matrix)
Given centered data matrix(each column has mean 0).
Objective: Find direction,,
maximizing data variance along this direction.
Variance:whereis covariance
matrix.
Optimization problem:
By Lagrange multipliers, optimal solution satisfies:
i.e., is eigenvector of, is eigenvalue. Maximum variance
corresponds to largest eigenvalue.
Method 2: Minimum reconstruction error (SVD of data
matrix)
Objective: Find rank-matrixminimizing.
By Eckart-Young theorem, optimal solution is rank-SVD approximation of:
whereare
firstright singular vectors,
exactly the firsteigenvectors
of(covariance matrix
times).
Equivalence of two methods:
i.e.:
Right singular vectors of=
eigenvectors of(PCA principal
directions)
Neural network compression: Weight matrixapproximated
by rank-,
parameters reduced fromto.
Fast matrix multiplication: If,
then, computation fromto.
Q10:
What's the relationship between Gram matrix () and Gram-Schmidt
orthogonalization?
Definition of Gram matrix:
Given matrix , Gram matrix is defined as:
Elements:
i.e., inner products between column vectors.
Properties of Gram matrix:
Symmetric:
Positive semi-definite:
Rank:
Invertibility:invertiblefull column rank
Gram-Schmidt orthogonalization:
Objective: Transforminto orthonormal basis.
Process:
Relationship:
Gram-Schmidt orthogonalization essentially "removes projection
ofonto subspace spanned by
firstvectors", equivalent
to:
Compute projection:
Compute orthogonal component:
Normalize
Cholesky decomposition of Gram matrix:
Ifis full column rank,
thenis positive definite,
can be Cholesky decomposed:
whereis lower triangular.
Actually, columns ofare
proportional to intermediate vectors in Gram-Schmidt process
(unnormalized).
Numerical stability:
Classical Gram-Schmidt (CGS) is numerically unstable: subsequent
vectors may not be orthogonal.
Modified Gram-Schmidt (MGS) is more stable: orthogonalize one vector
at a time.
Relationship between QR decomposition and
Gram-Schmidt:
QR decompositioncan be
obtained through Gram-Schmidt:
Columns of are orthogonalized
vectors
is upper triangular, elements
are
Conversely, Gram matrix can be expressed through QR:
This is exactly Cholesky decomposition of Gram matrix!
🎓 Summary: Core Points
of Linear Algebra
Key formulas to remember:
Four fundamental subspaces:
SVD decomposition:
Spectral theorem (symmetric matrices):
Matrix calculus:
Memory mnemonic:
SVD is universal key (any matrix decomposable)
Symmetric matrices have real eigenvalues (spectral theorem
guarantees orthogonality)
Positive definite matrices optimize well (convex functions
have unique solutions)
QR decomposition is numerically stable (first choice for
least squares)
Practical checklist:
✏️ Exercises and Solutions
Exercise 1: Eigenvalues
and Eigenvectors
Problem: Let . Find all
eigenvalues and eigenvectors of ,
and verify that can be
diagonalized.
Solution:
Characteristic equation:
Solving: , .
For: , i.e., , giving .
For: , i.e., , giving .
Verification of diagonalization: , , then
.
We can verify ,
therefore can be orthogonally
diagonalized. This is a manifestation of the spectral theorem for
symmetric matrices.
Exercise 2: SVD and
Low-Rank Approximation
Problem: Let . Find the SVD decomposition of and give the best rank-1 approximation
matrix.
Solution:
First compute .
Eigenvalues of : , giving, .
Singular values: , .
Right singular vectors are
formed by the (normalized) eigenvectors of , and left singular vectors .
Best rank-1 approximation: . By the Eckart-Young theorem, the approximation error
is.
Exercise 3: Matrix Calculus
Problem: Let , where is an symmetric matrix. Findand, and determine the
minimum value of when is positive definite.
Solution:
Gradient:
(Since is symmetric, )
Hessian matrix:
When is positive definite,
is positive
definite, so is strictly convex.
Setting:
Minimum value:
Exercise
4: Positive Definite Matrices and Cholesky Decomposition
Problem: Prove that a symmetric positive definite
matrix always has a unique
Cholesky decomposition ,
where is a lower triangular
matrix with positive diagonal elements. Verify for .
Solution:
Existence proof (by induction on matrix order ):
: , (positive definite), take .
Assume the result holds for order . Partition the -th order positive definite matrix as:
Let , where (since diagonal elements of
positive definite matrices are positive), ,
.
By positive definiteness, is still positive definite (Schur
complement), so by the induction hypothesis, exists.
Uniqueness: If , then. The left side is lower triangular, the right side
is upper triangular, so (diagonal matrix). Since both andhave positive diagonal elements,
, i.e., .
Verification: .
, , .
Exercise 5:
Condition Number and Numerical Stability
Problem: For matrix ,
compute its 2-norm condition number, and explain why solving the
linear system using faces numerical stability issues.
Solution:
Eigenvalues of : .
Since is symmetric positive
definite, singular values equal eigenvalues, so:
The condition number is approximately, indicating an ill-conditioned matrix. This
means:
Small perturbations in
the right-hand side vector can
cause changes in the solution
amplified by a factor of:
In floating-point arithmetic, even without errors in , rounding errors can be amplified by
approximatelytimes, resulting
in a loss of about 4 significant digits in the solution
In practice, regularization should be considered (e.g., Tikhonov
regularization: ) to improve numerical stability
📚 References
Strang, G. (2006). Linear Algebra and Its
Applications (4th ed.). Cengage Learning.
Golub, G. H., & Van Loan, C. F. (2013). Matrix
Computations (4th ed.). Johns Hopkins University
Press.
Horn, R. A., & Johnson, C. R. (2012). Matrix
Analysis (2nd ed.). Cambridge University Press.
Trefethen, L. N., & Bau III, D. (1997). Numerical
Linear Algebra. SIAM.
Axler, S. (2015). Linear Algebra Done Right (3rd
ed.). Springer.
Petersen, K. B., & Pedersen, M. S. (2012). The Matrix
Cookbook. Technical University of Denmark.
Eckart, C., & Young, G. (1936). The approximation of
one matrix by another of lower rank. Psychometrika,
1(3), 211-218.
Meyer, C. D. (2000). Matrix Analysis and Applied Linear
Algebra. SIAM.
Boyd, S., & Vandenberghe, L. (2004). Convex
Optimization. Cambridge University Press.
Magnus, J. R., & Neudecker, H. (2019). Matrix
Differential Calculus with Applications in Statistics and
Econometrics (3rd ed.). Wiley.
Next chapter preview: Chapter 3 will delve into
probability theory and statistical inference, including common
distributions, maximum likelihood estimation, Bayesian estimation, etc.,
laying the foundation for probabilistic models.
Post title:Machine Learning Mathematical Derivations (2): Linear Algebra and Matrix Theory
Post author:Chen Kai
Create time:2021-08-31 14:00:00
Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-2-Linear-Algebra-and-Matrix-Theory/
Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.