We've completed a long journey through linear algebra. From vectors and matrices to eigenvalue decomposition, SVD, tensor analysis, and applications in machine learning and deep learning — each chapter has revealed the remarkable universality of this discipline. Now, let's turn our gaze to the frontiers: quantum computing, graph neural networks, large language models, and technologies that are changing the world. These fields may seem mysterious, but their core remains the familiar linear algebra we've studied.
Linear Algebra in Quantum Computing
From Classical Bits to Quantum Bits
A classical computer's basic unit is the bit, which can only be in
one of two states:
Quantum bits (qubits) are entirely different. They can simultaneously
exist in a superposition of
Intuitive Understanding: Imagine a globe. A
classical bit can only be at the north pole or south pole, while a qubit
can be at any point on the sphere's surface. This sphere is called the
Bloch sphere, perfectly visualizing a qubit's state
space. The north pole is
Why is superposition so powerful? With
Quantum Gates: Physical Realization of Unitary Matrices
In quantum computing, "operations" are implemented through
quantum gates. Mathematically, quantum gates are
unitary matrices. A matrix
Most Important Single-Qubit Gates:
Hadamard Gate: Arguably the most frequently used
gate in quantum computing, it creates superposition:
Pauli Gates: Three fundamental rotations
Rotation Gates: Parameterized rotations
Two-Qubit Gate: CNOT
The Controlled-NOT (CNOT) gate is the archetypal two-qubit gate:
The famous Bell state (maximally entangled state)
can be prepared as:
Linear Algebra Perspective on Quantum Algorithms
Deutsch-Jozsa Algorithm: Demonstrating Quantum Parallelism
Suppose there's a function
The core idea uses Hadamard transforms to create superposition of all inputs, then lets quantum interference amplify the correct answer and cancel wrong ones.
Grover's Search Algorithm
To search for a specific target in
The core of Grover's algorithm iterates two operations: 1. Oracle Flip: Mark the target state's phase 2. Diffusion Transform: Reflect about the mean
In matrix language, the diffusion transform is
Shor's Algorithm and RSA's Crisis
Shor's algorithm can factor large integers in polynomial time,
directly threatening RSA encryption (whose security relies on the
difficulty of factoring large numbers). The key step is the
Quantum Fourier Transform (QFT):
Quantum Machine Learning
The intersection of quantum computing and machine learning is an active research area. Key directions:
Variational Quantum Eigensolver (VQE): For finding molecular ground state energies, a core problem in chemistry and drug design. VQE uses parameterized quantum circuits, with classical optimizers adjusting parameters to minimize energy expectation.
Quantum Neural Networks: Building neural networks with parameterized quantum gates. Theoretically may achieve quantum speedup for certain tasks, but currently limited by quantum hardware noise.
Quantum Kernel Methods: Using quantum state space as feature space, potentially accessing kernel functions that are classically hard to compute.
Linear Algebra in Graph Neural Networks
Matrix Representations of Graphs
Graphs are fundamental data structures for describing relationships and connections. Social networks, molecular structures, transportation systems, the internet — all can be modeled as graphs.
A graph
Adjacency Matrix
Degree Matrix
Graph Laplacian Matrix
Normalized Laplacian
Spectral Graph Theory: Fourier Analysis on Graphs
Fourier transform decomposes time signals into sine waves of different frequencies. How do we define "frequency" on graphs?
The answer comes from the Laplacian matrix's eigendecomposition.
Let
Graph Fourier Transform is defined as:
Eigenvalues
Spectral Clustering leverages this idea: use the Laplacian's smallest non-zero eigenvectors to embed graph nodes, then do K-means clustering in the embedding space. Intuitively, nodes in the same community have similar values in low-frequency eigenvectors.
Graph Convolutional Networks (GCN)
Traditional CNNs work well on regular grids (like images) but can't directly apply to graph-structured data. Graph Convolutional Networks solve this.
Spectral Convolution definition:
ChebNet's Approximation: Approximate filters with
Chebyshev polynomials:
GCN Simplification: Take
Intuitive Understanding: Each GCN layer does: 1.
Aggregate neighbor features (
Multiple GCN layers aggregate multi-hop neighbor information.
Message Passing Neural Networks
GCN can be viewed as a special case of the message passing framework. In this framework, each layer's update has three steps:
Message Computation:
Message Aggregation:
Node Update:
Different GNN variants correspond to different MSG, AGG, UPD function choices:
- GraphSAGE: Uses sampling and different aggregators (mean, max, LSTM)
- GAT (Graph Attention Network): Attention-weighted aggregation
- GIN (Graph Isomorphism Network): Theoretically most powerful message passing GNN
Graph Neural Network Applications
Molecular Property Prediction: Molecules can be viewed as graphs (atoms are nodes, chemical bonds are edges). GNNs can predict various molecular properties like solubility, toxicity, drug activity. AlphaFold for protein structure prediction also uses similar techniques.
Recommender Systems: User-item interactions can be modeled as bipartite graphs. GNNs can learn user and item embeddings for recommendations.
Knowledge Graphs: Entities are nodes, relations are edges. GNNs can do link prediction (predicting missing relations) and node classification (entity classification).
Physical Simulation: Particle systems can be modeled as dynamic graphs. Graph Network Simulators can learn complex physical dynamics.
Linear Algebra in the Era of Large Models
Mathematical Structure of Transformers
Transformers are the foundational architecture for modern large language models (GPT, BERT, LLaMA). Their core is self-attention, which is entirely linear algebra operations.
Given input sequence embedding matrix
Breaking down this formula:
1.
Dividing by
prevents softmax saturation (vanishing gradients) when inner products are too large.Softmax normalizes attention scores into a probability distribution, with each row summing to 1.
Multiplying by
is weighted summation: each position's output is a weighted combination of all positions' value vectors, with weights being attention scores.
Geometric Intuition: Attention performs soft retrieval. Query is "what I'm looking for," Key is "what I have," Value is "what I can provide." By matching queries and keys via inner products, then aggregating from relevant values.
Multi-Head Attention projects input into multiple
subspaces:
Mathematics of Positional Encoding
Transformers have no recurrence or convolution, so how do they perceive position? The answer is positional encoding.
Sinusoidal Positional Encoding (original
Transformer):$
$
Rotary Position Embedding (RoPE) is a more modern
approach, encoding positions through complex rotations:
Parameter-Efficient Fine-Tuning for Large Models
Large language models have billions or even trillions of parameters, making full fine-tuning extremely expensive. Linear algebra provides elegant solutions.
LoRA (Low-Rank Adaptation): Assumes weight changes
during fine-tuning are low-rank. Instead of directly updating
Trainable parameters drop from
QLoRA combines quantization: base model stored in 4-bit quantization, only LoRA part is full precision. This makes fine-tuning 65B models on consumer GPUs possible.
KV Cache and Inference Optimization
During autoregressive generation, each new token needs attention over
all previous tokens. Naive implementation requires
KV Cache exploits the fact that previous tokens'
Keys and Values don't change. We cache them, computing only the new
token's Q, K, V per step, then concatenating with the cache. This
reduces per-step computation to
This is classic space-time tradeoff. KV cache size
is
Sparse Computation and Efficient Inference
Sparse Attention
Standard attention has
Sparse Attention reduces complexity by computing only some position pairs' attention:
Local Attention: Each position only attends to
nearby windows. Complexity
Dilated Attention: Attend to tokens at
intervals
Longformer/BigBird: Combines local attention, global attention (certain special tokens see all positions), and random attention.
Mathematically, sparse attention is equivalent to
setting most elements of
Linear Attention and Kernel Approximation
Another approach uses kernel methods to approximate softmax
attention:
Performer uses random features to approximate the softmax kernel. Linear Transformer directly removes softmax, but may lose expressiveness.
Model Quantization
Quantization converts high-precision (FP32, FP16) weights and activations to low-precision (INT8, INT4) representations.
Linear Algebra Perspective: Quantization can be
viewed as finding a discrete grid to approximate continuous values.
Given original weights
Symmetric Quantization:
Asymmetric Quantization:
Per-Channel vs Per-Tensor Quantization: Different channels (or layers) may have very different value ranges. Per-channel quantization uses different scale factors for each channel, higher precision but more overhead.
GPTQ is a second-order information-based
quantization method. It considers the Hessian matrix to minimize
quantization error:
Model Pruning
Pruning removes unimportant weights (sets to zero), creating sparsity.
Unstructured Pruning: Any position's weights can be pruned. Can achieve high sparsity (90%+), but hardware acceleration is difficult.
Structured Pruning: Prune entire rows, columns, or convolution kernels. Easier to accelerate, but typically lower sparsity.
Importance Metrics: - Magnitude: Small
Sparse Matrix Storage: CSR, CSC, COO formats efficiently store sparse matrices. Modern GPUs (like NVIDIA Ampere) have specialized sparse tensor cores supporting 2:4 sparsity (at most 2 non-zeros per 4 elements).
Frontier Research Directions in Linear Algebra
Tensor Networks and Quantum State Representation
Tensor Networks are compact representations of
high-dimensional tensors. An
Matrix Product States (MPS) are the simplest tensor
networks:
MPS is used in quantum physics to represent 1D quantum system ground states. DMRG (Density Matrix Renormalization Group) is an MPS-based variational method.
More complex tensor networks include PEPS (2D), MERA (Multi-scale Entanglement Renormalization), etc. They have important applications in quantum many-body physics and quantum machine learning.
Randomized Numerical Linear Algebra
Randomized methods are transforming numerical linear algebra. Traditional algorithms are deterministic; randomized algorithms use probabilistic methods for faster speeds.
Randomized SVD: Instead of computing full SVD, use
random projections for approximate low-rank factorization: 1. Generate
random matrix$^{n (k+p)}
Johnson-Lindenstrauss Lemma is the theoretical foundation: high-dimensional points can be randomly projected to low dimensions while approximately preserving pairwise distances.
Implicit Neural Representations
Implicit Neural Representations (INR) use neural networks to represent continuous signals (images, 3D shapes, videos). Given coordinates, the network outputs the value at that point.
For example, NeRF (Neural Radiance Fields) uses an
MLP to represent 3D scenes:
INR's core learns a mapping from coordinates to values.
Positional Encoding (Fourier features) helps networks
learn high-frequency details:
Neural PDE Solvers
Physics-Informed Neural Networks (PINN) use neural
networks to solve partial differential equations. The network directly
parameterizes the solution function, training optimizes:
Automatic differentiation lets us easily compute derivatives of any order. This relies on chain rule — matrix multiplication in linear algebra.
Neural ODE views neural networks as continuous
dynamical systems:
Complete Series Review and Knowledge Map
Three Perspectives on Linear Algebra
Three ways of viewing linear algebra run through the entire series:
Algebraic Perspective: Matrices are arrays of numbers, operations follow specific rules. This is the foundation for computation.
Geometric Perspective: Matrices are linear transformations, vectors are arrows in space. This is the source of intuition.
Abstract Perspective: Vector spaces are sets satisfying axioms, linear maps preserve structure. This is the key to generalization.
The three perspectives complement each other. Algebra tells us "how to compute," geometry tells us "what it means," abstraction tells us "why it works."
Core Concept Network
1 | Vector Space |
Key Takeaways from Each Chapter
| Chapter | Topic | Core Insight |
|---|---|---|
| 1 | Vectors | Vectors have magnitude and direction, also elements of function spaces |
| 2 | Vector Spaces | Eight axioms define spaces where linear combinations make sense |
| 3 | Linear Transformations | Matrices and linear transformations correspond one-to-one; basis choice is key |
| 4 | Determinants | Determinant is signed volume scaling factor, zero iff singular |
| 5 | Linear Systems | Solution space structure determined by four fundamental subspaces |
| 6 | Eigenvalues | Eigenvectors are invariant directions, eigenvalues are scaling factors |
| 7 | Orthogonality | Inner products provide length and angle, orthogonal bases are best |
| 8 | Symmetric Matrices | Real symmetric matrices are orthogonally diagonalizable, all real eigenvalues |
| 9 | SVD | Any matrix decomposes as rotation-scaling-rotation |
| 10 | Norms & Condition | Condition number measures problem sensitivity |
| 11 | Matrix Calculus | Gradient points in direction of fastest change, chain rule is fundamental |
| 12 | Sparsity | L1 regularization induces sparsity, compressed sensing breaks Shannon limit |
| 13 | Tensors | Tensors are multidimensional arrays, decomposition reveals hidden structure |
| 14 | Random Matrices | High-dimensional randomness has surprising regularity (Marchenko-Pastur, semicircle) |
| 15 | Machine Learning | PCA maximizes variance, kernel methods are implicit high-dim mappings |
| 16 | Deep Learning | Neural networks are layered matrix multiplication plus nonlinearity |
| 17 | Computer Vision | Camera is projection matrix, 3D reconstruction is inverse problem |
| 18 | Frontiers | Quantum gates are unitary matrices, graph convolution is Laplacian filtering |
Most Important Theorems
Dimension Theorem:
Rank-Nullity Theorem:
Spectral Theorem: Real symmetric matrices are orthogonally diagonalizable
SVD Existence: Every matrix has an SVD decomposition
Eckart-Young Theorem: Truncated SVD is optimal low-rank approximation
Johnson-Lindenstrauss Lemma: High-dim points embed in low-dim with low distortion
Learning Advice and Resource Recommendations
Building Intuition
Visualization: Use GeoGebra, Manim (the library 3Blue1Brown uses), or write your own code to visualize linear transformations. Seeing how matrices warp grids is more intuitive than any formula.
Small Examples First: When learning new concepts, manually compute a few 2x2 or 3x3 examples. Only through hands-on calculation does true understanding emerge.
Ask "Why": Don't settle for "this formula is this way." Ask: Why is the determinant defined this way? Why are eigenvalues related to trace and determinant? Why does SVD always exist?
Connect to Applications: For each concept, think about where it's useful. Eigendecomposition for PageRank, SVD for recommender systems, orthogonal matrices for computer graphics...
Advanced Learning Paths
For Deeper Mathematics: - Abstract algebra (groups, rings, fields, modules) - Functional analysis (infinite-dimensional vector spaces) - Algebraic geometry (geometry of polynomial equations)
For Applications Focus: - Numerical Linear Algebra (Trefethen & Bau) - Convex Optimization (Boyd & Vandenberghe) - Statistical Learning Theory (Hastie, Tibshirani, Friedman)
For Research: - Random Matrix Theory - Tensor Decomposition and Multilinear Algebra - Quantum Information and Quantum Computing
Recommended Resources
Classic Textbooks: - Gilbert Strang's Introduction to Linear Algebra: Intuition-first, great for beginners - Sheldon Axler's Linear Algebra Done Right: Elegant proofs, abstract perspective - Trefethen & Bau's Numerical Linear Algebra: Essential for numerical methods - Golub & Van Loan's Matrix Computations: Engineer's bible
Online Courses: - MIT 18.06 (Gilbert Strang): Free on YouTube, classic of classics - 3Blue1Brown's Essence of Linear Algebra: Visual introduction, highly recommended first - Stanford CS229 (Machine Learning): Linear algebra in ML applications
Software Tools: - NumPy/SciPy: Python scientific computing - MATLAB/Octave: Traditional numerical computing - Julia: Modern scientific computing language - PyTorch/JAX: Deep learning frameworks with automatic differentiation
Papers and Surveys: - "The Matrix Cookbook": Matrix formula reference - "Randomized Numerical Linear Algebra" survey - "Attention is All You Need" (Transformer original paper)
前沿应用与总结/fig1.png)
前沿应用与总结/fig2.png)
前沿应用与总结/fig3.png)
前沿应用与总结/fig4.png)
前沿应用与总结/fig5.png)
前沿应用与总结/gif1_matmul.gif)
Exercises
Quantum Computing Basics
Exercise 1: Prove the Hadamard gate
Exercise 2: Compute
Exercise 3: Prove Pauli matrices
Exercise 4: Can the Bell state
Exercise 5: Design a quantum circuit that
transforms
Graph Neural Networks
Exercise 6: For the graph below, write its adjacency
matrix1
2
31 -- 2
| |
3 -- 4
Exercise 7: Compute the Laplacian matrix eigenvalues for the above graph, verify the smallest is 0, and explain its meaning.
Exercise 8: Prove for any graph signal
Exercise 9: Explain why the normalized
Laplacian
Exercise 10: In the GCN layer
Large Models and Efficient Computing
Exercise 11: In self-attention
Exercise 12: In LoRA, if original weight matrix
Exercise 13: Explain why KV cache speeds up
autoregressive generation. For sequence length
Exercise 14: Design an INT8 symmetric quantization
scheme for a tensor with weights in
Exercise 15: What's the complexity of sparse
attention (attending to only
Comprehensive Applications
Exercise 16: Design a simple GNN for molecular polarity prediction. - Input: Molecular graph (atoms are nodes, chemical bonds are edges) - Output: Polarity (0 or 1) - Describe what node features, edge features, and network structure you'd use
Exercise 17: Suppose you need to run a 7B parameter language model for inference on an 8GB VRAM GPU. Calculate: - Model weight space (FP16)? - With INT4 quantization? - Is it feasible? What else needs consideration?
Exercise 18: Derive how the GCN layer can be viewed
as a 1st-order Chebyshev approximation of spectral convolution. Starting
from spectral graph convolution definition, show how to arrive at
Exercise 19: Compare these three approaches for handling long sequences: - Sparse attention (local window) - Linear attention (kernel approximation) - Sliding window + global tokens
Analyze their pros, cons, and suitable scenarios.
Exercise 20: Design an architecture combining GNN and Transformer for molecular property prediction. Explain how to leverage both molecular graph structure and atomic sequence information.
Programming Practice
Exercise 21: Implement with NumPy: - Hadamard gate
and CNOT gate - Simulate a simple quantum circuit:
Exercise 23: Implement a LoRA layer, verify that
when
Exercise 24: Implement INT8 symmetric quantization and dequantization functions, test quantization error on a pretrained model's weights.
Exercise 25: Compare standard attention and sparse
attention (window size
Conclusion
Linear algebra is both an ancient and young discipline. Ancient because its basic concepts — vectors, matrices, linear transformations — have over two hundred years of history; young because it finds new life in each generation of technology.
From 19th century equation solving, to 20th century quantum mechanics, to 21st century machine learning and artificial intelligence, linear algebra remains the universal language of science and technology. Quantum computers use unitary matrices to describe quantum gates, graph neural networks use Laplacian matrices to propagate information, large language models use attention matrices to capture semantic relationships — the underlying mathematical essence remains the same.
Learning linear algebra isn't just learning computational techniques, it's learning a way of thinking: - Represent states with vectors - Represent transformations with matrices - Reveal structure through decomposition - Solve problems through optimization
I hope this series helps you: - Build solid conceptual foundations and geometric intuition - See the deep connections between linear algebra and modern technology - Gain motivation and direction for further study
Mathematics is not about memorization, but understanding. The beauty of linear algebra lies in its elegant structure and powerful applications.
Thank you for reading the complete "Essence of Linear Algebra" series!
This is Chapter 18 — the final chapter — of the 18-part "Essence of Linear Algebra" series.
- Post title:Essence of Linear Algebra (18): Frontiers and Summary
- Post author:Chen Kai
- Create time:2019-03-30 16:15:00
- Post link:https://www.chenk.top/chapter-18-frontiers-and-summary/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.