The core mission of computer vision is to make machines "see" and understand images and videos. Remarkably, this entire field is built almost entirely on linear algebra: images themselves are matrices, geometric transformations are matrix multiplications, camera imaging is projective transformation, and 3D reconstruction is solving systems of linear equations. Master linear algebra, and you master the mathematical core of computer vision.
Vector and Matrix Representation of Images
From Pixels to Matrices
When you take a photo with your smartphone, the sensor captures brightness values in tiny squares called pixels. A grayscale image can be directly represented as a matrix:
Intuitive Understanding: Think of an image as a spreadsheet where each cell contains a number. Larger numbers mean brighter pixels at that location.
Tensor Representation of Color Images
Color images typically use RGB (Red, Green, Blue) three-channel
representation, where each channel is a matrix:
1 | import numpy as np |
Images as High-Dimensional Vectors
Sometimes, "flattening" an image into a long vector is more
convenient for processing. An
Why do this? Many machine learning algorithms expect vector inputs. For example, PCA dimensionality reduction, support vector machines, and fully connected neural network input layers all require flattened images.
1 | # Flatten image |
Inner Products and Image Similarity
Since images can be viewed as vectors, vector inner products can
measure image similarity. The cosine similarity between two images
Geometric Transformation Matrices
Why Use Matrices for Transformations?
Imagine using photo editing software to rotate, scale, and translate an image. These operations fundamentally transform pixel coordinates. Using matrices to represent transformations has two major benefits:
- Simple composition: Combining multiple transformations is just matrix multiplication
- Simple inverse: The inverse transformation is the matrix inverse
2D Rotation Matrix
The transformation matrix for counterclockwise rotation by angle
Geometric Intuition: The rotation matrix transforms
basis vector
Key Properties: -1
2
3
4
5
6
7
8
9
10
11
12
13
14
15import numpy as np
def rotation_matrix(theta):
"""Create 2D rotation matrix"""
c, s = np.cos(theta), np.sin(theta)
return np.array([[c, -s], [s, c]])
# Rotate 45 degrees
theta = np.pi / 4
R = rotation_matrix(theta)
# Rotate a point
point = np.array([1, 0])
rotated = R @ point
print(f"After rotation: ({rotated[0]:.3f}, {rotated[1]:.3f})") # (0.707, 0.707)
Scaling Matrix
Scaling by
Special Cases: -
Shear Matrix
Shear transformations "tilt" the image, like italic text:
Composing Transformations
Multiple transformations applied sequentially correspond to matrix
multiplication from right to left:
1 | # Combined transformation: first scale 2x, then rotate 45 degrees |
Homogeneous Coordinate System
The Translation Problem
The transformations above (rotation, scaling, shear) are all linear
transformations, expressible as
The Magic of Homogeneous Coordinates
Add an extra coordinate 1 after the 2D point
Intuitive Understanding: We "embed" the 2D plane
into 3D space (placed on the
General Affine Transformation Matrix
Rotation, scaling, shear, and translation can all be unified
using
1 | import cv2 |
Rotation Around an Arbitrary Point
Rotating around the origin is simple, but what about rotating around
the image center or any point
- Translate to move the rotation center to the origin
- Rotate around the origin
- Translate back
Combined:1
2
3
4
5
6def rotation_around_center(img, angle_deg):
"""Rotate around image center"""
h, w = img.shape[:2]
center = (w // 2, h // 2)
M = cv2.getRotationMatrix2D(center, angle_deg, 1.0)
return cv2.warpAffine(img, M, (w, h))
Perspective Transformation and Homography Matrix
Limitations of Affine Transformations
Affine transformations preserve parallel lines, but in the real world, parallel railroad tracks "converge" in the distance. This near-big-far-small perspective effect requires a more general perspective transformation (projective transformation).
Homography Matrix
Homography uses a
Key Difference: Affine transformations have
Estimating Homography Matrix
Given corresponding point pairs
Each point pair provides two constraints (x and y directions).
DLT Algorithm (Direct Linear Transform): Rewrite
constraints as homogeneous linear equations
1 | import cv2 |
Homography Applications
- Image Stitching: Estimate homography between adjacent images, align them to the same coordinate system
- Document Rectification: Transform tilted document/whiteboard photos into frontal view
- Augmented Reality: "Attach" virtual objects to detected planes
1 | def rectify_document(img, corners): |
Camera Model and Projection Matrix
Pinhole Camera Model
A camera's working principle can be described with a simplified model: 3D scene points project through a small hole (pinhole) onto the imaging plane behind the camera.
Let a 3D point in camera coordinates be
Camera Intrinsic Matrix
Real cameras have additional parameters: - Focal length may differ in
x and y directions:
These parameters form the intrinsic matrix
Camera Extrinsics
Extrinsics describe the camera's position and orientation in world
coordinates, including: - Rotation matrix
Complete Projection Equation
A 3D point
Physical Meaning: 1.
1 | def project_3d_to_2d(X_world, K, R, t): |
Camera Calibration
Camera calibration aims to estimate intrinsics
- Capture multiple checkerboard images (different angles)
- Detect checkerboard corners
- Establish 2D-3D correspondences
- Solve intrinsics with nonlinear optimization
1 | import cv2 |
Distortion Correction
Real lenses aren't ideal pinholes and produce distortion:
Radial Distortion (barrel/pincushion):
Tangential Distortion (lens not parallel to
sensor):
1 | # Undistort |
Epipolar Geometry and Fundamental Matrix
Epipolar Constraint
Suppose the same 3D point is captured by two cameras, appearing at
positions
Geometric Intuition: - The 3D point, two camera centers, and two image points — these 5 points are coplanar (epipolar plane) - The epipolar constraint is the algebraic expression of this coplanarity
Essential and Fundamental Matrices
Essential Matrix
Fundamental Matrix
The relationship between them:
Essential Matrix Decomposition
The essential matrix decomposes into rotation and translation:
Significance: From the essential matrix, we can recover the relative pose between two cameras (rotation and translation direction; translation scale is undetermined).
1 | import cv2 |
Epipolar Lines
Given a point
1 | def draw_epipolar_lines(img1, img2, pts1, pts2, F): |
3D Reconstruction
Triangulation Principle
Knowing two cameras' poses and a pair of corresponding points, we can recover the 3D point through triangulation.
Given two cameras with projection matrices
1 | def triangulate_points(pts1, pts2, P1, P2): |
Structure from Motion (SfM)
SfM reconstructs 3D scene structure and camera poses from a set of images. Basic pipeline:
- Feature Extraction and Matching: Detect SIFT/ORB features in all images, perform pairwise matching
- Initialization: Select two images, estimate fundamental/essential matrix, triangulate initial points
- Incremental Reconstruction: Gradually add new
images
- PnP to estimate new camera pose
- Triangulate new 3D points
- Bundle Adjustment: Global optimization of all parameters
Bundle Adjustment
Bundle Adjustment is the core optimization step in SfM, jointly
optimizing all camera parameters and 3D point positions to minimize
reprojection error:
This is a large-scale nonlinear least squares problem, typically solved with Levenberg-Marquardt. Since the Jacobian matrix has sparse structure (each observation only involves one camera and one 3D point), it can be solved efficiently.
1 | # Simplified Bundle Adjustment concept using scipy |
Linear Algebra in SLAM
SLAM Problem Overview
SLAM (Simultaneous Localization and Mapping) solves: a robot moves in an unknown environment, simultaneously estimating its position and building an environment map.
This involves extensive linear algebra: - Pose representation (rotation matrices, quaternions) - State estimation (matrix operations in Kalman filtering) - Graph optimization (solving sparse linear systems)
Lie Group Representation of Poses
3D rigid body transformations can be represented with
Why use Lie algebra? Optimization requires
derivatives with respect to parameters. Rotation matrices have
constraints (
1 | import numpy as np |
Extended Kalman Filter (EKF-SLAM)
EKF-SLAM models SLAM as a state estimation problem. The state vector
includes robot pose and all landmark positions:
Prediction Step (motion model):$
Update Step (observation model):$
$$
$
Graph Optimization SLAM
Modern SLAM increasingly uses graph optimization. SLAM is modeled as a factor graph:
- Nodes: Robot poses, landmark positions
- Edges: Motion constraints, observation constraints
Optimization objective minimizes squared error sum of all
constraints:
Image Filtering in Matrix Form
Convolution as Matrix Multiplication
Image convolution is typically written as:
Why is this important? Because we can use linear algebra tools to analyze filter properties.
Common Filters
Mean Filter (blur):
Gaussian Filter (smoother blur):
Laplacian Operator (edge detection):
Sobel Operator (gradient):1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20import cv2
import numpy as np
# Various filters
img = cv2.imread('image.jpg', cv2.IMREAD_GRAYSCALE)
# Gaussian blur
blurred = cv2.GaussianBlur(img, (5, 5), 1.0)
# Sobel gradient
grad_x = cv2.Sobel(img, cv2.CV_64F, 1, 0, ksize=3)
grad_y = cv2.Sobel(img, cv2.CV_64F, 0, 1, ksize=3)
gradient_magnitude = np.sqrt(grad_x**2 + grad_y**2)
# Laplacian
laplacian = cv2.Laplacian(img, cv2.CV_64F)
# Custom kernel
kernel = np.array([[0, -1, 0], [-1, 5, -1], [0, -1, 0]]) # Sharpening
sharpened = cv2.filter2D(img, -1, kernel)
Frequency Domain Analysis of Filtering
The convolution theorem tells us: spatial domain convolution equals
frequency domain multiplication.
计算机视觉中的线性代数/fig6.png)
Feature Detection with Linear Algebra
Harris Corner Detection
The core of Harris corner detection is analyzing the autocorrelation
matrix (structure tensor) around a point:
Key Observation:
Harris response avoids computing eigenvalues directly:
1 | def harris_corner_detector(img, k=0.04, block_size=2, ksize=3): |
PCA and Feature Descriptors
SIFT and similar feature descriptors compute statistics on local gradient histograms. PCA can reduce dimensions or extract principal directions.
For a set of data points${_i}
Exercises
Basic Exercises
1. Image Matrix Operations
A
2. Rotation Matrix Properties
Prove: - (a) 2D rotation matrix determinant equals 1 - (b) 2D
rotation matrix inverse equals its transpose - (c) Product of two
rotation matrices is still a rotation matrix, and
Express the following operations with
Advanced Exercises
4. Homography Matrix
Given four point correspondences:
- Explain why 4 point pairs are needed to determine a homography matrix
- Set up the linear system for solving
using DLT
- Set up the linear system for solving
5. Essential Matrix Decomposition
Given essential matrix
Hint: Use SVD
6. Triangulation
Two cameras have projection matrices:
Programming Exercises
7. Implement image affine transformation from scratch (build the transformation matrix manually).
8. Implement fundamental matrix estimation using the 8-point algorithm.
9. Implement Harris corner detection with non-maximum suppression.
Application Problems
10. Panorama Stitching
Design a panoramic image stitching system. Discuss: - Why can homography matrices describe relationships between different viewpoint images? - How to handle accumulated errors? - How to handle exposure differences?
11. Simple Visual Odometry
Implement a simple monocular visual odometry. Discuss: - Scale drift problem in monocular VO - How to detect and handle tracking failures?
Chapter Summary
Computer vision and linear algebra are inseparable:
Image Representation: - Images are matrices, color images are tensors - Images can be flattened to vectors for machine learning
Geometric Transformations: - Rotation, scaling, shear are linear transformations - Homogeneous coordinates make translation expressible as matrices - Perspective transformations use homography matrices
Cameras and 3D: - Projection matrices map 3D points to 2D images - Epipolar geometry constrains corresponding point relationships - Triangulation and SfM recover 3D structure
Optimization and Estimation: - State estimation in SLAM relies on matrix operations - Lie groups/algebras handle rotation optimization - Bundle Adjustment solves sparse linear systems
Signal Processing: - Convolution can be written as matrix multiplication - Fourier transform analyzes filters - Deep learning heavily uses matrix operations
Master these concepts, and you master computer vision's mathematical foundations. Next steps could be multiple view geometry, visual SLAM, and 3D reconstruction.
计算机视觉中的线性代数/gif1_rotation.gif)
计算机视觉中的线性代数/gif2_transform.gif)
计算机视觉中的线性代数/gif3_convolution.gif)
References
- Hartley, R., & Zisserman, A. Multiple View Geometry in Computer Vision. Cambridge University Press.
- Szeliski, R. Computer Vision: Algorithms and Applications. Springer.
- Forsyth, D. A., & Ponce, J. Computer Vision: A Modern Approach. Pearson.
- Strang, G. Introduction to Linear Algebra. Wellesley-Cambridge Press.
- Barfoot, T. D. State Estimation for Robotics. Cambridge University Press.
This is Chapter 17 of the 18-part "Essence of Linear Algebra" series.
- Post title:Essence of Linear Algebra (17): Linear Algebra in Computer Vision
- Post author:Chen Kai
- Create time:2019-03-26 10:00:00
- Post link:https://www.chenk.top/chapter-17-linear-algebra-in-computer-vision/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.