Essence of Linear Algebra (17): Linear Algebra in Computer Vision
Chen Kai BOSS

The core mission of computer vision is to make machines "see" and understand images and videos. Remarkably, this entire field is built almost entirely on linear algebra: images themselves are matrices, geometric transformations are matrix multiplications, camera imaging is projective transformation, and 3D reconstruction is solving systems of linear equations. Master linear algebra, and you master the mathematical core of computer vision.

Vector and Matrix Representation of Images

From Pixels to Matrices

Image as Matrix

When you take a photo with your smartphone, the sensor captures brightness values in tiny squares called pixels. A grayscale image can be directly represented as a matrix:

whererepresents the grayscale value at row, column. Values typically range from (8-bit images) or (after normalization).

Intuitive Understanding: Think of an image as a spreadsheet where each cell contains a number. Larger numbers mean brighter pixels at that location.

Tensor Representation of Color Images

Color images typically use RGB (Red, Green, Blue) three-channel representation, where each channel is a matrix:This forms a 3-dimensional tensor with shape. In Python/NumPy, acolor image is an array of shape.

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
import cv2

# Read image
img = cv2.imread('photo.jpg')
print(f"Image shape: {img.shape}") # (480, 640, 3)

# Split channels
B, G, R = cv2.split(img)
print(f"Single channel shape: {R.shape}") # (480, 640)

# Convert to grayscale (weighted average)
gray = 0.299 * R + 0.587 * G + 0.114 * B

Images as High-Dimensional Vectors

Sometimes, "flattening" an image into a long vector is more convenient for processing. Angrayscale image can become an-dimensional vector:

Why do this? Many machine learning algorithms expect vector inputs. For example, PCA dimensionality reduction, support vector machines, and fully connected neural network input layers all require flattened images.

1
2
3
4
5
6
# Flatten image
img_flat = gray.flatten() # or gray.reshape(-1)
print(f"Vector dimension: {img_flat.shape}") # (307200,) for 640x480 image

# Restore to matrix
img_restored = img_flat.reshape(480, 640)

Inner Products and Image Similarity

Since images can be viewed as vectors, vector inner products can measure image similarity. The cosine similarity between two imagesandis:This is widely used in image retrieval and face recognition. Two similar images have corresponding vectors with small angles between them, with cosine values close to 1.

Geometric Transformation Matrices

2D Transformations

Why Use Matrices for Transformations?

Imagine using photo editing software to rotate, scale, and translate an image. These operations fundamentally transform pixel coordinates. Using matrices to represent transformations has two major benefits:

  1. Simple composition: Combining multiple transformations is just matrix multiplication
  2. Simple inverse: The inverse transformation is the matrix inverse

2D Rotation Matrix

The transformation matrix for counterclockwise rotation by anglearound the origin is:

Geometric Intuition: The rotation matrix transforms basis vectorto, andto. These are precisely the new basis vectors after rotation!

Key Properties: -is an orthogonal matrix: - Inverse equals transpose: - Determinant is 1:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np

def rotation_matrix(theta):
"""Create 2D rotation matrix"""
c, s = np.cos(theta), np.sin(theta)
return np.array([[c, -s], [s, c]])

# Rotate 45 degrees
theta = np.pi / 4
R = rotation_matrix(theta)

# Rotate a point
point = np.array([1, 0])
rotated = R @ point
print(f"After rotation: ({rotated[0]:.3f}, {rotated[1]:.3f})") # (0.707, 0.707)

Scaling Matrix

Scaling byalong the x-axis andalong the y-axis:

Special Cases: -: Uniform scaling (preserves aspect ratio) -: Reflection across y-axis -: Point reflection (180° rotation)

Shear Matrix

Shear transformations "tilt" the image, like italic text:Parametercontrols the tilt amount.is the identity,tilts right,tilts left.

Composing Transformations

Multiple transformations applied sequentially correspond to matrix multiplication from right to left:This means first scale (), then rotate (). Order matters! Rotating then scaling usually gives different results (unless scaling is uniform).

1
2
3
4
5
6
7
8
9
10
11
12
13
# Combined transformation: first scale 2x, then rotate 45 degrees
scale = 2
theta = np.pi / 4

S = np.array([[scale, 0], [0, scale]])
R = rotation_matrix(theta)

# Combined matrix (note order)
T = R @ S # S first, then R

# Apply to points
points = np.array([[1, 0], [0, 1], [1, 1]]).T # 2x3 matrix
transformed = T @ points

Homogeneous Coordinate System

The Translation Problem

The transformations above (rotation, scaling, shear) are all linear transformations, expressible as. But translation doesn't fit this form!This is an affine transformation, not a linear transformation (because it doesn't preserve the origin). To unify all transformations as matrix multiplication, we introduce homogeneous coordinates.

The Magic of Homogeneous Coordinates

Add an extra coordinate 1 after the 2D point:Now translation can be written as matrix multiplication:

Intuitive Understanding: We "embed" the 2D plane into 3D space (placed on theplane). In this higher-dimensional space, 2D translation becomes 3D "shear," which is a linear transformation.

General Affine Transformation Matrix

Rotation, scaling, shear, and translation can all be unified usinghomogeneous matrices:The upper-leftblockhandles linear transformations, and the upper-righthandles translation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import cv2
import numpy as np

def affine_matrix(rotation_deg, scale, translation):
"""Build affine transformation matrix"""
theta = np.radians(rotation_deg)
c, s = np.cos(theta), np.sin(theta)
sx, sy = scale
tx, ty = translation

return np.array([
[sx * c, -sy * s, tx],
[sx * s, sy * c, ty],
[0, 0, 1]
])

# Create transformation: rotate 30°, scale 1.5x, translate (100, 50)
M = affine_matrix(30, (1.5, 1.5), (100, 50))

# Apply to image (OpenCV uses 2x3 matrix)
img = cv2.imread('input.jpg')
rows, cols = img.shape[:2]
M_cv = M[:2, :] # Take first two rows
result = cv2.warpAffine(img, M_cv, (cols, rows))

Rotation Around an Arbitrary Point

Rotating around the origin is simple, but what about rotating around the image center or any point? Three steps:

  1. Translate to move the rotation center to the origin
  2. Rotate around the origin
  3. Translate back

Combined:

1
2
3
4
5
6
def rotation_around_center(img, angle_deg):
"""Rotate around image center"""
h, w = img.shape[:2]
center = (w // 2, h // 2)
M = cv2.getRotationMatrix2D(center, angle_deg, 1.0)
return cv2.warpAffine(img, M, (w, h))

Perspective Transformation and Homography Matrix

Limitations of Affine Transformations

Affine transformations preserve parallel lines, but in the real world, parallel railroad tracks "converge" in the distance. This near-big-far-small perspective effect requires a more general perspective transformation (projective transformation).

Homography Matrix

Homography uses amatrixto describe the projective relationship between two planes:Final 2D coordinates are obtained by dividing by:

Key Difference: Affine transformations havein the last row, but perspective transformations allow non-zero, introducing "perspective division" that makes the transformation nonlinear.

Estimating Homography Matrix

Given corresponding point pairsin two images, how do we estimate?

Each point pair provides two constraints (x and y directions).has 8 degrees of freedom (9 elements, but we can fix scale), so theoretically 4 point pairs suffice. In practice, noise exists, requiring more points and robust estimation (like RANSAC).

DLT Algorithm (Direct Linear Transform): Rewrite constraints as homogeneous linear equations, solve with SVD.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import cv2
import numpy as np

# Assume we have feature matching points from two images
src_pts = np.array([[100, 100], [200, 100], [200, 200], [100, 200]], dtype=np.float32)
dst_pts = np.array([[120, 90], [220, 110], [210, 220], [90, 210]], dtype=np.float32)

# Compute homography matrix
H, mask = cv2.findHomography(src_pts, dst_pts, cv2.RANSAC, 5.0)

# Apply perspective transformation
img = cv2.imread('input.jpg')
h, w = img.shape[:2]
result = cv2.warpPerspective(img, H, (w, h))

Homography Applications

  1. Image Stitching: Estimate homography between adjacent images, align them to the same coordinate system
  2. Document Rectification: Transform tilted document/whiteboard photos into frontal view
  3. Augmented Reality: "Attach" virtual objects to detected planes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def rectify_document(img, corners):
"""
Document rectification: transform quadrilateral region to rectangle
corners: four corner coordinates of document [top-left, top-right, bottom-right, bottom-left]
"""
# Target rectangle size (A4 ratio)
width, height = 595, 842

src = np.array(corners, dtype=np.float32)
dst = np.array([[0, 0], [width, 0], [width, height], [0, height]], dtype=np.float32)

H = cv2.getPerspectiveTransform(src, dst)
rectified = cv2.warpPerspective(img, H, (width, height))
return rectified

Camera Model and Projection Matrix

Pinhole Camera Model

Pinhole Camera Model

A camera's working principle can be described with a simplified model: 3D scene points project through a small hole (pinhole) onto the imaging plane behind the camera.

Let a 3D point in camera coordinates bewith focal length. Projection onto the image plane gives:This is the basic perspective projection formula. Division by(depth) is the source of the "near-big-far-small" effect.

Camera Intrinsic Matrix

Real cameras have additional parameters: - Focal length may differ in x and y directions: - Principal point (image plane origin) not at image center: - Possible pixel skew (usually 0)

These parameters form the intrinsic matrix:whereis the skew parameter, approximately 0 for most modern cameras.

Camera Extrinsics

Extrinsics describe the camera's position and orientation in world coordinates, including: - Rotation matrix(orthogonal matrix) - Translation vector()

Complete Projection Equation

A 3D point(homogeneous coordinates) in world coordinates projects to pixel coordinates:whereis depth (scale factor), andis theprojection matrix.

Physical Meaning: 1.: Transform from world coordinates to camera coordinates 2.: Project from camera coordinates to pixel coordinates

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def project_3d_to_2d(X_world, K, R, t):
"""
Project 3D points to 2D image plane
X_world: (N, 3) points in world coordinates
K: (3, 3) intrinsic matrix
R: (3, 3) rotation matrix
t: (3,) translation vector
"""
# Transform to camera coordinates
X_cam = (R @ X_world.T).T + t # (N, 3)

# Project (divide by depth)
x = X_cam[:, 0] / X_cam[:, 2]
y = X_cam[:, 1] / X_cam[:, 2]

# Convert to pixel coordinates
u = K[0, 0] * x + K[0, 2]
v = K[1, 1] * y + K[1, 2]

return np.stack([u, v], axis=1)

Camera Calibration

Camera calibration aims to estimate intrinsicsand distortion coefficients. The classic method is Zhang's calibration:

  1. Capture multiple checkerboard images (different angles)
  2. Detect checkerboard corners
  3. Establish 2D-3D correspondences
  4. Solve intrinsics with nonlinear optimization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import cv2
import numpy as np
import glob

# Checkerboard size (inner corners)
CHESS_SIZE = (9, 6)
SQUARE_SIZE = 25.0 # Each square's side length (mm)

# Prepare 3D point coordinates (assume checkerboard on z=0 plane)
objp = np.zeros((CHESS_SIZE[0] * CHESS_SIZE[1], 3), np.float32)
objp[:, :2] = np.mgrid[0:CHESS_SIZE[0], 0:CHESS_SIZE[1]].T.reshape(-1, 2)
objp *= SQUARE_SIZE

# Store points from all images
objpoints = [] # 3D points
imgpoints = [] # 2D points

# Read calibration images
images = glob.glob('calibration/*.jpg')
for fname in images:
img = cv2.imread(fname)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Detect corners
ret, corners = cv2.findChessboardCorners(gray, CHESS_SIZE, None)

if ret:
# Subpixel refinement
corners2 = cv2.cornerSubPix(gray, corners, (11, 11), (-1, -1),
criteria=(cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 30, 0.001))

objpoints.append(objp)
imgpoints.append(corners2)

# Perform calibration
ret, K, dist, rvecs, tvecs = cv2.calibrateCamera(
objpoints, imgpoints, gray.shape[::-1], None, None)

print("Intrinsic matrix K:")
print(K)
print(f"\nDistortion coefficients: {dist.ravel()}")

Distortion Correction

Real lenses aren't ideal pinholes and produce distortion:

Radial Distortion (barrel/pincushion):

Tangential Distortion (lens not parallel to sensor):where.

1
2
3
4
5
6
# Undistort
undistorted = cv2.undistort(img, K, dist)

# Or get mapping (faster for batch processing)
map1, map2 = cv2.initUndistortRectifyMap(K, dist, None, K, (w, h), cv2.CV_32FC1)
undistorted = cv2.remap(img, map1, map2, cv2.INTER_LINEAR)

Epipolar Geometry and Fundamental Matrix

Epipolar Geometry

Epipolar Constraint

Suppose the same 3D point is captured by two cameras, appearing at positionsandin the two images. A geometric constraint exists between these two points — the epipolar constraint:whereis theFundamental Matrix.

Geometric Intuition: - The 3D point, two camera centers, and two image points — these 5 points are coplanar (epipolar plane) - The epipolar constraint is the algebraic expression of this coplanarity

Essential and Fundamental Matrices

Essential Matrix: Describes the relationship between two calibrated cameras (using normalized coordinates)whereare normalized coordinates.

Fundamental Matrix: Describes the relationship between two uncalibrated cameras (using pixel coordinates)

The relationship between them:

Essential Matrix Decomposition

The essential matrix decomposes into rotation and translation:whereis the skew-symmetric matrix of the translation vector:

Significance: From the essential matrix, we can recover the relative pose between two cameras (rotation and translation direction; translation scale is undetermined).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import cv2
import numpy as np

def compute_fundamental_matrix(pts1, pts2):
"""
Compute fundamental matrix from corresponding points
pts1, pts2: (N, 2) corresponding point coordinates
"""
F, mask = cv2.findFundamentalMat(pts1, pts2, cv2.FM_RANSAC, 3.0, 0.99)
return F, mask.ravel().astype(bool)

def compute_essential_matrix(pts1, pts2, K):
"""
Compute essential matrix from corresponding points
"""
E, mask = cv2.findEssentialMat(pts1, pts2, K, cv2.RANSAC, 0.999, 1.0)
return E, mask.ravel().astype(bool)

def recover_pose(E, pts1, pts2, K):
"""
Recover camera pose from essential matrix
"""
_, R, t, mask = cv2.recoverPose(E, pts1, pts2, K)
return R, t

Epipolar Lines

Given a pointin the first image, its corresponding point in the second image must lie on a certain line — the epipolar line:Conversely, given, the epipolar line forin the first image is.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def draw_epipolar_lines(img1, img2, pts1, pts2, F):
"""Draw epipolar lines"""
lines1 = cv2.computeCorrespondEpilines(pts2.reshape(-1, 1, 2), 2, F).reshape(-1, 3)
lines2 = cv2.computeCorrespondEpilines(pts1.reshape(-1, 1, 2), 1, F).reshape(-1, 3)

h, w = img1.shape[:2]

for r, pt in zip(lines1, pts1):
color = tuple(np.random.randint(0, 255, 3).tolist())
x0, y0 = map(int, [0, -r[2]/r[1]])
x1, y1 = map(int, [w, -(r[2] + r[0]*w)/r[1]])
cv2.line(img1, (x0, y0), (x1, y1), color, 1)
cv2.circle(img1, tuple(pt.astype(int)), 5, color, -1)

return img1, img2

3D Reconstruction

Triangulation Principle

Knowing two cameras' poses and a pair of corresponding points, we can recover the 3D point through triangulation.

Given two cameras with projection matricesand, and corresponding image pointsand, the 3D pointsatisfies:$

This gives 4 equations (actually 3 independent), which can solve for.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def triangulate_points(pts1, pts2, P1, P2):
"""
Triangulate to recover 3D points
pts1, pts2: (N, 2) image point coordinates
P1, P2: (3, 4) projection matrices
"""
pts1_h = pts1.T # (2, N)
pts2_h = pts2.T

# OpenCV function returns homogeneous coordinates
X_4d = cv2.triangulatePoints(P1, P2, pts1_h, pts2_h) # (4, N)

# Convert to 3D coordinates
X_3d = X_4d[:3] / X_4d[3] # (3, N)
return X_3d.T # (N, 3)

Structure from Motion (SfM)

SfM reconstructs 3D scene structure and camera poses from a set of images. Basic pipeline:

  1. Feature Extraction and Matching: Detect SIFT/ORB features in all images, perform pairwise matching
  2. Initialization: Select two images, estimate fundamental/essential matrix, triangulate initial points
  3. Incremental Reconstruction: Gradually add new images
    • PnP to estimate new camera pose
    • Triangulate new 3D points
  4. Bundle Adjustment: Global optimization of all parameters

Bundle Adjustment

Bundle Adjustment is the core optimization step in SfM, jointly optimizing all camera parameters and 3D point positions to minimize reprojection error:where: -is the projection function -is a robust kernel function (e.g., Huber) for handling outliers - Sum is over all visible point-camera pairs

This is a large-scale nonlinear least squares problem, typically solved with Levenberg-Marquardt. Since the Jacobian matrix has sparse structure (each observation only involves one camera and one 3D point), it can be solved efficiently.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Simplified Bundle Adjustment concept using scipy
from scipy.optimize import least_squares
from scipy.sparse import lil_matrix

def bundle_adjustment(camera_params, points_3d, points_2d, camera_indices, point_indices, K):
"""
Bundle Adjustment optimization
camera_params: (n_cameras, 6) [rvec, tvec]
points_3d: (n_points, 3)
points_2d: (n_observations, 2)
camera_indices: (n_observations,) camera index for each observation
point_indices: (n_observations,) 3D point index for each observation
"""
def residual(params):
n_cameras = len(np.unique(camera_indices))
n_points = len(np.unique(point_indices))

cam_params = params[:n_cameras * 6].reshape(-1, 6)
pts_3d = params[n_cameras * 6:].reshape(-1, 3)

residuals = []
for i, (cam_idx, pt_idx) in enumerate(zip(camera_indices, point_indices)):
rvec = cam_params[cam_idx, :3]
tvec = cam_params[cam_idx, 3:6]

# Project
R, _ = cv2.Rodrigues(rvec)
pt_cam = R @ pts_3d[pt_idx] + tvec
pt_proj = K @ pt_cam
pt_2d_pred = pt_proj[:2] / pt_proj[2]

residuals.extend(pt_2d_pred - points_2d[i])

return np.array(residuals)

x0 = np.hstack([camera_params.ravel(), points_3d.ravel()])
result = least_squares(residual, x0, method='lm')
return result

Linear Algebra in SLAM

SLAM Problem Overview

SLAM (Simultaneous Localization and Mapping) solves: a robot moves in an unknown environment, simultaneously estimating its position and building an environment map.

This involves extensive linear algebra: - Pose representation (rotation matrices, quaternions) - State estimation (matrix operations in Kalman filtering) - Graph optimization (solving sparse linear systems)

Lie Group Representation of Poses

3D rigid body transformations can be represented withtransformation matrices:whereis the rotation matrix andis the translation vector.forms a Lie group, with corresponding Lie algebrabeing 6-dimensional:The exponential map maps Lie algebra to Lie group:.

Why use Lie algebra? Optimization requires derivatives with respect to parameters. Rotation matrices have constraints (), making direct optimization inconvenient. Lie algebra is an unconstrained vector space, better suited for optimization.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
from scipy.spatial.transform import Rotation

def skew_symmetric(v):
"""Skew-symmetric matrix of a vector"""
return np.array([
[0, -v[2], v[1]],
[v[2], 0, -v[0]],
[-v[1], v[0], 0]
])

def so3_exp(phi):
"""Exponential map from so(3) to SO(3)"""
theta = np.linalg.norm(phi)
if theta < 1e-10:
return np.eye(3)

a = phi / theta
A = skew_symmetric(a)
R = np.eye(3) + np.sin(theta) * A + (1 - np.cos(theta)) * A @ A
return R

def se3_exp(xi):
"""Exponential map from se(3) to SE(3)"""
rho = xi[:3]
phi = xi[3:6]

R = so3_exp(phi)
theta = np.linalg.norm(phi)

if theta < 1e-10:
J = np.eye(3)
else:
a = phi / theta
A = skew_symmetric(a)
J = np.sin(theta)/theta * np.eye(3) + (1 - np.sin(theta)/theta) * np.outer(a, a) + (1 - np.cos(theta))/theta * A

t = J @ rho

T = np.eye(4)
T[:3, :3] = R
T[:3, 3] = t
return T

Extended Kalman Filter (EKF-SLAM)

EKF-SLAM models SLAM as a state estimation problem. The state vector includes robot pose and all landmark positions:

Prediction Step (motion model):$

Update Step (observation model):$

$$

$ whereandare Jacobian matrices of motion and observation models.

Graph Optimization SLAM

Modern SLAM increasingly uses graph optimization. SLAM is modeled as a factor graph:

  • Nodes: Robot poses, landmark positions
  • Edges: Motion constraints, observation constraints

Optimization objective minimizes squared error sum of all constraints:This is a nonlinear least squares problem, solved iteratively. Each iteration solves a sparse linear system:whereis the approximate Hessian matrix with sparse structure.

Image Filtering in Matrix Form

PCA Visualization

Convolution as Matrix Multiplication

Image convolution is typically written as:But it can also be expressed as matrix multiplication. Flatten the image to vector, convert the kernel to a special matrix(block Toeplitz form):

Why is this important? Because we can use linear algebra tools to analyze filter properties.

Common Filters

Mean Filter (blur):

Gaussian Filter (smoother blur):

Laplacian Operator (edge detection):

Sobel Operator (gradient):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import cv2
import numpy as np

# Various filters
img = cv2.imread('image.jpg', cv2.IMREAD_GRAYSCALE)

# Gaussian blur
blurred = cv2.GaussianBlur(img, (5, 5), 1.0)

# Sobel gradient
grad_x = cv2.Sobel(img, cv2.CV_64F, 1, 0, ksize=3)
grad_y = cv2.Sobel(img, cv2.CV_64F, 0, 1, ksize=3)
gradient_magnitude = np.sqrt(grad_x**2 + grad_y**2)

# Laplacian
laplacian = cv2.Laplacian(img, cv2.CV_64F)

# Custom kernel
kernel = np.array([[0, -1, 0], [-1, 5, -1], [0, -1, 0]]) # Sharpening
sharpened = cv2.filter2D(img, -1, kernel)

Frequency Domain Analysis of Filtering

The convolution theorem tells us: spatial domain convolution equals frequency domain multiplication.This means we can understand filters by analyzing their frequency response: - Low-pass filters (e.g., Gaussian): Preserve low frequencies, remove high-frequency noise - High-pass filters (e.g., Laplacian): Preserve high frequencies, extract edges

Feature Detection with Linear Algebra

Harris Corner Detection

The core of Harris corner detection is analyzing the autocorrelation matrix (structure tensor) around a point:whereare image gradients, andis a window function (usually Gaussian).

Key Observation:'s eigenvalues reflect local image structure: - Both eigenvalues large → Corner (change in all directions) - One large, one small → Edge (change in one direction) - Both small → Flat region

Harris response avoids computing eigenvalues directly: is typically 0.04-0.06.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def harris_corner_detector(img, k=0.04, block_size=2, ksize=3):
"""Manual Harris corner detection implementation"""
# Compute gradients
Ix = cv2.Sobel(img, cv2.CV_64F, 1, 0, ksize=ksize)
Iy = cv2.Sobel(img, cv2.CV_64F, 0, 1, ksize=ksize)

# Compute M matrix elements
Ixx = Ix ** 2
Iyy = Iy ** 2
Ixy = Ix * Iy

# Gaussian-weighted sum
Sxx = cv2.GaussianBlur(Ixx, (block_size*2+1, block_size*2+1), 0)
Syy = cv2.GaussianBlur(Iyy, (block_size*2+1, block_size*2+1), 0)
Sxy = cv2.GaussianBlur(Ixy, (block_size*2+1, block_size*2+1), 0)

# Harris response
det_M = Sxx * Syy - Sxy ** 2
trace_M = Sxx + Syy
R = det_M - k * trace_M ** 2

return R

# OpenCV implementation
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY).astype(np.float32)
harris = cv2.cornerHarris(gray, blockSize=2, ksize=3, k=0.04)

PCA and Feature Descriptors

SIFT and similar feature descriptors compute statistics on local gradient histograms. PCA can reduce dimensions or extract principal directions.

For a set of data points${_i} $ 's eigenvectors give data's principal directions; eigenvalues give variance in each direction.

Exercises

Basic Exercises

1. Image Matrix Operations

Agrayscale image is stored in matrix. Write matrix expressions for: - (a) Flip image vertically - (b) Flip image horizontally - (c) Increase brightness by 50 (add 50 to pixel values) - (d) Increase contrast by 1.5x (multiply pixel values by 1.5)

2. Rotation Matrix Properties

Prove: - (a) 2D rotation matrix determinant equals 1 - (b) 2D rotation matrix inverse equals its transpose - (c) Product of two rotation matrices is still a rotation matrix, and 3. Homogeneous Coordinates

Express the following operations withhomogeneous transformation matrices, and write the combined transformation matrix: - First translate by - Then rotatearound origin - Finally scale by

Advanced Exercises

4. Homography Matrix

Given four point correspondences:,,,.

    1. Explain why 4 point pairs are needed to determine a homography matrix
    1. Set up the linear system for solvingusing DLT

5. Essential Matrix Decomposition

Given essential matrix, prove it can be decomposed as, whereis skew-symmetric andis a rotation matrix.

Hint: Use SVD, then use the special form.

6. Triangulation

Two cameras have projection matrices:A corresponding point pair isand. Derive the linear systemfor triangulation, whereis the 3D point.

Programming Exercises

7. Implement image affine transformation from scratch (build the transformation matrix manually).

8. Implement fundamental matrix estimation using the 8-point algorithm.

9. Implement Harris corner detection with non-maximum suppression.

Application Problems

10. Panorama Stitching

Design a panoramic image stitching system. Discuss: - Why can homography matrices describe relationships between different viewpoint images? - How to handle accumulated errors? - How to handle exposure differences?

11. Simple Visual Odometry

Implement a simple monocular visual odometry. Discuss: - Scale drift problem in monocular VO - How to detect and handle tracking failures?

Chapter Summary

Computer vision and linear algebra are inseparable:

Image Representation: - Images are matrices, color images are tensors - Images can be flattened to vectors for machine learning

Geometric Transformations: - Rotation, scaling, shear are linear transformations - Homogeneous coordinates make translation expressible as matrices - Perspective transformations use homography matrices

Cameras and 3D: - Projection matrices map 3D points to 2D images - Epipolar geometry constrains corresponding point relationships - Triangulation and SfM recover 3D structure

Optimization and Estimation: - State estimation in SLAM relies on matrix operations - Lie groups/algebras handle rotation optimization - Bundle Adjustment solves sparse linear systems

Signal Processing: - Convolution can be written as matrix multiplication - Fourier transform analyzes filters - Deep learning heavily uses matrix operations

Master these concepts, and you master computer vision's mathematical foundations. Next steps could be multiple view geometry, visual SLAM, and 3D reconstruction.

References

  1. Hartley, R., & Zisserman, A. Multiple View Geometry in Computer Vision. Cambridge University Press.
  2. Szeliski, R. Computer Vision: Algorithms and Applications. Springer.
  3. Forsyth, D. A., & Ponce, J. Computer Vision: A Modern Approach. Pearson.
  4. Strang, G. Introduction to Linear Algebra. Wellesley-Cambridge Press.
  5. Barfoot, T. D. State Estimation for Robotics. Cambridge University Press.

This is Chapter 17 of the 18-part "Essence of Linear Algebra" series.

  • Post title:Essence of Linear Algebra (17): Linear Algebra in Computer Vision
  • Post author:Chen Kai
  • Create time:2019-03-26 10:00:00
  • Post link:https://www.chenk.top/chapter-17-linear-algebra-in-computer-vision/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments