In 1886, Francis Galton discovered a peculiar phenomenon while
studying the relationship between parents' and children's heights:
parents with extreme heights tended to have children whose heights were
closer to the average. He coined the term "regression toward the mean,"
which is where "regression" comes from. However, the true power of
linear regression lies not in statistical description, but in its role
as the mathematical foundation for almost all machine learning
algorithms — from neural networks to support vector machines, all can be
viewed as generalizations of linear regression.
The essence of linear regression is finding the optimal hyperplane in
data space. This seemingly simple problem conceals deep connections
between linear algebra, probability theory, and optimization. This
chapter provides complete mathematical derivations of linear regression
from multiple perspectives.
Objective: Find parameter vectorand biassuch that the linear
model:
best fits the training data.
Notation Simplification: To unify notation, we
absorb the bias into the weight vector. Define the augmented feature
vector:
Then the model simplifies to:
For brevity, we omit the tildes and write, understanding thatincludes the constant term 1.
Matrix Form
Organize all training samples into matrix form:
Design Matrix:
Output Vector:
Prediction Vector:
Our goal is to find optimalsuch
that prediction vectoris as
close as possible to true values.
Least Squares: Algebraic
Derivation
Loss Function
Using squared loss (L2 loss) to measure prediction error:
The coefficienteliminates constants when
taking derivatives.
Objective:
Gradient Derivation
Calculate the gradient of the loss function with respect to. Using the chain rule:
Expanding:
Taking derivatives with respect to (using matrix differentiation
formulas):
Therefore:
Normal Equation
Setting the gradient to zero:
This is the famous Normal Equation.
Theorem 1 (Least Squares Solution): Ifis invertible, the unique solution
to the least squares problem is:
Proof:
First-order necessary condition:gives
Second-order sufficient condition: Computing the
Hessian matrix:
For any non-zero vector:
Ifhas full column rank
(i.e.,),
thenif and only if, thusis positive definite.
Positive definite Hessian + zero gradient = global
minimum. QED.
Invertibility Conditions for:
Necessary and sufficient condition:has full column rank, i.e.,
Equivalent condition:is positive definite
Practical interpretation:
Number of samples(at
least as many samples as features)
Features are linearly independent (no perfect collinearity)
Moore-Penrose Pseudoinverse
Whenis not invertible
(e.g.,or collinear
features), use the Pseudoinverse:
whereis the
Moore-Penrose pseudoinverse.
Properties:
Whenis invertible,(degenerates to
ordinary inverse) -is the
minimum norm solution among all solutions satisfying:
Computation: Via Singular Value Decomposition (SVD).
Let, then:
whereis the
pseudoinverse of(non-zero
singular values are inverted, zeros remain zero).
Geometric
Interpretation: Projection Perspective
Column Space and Projection
The geometric essence of linear regression is orthogonal
projection.
Definitions:
-is the column
space of, i.e., the subspace
spanned by columns of -is a vector in - The goal is to find the
point inclosest
toTheorem 2 (Orthogonal
Projection Theorem):is the orthogonal projection ofontoif and only if the residual
vectoris orthogonal
to:
This is exactly the normal equation!
Proof:
For any,
consider the squared norm of prediction error:
Using the Pythagorean theorem (when two vectors are orthogonal):
Let,. If, then:
Therefore:
Equality holds if and only if. QED.
Projection Matrix
Definition: The projection matrixprojects any vector onto:
-(idempotent) -(symmetric) -(residual is orthogonal to column
space)
Geometric Intuition
In-dimensional space:
-is the true output vector
-is a-dimensional subspace (ifhas full column rank) -is the "shadow" ofon -is the residual
perpendicular toAnalogy (2D plane projection):
Imagine in 3D space, the perpendicular distance from a pointto a 2D plane. The projection pointminimizes this distance.
Least Squares Geometric
Interpretation
The figure below shows OLS as orthogonal projection in 3D —
observation vector is projected
onto the column space of design matrix , with residual perpendicular to the column
space:
Projection Geometry
Probabilistic
Perspective: Maximum Likelihood Estimation
Linear Gaussian Model
Assume the data generating process is:
whereare independent
and identically distributed Gaussian noise.
Equivalent form:
That is, given,follows a Gaussian distribution with
meanand variance.
Likelihood Function
Given training data, the likelihood function for parametersis:
Log-Likelihood
Taking logarithm (monotonic transformation doesn't change the maximum
point):
Maximum Likelihood
Estimation
Optimization with respect to:
Maximizingis
equivalent to minimizing:
This is exactly the least squares objective!
Theorem 3: Under the linear Gaussian model, maximum
likelihood estimation is equivalent to least squares estimation:
Optimization with respect to:
Fixing, taking partial
derivative with respect to:
Solving:
The MLE of noise variance is the mean of squared residuals.
Bayesian Perspective
Introducing a prior distributionon parameters, compute the posterior
via Bayes' rule:
Gaussian Prior: Assuming,
then:
Maximizing the posterior probability (MAP) is equivalent to
minimizing:
This is the Ridge Regression objective function! The
regularization termreflects the strength of prior belief.
Regularization: Ridge
Regression and Lasso
Ridge Regression (L2
Regularization)
Objective Function:
whereis the
regularization parameter.
Gradient:
Setting gradient to zero:
Analytical Solution:
Key Observations:
Addingguarantees
invertibility of(even ifis not
invertible)
Sparsity: Some parameters are compressed exactly to
zero, achieving feature selection
Geometric Interpretation:
In constrained form:
The L1 constraint ball is a diamond (hyperdiamond in high
dimensions), whose corners are more likely to intersect with contour
lines on coordinate axes, leading to sparse solutions.
Elastic Net
Combining L1 and L2:
Advantages:
Retains L1 sparsity
Retains L2 stability (friendly to collinear features)
Effects of Regularization
Bias-Variance Tradeoff:
No regularization (): Low bias, high variance (overfitting)
Strong regularization (large): High bias, low variance
(underfitting)
Optimal: Selected via
cross-validation
Ridge Trace:
Plotvarying withto observe shrinkage behavior of
parameters.
Update parameters using only one sample at a time.
Algorithm:
Initialize
For:
Randomly select sample - Update:
Advantages:
Fast per iteration ()
Suitable for large-scale data
Ability to escape local minima (for non-convex problems)
Disadvantages:
Unstable convergence (high variance)
Requires careful learning rate tuning
Mini-batch Gradient Descent
Compromise: Usesamples (batch
size) each time.
whereis the batch
at iteration(size).
Typical choices:.
Convergence Analysis
Theorem 4 (BGD Convergence): For learning rate,
batch gradient descent converges linearly to the optimal solution:
where,are the minimum and maximum
eigenvalues of.
Proof Sketch:
Loss function is strongly convex (Hessian is positive definite)
Gradient is Lipschitz continuous
Apply convergence theorem for strongly convex functions
Practical Recommendations:
Learning rate:(conservative choice)
Adaptive learning rates: Adam, RMSprop, etc.
Learning rate decay:
Gradient Descent Animation
Model Evaluation and
Selection
Evaluation Metrics
Mean Squared Error (MSE):
Root Mean Squared Error (RMSE):
Mean Absolute Error (MAE):
Coefficient of Determination (R ²):
whereis the mean.
Interpretation:
-: Perfect fit -: Model equivalent to predicting
the mean -: Model worse
than predicting the mean (rare, indicates model problems)
Adjusted R ²
Penalizing model complexity:
Advantage: Adding useless features decreases (while always increases).
The figure below shows four classic residual diagnostic plots —
residuals vs fitted values, Q-Q plot, Scale-Location plot, and residual
histogram — for checking model assumptions:
Residual Diagnostics
Cross-Validation
k-fold Cross-Validation:
Split data intofolds
For:
Train on all data except fold
Test on fold
Average thetest errors
Python Implementation:
1 2 3 4 5 6 7
from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression
import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler
classLinearRegression: """ Complete implementation of linear regression Supports: - Analytical solution (normal equation) - Batch gradient descent - Stochastic gradient descent - Ridge regularization """ def__init__(self, method='normal', alpha=0.01, n_iterations=1000, lambda_reg=0.0, batch_size=None, random_state=42): """ Parameters: method: str, solving method 'normal': Normal equation 'bgd': Batch gradient descent 'sgd': Stochastic gradient descent 'mini_batch': Mini-batch gradient descent alpha: float, learning rate (only for gradient descent) n_iterations: int, number of iterations lambda_reg: float, Ridge regularization parameter batch_size: int, batch size (only for mini_batch) random_state: int, random seed """ self.method = method self.alpha = alpha self.n_iterations = n_iterations self.lambda_reg = lambda_reg self.batch_size = batch_size self.random_state = random_state self.w = None self.loss_history = [] deffit(self, X, y): """ Train linear regression model Parameters: X: np.array, shape=(m, d), input features y: np.array, shape=(m,), output labels """ # Add bias term X_bias = self._add_bias(X) m, d = X_bias.shape # Initialize weights np.random.seed(self.random_state) self.w = np.random.randn(d) * 0.01 if self.method == 'normal': # Normal equation self.w = self._normal_equation(X_bias, y) elif self.method == 'bgd': # Batch gradient descent self._batch_gradient_descent(X_bias, y) elif self.method == 'sgd': # Stochastic gradient descent self._stochastic_gradient_descent(X_bias, y) elif self.method == 'mini_batch': # Mini-batch gradient descent if self.batch_size isNone: self.batch_size = min(32, m) self._mini_batch_gradient_descent(X_bias, y) else: raise ValueError(f"Unknown method: {self.method}") return self defpredict(self, X): """ Prediction Parameters: X: np.array, shape=(m, d) Returns: y_pred: np.array, shape=(m,) """ X_bias = self._add_bias(X) return X_bias @ self.w defscore(self, X, y): """ Calculate R ² score """ y_pred = self.predict(X) ss_res = np.sum((y - y_pred) ** 2) ss_tot = np.sum((y - np.mean(y)) ** 2) return1 - ss_res / ss_tot def_add_bias(self, X): """Add bias column""" return np.c_[X, np.ones(X.shape[0])] def_compute_loss(self, X, y): """Compute loss (including regularization)""" m = len(y) predictions = X @ self.w mse = np.mean((predictions - y) ** 2) reg = self.lambda_reg * np.sum(self.w[:-1] ** 2) # Don't regularize bias return mse + reg def_compute_gradient(self, X, y): """Compute gradient""" m = len(y) predictions = X @ self.w gradient = X.T @ (predictions - y) / m # Add regularization gradient (excluding bias) if self.lambda_reg > 0: reg_gradient = np.zeros_like(self.w) reg_gradient[:-1] = 2 * self.lambda_reg * self.w[:-1] gradient += reg_gradient return gradient def_normal_equation(self, X, y): """Solve via normal equation""" # w = (X^T X + lambda*I)^{-1} X^T y d = X.shape[1] reg_matrix = self.lambda_reg * np.eye(d) reg_matrix[-1, -1] = 0# Don't regularize bias try: w = np.linalg.solve(X.T @ X + reg_matrix, X.T @ y) except np.linalg.LinAlgError: # If matrix is singular, use pseudoinverse w = np.linalg.pinv(X.T @ X + reg_matrix) @ X.T @ y return w def_batch_gradient_descent(self, X, y): """Batch gradient descent""" for i inrange(self.n_iterations): gradient = self._compute_gradient(X, y) self.w -= self.alpha * gradient # Record loss if i % 10 == 0: loss = self._compute_loss(X, y) self.loss_history.append(loss) def_stochastic_gradient_descent(self, X, y): """Stochastic gradient descent""" m = len(y) np.random.seed(self.random_state) for i inrange(self.n_iterations): # Randomly select one sample idx = np.random.randint(m) X_i = X[idx:idx+1] y_i = y[idx:idx+1] gradient = self._compute_gradient(X_i, y_i) self.w -= self.alpha * gradient # Record loss (every 10 iterations) if i % 10 == 0: loss = self._compute_loss(X, y) self.loss_history.append(loss) def_mini_batch_gradient_descent(self, X, y): """Mini-batch gradient descent""" m = len(y) np.random.seed(self.random_state) for i inrange(self.n_iterations): # Randomly select batch indices = np.random.choice(m, self.batch_size, replace=False) X_batch = X[indices] y_batch = y[indices] gradient = self._compute_gradient(X_batch, y_batch) self.w -= self.alpha * gradient # Record loss if i % 10 == 0: loss = self._compute_loss(X, y) self.loss_history.append(loss)
# Example: Housing Price Prediction defdemo_linear_regression(): """ Complete example: Linear regression application on simulated housing data """ # Generate synthetic data (simulating house price prediction) np.random.seed(42) m = 500 d = 5 # True weights w_true = np.array([50, -20, 30, 15, -10, 200]) # Last one is bias # Generate features (standardized) X = np.random.randn(m, d) # Add bias and compute true values X_bias = np.c_[X, np.ones(m)] y_true = X_bias @ w_true # Add noise noise = np.random.randn(m) * 20 y = y_true + noise # Split training and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Standardize scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Compare different methods methods = { 'Normal Equation': LinearRegression(method='normal'), 'Batch GD': LinearRegression(method='bgd', alpha=0.1, n_iterations=1000), 'SGD': LinearRegression(method='sgd', alpha=0.01, n_iterations=2000), 'Mini-batch GD': LinearRegression(method='mini_batch', alpha=0.05, n_iterations=1000, batch_size=32) } results = {} print("=" * 70) print("Linear Regression Method Comparison") print("=" * 70) for name, model in methods.items(): # Train model.fit(X_train_scaled, y_train) # Evaluate train_score = model.score(X_train_scaled, y_train) test_score = model.score(X_test_scaled, y_test) y_pred = model.predict(X_test_scaled) rmse = np.sqrt(np.mean((y_test - y_pred) ** 2)) results[name] = { 'train_r2': train_score, 'test_r2': test_score, 'rmse': rmse, 'weights': model.w } print(f"\n{name}:") print(f" Training R ²: {train_score:.4f}") print(f" Test R ²: {test_score:.4f}") print(f" Test RMSE: {rmse:.2f}") # Ridge Regression: Effect of regularization print("\n" + "=" * 70) print("Ridge Regression: Regularization Parameter Selection") print("=" * 70) lambdas = [0, 0.01, 0.1, 1, 10, 100] train_scores = [] test_scores = [] for lam in lambdas: model = LinearRegression(method='normal', lambda_reg=lam) model.fit(X_train_scaled, y_train) train_scores.append(model.score(X_train_scaled, y_train)) test_scores.append(model.score(X_test_scaled, y_test)) print(f"λ={lam:6.2f}: Training R ²={train_scores[-1]:.4f}, " f"Test R ²={test_scores[-1]:.4f}")
if __name__ == "__main__": demo_linear_regression()
Q&A: Core Linear
Regression Questions
Q1: Why
use squared loss instead of absolute value loss?
Analytical solution: Squared loss leads to
quadratic optimization problem with closed-form solution
Statistical significance: Under Gaussian noise
assumption, squared loss corresponds to maximum likelihood
estimation
Absolute Value Loss (L1 loss): -
Advantages: Robust to outliers (less affected by
outliers) - Disadvantages: Not differentiable at zero,
no analytical solution, requires linear programming
Comparison:
Loss Function
Differentiability
Analytical Solution
Outlier Robustness
Noise Distribution
Squared (L2)
Everywhere
Yes
Weak
Gaussian
Absolute (L1)
Not at 0
No
Strong
Laplacian
Huber
Everywhere
No
Medium
Mixed
Huber Loss (compromise):
Quadratic when,
linear when,
combining advantages of both.
Q2:
Normal equation vs gradient descent — when to use which?
Decision Tree:
1 2 3 4 5 6 7 8 9 10
Data Scale? ├─ Small scale (m < 10000, d < 1000) │ └─ Use normal equation (fast, high precision) └─ Large scale (m > 10000 or d > 1000) ├─ X^T X invertible? │ ├─ Yes → Gradient descent │ └─ No → Ridge regression or gradient descent └─ Enough memory? ├─ Yes → Batch gradient descent └─ No → SGD or Mini-batch GD
Detailed Comparison:
Dimension
Normal Equation
Gradient Descent
Time Complexity
(is iterations)
Space Complexity
Convergence
One step
Requires multiple iterations
Hyperparameters
None
Learning rate, iterations
Feature Count
feasible
Arbitrary
Invertibility
must be invertible
Not required
Regularization
Easy to add
Easy to add
Online Learning
Not supported
Supported (SGD)
Practical Recommendations:
Default choice: Use normal equation for, otherwise gradient
descent
Big data: Must use SGD or Mini-batch GD
Real-time updates: Must use SGD (supports online
learning)
Q3: Why does
Ridge regression solution always exist?
Core Reason: Addingmakes the matrix positive
definite.
Theorem: For any, the matrixis positive definite, hence invertible.
Proof:
For any non-zero vector:
Since, we have. Even if(is in the null space of), due to:
Thereforeis
positive definite, hence invertible.
Intuition:
-may be singular (e.g.,
collinear features) - Addingis like adding a "safety cushion" to the diagonal - Even if
some features are perfectly correlated,ensures positive definiteness
Geometric Interpretation:
In feature space, the null space ofcorresponds to directions that cannot be determined from data.
Ridge regression, by adding, imposes a "prefer zero" prior in these directions, making
the problem well-posed.
Optimization algorithms: Normal equation for
small data, gradient descent for large data
Model diagnostics: Check linearity,
independence, homoscedasticity, normality
Practical Tips:
Feature standardization
Handle collinearity
Select regularization parameter
Cross-validation for evaluation
Next Chapter Preview: Chapter 6 will explore
logistic regression and classification, extending linear models to
discrete output spaces. We will derive the origin of the sigmoid
function, mathematical foundations of cross-entropy loss, and geometric
interpretation of decision boundaries.
✏️ Exercises and Solutions
Exercise 1: Normal
Equation Derivation
Problem: Derive the normal equation from matrix calculus.
Solution:
Loss:
Gradient:
Setting: , so when is invertible.
Exercise 2:
Bayesian Interpretation of Ridge
Problem: Prove that Ridge regression is equivalent
to MAP estimation with Gaussian prior , and find the relationship
between and.
Problem: Prove that is symmetric and idempotent (, ).
Solution:
Symmetry: (since is symmetric).
Idempotency: .
Geometric meaning:
orthogonally projects any vector onto the column space of . Projecting twice is the same as
projecting once.
Exercise 4: MLE of Noise
Variance
Problem: Under (), derive the MLE ofand show it is biased.
Solution:
Log-likelihood:
Setting:
This is biased: . Unbiased estimator: .
Exercise 5:
Multicollinearity and Ridge
Problem: When, analyze how
OLS variance is affected and why Ridge helps.
Solution:
. High correlation makes near-singular, inflating.
VIF, amplifying variance 50x.
Ridge: .
Adding stabilizes
inversion, reducing variance at the cost of small bias — a favorable MSE
trade-off.
References
Hastie, T., Tibshirani, R., & Friedman, J. (2009).
The Elements of Statistical Learning (2nd ed.).
Springer.
Bishop, C. M. (2006). Pattern Recognition and Machine
Learning. Springer.
Murphy, K. P. (2012). Machine Learning: A Probabilistic
Perspective. MIT Press.
Shalev-Shwartz, S., & Ben-David, S. (2014).
Understanding Machine Learning: From Theory to
Algorithms. Cambridge University Press.
Goodfellow, I., Bengio, Y., & Courville, A. (2016).
Deep Learning. MIT Press.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013).
An Introduction to Statistical Learning.
Springer.
Tibshirani, R. (1996). Regression shrinkage and selection
via the lasso. Journal of the Royal Statistical Society:
Series B, 58(1), 267-288.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge
regression: Biased estimation for nonorthogonal problems.
Technometrics, 12(1), 55-67.
Zou, H., & Hastie, T. (2005). Regularization and
variable selection via the elastic net. Journal of the
Royal Statistical Society: Series B, 67(2), 301-320.
Huber, P. J. (1964). Robust estimation of a location
parameter. The Annals of Mathematical Statistics,
35(1), 73-101.
Post title:Machine Learning Mathematical Derivations (5): Linear Regression
Post author:Chen Kai
Create time:2021-09-18 09:00:00
Post link:https://www.chenk.top/Machine-Learning-Mathematical-Derivations-5-Linear-Regression/
Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.